
Intel Ct beta - what, why, how
2 weeks ago, he helped the client to start using the Intel Ct beta. At the same time, as usual, he figured it out a bit, and now I want to share it.
Recently, it became possible to download Ct beta by registering on the site . So far only for Windows, but after a while beta should appear for Linux.
If you carefully read the article on Intel Ct on Wikipedia ... then nothing can be learned about the essence of Ct. There is also little information in Russian. And it’s a pity - technology from a number of those that personally seem to me both useful and elegant.
If in one sentence, Intel Ct is a technology that allows the programmer to take advantage of the parallelism of modern and promising hardware, in many of its forms, transparently enough. The terrain is quite well spoiled - CUDA, OpenCL, Cilk ++, TBB, OpenMP, MPI, etc. Each approach has its own field of application, strengths and weaknesses, its supporters and its own areas of development. The scope of Ct is the implementation of data parallel algorithms. So from the above list of tasks for Ct, CUDA and OpenCL are closest.
I would call Ct DSLom. There is its own runtime implemented in C ++ (templates, operator overloading, everything is very ideomatic). Naturally, the Ct program is compiled by the Intel compiler. The latest gcc will do the trick too.
The main type is Vec. Its instances are immutable, passed by value, the operators return a new instance. Hence the functional aftertaste, and notable advantages in providing multi-threaded correctness. No problems with data-races and deadlocks (by design).
Vec are different - one-dimensional, 2-and 3-dimensional, sparse. Their elements can be of various types, including vectors. For example,
Vec charArray (source, width); // create and initialize vector
Vec2D intArray (source, width, height);
rcall (ct_foo) (charArray, 1); // call example Ct code.
The program is divided into 2 areas - the C ++ area and the Ct area. It is possible in the same .cpp file. In the field of C ++, you can declare and initialize vectors, in the field of Ct, you can perform operations on them. For example, the code above is a C ++ area. (I simplify it a bit here.)
And here is the Ct area:
void ct_foo (Vec & charArray, int t)
{
charArray + = t; // updates all vector data with scalar.
}
Less trivial examples can be found in the samples directory or here .
There are two types of compilation - debugging and release. The main difference is the link with different libraries. After linking with the debug library, everything will work transparently in Intel Debugger (and in gdb, of course, too), you can debug like regular C ++ code.
The release version links with its runtime (there is actually a virtual machine - there is GC, JIT). During its launch, recompilation takes place for the architecture on which it is executed. Fast enough recompilation. Even in beta, this is not a very long process - I measured about 300 milliseconds for the average size of the calculation problem.
What is the benefit of such a relatively complex internal architecture? And how else to ensure upward compatibility for the developer, free of charge?
For example, if suddenly someday :) processors appear, consisting of several "wide" x86 cores, and a large number of simple x86 cores sitting on shared memory. Or you will need to use SIMD and multi-core at the same time. And if heterogeneous kernels are a matter of a rather distant future, then SIMD registers will expand to 256 bits alreadycoming soon . And there, you look, and 512 bits are just around the corner. Moreover, recompilation is not required - just link to the dynamic library. Well, which with runtime.
The code in Ct expresses the algorithm in terms of a computational task, completely transparent with respect to the platforms on which it can be executed. Rantime understands the platform, and there is no need to change the implementation or worry about blocks / warps / threads.
Since the product has not yet been officially released, I cannot describe the functionality in great detail - it will change anyway. But it is unlikely to be radical, and most likely, if this happens, scripts / documents will be available that will help adapt the written software to the release. If you download the beta now, then in the Doc directory there is a wonderful file ct_userguide.pdf. Everything's there. This time our technical writers did a particularly good job. Of course, everything is not in our language, but who did it stop? Even more information will appear closer to the release, well, and I can answer some questions in the comments.
It was difficult to write so many letters, without ever mentioning one product logically connected with the subject area. It’s not far in the lab here, but alas, I can’t say anything new about it yet.
Recently, it became possible to download Ct beta by registering on the site . So far only for Windows, but after a while beta should appear for Linux.
If you carefully read the article on Intel Ct on Wikipedia ... then nothing can be learned about the essence of Ct. There is also little information in Russian. And it’s a pity - technology from a number of those that personally seem to me both useful and elegant.
If in one sentence, Intel Ct is a technology that allows the programmer to take advantage of the parallelism of modern and promising hardware, in many of its forms, transparently enough. The terrain is quite well spoiled - CUDA, OpenCL, Cilk ++, TBB, OpenMP, MPI, etc. Each approach has its own field of application, strengths and weaknesses, its supporters and its own areas of development. The scope of Ct is the implementation of data parallel algorithms. So from the above list of tasks for Ct, CUDA and OpenCL are closest.
I would call Ct DSLom. There is its own runtime implemented in C ++ (templates, operator overloading, everything is very ideomatic). Naturally, the Ct program is compiled by the Intel compiler. The latest gcc will do the trick too.
The main type is Vec. Its instances are immutable, passed by value, the operators return a new instance. Hence the functional aftertaste, and notable advantages in providing multi-threaded correctness. No problems with data-races and deadlocks (by design).
Vec are different - one-dimensional, 2-and 3-dimensional, sparse. Their elements can be of various types, including vectors. For example,
Vec charArray (source, width); // create and initialize vector
Vec2D intArray (source, width, height);
rcall (ct_foo) (charArray, 1); // call example Ct code.
The program is divided into 2 areas - the C ++ area and the Ct area. It is possible in the same .cpp file. In the field of C ++, you can declare and initialize vectors, in the field of Ct, you can perform operations on them. For example, the code above is a C ++ area. (I simplify it a bit here.)
And here is the Ct area:
void ct_foo (Vec & charArray, int t)
{
charArray + = t; // updates all vector data with scalar.
}
Less trivial examples can be found in the samples directory or here .
There are two types of compilation - debugging and release. The main difference is the link with different libraries. After linking with the debug library, everything will work transparently in Intel Debugger (and in gdb, of course, too), you can debug like regular C ++ code.
The release version links with its runtime (there is actually a virtual machine - there is GC, JIT). During its launch, recompilation takes place for the architecture on which it is executed. Fast enough recompilation. Even in beta, this is not a very long process - I measured about 300 milliseconds for the average size of the calculation problem.
What is the benefit of such a relatively complex internal architecture? And how else to ensure upward compatibility for the developer, free of charge?
For example, if suddenly someday :) processors appear, consisting of several "wide" x86 cores, and a large number of simple x86 cores sitting on shared memory. Or you will need to use SIMD and multi-core at the same time. And if heterogeneous kernels are a matter of a rather distant future, then SIMD registers will expand to 256 bits alreadycoming soon . And there, you look, and 512 bits are just around the corner. Moreover, recompilation is not required - just link to the dynamic library. Well, which with runtime.
The code in Ct expresses the algorithm in terms of a computational task, completely transparent with respect to the platforms on which it can be executed. Rantime understands the platform, and there is no need to change the implementation or worry about blocks / warps / threads.
Since the product has not yet been officially released, I cannot describe the functionality in great detail - it will change anyway. But it is unlikely to be radical, and most likely, if this happens, scripts / documents will be available that will help adapt the written software to the release. If you download the beta now, then in the Doc directory there is a wonderful file ct_userguide.pdf. Everything's there. This time our technical writers did a particularly good job. Of course, everything is not in our language, but who did it stop? Even more information will appear closer to the release, well, and I can answer some questions in the comments.
It was difficult to write so many letters, without ever mentioning one product logically connected with the subject area. It’s not far in the lab here, but alas, I can’t say anything new about it yet.