Intel® Parallel Studio XE 2015 - Talk About New Names and Chips

    On August 26, 2014, the next new version of the Parallel Studio toolkit - 2015 was released. We wrote about the innovations of the previous version almost a year ago, and now it's time to review what has appeared in the latest release.
    Not so long ago, I tried to shed light on the confusing names of Intel software products in the corresponding post , but the good guys from marketing again changed everything. So, get acquainted with the new philosophy in the names:

    From now on, all tool packages are offered under the name Intel Parallel Studio XE 2015 , but in different versions. It turns out that now in order to use the compiler and libraries, you need Intel Parallel Studio XE 2015 Composer Edition. And the lives of thousands of developers immediately became easier, and all the bugs were gone, afraid of such a formidable name. Okay, let's go on without irony. If we add the well-known and used VTune + Inspector + Advisor to such a basic kit, we get the Pro version. Throwing Intel MPI and ITAC on top, we get a kit called Cluster Edition. I anticipate confusion with similar names, but, nevertheless, I think that people will gradually get used to such a renaming, because logic is present.

    Naturally, these are only marketing innovations. Let's see what we get with the advent of a new version in technical terms, but there really is something to try.


    Naturally, the compiler, as usual, in the new version has become even faster and more productive than the previous one. In part, I have already described some of the new “goodies” here , mainly focusing on new compiler reports.

    Here I will give more specifics about what has appeared. Say, in this version of the compiler “now and forever and ever” the C ++ 11 and Fortran 2003 standards are fully supported.
    The first one adds new string literals, an explicit replacement of virtual functions, thread_local, and an improvement in object constructors.
    The Fortran compiler added support for parameterized derived types, thereby closing the “gap” in the support of the Fortran 2003 standard. An interesting opportunity, by the way, is a certain remote analogue of templates from C ++ that gives us the right to control the size of data during compilation and program execution:

    TYPE humongous_matrix(k, d)
      INTEGER, KIND :: k = kind(0.0)
      INTEGER(selected_int_kind(12)), LEN :: d
      REAL(k) :: element(d,d)
    TYPE(humongous_matrix(8,10000000)) :: giant

    In this example, d can be set in runtime and control the size of the two-dimensional array of element. By the way, k must be known at compile time in order to specify the length of an integer type in this example.

    In addition, support for Fortran 2008 and OpenMP 4.0 standards has expanded. In the same my beloved Fortran, the BLOCK construction appeared, which was very useful when working with DO CONCURRENT. For OpenMP, added support for CANCEL, CANCELLATION POINT and DEPEND directives for tasks. In general, OpenMP was paid close attention not only in the new version of the compiler, but also actively worked on its expanded support in the profiler. But more on that later.

    In addition to such a rich additive in the standards, reports on optimization, and in particular, on the operation of the vectorizer, were substantially revised. In the blog I mentioned, I talked about this in more detail. Here I want to add that integration with Visual Studio really works, and now the reports are pleasing to the eye with their visibility:

    Well, at the end of the compiler topic - added the ability to offload computing on integrated graphics - Intel Graphics Technology, implemented using directives. The topic is voluminous, so I will not be scattered in this post, but I reserve the right to write a separate story about it.

    VTune Amplifier XE

    The new version of the studio has changed for the better and VTune. Now we have even more options for profiling both on the CPU and GPU. For example, since we can now do an offload on the GPU, then there is a corresponding analysis, however, so far only on Windows. They also expanded support for OpenCL. In addition, a TSX transaction analysis function has appeared. By the way, a very good overview of transactional memory in the Haswell processor is presented here .

    But these are all useful amenities for a very limited number of developers. But what almost everyone who encounters concurrency will be able to appreciate is an analysis of OpenMP scalability. Now, at the end of the profiling, a separate item for OpenMP will be displayed on the Summary page, provided that the application was built with the appropriate key (and Intel's compiler) and there are parallel areas in it.

    We can clearly see how much time our application worked sequentially, and, accordingly, evaluate how it will be further scaled, taking into account the Amdal law. Moreover, there is an approximation of the ideal execution time of the parallel part, which is calculated without taking into account the overhead and with the ideal load balance. So, we can now assess in advance whether it is worth investing in a particular piece of code or algorithm. As is customary, it is possible to immediately switch to the code of interest to us. A handy thing that I’m sure will be useful for many developers.

    What else? Ability to use external collectors. Suppose we wrote a script that collects various events occurring in the OS, and we want them to be collected during the profiling of our application. Now there is such an opportunity, and the result will be grouped and shown on the same timeline.

    For those who like working on apples, a graphical interface for OS X has been created. There is no profiling there, but you can view the results collected on Linux or Windows, or collect profiles remotely.

    Inspector and Advisor

    In the end, he left funds that can greatly simplify the life of any developer, a kind of bonus in the Parallel Studio XE package.

    The Inspector has significantly revised the mechanism for finding errors associated with general data. According to the assurances of the developers and based on a comparison of the XE 2013 Update 3 version and the “freshly released” version, the work acceleration reaches 20 times, which is good news. In addition, the amount of memory used for this has decreased.

    An interesting feature is a new graph that shows real-time memory usage. We launched an analysis of working with memory, and look at it, while realizing how actively memory is being spent. It looks like this:

    Upon completion of the analysis, you can also find possible errors that led to a significant increase in used memory by going to the desired piece of code and running through the stack:

    What's new in Advisor? This is of course a simulation opportunity for Xeon Phi!
    Let me remind you that this tool without implementing any parallel model, or simpler, without rewriting anything in the code, allows you to evaluate the possible performance gain in different parts of our application. Moreover, it is possible to profile it and find out what places to pay close attention to. Let's say what happens if we parallelize in this loop and start the calculation? How many times will we accelerate? All that is needed is to insert annotations into the code section of interest to us and run the tool.
    So, now you can estimate in advance how good our algorithm is for running on Xeon Phi:

    These images show that in one case the algorithm scales well and the expediency of using Xeon Phi is high, but the other, on the contrary, does not scale starting from 16 threads.
    In addition, it became possible to predict the behavior when changing the number of iterations and their duration / complexity:

    In general, a lot of different “goodies” have appeared that will definitely come in handy when developing highly optimized parallel and sequential code. All of them can and should be tried here for free, that is, for free, as usual, for 30 days.

    Also popular now: