Introduction to Parallel Computing

    Roughly speaking, a parallel machine is a set of processors, memory, and some communication methods between them. It can be a dual-core processor in your (not new) laptop, a multiprocessor server or, for example, a cluster (supercomputer). You may not know anything about such computers, but you know exactly why they are built: speed, speed and speed again. However, speed is not the only advantage.

    After completing the not-so-trivial task of creating such an apparatus, designers and developers still have to think about how to make it work. After all, the techniques and algorithms used for old, single-processor single-threaded machines, as a rule, are not suitable.

    What is most surprising is that universities are in no hurry to translate training programs into parallel computing! However, today you need to try to find a computer with one core. In my native Carleton University, parallel computing courses are not part of the compulsory Bachelor of Computer Science program, and are only available to those who have completed the basic courses of the first three years. Distributed computing courses are at the same level, and some may be confused.



    What is the difference? Equipment involved in parallel computing, as a rule, is located in one physical place, they are closely interconnected and all parameters of their work are known to the programmer. In distributed computing, there is no close constant connection between nodes, according to the name, they are distributed over a certain territory and the operation parameters of this system are dynamic and not always known.

    Classical algorithms taught by the last half century are no longer suitable for parallel systems. As you might guess, an algorithm for a parallel machine is called a parallel algorithm. It often depends heavily on the architecture of a particular machine and is not as versatile as its classic ancestor.

    You may ask: why did you have to come up with a new structure, then pore over new algorithms, and could you continue to speed up regular computers? Firstly, there is no way to make one super-fast processor, which could be compared with modern parallel super-computers; and if there are, then a lot of resources will be spent on this. In addition, many tasks are well solved by parallel machines, primarily due to such a structure! Calculation of complex chaotic systems like weather, simulation of interactions of elementary particles in physics, modeling at the nano level (yes, yes, nanotechnology!), Data mining (about which we have a blog about the site ), cryptography ... the list goes on and on.
    For the programmer, as the end “user” of the parallel machine, two options are possible: parallelism to it may be “visible”, or it may not. In the first case, the programmer writes programs based on the architecture of the computer, takes into account the parameters of this particular machine. In the second, the programmer may not know that he is facing a non-classical computer, and all his classical methods manage to work thanks to the thoughtfulness of the system itself.

    Here's an example of a multi-core processor - a favorite of many SUN UltraSPARC T1 / T2:



    8 cores, each supports 4 (for T1) or 8 (for T2) hardware threads, which by simple multiplication gives us 32/64 (respectively) threads in the processor. Theoretically, this means that such a processor can perform 32/64 tasks of a single-threaded processor in the same time. But, of course, a lot of resources are spent on switching between streams and other “internal kitchen”.

    And here, for example, the famous coolest graphics units like the nVidia GeForce 9600 GT, which has 64 cores on board.



    Recent GPUs have up to a thousand cores! We'll talk about why graphic units have so many cores a bit later. Now look better at the example of the parallelism hierarchy in supercomputers:



    It is clear that typing a bunch of powerful computers is not a problem. The problem is to get them to work together, that is, to connect to a fast network. Often in such clusters they use a high-speed switch and all computers simply connect to a fast local area network. Here, for example, stands at our university:



    256 cores: 64 units, each 4 cores of 2.2 GHz and 8 gigabytes of RAM. The connection uses the Foundry SuperX switch and a gigabit network. OS - Linux Rocks. The switch is often the most expensive element: try to find a quick switch to 64 ports! This is an example of one of the biggest problems of parallel computers: the speed of information exchange. The processors have become very fast, and the memory and buses are still slow. Not to mention hard drives - in comparison with processors, they look like tools of the Stone Age!

    Classification



    Let's finally figure out the types of parallel machines. They differ in the type of memory - shared (shared) or distributed (distributed), and the type of control - SIMD and MIMD. It turns out, four types of parallel machines.

    Common memory


    A classic computer looks like this:



    And the most obvious way to expand this scheme with several processors is as follows:



    Instead of a single processor - several, more memory, respectively, but in general the same scheme. It turns out that all processors share shared memory and if processor 2 needs some information that processor 3 is working on, then it will receive it from the shared memory at the same speed.

    Here's what the Quad Pentium looks like:



    Distributed memory


    As you probably yourself guessed, in this case each processor has its own memory (I remind you that this is not about the internal memory of the processor).



    An example is the cluster described above: it is essentially a bunch of separate computers, each of which has its own memory. In this case, if the processor (or computer) 2 needs information from computer 3, then this will take more time: you will need to request information and transfer it (in the case of a cluster, over a local network). The network topology will affect the speed of information exchange, so different types of structures have been developed.

    The simplest option that comes to mind is a simple two-dimensional array (mesh):



    Or a cube:



    A similar structure has the IBM Blue Gene, a descendant of the famous IBM Deep Blue, who beat a man in chess. It’s similar, because in fact Blue Gene is not a cube, but a torus - the extreme elements are connected:



    By the way, it was called Gene, because it is actively used in genetic research.

    Another interesting structure that should have come to someone’s head is the tree that everyone loved:



    Since the depth (or height) of the binary tree is logn, the transmission of information from the two most distant nodes will go 2 * logn, which is very good, but such a structure is still not often used. Firstly, to divide such a network into two isolated subnets, it is enough to cut one wire (remember the min-cut problem?) In the case of a two-dimensional array, you need to cut sqrt (n) wires! Count the cube yourself. And secondly, too much traffic goes through the root node!

    Four-dimensional hypercubes were popular in the 80s:



    These are two three-dimensional cubes with connected corresponding vertices. They built cubes of even greater dimension, but now they are almost never used, including because this is a huge number of wires!

    In general, network design to solve a problem is an interesting topic. Here, for example, the so-called Omega Network: We



    figured out the memory, now about the management.

    SIMD vs. Mimd



    SIMD - Singe Instruction stream, Multiple Data stream. There is only one control node; it sends instructions to all other processors. Each processor has its own data set for operation.
    MIMD - Multiple Instruction stream, Multiple Data Stream. Each processor has its own control unit, each processor can execute different instructions.

    SIMD systems are usually used for specific tasks, requiring, as a rule, not so much the flexibility and versatility of the computer as the computing power itself. Media processing, scientific research (the same simulations and modeling), or, for example, the Fourier transform of giant matrices. Therefore, in graphic units there is such a furious number of cores: these are SIMD systems, and there is usually only one “smart” processor (such as in your computer): it manages a bunch of simple and non-universal cores.



    Since the “control” processor sends the same instructions to all “working” processors, programming such systems requires some skill. Here is a simple example: The initial state of processor memory is:

    if (B == 0)
    then C = A
    else C = A/B





    The first line was completed, the data was counted, now the second line is started: then



    At the same time, the second and third processors do nothing, because the variable B does not fit the condition. Accordingly, the third line is executed further, and this time the other two processors “rest”:



    Examples of SIMD machines are the old ILLiac IV, MPP, DAP, Connection Machine CM-1/2, modern vector units, specific coprocessors and graphic units like nVidia GPU

    MIMD machines have wider functionality, which is why they are used in our user computers. If you have at least a dual-core processor in your laptop, you are the lucky owner of a MIMD machine with shared memory! Distributed memory MIMDs are supercomputers like the IBM Blue Gene, which we talked about above, or clusters.

    Here is the result of the entire classification:



    On this, the introduction to the topic can be considered complete. Next time we’ll talk about how the speeds of parallel machines are calculated, write our first parallel program using recursion, run it in a small cluster and learn how to analyze its speed and resource consumption.

    Also popular now: