Data exchange using MPI. Working with the MPI Library with Intel® MPI Library

  • Tutorial


In this post, we will talk about organizing data exchange using MPI using the Intel MPI Library as an example. We think that this information will be interesting to anyone who wants to get acquainted with the field of parallel high-performance computing in practice.

We will provide a brief description of how data exchange is organized in parallel MPI-based applications, as well as links to external sources with a more detailed description. In the practical part, you will find a description of all the stages of development of the demo MPI application “Hello World”, starting from setting up the necessary environment and ending with the launch of the program itself.

MPI (Message Passing Interface)


MPI is a message passing interface between processes that perform a single task. It is intended primarily for systems with distributed memory ( MPP ), unlike, for example, OpenMP . A distributed (cluster) system, as a rule, is a set of computing nodes connected by high-performance communication channels (for example, InfiniBand ).

MPI is the most common data transfer interface standard in parallel programming. MPI standardizes MPI Forum. There are MPI implementations for most modern platforms, operating systems and languages. MPI is widely used in solving various problems of computational physics, pharmaceuticals, materials science, genetics and other fields of knowledge.

A parallel program from the point of view of MPI is a set of processes running on different computing nodes. Each process is generated based on the same program code.

The main operation in MPI is message passing. MPI implements almost all the basic communication patterns: point-to-point, collective and one-sided.

Work with MPI


Let's look at a living example of how a typical MPI program works. As a demo application, let's take the source code of the example that comes with the Intel MPI Library. Before starting our first MPI program, you need to prepare and set up a working environment for experiments.

Setting up a cluster environment


For experiments, we need a couple of computational nodes (preferably with similar characteristics). If you don’t have two servers at hand, you can always use cloud services.

For demonstration, I chose Amazon Elastic Compute Cloud (Amazon EC2). Amazon provides new users with a trial year of free entry-level server use .

Working with Amazon EC2 is intuitive. In case of questions, you can refer to the detailed documentation (in English). If desired, you can use any other similar service.

We create two working virtual servers. In the management console, select EC2 Virtual Servers in the Cloud , then Launch Instance(“Instance” means an instance of a virtual server).

The next step is choosing the operating system. Intel MPI Library supports both Linux and Windows. For the first acquaintance with MPI, we will choose OC Linux. Choose Red Hat Enterprise Linux 6.6 64-bit or SLES11.3 / 12.0 .
Select Instance Type (server type). For experiments, t2.micro is suitable for us (1 vCPUs, 2.5 GHz, Intel Xeon processor family, 1 GiB RAM). As a recently registered user, I could use this type for free - the mark “Free tier eligible”. Set Number of instances : 2 (the number of virtual servers).

After the service prompts us to launch Launch Instances(configured virtual servers), we save the SSH keys that are needed to communicate with virtual servers from the outside. The status of virtual servers and IP addresses for communication with the servers of the local computer can be monitored in the management console.

An important point: in the settings of Network & Security / Security Groups, you need to create a rule by which we open the ports for TCP connections - this is necessary for the MPI process manager. A rule may look like this:
Type: Custom TCP Rule
Protocol: TCP
Port Range: 1024-65535
Source: 0.0.0.0/0

For security reasons, you can set a more strict rule, but for our demo this is enough.

Here you can read instructions on how to contact virtual servers from the local computer (in English).
I used Putty to communicate with production servers from a Windows computer , and WinSCP to transfer files . Here you can read instructions on how to configure them to work with Amazon services (in English).

The next step is to configure SSH. In order to configure a passwordless SSH with public key authorization, you must perform the following steps:
  1. We run the ssh-keygen utility on each host - it will create a pair of private and public keys in the $ HOME / .ssh directory;
  2. We take the contents of the public key (file with the extension .pub) from one server and add it to the file $ HOME / .ssh / authorized_keys on another server;
  3. We perform this procedure for both servers;
  4. Let's try to join via SSH from one server to another and vice versa to check the correctness of SSH settings. At the first connection, you may need to add the public key of the remote host to the list $ HOME / .ssh / known_hosts.

Configuring MPI Library


So, the working environment is configured. Time to install MPI.
As a demo, let's take the 30-day trial version of Intel MPI Library (~ 300MB). If desired, other MPI implementations can be used, for example, MPICH . The latest available version of Intel MPI Library at the time of writing is 5.0.3.048, and we’ll take it for experiments.

Install Intel MPI Library following the instructions of the integrated installer (superuser privileges may be required).
$ tar xvfz l_mpi_p_5.0.3.048.tgz
$ cd l_mpi_p_5.0.3.048
$ ./install.sh

We perform the installation on each of the hosts with the same installation path on both nodes. A more standard way to deploy MPI is to install it in the network storage available on each of the work nodes, but the description of setting up such a storage is beyond the scope of the article, so we will restrict ourselves to a simpler option.

To compile the MPI demo program, we will use the GNU C compiler (gcc).
Amazon doesn’t have an image in the standard set of RHEL programs, so you need to install it:
$ sudo yum install gcc

As a demo MPI program, take test.c from the standard set of Intel MPI Library examples (located in the intel / impi / 5.0.3.048 / test folder).
To compile it, the first step is to expose the Intel MPI Library environment:
$. /home/ec2-user/intel/impi/5.0.3.048/intel64/bin/mpivars.sh

Next, we compile our test program using a script from the Intel MPI Library (all necessary MPI dependencies will be set automatically when compiling):
$ cd /home/ec2-user/intel/impi/5.0.3.048/test
$ mpicc -o test.exe ./test.c

The resulting test.exe is copied to the second node:
$ scp test.exe ip-172-31-47-24: /home/ec2-user/intel/impi/5.0.3.048/test/

Before starting the MPI program, it will be useful to make a trial run of some standard Linux utility, for example, 'hostname':
$ mpirun -ppn 1 -n 2 -hosts ip-172-31-47-25, ip-172-31-47-24 hostname
ip-172-31-47-25
ip-172-31-47-24

The 'mpirun' utility is a program from the Intel MPI Library designed to run MPI applications. This is a kind of "launcher." It is this program that is responsible for launching an instance of the MPI program on each of the nodes listed in its arguments.

Regarding options, '-ppn' is the number of processes to start on each node, '-n' is the total number of processes to start, '-hosts' is the list of nodes where the specified application will be launched, the last argument is the path to the executable file (this may be and an application without MPI).

In our example, when running the hostname utility, we should get its output (name of the computational node) from both virtual servers, then we can say that the MPI process manager works correctly.

"Hello World" using MPI


As a demo MPI application, we took test.c from the standard set of Intel MPI Library examples.

A demo MPI application collects from each of the parallel MPI processes some information about the process and the computing node on which it is running, and prints this information on the head node.

Let us consider in more detail the main components of a typical MPI program.

#include "mpi.h"
Includes the mpi.h header file, which contains declarations of basic MPI functions and constants.
If we use special scripts from the Intel MPI Library (mpicc, mpiicc, etc.) to compile our application, then the path to mpi.h is written automatically. Otherwise, the path to the include folder will have to be specified at compilation time.

MPI_Init (&argc, &argv);
...
MPI_Finalize ();
Call MPI_Init () is necessary to initialize the execution environment of the MPI program. After this call, you can use the remaining MPI functions.
The last call to an MPI program is MPI_Finalize (). In case of successful completion of the MPI program, each of the launched MPI processes makes an MPI_Finalize () call, in which the internal MPI resources are cleaned. Calling any MPI function after MPI_Finalize () is not allowed.

To describe the rest of our MPI program, we need to consider the basic terms used in MPI programming.

An MPI program is a set of processes that can send messages to each other through various MPI functions. Each process has a special identifier - rank. The process rank can be used in various operations of sending MPI messages, for example, the rank can be specified as the identifier of the message recipient.

In addition, MPI has special objects called communicators that describe groups of processes. Each process within a single communicator has a unique rank. The same process can relate to different communicators and, accordingly, can have different ranks within different communicators. Each operation of sending data to MPI must be performed within the framework of a communicator. By default, the communicator MPI_COMM_WORLD is always created, which includes all available processes.

Back to test.c:

MPI_Comm_size (MPI_COMM_WORLD, &size);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
The MPI_Comm_size () call will write to the variable size (size) of the current MPI_COMM_WORLD communicator (the total number of processes that we specified with the mpirun option '-n').
MPI_Comm_rank () will write to the variable rank (rank) of the current MPI process within the MPI_COMM_WORLD communicator.

MPI_Get_processor_name (name, &namelen);
The call MPI_Get_processor_name () will write to the variable name the string identifier (name) of the computing node on which the corresponding process was launched.

The collected information (process rank, dimension MPI_COMM_WORLD, processor name) is then sent from all nonzero ranks to zero using the MPI_Send () function:
MPI_Send (&rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
MPI_Send (&size, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
MPI_Send (&namelen, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
MPI_Send (name, namelen + 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD);

The MPI_Send () function has the following format:
MPI_Send (buf, count, type, dest, tag, comm)
buf - address of the memory buffer in which the transferred data is located;
count - the number of data elements in the message;
type - type of data elements of the forwarded message;
dest - rank of the process of the message recipient;
tag - a special tag for identifying messages;
comm - the communicator within which the message is sent.
A more detailed description of the MPI_Send () function and its arguments, as well as other MPI functions, can be found in the MPI standard (the documentation language is English).

At rank zero, messages sent by other ranks are received and printed on the screen:
printf ("Hello world: rank %d of %d running on %s\n", rank, size, name);
for (i = 1; i < size; i++) {
    MPI_Recv (&rank, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
    MPI_Recv (&size, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
    MPI_Recv (&namelen, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
    MPI_Recv (name, namelen + 1, MPI_CHAR, i, 1, MPI_COMM_WORLD, &stat);
    printf ("Hello world: rank %d of %d running on %s\n", rank, size, name);
}
For clarity, the zero rank additionally prints its data like those that it received from remote ranks.

The MPI_Recv () function has the following format:
MPI_Recv (buf, count, type, source, tag, comm, status)
buf, count, type - memory buffer for receiving a message;
source - the rank of the process from which the message should be received;
tag - tag of the received message;
comm - communicator, in the framework of which data is received;
status - a pointer to a special MPI data structure that contains information about the result of the data receiving operation.

In this article, we will not delve into the intricacies of the MPI_Send () / MPI_Recv () functions. A description of the various types of MPI operations and the intricacies of their work is the topic of a separate article. We only note that the zero rank in our program will receive messages from other processes strictly in a certain sequence, starting from the first rank and increasing (this is determined by the source field in the MPI_Recv () function, which varies from 1 to size).

The described functions MPI_Send () / MPI_Recv () is an example of the so-called point-to-point MPI operations. In such operations, one rank exchanges messages with another within a specific communicator. There are also collective MPI operations in which more than two ranks can participate in data exchange. Collective MPI operations are a topic for a separate (and possibly more than one) article.

As a result of our demo MPI program, we get:
$ mpirun -ppn 1 -n 2 -hosts ip-172-31-47-25, ip-172-31-47-24 /home/ec2-user/intel/impi/5.0.3.048/test/test.exe
Hello world: rank 1 of 2 running on ip-172-31-47-25
Hello world: rank 1 of 2 running on ip-172-31-47-24


Are you interested in what was said in this post and would you like to take part in the development of MPI technology? The Intel MPI Library development team (Nizhny Novgorod) is currently actively looking for associate engineers. For more information, see the official Intel website and the BrainStorage website .

And finally, a small survey about possible topics for future publications on high-performance computing.

Only registered users can participate in the survey. Please come in.

List of topics for future high-performance computing publications

  • 81.1% Parallel Programming Technologies 56
  • 43.4% Intel Tools for Analyzing and Improving Parallel Application Performance 30
  • 43.4% Cluster systems infrastructure 30

Also popular now: