High-Performance Computing - MPI

HPC: First MPI Laboratory

Initial Setup

Today you will learn how to compile, edit and run MPI programs. The first step is to connect to setonix by typing

ssh your_username@setonix.pawsey.org.au

Once you have logged on, type the command hostname to learn the name of the login node. There are four login nodes and users are assigned on a round robin basis. Normally we do not run MPI programs on the login nodes, as they are shared amongst many users. However, today is an exception as the programs are extremely simple. Type who to see who else is currently also using your login node.

On many supercomputers we have to manually load modules to activate the MPI compilers, but on setonix everything is ready to go. This includes man pages for MPI functions, including syntax and error codes; type man mpi to get a sense of the information available.

All the source code files linked to on this page can also be accessed via git:
git clone https://github.com/liamscarlett/intro-mpi

Hello World

The first step in any programming language is to perform simple I/O, for which the canonical example is 'Hello World'. To see how this works with MPI, download one of the following files (hello.f90 - hello.c). It is good practice to run your MPI programs in the /scratch directory, and sensible to put the exercises for each lab in a separate directory. For today, this could be something like /scratch/courses0100/your_username/mpi1.

Take a look at the file you have copied across using a text pager (e.g. cat, more, less) and compile using the appropriate parallel compiler

ftn hello.f90 -o main cc hello.c -o main

which generates an executable main capable of running in an MPI environment. To run the parallel program, type

srun -u -p work --reservation=CurtinHPCcourse2026 -A courses0100 -n 4 ./main

where the -n flag sets the number of processes to be used and main is the executable compiled above. The -u flag sets unbuffered output – useful if you're watching the program run in real time.
You should find that 'Hello World' appears four times on the screen; once from each process. Experiment by varying the number of processes. Don't be too greedy; remember there are other users on the system.

To make life easier, type the following command to setup an alias for srun

alias srun="srun -u -p work --reservation=CurtinHPCcourse2026 -A courses0100"

You can add this to your .bashrc (or equivalent) file. Anywhere srun appears in the remainder of this and the next workshop materials, it is assumed that the alias has been set.

The srun command runs the program on a compute node, which means there is a delay between executing srun and the program running, as the scheduling system processes the job. To verify that this is the case, enter the command hostname to print the name of the login node, then enter srun hostname to run the hostname command on a compute node, and therefore print the name of the compute node.

Basic MPI

In lectures we saw how the MPI functions MPI_Comm_size and MPI_Comm_rank are used to determine the number of MPI processes, and the rank of each process, respectively. Our first application of these functions is a simple extension to the 'Hello World' program seen above. Download one of the following files (hello2.f90 - hello2.c) and compile using the appropriate compiler. Execute the program by typing srun -n 4 ./main. This should produce output along the lines of

Hello World. I am process 0 Hello World. I am process 1 Hello World. I am process 2 Hello World. I am process 3

Run the program several times and try varying the number of processes. You might find that the order of the output varies. Can you explain why?

Using a UNIX text editor, modify the program to include a call to MPI_Comm_size to determine the number of processes. Include this information in each line of the output in the manner of

Hello World. I am process 0 of 4

and so on for the other processes. Test that your code works by varying the number of processes passed to -n.

The infinite series ∑1/k², where k is a positive integer, is a special case of the Riemann Zeta Function with an exact value of π²/6. Download one of the following serial codes (zeta.f90 - zeta.c) which calculates this sum for the first 100,000 terms. Compile the program in the usual way (using either ftn or cc) and run the executable using the command

./main

Note that srun is not employed since the program runs on a single CPU. Warning: running without srun means that the program is executed on the login node, which is generally bad! You can only get away with this occasionally for small, single-core programs.

Modify the serial program to use MPI to determine the quantities ∑1/k², ∑1/k³, ∑1/k⁴ and ∑1/k⁵. Perform each summation on a different processes and test to check that at exactly four processes are available. Each process should print both the value of the summation and the exact answer given in Equations 71-74 on the interesting Wolfram web-site for the Riemann Zeta function. No output should be generated if four processes are not specified; confirm this by varying the argument of -n.

Point-to-Point Communication

In lectures you saw how MPI_Send and MPI_Recv can be used to send data from one process to another. Download one of the following codes (send.f90 - send.c) and compile and run. You can specify -n 2 as only two processes are required. You should find the output is along the lines of

Process 0 sent the number 10 Process 1 received the number 10

Have a close look at the code, and make sure you understand what is going on.

Modify the send program so that each process sends an integer to the other, e.g. Process 0 sends the number 10 to Process 1, while Process 1 sends the number 5 to Process 0.

Now consider a parallel calculation with four processes. Write a program which sends an integer (say 10) around a ring in which communication follows the progression 0→1, 1→2, 2→3 and 3→0. The process should terminate when the integer returns to Process 0. Your program should produce output along the lines of

Process 0 sent the number 10 to process 1 Process 1 received the number 10 from process 0 Process 1 sent the number 10 to process 2 Process 2 received the number 10 from process 1 Process 2 sent the number 10 to process 3 Process 3 received the number 10 from process 2 Process 3 sent the number 10 to process 0 Process 0 received the number 10 from process 3

Generalise the code to work for an arbitrary number of processes and confirm that your program continues to behave as expected.
Some hints:

Define integers source and dest. For each rank, the source should be rank-1 and dest should be rank+1, with exceptions for the first and last ranks.
Rank 0 needs to start the chain by posting a send, and follow that by posting a recv to receive the number from the last rank. Every other rank needs to post a recv first and then a send.
In order to obtain the anticipated output order it helps to get each process to wait for a second after each MPI_Recv. In Fortran this is achieved with call sleep(1). The C function is named the same, but requires #include <unistd.h>.

Creating Deadlock

Download one of the following programs (array.f90 - array.c) in which two processes send each other an array of data. Have a look at the code, and make sure you understand what is going on (the important sections are the process-specific instructions). Compile the code, and run the executable on two processes. You will see that 100 numbers are sent from Process 0 to Process 1, and another 100 numbers (a separate data set) are sent in the reverse direction. Now run the code again with a command-line argument such as

srun -n 2 ./main 500

which alters the amount of data which is exchanged between the processes. Explore a variety of values, from 1 real up to 4000 reals. For large values you will find that the program appears to hang. In fact the program is deadlocked, with each process waiting for the other to finish. Press Ctrl-C twice quickly to cancel execution, and try a different value. Continue experimenting until you determine the maximum array size which won't deadlock.

While this behaviour is unexpected, the explanation is relatively straightforward. Small messages are sent using a buffer, while large messages require simultaneous cooperation between the two processes. In the former case, even though both processes initiate MPI_Send at much the same time, once the message has been copied to the system buffer, execution passes to the following line of code and the appropriate messages are received. In the case of the large messages, however, MPI_Send can't complete without the cooperation of a matching MPI_Recv. Since both processes post MPI_Send, neither command completes and a deadlock condition is created.

Having determined the critical value for deadlock, run the program again with two command-line arguments, where one number is larger than the deadlock threshold and the other is smaller. With this format you specify separate array sizes for the messages sent from each process. Consider a range of message sizes, and explain what you observe.

As a final exercise, modify the array program to use double precision data instead of the single precision real and float. Does this affect the threshold at which deadlock occurs?

Collective Communication

Using MPI_Barrier, modify the second 'Hello World' code so that the output is printed out in order (i.e. 0, 1, 2, 3, ...). Use the function call sleep(1) to pause for one second between each line of output. Test your solution by running the executable multiple times on a large number of processes.
Write an MPI program in which an arbitrary number of processes (specified by -n) are used to calculate the sum ∑1/k² where the summation is over all values of k between 1 and 100,000. Calculate a partial sum with each process and use MPI_Reduce to gather the results while performing a sum operation. Be sure to use double precision, and print the final result and the percentage error relative to the exact value of π²/6. Test the correctness of your program by confirming that the result is independent of the number of processes.
Write an MPI program to calculate the dot product between two vectors using many processes. Your program should do the following:
1. Allocate two large arrays x and y (same size).
2. On rank 0 only, populate them with values and calculate their dot product.
3. Scatter both arrays across all MPI ranks. You may find it easiest to hard-code the count (something large) and then let the size of the arrays be count*nprocs.
4. Each rank should receive a part of the two arrays, you may like to call them xpart and ypart. Have each rank calculate the dot product of xpart and ypart, and store the result in a variable dotpart
5. Use MPI_Reduce to gather each rank's dotpart, summing them together into a variable dot on rank 0.
6. Verify that dot as calculated via MPI_Scatter/Reduce is the same as the dot product you initially calculated on rank 0 alone.

Further Information

There are numerous guides for MPI available online. The documents below are from major supercomputing centres and provide an excellent introduction.

Today we ran a number of simple tests either directly on the login node, or by running srun from the login node. It is also possible to create a interactive session on a compute node by typing the command

salloc --tasks=4 -p work -A courses0100 --reservation=CurtinHPCcourse2026 --time=00:10:00

to start a 10 minute session with 4 MPI processes. Vary the argument of --tasks and --time according to the problem at hand. Once the interactive session has started, use srun in the usual manner. Note that you are free to specify fewer processes than the maximum should you choose to do so, but you can't request more. Be aware that setonix is a production machine that is important for many researchers, so only request the resources you need. Additionally, do not leave the interactive session idle for long periods, and logout once you are finished.

If you wish to run gnuplot you will need to add the -Y option as an additional argument to ssh. This flag instructs ssh to transmit the X11 output back to your display. To do this you will need to logout and login again. Once you have done this, type

module load gnuplot/6.0.0

to activate the relevant paths and libraries. Note that gnuplot only works remotely over X11, so if you don't have an X11 client on your machine you will need to use another plotting program.