High-Performance Computing - MPI

HPC: Second MPI Laboratory

Initial Setup

Today you will explore non-blocking communication and master/worker parallelism with MPI. The assignment requires you to be able to view PPM image files generated on Setonix. One approach uses X11 forwarding to render to image directly from the remote machine without copying it to your local machine. Start by connecting to setonix by typing

ssh -Y your_username@setonix.pawsey.org.au

where the -Y flag instructs ssh to transmit the X11 output back to your display. MacOS users have the option to install the free Xquartz package. Windows users should connect via MobaXterm which has built-in support for X11.

To test that X11 forwarding is working correctly, cd to the directory /software/projects/courses0100 and type the command

./xiv mandelbrot.ppm

A window should appear showing a render of the Mandelbrot set. Use "+" and "-" on the keyboard to zoom, and press the arrow keys to move around the image once zoomed in. Press "q" to quit. To learn more about the functionality of this simple image viewer, type

./xiv --help

Later today we will want to access this program from other directories. To make this easier, setup an alias by typing

alias xiv="/software/projects/courses0100/xiv"

Test that your alias works by copying mandelbrot.ppm to your home directory and running xiv mandelbrot.ppm from there.

If you are unable to get xiv working through X11 (e.g. XQuartz is having some issues on recent versions of MacOS), or if you are finding it too slow, you may find it easier to just scp the image files to your local machine and open them there. If you go down this road, please make sure you are able to open .ppm files on your computer (the default document viewer Preview on MacOS can open .ppm files). Copy the mandelbrot.ppm file to your computer and make sure you are able to open it.

On an unrelated note, recall that information on the MPI functions and subroutines is available via the UNIX man pages. For example, typing

man MPI_Recv

provides information on the usage and errors associated with a blocking receive. Details for both C and Fortran syntax is provided. These pages are available for all MPI routines. Look up a few, and test your memory to see how many functions you can remember from last week.

As was the case last session, note that all source files are also available via git:

git clone https://github.com/liamscarlett/intro-mpi.

Non-Blocking Communication

In the first MPI laboratory you were provided with a simple piece of code which used the standard blocking send (MPI_Send and MPI_Recv) to pass a single integer from Process 0 to Process 1. Download this code (send.f90 - send.c) and modify the program to perform non-blocking (non-blocking) communication using MPI_Isend/MPI_Irecv/MPI_Wait.

Last session you explored the deadlock condition between two processes. Small messages were transmitted via a buffer, but when both arrays exceeded a critical size a deadlock state arose. Download the array program from the previous laboratory (array.f90 - array.c). To remind yourself of the deadlock state, compile and run it with a 400KB message as follows (note that a real/float occupies four bytes of memory).

srun -n 2 main 100000

Now modify the code to perform non-blocking communication using MPI_Isend/MPI_Irecv/MPI_Waitall. Show that this avoids deadlock by running the code with 400KB messages.

Your next task is to write a two-process program which performs computation during non-blocking communication. The program should have the following structure:

On Process 0:

Use MPI_Isend to send 3 million reals/floats to Process 1. Once Isend is posted, the program should enter a while loop which
a) increments a counter (initially set equal to zero)
b) uses MPI_Test to see if the Isend has completed.

Exit the while loop when Isend is done, and then print out the value of the counter. On Process 1:

Accept the message from Process 0 using a blocking receive (MPI_Recv).

Once the message has arrived, print out the last data value received.

Compile the program and run several times. You should find that approximately 500,000 iterations are performed during the Isend operation. Now reduce the size of message sent (e.g. consider 10⁵ or 10⁴ values). What happens? What are the implications for overlapping communication and computation?

Synchronous Communication

Download once again the deadlock program from last session (array.f90 - array.c) and replace both MPI_Send calls with MPI_Ssend. Compile and run, and exchange a single number between the processes using the command-line argument

srun -n 2 ./main 1

You should find that the code now always deadlocks, regardless of message size. This is because a synchronous send is being used, which explicitly precludes the use of any kind of buffer.

Now replace one of the MPI_Ssend calls with MPI_Send as used in the original version. Run the program with two command-line arguments in order to control the size of both messages. Under some conditions the code will not deadlock. Explain what you find.

Understanding Memory

When working with arrays it is important to appreciate how data is layed out in memory. To explore the manipulation of sections of an array, download the following program (memory.f90 - memory.c) and run the program on two cpus. If you are running the C version, use the command-line arguments

srun -n 2 ./main 0 10 0

while Fortran users should invoke

srun -n 2 ./main 1 10 1

You will find that an array on Process 0 is sent to Process 1, with the exact size and location of the message determined by the command-line arguments. Note that both arrays are fixed in length and contain ten integers. The meaning of the three arguments are (in order): starting position for MPI_Send, number of elements for MPI_Send, and receive position for MPI_Recv. Inspect the code to see how this is implemented.

Vary the number of elements sent, by changing the parameter 10 to some lower number, and observe the output. Using a message-size of 5, vary the start and end parameters. What happens if the start location has a value greater than 5? Explain what you observe. What happens if the end location is greater than 5? What unpredictable behaviour might occur by sending a message in this way?

Finally, C programmers should download and compile the program memory2.c. This accepts the same arguments as memory.c, and produces the same output (check this!). The key difference however is that malloc is used to setup the array structures, and thus the syntax for accessing MPI_Send and MPI_Recv is pointer-based. This is a very important subtlety, so make sure you keep track of which arrays are dynamic (i.e. use malloc) and which are static (i.e. defined in the source).

The Mandelbrot Set

Consider the iterative sequence
z_i+1=(z_i)²+κ
where z and κ are complex numbers, and z₀=κ.

When the magnitude of κ is large, this sequence quickly diverges (i.e. the magnitude of z goes to ∞). The Mandelbrot set is defined as the set of values of κ for which this sequence remains bounded.

It is known that the Mandelbrot set is bounded by the disk |κ|<2, and therefore we only need to iterate until |z_i|>2 to determine that a point is not in the Mandelbrot set.

The following serial code (mandelbrot.f90 - mandelbrot.c) discretises the complex region −2≤Re(z)≤2 and −2≤Im(z)≤2, and computes a 2D array containing the number of iterations required to reach the condition |z|>2, with an upper bound of 1000 iterations. The number of iterations is converted into false colour and written to an image file mandelbrot.ppm. Compile the program in the usual way, run the executable by typing ./main, and wait for about 20 seconds. To view the image, type

xiv mandelbrot.ppm

which runs the program seen at the start of the session. Vary the zoom and explore the structure of the Mandelbrot set, paying close attention to the self-similarity (fractal) nature of the image. Note that these calculations should be performed in /scratch/courses0100/username as the PPM files are quite large. The image size is set to 2000x2000 pixels; try increasing this slightly to explore the self-similarity.

In preparation for the assessable task, examine and compile the following code (mandelbrot2.f90 - mandelbrot2.c) which produces the same output file as the previous version, except that now the array x is one-dimensional. This modification helps greatly when constructing a parallel version of the algorithm. Note particularly the use of the integer and modulo arithmetic which converts the scalar loop into an (i,j) pair. Use the UNIX diff command to confirm that the output is exactly the same, being careful to ensure the image size is the same for both codes.

Assessable Tasks

Your assessable tasks involve three different approaches to MPI parallelization of the Mandelbrot problem:

Using the second of the two Mandelbrot codes, rewrite the program to use MPI with 10 processes on an 8000x8000 system. Use static decomposition (block partioning) by dividing the complex plane into 10 rectangular strips, and have each process perform the computation for each strip. When the calculation phase is finished, use MPI_GATHER to return all data to the root process. Use the diff command to confirm that the PPM file is identical to the serial version.

Once you have confirmed that the output is correct, comment out the code that writes the PPM file, and use MPI_WTIME and MPI_BARRIER to determine the time that each process spends on computation, waiting and communication. Comment on the significant load imbalance between the various processes. Compare the performance to the serial version.
Modify your program from Task 1 to use dynamic decomposition in a master-worker parallelism with arbitrary chunksize and number of workers. Vary the chunksize, and note the effect on the speedup relative to the original serial version. As before, use diff to confirm that the output is correct. Comment out the I/O routine and determine the time that each worker spends on computation and communication for a chunksize of 100,000 and 10 workers. How does this affect the load balancing?

Using gnuplot or similar, produce a graph of total execution time vs chunksize, and thus determine the optimum chunksize with 10 workers (hint: it will be roughly constant over a fairly wide range of values). Explain what happens for very large and very small values of the chunksize. Check that you recover the same timing result as Task 1 when the chunksize is one-tenth of the total amount of work.

Your code should work even when the chunksize does not evenly divide N*N. However, it is perfectly acceptable for this task to allow one of your workers to work on pixels outside the bounds of the main pixel grid to allow all workers to be given chunks of equal size. All that matters is that those out-of-bounds pixels are not printed to the image file, and that your solution does not suffer from memory-access faults. Compile your C code with the extra flags -g -fsanitize=address to check for memory access faults.

Note: Do not use MPI_GATHER or any other type of collective communication for Task 2.
As a final task, repeat the static decomposition exercise in Task 1 using a 1D cyclic partition instead of block partitioning. Compare your results with those from Task 1, and comment on what the best MPI parallelism strategy is for the Mandelbrot set.

Guidelines and Advice

The report is due on Thursday March 26th. It is to be written in a single-column format and should be prepared using a word-processor such as LaTeX, Word or Pages. Please follow these instructions exactly to ensure your assignment is marked:

Export your report as a PDF named FirstName.LastName.pdf.
Please ensure that your first and last names are identical to how you are registered in BlackBoard, otherwise I may not be able to find you.
Make sure that your name and SID are clearly stated on the first page of the PDF (e.g. in the header).
Bundle your report and source code together in a ZIP file named FirstName.LastName.zip.
Email your ZIP file to Liam Scarlett.
Do not also attach the PDF. I will find it inside the ZIP file.
The subject line of the email should be PHYS4004 MPI ASSESSABLE TASK (any emails without this will not be found when it comes time for me to start marking).
Also upload your assignment to BlackBoard, With the PDF and source code as separate items (in that order).

Having the submissions in a uniform format makes my life easier when it comes to collecting and marking assignments. Any variation to the instructions above complicates things. There is no need to make waffly reports with cover pages, table of contents, lists of figures, etc. Strive for brevity. A summary of the rationale for your grade, and a breakdown of the grades for each task will be sent to your Curtin email address.

The report should be precise, understandable and concise. It should include a combination of written responses, timing data, graphical representations of key concepts (load balancing, communication cost, wait times, etc) and code. For the load balancing questions, be sure to include graphs that resemble the example on the "Timing Visualisation" slide in Lecture 2. Take care to answer all questions and address all requests for comment and comparison. Include all code fragments that are essential to understand the logical flow, but there is no need to repetitively include lengthy/identical I/O routines for all three programs. Do not pad out the report with pointless whitespace or meaningless words and figures. As a rough guide, it should not exceed ten pages in length.

The report should convey your insight into the subtleties of MPI programming and explain the logic behind your solution. It should include a critical assessment of the best strategy and the compromises involved. For Tasks 2 and 3, there are many possibilities. Feel free to suggest more than one solution.

A "template" for your report:

Introduction:
- Describe the Mandelbrot set and algorithm.
- Include a render of the Mandelbrot set which you obtained from running the serial code.
- State the time taken to complete the calculation with the serial code.
Static decomposition (block partitioning):
- Describe the way in which the data is distributed to the different processes in a static decomposition with block partitioning.
- Explain the algorithm by which each process knows what work to do and communicates its results back to process 0.
- Include a code snippet showing the changes you made to the serial code to implement what you just described above.
- State the total time taken to complete the calculation, and the speedup relative to the serial code. Compare this to the "ideal" speedup (apply Amdahl's Law assuming the serial fraction of your parallel code is essentially zero).
- Include a timing figure in the format shown at the end of Lecture 2, breaking down the time spent on computation, waiting, and communication by each process.
- Referring to your timing figure, comment on the load balancing. If there is poor load balancing, explain why in the context of the way in which the workload was distributed to the different processes.
Dynamic decomposition (master-worker):
Include all the same points as previous, and:
- A figure showing the total runtime as a function of chunk size, with chunk size ranging from 1 to 10⁸. Explain your observations.
- For a chunk size equal to one tenth of the total workload, compare the calculation time to what you got in the static block partitioning task. Explain your observations.
Static decomposition (cyclic partitioning):
Include all the same points as for the block partitioning, and:
- A comparison of the timing and load-balancing with the block partitioning case. Explain your observations.
Conclusions:
- Give an overall comparison of the three methods you implemented.
- What are the strengths and weakness of each?
- Which method performed best for this problem?

In developing your parallel algorithms and MPI progamming skills you might find the following advice useful:

Start simple and test frequently. Debugging a large piece of MPI code is very difficult for the novice MPI programmer.
Make your life easier by temporarily reducing the size of the problem (i.e. reduce N). Once everything is working, reset the value.
For the master-worker problem, consider first coding a solution for when chunksize evenly divides N*N.
Your first goal is to have a working piece of MPI code. Once that is achieved, try to reduce the number of send and receive calls. Finally, make the code understandable for other people.
Avoid duplication and replication. If an expression is used more than once, then define a variable for that expression.
Ensure you have done all the regular questions in the two MPI labs. Skipping over these exercises is false economy.
A clean MPI algorithm is much more important than an algorithm that optimizes every last byte and CPU cycle. Feel free to simplify your code by wasting a small amount of resources.
Don't assume the logic of the MPI library is the same as yours. If a routine is unfamiliar, first test it in isolation before proceeding.