\documentclass[12pt]{article} \usepackage{url} \usepackage{times} \usepackage[dvips]{graphicx} %\textheight=25cm \textwidth=14cm %\topmargin=-4cm %\oddsidemargin=0cm \begin{document} \title{Pypar Tutorial\\ Building a Parallel Program using Python} \author{Ole Nielsen \\ Australian National University, Canberra \\ Ole.Nielsen@anu.edu.au} \maketitle \section*{Introduction} This is a tutorial demonstrating essential features of parallel programming and building a simple Python MPI program using the MPI binding \texttt{pypar} available at \url{http://datamining.anu.edu.au/software/pypar}. It is assumed that \texttt{pypar} has been installed on your system. If not, see the installation document (\texttt{pypar\_installation.pdf}) that comes with pypar (or get it from the web site at \url{http://datamining.anu.edu.au/~ole/pypar/pypar_installation.pdf}). For a reference to all of Pypar's functionality, see the reference manual (\texttt{pypar\_reference.pdf}) that comes with pypar (or get it from the web site at \url{http://datamining.anu.edu.au/~ole/pypar/pypar_reference.pdf}). \section{The Message Passing Interface --- MPI} MPI is the defacto standard for parallel programming of both distributed computer clusters of and shared memory architectures. MPI programming takes the form of special commands that are imported from the MPI library. They provide information to the program about how many processes are available, which number each process has been allocated and functions to communicate with other processes. See \url{http://www.netlib.org/utk/papers/mpi-book/mpi-book.html} for a comprehensive reference to MPI (based on the C or Fortran programming languages). It is said that MPI is both large and small. What is meant is that the MPI standard has hundreds of functions in it. However, many of the advanced routines represent functionality that can be ignored unless one pursues added flexibility (data types), overlap of computation and communication (nonblocking send/receive), modularity (groups, communicators), or convenience (collective operations, topologies). MPI is said to be small because the majority of useful and efficient parallel programs are written using only a small essential subset of MPI. In fact, it can be argued that parallel performance is best achieved by keeping the parallelism simple. Pypar is a Python binding to such a subset. \section{A simple Pypar MPI program} Let us start with a simple example: a Python program that imports the pypar module and reports the number of processors, local id and host name This is the Pypar equivalent of the infamous 'hello world' program if you like. Make a file called \texttt{ni.py} using your favourite editor\footnote{The author believs that your favourite editor should be GNU/Emacs, but it is up to you, of course.} and type in the following contents: {\footnotesize \begin{verbatim} import pypar p = pypar.rank() P = pypar.size() node = pypar.get_processor_name() print "I am process %d of %d on node %s" %(p, P, node) pypar.finalize() \end{verbatim}} \noindent and run it as follows: \begin{verbatim} mpirun -np 4 python ni.py \end{verbatim} You should get a greeting from each of the four nodes looking something like \begin{verbatim} Pypar (version 1.9) initialised MPI OK with 4 processors I am process 0 of 4 on node ninja-1 I am process 3 of 4 on node ninja-4 I am process 1 of 4 on node ninja-2 I am process 2 of 4 on node ninja-3 \end{verbatim} The order of output may be arbitrary! If four copies of the program is run, print is called four times. The order in which each process executes the message is undetermined, based on when they each reach that point in their execution of the program, and how they travel on the network. \emph{Note that all the print's, though they come from different processes, will send their output intact to your shell window; this is generally true of output commands. Input commands, like \texttt{raw_input}, will only work on the process with rank zero.} \subsection{The MPI execution model} It is important to understand that in MPI (and therefore also in Pypar), multiple identical copies of this program will start simultaneously in separate processes. These processes may run on one machine or on multiple machines. This is a fundamental difference from ordinary programs, where, if someone sais ``run the program'', it is assumed that there was only one instance of the program running. XXXXX In a nutshell, this program sets up a communication group of 4 processes, where each process gets its rank (\texttt{p}), prints it, and exits. .......Explain in detail like below.. The first line, \texttt{\#include } should be familiar to all C programmers. It includes the standard input/output routines like printf. The second line, \texttt{\#include } includes the MPI functions. The file \texttt{mpi.h} contains prototypes for all the MPI routines in this program; this file is located in \texttt{/usr/include/lam/mpi.h} in case you actually want to look at it. The program starts with the main... line which takes the two arguments argc and argv (which is normal in C): The argument \texttt{argc} is the number of commandline arguments given to the program (including itself) and \texttt{argv} is an array of strings containing them. For example argv[0] will contain the program name itself. We will not be using them directly but MPI needs access to them through \texttt{MPI\_init}. Then the program declares one integer variable, \texttt{myid}. The first step of the program, \begin{verbatim} MPI_Init(&argc,&argv); \end{verbatim} calls MPI\_Init to set up everything. This should be the first command executed in all MPI programs. This routine takes pointers to \texttt{argc} and \texttt{argv}, looks at them, pulls out the purely MPI-relevant things, and generally fixes them so you can use command line arguments as normal. \medskip Next, the program runs \texttt{MPI\_Comm\_rank}, passing it the default communicator MPI\_COMM\_WORLD (which is the group of \emph{all} processes started) and a \emph{pointer} to \texttt{myid}. Passing a pointer is the way C can change the value of te argument \texttt{myid}. \texttt{MPI\_Comm\_rank} will set \texttt{myid} to the rank of the machine on which the program is running. Remember that in reality, several instances of this program start up in several different processes when this program is run. Each will receive a unique number from \texttt{MPI\_Comm\_rank}. Because multiple copies are running, each will execute all lines in the program including the \texttt{printf} statement which prints a message and the rank. After doing everything else, the program calls \texttt{MPI\_Finalize}, which generally terminates everything and shuts down MPI. This should be the last command executed in all MPI programs. \subsection*{Compile the MPI program} Normally, C programs are compiled with a command such as \texttt{gcc hello.c -o hello.x}. However, MPI programs need access to more libraries which are provided through the wrapper program \texttt{mpicc}. This can be used as usual. Try to run \begin{verbatim} mpicc hello.c -o hello.x \end{verbatim} If you made a syntactically correct program a \emph{compiled} program will appear with the name you specified after the -o option. If not you will see some error messages and you'll have to correct those before continuing. \subsection*{Run the MPI program} In order to run an MPI compiled program, you must type: \begin{verbatim} mpirun -np [options] \end{verbatim} where you specify the number of processes on which you want to run your parallel program, optional mpirun options, and your program name and its expected arguments. In this case try: \begin{verbatim} mpirun -np 1 hello.x \end{verbatim} This will run one copy of your program and the output should like this \begin{verbatim} Sawatdii khrap thuk thuk khon (Process 0) \end{verbatim} Now try to start four processes by running \texttt{mpirun -np 4 hello.x}. You should see the following output: \begin{verbatim} Sawatdii khrap thuk thuk khon (Process 0) Sawatdii khrap thuk thuk khon (Process 3) Sawatdii khrap thuk thuk khon (Process 1) Sawatdii khrap thuk thuk khon (Process 2) \end{verbatim} Note that the order of output may be arbitrary! If four copies of the program is run, printf is called four times. The order in which each process executes the message is undetermined, based on when they each reach that point in their execution of the program, and how they travel on the network. Your guess is as good as mine. \emph{Note that all the printf's, though they come from different processes, will send their output intact to your shell window; this is generally true of output commands. Input commands, like scanf, will only work on the process with rank zero.} \subsubsection*{Exercise} The MPI command \texttt{MPI\_Comm\_size(MPI\_COMM\_WORLD, \&proc);} will store the total number of processes started in the integer variable \texttt{number\_of\_processes}. Modify the program to give the following output: \begin{verbatim} Sawatdii khrap thuk thuk khon (Process 0 of 4) Sawatdii khrap thuk thuk khon (Process 3 of 4) Sawatdii khrap thuk thuk khon (Process 1 of 4) Sawatdii khrap thuk thuk khon (Process 2 of 4) \end{verbatim} when started with four processes. \subsection*{Running on multiple computers} So far we have only \emph{simulated} the parallelism by running multiple processes on one machine. We will now add host specific information to the program and then run it in parallel. Add the following declarations to your program: \begin{verbatim} int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; \end{verbatim} Add the command \begin{verbatim} MPI_Get_processor_name(processor_name, &namelen); \end{verbatim} This will store the hostname of the processor executing the given process along with the length of the hostname. Then modify the print statement to \begin{verbatim} printf("Sawatdii khrap thuk thuk khon (Process %d of %d running on %s)\n", myid, number_of_processes, processor_name); \end{verbatim} Compiling and running the program you should see the folowing output \begin{verbatim} Sawatdii khrap thuk thuk khon (Process 0 of 4 running on ninja-n) Sawatdii khrap thuk thuk khon (Process 1 of 4 running on ninja-n) Sawatdii khrap thuk thuk khon (Process 2 of 4 running on ninja-n) Sawatdii khrap thuk thuk khon (Process 3 of 4 running on ninja-n) \end{verbatim} We are still running all out MPI processes on one computer but we are now ready to run this program in parallel. Edit the file \texttt{~/.lamhosts} to contain all machines in our network and run \texttt{lamboot -v ~/.lamhosts} again. You should see the following diagnostic message: \begin{verbatim} LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University Executing hboot on n0 (ninja-1 - 1 CPU)... Executing hboot on n1 (ninja-2 - 1 CPU)... Executing hboot on n2 (ninja-3 - 1 CPU)... Executing hboot on n3 (ninja-4 - 1 CPU)... Executing hboot on n4 (ninja-5 - 1 CPU)... Executing hboot on n5 (ninja-6 - 1 CPU)... Executing hboot on n6 (ninja-7 - 1 CPU)... Executing hboot on n7 (ninja-8 - 1 CPU)... topology done \end{verbatim} \noindent Finally try to run your program again and verify that it runs on different nodes. What happens if you run more processes than there are physical machines? \section*{Optional: synchronise the clocks} \textbf{\emph{Note: This an optional exercise for those who can't get enough!}} The clocks in our cluster are not synchronised: They show differenttimes. Try for example to run \begin{verbatim} ssh -x ninja-2 date; ssh -x ninja-5 date; ssh -x ninja-3 date; date You will see that the times listed differ significantly. \end{verbatim} Try and search for things like \texttt{synchronize time network} and see if you can figure out how to do it. \section*{MPI resources} \begin{itemize} \item \url{http://www.cs.appstate.edu/~can/classes/5530} (exellent introduction and many useful links) \item http://www-unix.mcs.anl.gov/mpi/mpich (The MPICH implementation) \item http://www.lam-mpi.org (The LAM-MPI implementation) \item http://www.netlib.org/utk/papers/mpi-book/mpi-book.html\\ (The complete MPI reference) \end{itemize} \section*{Exercise 6} In this exercise we will continue the introduction to MPI by using the local id to differentiate work and by familiarising ourselves with the basic send and receive primitives. We will then proceed to measure the fundamental characteristics of the network: namely latency and bandwidth. \subsection*{Using local id to differentiate work} In Exercise 5 we had every processor write to the screen. This is convenient for debugging purposes and I suggest you start every parallel program with a message from each processor about its rank, the total number of processors and the hostname it is running on (see below). However, it is often also desirable to have only on processor output to screen for example to report the final result of a calculation. \subsubsection*{Exercise} Write a parallel program that produces something like the following output when run on four processors: \begin{verbatim} P2/4: Initialised OK on host ninja-6 P0/4: Initialised OK on host ninja-4 P0: Program terminating P1/4: Initialised OK on host ninja-5 P3/4: Initialised OK on host ninja-7 \end{verbatim} Arrange the program such that all processors report on their initialisation but only one, processor 0, reports on its termination just prior to \texttt{MPI\_Finalize()}. Note, as in Exercise 5, that the order of output is arbitrary. %\emph{Hint: Use the value of myid to decide whether to print or not}. \subsection*{A simple send and receive pair} In this part we will use the fundamental MPI send and receive commands to communicate among the processors. The commands are: \begin{itemize} \item \texttt{MPI\_Send}: This routine performs a basic send; this routine may block until the message is received, depending on the specific implementation of MPI. \begin{verbatim} int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Input: buf - initial address of send buffer count - number of elements in send buffer (nonnegative integer) datatype - datatype of each send buffer element dest - rank of destination (integer) tag - message tag (integer) comm - communicator \end{verbatim} \textbf{Example}: To send one integer, A say, to processor 3 with tag 13: \begin{verbatim} MPI_Send(&A, 1, MPI_INT, 3, 13, MPI_COMM_WORLD); \end{verbatim} \item \texttt{MPI\_Recv}: This routine performs a basic receive. \begin{verbatim} int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) Input: count - maximum number of elements in receive buffer (integer) datatype - datatype of each receive buffer element source - rank of source (integer) tag - message tag (integer) comm - communicator Output: buf - initial address of receive buffer status - status object, provides information about message received; status is a structure of type MPI_Status, the element status.MPI_SOURCE is the source of the message received, and the element status.MPI_TAG is the tag value. \end{verbatim} \textbf{Example}: To receive one integer, B say, from processor 7 with tag 13: \begin{verbatim} MPI_Recv(&B, 1, MPI_INT, 7, 13, MPI_COMM_WORLD, &status); \end{verbatim} \end{itemize} \noindent These calls are described in more detail in Ian Foster online book in \textbf{MPI Basics}: \url{http://www-unix.mcs.anl.gov/dbpp/text/node96.html}. The chapter about \textbf{Asynchronous communication}, \url{http://www-unix.mcs.anl.gov/dbpp/text/node98.html}, describes how to query the Status Object. MPI defines the following constants for use with \texttt{MPI\_Recv}: \begin{itemize} \item \texttt{MPI\_ANY\_SOURCE}: Can be used to specify that a message can be received from anywhere instead of a specific source. \item \texttt{MPI\_ANY\_TAG}: Can be used to specify that a message can have any tag instead of a specific tag. \end{itemize} \subsubsection*{Exercise} The following program segment implements a very simple communication pattern: Processor 0 sends a number to processor 1 and both outputs a diagnostic message. Add to your previous program the declarations: \begin{verbatim} int A, B, source, destination, tag=13; MPI_Status status; \end{verbatim} and the following code segment \begin{verbatim} if (myid == 0) { A = 42; destination = 1; printf("P%d: Sending value %d to MPI process %d\n", myid, A, destination); MPI_Send(&A, 1, MPI_INT, 1, 13, MPI_COMM_WORLD); } else if (myid == 1) { source = 0; MPI_Recv(&B, 1, MPI_INT, source, 13, MPI_COMM_WORLD, &status); printf("P%d: Received value %d from MPI process %d\n", myid, B, source); } \end{verbatim} make sure it can compile and run on 2 processors. Verify that your output looks something like \begin{verbatim} P0/2: Initialised OK on host ninja-1 P0: Sending value 42 to MPI process 1 P0: Terminating P1/2: Initialised OK on host ninja-2 P1: Received value 42 from MPI process 0 \end{verbatim} \subsection*{Send messages around in a ring - Exercise} Write an MPI program which passes data around in a ring structure from process 0 then to process 1 etc. When the message reaches the last processor it should be passed back to processor 0 - this forms a 'communication ring'. Every process should add one to the value before passing it on and write out a diagnostic message. With a starting value of 42 and running on four processors your program should produce the following output: \begin{verbatim} P0/4: Initialised OK on host ninja-1 P0: Sending value 42 to MPI process 1 P1/4: Initialised OK on host ninja-2 P3/4: Initialised OK on host ninja-4 P1: Received value 42 from MPI process 0 P1: Sending value 43 to MPI process 2 P2/4: Initialised OK on host ninja-3 P2: Received value 43 from MPI process 1 P2: Sending value 44 to MPI process 3 P3: Received value 44 from MPI process 2 P3: Sending value 45 to MPI process 0 P0: Received value 45 from MPI process 3 \end{verbatim} \emph{Hint}: You can use the \texttt{modulus} operator to calculate the destination such that the next processor after the last becomes 0: \begin{verbatim} destination = (myid + 1) % number_of_processors; \end{verbatim} \subsection*{Timing} A main reason for doing parallel computing is that of faster execution. Therefore \emph{timing} is an integral part of this game. MPI provides a function \texttt{MPI\_Wtime} which counts seconds. To time a code segment with \texttt{MPI\_Wtime} do something like this: \begin{verbatim} double t0; t0=MPI_Wtime(); // (Computations) printf("Time = %.6f sec\n", MPI_Wtime() - t0); \end{verbatim} \subsubsection*{Exercise} Time the previous 'ring' program in such a way that only process 0 measures the time and writes it to the screen \subsection*{Measure network latency and bandwidth} The time it takes to transmit a message can be approximated by the model \[ t = t_l + \alpha t_b \] where \begin{itemize} \item $t_l$ is the \emph{latency}. The latency is the time it takes to start the transmission before anything gets communicated. This involves opening the network connection, buffering the message and so on. \item $t_b$ is the time it takes to communicate one byte once the connection is open. This is related to the \emph{bandwidth} $B$, defined as the number of bytes transmitted per second, as \[ B = \frac{1}{t_b} \] \item $\alpha$ is the number of bytes communicated. \end{itemize} \subsubsection*{Exercise} \textbf{Can you work out what the network latency is?} \emph{Hint:} Send a very small message (e.g.\ one integer) from one processor to another and and back again a number of times and measure the time. \subsubsection*{Exercise} \textbf{Can you work out how many megabytes the network can transmit per second?} Create arrays of varying size, send them from one processor to another and back and measure the time. \emph{Hints:} \begin{itemize} \item To create an array of double precision floating point numbers do the following. \begin{verbatim} #define MAX_LEN 500000 /* Largest block */ double A[MAX_LEN]; \end{verbatim} \item To populate the array with pseudo random numbers: \begin{verbatim} for (j=0; j python paraprime.py I am proc 0 of 1 \end{verbatim}} \noindent Now let us try and run the same program on four processors: {\small \begin{verbatim} > prun -n 4 python paraprime.py I am proc 0 of 4 I am proc 1 of 4 I am proc 2 of 4 I am proc 3 of 4 \end{verbatim}} If you get a message from each of the four processors stating their id as above, we know that everything works. We have a parallel program! \section{Strategies for efficient parallel programming} A final note on efficiency. Parallel programming should lead to faster execution and/or ability to deal with larger problems. The ultimate goal is to be \emph{P} times faster with \emph{P} processors. However \emph{speedup} is usually less than \emph{P}. One must address three critical issues to achieve good speedup: \begin{itemize} \item \textbf{Interprocessor communication:} The amount and the frequency of messages should be kept as low as possible. Ensure that processors have plenty of work to do between communications. \item \textbf{Data distribution and load balancing:} If some processors finish much sooner than others, the total execution time of the parallel program is bounded by that of the slowest processor and we say that the program is poorly load balanced. Ensure that each processor get its fair share of the work load. \item \textbf{Sequential parts of a program:} If half of a program, say, is inherently sequential the speedup can never exceed 2 no matter how well the remaining half is parallelised. This is known as Amdahls law. Ensure that the all cost intensive parts get parallelised. \end{itemize} \section*{Exercise 8} In Exercise 6-7 you sent messages around in rings and back and forth between two processors and also timed the network speed when communicating between to processors. The purpose of this exercise is to try a collective communication patters where on processor sends and gathers data from all the others using the basic MPI\_Send and MPI\_Recv. \subsection*{Distribute and gather integers} Create a program where process 0 distributes an integer to all other processes, then receives an integer back from each of them. The other processes should receive the integer from process 0, add their rank to it and pass it back to process 0. With some suitable diagnostic output your program should produce something like this when run on four processors: \begin{verbatim} P0/4: Initialised OK on host ninja-1 P0: Sending value 42 to MPI process 1 P0: Sending value 42 to MPI process 2 P0: Sending value 42 to MPI process 3 P3/4: Initialised OK on host ninja-4 P1/4: Initialised OK on host ninja-2 P2/4: Initialised OK on host ninja-3 P3: Received value 42 from MPI process 0 P3: Sending value 45 to MPI process 0 P1: Received value 42 from MPI process 0 P1: Sending value 43 to MPI process 0 P2: Received value 42 from MPI process 0 P2: Sending value 44 to MPI process 0 P0: Received value 43 from MPI process 1 P0: Received value 44 from MPI process 2 P0: Received value 45 from MPI process 3 \end{verbatim} \subsection*{Distribute and gather arrays} In this part we will do exactly the same as above except we will use a double precision array as the data. \begin{enumerate} \item Create a double precision array (A) of length $N=1024$ to use as buffer \item On process 0 populate it with numbers $0$ to $N-1$ (using A[i] = (double) i;) \item Pass A on to all other processors (the workers). \item All workers should compute the $p$-norm of the array where $p$ is the rank of the worker: \[ \left( \sum_{i=0}^N A[i]^p \right)^{1/p} \] \item Then return the result as one double precision number per worker and print them all out on processor 0. \end{enumerate} \emph{Hints:} \begin{itemize} \item To compute $x^y$ use the C-function pow(x, y); \item Include the math headers in your program: \verb+#include + \item You'll also need to link in the math libraries as follows: mpicc progname.c -lm %\item You might want to add many print statements reporting the communication pattern in \end{itemize} If your pogram works it should produce something like this on four processors. \begin{verbatim} P0/4: Initialised OK on host ninja-1 P3/4: Initialised OK on host ninja-4 P2/4: Initialised OK on host ninja-3 P1/4: Initialised OK on host ninja-2 P0: Received value 523776.00 from MPI process 1 P0: Received value 18904.76 from MPI process 2 P0: Received value 6497.76 from MPI process 3 \end{verbatim} \end{document}