[124] | 1 | \documentclass[12pt]{article} |
---|
| 2 | |
---|
| 3 | \usepackage{url} |
---|
| 4 | \usepackage{times} |
---|
| 5 | \usepackage[dvips]{graphicx} |
---|
| 6 | |
---|
| 7 | %\textheight=25cm |
---|
| 8 | \textwidth=14cm |
---|
| 9 | %\topmargin=-4cm |
---|
| 10 | %\oddsidemargin=0cm |
---|
| 11 | |
---|
| 12 | |
---|
| 13 | |
---|
| 14 | \begin{document} |
---|
| 15 | |
---|
| 16 | \title{Pypar Tutorial\\ |
---|
| 17 | Building a Parallel Program using Python} |
---|
| 18 | |
---|
| 19 | |
---|
| 20 | \author{Ole Nielsen \\ |
---|
| 21 | Australian National University, Canberra \\ |
---|
| 22 | Ole.Nielsen@anu.edu.au} |
---|
| 23 | |
---|
| 24 | \maketitle |
---|
| 25 | \section*{Introduction} |
---|
| 26 | |
---|
| 27 | This is a tutorial demonstrating essential |
---|
| 28 | features of parallel programming and building |
---|
| 29 | a simple Python MPI program using the MPI binding \texttt{pypar} available at |
---|
| 30 | \url{http://datamining.anu.edu.au/software/pypar}. |
---|
| 31 | |
---|
| 32 | It is assumed that \texttt{pypar} has been installed on your system. |
---|
| 33 | If not, see the installation document (\texttt{pypar\_installation.pdf}) |
---|
| 34 | that comes with pypar |
---|
| 35 | (or get it from the web site at |
---|
| 36 | \url{http://datamining.anu.edu.au/~ole/pypar/pypar_installation.pdf}). |
---|
| 37 | |
---|
| 38 | For a reference to all of Pypar's functionality, see the |
---|
| 39 | reference manual (\texttt{pypar\_reference.pdf}) |
---|
| 40 | that comes with pypar (or get it from the web site at |
---|
| 41 | \url{http://datamining.anu.edu.au/~ole/pypar/pypar_reference.pdf}). |
---|
| 42 | |
---|
| 43 | |
---|
| 44 | \section{The Message Passing Interface --- MPI} |
---|
| 45 | MPI is the defacto standard for parallel programming of both |
---|
| 46 | distributed computer clusters of and shared memory architectures. |
---|
| 47 | MPI programming takes the form of special commands that are |
---|
| 48 | imported from the MPI library. They provide information to the program |
---|
| 49 | about how many processes are available, which number each process has been |
---|
| 50 | allocated and functions to communicate with other processes. |
---|
| 51 | See \url{http://www.netlib.org/utk/papers/mpi-book/mpi-book.html} for |
---|
| 52 | a comprehensive reference to MPI (based on |
---|
| 53 | the C or Fortran programming languages). |
---|
| 54 | |
---|
| 55 | It is said that MPI is both large and small. What is meant is that the |
---|
| 56 | MPI standard has hundreds of functions in it. However, many of the |
---|
| 57 | advanced routines represent functionality that can be ignored unless |
---|
| 58 | one pursues added flexibility (data types), overlap of computation and |
---|
| 59 | communication (nonblocking send/receive), modularity (groups, |
---|
| 60 | communicators), or convenience (collective operations, topologies). |
---|
| 61 | MPI is said to be small because the majority of useful and efficient |
---|
| 62 | parallel programs are written using only a small essential subset of MPI. |
---|
| 63 | In fact, it can be argued that parallel performance is best achieved by |
---|
| 64 | keeping the parallelism simple. |
---|
| 65 | |
---|
| 66 | Pypar is a Python binding to such a subset. |
---|
| 67 | |
---|
| 68 | |
---|
| 69 | \section{A simple Pypar MPI program} |
---|
| 70 | Let us start with a simple example: a Python program |
---|
| 71 | that imports the pypar module and |
---|
| 72 | reports the number of processors, local id and host name |
---|
| 73 | This is the Pypar equivalent of the infamous 'hello world' |
---|
| 74 | program if you like. |
---|
| 75 | Make a file called \texttt{ni.py} using your favourite |
---|
| 76 | editor\footnote{The author believs that your favourite editor should |
---|
| 77 | be GNU/Emacs, but it is up to you, of course.} |
---|
| 78 | and type in the following contents: |
---|
| 79 | {\footnotesize |
---|
| 80 | \begin{verbatim} |
---|
| 81 | import pypar |
---|
| 82 | |
---|
| 83 | p = pypar.rank() |
---|
| 84 | P = pypar.size() |
---|
| 85 | node = pypar.get_processor_name() |
---|
| 86 | |
---|
| 87 | print "I am process %d of %d on node %s" %(p, P, node) |
---|
| 88 | |
---|
| 89 | pypar.finalize() |
---|
| 90 | \end{verbatim}} |
---|
| 91 | \noindent and run it as follows: |
---|
| 92 | \begin{verbatim} |
---|
| 93 | mpirun -np 4 python ni.py |
---|
| 94 | \end{verbatim} |
---|
| 95 | You should get a greeting from each of the four nodes looking |
---|
| 96 | something like |
---|
| 97 | \begin{verbatim} |
---|
| 98 | Pypar (version 1.9) initialised MPI OK with 4 processors |
---|
| 99 | I am process 0 of 4 on node ninja-1 |
---|
| 100 | I am process 3 of 4 on node ninja-4 |
---|
| 101 | I am process 1 of 4 on node ninja-2 |
---|
| 102 | I am process 2 of 4 on node ninja-3 |
---|
| 103 | \end{verbatim} |
---|
| 104 | The order of output may be arbitrary! If four copies of the |
---|
| 105 | program is run, print is called four times. The order in which each |
---|
| 106 | process executes the message is undetermined, based on when they each |
---|
| 107 | reach that point in their execution of the program, and how they |
---|
| 108 | travel on the network. |
---|
| 109 | \emph{Note that all the print's, though they come from different |
---|
| 110 | processes, will send their output intact to your shell window; this |
---|
| 111 | is generally true of output commands. |
---|
| 112 | Input commands, like \texttt{raw_input}, |
---|
| 113 | will only work on the process with rank zero.} |
---|
| 114 | |
---|
| 115 | |
---|
| 116 | \subsection{The MPI execution model} |
---|
| 117 | |
---|
| 118 | It is important to understand that in MPI (and therefore also in |
---|
| 119 | Pypar), multiple identical copies of this program |
---|
| 120 | will start simultaneously in separate processes. |
---|
| 121 | These processes may run on one machine or on multiple machines. |
---|
| 122 | This is a fundamental difference from ordinary programs, |
---|
| 123 | where, if someone sais ``run the program'', it is |
---|
| 124 | assumed that there was only one instance of the program running. |
---|
| 125 | |
---|
| 126 | |
---|
| 127 | XXXXX |
---|
| 128 | In a nutshell, this program sets up a communication group of 4 |
---|
| 129 | processes, where each process gets its rank (\texttt{p}), prints it, |
---|
| 130 | and exits. |
---|
| 131 | |
---|
| 132 | .......Explain in detail like below.. |
---|
| 133 | |
---|
| 134 | |
---|
| 135 | |
---|
| 136 | |
---|
| 137 | The first line, |
---|
| 138 | \texttt{\#include <stdio>} |
---|
| 139 | should be familiar to all C programmers. It includes the standard |
---|
| 140 | input/output routines like printf. The second line, |
---|
| 141 | \texttt{\#include <mpi.h>} |
---|
| 142 | includes the MPI functions. The file \texttt{mpi.h} |
---|
| 143 | contains prototypes for all the |
---|
| 144 | MPI routines in this program; this file is located in |
---|
| 145 | \texttt{/usr/include/lam/mpi.h} in case you actually want to look at it. |
---|
| 146 | The program starts with the main... line which takes the two arguments |
---|
| 147 | argc and argv (which is normal in C): The argument \texttt{argc} is |
---|
| 148 | the number of commandline arguments given to the program (including itself) |
---|
| 149 | and \texttt{argv} is an array of strings containing them. For example |
---|
| 150 | argv[0] will contain the program name itself. We will not be using them |
---|
| 151 | directly but MPI needs access to them through \texttt{MPI\_init}. |
---|
| 152 | Then the program declares one integer variable, \texttt{myid}. The |
---|
| 153 | first step of the program, |
---|
| 154 | \begin{verbatim} |
---|
| 155 | MPI_Init(&argc,&argv); |
---|
| 156 | \end{verbatim} |
---|
| 157 | calls MPI\_Init to set up everything. |
---|
| 158 | This should be the first command executed in all MPI programs. This |
---|
| 159 | routine takes pointers to \texttt{argc} and \texttt{argv}, |
---|
| 160 | looks at them, pulls out the purely |
---|
| 161 | MPI-relevant things, and generally fixes them so you can use command line |
---|
| 162 | arguments as normal. |
---|
| 163 | |
---|
| 164 | \medskip |
---|
| 165 | |
---|
| 166 | Next, the program runs \texttt{MPI\_Comm\_rank}, |
---|
| 167 | passing it the default communicator MPI\_COMM\_WORLD |
---|
| 168 | (which is the group of \emph{all} processes started) |
---|
| 169 | and a \emph{pointer} to \texttt{myid}. Passing a pointer is the way |
---|
| 170 | C can change the value of te argument \texttt{myid}. |
---|
| 171 | \texttt{MPI\_Comm\_rank} will set \texttt{myid} to the rank of the machine |
---|
| 172 | on which the program is running. |
---|
| 173 | Remember that in reality, several instances of this program |
---|
| 174 | start up in several different processes when this program is run. Each |
---|
| 175 | will receive a unique number from \texttt{MPI\_Comm\_rank}. |
---|
| 176 | Because multiple copies are running, each will execute all lines |
---|
| 177 | in the program including the \texttt{printf} statement which prints a |
---|
| 178 | message and the rank. |
---|
| 179 | After doing everything else, the program calls \texttt{MPI\_Finalize}, |
---|
| 180 | which generally terminates everything and shuts down MPI. This should be the |
---|
| 181 | last command executed in all MPI programs. |
---|
| 182 | |
---|
| 183 | \subsection*{Compile the MPI program} |
---|
| 184 | |
---|
| 185 | Normally, C programs are compiled with a command such as |
---|
| 186 | \texttt{gcc hello.c -o hello.x}. However, MPI programs need access to |
---|
| 187 | more libraries which are provided through the wrapper program |
---|
| 188 | \texttt{mpicc}. This can be used as usual. Try to run |
---|
| 189 | \begin{verbatim} |
---|
| 190 | mpicc hello.c -o hello.x |
---|
| 191 | \end{verbatim} |
---|
| 192 | If you made a syntactically correct program a \emph{compiled} |
---|
| 193 | program will appear with the name you specified after the -o option. |
---|
| 194 | If not you will see some error messages and you'll have to correct |
---|
| 195 | those before continuing. |
---|
| 196 | |
---|
| 197 | \subsection*{Run the MPI program} |
---|
| 198 | In order to run an MPI compiled program, you must type: |
---|
| 199 | \begin{verbatim} |
---|
| 200 | mpirun -np <number of processes> [options] <program name and arguments> |
---|
| 201 | \end{verbatim} |
---|
| 202 | where you specify the number of processes on which you want to run |
---|
| 203 | your parallel program, optional mpirun options, |
---|
| 204 | and your program name and its expected arguments. |
---|
| 205 | In this case try: |
---|
| 206 | \begin{verbatim} |
---|
| 207 | mpirun -np 1 hello.x |
---|
| 208 | \end{verbatim} |
---|
| 209 | This will run one copy of your program and the output should like this |
---|
| 210 | \begin{verbatim} |
---|
| 211 | Sawatdii khrap thuk thuk khon (Process 0) |
---|
| 212 | \end{verbatim} |
---|
| 213 | |
---|
| 214 | Now try to start four processes by running \texttt{mpirun -np 4 hello.x}. |
---|
| 215 | You should see the following output: |
---|
| 216 | \begin{verbatim} |
---|
| 217 | Sawatdii khrap thuk thuk khon (Process 0) |
---|
| 218 | Sawatdii khrap thuk thuk khon (Process 3) |
---|
| 219 | Sawatdii khrap thuk thuk khon (Process 1) |
---|
| 220 | Sawatdii khrap thuk thuk khon (Process 2) |
---|
| 221 | \end{verbatim} |
---|
| 222 | Note that the order of output may be arbitrary! If four copies of the |
---|
| 223 | program is run, printf is called four times. The order in which each |
---|
| 224 | process executes the message is undetermined, based on when they each |
---|
| 225 | reach that point in their execution of the program, and how they |
---|
| 226 | travel on the network. Your guess is as good as mine. |
---|
| 227 | \emph{Note that all the printf's, though they come from different |
---|
| 228 | processes, will send their output intact to your shell window; this |
---|
| 229 | is generally true of output commands. Input commands, like scanf, |
---|
| 230 | will only work on the process with rank zero.} |
---|
| 231 | |
---|
| 232 | \subsubsection*{Exercise} |
---|
| 233 | The MPI command \texttt{MPI\_Comm\_size(MPI\_COMM\_WORLD, \&proc);} will |
---|
| 234 | store the total number of processes started in the integer variable |
---|
| 235 | \texttt{number\_of\_processes}. Modify the program |
---|
| 236 | to give the following output: |
---|
| 237 | \begin{verbatim} |
---|
| 238 | Sawatdii khrap thuk thuk khon (Process 0 of 4) |
---|
| 239 | Sawatdii khrap thuk thuk khon (Process 3 of 4) |
---|
| 240 | Sawatdii khrap thuk thuk khon (Process 1 of 4) |
---|
| 241 | Sawatdii khrap thuk thuk khon (Process 2 of 4) |
---|
| 242 | \end{verbatim} |
---|
| 243 | when started with four processes. |
---|
| 244 | |
---|
| 245 | |
---|
| 246 | \subsection*{Running on multiple computers} |
---|
| 247 | |
---|
| 248 | So far we have only \emph{simulated} the parallelism by running |
---|
| 249 | multiple processes on one machine. |
---|
| 250 | We will now add host specific information to the program and then run it |
---|
| 251 | in parallel. |
---|
| 252 | |
---|
| 253 | Add the following declarations to your program: |
---|
| 254 | \begin{verbatim} |
---|
| 255 | int namelen; |
---|
| 256 | char processor_name[MPI_MAX_PROCESSOR_NAME]; |
---|
| 257 | \end{verbatim} |
---|
| 258 | |
---|
| 259 | Add the command |
---|
| 260 | \begin{verbatim} |
---|
| 261 | MPI_Get_processor_name(processor_name, &namelen); |
---|
| 262 | \end{verbatim} |
---|
| 263 | This will store the hostname of the processor executing the given process |
---|
| 264 | along with the length of the hostname. Then modify the print statement to |
---|
| 265 | \begin{verbatim} |
---|
| 266 | printf("Sawatdii khrap thuk thuk khon (Process %d of %d running on %s)\n", |
---|
| 267 | myid, number_of_processes, processor_name); |
---|
| 268 | \end{verbatim} |
---|
| 269 | Compiling and running the program you should see the folowing output |
---|
| 270 | \begin{verbatim} |
---|
| 271 | Sawatdii khrap thuk thuk khon (Process 0 of 4 running on ninja-n) |
---|
| 272 | Sawatdii khrap thuk thuk khon (Process 1 of 4 running on ninja-n) |
---|
| 273 | Sawatdii khrap thuk thuk khon (Process 2 of 4 running on ninja-n) |
---|
| 274 | Sawatdii khrap thuk thuk khon (Process 3 of 4 running on ninja-n) |
---|
| 275 | \end{verbatim} |
---|
| 276 | We are still running all out MPI processes on one computer but we are |
---|
| 277 | now ready to run this program in parallel. |
---|
| 278 | |
---|
| 279 | Edit the file \texttt{~/.lamhosts} to contain all machines in our network |
---|
| 280 | and run \texttt{lamboot -v ~/.lamhosts} again. You should see the following |
---|
| 281 | diagnostic message: |
---|
| 282 | \begin{verbatim} |
---|
| 283 | LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University |
---|
| 284 | |
---|
| 285 | Executing hboot on n0 (ninja-1 - 1 CPU)... |
---|
| 286 | Executing hboot on n1 (ninja-2 - 1 CPU)... |
---|
| 287 | Executing hboot on n2 (ninja-3 - 1 CPU)... |
---|
| 288 | Executing hboot on n3 (ninja-4 - 1 CPU)... |
---|
| 289 | Executing hboot on n4 (ninja-5 - 1 CPU)... |
---|
| 290 | Executing hboot on n5 (ninja-6 - 1 CPU)... |
---|
| 291 | Executing hboot on n6 (ninja-7 - 1 CPU)... |
---|
| 292 | Executing hboot on n7 (ninja-8 - 1 CPU)... |
---|
| 293 | topology done |
---|
| 294 | \end{verbatim} |
---|
| 295 | \noindent Finally try to run your program again and verify that it runs |
---|
| 296 | on different nodes. |
---|
| 297 | What happens if you run more processes than there are physical machines? |
---|
| 298 | |
---|
| 299 | |
---|
| 300 | \section*{Optional: synchronise the clocks} |
---|
| 301 | |
---|
| 302 | \textbf{\emph{Note: This an optional exercise for those who can't get enough!}} |
---|
| 303 | |
---|
| 304 | The clocks in our cluster are not synchronised: They show differenttimes. |
---|
| 305 | Try for example to run |
---|
| 306 | \begin{verbatim} |
---|
| 307 | ssh -x ninja-2 date; ssh -x ninja-5 date; ssh -x ninja-3 date; date |
---|
| 308 | You will see that the times listed differ significantly. |
---|
| 309 | \end{verbatim} |
---|
| 310 | |
---|
| 311 | Try and search for things like |
---|
| 312 | \texttt{synchronize time network} |
---|
| 313 | and see if you can figure out how to do it. |
---|
| 314 | |
---|
| 315 | |
---|
| 316 | \section*{MPI resources} |
---|
| 317 | |
---|
| 318 | \begin{itemize} |
---|
| 319 | \item \url{http://www.cs.appstate.edu/~can/classes/5530} (exellent introduction |
---|
| 320 | and many useful links) |
---|
| 321 | \item http://www-unix.mcs.anl.gov/mpi/mpich (The MPICH implementation) |
---|
| 322 | \item http://www.lam-mpi.org (The LAM-MPI implementation) |
---|
| 323 | \item http://www.netlib.org/utk/papers/mpi-book/mpi-book.html\\ |
---|
| 324 | (The complete MPI reference) |
---|
| 325 | \end{itemize} |
---|
| 326 | |
---|
| 327 | |
---|
| 328 | |
---|
| 329 | \section*{Exercise 6} |
---|
| 330 | |
---|
| 331 | In this exercise we will continue the introduction to MPI |
---|
| 332 | by using the local id to differentiate work and |
---|
| 333 | by familiarising ourselves with the basic send and receive |
---|
| 334 | primitives. |
---|
| 335 | We will then proceed to measure the fundamental characteristics of |
---|
| 336 | the network: namely latency and bandwidth. |
---|
| 337 | |
---|
| 338 | |
---|
| 339 | |
---|
| 340 | \subsection*{Using local id to differentiate work} |
---|
| 341 | In Exercise 5 we had every processor write to the screen. This |
---|
| 342 | is convenient for debugging purposes |
---|
| 343 | and I suggest you start every parallel program |
---|
| 344 | with a message from each processor about its rank, the total number |
---|
| 345 | of processors and the hostname it is running on (see below). |
---|
| 346 | |
---|
| 347 | However, it is often also desirable to have only on |
---|
| 348 | processor output to screen |
---|
| 349 | for example to report the final result of a calculation. |
---|
| 350 | |
---|
| 351 | \subsubsection*{Exercise} |
---|
| 352 | Write a parallel program that |
---|
| 353 | produces something like the following output when run on four processors: |
---|
| 354 | \begin{verbatim} |
---|
| 355 | P2/4: Initialised OK on host ninja-6 |
---|
| 356 | P0/4: Initialised OK on host ninja-4 |
---|
| 357 | P0: Program terminating |
---|
| 358 | P1/4: Initialised OK on host ninja-5 |
---|
| 359 | P3/4: Initialised OK on host ninja-7 |
---|
| 360 | \end{verbatim} |
---|
| 361 | Arrange the program such that |
---|
| 362 | all processors report on their initialisation but |
---|
| 363 | only one, processor 0, reports on its termination just prior |
---|
| 364 | to \texttt{MPI\_Finalize()}. |
---|
| 365 | Note, as in Exercise 5, that the order of output is |
---|
| 366 | arbitrary. |
---|
| 367 | %\emph{Hint: Use the value of myid to decide whether to print or not}. |
---|
| 368 | |
---|
| 369 | \subsection*{A simple send and receive pair} |
---|
| 370 | In this part we will use the fundamental MPI send and receive commands |
---|
| 371 | to communicate among the processors. The commands are: |
---|
| 372 | \begin{itemize} |
---|
| 373 | \item \texttt{MPI\_Send}: |
---|
| 374 | This routine performs a basic send; this routine may block until |
---|
| 375 | the message |
---|
| 376 | is received, depending on the specific implementation of MPI. |
---|
| 377 | \begin{verbatim} |
---|
| 378 | int MPI_Send(void* buf, int count, MPI_Datatype datatype, |
---|
| 379 | int dest, int tag, MPI_Comm comm) |
---|
| 380 | Input: |
---|
| 381 | buf - initial address of send buffer |
---|
| 382 | count - number of elements in send buffer (nonnegative integer) |
---|
| 383 | datatype - datatype of each send buffer element |
---|
| 384 | dest - rank of destination (integer) |
---|
| 385 | tag - message tag (integer) |
---|
| 386 | comm - communicator |
---|
| 387 | \end{verbatim} |
---|
| 388 | \textbf{Example}: To send one integer, A say, to processor 3 with tag 13: |
---|
| 389 | \begin{verbatim} |
---|
| 390 | MPI_Send(&A, 1, MPI_INT, 3, 13, MPI_COMM_WORLD); |
---|
| 391 | \end{verbatim} |
---|
| 392 | \item \texttt{MPI\_Recv}: |
---|
| 393 | This routine performs a basic receive. |
---|
| 394 | \begin{verbatim} |
---|
| 395 | int MPI_Recv(void* buf, int count, MPI_Datatype datatype, |
---|
| 396 | int source, int tag, MPI_Comm comm, |
---|
| 397 | MPI_Status *status) |
---|
| 398 | |
---|
| 399 | Input: |
---|
| 400 | count - maximum number of elements in receive buffer |
---|
| 401 | (integer) |
---|
| 402 | datatype - datatype of each receive buffer element |
---|
| 403 | source - rank of source (integer) |
---|
| 404 | tag - message tag (integer) |
---|
| 405 | comm - communicator |
---|
| 406 | |
---|
| 407 | Output: |
---|
| 408 | buf - initial address of receive buffer |
---|
| 409 | status - status object, provides information about |
---|
| 410 | message received; |
---|
| 411 | status is a structure of type MPI_Status, the element |
---|
| 412 | status.MPI_SOURCE is the source of the message received, |
---|
| 413 | and the element status.MPI_TAG is the tag value. |
---|
| 414 | |
---|
| 415 | \end{verbatim} |
---|
| 416 | \textbf{Example}: To receive one integer, B say, from processor 7 |
---|
| 417 | with tag 13: |
---|
| 418 | \begin{verbatim} |
---|
| 419 | MPI_Recv(&B, 1, MPI_INT, 7, 13, MPI_COMM_WORLD, &status); |
---|
| 420 | \end{verbatim} |
---|
| 421 | \end{itemize} |
---|
| 422 | |
---|
| 423 | \noindent These calls are described in more detail in Ian Foster online book |
---|
| 424 | in \textbf{MPI Basics}: |
---|
| 425 | \url{http://www-unix.mcs.anl.gov/dbpp/text/node96.html}. |
---|
| 426 | The chapter about \textbf{Asynchronous communication}, |
---|
| 427 | \url{http://www-unix.mcs.anl.gov/dbpp/text/node98.html}, describes how |
---|
| 428 | to query the Status Object. |
---|
| 429 | |
---|
| 430 | MPI defines the following constants for use with \texttt{MPI\_Recv}: |
---|
| 431 | \begin{itemize} |
---|
| 432 | \item \texttt{MPI\_ANY\_SOURCE}: Can be used to specify that a message |
---|
| 433 | can be received from anywhere instead of a specific source. |
---|
| 434 | \item \texttt{MPI\_ANY\_TAG}: Can be used to specify that a message |
---|
| 435 | can have any tag instead of a specific tag. |
---|
| 436 | \end{itemize} |
---|
| 437 | |
---|
| 438 | \subsubsection*{Exercise} |
---|
| 439 | The following program segment implements a very simple |
---|
| 440 | communication pattern: Processor 0 sends a number to processor 1 and |
---|
| 441 | both outputs a diagnostic message. |
---|
| 442 | |
---|
| 443 | Add to your previous program the declarations: |
---|
| 444 | \begin{verbatim} |
---|
| 445 | int A, B, source, destination, tag=13; |
---|
| 446 | MPI_Status status; |
---|
| 447 | \end{verbatim} |
---|
| 448 | and the following code segment |
---|
| 449 | \begin{verbatim} |
---|
| 450 | if (myid == 0) { |
---|
| 451 | A = 42; |
---|
| 452 | destination = 1; |
---|
| 453 | printf("P%d: Sending value %d to MPI process %d\n", |
---|
| 454 | myid, A, destination); |
---|
| 455 | MPI_Send(&A, 1, MPI_INT, 1, 13, MPI_COMM_WORLD); |
---|
| 456 | } else if (myid == 1) { |
---|
| 457 | source = 0; |
---|
| 458 | MPI_Recv(&B, 1, MPI_INT, source, 13, MPI_COMM_WORLD, &status); |
---|
| 459 | printf("P%d: Received value %d from MPI process %d\n", myid, B, source); |
---|
| 460 | } |
---|
| 461 | \end{verbatim} |
---|
| 462 | make sure it can compile and run on 2 processors. |
---|
| 463 | Verify that your output looks something like |
---|
| 464 | \begin{verbatim} |
---|
| 465 | P0/2: Initialised OK on host ninja-1 |
---|
| 466 | P0: Sending value 42 to MPI process 1 |
---|
| 467 | P0: Terminating |
---|
| 468 | P1/2: Initialised OK on host ninja-2 |
---|
| 469 | P1: Received value 42 from MPI process 0 |
---|
| 470 | \end{verbatim} |
---|
| 471 | |
---|
| 472 | |
---|
| 473 | |
---|
| 474 | \subsection*{Send messages around in a ring - Exercise} |
---|
| 475 | |
---|
| 476 | Write an MPI program which passes data |
---|
| 477 | around in a ring structure from process 0 then to process 1 etc. |
---|
| 478 | When the message reaches the last processor it should be passed back to |
---|
| 479 | processor 0 - this forms a 'communication ring'. |
---|
| 480 | Every process should add one to the value before passing it on |
---|
| 481 | and write out a diagnostic message. |
---|
| 482 | |
---|
| 483 | With a starting value of 42 and running on four processors your program |
---|
| 484 | should produce the following output: |
---|
| 485 | \begin{verbatim} |
---|
| 486 | P0/4: Initialised OK on host ninja-1 |
---|
| 487 | P0: Sending value 42 to MPI process 1 |
---|
| 488 | P1/4: Initialised OK on host ninja-2 |
---|
| 489 | P3/4: Initialised OK on host ninja-4 |
---|
| 490 | P1: Received value 42 from MPI process 0 |
---|
| 491 | P1: Sending value 43 to MPI process 2 |
---|
| 492 | P2/4: Initialised OK on host ninja-3 |
---|
| 493 | P2: Received value 43 from MPI process 1 |
---|
| 494 | P2: Sending value 44 to MPI process 3 |
---|
| 495 | P3: Received value 44 from MPI process 2 |
---|
| 496 | P3: Sending value 45 to MPI process 0 |
---|
| 497 | P0: Received value 45 from MPI process 3 |
---|
| 498 | \end{verbatim} |
---|
| 499 | \emph{Hint}: You can use the \texttt{modulus} operator to calculate |
---|
| 500 | the destination such that the next processor after the last becomes 0: |
---|
| 501 | \begin{verbatim} |
---|
| 502 | destination = (myid + 1) % number_of_processors; |
---|
| 503 | \end{verbatim} |
---|
| 504 | |
---|
| 505 | |
---|
| 506 | \subsection*{Timing} |
---|
| 507 | A main reason for doing parallel computing is that of faster execution. |
---|
| 508 | Therefore \emph{timing} is an integral part of this game. |
---|
| 509 | MPI provides a function \texttt{MPI\_Wtime} which counts seconds. |
---|
| 510 | To time a code segment with \texttt{MPI\_Wtime} do something like this: |
---|
| 511 | \begin{verbatim} |
---|
| 512 | double t0; |
---|
| 513 | |
---|
| 514 | t0=MPI_Wtime(); |
---|
| 515 | // (Computations) |
---|
| 516 | printf("Time = %.6f sec\n", MPI_Wtime() - t0); |
---|
| 517 | \end{verbatim} |
---|
| 518 | |
---|
| 519 | \subsubsection*{Exercise} |
---|
| 520 | Time the previous 'ring' program in such a way that only process 0 |
---|
| 521 | measures the time and writes it to the screen |
---|
| 522 | |
---|
| 523 | |
---|
| 524 | \subsection*{Measure network latency and bandwidth} |
---|
| 525 | |
---|
| 526 | The time it takes to transmit a message can be approximated by the model |
---|
| 527 | \[ |
---|
| 528 | t = t_l + \alpha t_b |
---|
| 529 | \] |
---|
| 530 | where |
---|
| 531 | \begin{itemize} |
---|
| 532 | \item $t_l$ is the \emph{latency}. The latency is the |
---|
| 533 | time it takes to start the transmission before anything |
---|
| 534 | gets communicated. This involves opening the network connection, |
---|
| 535 | buffering the message and so on. |
---|
| 536 | \item $t_b$ is the time it takes to communicate one byte once |
---|
| 537 | the connection is open. This is related to the \emph{bandwidth} $B$, |
---|
| 538 | defined as the number of bytes transmitted per second, as |
---|
| 539 | \[ |
---|
| 540 | B = \frac{1}{t_b} |
---|
| 541 | \] |
---|
| 542 | \item $\alpha$ is the number of bytes communicated. |
---|
| 543 | \end{itemize} |
---|
| 544 | |
---|
| 545 | \subsubsection*{Exercise} |
---|
| 546 | \textbf{Can you work out what the network latency is?} |
---|
| 547 | \emph{Hint:} Send a very small message (e.g.\ one integer) |
---|
| 548 | from one processor to another and and back again a number of times |
---|
| 549 | and measure the time. |
---|
| 550 | |
---|
| 551 | \subsubsection*{Exercise} |
---|
| 552 | \textbf{Can you work out how many megabytes the network can transmit per |
---|
| 553 | second?} |
---|
| 554 | Create arrays of varying size, send them from one processor |
---|
| 555 | to another and back and measure the time. |
---|
| 556 | \emph{Hints:} |
---|
| 557 | \begin{itemize} |
---|
| 558 | \item To create an array of double precision floating point numbers |
---|
| 559 | do the following. |
---|
| 560 | \begin{verbatim} |
---|
| 561 | #define MAX_LEN 500000 /* Largest block */ |
---|
| 562 | double A[MAX_LEN]; |
---|
| 563 | \end{verbatim} |
---|
| 564 | \item To populate the array with pseudo random numbers: |
---|
| 565 | \begin{verbatim} |
---|
| 566 | for (j=0; j<MAX_LEN; j++) A[j]=rand(); |
---|
| 567 | \end{verbatim} |
---|
| 568 | \item To send only part of the array use the second argument in |
---|
| 569 | MPI\_Send and MPI\_Recv to control how much information is transferred. |
---|
| 570 | \item MPI\_Send and MPI\_Recv require a \emph{pointer} to the first |
---|
| 571 | element of the array. Since arrays are already defined as pointers |
---|
| 572 | simply put \texttt{A} rather than \texttt{\&A} which is what |
---|
| 573 | we used earlier for the |
---|
| 574 | simple integer variable. |
---|
| 575 | \item One double precision number takes up 8 bytes - use that in your |
---|
| 576 | estimate. |
---|
| 577 | \end{itemize} |
---|
| 578 | |
---|
| 579 | Your measurements may fluctuate based on the network traffic and the |
---|
| 580 | load on the machines. |
---|
| 581 | |
---|
| 582 | |
---|
| 583 | \section*{Good MPI resources} |
---|
| 584 | \begin{itemize} |
---|
| 585 | \item MPI Data types:\\ |
---|
| 586 | \url{http://www.ats.ucla.edu/at/hpc/parallel_computing/mpi-intro.htm} |
---|
| 587 | \item Online book by Ian Foster:\\ |
---|
| 588 | \url{http://www-unix.mcs.anl.gov/dbpp/text/node94.html} |
---|
| 589 | \end{itemize} |
---|
| 590 | |
---|
| 591 | |
---|
| 592 | |
---|
| 593 | |
---|
| 594 | |
---|
| 595 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
| 596 | \section{Identifying the processors} |
---|
| 597 | |
---|
| 598 | |
---|
| 599 | %%%%%%%%%%% |
---|
| 600 | |
---|
| 601 | |
---|
| 602 | It is almost impossible to demonstrate parallelism with an interactive |
---|
| 603 | session at the command prompt as we did with the Python and Maple tutorials. |
---|
| 604 | Therefore we must write a \emph{program} that can be executed in parallel. |
---|
| 605 | Start an editor of your choice with a new file |
---|
| 606 | (called \texttt{paraprime.py}, say). For example: |
---|
| 607 | \begin{verbatim} |
---|
| 608 | emacs paraprime.py & |
---|
| 609 | \end{verbatim} |
---|
| 610 | |
---|
| 611 | \noindent Write the following in the file and save it |
---|
| 612 | {\small \begin{verbatim} |
---|
| 613 | import pypar |
---|
| 614 | |
---|
| 615 | numproc = pypar.size() |
---|
| 616 | myid = pypar.rank() |
---|
| 617 | |
---|
| 618 | print "I am proc %d of %d" %(myid, numproc) |
---|
| 619 | \end{verbatim}} |
---|
| 620 | |
---|
| 621 | Then type the following at the Unix prompt to execute the |
---|
| 622 | program normally on one processor |
---|
| 623 | {\small \begin{verbatim} |
---|
| 624 | python paraprime.py |
---|
| 625 | \end{verbatim}} |
---|
| 626 | \noindent You should see the following output |
---|
| 627 | {\small \begin{verbatim} |
---|
| 628 | > python paraprime.py |
---|
| 629 | I am proc 0 of 1 |
---|
| 630 | \end{verbatim}} |
---|
| 631 | |
---|
| 632 | |
---|
| 633 | \noindent Now let us try and run the same program on four processors: |
---|
| 634 | {\small \begin{verbatim} |
---|
| 635 | > prun -n 4 python paraprime.py |
---|
| 636 | I am proc 0 of 4 |
---|
| 637 | I am proc 1 of 4 |
---|
| 638 | I am proc 2 of 4 |
---|
| 639 | I am proc 3 of 4 |
---|
| 640 | \end{verbatim}} |
---|
| 641 | If you get a message from each of the four processors stating their |
---|
| 642 | id as above, we know that everything works. We have a parallel program! |
---|
| 643 | |
---|
| 644 | \section{Strategies for efficient parallel programming} |
---|
| 645 | A final note on efficiency. |
---|
| 646 | |
---|
| 647 | Parallel programming should lead to faster execution and/or |
---|
| 648 | ability to deal with larger problems. |
---|
| 649 | |
---|
| 650 | The ultimate goal is to be \emph{P} times faster with \emph{P} processors. |
---|
| 651 | However \emph{speedup} is usually less than \emph{P}. One must address |
---|
| 652 | three critical issues to achieve good speedup: |
---|
| 653 | \begin{itemize} |
---|
| 654 | \item \textbf{Interprocessor communication:} The amount and the frequency |
---|
| 655 | of messages should be kept as low as possible. Ensure that processors |
---|
| 656 | have plenty of work to do between communications. |
---|
| 657 | \item \textbf{Data distribution and load balancing:} If some processors |
---|
| 658 | finish much sooner than others, the total execution time of the parallel |
---|
| 659 | program is bounded by that of the slowest processor and we say that the |
---|
| 660 | program is poorly load balanced. |
---|
| 661 | Ensure that each processor get its fair share of the work load. |
---|
| 662 | \item \textbf{Sequential parts of a program:} If half of a program, say, is |
---|
| 663 | inherently sequential the speedup can never exceed 2 no matter how well |
---|
| 664 | the remaining half is parallelised. This is known as Amdahls law. |
---|
| 665 | Ensure that the all cost intensive parts get parallelised. |
---|
| 666 | \end{itemize} |
---|
| 667 | |
---|
| 668 | |
---|
| 669 | |
---|
| 670 | |
---|
| 671 | |
---|
| 672 | \section*{Exercise 8} |
---|
| 673 | |
---|
| 674 | In Exercise 6-7 you sent messages around in rings and back and forth between |
---|
| 675 | two processors and also timed the network speed when communicating between to |
---|
| 676 | processors. |
---|
| 677 | The purpose of this exercise is to try a collective communication patters |
---|
| 678 | where on processor sends and gathers data from all the others using the |
---|
| 679 | basic MPI\_Send and MPI\_Recv. |
---|
| 680 | |
---|
| 681 | \subsection*{Distribute and gather integers} |
---|
| 682 | |
---|
| 683 | Create a program where process 0 distributes an integer to all other processes, |
---|
| 684 | then receives an integer back from each of them. |
---|
| 685 | The other processes should receive the integer from process 0, add their |
---|
| 686 | rank to it and pass it back to process 0. |
---|
| 687 | |
---|
| 688 | With some suitable diagnostic output your program should produce |
---|
| 689 | something like this when run on four processors: |
---|
| 690 | |
---|
| 691 | \begin{verbatim} |
---|
| 692 | P0/4: Initialised OK on host ninja-1 |
---|
| 693 | P0: Sending value 42 to MPI process 1 |
---|
| 694 | P0: Sending value 42 to MPI process 2 |
---|
| 695 | P0: Sending value 42 to MPI process 3 |
---|
| 696 | P3/4: Initialised OK on host ninja-4 |
---|
| 697 | P1/4: Initialised OK on host ninja-2 |
---|
| 698 | P2/4: Initialised OK on host ninja-3 |
---|
| 699 | P3: Received value 42 from MPI process 0 |
---|
| 700 | P3: Sending value 45 to MPI process 0 |
---|
| 701 | P1: Received value 42 from MPI process 0 |
---|
| 702 | P1: Sending value 43 to MPI process 0 |
---|
| 703 | P2: Received value 42 from MPI process 0 |
---|
| 704 | P2: Sending value 44 to MPI process 0 |
---|
| 705 | P0: Received value 43 from MPI process 1 |
---|
| 706 | P0: Received value 44 from MPI process 2 |
---|
| 707 | P0: Received value 45 from MPI process 3 |
---|
| 708 | \end{verbatim} |
---|
| 709 | |
---|
| 710 | |
---|
| 711 | \subsection*{Distribute and gather arrays} |
---|
| 712 | |
---|
| 713 | In this part we will do exactly the same as above except we will use |
---|
| 714 | a double precision array as the data. |
---|
| 715 | |
---|
| 716 | \begin{enumerate} |
---|
| 717 | \item Create a double precision array (A) of length $N=1024$ to use as buffer |
---|
| 718 | \item On process 0 populate it with numbers $0$ to $N-1$ |
---|
| 719 | (using A[i] = (double) i;) |
---|
| 720 | \item Pass A on to all other processors (the workers). |
---|
| 721 | \item All workers should compute the $p$-norm of the array |
---|
| 722 | where $p$ is the rank of the worker: |
---|
| 723 | \[ |
---|
| 724 | \left( |
---|
| 725 | \sum_{i=0}^N A[i]^p |
---|
| 726 | \right)^{1/p} |
---|
| 727 | \] |
---|
| 728 | \item Then return the result as one double precision number per worker and |
---|
| 729 | print them all out on processor 0. |
---|
| 730 | \end{enumerate} |
---|
| 731 | |
---|
| 732 | \emph{Hints:} |
---|
| 733 | \begin{itemize} |
---|
| 734 | \item To compute $x^y$ use the C-function pow(x, y); |
---|
| 735 | \item Include the math headers in your program: \verb+#include <math.h>+ |
---|
| 736 | \item You'll also need to link in the math libraries as follows: |
---|
| 737 | mpicc progname.c -lm |
---|
| 738 | %\item You might want to add many print statements reporting the communication pattern in |
---|
| 739 | \end{itemize} |
---|
| 740 | |
---|
| 741 | If your pogram works it should produce something like this on four processors. |
---|
| 742 | \begin{verbatim} |
---|
| 743 | P0/4: Initialised OK on host ninja-1 |
---|
| 744 | P3/4: Initialised OK on host ninja-4 |
---|
| 745 | P2/4: Initialised OK on host ninja-3 |
---|
| 746 | P1/4: Initialised OK on host ninja-2 |
---|
| 747 | P0: Received value 523776.00 from MPI process 1 |
---|
| 748 | P0: Received value 18904.76 from MPI process 2 |
---|
| 749 | P0: Received value 6497.76 from MPI process 3 |
---|
| 750 | \end{verbatim} |
---|
| 751 | |
---|
| 752 | |
---|
| 753 | |
---|
| 754 | |
---|
| 755 | \end{document} |
---|