1 | \documentclass[12pt]{article} |
---|
2 | |
---|
3 | \usepackage{url} |
---|
4 | \usepackage{times} |
---|
5 | \usepackage[dvips]{graphicx} |
---|
6 | |
---|
7 | %\textheight=25cm |
---|
8 | \textwidth=14cm |
---|
9 | %\topmargin=-4cm |
---|
10 | %\oddsidemargin=0cm |
---|
11 | |
---|
12 | |
---|
13 | |
---|
14 | \begin{document} |
---|
15 | |
---|
16 | \title{Pypar Tutorial\\ |
---|
17 | Building a Parallel Program using Python} |
---|
18 | |
---|
19 | |
---|
20 | \author{Ole Nielsen \\ |
---|
21 | Australian National University, Canberra \\ |
---|
22 | Ole.Nielsen@anu.edu.au} |
---|
23 | |
---|
24 | \maketitle |
---|
25 | \section*{Introduction} |
---|
26 | |
---|
27 | This is a tutorial demonstrating essential |
---|
28 | features of parallel programming and building |
---|
29 | a simple Python MPI program using the MPI binding \texttt{pypar} available at |
---|
30 | \url{http://datamining.anu.edu.au/software/pypar}. |
---|
31 | |
---|
32 | It is assumed that \texttt{pypar} has been installed on your system. |
---|
33 | If not, see the installation document (\texttt{pypar\_installation.pdf}) |
---|
34 | that comes with pypar |
---|
35 | (or get it from the web site at |
---|
36 | \url{http://datamining.anu.edu.au/~ole/pypar/pypar_installation.pdf}). |
---|
37 | |
---|
38 | For a reference to all of Pypar's functionality, see the |
---|
39 | reference manual (\texttt{pypar\_reference.pdf}) |
---|
40 | that comes with pypar (or get it from the web site at |
---|
41 | \url{http://datamining.anu.edu.au/~ole/pypar/pypar_reference.pdf}). |
---|
42 | |
---|
43 | |
---|
44 | \section{The Message Passing Interface --- MPI} |
---|
45 | MPI is the defacto standard for parallel programming of both |
---|
46 | distributed computer clusters of and shared memory architectures. |
---|
47 | MPI programming takes the form of special commands that are |
---|
48 | imported from the MPI library. They provide information to the program |
---|
49 | about how many processes are available, which number each process has been |
---|
50 | allocated and functions to communicate with other processes. |
---|
51 | See \url{http://www.netlib.org/utk/papers/mpi-book/mpi-book.html} for |
---|
52 | a comprehensive reference to MPI (based on |
---|
53 | the C or Fortran programming languages). |
---|
54 | |
---|
55 | It is said that MPI is both large and small. What is meant is that the |
---|
56 | MPI standard has hundreds of functions in it. However, many of the |
---|
57 | advanced routines represent functionality that can be ignored unless |
---|
58 | one pursues added flexibility (data types), overlap of computation and |
---|
59 | communication (nonblocking send/receive), modularity (groups, |
---|
60 | communicators), or convenience (collective operations, topologies). |
---|
61 | MPI is said to be small because the majority of useful and efficient |
---|
62 | parallel programs are written using only a small essential subset of MPI. |
---|
63 | In fact, it can be argued that parallel performance is best achieved by |
---|
64 | keeping the parallelism simple. |
---|
65 | |
---|
66 | Pypar is a Python binding to such a subset. |
---|
67 | |
---|
68 | |
---|
69 | \section{A simple Pypar MPI program} |
---|
70 | Let us start with a simple example: a Python program |
---|
71 | that imports the pypar module and |
---|
72 | reports the number of processors, local id and host name |
---|
73 | This is the Pypar equivalent of the infamous 'hello world' |
---|
74 | program if you like. |
---|
75 | Make a file called \texttt{ni.py} using your favourite |
---|
76 | editor\footnote{The author believs that your favourite editor should |
---|
77 | be GNU/Emacs, but it is up to you, of course.} |
---|
78 | and type in the following contents: |
---|
79 | {\footnotesize |
---|
80 | \begin{verbatim} |
---|
81 | import pypar |
---|
82 | |
---|
83 | p = pypar.rank() |
---|
84 | P = pypar.size() |
---|
85 | node = pypar.get_processor_name() |
---|
86 | |
---|
87 | print "I am process %d of %d on node %s" %(p, P, node) |
---|
88 | |
---|
89 | pypar.finalize() |
---|
90 | \end{verbatim}} |
---|
91 | \noindent and run it as follows: |
---|
92 | \begin{verbatim} |
---|
93 | mpirun -np 4 python ni.py |
---|
94 | \end{verbatim} |
---|
95 | You should get a greeting from each of the four nodes looking |
---|
96 | something like |
---|
97 | \begin{verbatim} |
---|
98 | Pypar (version 1.9) initialised MPI OK with 4 processors |
---|
99 | I am process 0 of 4 on node ninja-1 |
---|
100 | I am process 3 of 4 on node ninja-4 |
---|
101 | I am process 1 of 4 on node ninja-2 |
---|
102 | I am process 2 of 4 on node ninja-3 |
---|
103 | \end{verbatim} |
---|
104 | The order of output may be arbitrary! If four copies of the |
---|
105 | program is run, print is called four times. The order in which each |
---|
106 | process executes the message is undetermined, based on when they each |
---|
107 | reach that point in their execution of the program, and how they |
---|
108 | travel on the network. |
---|
109 | \emph{Note that all the print's, though they come from different |
---|
110 | processes, will send their output intact to your shell window; this |
---|
111 | is generally true of output commands. |
---|
112 | Input commands, like \texttt{raw_input}, |
---|
113 | will only work on the process with rank zero.} |
---|
114 | |
---|
115 | |
---|
116 | \subsection{The MPI execution model} |
---|
117 | |
---|
118 | It is important to understand that in MPI (and therefore also in |
---|
119 | Pypar), multiple identical copies of this program |
---|
120 | will start simultaneously in separate processes. |
---|
121 | These processes may run on one machine or on multiple machines. |
---|
122 | This is a fundamental difference from ordinary programs, |
---|
123 | where, if someone sais ``run the program'', it is |
---|
124 | assumed that there was only one instance of the program running. |
---|
125 | |
---|
126 | |
---|
127 | XXXXX |
---|
128 | In a nutshell, this program sets up a communication group of 4 |
---|
129 | processes, where each process gets its rank (\texttt{p}), prints it, |
---|
130 | and exits. |
---|
131 | |
---|
132 | .......Explain in detail like below.. |
---|
133 | |
---|
134 | |
---|
135 | |
---|
136 | |
---|
137 | The first line, |
---|
138 | \texttt{\#include <stdio>} |
---|
139 | should be familiar to all C programmers. It includes the standard |
---|
140 | input/output routines like printf. The second line, |
---|
141 | \texttt{\#include <mpi.h>} |
---|
142 | includes the MPI functions. The file \texttt{mpi.h} |
---|
143 | contains prototypes for all the |
---|
144 | MPI routines in this program; this file is located in |
---|
145 | \texttt{/usr/include/lam/mpi.h} in case you actually want to look at it. |
---|
146 | The program starts with the main... line which takes the two arguments |
---|
147 | argc and argv (which is normal in C): The argument \texttt{argc} is |
---|
148 | the number of commandline arguments given to the program (including itself) |
---|
149 | and \texttt{argv} is an array of strings containing them. For example |
---|
150 | argv[0] will contain the program name itself. We will not be using them |
---|
151 | directly but MPI needs access to them through \texttt{MPI\_init}. |
---|
152 | Then the program declares one integer variable, \texttt{myid}. The |
---|
153 | first step of the program, |
---|
154 | \begin{verbatim} |
---|
155 | MPI_Init(&argc,&argv); |
---|
156 | \end{verbatim} |
---|
157 | calls MPI\_Init to set up everything. |
---|
158 | This should be the first command executed in all MPI programs. This |
---|
159 | routine takes pointers to \texttt{argc} and \texttt{argv}, |
---|
160 | looks at them, pulls out the purely |
---|
161 | MPI-relevant things, and generally fixes them so you can use command line |
---|
162 | arguments as normal. |
---|
163 | |
---|
164 | \medskip |
---|
165 | |
---|
166 | Next, the program runs \texttt{MPI\_Comm\_rank}, |
---|
167 | passing it the default communicator MPI\_COMM\_WORLD |
---|
168 | (which is the group of \emph{all} processes started) |
---|
169 | and a \emph{pointer} to \texttt{myid}. Passing a pointer is the way |
---|
170 | C can change the value of te argument \texttt{myid}. |
---|
171 | \texttt{MPI\_Comm\_rank} will set \texttt{myid} to the rank of the machine |
---|
172 | on which the program is running. |
---|
173 | Remember that in reality, several instances of this program |
---|
174 | start up in several different processes when this program is run. Each |
---|
175 | will receive a unique number from \texttt{MPI\_Comm\_rank}. |
---|
176 | Because multiple copies are running, each will execute all lines |
---|
177 | in the program including the \texttt{printf} statement which prints a |
---|
178 | message and the rank. |
---|
179 | After doing everything else, the program calls \texttt{MPI\_Finalize}, |
---|
180 | which generally terminates everything and shuts down MPI. This should be the |
---|
181 | last command executed in all MPI programs. |
---|
182 | |
---|
183 | \subsection*{Compile the MPI program} |
---|
184 | |
---|
185 | Normally, C programs are compiled with a command such as |
---|
186 | \texttt{gcc hello.c -o hello.x}. However, MPI programs need access to |
---|
187 | more libraries which are provided through the wrapper program |
---|
188 | \texttt{mpicc}. This can be used as usual. Try to run |
---|
189 | \begin{verbatim} |
---|
190 | mpicc hello.c -o hello.x |
---|
191 | \end{verbatim} |
---|
192 | If you made a syntactically correct program a \emph{compiled} |
---|
193 | program will appear with the name you specified after the -o option. |
---|
194 | If not you will see some error messages and you'll have to correct |
---|
195 | those before continuing. |
---|
196 | |
---|
197 | \subsection*{Run the MPI program} |
---|
198 | In order to run an MPI compiled program, you must type: |
---|
199 | \begin{verbatim} |
---|
200 | mpirun -np <number of processes> [options] <program name and arguments> |
---|
201 | \end{verbatim} |
---|
202 | where you specify the number of processes on which you want to run |
---|
203 | your parallel program, optional mpirun options, |
---|
204 | and your program name and its expected arguments. |
---|
205 | In this case try: |
---|
206 | \begin{verbatim} |
---|
207 | mpirun -np 1 hello.x |
---|
208 | \end{verbatim} |
---|
209 | This will run one copy of your program and the output should like this |
---|
210 | \begin{verbatim} |
---|
211 | Sawatdii khrap thuk thuk khon (Process 0) |
---|
212 | \end{verbatim} |
---|
213 | |
---|
214 | Now try to start four processes by running \texttt{mpirun -np 4 hello.x}. |
---|
215 | You should see the following output: |
---|
216 | \begin{verbatim} |
---|
217 | Sawatdii khrap thuk thuk khon (Process 0) |
---|
218 | Sawatdii khrap thuk thuk khon (Process 3) |
---|
219 | Sawatdii khrap thuk thuk khon (Process 1) |
---|
220 | Sawatdii khrap thuk thuk khon (Process 2) |
---|
221 | \end{verbatim} |
---|
222 | Note that the order of output may be arbitrary! If four copies of the |
---|
223 | program is run, printf is called four times. The order in which each |
---|
224 | process executes the message is undetermined, based on when they each |
---|
225 | reach that point in their execution of the program, and how they |
---|
226 | travel on the network. Your guess is as good as mine. |
---|
227 | \emph{Note that all the printf's, though they come from different |
---|
228 | processes, will send their output intact to your shell window; this |
---|
229 | is generally true of output commands. Input commands, like scanf, |
---|
230 | will only work on the process with rank zero.} |
---|
231 | |
---|
232 | \subsubsection*{Exercise} |
---|
233 | The MPI command \texttt{MPI\_Comm\_size(MPI\_COMM\_WORLD, \&proc);} will |
---|
234 | store the total number of processes started in the integer variable |
---|
235 | \texttt{number\_of\_processes}. Modify the program |
---|
236 | to give the following output: |
---|
237 | \begin{verbatim} |
---|
238 | Sawatdii khrap thuk thuk khon (Process 0 of 4) |
---|
239 | Sawatdii khrap thuk thuk khon (Process 3 of 4) |
---|
240 | Sawatdii khrap thuk thuk khon (Process 1 of 4) |
---|
241 | Sawatdii khrap thuk thuk khon (Process 2 of 4) |
---|
242 | \end{verbatim} |
---|
243 | when started with four processes. |
---|
244 | |
---|
245 | |
---|
246 | \subsection*{Running on multiple computers} |
---|
247 | |
---|
248 | So far we have only \emph{simulated} the parallelism by running |
---|
249 | multiple processes on one machine. |
---|
250 | We will now add host specific information to the program and then run it |
---|
251 | in parallel. |
---|
252 | |
---|
253 | Add the following declarations to your program: |
---|
254 | \begin{verbatim} |
---|
255 | int namelen; |
---|
256 | char processor_name[MPI_MAX_PROCESSOR_NAME]; |
---|
257 | \end{verbatim} |
---|
258 | |
---|
259 | Add the command |
---|
260 | \begin{verbatim} |
---|
261 | MPI_Get_processor_name(processor_name, &namelen); |
---|
262 | \end{verbatim} |
---|
263 | This will store the hostname of the processor executing the given process |
---|
264 | along with the length of the hostname. Then modify the print statement to |
---|
265 | \begin{verbatim} |
---|
266 | printf("Sawatdii khrap thuk thuk khon (Process %d of %d running on %s)\n", |
---|
267 | myid, number_of_processes, processor_name); |
---|
268 | \end{verbatim} |
---|
269 | Compiling and running the program you should see the folowing output |
---|
270 | \begin{verbatim} |
---|
271 | Sawatdii khrap thuk thuk khon (Process 0 of 4 running on ninja-n) |
---|
272 | Sawatdii khrap thuk thuk khon (Process 1 of 4 running on ninja-n) |
---|
273 | Sawatdii khrap thuk thuk khon (Process 2 of 4 running on ninja-n) |
---|
274 | Sawatdii khrap thuk thuk khon (Process 3 of 4 running on ninja-n) |
---|
275 | \end{verbatim} |
---|
276 | We are still running all out MPI processes on one computer but we are |
---|
277 | now ready to run this program in parallel. |
---|
278 | |
---|
279 | Edit the file \texttt{~/.lamhosts} to contain all machines in our network |
---|
280 | and run \texttt{lamboot -v ~/.lamhosts} again. You should see the following |
---|
281 | diagnostic message: |
---|
282 | \begin{verbatim} |
---|
283 | LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University |
---|
284 | |
---|
285 | Executing hboot on n0 (ninja-1 - 1 CPU)... |
---|
286 | Executing hboot on n1 (ninja-2 - 1 CPU)... |
---|
287 | Executing hboot on n2 (ninja-3 - 1 CPU)... |
---|
288 | Executing hboot on n3 (ninja-4 - 1 CPU)... |
---|
289 | Executing hboot on n4 (ninja-5 - 1 CPU)... |
---|
290 | Executing hboot on n5 (ninja-6 - 1 CPU)... |
---|
291 | Executing hboot on n6 (ninja-7 - 1 CPU)... |
---|
292 | Executing hboot on n7 (ninja-8 - 1 CPU)... |
---|
293 | topology done |
---|
294 | \end{verbatim} |
---|
295 | \noindent Finally try to run your program again and verify that it runs |
---|
296 | on different nodes. |
---|
297 | What happens if you run more processes than there are physical machines? |
---|
298 | |
---|
299 | |
---|
300 | \section*{Optional: synchronise the clocks} |
---|
301 | |
---|
302 | \textbf{\emph{Note: This an optional exercise for those who can't get enough!}} |
---|
303 | |
---|
304 | The clocks in our cluster are not synchronised: They show differenttimes. |
---|
305 | Try for example to run |
---|
306 | \begin{verbatim} |
---|
307 | ssh -x ninja-2 date; ssh -x ninja-5 date; ssh -x ninja-3 date; date |
---|
308 | You will see that the times listed differ significantly. |
---|
309 | \end{verbatim} |
---|
310 | |
---|
311 | Try and search for things like |
---|
312 | \texttt{synchronize time network} |
---|
313 | and see if you can figure out how to do it. |
---|
314 | |
---|
315 | |
---|
316 | \section*{MPI resources} |
---|
317 | |
---|
318 | \begin{itemize} |
---|
319 | \item \url{http://www.cs.appstate.edu/~can/classes/5530} (exellent introduction |
---|
320 | and many useful links) |
---|
321 | \item http://www-unix.mcs.anl.gov/mpi/mpich (The MPICH implementation) |
---|
322 | \item http://www.lam-mpi.org (The LAM-MPI implementation) |
---|
323 | \item http://www.netlib.org/utk/papers/mpi-book/mpi-book.html\\ |
---|
324 | (The complete MPI reference) |
---|
325 | \end{itemize} |
---|
326 | |
---|
327 | |
---|
328 | |
---|
329 | \section*{Exercise 6} |
---|
330 | |
---|
331 | In this exercise we will continue the introduction to MPI |
---|
332 | by using the local id to differentiate work and |
---|
333 | by familiarising ourselves with the basic send and receive |
---|
334 | primitives. |
---|
335 | We will then proceed to measure the fundamental characteristics of |
---|
336 | the network: namely latency and bandwidth. |
---|
337 | |
---|
338 | |
---|
339 | |
---|
340 | \subsection*{Using local id to differentiate work} |
---|
341 | In Exercise 5 we had every processor write to the screen. This |
---|
342 | is convenient for debugging purposes |
---|
343 | and I suggest you start every parallel program |
---|
344 | with a message from each processor about its rank, the total number |
---|
345 | of processors and the hostname it is running on (see below). |
---|
346 | |
---|
347 | However, it is often also desirable to have only on |
---|
348 | processor output to screen |
---|
349 | for example to report the final result of a calculation. |
---|
350 | |
---|
351 | \subsubsection*{Exercise} |
---|
352 | Write a parallel program that |
---|
353 | produces something like the following output when run on four processors: |
---|
354 | \begin{verbatim} |
---|
355 | P2/4: Initialised OK on host ninja-6 |
---|
356 | P0/4: Initialised OK on host ninja-4 |
---|
357 | P0: Program terminating |
---|
358 | P1/4: Initialised OK on host ninja-5 |
---|
359 | P3/4: Initialised OK on host ninja-7 |
---|
360 | \end{verbatim} |
---|
361 | Arrange the program such that |
---|
362 | all processors report on their initialisation but |
---|
363 | only one, processor 0, reports on its termination just prior |
---|
364 | to \texttt{MPI\_Finalize()}. |
---|
365 | Note, as in Exercise 5, that the order of output is |
---|
366 | arbitrary. |
---|
367 | %\emph{Hint: Use the value of myid to decide whether to print or not}. |
---|
368 | |
---|
369 | \subsection*{A simple send and receive pair} |
---|
370 | In this part we will use the fundamental MPI send and receive commands |
---|
371 | to communicate among the processors. The commands are: |
---|
372 | \begin{itemize} |
---|
373 | \item \texttt{MPI\_Send}: |
---|
374 | This routine performs a basic send; this routine may block until |
---|
375 | the message |
---|
376 | is received, depending on the specific implementation of MPI. |
---|
377 | \begin{verbatim} |
---|
378 | int MPI_Send(void* buf, int count, MPI_Datatype datatype, |
---|
379 | int dest, int tag, MPI_Comm comm) |
---|
380 | Input: |
---|
381 | buf - initial address of send buffer |
---|
382 | count - number of elements in send buffer (nonnegative integer) |
---|
383 | datatype - datatype of each send buffer element |
---|
384 | dest - rank of destination (integer) |
---|
385 | tag - message tag (integer) |
---|
386 | comm - communicator |
---|
387 | \end{verbatim} |
---|
388 | \textbf{Example}: To send one integer, A say, to processor 3 with tag 13: |
---|
389 | \begin{verbatim} |
---|
390 | MPI_Send(&A, 1, MPI_INT, 3, 13, MPI_COMM_WORLD); |
---|
391 | \end{verbatim} |
---|
392 | \item \texttt{MPI\_Recv}: |
---|
393 | This routine performs a basic receive. |
---|
394 | \begin{verbatim} |
---|
395 | int MPI_Recv(void* buf, int count, MPI_Datatype datatype, |
---|
396 | int source, int tag, MPI_Comm comm, |
---|
397 | MPI_Status *status) |
---|
398 | |
---|
399 | Input: |
---|
400 | count - maximum number of elements in receive buffer |
---|
401 | (integer) |
---|
402 | datatype - datatype of each receive buffer element |
---|
403 | source - rank of source (integer) |
---|
404 | tag - message tag (integer) |
---|
405 | comm - communicator |
---|
406 | |
---|
407 | Output: |
---|
408 | buf - initial address of receive buffer |
---|
409 | status - status object, provides information about |
---|
410 | message received; |
---|
411 | status is a structure of type MPI_Status, the element |
---|
412 | status.MPI_SOURCE is the source of the message received, |
---|
413 | and the element status.MPI_TAG is the tag value. |
---|
414 | |
---|
415 | \end{verbatim} |
---|
416 | \textbf{Example}: To receive one integer, B say, from processor 7 |
---|
417 | with tag 13: |
---|
418 | \begin{verbatim} |
---|
419 | MPI_Recv(&B, 1, MPI_INT, 7, 13, MPI_COMM_WORLD, &status); |
---|
420 | \end{verbatim} |
---|
421 | \end{itemize} |
---|
422 | |
---|
423 | \noindent These calls are described in more detail in Ian Foster online book |
---|
424 | in \textbf{MPI Basics}: |
---|
425 | \url{http://www-unix.mcs.anl.gov/dbpp/text/node96.html}. |
---|
426 | The chapter about \textbf{Asynchronous communication}, |
---|
427 | \url{http://www-unix.mcs.anl.gov/dbpp/text/node98.html}, describes how |
---|
428 | to query the Status Object. |
---|
429 | |
---|
430 | MPI defines the following constants for use with \texttt{MPI\_Recv}: |
---|
431 | \begin{itemize} |
---|
432 | \item \texttt{MPI\_ANY\_SOURCE}: Can be used to specify that a message |
---|
433 | can be received from anywhere instead of a specific source. |
---|
434 | \item \texttt{MPI\_ANY\_TAG}: Can be used to specify that a message |
---|
435 | can have any tag instead of a specific tag. |
---|
436 | \end{itemize} |
---|
437 | |
---|
438 | \subsubsection*{Exercise} |
---|
439 | The following program segment implements a very simple |
---|
440 | communication pattern: Processor 0 sends a number to processor 1 and |
---|
441 | both outputs a diagnostic message. |
---|
442 | |
---|
443 | Add to your previous program the declarations: |
---|
444 | \begin{verbatim} |
---|
445 | int A, B, source, destination, tag=13; |
---|
446 | MPI_Status status; |
---|
447 | \end{verbatim} |
---|
448 | and the following code segment |
---|
449 | \begin{verbatim} |
---|
450 | if (myid == 0) { |
---|
451 | A = 42; |
---|
452 | destination = 1; |
---|
453 | printf("P%d: Sending value %d to MPI process %d\n", |
---|
454 | myid, A, destination); |
---|
455 | MPI_Send(&A, 1, MPI_INT, 1, 13, MPI_COMM_WORLD); |
---|
456 | } else if (myid == 1) { |
---|
457 | source = 0; |
---|
458 | MPI_Recv(&B, 1, MPI_INT, source, 13, MPI_COMM_WORLD, &status); |
---|
459 | printf("P%d: Received value %d from MPI process %d\n", myid, B, source); |
---|
460 | } |
---|
461 | \end{verbatim} |
---|
462 | make sure it can compile and run on 2 processors. |
---|
463 | Verify that your output looks something like |
---|
464 | \begin{verbatim} |
---|
465 | P0/2: Initialised OK on host ninja-1 |
---|
466 | P0: Sending value 42 to MPI process 1 |
---|
467 | P0: Terminating |
---|
468 | P1/2: Initialised OK on host ninja-2 |
---|
469 | P1: Received value 42 from MPI process 0 |
---|
470 | \end{verbatim} |
---|
471 | |
---|
472 | |
---|
473 | |
---|
474 | \subsection*{Send messages around in a ring - Exercise} |
---|
475 | |
---|
476 | Write an MPI program which passes data |
---|
477 | around in a ring structure from process 0 then to process 1 etc. |
---|
478 | When the message reaches the last processor it should be passed back to |
---|
479 | processor 0 - this forms a 'communication ring'. |
---|
480 | Every process should add one to the value before passing it on |
---|
481 | and write out a diagnostic message. |
---|
482 | |
---|
483 | With a starting value of 42 and running on four processors your program |
---|
484 | should produce the following output: |
---|
485 | \begin{verbatim} |
---|
486 | P0/4: Initialised OK on host ninja-1 |
---|
487 | P0: Sending value 42 to MPI process 1 |
---|
488 | P1/4: Initialised OK on host ninja-2 |
---|
489 | P3/4: Initialised OK on host ninja-4 |
---|
490 | P1: Received value 42 from MPI process 0 |
---|
491 | P1: Sending value 43 to MPI process 2 |
---|
492 | P2/4: Initialised OK on host ninja-3 |
---|
493 | P2: Received value 43 from MPI process 1 |
---|
494 | P2: Sending value 44 to MPI process 3 |
---|
495 | P3: Received value 44 from MPI process 2 |
---|
496 | P3: Sending value 45 to MPI process 0 |
---|
497 | P0: Received value 45 from MPI process 3 |
---|
498 | \end{verbatim} |
---|
499 | \emph{Hint}: You can use the \texttt{modulus} operator to calculate |
---|
500 | the destination such that the next processor after the last becomes 0: |
---|
501 | \begin{verbatim} |
---|
502 | destination = (myid + 1) % number_of_processors; |
---|
503 | \end{verbatim} |
---|
504 | |
---|
505 | |
---|
506 | \subsection*{Timing} |
---|
507 | A main reason for doing parallel computing is that of faster execution. |
---|
508 | Therefore \emph{timing} is an integral part of this game. |
---|
509 | MPI provides a function \texttt{MPI\_Wtime} which counts seconds. |
---|
510 | To time a code segment with \texttt{MPI\_Wtime} do something like this: |
---|
511 | \begin{verbatim} |
---|
512 | double t0; |
---|
513 | |
---|
514 | t0=MPI_Wtime(); |
---|
515 | // (Computations) |
---|
516 | printf("Time = %.6f sec\n", MPI_Wtime() - t0); |
---|
517 | \end{verbatim} |
---|
518 | |
---|
519 | \subsubsection*{Exercise} |
---|
520 | Time the previous 'ring' program in such a way that only process 0 |
---|
521 | measures the time and writes it to the screen |
---|
522 | |
---|
523 | |
---|
524 | \subsection*{Measure network latency and bandwidth} |
---|
525 | |
---|
526 | The time it takes to transmit a message can be approximated by the model |
---|
527 | \[ |
---|
528 | t = t_l + \alpha t_b |
---|
529 | \] |
---|
530 | where |
---|
531 | \begin{itemize} |
---|
532 | \item $t_l$ is the \emph{latency}. The latency is the |
---|
533 | time it takes to start the transmission before anything |
---|
534 | gets communicated. This involves opening the network connection, |
---|
535 | buffering the message and so on. |
---|
536 | \item $t_b$ is the time it takes to communicate one byte once |
---|
537 | the connection is open. This is related to the \emph{bandwidth} $B$, |
---|
538 | defined as the number of bytes transmitted per second, as |
---|
539 | \[ |
---|
540 | B = \frac{1}{t_b} |
---|
541 | \] |
---|
542 | \item $\alpha$ is the number of bytes communicated. |
---|
543 | \end{itemize} |
---|
544 | |
---|
545 | \subsubsection*{Exercise} |
---|
546 | \textbf{Can you work out what the network latency is?} |
---|
547 | \emph{Hint:} Send a very small message (e.g.\ one integer) |
---|
548 | from one processor to another and and back again a number of times |
---|
549 | and measure the time. |
---|
550 | |
---|
551 | \subsubsection*{Exercise} |
---|
552 | \textbf{Can you work out how many megabytes the network can transmit per |
---|
553 | second?} |
---|
554 | Create arrays of varying size, send them from one processor |
---|
555 | to another and back and measure the time. |
---|
556 | \emph{Hints:} |
---|
557 | \begin{itemize} |
---|
558 | \item To create an array of double precision floating point numbers |
---|
559 | do the following. |
---|
560 | \begin{verbatim} |
---|
561 | #define MAX_LEN 500000 /* Largest block */ |
---|
562 | double A[MAX_LEN]; |
---|
563 | \end{verbatim} |
---|
564 | \item To populate the array with pseudo random numbers: |
---|
565 | \begin{verbatim} |
---|
566 | for (j=0; j<MAX_LEN; j++) A[j]=rand(); |
---|
567 | \end{verbatim} |
---|
568 | \item To send only part of the array use the second argument in |
---|
569 | MPI\_Send and MPI\_Recv to control how much information is transferred. |
---|
570 | \item MPI\_Send and MPI\_Recv require a \emph{pointer} to the first |
---|
571 | element of the array. Since arrays are already defined as pointers |
---|
572 | simply put \texttt{A} rather than \texttt{\&A} which is what |
---|
573 | we used earlier for the |
---|
574 | simple integer variable. |
---|
575 | \item One double precision number takes up 8 bytes - use that in your |
---|
576 | estimate. |
---|
577 | \end{itemize} |
---|
578 | |
---|
579 | Your measurements may fluctuate based on the network traffic and the |
---|
580 | load on the machines. |
---|
581 | |
---|
582 | |
---|
583 | \section*{Good MPI resources} |
---|
584 | \begin{itemize} |
---|
585 | \item MPI Data types:\\ |
---|
586 | \url{http://www.ats.ucla.edu/at/hpc/parallel_computing/mpi-intro.htm} |
---|
587 | \item Online book by Ian Foster:\\ |
---|
588 | \url{http://www-unix.mcs.anl.gov/dbpp/text/node94.html} |
---|
589 | \end{itemize} |
---|
590 | |
---|
591 | |
---|
592 | |
---|
593 | |
---|
594 | |
---|
595 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
596 | \section{Identifying the processors} |
---|
597 | |
---|
598 | |
---|
599 | %%%%%%%%%%% |
---|
600 | |
---|
601 | |
---|
602 | It is almost impossible to demonstrate parallelism with an interactive |
---|
603 | session at the command prompt as we did with the Python and Maple tutorials. |
---|
604 | Therefore we must write a \emph{program} that can be executed in parallel. |
---|
605 | Start an editor of your choice with a new file |
---|
606 | (called \texttt{paraprime.py}, say). For example: |
---|
607 | \begin{verbatim} |
---|
608 | emacs paraprime.py & |
---|
609 | \end{verbatim} |
---|
610 | |
---|
611 | \noindent Write the following in the file and save it |
---|
612 | {\small \begin{verbatim} |
---|
613 | import pypar |
---|
614 | |
---|
615 | numproc = pypar.size() |
---|
616 | myid = pypar.rank() |
---|
617 | |
---|
618 | print "I am proc %d of %d" %(myid, numproc) |
---|
619 | \end{verbatim}} |
---|
620 | |
---|
621 | Then type the following at the Unix prompt to execute the |
---|
622 | program normally on one processor |
---|
623 | {\small \begin{verbatim} |
---|
624 | python paraprime.py |
---|
625 | \end{verbatim}} |
---|
626 | \noindent You should see the following output |
---|
627 | {\small \begin{verbatim} |
---|
628 | > python paraprime.py |
---|
629 | I am proc 0 of 1 |
---|
630 | \end{verbatim}} |
---|
631 | |
---|
632 | |
---|
633 | \noindent Now let us try and run the same program on four processors: |
---|
634 | {\small \begin{verbatim} |
---|
635 | > prun -n 4 python paraprime.py |
---|
636 | I am proc 0 of 4 |
---|
637 | I am proc 1 of 4 |
---|
638 | I am proc 2 of 4 |
---|
639 | I am proc 3 of 4 |
---|
640 | \end{verbatim}} |
---|
641 | If you get a message from each of the four processors stating their |
---|
642 | id as above, we know that everything works. We have a parallel program! |
---|
643 | |
---|
644 | \section{Strategies for efficient parallel programming} |
---|
645 | A final note on efficiency. |
---|
646 | |
---|
647 | Parallel programming should lead to faster execution and/or |
---|
648 | ability to deal with larger problems. |
---|
649 | |
---|
650 | The ultimate goal is to be \emph{P} times faster with \emph{P} processors. |
---|
651 | However \emph{speedup} is usually less than \emph{P}. One must address |
---|
652 | three critical issues to achieve good speedup: |
---|
653 | \begin{itemize} |
---|
654 | \item \textbf{Interprocessor communication:} The amount and the frequency |
---|
655 | of messages should be kept as low as possible. Ensure that processors |
---|
656 | have plenty of work to do between communications. |
---|
657 | \item \textbf{Data distribution and load balancing:} If some processors |
---|
658 | finish much sooner than others, the total execution time of the parallel |
---|
659 | program is bounded by that of the slowest processor and we say that the |
---|
660 | program is poorly load balanced. |
---|
661 | Ensure that each processor get its fair share of the work load. |
---|
662 | \item \textbf{Sequential parts of a program:} If half of a program, say, is |
---|
663 | inherently sequential the speedup can never exceed 2 no matter how well |
---|
664 | the remaining half is parallelised. This is known as Amdahls law. |
---|
665 | Ensure that the all cost intensive parts get parallelised. |
---|
666 | \end{itemize} |
---|
667 | |
---|
668 | |
---|
669 | |
---|
670 | |
---|
671 | |
---|
672 | \section*{Exercise 8} |
---|
673 | |
---|
674 | In Exercise 6-7 you sent messages around in rings and back and forth between |
---|
675 | two processors and also timed the network speed when communicating between to |
---|
676 | processors. |
---|
677 | The purpose of this exercise is to try a collective communication patters |
---|
678 | where on processor sends and gathers data from all the others using the |
---|
679 | basic MPI\_Send and MPI\_Recv. |
---|
680 | |
---|
681 | \subsection*{Distribute and gather integers} |
---|
682 | |
---|
683 | Create a program where process 0 distributes an integer to all other processes, |
---|
684 | then receives an integer back from each of them. |
---|
685 | The other processes should receive the integer from process 0, add their |
---|
686 | rank to it and pass it back to process 0. |
---|
687 | |
---|
688 | With some suitable diagnostic output your program should produce |
---|
689 | something like this when run on four processors: |
---|
690 | |
---|
691 | \begin{verbatim} |
---|
692 | P0/4: Initialised OK on host ninja-1 |
---|
693 | P0: Sending value 42 to MPI process 1 |
---|
694 | P0: Sending value 42 to MPI process 2 |
---|
695 | P0: Sending value 42 to MPI process 3 |
---|
696 | P3/4: Initialised OK on host ninja-4 |
---|
697 | P1/4: Initialised OK on host ninja-2 |
---|
698 | P2/4: Initialised OK on host ninja-3 |
---|
699 | P3: Received value 42 from MPI process 0 |
---|
700 | P3: Sending value 45 to MPI process 0 |
---|
701 | P1: Received value 42 from MPI process 0 |
---|
702 | P1: Sending value 43 to MPI process 0 |
---|
703 | P2: Received value 42 from MPI process 0 |
---|
704 | P2: Sending value 44 to MPI process 0 |
---|
705 | P0: Received value 43 from MPI process 1 |
---|
706 | P0: Received value 44 from MPI process 2 |
---|
707 | P0: Received value 45 from MPI process 3 |
---|
708 | \end{verbatim} |
---|
709 | |
---|
710 | |
---|
711 | \subsection*{Distribute and gather arrays} |
---|
712 | |
---|
713 | In this part we will do exactly the same as above except we will use |
---|
714 | a double precision array as the data. |
---|
715 | |
---|
716 | \begin{enumerate} |
---|
717 | \item Create a double precision array (A) of length $N=1024$ to use as buffer |
---|
718 | \item On process 0 populate it with numbers $0$ to $N-1$ |
---|
719 | (using A[i] = (double) i;) |
---|
720 | \item Pass A on to all other processors (the workers). |
---|
721 | \item All workers should compute the $p$-norm of the array |
---|
722 | where $p$ is the rank of the worker: |
---|
723 | \[ |
---|
724 | \left( |
---|
725 | \sum_{i=0}^N A[i]^p |
---|
726 | \right)^{1/p} |
---|
727 | \] |
---|
728 | \item Then return the result as one double precision number per worker and |
---|
729 | print them all out on processor 0. |
---|
730 | \end{enumerate} |
---|
731 | |
---|
732 | \emph{Hints:} |
---|
733 | \begin{itemize} |
---|
734 | \item To compute $x^y$ use the C-function pow(x, y); |
---|
735 | \item Include the math headers in your program: \verb+#include <math.h>+ |
---|
736 | \item You'll also need to link in the math libraries as follows: |
---|
737 | mpicc progname.c -lm |
---|
738 | %\item You might want to add many print statements reporting the communication pattern in |
---|
739 | \end{itemize} |
---|
740 | |
---|
741 | If your pogram works it should produce something like this on four processors. |
---|
742 | \begin{verbatim} |
---|
743 | P0/4: Initialised OK on host ninja-1 |
---|
744 | P3/4: Initialised OK on host ninja-4 |
---|
745 | P2/4: Initialised OK on host ninja-3 |
---|
746 | P1/4: Initialised OK on host ninja-2 |
---|
747 | P0: Received value 523776.00 from MPI process 1 |
---|
748 | P0: Received value 18904.76 from MPI process 2 |
---|
749 | P0: Received value 6497.76 from MPI process 3 |
---|
750 | \end{verbatim} |
---|
751 | |
---|
752 | |
---|
753 | |
---|
754 | |
---|
755 | \end{document} |
---|