source: anuga_core/source/pypar-numeric/documentation/pypar_tutorial.tex @ 5779

Last change on this file since 5779 was 5779, checked in by steve, 16 years ago

Added the old version of pypar which works with Numeric. Necessary for parallel code until we move anuga to numpy (and then we can use pypar as distribute via sourceforge).

File size: 26.5 KB
Line 
1\documentclass[12pt]{article}
2
3\usepackage{url}
4\usepackage{times}
5\usepackage[dvips]{graphicx}
6\usepackage{underscore}
7
8%\textheight=25cm
9\textwidth=14cm
10%\topmargin=-4cm
11%\oddsidemargin=0cm
12
13
14
15\begin{document}
16
17\title{Pypar Tutorial\\
18Building a Parallel Program using Python}
19
20
21\author{Ole Nielsen \\
22Australian National University, Canberra \\
23Ole.Nielsen@anu.edu.au}
24
25\maketitle
26\section*{Introduction}
27
28This is a tutorial demonstrating essential
29features of parallel programming and building
30a simple Python MPI program using the MPI binding \texttt{pypar} available at
31\url{http://datamining.anu.edu.au/software/pypar}.
32
33It is assumed that \texttt{pypar} has been installed on your system.
34If not, see the installation document (\texttt{pypar\_installation.pdf})
35that comes with pypar
36(or get it from the web site at
37\url{http://datamining.anu.edu.au/~ole/pypar/pypar_installation.pdf}).
38
39For a reference to all of Pypar's functionality, see the
40reference manual (\texttt{pypar\_reference.pdf})
41that comes with pypar (or get it from the web site at
42\url{http://datamining.anu.edu.au/~ole/pypar/pypar_reference.pdf}).
43
44
45\section{The Message Passing Interface --- MPI}
46MPI is the defacto standard for parallel programming of both
47distributed computer clusters of and shared memory architectures.
48MPI programming takes the form of special commands that are
49imported from the MPI library. They provide information to the program
50about how many processes are available, which number each process has been
51allocated and functions to communicate with other processes.
52See \url{http://www.netlib.org/utk/papers/mpi-book/mpi-book.html} for
53a comprehensive reference to MPI (based on
54the C or Fortran programming languages).
55
56It is said that MPI is both large and small. What is meant is that the
57MPI standard has hundreds of functions in it. However, many of the
58advanced routines represent functionality that can be ignored unless
59one pursues added flexibility (data types), overlap of computation and
60communication (nonblocking send/receive), modularity (groups,
61communicators), or convenience (collective operations, topologies).
62MPI is said to be small because the majority of useful and efficient
63parallel programs are written using only a small essential subset of MPI.
64In fact, it can be argued that parallel performance is best achieved by
65keeping the parallelism simple.
66
67Pypar is a Python binding to such a subset.
68
69
70\section{A simple Pypar MPI program}
71Let us start with a simple example: a Python program
72that imports the pypar module and
73reports the number of processors, local id and host name
74This is the Pypar equivalent of the infamous 'hello world'
75program if you like.
76Make a file called \texttt{ni.py} using your favourite
77editor\footnote{The author believs that your favourite editor should
78be GNU/Emacs, but it is up to you, of course.}
79and type in the following contents:
80{\footnotesize
81\begin{verbatim}
82import pypar
83
84p = pypar.rank()
85P = pypar.size()
86node = pypar.get_processor_name()
87
88print "I am process %d of %d on node %s" %(p, P, node)
89
90pypar.finalize()
91\end{verbatim}}
92\noindent and run it as follows:
93\begin{verbatim}
94mpirun -np 4 python ni.py
95\end{verbatim}
96You should get a greeting from each of the four nodes looking
97something like
98\begin{verbatim}
99Pypar (version 1.9) initialised MPI OK with 4 processors
100I am process 0 of 4 on node ninja-1
101I am process 3 of 4 on node ninja-4
102I am process 1 of 4 on node ninja-2
103I am process 2 of 4 on node ninja-3
104\end{verbatim}
105The order of output may be arbitrary!  If four copies of the
106program is run, print is called four times. The order in which each
107process executes the message is undetermined, based on when they each
108reach that point in their execution of the program, and how they
109travel on the network.
110\emph{Note that all the print's, though they come from different
111  processes, will send their output intact to your shell window; this
112  is generally true of output commands.
113  Input commands, like \texttt{raw_input},
114  will only work on the process with rank zero.}
115
116
117\subsection{The MPI execution model}
118
119It is important to understand that in MPI (and therefore also in
120Pypar), multiple identical copies of this program
121will start simultaneously in separate processes.
122These processes may run on one machine or on multiple machines.
123This is a fundamental difference from ordinary programs,
124where, if someone sais ``run the program'', it is
125assumed that there was only one instance of the program running.
126
127
128XXXXX
129In a nutshell, this program sets up a communication group of 4
130processes, where each process gets its rank (\texttt{p}), prints it,
131and exits.
132
133.......Explain in detail like below..
134
135
136
137
138The first line,
139\texttt{\#include <stdio>}
140should be familiar to all C programmers. It includes the standard
141input/output routines like printf. The second line,
142\texttt{\#include <mpi.h>}
143includes the MPI functions. The file \texttt{mpi.h}
144contains prototypes for all the
145MPI routines in this program; this file is located in
146\texttt{/usr/include/lam/mpi.h} in case you actually want to look at it.
147The program starts with the main... line which takes the two arguments
148argc and argv (which is normal in C): The argument \texttt{argc} is
149the number of commandline arguments given to the program (including itself)
150and \texttt{argv} is an array of strings containing them. For example
151argv[0] will contain the program name itself. We will not be using them
152directly but MPI needs access to them through \texttt{MPI\_init}.
153Then the program declares one integer variable, \texttt{myid}. The
154first step of the program,
155\begin{verbatim}
156      MPI_Init(&argc,&argv);
157\end{verbatim}
158calls MPI\_Init to set up everything.
159This should be the first command executed in all MPI programs. This
160routine takes pointers to \texttt{argc} and \texttt{argv},
161looks at them, pulls out the purely
162MPI-relevant things, and generally fixes them so you can use command line
163arguments as normal.
164
165\medskip
166
167Next, the program runs \texttt{MPI\_Comm\_rank},
168passing it the default communicator MPI\_COMM\_WORLD
169(which is the group of \emph{all} processes started)
170and a \emph{pointer} to \texttt{myid}. Passing a pointer is the way
171C can change the value of te argument \texttt{myid}.
172\texttt{MPI\_Comm\_rank} will set \texttt{myid} to the rank of the machine
173on which the program is running.
174Remember that in reality, several instances of this program
175start up in several different processes when this program is run. Each
176will receive a unique number from \texttt{MPI\_Comm\_rank}.
177Because multiple copies are running, each will execute all lines
178in the program including the \texttt{printf} statement which prints a
179message and the rank.
180After doing everything else, the program calls \texttt{MPI\_Finalize},
181which generally terminates everything and shuts down MPI. This should be the
182last command executed in all MPI programs.
183
184\subsection*{Compile the MPI program}
185
186Normally, C programs are compiled with a command such as
187\texttt{gcc hello.c -o hello.x}. However, MPI programs need access to
188more libraries which are provided through the wrapper program
189\texttt{mpicc}. This can be used as usual. Try to run
190\begin{verbatim}
191  mpicc hello.c -o hello.x
192\end{verbatim}
193If you made a syntactically correct program a \emph{compiled}
194program will appear with the name you specified after the -o option.
195If not you will see some error messages and you'll have to correct
196those before continuing.
197
198\subsection*{Run the MPI program}
199In order to run an MPI compiled program, you must type:
200\begin{verbatim}
201mpirun -np <number of processes> [options] <program name and arguments>
202\end{verbatim}
203where you specify the number of processes on which you want to run
204your parallel program, optional mpirun options,
205and your program name and its expected arguments.
206In this case try:
207\begin{verbatim}
208    mpirun -np 1 hello.x
209\end{verbatim}
210This will run one copy of your program and the output should like this
211\begin{verbatim}
212Sawatdii khrap thuk thuk khon (Process 0)
213\end{verbatim}
214
215Now try to start four processes by running \texttt{mpirun -np 4 hello.x}.
216You should see the following output:
217\begin{verbatim}
218Sawatdii khrap thuk thuk khon (Process 0)
219Sawatdii khrap thuk thuk khon (Process 3)
220Sawatdii khrap thuk thuk khon (Process 1)
221Sawatdii khrap thuk thuk khon (Process 2)
222\end{verbatim}
223Note that the order of output may be arbitrary!  If four copies of the
224program is run, printf is called four times.  The order in which each
225process executes the message is undetermined, based on when they each
226reach that point in their execution of the program, and how they
227travel on the network. Your guess is as good as mine.
228\emph{Note that all the printf's, though they come from different
229  processes, will send their output intact to your shell window; this
230  is generally true of output commands. Input commands, like scanf,
231  will only work on the process with rank zero.}
232
233\subsubsection*{Exercise}
234The MPI command \texttt{MPI\_Comm\_size(MPI\_COMM\_WORLD, \&proc);} will
235store the total number of processes started in the integer variable
236\texttt{number\_of\_processes}. Modify the program
237to give the following output:
238\begin{verbatim}
239Sawatdii khrap thuk thuk khon (Process 0 of 4)
240Sawatdii khrap thuk thuk khon (Process 3 of 4)
241Sawatdii khrap thuk thuk khon (Process 1 of 4)
242Sawatdii khrap thuk thuk khon (Process 2 of 4)
243\end{verbatim}
244when started with four processes.
245
246
247\subsection*{Running on multiple computers}
248
249So far we have only \emph{simulated} the parallelism by running
250multiple processes on one machine.
251We will now add host specific information to the program and then run it
252in parallel.
253
254Add the following declarations to your program:
255\begin{verbatim}
256int  namelen;
257char processor_name[MPI_MAX_PROCESSOR_NAME];
258\end{verbatim}
259
260Add the command
261\begin{verbatim}
262MPI_Get_processor_name(processor_name, &namelen);
263\end{verbatim}
264This will store the hostname of the processor executing the given process
265along with the length of the hostname. Then modify the print statement to
266\begin{verbatim}
267  printf("Sawatdii khrap thuk thuk khon (Process %d of %d running on %s)\n",
268         myid, number_of_processes, processor_name);
269\end{verbatim}
270Compiling and running the program you should see the folowing output
271\begin{verbatim}
272Sawatdii khrap thuk thuk khon (Process 0 of 4 running on ninja-n)
273Sawatdii khrap thuk thuk khon (Process 1 of 4 running on ninja-n)
274Sawatdii khrap thuk thuk khon (Process 2 of 4 running on ninja-n)
275Sawatdii khrap thuk thuk khon (Process 3 of 4 running on ninja-n)
276\end{verbatim}
277We are still running all out MPI processes on one computer but we are
278now ready to run this program in parallel.
279
280Edit the file \texttt{~/.lamhosts} to contain all machines in our network
281and run \texttt{lamboot -v ~/.lamhosts} again. You should see the following
282diagnostic message:
283\begin{verbatim}
284LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University
285
286Executing hboot on n0 (ninja-1 - 1 CPU)...
287Executing hboot on n1 (ninja-2 - 1 CPU)...
288Executing hboot on n2 (ninja-3 - 1 CPU)...
289Executing hboot on n3 (ninja-4 - 1 CPU)...
290Executing hboot on n4 (ninja-5 - 1 CPU)...
291Executing hboot on n5 (ninja-6 - 1 CPU)...
292Executing hboot on n6 (ninja-7 - 1 CPU)...
293Executing hboot on n7 (ninja-8 - 1 CPU)...
294topology done
295\end{verbatim}
296\noindent Finally try to run your program again and verify that it runs
297on different nodes.
298What happens if you run more processes than there are physical machines?
299
300
301\section*{Optional: synchronise the clocks}
302
303\textbf{\emph{Note: This an optional exercise for those who can't get enough!}}
304
305The clocks in our cluster are not synchronised: They show differenttimes.
306Try for example to run
307\begin{verbatim}
308ssh -x ninja-2 date; ssh -x ninja-5 date; ssh -x ninja-3 date; date
309You will see that the times listed differ significantly.
310\end{verbatim}
311
312Try and search for things like
313\texttt{synchronize time network}
314and see if you can figure out how to do it.
315
316
317\section*{MPI resources}
318
319\begin{itemize}
320  \item \url{http://www.cs.appstate.edu/~can/classes/5530} (exellent introduction
321  and many useful links)
322  \item http://www-unix.mcs.anl.gov/mpi/mpich (The MPICH implementation)
323  \item http://www.lam-mpi.org (The LAM-MPI implementation)
324  \item http://www.netlib.org/utk/papers/mpi-book/mpi-book.html\\
325    (The complete MPI reference)
326\end{itemize}
327
328
329
330\section*{Exercise 6}
331
332In this exercise we will continue the introduction to MPI
333by using the local id to differentiate work and
334by familiarising ourselves with the basic send and receive
335primitives.
336We will then proceed to measure the fundamental characteristics of
337the network: namely latency and bandwidth.
338
339
340
341\subsection*{Using local id to differentiate work}
342In Exercise 5 we had every processor write to the screen. This
343is convenient for debugging purposes
344and I suggest you start every parallel program
345with a message from each processor about its rank, the total number
346of processors and the hostname it is running on (see below).
347
348However, it is often also desirable to have only on
349processor output to screen
350for example to report the final result of a calculation.
351
352\subsubsection*{Exercise}
353Write a parallel program that
354produces something like the following output when run on four processors:
355\begin{verbatim}
356  P2/4: Initialised OK on host ninja-6
357  P0/4: Initialised OK on host ninja-4
358  P0: Program terminating
359  P1/4: Initialised OK on host ninja-5
360  P3/4: Initialised OK on host ninja-7
361\end{verbatim}
362Arrange the program such that
363all processors report on their initialisation but
364only one, processor 0, reports on its termination just prior
365to \texttt{MPI\_Finalize()}.
366Note, as in Exercise 5, that the order of output is
367arbitrary.
368%\emph{Hint: Use the value of myid to decide whether to print or not}.
369
370\subsection*{A simple send and receive pair}
371In this part we will use the fundamental MPI send and receive commands
372to communicate among the processors. The commands are:
373\begin{itemize}
374  \item \texttt{MPI\_Send}:
375  This routine performs a basic send; this routine may block until
376  the message
377  is received, depending on the specific implementation of MPI.
378  \begin{verbatim}
379  int MPI_Send(void* buf, int count, MPI_Datatype datatype,
380                     int dest, int tag, MPI_Comm comm)
381  Input:
382      buf  - initial address of send buffer
383      count - number of elements in send buffer (nonnegative integer)
384      datatype - datatype of each send buffer element
385      dest - rank of destination (integer)
386      tag  - message tag (integer)
387      comm - communicator
388  \end{verbatim}
389  \textbf{Example}: To send one integer, A say, to processor 3 with tag 13:
390  \begin{verbatim}
391    MPI_Send(&A, 1, MPI_INT, 3, 13, MPI_COMM_WORLD);
392  \end{verbatim}
393  \item \texttt{MPI\_Recv}:
394    This routine performs a basic receive.
395  \begin{verbatim}
396  int MPI_Recv(void* buf, int count, MPI_Datatype datatype,
397                     int source, int tag, MPI_Comm comm,
398                     MPI_Status *status)
399
400  Input:
401      count - maximum number of elements in receive buffer
402             (integer)
403      datatype - datatype of each receive buffer element
404      source - rank of source (integer)
405      tag  - message tag (integer)
406      comm - communicator
407
408  Output:
409      buf  - initial address of receive buffer
410      status - status object, provides information about
411               message received;
412      status is a structure of type MPI_Status, the element
413      status.MPI_SOURCE is the source of the message received,
414      and the element status.MPI_TAG is the tag value.
415
416  \end{verbatim}
417  \textbf{Example}: To receive one integer, B say, from processor 7
418  with tag 13:
419  \begin{verbatim}
420    MPI_Recv(&B, 1, MPI_INT, 7, 13, MPI_COMM_WORLD, &status);
421  \end{verbatim}
422\end{itemize}
423
424\noindent These calls are described in more detail in Ian Foster online book
425in \textbf{MPI Basics}:
426\url{http://www-unix.mcs.anl.gov/dbpp/text/node96.html}.
427The chapter about \textbf{Asynchronous communication},
428\url{http://www-unix.mcs.anl.gov/dbpp/text/node98.html}, describes how
429to query the Status Object.
430
431MPI defines the following constants for use with \texttt{MPI\_Recv}:
432\begin{itemize}
433  \item \texttt{MPI\_ANY\_SOURCE}: Can be used to specify that a message
434  can be received from anywhere instead of a specific source.
435  \item \texttt{MPI\_ANY\_TAG}: Can be used to specify that a message
436  can have any tag instead of a specific tag.
437\end{itemize}
438
439\subsubsection*{Exercise}
440The following program segment implements a very simple
441communication pattern: Processor 0 sends a number to processor 1 and
442both outputs a diagnostic message.
443
444Add to your previous program the declarations:
445\begin{verbatim}
446  int A, B, source, destination, tag=13;
447  MPI_Status status;
448\end{verbatim}
449and the following code segment
450\begin{verbatim}
451  if (myid == 0) {
452    A = 42;
453    destination = 1;
454    printf("P%d: Sending value %d to MPI process %d\n",
455           myid, A, destination);
456    MPI_Send(&A, 1, MPI_INT, 1, 13, MPI_COMM_WORLD);
457  } else if (myid == 1) {
458    source = 0;
459    MPI_Recv(&B, 1, MPI_INT, source, 13, MPI_COMM_WORLD, &status);
460    printf("P%d: Received value %d from MPI process %d\n", myid, B, source);
461  }
462\end{verbatim}
463make sure it can compile and run on 2 processors.
464Verify that your output looks something like
465\begin{verbatim}
466P0/2: Initialised OK on host ninja-1
467P0: Sending value 42 to MPI process 1
468P0: Terminating
469P1/2: Initialised OK on host ninja-2
470P1: Received value 42 from MPI process 0
471\end{verbatim}
472
473
474
475\subsection*{Send messages around in a ring - Exercise}
476
477Write an MPI program which passes data
478around in a ring structure from process 0 then to process 1 etc.
479When the message reaches the last processor it should be passed back to
480processor 0 - this forms a 'communication ring'.
481Every process should add one to the value before passing it on
482and write out a diagnostic message.
483
484With a starting value of 42 and running on four processors your program
485should produce the following output:
486\begin{verbatim}
487P0/4: Initialised OK on host ninja-1
488P0: Sending value 42 to MPI process 1
489P1/4: Initialised OK on host ninja-2
490P3/4: Initialised OK on host ninja-4
491P1: Received value 42 from MPI process 0
492P1: Sending value 43 to MPI process 2
493P2/4: Initialised OK on host ninja-3
494P2: Received value 43 from MPI process 1
495P2: Sending value 44 to MPI process 3
496P3: Received value 44 from MPI process 2
497P3: Sending value 45 to MPI process 0
498P0: Received value 45 from MPI process 3
499\end{verbatim}
500\emph{Hint}: You can use the \texttt{modulus} operator to calculate
501the destination such that the next processor after the last becomes 0:
502\begin{verbatim}
503  destination = (myid + 1) % number_of_processors;
504\end{verbatim}
505
506
507\subsection*{Timing}
508A main reason for doing parallel computing is that of faster execution.
509Therefore \emph{timing} is an integral part of this game.
510MPI provides a function \texttt{MPI\_Wtime} which counts seconds.
511To time a code segment with \texttt{MPI\_Wtime} do something like this:
512\begin{verbatim}
513  double t0;
514
515  t0=MPI_Wtime();
516  // (Computations)
517  printf("Time = %.6f sec\n", MPI_Wtime() - t0);
518\end{verbatim}
519
520\subsubsection*{Exercise}
521Time the previous 'ring' program in such a way that only process 0
522measures the time and writes it to the screen
523
524
525\subsection*{Measure network latency and bandwidth}
526
527The time it takes to transmit a message can be approximated by the model
528\[
529  t = t_l + \alpha t_b
530\]
531where
532\begin{itemize}
533  \item $t_l$ is the \emph{latency}. The latency is the
534  time it takes to start the transmission before anything
535  gets communicated. This involves opening the network connection,
536  buffering the message and so on.
537  \item $t_b$ is the time it takes to communicate one byte once
538  the connection is open. This is related to the \emph{bandwidth} $B$,
539  defined as the number of bytes transmitted per second, as
540  \[
541    B = \frac{1}{t_b}
542  \]
543  \item $\alpha$ is the number of bytes communicated.
544\end{itemize}
545
546\subsubsection*{Exercise}
547\textbf{Can you work out what the network latency is?}
548\emph{Hint:} Send a very small message (e.g.\ one integer)
549from one processor to another and and back again a number of times
550and measure the time.
551
552\subsubsection*{Exercise}
553\textbf{Can you work out how many megabytes the network can transmit per
554second?}
555Create arrays of varying size, send them from one processor
556to another and back and measure the time.
557\emph{Hints:}
558\begin{itemize}
559  \item To create an array of double precision floating point numbers
560    do the following.
561    \begin{verbatim}
562      #define MAX_LEN  500000    /* Largest block */
563      double A[MAX_LEN];
564    \end{verbatim}
565  \item To populate the array with pseudo random numbers:
566  \begin{verbatim}
567   for (j=0; j<MAX_LEN; j++) A[j]=rand();
568   \end{verbatim}
569   \item To send only part of the array use the second argument in
570   MPI\_Send and MPI\_Recv to control how much information is transferred.
571   \item MPI\_Send and MPI\_Recv require a \emph{pointer} to the first
572   element of the array. Since arrays are already defined as pointers
573   simply put \texttt{A} rather than \texttt{\&A} which is what
574   we used earlier for the
575   simple integer variable.
576   \item One double precision number takes up 8 bytes - use that in your
577   estimate.
578\end{itemize}
579
580Your measurements may fluctuate based on the network traffic and the
581load on the machines.
582
583
584\section*{Good MPI resources}
585\begin{itemize}
586  \item MPI Data types:\\
587    \url{http://www.ats.ucla.edu/at/hpc/parallel_computing/mpi-intro.htm}
588  \item Online book by Ian Foster:\\
589    \url{http://www-unix.mcs.anl.gov/dbpp/text/node94.html}
590\end{itemize}
591
592
593
594
595
596%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
597\section{Identifying the processors}
598
599
600%%%%%%%%%%%
601
602
603It is almost impossible to demonstrate parallelism with an interactive
604session at the command prompt as we did with the Python and Maple tutorials.
605Therefore we must write a \emph{program} that can be executed in parallel.
606Start an editor of your choice with a new file
607(called \texttt{paraprime.py}, say). For example:
608\begin{verbatim}
609  emacs paraprime.py &
610\end{verbatim}
611
612\noindent Write the following in the file and save it
613{\small \begin{verbatim}
614import pypar
615
616numproc = pypar.size()
617myid =    pypar.rank()
618
619print "I am proc %d of %d" %(myid, numproc)
620\end{verbatim}}
621
622Then type the following at the Unix prompt to execute the
623program normally on one processor
624{\small \begin{verbatim}
625  python paraprime.py
626\end{verbatim}}
627\noindent You should see the following output
628{\small \begin{verbatim}
629> python paraprime.py
630I am proc 0 of 1
631\end{verbatim}}
632
633
634\noindent Now let us try and run the same program on four processors:
635{\small \begin{verbatim}
636> prun -n 4 python paraprime.py
637I am proc 0 of 4
638I am proc 1 of 4
639I am proc 2 of 4
640I am proc 3 of 4
641\end{verbatim}}
642If you get a message from each of the four processors stating their
643id as above, we know that everything works. We have a parallel program!
644
645\section{Strategies for efficient parallel programming}
646A final note on efficiency.
647
648Parallel programming should lead to faster execution and/or
649ability to deal with larger problems.
650
651The ultimate goal is to be \emph{P} times faster with \emph{P} processors.
652However \emph{speedup} is usually less than \emph{P}. One must address
653three critical issues to achieve good speedup:
654\begin{itemize}
655  \item \textbf{Interprocessor communication:} The amount and the frequency
656  of messages should be kept as low as possible. Ensure that processors
657  have plenty of work to do between communications.
658  \item \textbf{Data distribution and load balancing:} If some processors
659  finish much sooner than others, the total execution time of the parallel
660  program is bounded by that of the slowest processor and we say that the
661  program is poorly load balanced.
662  Ensure that each processor get its fair share of the work load.
663  \item \textbf{Sequential parts of a program:} If half of a program, say, is
664  inherently sequential the speedup can never exceed 2 no matter how well
665  the remaining half is parallelised. This is known as Amdahls law.
666  Ensure that the all cost intensive parts get parallelised.
667\end{itemize}
668
669
670
671
672
673\section*{Exercise 8}
674
675In Exercise 6-7 you sent messages around in rings and back and forth between
676two processors and also timed the network speed when communicating between to
677processors.
678The purpose of this exercise is to try a collective communication patters
679where on processor sends and gathers data from all the others using the
680basic MPI\_Send and MPI\_Recv.
681
682\subsection*{Distribute and gather integers}
683
684Create a program where process 0 distributes an integer to all other processes,
685then receives an integer back from each of them.
686The other processes should receive the integer from process 0, add their
687rank to it and pass it back to process 0.
688
689With some suitable diagnostic output your program should produce
690something like this when run on four processors:
691
692\begin{verbatim}
693P0/4: Initialised OK on host ninja-1
694P0: Sending value 42 to MPI process 1
695P0: Sending value 42 to MPI process 2
696P0: Sending value 42 to MPI process 3
697P3/4: Initialised OK on host ninja-4
698P1/4: Initialised OK on host ninja-2
699P2/4: Initialised OK on host ninja-3
700P3: Received value 42 from MPI process 0
701P3: Sending value 45 to MPI process 0
702P1: Received value 42 from MPI process 0
703P1: Sending value 43 to MPI process 0
704P2: Received value 42 from MPI process 0
705P2: Sending value 44 to MPI process 0
706P0: Received value 43 from MPI process 1
707P0: Received value 44 from MPI process 2
708P0: Received value 45 from MPI process 3
709\end{verbatim}
710
711
712\subsection*{Distribute and gather arrays}
713
714In this part we will do exactly the same as above except we will use
715a double precision array as the data.
716
717\begin{enumerate}
718  \item Create a double precision array (A) of length $N=1024$ to use as buffer
719  \item On process 0 populate it with numbers $0$ to $N-1$
720  (using A[i] = (double) i;)
721  \item Pass A on to all other processors (the workers).
722  \item All workers should compute the $p$-norm of the array
723  where $p$ is the rank of the worker:
724  \[
725    \left(
726      \sum_{i=0}^N A[i]^p
727    \right)^{1/p}
728  \]
729  \item Then return the result as one double precision number per worker and
730  print them all out on processor 0.
731\end{enumerate}
732
733\emph{Hints:}
734\begin{itemize}
735  \item To compute $x^y$ use the C-function pow(x, y);
736  \item Include the math headers in your program: \verb+#include <math.h>+
737  \item You'll also need to link in the math libraries as follows:
738     mpicc progname.c -lm
739  %\item You might want to add many print statements reporting the communication pattern in
740\end{itemize}
741
742If your pogram works it should produce something like this on four processors.
743\begin{verbatim}
744P0/4: Initialised OK on host ninja-1
745P3/4: Initialised OK on host ninja-4
746P2/4: Initialised OK on host ninja-3
747P1/4: Initialised OK on host ninja-2
748P0: Received value 523776.00 from MPI process 1
749P0: Received value 18904.76 from MPI process 2
750P0: Received value 6497.76 from MPI process 3
751\end{verbatim}
752
753
754
755
756\end{document}
Note: See TracBrowser for help on using the repository browser.