Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

pypar_tutorial.tex @ 650

Last change on this file since 650 was 124, checked in by ole, 19 years ago
Doc files and ring_example
File size: 26.7 KB

Line
1	\documentclass[12pt]{article}
2
3	\usepackage{url}
4	\usepackage{times}
5	\usepackage[dvips]{graphicx}
6
7	%\textheight=25cm
8	\textwidth=14cm
9	%\topmargin=-4cm
10	%\oddsidemargin=0cm
11
12
13
14	\begin{document}
15
16	\title{Pypar Tutorial\\
17	Building a Parallel Program using Python}
18
19
20	\author{Ole Nielsen \\
21	Australian National University, Canberra \\
22	Ole.Nielsen@anu.edu.au}
23
24	\maketitle
25	\section*{Introduction}
26
27	This is a tutorial demonstrating essential
28	features of parallel programming and building
29	a simple Python MPI program using the MPI binding \texttt{pypar} available at
30	\url{http://datamining.anu.edu.au/software/pypar}.
31
32	It is assumed that \texttt{pypar} has been installed on your system.
33	If not, see the installation document (\texttt{pypar\_installation.pdf})
34	that comes with pypar
35	(or get it from the web site at
36	\url{http://datamining.anu.edu.au/~ole/pypar/pypar_installation.pdf}).
37
38	For a reference to all of Pypar's functionality, see the
39	reference manual (\texttt{pypar\_reference.pdf})
40	that comes with pypar (or get it from the web site at
41	\url{http://datamining.anu.edu.au/~ole/pypar/pypar_reference.pdf}).
42
43
44	\section{The Message Passing Interface --- MPI}
45	MPI is the defacto standard for parallel programming of both
46	distributed computer clusters of and shared memory architectures.
47	MPI programming takes the form of special commands that are
48	imported from the MPI library. They provide information to the program
49	about how many processes are available, which number each process has been
50	allocated and functions to communicate with other processes.
51	See \url{http://www.netlib.org/utk/papers/mpi-book/mpi-book.html} for
52	a comprehensive reference to MPI (based on
53	the C or Fortran programming languages).
54
55	It is said that MPI is both large and small. What is meant is that the
56	MPI standard has hundreds of functions in it. However, many of the
57	advanced routines represent functionality that can be ignored unless
58	one pursues added flexibility (data types), overlap of computation and
59	communication (nonblocking send/receive), modularity (groups,
60	communicators), or convenience (collective operations, topologies).
61	MPI is said to be small because the majority of useful and efficient
62	parallel programs are written using only a small essential subset of MPI.
63	In fact, it can be argued that parallel performance is best achieved by
64	keeping the parallelism simple.
65
66	Pypar is a Python binding to such a subset.
67
68
69	\section{A simple Pypar MPI program}
70	Let us start with a simple example: a Python program
71	that imports the pypar module and
72	reports the number of processors, local id and host name
73	This is the Pypar equivalent of the infamous 'hello world'
74	program if you like.
75	Make a file called \texttt{ni.py} using your favourite
76	editor\footnote{The author believs that your favourite editor should
77	be GNU/Emacs, but it is up to you, of course.}
78	and type in the following contents:
79	{\footnotesize
80	\begin{verbatim}
81	import pypar
82
83	p = pypar.rank()
84	P = pypar.size()
85	node = pypar.get_processor_name()
86
87	print "I am process %d of %d on node %s" %(p, P, node)
88
89	pypar.finalize()
90	\end{verbatim}}
91	\noindent and run it as follows:
92	\begin{verbatim}
93	mpirun -np 4 python ni.py
94	\end{verbatim}
95	You should get a greeting from each of the four nodes looking
96	something like
97	\begin{verbatim}
98	Pypar (version 1.9) initialised MPI OK with 4 processors
99	I am process 0 of 4 on node ninja-1
100	I am process 3 of 4 on node ninja-4
101	I am process 1 of 4 on node ninja-2
102	I am process 2 of 4 on node ninja-3
103	\end{verbatim}
104	The order of output may be arbitrary! If four copies of the
105	program is run, print is called four times. The order in which each
106	process executes the message is undetermined, based on when they each
107	reach that point in their execution of the program, and how they
108	travel on the network.
109	\emph{Note that all the print's, though they come from different
110	processes, will send their output intact to your shell window; this
111	is generally true of output commands.
112	Input commands, like \texttt{raw_input},
113	will only work on the process with rank zero.}
114
115
116	\subsection{The MPI execution model}
117
118	It is important to understand that in MPI (and therefore also in
119	Pypar), multiple identical copies of this program
120	will start simultaneously in separate processes.
121	These processes may run on one machine or on multiple machines.
122	This is a fundamental difference from ordinary programs,
123	where, if someone sais ``run the program'', it is
124	assumed that there was only one instance of the program running.
125
126
127	XXXXX
128	In a nutshell, this program sets up a communication group of 4
129	processes, where each process gets its rank (\texttt{p}), prints it,
130	and exits.
131
132	.......Explain in detail like below..
133
134
135
136
137	The first line,
138	\texttt{\#include <stdio>}
139	should be familiar to all C programmers. It includes the standard
140	input/output routines like printf. The second line,
141	\texttt{\#include <mpi.h>}
142	includes the MPI functions. The file \texttt{mpi.h}
143	contains prototypes for all the
144	MPI routines in this program; this file is located in
145	\texttt{/usr/include/lam/mpi.h} in case you actually want to look at it.
146	The program starts with the main... line which takes the two arguments
147	argc and argv (which is normal in C): The argument \texttt{argc} is
148	the number of commandline arguments given to the program (including itself)
149	and \texttt{argv} is an array of strings containing them. For example
150	argv[0] will contain the program name itself. We will not be using them
151	directly but MPI needs access to them through \texttt{MPI\_init}.
152	Then the program declares one integer variable, \texttt{myid}. The
153	first step of the program,
154	\begin{verbatim}
155	MPI_Init(&argc,&argv);
156	\end{verbatim}
157	calls MPI\_Init to set up everything.
158	This should be the first command executed in all MPI programs. This
159	routine takes pointers to \texttt{argc} and \texttt{argv},
160	looks at them, pulls out the purely
161	MPI-relevant things, and generally fixes them so you can use command line
162	arguments as normal.
163
164	\medskip
165
166	Next, the program runs \texttt{MPI\_Comm\_rank},
167	passing it the default communicator MPI\_COMM\_WORLD
168	(which is the group of \emph{all} processes started)
169	and a \emph{pointer} to \texttt{myid}. Passing a pointer is the way
170	C can change the value of te argument \texttt{myid}.
171	\texttt{MPI\_Comm\_rank} will set \texttt{myid} to the rank of the machine
172	on which the program is running.
173	Remember that in reality, several instances of this program
174	start up in several different processes when this program is run. Each
175	will receive a unique number from \texttt{MPI\_Comm\_rank}.
176	Because multiple copies are running, each will execute all lines
177	in the program including the \texttt{printf} statement which prints a
178	message and the rank.
179	After doing everything else, the program calls \texttt{MPI\_Finalize},
180	which generally terminates everything and shuts down MPI. This should be the
181	last command executed in all MPI programs.
182
183	\subsection*{Compile the MPI program}
184
185	Normally, C programs are compiled with a command such as
186	\texttt{gcc hello.c -o hello.x}. However, MPI programs need access to
187	more libraries which are provided through the wrapper program
188	\texttt{mpicc}. This can be used as usual. Try to run
189	\begin{verbatim}
190	mpicc hello.c -o hello.x
191	\end{verbatim}
192	If you made a syntactically correct program a \emph{compiled}
193	program will appear with the name you specified after the -o option.
194	If not you will see some error messages and you'll have to correct
195	those before continuing.
196
197	\subsection*{Run the MPI program}
198	In order to run an MPI compiled program, you must type:
199	\begin{verbatim}
200	mpirun -np <number of processes> [options] <program name and arguments>
201	\end{verbatim}
202	where you specify the number of processes on which you want to run
203	your parallel program, optional mpirun options,
204	and your program name and its expected arguments.
205	In this case try:
206	\begin{verbatim}
207	mpirun -np 1 hello.x
208	\end{verbatim}
209	This will run one copy of your program and the output should like this
210	\begin{verbatim}
211	Sawatdii khrap thuk thuk khon (Process 0)
212	\end{verbatim}
213
214	Now try to start four processes by running \texttt{mpirun -np 4 hello.x}.
215	You should see the following output:
216	\begin{verbatim}
217	Sawatdii khrap thuk thuk khon (Process 0)
218	Sawatdii khrap thuk thuk khon (Process 3)
219	Sawatdii khrap thuk thuk khon (Process 1)
220	Sawatdii khrap thuk thuk khon (Process 2)
221	\end{verbatim}
222	Note that the order of output may be arbitrary! If four copies of the
223	program is run, printf is called four times. The order in which each
224	process executes the message is undetermined, based on when they each
225	reach that point in their execution of the program, and how they
226	travel on the network. Your guess is as good as mine.
227	\emph{Note that all the printf's, though they come from different
228	processes, will send their output intact to your shell window; this
229	is generally true of output commands. Input commands, like scanf,
230	will only work on the process with rank zero.}
231
232	\subsubsection*{Exercise}
233	The MPI command \texttt{MPI\_Comm\_size(MPI\_COMM\_WORLD, \&proc);} will
234	store the total number of processes started in the integer variable
235	\texttt{number\_of\_processes}. Modify the program
236	to give the following output:
237	\begin{verbatim}
238	Sawatdii khrap thuk thuk khon (Process 0 of 4)
239	Sawatdii khrap thuk thuk khon (Process 3 of 4)
240	Sawatdii khrap thuk thuk khon (Process 1 of 4)
241	Sawatdii khrap thuk thuk khon (Process 2 of 4)
242	\end{verbatim}
243	when started with four processes.
244
245
246	\subsection*{Running on multiple computers}
247
248	So far we have only \emph{simulated} the parallelism by running
249	multiple processes on one machine.
250	We will now add host specific information to the program and then run it
251	in parallel.
252
253	Add the following declarations to your program:
254	\begin{verbatim}
255	int namelen;
256	char processor_name[MPI_MAX_PROCESSOR_NAME];
257	\end{verbatim}
258
259	Add the command
260	\begin{verbatim}
261	MPI_Get_processor_name(processor_name, &namelen);
262	\end{verbatim}
263	This will store the hostname of the processor executing the given process
264	along with the length of the hostname. Then modify the print statement to
265	\begin{verbatim}
266	printf("Sawatdii khrap thuk thuk khon (Process %d of %d running on %s)\n",
267	myid, number_of_processes, processor_name);
268	\end{verbatim}
269	Compiling and running the program you should see the folowing output
270	\begin{verbatim}
271	Sawatdii khrap thuk thuk khon (Process 0 of 4 running on ninja-n)
272	Sawatdii khrap thuk thuk khon (Process 1 of 4 running on ninja-n)
273	Sawatdii khrap thuk thuk khon (Process 2 of 4 running on ninja-n)
274	Sawatdii khrap thuk thuk khon (Process 3 of 4 running on ninja-n)
275	\end{verbatim}
276	We are still running all out MPI processes on one computer but we are
277	now ready to run this program in parallel.
278
279	Edit the file \texttt{~/.lamhosts} to contain all machines in our network
280	and run \texttt{lamboot -v ~/.lamhosts} again. You should see the following
281	diagnostic message:
282	\begin{verbatim}
283	LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University
284
285	Executing hboot on n0 (ninja-1 - 1 CPU)...
286	Executing hboot on n1 (ninja-2 - 1 CPU)...
287	Executing hboot on n2 (ninja-3 - 1 CPU)...
288	Executing hboot on n3 (ninja-4 - 1 CPU)...
289	Executing hboot on n4 (ninja-5 - 1 CPU)...
290	Executing hboot on n5 (ninja-6 - 1 CPU)...
291	Executing hboot on n6 (ninja-7 - 1 CPU)...
292	Executing hboot on n7 (ninja-8 - 1 CPU)...
293	topology done
294	\end{verbatim}
295	\noindent Finally try to run your program again and verify that it runs
296	on different nodes.
297	What happens if you run more processes than there are physical machines?
298
299
300	\section*{Optional: synchronise the clocks}
301
302	\textbf{\emph{Note: This an optional exercise for those who can't get enough!}}
303
304	The clocks in our cluster are not synchronised: They show differenttimes.
305	Try for example to run
306	\begin{verbatim}
307	ssh -x ninja-2 date; ssh -x ninja-5 date; ssh -x ninja-3 date; date
308	You will see that the times listed differ significantly.
309	\end{verbatim}
310
311	Try and search for things like
312	\texttt{synchronize time network}
313	and see if you can figure out how to do it.
314
315
316	\section*{MPI resources}
317
318	\begin{itemize}
319	\item \url{http://www.cs.appstate.edu/~can/classes/5530} (exellent introduction
320	and many useful links)
321	\item http://www-unix.mcs.anl.gov/mpi/mpich (The MPICH implementation)
322	\item http://www.lam-mpi.org (The LAM-MPI implementation)
323	\item http://www.netlib.org/utk/papers/mpi-book/mpi-book.html\\
324	(The complete MPI reference)
325	\end{itemize}
326
327
328
329	\section*{Exercise 6}
330
331	In this exercise we will continue the introduction to MPI
332	by using the local id to differentiate work and
333	by familiarising ourselves with the basic send and receive
334	primitives.
335	We will then proceed to measure the fundamental characteristics of
336	the network: namely latency and bandwidth.
337
338
339
340	\subsection*{Using local id to differentiate work}
341	In Exercise 5 we had every processor write to the screen. This
342	is convenient for debugging purposes
343	and I suggest you start every parallel program
344	with a message from each processor about its rank, the total number
345	of processors and the hostname it is running on (see below).
346
347	However, it is often also desirable to have only on
348	processor output to screen
349	for example to report the final result of a calculation.
350
351	\subsubsection*{Exercise}
352	Write a parallel program that
353	produces something like the following output when run on four processors:
354	\begin{verbatim}
355	P2/4: Initialised OK on host ninja-6
356	P0/4: Initialised OK on host ninja-4
357	P0: Program terminating
358	P1/4: Initialised OK on host ninja-5
359	P3/4: Initialised OK on host ninja-7
360	\end{verbatim}
361	Arrange the program such that
362	all processors report on their initialisation but
363	only one, processor 0, reports on its termination just prior
364	to \texttt{MPI\_Finalize()}.
365	Note, as in Exercise 5, that the order of output is
366	arbitrary.
367	%\emph{Hint: Use the value of myid to decide whether to print or not}.
368
369	\subsection*{A simple send and receive pair}
370	In this part we will use the fundamental MPI send and receive commands
371	to communicate among the processors. The commands are:
372	\begin{itemize}
373	\item \texttt{MPI\_Send}:
374	This routine performs a basic send; this routine may block until
375	the message
376	is received, depending on the specific implementation of MPI.
377	\begin{verbatim}
378	int MPI_Send(void* buf, int count, MPI_Datatype datatype,
379	int dest, int tag, MPI_Comm comm)
380	Input:
381	buf - initial address of send buffer
382	count - number of elements in send buffer (nonnegative integer)
383	datatype - datatype of each send buffer element
384	dest - rank of destination (integer)
385	tag - message tag (integer)
386	comm - communicator
387	\end{verbatim}
388	\textbf{Example}: To send one integer, A say, to processor 3 with tag 13:
389	\begin{verbatim}
390	MPI_Send(&A, 1, MPI_INT, 3, 13, MPI_COMM_WORLD);
391	\end{verbatim}
392	\item \texttt{MPI\_Recv}:
393	This routine performs a basic receive.
394	\begin{verbatim}
395	int MPI_Recv(void* buf, int count, MPI_Datatype datatype,
396	int source, int tag, MPI_Comm comm,
397	MPI_Status *status)
398
399	Input:
400	count - maximum number of elements in receive buffer
401	(integer)
402	datatype - datatype of each receive buffer element
403	source - rank of source (integer)
404	tag - message tag (integer)
405	comm - communicator
406
407	Output:
408	buf - initial address of receive buffer
409	status - status object, provides information about
410	message received;
411	status is a structure of type MPI_Status, the element
412	status.MPI_SOURCE is the source of the message received,
413	and the element status.MPI_TAG is the tag value.
414
415	\end{verbatim}
416	\textbf{Example}: To receive one integer, B say, from processor 7
417	with tag 13:
418	\begin{verbatim}
419	MPI_Recv(&B, 1, MPI_INT, 7, 13, MPI_COMM_WORLD, &status);
420	\end{verbatim}
421	\end{itemize}
422
423	\noindent These calls are described in more detail in Ian Foster online book
424	in \textbf{MPI Basics}:
425	\url{http://www-unix.mcs.anl.gov/dbpp/text/node96.html}.
426	The chapter about \textbf{Asynchronous communication},
427	\url{http://www-unix.mcs.anl.gov/dbpp/text/node98.html}, describes how
428	to query the Status Object.
429
430	MPI defines the following constants for use with \texttt{MPI\_Recv}:
431	\begin{itemize}
432	\item \texttt{MPI\_ANY\_SOURCE}: Can be used to specify that a message
433	can be received from anywhere instead of a specific source.
434	\item \texttt{MPI\_ANY\_TAG}: Can be used to specify that a message
435	can have any tag instead of a specific tag.
436	\end{itemize}
437
438	\subsubsection*{Exercise}
439	The following program segment implements a very simple
440	communication pattern: Processor 0 sends a number to processor 1 and
441	both outputs a diagnostic message.
442
443	Add to your previous program the declarations:
444	\begin{verbatim}
445	int A, B, source, destination, tag=13;
446	MPI_Status status;
447	\end{verbatim}
448	and the following code segment
449	\begin{verbatim}
450	if (myid == 0) {
451	A = 42;
452	destination = 1;
453	printf("P%d: Sending value %d to MPI process %d\n",
454	myid, A, destination);
455	MPI_Send(&A, 1, MPI_INT, 1, 13, MPI_COMM_WORLD);
456	} else if (myid == 1) {
457	source = 0;
458	MPI_Recv(&B, 1, MPI_INT, source, 13, MPI_COMM_WORLD, &status);
459	printf("P%d: Received value %d from MPI process %d\n", myid, B, source);
460	}
461	\end{verbatim}
462	make sure it can compile and run on 2 processors.
463	Verify that your output looks something like
464	\begin{verbatim}
465	P0/2: Initialised OK on host ninja-1
466	P0: Sending value 42 to MPI process 1
467	P0: Terminating
468	P1/2: Initialised OK on host ninja-2
469	P1: Received value 42 from MPI process 0
470	\end{verbatim}
471
472
473
474	\subsection*{Send messages around in a ring - Exercise}
475
476	Write an MPI program which passes data
477	around in a ring structure from process 0 then to process 1 etc.
478	When the message reaches the last processor it should be passed back to
479	processor 0 - this forms a 'communication ring'.
480	Every process should add one to the value before passing it on
481	and write out a diagnostic message.
482
483	With a starting value of 42 and running on four processors your program
484	should produce the following output:
485	\begin{verbatim}
486	P0/4: Initialised OK on host ninja-1
487	P0: Sending value 42 to MPI process 1
488	P1/4: Initialised OK on host ninja-2
489	P3/4: Initialised OK on host ninja-4
490	P1: Received value 42 from MPI process 0
491	P1: Sending value 43 to MPI process 2
492	P2/4: Initialised OK on host ninja-3
493	P2: Received value 43 from MPI process 1
494	P2: Sending value 44 to MPI process 3
495	P3: Received value 44 from MPI process 2
496	P3: Sending value 45 to MPI process 0
497	P0: Received value 45 from MPI process 3
498	\end{verbatim}
499	\emph{Hint}: You can use the \texttt{modulus} operator to calculate
500	the destination such that the next processor after the last becomes 0:
501	\begin{verbatim}
502	destination = (myid + 1) % number_of_processors;
503	\end{verbatim}
504
505
506	\subsection*{Timing}
507	A main reason for doing parallel computing is that of faster execution.
508	Therefore \emph{timing} is an integral part of this game.
509	MPI provides a function \texttt{MPI\_Wtime} which counts seconds.
510	To time a code segment with \texttt{MPI\_Wtime} do something like this:
511	\begin{verbatim}
512	double t0;
513
514	t0=MPI_Wtime();
515	// (Computations)
516	printf("Time = %.6f sec\n", MPI_Wtime() - t0);
517	\end{verbatim}
518
519	\subsubsection*{Exercise}
520	Time the previous 'ring' program in such a way that only process 0
521	measures the time and writes it to the screen
522
523
524	\subsection*{Measure network latency and bandwidth}
525
526	The time it takes to transmit a message can be approximated by the model
527	\[
528	t = t_l + \alpha t_b
529	\]
530	where
531	\begin{itemize}
532	\item $t_l$ is the \emph{latency}. The latency is the
533	time it takes to start the transmission before anything
534	gets communicated. This involves opening the network connection,
535	buffering the message and so on.
536	\item $t_b$ is the time it takes to communicate one byte once
537	the connection is open. This is related to the \emph{bandwidth} $B$,
538	defined as the number of bytes transmitted per second, as
539	\[
540	B = \frac{1}{t_b}
541	\]
542	\item $\alpha$ is the number of bytes communicated.
543	\end{itemize}
544
545	\subsubsection*{Exercise}
546	\textbf{Can you work out what the network latency is?}
547	\emph{Hint:} Send a very small message (e.g.\ one integer)
548	from one processor to another and and back again a number of times
549	and measure the time.
550
551	\subsubsection*{Exercise}
552	\textbf{Can you work out how many megabytes the network can transmit per
553	second?}
554	Create arrays of varying size, send them from one processor
555	to another and back and measure the time.
556	\emph{Hints:}
557	\begin{itemize}
558	\item To create an array of double precision floating point numbers
559	do the following.
560	\begin{verbatim}
561	#define MAX_LEN 500000 /* Largest block */
562	double A[MAX_LEN];
563	\end{verbatim}
564	\item To populate the array with pseudo random numbers:
565	\begin{verbatim}
566	for (j=0; j<MAX_LEN; j++) A[j]=rand();
567	\end{verbatim}
568	\item To send only part of the array use the second argument in
569	MPI\_Send and MPI\_Recv to control how much information is transferred.
570	\item MPI\_Send and MPI\_Recv require a \emph{pointer} to the first
571	element of the array. Since arrays are already defined as pointers
572	simply put \texttt{A} rather than \texttt{\&A} which is what
573	we used earlier for the
574	simple integer variable.
575	\item One double precision number takes up 8 bytes - use that in your
576	estimate.
577	\end{itemize}
578
579	Your measurements may fluctuate based on the network traffic and the
580	load on the machines.
581
582
583	\section*{Good MPI resources}
584	\begin{itemize}
585	\item MPI Data types:\\
586	\url{http://www.ats.ucla.edu/at/hpc/parallel_computing/mpi-intro.htm}
587	\item Online book by Ian Foster:\\
588	\url{http://www-unix.mcs.anl.gov/dbpp/text/node94.html}
589	\end{itemize}
590
591
592
593
594
595	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
596	\section{Identifying the processors}
597
598
599	%%%%%%%%%%%
600
601
602	It is almost impossible to demonstrate parallelism with an interactive
603	session at the command prompt as we did with the Python and Maple tutorials.
604	Therefore we must write a \emph{program} that can be executed in parallel.
605	Start an editor of your choice with a new file
606	(called \texttt{paraprime.py}, say). For example:
607	\begin{verbatim}
608	emacs paraprime.py &
609	\end{verbatim}
610
611	\noindent Write the following in the file and save it
612	{\small \begin{verbatim}
613	import pypar
614
615	numproc = pypar.size()
616	myid = pypar.rank()
617
618	print "I am proc %d of %d" %(myid, numproc)
619	\end{verbatim}}
620
621	Then type the following at the Unix prompt to execute the
622	program normally on one processor
623	{\small \begin{verbatim}
624	python paraprime.py
625	\end{verbatim}}
626	\noindent You should see the following output
627	{\small \begin{verbatim}
628	> python paraprime.py
629	I am proc 0 of 1
630	\end{verbatim}}
631
632
633	\noindent Now let us try and run the same program on four processors:
634	{\small \begin{verbatim}
635	> prun -n 4 python paraprime.py
636	I am proc 0 of 4
637	I am proc 1 of 4
638	I am proc 2 of 4
639	I am proc 3 of 4
640	\end{verbatim}}
641	If you get a message from each of the four processors stating their
642	id as above, we know that everything works. We have a parallel program!
643
644	\section{Strategies for efficient parallel programming}
645	A final note on efficiency.
646
647	Parallel programming should lead to faster execution and/or
648	ability to deal with larger problems.
649
650	The ultimate goal is to be \emph{P} times faster with \emph{P} processors.
651	However \emph{speedup} is usually less than \emph{P}. One must address
652	three critical issues to achieve good speedup:
653	\begin{itemize}
654	\item \textbf{Interprocessor communication:} The amount and the frequency
655	of messages should be kept as low as possible. Ensure that processors
656	have plenty of work to do between communications.
657	\item \textbf{Data distribution and load balancing:} If some processors
658	finish much sooner than others, the total execution time of the parallel
659	program is bounded by that of the slowest processor and we say that the
660	program is poorly load balanced.
661	Ensure that each processor get its fair share of the work load.
662	\item \textbf{Sequential parts of a program:} If half of a program, say, is
663	inherently sequential the speedup can never exceed 2 no matter how well
664	the remaining half is parallelised. This is known as Amdahls law.
665	Ensure that the all cost intensive parts get parallelised.
666	\end{itemize}
667
668
669
670
671
672	\section*{Exercise 8}
673
674	In Exercise 6-7 you sent messages around in rings and back and forth between
675	two processors and also timed the network speed when communicating between to
676	processors.
677	The purpose of this exercise is to try a collective communication patters
678	where on processor sends and gathers data from all the others using the
679	basic MPI\_Send and MPI\_Recv.
680
681	\subsection*{Distribute and gather integers}
682
683	Create a program where process 0 distributes an integer to all other processes,
684	then receives an integer back from each of them.
685	The other processes should receive the integer from process 0, add their
686	rank to it and pass it back to process 0.
687
688	With some suitable diagnostic output your program should produce
689	something like this when run on four processors:
690
691	\begin{verbatim}
692	P0/4: Initialised OK on host ninja-1
693	P0: Sending value 42 to MPI process 1
694	P0: Sending value 42 to MPI process 2
695	P0: Sending value 42 to MPI process 3
696	P3/4: Initialised OK on host ninja-4
697	P1/4: Initialised OK on host ninja-2
698	P2/4: Initialised OK on host ninja-3
699	P3: Received value 42 from MPI process 0
700	P3: Sending value 45 to MPI process 0
701	P1: Received value 42 from MPI process 0
702	P1: Sending value 43 to MPI process 0
703	P2: Received value 42 from MPI process 0
704	P2: Sending value 44 to MPI process 0
705	P0: Received value 43 from MPI process 1
706	P0: Received value 44 from MPI process 2
707	P0: Received value 45 from MPI process 3
708	\end{verbatim}
709
710
711	\subsection*{Distribute and gather arrays}
712
713	In this part we will do exactly the same as above except we will use
714	a double precision array as the data.
715
716	\begin{enumerate}
717	\item Create a double precision array (A) of length $N=1024$ to use as buffer
718	\item On process 0 populate it with numbers $0$ to $N-1$
719	(using A[i] = (double) i;)
720	\item Pass A on to all other processors (the workers).
721	\item All workers should compute the $p$-norm of the array
722	where $p$ is the rank of the worker:
723	\[
724	\left(
725	\sum_{i=0}^N A[i]^p
726	\right)^{1/p}
727	\]
728	\item Then return the result as one double precision number per worker and
729	print them all out on processor 0.
730	\end{enumerate}
731
732	\emph{Hints:}
733	\begin{itemize}
734	\item To compute $x^y$ use the C-function pow(x, y);
735	\item Include the math headers in your program: \verb+#include <math.h>+
736	\item You'll also need to link in the math libraries as follows:
737	mpicc progname.c -lm
738	%\item You might want to add many print statements reporting the communication pattern in
739	\end{itemize}
740
741	If your pogram works it should produce something like this on four processors.
742	\begin{verbatim}
743	P0/4: Initialised OK on host ninja-1
744	P3/4: Initialised OK on host ninja-4
745	P2/4: Initialised OK on host ninja-3
746	P1/4: Initialised OK on host ninja-2
747	P0: Received value 523776.00 from MPI process 1
748	P0: Received value 18904.76 from MPI process 2
749	P0: Received value 6497.76 from MPI process 3
750	\end{verbatim}
751
752
753
754
755	\end{document}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: pypar/documentation/pypar_tutorial.tex @ 650

Download in other formats: