Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

pypar_tutorial.tex @ 1454

Last change on this file since 1454 was 124, checked in by ole, 20 years ago
Doc files and ring_example
File size: 26.7 KB

Rev	Line
[124]	1	\documentclass[12pt]{article}
	2
	3	\usepackage{url}
	4	\usepackage{times}
	5	\usepackage[dvips]{graphicx}
	6
	7	%\textheight=25cm
	8	\textwidth=14cm
	9	%\topmargin=-4cm
	10	%\oddsidemargin=0cm
	11
	12
	13
	14	\begin{document}
	15
	16	\title{Pypar Tutorial\\
	17	Building a Parallel Program using Python}
	18
	19
	20	\author{Ole Nielsen \\
	21	Australian National University, Canberra \\
	22	Ole.Nielsen@anu.edu.au}
	23
	24	\maketitle
	25	\section*{Introduction}
	26
	27	This is a tutorial demonstrating essential
	28	features of parallel programming and building
	29	a simple Python MPI program using the MPI binding \texttt{pypar} available at
	30	\url{http://datamining.anu.edu.au/software/pypar}.
	31
	32	It is assumed that \texttt{pypar} has been installed on your system.
	33	If not, see the installation document (\texttt{pypar\_installation.pdf})
	34	that comes with pypar
	35	(or get it from the web site at
	36	\url{http://datamining.anu.edu.au/~ole/pypar/pypar_installation.pdf}).
	37
	38	For a reference to all of Pypar's functionality, see the
	39	reference manual (\texttt{pypar\_reference.pdf})
	40	that comes with pypar (or get it from the web site at
	41	\url{http://datamining.anu.edu.au/~ole/pypar/pypar_reference.pdf}).
	42
	43
	44	\section{The Message Passing Interface --- MPI}
	45	MPI is the defacto standard for parallel programming of both
	46	distributed computer clusters of and shared memory architectures.
	47	MPI programming takes the form of special commands that are
	48	imported from the MPI library. They provide information to the program
	49	about how many processes are available, which number each process has been
	50	allocated and functions to communicate with other processes.
	51	See \url{http://www.netlib.org/utk/papers/mpi-book/mpi-book.html} for
	52	a comprehensive reference to MPI (based on
	53	the C or Fortran programming languages).
	54
	55	It is said that MPI is both large and small. What is meant is that the
	56	MPI standard has hundreds of functions in it. However, many of the
	57	advanced routines represent functionality that can be ignored unless
	58	one pursues added flexibility (data types), overlap of computation and
	59	communication (nonblocking send/receive), modularity (groups,
	60	communicators), or convenience (collective operations, topologies).
	61	MPI is said to be small because the majority of useful and efficient
	62	parallel programs are written using only a small essential subset of MPI.
	63	In fact, it can be argued that parallel performance is best achieved by
	64	keeping the parallelism simple.
	65
	66	Pypar is a Python binding to such a subset.
	67
	68
	69	\section{A simple Pypar MPI program}
	70	Let us start with a simple example: a Python program
	71	that imports the pypar module and
	72	reports the number of processors, local id and host name
	73	This is the Pypar equivalent of the infamous 'hello world'
	74	program if you like.
	75	Make a file called \texttt{ni.py} using your favourite
	76	editor\footnote{The author believs that your favourite editor should
	77	be GNU/Emacs, but it is up to you, of course.}
	78	and type in the following contents:
	79	{\footnotesize
	80	\begin{verbatim}
	81	import pypar
	82
	83	p = pypar.rank()
	84	P = pypar.size()
	85	node = pypar.get_processor_name()
	86
	87	print "I am process %d of %d on node %s" %(p, P, node)
	88
	89	pypar.finalize()
	90	\end{verbatim}}
	91	\noindent and run it as follows:
	92	\begin{verbatim}
	93	mpirun -np 4 python ni.py
	94	\end{verbatim}
	95	You should get a greeting from each of the four nodes looking
	96	something like
	97	\begin{verbatim}
	98	Pypar (version 1.9) initialised MPI OK with 4 processors
	99	I am process 0 of 4 on node ninja-1
	100	I am process 3 of 4 on node ninja-4
	101	I am process 1 of 4 on node ninja-2
	102	I am process 2 of 4 on node ninja-3
	103	\end{verbatim}
	104	The order of output may be arbitrary! If four copies of the
	105	program is run, print is called four times. The order in which each
	106	process executes the message is undetermined, based on when they each
	107	reach that point in their execution of the program, and how they
	108	travel on the network.
	109	\emph{Note that all the print's, though they come from different
	110	processes, will send their output intact to your shell window; this
	111	is generally true of output commands.
	112	Input commands, like \texttt{raw_input},
	113	will only work on the process with rank zero.}
	114
	115
	116	\subsection{The MPI execution model}
	117
	118	It is important to understand that in MPI (and therefore also in
	119	Pypar), multiple identical copies of this program
	120	will start simultaneously in separate processes.
	121	These processes may run on one machine or on multiple machines.
	122	This is a fundamental difference from ordinary programs,
	123	where, if someone sais ``run the program'', it is
	124	assumed that there was only one instance of the program running.
	125
	126
	127	XXXXX
	128	In a nutshell, this program sets up a communication group of 4
	129	processes, where each process gets its rank (\texttt{p}), prints it,
	130	and exits.
	131
	132	.......Explain in detail like below..
	133
	134
	135
	136
	137	The first line,
	138	\texttt{\#include <stdio>}
	139	should be familiar to all C programmers. It includes the standard
	140	input/output routines like printf. The second line,
	141	\texttt{\#include <mpi.h>}
	142	includes the MPI functions. The file \texttt{mpi.h}
	143	contains prototypes for all the
	144	MPI routines in this program; this file is located in
	145	\texttt{/usr/include/lam/mpi.h} in case you actually want to look at it.
	146	The program starts with the main... line which takes the two arguments
	147	argc and argv (which is normal in C): The argument \texttt{argc} is
	148	the number of commandline arguments given to the program (including itself)
	149	and \texttt{argv} is an array of strings containing them. For example
	150	argv[0] will contain the program name itself. We will not be using them
	151	directly but MPI needs access to them through \texttt{MPI\_init}.
	152	Then the program declares one integer variable, \texttt{myid}. The
	153	first step of the program,
	154	\begin{verbatim}
	155	MPI_Init(&argc,&argv);
	156	\end{verbatim}
	157	calls MPI\_Init to set up everything.
	158	This should be the first command executed in all MPI programs. This
	159	routine takes pointers to \texttt{argc} and \texttt{argv},
	160	looks at them, pulls out the purely
	161	MPI-relevant things, and generally fixes them so you can use command line
	162	arguments as normal.
	163
	164	\medskip
	165
	166	Next, the program runs \texttt{MPI\_Comm\_rank},
	167	passing it the default communicator MPI\_COMM\_WORLD
	168	(which is the group of \emph{all} processes started)
	169	and a \emph{pointer} to \texttt{myid}. Passing a pointer is the way
	170	C can change the value of te argument \texttt{myid}.
	171	\texttt{MPI\_Comm\_rank} will set \texttt{myid} to the rank of the machine
	172	on which the program is running.
	173	Remember that in reality, several instances of this program
	174	start up in several different processes when this program is run. Each
	175	will receive a unique number from \texttt{MPI\_Comm\_rank}.
	176	Because multiple copies are running, each will execute all lines
	177	in the program including the \texttt{printf} statement which prints a
	178	message and the rank.
	179	After doing everything else, the program calls \texttt{MPI\_Finalize},
	180	which generally terminates everything and shuts down MPI. This should be the
	181	last command executed in all MPI programs.
	182
	183	\subsection*{Compile the MPI program}
	184
	185	Normally, C programs are compiled with a command such as
	186	\texttt{gcc hello.c -o hello.x}. However, MPI programs need access to
	187	more libraries which are provided through the wrapper program
	188	\texttt{mpicc}. This can be used as usual. Try to run
	189	\begin{verbatim}
	190	mpicc hello.c -o hello.x
	191	\end{verbatim}
	192	If you made a syntactically correct program a \emph{compiled}
	193	program will appear with the name you specified after the -o option.
	194	If not you will see some error messages and you'll have to correct
	195	those before continuing.
	196
	197	\subsection*{Run the MPI program}
	198	In order to run an MPI compiled program, you must type:
	199	\begin{verbatim}
	200	mpirun -np <number of processes> [options] <program name and arguments>
	201	\end{verbatim}
	202	where you specify the number of processes on which you want to run
	203	your parallel program, optional mpirun options,
	204	and your program name and its expected arguments.
	205	In this case try:
	206	\begin{verbatim}
	207	mpirun -np 1 hello.x
	208	\end{verbatim}
	209	This will run one copy of your program and the output should like this
	210	\begin{verbatim}
	211	Sawatdii khrap thuk thuk khon (Process 0)
	212	\end{verbatim}
	213
	214	Now try to start four processes by running \texttt{mpirun -np 4 hello.x}.
	215	You should see the following output:
	216	\begin{verbatim}
	217	Sawatdii khrap thuk thuk khon (Process 0)
	218	Sawatdii khrap thuk thuk khon (Process 3)
	219	Sawatdii khrap thuk thuk khon (Process 1)
	220	Sawatdii khrap thuk thuk khon (Process 2)
	221	\end{verbatim}
	222	Note that the order of output may be arbitrary! If four copies of the
	223	program is run, printf is called four times. The order in which each
	224	process executes the message is undetermined, based on when they each
	225	reach that point in their execution of the program, and how they
	226	travel on the network. Your guess is as good as mine.
	227	\emph{Note that all the printf's, though they come from different
	228	processes, will send their output intact to your shell window; this
	229	is generally true of output commands. Input commands, like scanf,
	230	will only work on the process with rank zero.}
	231
	232	\subsubsection*{Exercise}
	233	The MPI command \texttt{MPI\_Comm\_size(MPI\_COMM\_WORLD, \&proc);} will
	234	store the total number of processes started in the integer variable
	235	\texttt{number\_of\_processes}. Modify the program
	236	to give the following output:
	237	\begin{verbatim}
	238	Sawatdii khrap thuk thuk khon (Process 0 of 4)
	239	Sawatdii khrap thuk thuk khon (Process 3 of 4)
	240	Sawatdii khrap thuk thuk khon (Process 1 of 4)
	241	Sawatdii khrap thuk thuk khon (Process 2 of 4)
	242	\end{verbatim}
	243	when started with four processes.
	244
	245
	246	\subsection*{Running on multiple computers}
	247
	248	So far we have only \emph{simulated} the parallelism by running
	249	multiple processes on one machine.
	250	We will now add host specific information to the program and then run it
	251	in parallel.
	252
	253	Add the following declarations to your program:
	254	\begin{verbatim}
	255	int namelen;
	256	char processor_name[MPI_MAX_PROCESSOR_NAME];
	257	\end{verbatim}
	258
	259	Add the command
	260	\begin{verbatim}
	261	MPI_Get_processor_name(processor_name, &namelen);
	262	\end{verbatim}
	263	This will store the hostname of the processor executing the given process
	264	along with the length of the hostname. Then modify the print statement to
	265	\begin{verbatim}
	266	printf("Sawatdii khrap thuk thuk khon (Process %d of %d running on %s)\n",
	267	myid, number_of_processes, processor_name);
	268	\end{verbatim}
	269	Compiling and running the program you should see the folowing output
	270	\begin{verbatim}
	271	Sawatdii khrap thuk thuk khon (Process 0 of 4 running on ninja-n)
	272	Sawatdii khrap thuk thuk khon (Process 1 of 4 running on ninja-n)
	273	Sawatdii khrap thuk thuk khon (Process 2 of 4 running on ninja-n)
	274	Sawatdii khrap thuk thuk khon (Process 3 of 4 running on ninja-n)
	275	\end{verbatim}
	276	We are still running all out MPI processes on one computer but we are
	277	now ready to run this program in parallel.
	278
	279	Edit the file \texttt{~/.lamhosts} to contain all machines in our network
	280	and run \texttt{lamboot -v ~/.lamhosts} again. You should see the following
	281	diagnostic message:
	282	\begin{verbatim}
	283	LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University
	284
	285	Executing hboot on n0 (ninja-1 - 1 CPU)...
	286	Executing hboot on n1 (ninja-2 - 1 CPU)...
	287	Executing hboot on n2 (ninja-3 - 1 CPU)...
	288	Executing hboot on n3 (ninja-4 - 1 CPU)...
	289	Executing hboot on n4 (ninja-5 - 1 CPU)...
	290	Executing hboot on n5 (ninja-6 - 1 CPU)...
	291	Executing hboot on n6 (ninja-7 - 1 CPU)...
	292	Executing hboot on n7 (ninja-8 - 1 CPU)...
	293	topology done
	294	\end{verbatim}
	295	\noindent Finally try to run your program again and verify that it runs
	296	on different nodes.
	297	What happens if you run more processes than there are physical machines?
	298
	299
	300	\section*{Optional: synchronise the clocks}
	301
	302	\textbf{\emph{Note: This an optional exercise for those who can't get enough!}}
	303
	304	The clocks in our cluster are not synchronised: They show differenttimes.
	305	Try for example to run
	306	\begin{verbatim}
	307	ssh -x ninja-2 date; ssh -x ninja-5 date; ssh -x ninja-3 date; date
	308	You will see that the times listed differ significantly.
	309	\end{verbatim}
	310
	311	Try and search for things like
	312	\texttt{synchronize time network}
	313	and see if you can figure out how to do it.
	314
	315
	316	\section*{MPI resources}
	317
	318	\begin{itemize}
	319	\item \url{http://www.cs.appstate.edu/~can/classes/5530} (exellent introduction
	320	and many useful links)
	321	\item http://www-unix.mcs.anl.gov/mpi/mpich (The MPICH implementation)
	322	\item http://www.lam-mpi.org (The LAM-MPI implementation)
	323	\item http://www.netlib.org/utk/papers/mpi-book/mpi-book.html\\
	324	(The complete MPI reference)
	325	\end{itemize}
	326
	327
	328
	329	\section*{Exercise 6}
	330
	331	In this exercise we will continue the introduction to MPI
	332	by using the local id to differentiate work and
	333	by familiarising ourselves with the basic send and receive
	334	primitives.
	335	We will then proceed to measure the fundamental characteristics of
	336	the network: namely latency and bandwidth.
	337
	338
	339
	340	\subsection*{Using local id to differentiate work}
	341	In Exercise 5 we had every processor write to the screen. This
	342	is convenient for debugging purposes
	343	and I suggest you start every parallel program
	344	with a message from each processor about its rank, the total number
	345	of processors and the hostname it is running on (see below).
	346
	347	However, it is often also desirable to have only on
	348	processor output to screen
	349	for example to report the final result of a calculation.
	350
	351	\subsubsection*{Exercise}
	352	Write a parallel program that
	353	produces something like the following output when run on four processors:
	354	\begin{verbatim}
	355	P2/4: Initialised OK on host ninja-6
	356	P0/4: Initialised OK on host ninja-4
	357	P0: Program terminating
	358	P1/4: Initialised OK on host ninja-5
	359	P3/4: Initialised OK on host ninja-7
	360	\end{verbatim}
	361	Arrange the program such that
	362	all processors report on their initialisation but
	363	only one, processor 0, reports on its termination just prior
	364	to \texttt{MPI\_Finalize()}.
	365	Note, as in Exercise 5, that the order of output is
	366	arbitrary.
	367	%\emph{Hint: Use the value of myid to decide whether to print or not}.
	368
	369	\subsection*{A simple send and receive pair}
	370	In this part we will use the fundamental MPI send and receive commands
	371	to communicate among the processors. The commands are:
	372	\begin{itemize}
	373	\item \texttt{MPI\_Send}:
	374	This routine performs a basic send; this routine may block until
	375	the message
	376	is received, depending on the specific implementation of MPI.
	377	\begin{verbatim}
	378	int MPI_Send(void* buf, int count, MPI_Datatype datatype,
	379	int dest, int tag, MPI_Comm comm)
	380	Input:
	381	buf - initial address of send buffer
	382	count - number of elements in send buffer (nonnegative integer)
	383	datatype - datatype of each send buffer element
	384	dest - rank of destination (integer)
	385	tag - message tag (integer)
	386	comm - communicator
	387	\end{verbatim}
	388	\textbf{Example}: To send one integer, A say, to processor 3 with tag 13:
	389	\begin{verbatim}
	390	MPI_Send(&A, 1, MPI_INT, 3, 13, MPI_COMM_WORLD);
	391	\end{verbatim}
	392	\item \texttt{MPI\_Recv}:
	393	This routine performs a basic receive.
	394	\begin{verbatim}
	395	int MPI_Recv(void* buf, int count, MPI_Datatype datatype,
	396	int source, int tag, MPI_Comm comm,
	397	MPI_Status *status)
	398
	399	Input:
	400	count - maximum number of elements in receive buffer
	401	(integer)
	402	datatype - datatype of each receive buffer element
	403	source - rank of source (integer)
	404	tag - message tag (integer)
	405	comm - communicator
	406
	407	Output:
	408	buf - initial address of receive buffer
	409	status - status object, provides information about
	410	message received;
	411	status is a structure of type MPI_Status, the element
	412	status.MPI_SOURCE is the source of the message received,
	413	and the element status.MPI_TAG is the tag value.
	414
	415	\end{verbatim}
	416	\textbf{Example}: To receive one integer, B say, from processor 7
	417	with tag 13:
	418	\begin{verbatim}
	419	MPI_Recv(&B, 1, MPI_INT, 7, 13, MPI_COMM_WORLD, &status);
	420	\end{verbatim}
	421	\end{itemize}
	422
	423	\noindent These calls are described in more detail in Ian Foster online book
	424	in \textbf{MPI Basics}:
	425	\url{http://www-unix.mcs.anl.gov/dbpp/text/node96.html}.
	426	The chapter about \textbf{Asynchronous communication},
	427	\url{http://www-unix.mcs.anl.gov/dbpp/text/node98.html}, describes how
	428	to query the Status Object.
	429
	430	MPI defines the following constants for use with \texttt{MPI\_Recv}:
	431	\begin{itemize}
	432	\item \texttt{MPI\_ANY\_SOURCE}: Can be used to specify that a message
	433	can be received from anywhere instead of a specific source.
	434	\item \texttt{MPI\_ANY\_TAG}: Can be used to specify that a message
	435	can have any tag instead of a specific tag.
	436	\end{itemize}
	437
	438	\subsubsection*{Exercise}
	439	The following program segment implements a very simple
	440	communication pattern: Processor 0 sends a number to processor 1 and
	441	both outputs a diagnostic message.
	442
	443	Add to your previous program the declarations:
	444	\begin{verbatim}
	445	int A, B, source, destination, tag=13;
	446	MPI_Status status;
	447	\end{verbatim}
	448	and the following code segment
	449	\begin{verbatim}
	450	if (myid == 0) {
	451	A = 42;
	452	destination = 1;
	453	printf("P%d: Sending value %d to MPI process %d\n",
	454	myid, A, destination);
	455	MPI_Send(&A, 1, MPI_INT, 1, 13, MPI_COMM_WORLD);
	456	} else if (myid == 1) {
	457	source = 0;
	458	MPI_Recv(&B, 1, MPI_INT, source, 13, MPI_COMM_WORLD, &status);
	459	printf("P%d: Received value %d from MPI process %d\n", myid, B, source);
	460	}
	461	\end{verbatim}
	462	make sure it can compile and run on 2 processors.
	463	Verify that your output looks something like
	464	\begin{verbatim}
	465	P0/2: Initialised OK on host ninja-1
	466	P0: Sending value 42 to MPI process 1
	467	P0: Terminating
	468	P1/2: Initialised OK on host ninja-2
	469	P1: Received value 42 from MPI process 0
	470	\end{verbatim}
	471
	472
	473
	474	\subsection*{Send messages around in a ring - Exercise}
	475
	476	Write an MPI program which passes data
	477	around in a ring structure from process 0 then to process 1 etc.
	478	When the message reaches the last processor it should be passed back to
	479	processor 0 - this forms a 'communication ring'.
	480	Every process should add one to the value before passing it on
	481	and write out a diagnostic message.
	482
	483	With a starting value of 42 and running on four processors your program
	484	should produce the following output:
	485	\begin{verbatim}
	486	P0/4: Initialised OK on host ninja-1
	487	P0: Sending value 42 to MPI process 1
	488	P1/4: Initialised OK on host ninja-2
	489	P3/4: Initialised OK on host ninja-4
	490	P1: Received value 42 from MPI process 0
	491	P1: Sending value 43 to MPI process 2
	492	P2/4: Initialised OK on host ninja-3
	493	P2: Received value 43 from MPI process 1
	494	P2: Sending value 44 to MPI process 3
	495	P3: Received value 44 from MPI process 2
	496	P3: Sending value 45 to MPI process 0
	497	P0: Received value 45 from MPI process 3
	498	\end{verbatim}
	499	\emph{Hint}: You can use the \texttt{modulus} operator to calculate
	500	the destination such that the next processor after the last becomes 0:
	501	\begin{verbatim}
	502	destination = (myid + 1) % number_of_processors;
	503	\end{verbatim}
	504
	505
	506	\subsection*{Timing}
	507	A main reason for doing parallel computing is that of faster execution.
	508	Therefore \emph{timing} is an integral part of this game.
	509	MPI provides a function \texttt{MPI\_Wtime} which counts seconds.
	510	To time a code segment with \texttt{MPI\_Wtime} do something like this:
	511	\begin{verbatim}
	512	double t0;
	513
	514	t0=MPI_Wtime();
	515	// (Computations)
	516	printf("Time = %.6f sec\n", MPI_Wtime() - t0);
	517	\end{verbatim}
	518
	519	\subsubsection*{Exercise}
	520	Time the previous 'ring' program in such a way that only process 0
	521	measures the time and writes it to the screen
	522
	523
	524	\subsection*{Measure network latency and bandwidth}
	525
	526	The time it takes to transmit a message can be approximated by the model
	527	\[
	528	t = t_l + \alpha t_b
	529	\]
	530	where
	531	\begin{itemize}
	532	\item $t_l$ is the \emph{latency}. The latency is the
	533	time it takes to start the transmission before anything
	534	gets communicated. This involves opening the network connection,
	535	buffering the message and so on.
	536	\item $t_b$ is the time it takes to communicate one byte once
	537	the connection is open. This is related to the \emph{bandwidth} $B$,
	538	defined as the number of bytes transmitted per second, as
	539	\[
	540	B = \frac{1}{t_b}
	541	\]
	542	\item $\alpha$ is the number of bytes communicated.
	543	\end{itemize}
	544
	545	\subsubsection*{Exercise}
	546	\textbf{Can you work out what the network latency is?}
	547	\emph{Hint:} Send a very small message (e.g.\ one integer)
	548	from one processor to another and and back again a number of times
	549	and measure the time.
	550
	551	\subsubsection*{Exercise}
	552	\textbf{Can you work out how many megabytes the network can transmit per
	553	second?}
	554	Create arrays of varying size, send them from one processor
	555	to another and back and measure the time.
	556	\emph{Hints:}
	557	\begin{itemize}
	558	\item To create an array of double precision floating point numbers
	559	do the following.
	560	\begin{verbatim}
	561	#define MAX_LEN 500000 /* Largest block */
	562	double A[MAX_LEN];
	563	\end{verbatim}
	564	\item To populate the array with pseudo random numbers:
	565	\begin{verbatim}
	566	for (j=0; j<MAX_LEN; j++) A[j]=rand();
	567	\end{verbatim}
	568	\item To send only part of the array use the second argument in
	569	MPI\_Send and MPI\_Recv to control how much information is transferred.
	570	\item MPI\_Send and MPI\_Recv require a \emph{pointer} to the first
	571	element of the array. Since arrays are already defined as pointers
	572	simply put \texttt{A} rather than \texttt{\&A} which is what
	573	we used earlier for the
	574	simple integer variable.
	575	\item One double precision number takes up 8 bytes - use that in your
	576	estimate.
	577	\end{itemize}
	578
	579	Your measurements may fluctuate based on the network traffic and the
	580	load on the machines.
	581
	582
	583	\section*{Good MPI resources}
	584	\begin{itemize}
	585	\item MPI Data types:\\
	586	\url{http://www.ats.ucla.edu/at/hpc/parallel_computing/mpi-intro.htm}
	587	\item Online book by Ian Foster:\\
	588	\url{http://www-unix.mcs.anl.gov/dbpp/text/node94.html}
	589	\end{itemize}
	590
	591
	592
	593
	594
	595	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	596	\section{Identifying the processors}
	597
	598
	599	%%%%%%%%%%%
	600
	601
	602	It is almost impossible to demonstrate parallelism with an interactive
	603	session at the command prompt as we did with the Python and Maple tutorials.
	604	Therefore we must write a \emph{program} that can be executed in parallel.
	605	Start an editor of your choice with a new file
	606	(called \texttt{paraprime.py}, say). For example:
	607	\begin{verbatim}
	608	emacs paraprime.py &
	609	\end{verbatim}
	610
	611	\noindent Write the following in the file and save it
	612	{\small \begin{verbatim}
	613	import pypar
	614
	615	numproc = pypar.size()
	616	myid = pypar.rank()
	617
	618	print "I am proc %d of %d" %(myid, numproc)
	619	\end{verbatim}}
	620
	621	Then type the following at the Unix prompt to execute the
	622	program normally on one processor
	623	{\small \begin{verbatim}
	624	python paraprime.py
	625	\end{verbatim}}
	626	\noindent You should see the following output
	627	{\small \begin{verbatim}
	628	> python paraprime.py
	629	I am proc 0 of 1
	630	\end{verbatim}}
	631
	632
	633	\noindent Now let us try and run the same program on four processors:
	634	{\small \begin{verbatim}
	635	> prun -n 4 python paraprime.py
	636	I am proc 0 of 4
	637	I am proc 1 of 4
	638	I am proc 2 of 4
	639	I am proc 3 of 4
	640	\end{verbatim}}
	641	If you get a message from each of the four processors stating their
	642	id as above, we know that everything works. We have a parallel program!
	643
	644	\section{Strategies for efficient parallel programming}
	645	A final note on efficiency.
	646
	647	Parallel programming should lead to faster execution and/or
	648	ability to deal with larger problems.
	649
	650	The ultimate goal is to be \emph{P} times faster with \emph{P} processors.
	651	However \emph{speedup} is usually less than \emph{P}. One must address
	652	three critical issues to achieve good speedup:
	653	\begin{itemize}
	654	\item \textbf{Interprocessor communication:} The amount and the frequency
	655	of messages should be kept as low as possible. Ensure that processors
	656	have plenty of work to do between communications.
	657	\item \textbf{Data distribution and load balancing:} If some processors
	658	finish much sooner than others, the total execution time of the parallel
	659	program is bounded by that of the slowest processor and we say that the
	660	program is poorly load balanced.
	661	Ensure that each processor get its fair share of the work load.
	662	\item \textbf{Sequential parts of a program:} If half of a program, say, is
	663	inherently sequential the speedup can never exceed 2 no matter how well
	664	the remaining half is parallelised. This is known as Amdahls law.
	665	Ensure that the all cost intensive parts get parallelised.
	666	\end{itemize}
	667
	668
	669
	670
	671
	672	\section*{Exercise 8}
	673
	674	In Exercise 6-7 you sent messages around in rings and back and forth between
	675	two processors and also timed the network speed when communicating between to
	676	processors.
	677	The purpose of this exercise is to try a collective communication patters
	678	where on processor sends and gathers data from all the others using the
	679	basic MPI\_Send and MPI\_Recv.
	680
	681	\subsection*{Distribute and gather integers}
	682
	683	Create a program where process 0 distributes an integer to all other processes,
	684	then receives an integer back from each of them.
	685	The other processes should receive the integer from process 0, add their
	686	rank to it and pass it back to process 0.
	687
	688	With some suitable diagnostic output your program should produce
	689	something like this when run on four processors:
	690
	691	\begin{verbatim}
	692	P0/4: Initialised OK on host ninja-1
	693	P0: Sending value 42 to MPI process 1
	694	P0: Sending value 42 to MPI process 2
	695	P0: Sending value 42 to MPI process 3
	696	P3/4: Initialised OK on host ninja-4
	697	P1/4: Initialised OK on host ninja-2
	698	P2/4: Initialised OK on host ninja-3
	699	P3: Received value 42 from MPI process 0
	700	P3: Sending value 45 to MPI process 0
	701	P1: Received value 42 from MPI process 0
	702	P1: Sending value 43 to MPI process 0
	703	P2: Received value 42 from MPI process 0
	704	P2: Sending value 44 to MPI process 0
	705	P0: Received value 43 from MPI process 1
	706	P0: Received value 44 from MPI process 2
	707	P0: Received value 45 from MPI process 3
	708	\end{verbatim}
	709
	710
	711	\subsection*{Distribute and gather arrays}
	712
	713	In this part we will do exactly the same as above except we will use
	714	a double precision array as the data.
	715
	716	\begin{enumerate}
	717	\item Create a double precision array (A) of length $N=1024$ to use as buffer
	718	\item On process 0 populate it with numbers $0$ to $N-1$
	719	(using A[i] = (double) i;)
	720	\item Pass A on to all other processors (the workers).
	721	\item All workers should compute the $p$-norm of the array
	722	where $p$ is the rank of the worker:
	723	\[
	724	\left(
	725	\sum_{i=0}^N A[i]^p
	726	\right)^{1/p}
	727	\]
	728	\item Then return the result as one double precision number per worker and
	729	print them all out on processor 0.
	730	\end{enumerate}
	731
	732	\emph{Hints:}
	733	\begin{itemize}
	734	\item To compute $x^y$ use the C-function pow(x, y);
	735	\item Include the math headers in your program: \verb+#include <math.h>+
	736	\item You'll also need to link in the math libraries as follows:
	737	mpicc progname.c -lm
	738	%\item You might want to add many print statements reporting the communication pattern in
	739	\end{itemize}
	740
	741	If your pogram works it should produce something like this on four processors.
	742	\begin{verbatim}
	743	P0/4: Initialised OK on host ninja-1
	744	P3/4: Initialised OK on host ninja-4
	745	P2/4: Initialised OK on host ninja-3
	746	P1/4: Initialised OK on host ninja-2
	747	P0: Received value 523776.00 from MPI process 1
	748	P0: Received value 18904.76 from MPI process 2
	749	P0: Received value 6497.76 from MPI process 3
	750	\end{verbatim}
	751
	752
	753
	754
	755	\end{document}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: pypar/documentation/pypar_tutorial.tex @ 1454

Download in other formats: