Context Navigation

parallel.tex @ 8950

Last change on this file since 8950 was 3315, checked in by steve, 18 years ago

File size: 15.5 KB

Line
1	\chapter{Running the Code in Parallel}
2
3	This chapter looks at how to run the code in parallel. The first section gives an overview of the main steps required to divide the mesh over the processors. The second sections describes some example code and the final section talks about how to run the code on a few specific architectures.
4
5
6	\section {Partitioning the Mesh}\label{sec:part}
7	There are four main steps required to run the code in parallel. They are;
8	\begin{enumerate}
9	\item partition the mesh into a set of non-overlapping submeshes
10	(\code{pmesh_divide_metis} from {\tt pmesh_divide.py}),
11	\item build a \lq ghost\rq\ or communication layer of triangles
12	around each submesh and define the communication pattern (\code{build_submesh} from {\tt build_submesh.py}),
13	\item distribute the submeshes over the processors (\code{send_submesh}
14	and \code{rec_submesh} from {\tt build_commun.py}),
15	\item and update the numbering scheme for each submesh assigned to a
16	processor (\code{build_local_mesh} from {\tt build_local.py}).
17	\end{enumerate}
18	See Figure \ref{fig:subpart}
19
20	\begin{figure}[hbtp]
21	\centerline{ \includegraphics[scale = 0.75]{figures/domain.eps}}
22	\caption{The main steps used to divide the mesh over the processors.}
23	\label{fig:subpart}
24	\end{figure}
25
26	\subsection {Subdividing the Global Mesh}\label{sec:part1}
27
28	The first step in parallelising the code is to subdivide the mesh
29	into, roughly, equally sized partitions. On a rectangular mesh this may be
30	done by a simple co-ordinate based dissection method, but on a complicated
31	domain such as the Merimbula mesh shown in Figure \ref{fig:mergrid}
32	a more sophisticated approach must be used. We use pymetis, a
33	python wrapper around the Metis
34	(\url{http://glaros.dtc.umn.edu/gkhome/metis/metis/overview})
35	partitioning library. The \code{pmesh_divide_metis} function defined
36	in {\tt pmesh_divide.py} uses Metis to divide the mesh for
37	parallel computation. Metis was chosen as the partitioner based on
38	the results in the paper \cite{gk:metis}.
39
40	\begin{figure}[hbtp]
41	\centerline{ \includegraphics[scale = 0.75]{figures/mermesh.eps}}
42	\caption{The Merimbula mesh.}
43	\label{fig:mergrid}
44	\end{figure}
45
46	Figure \ref{fig:mergrid4} shows the Merimbula grid partitioned over
47	four processors. Note that one submesh may comprise several
48	unconnected mesh partitions. Table \ref{tbl:mer4} gives the node
49	distribution over the four processors while Table \ref{tbl:mer8}
50	shows the distribution over eight processors. These results imply
51	that Pymetis gives a reasonably well balanced partition of the mesh.
52
53	\begin{figure}[hbtp]
54	\centerline{ \includegraphics[scale = 0.75]{figures/mermesh4c.eps}
55	\includegraphics[scale = 0.75]{figures/mermesh4a.eps}}
56
57
58	\centerline{ \includegraphics[scale = 0.75]{figures/mermesh4d.eps}
59	\includegraphics[scale = 0.75]{figures/mermesh4b.eps}}
60	\caption{The Merimbula grid partitioned over 4 processors using Metis.}
61	\label{fig:mergrid4}
62	\end{figure}
63
64	\begin{table}
65	\caption{Running an 4-way test of
66	{\tt run_parallel_sw_merimbula_metis.py}}\label{tbl:mer4}
67	\begin{center}
68	\begin{tabular}{c\|c c c c c c c c}
69	CPU & 0 & 1 & 2 & 3\\
70	Elements & 2757 & 2713 & 2761 & 2554\\
71	\% & 25.56\% & 25.16\% & 25.60\% & 23.68\%\
72	\end{tabular}
73	\end{center}
74	\end{table}
75
76	\begin{table}
77	\caption{Running an 8-way test of
78	{\tt run_parallel_sw_merimbula_metis.py}} \label{tbl:mer8}
79	\begin{center}
80	\begin{tabular}{c\|c c c c c c c c}
81	CPU & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7\\
82	Elements & 1229 & 1293 & 1352 & 1341 & 1349 & 1401 & 1413 & 1407\\
83	\% & 11.40\% & 11.99\% & 12.54\% & 12.43\% & 12.51\% & 12.99\% & 13.10\% & 13.05\%\\
84	\end{tabular}
85	\end{center}
86	\end{table}
87
88	The number of submeshes found by Pymetis is equal to the number of
89	processors; Submesh $p$ will be assigned to Processor $p$.
90
91	\subsection {Building the Ghost Layer}\label{sec:part2}
92	The function {\tt build_submesh.py} is the work-horse and is
93	responsible for setting up the communication pattern as well as
94	assigning the local numbering scheme for the submeshes.
95
96	Consider the example subpartitioning given in Figure
97	\ref{fig:subdomain}. During the \code{evolve} calculations Triangle
98	2 in Submesh 0 will need to access its neighbour Triangle 3 stored
99	in Submesh 1. The standard approach to this problem is to add an
100	extra layer of triangles, which we call ghost triangles. The ghost
101	triangles are read-only, they should not be updated during the
102	calculations, they are only there to hold any extra information that
103	a processor may need to complete its calculations. The ghost
104	triangle values are updated through communication calls. Figure
105	\ref{fig:subdomaing} shows the submeshes with the extra layer of
106	ghost triangles.
107
108	\begin{figure}[hbtp]
109	\centerline{ \includegraphics[scale = 0.6]{figures/subdomain.eps}}
110	\caption{An example subpartitioning of a mesh.}
111	\label{fig:subdomain}
112	\end{figure}
113
114
115	\begin{figure}[hbtp]
116	\centerline{ \includegraphics[scale = 0.6]{figures/subdomainghost.eps}}
117	\caption{An example subpartitioning with ghost triangles. The numbers
118	in brackets shows the local numbering scheme that is calculated and
119	stored with the mesh, but not implemented until the local mesh is
120	built. See Section \ref{sec:part4}. }
121	\label{fig:subdomaing}
122	\end{figure}
123
124	When partitioning the mesh we introduce new, dummy, boundary edges.
125	For example, Triangle 2 in Submesh 1 from Figure
126	\ref{fig:subdomaing} originally shared an edge with Triangle 1, but
127	after partitioning that edge becomes a boundary edge. These new
128	boundary edges are are tagged as \code{ghost} and should, in
129	general, be assigned a type of \code{None}. The following piece of
130	code taken from {\tt run_parallel_advection.py} shows an example.
131	{\small \begin{verbatim} T = Transmissive_boundary(domain)
132	domain.default_order = 2 domain.set_boundary( {'left': T, 'right':
133	T, 'bottom': T, 'top': T, 'ghost': Non e} )
134	\end{verbatim}}
135
136	Looking at Figure \ref{fig:subdomaing} we see that after each
137	\code{evolve} step Processor 0 will have to send the updated values
138	for Triangle 2 and Triangle 4 to Processor 1, and similarly
139	Processor 1 will have to send the updated values for Triangle 3 and
140	Triangle 5 (recall that Submesh $p$ will be assigned to Processor
141	$p$). The \code{build_submesh} function builds a dictionary that
142	defines the communication pattern.
143
144	Finally, the ANUGA code assumes that the triangles (and nodes etc.) are numbered consecutively starting from 0. Consequently, if Submesh 1 in Figure \ref{fig:subdomaing} was passed into the \code{evolve} calculations it would crash. The \code{build_submesh} function determines a local numbering scheme for each submesh, but it does not actually update the numbering, that is left to \code{build_local}.
145
146
147	\subsection {Sending the Submeshes}\label{sec:part3}
148
149	All of the functions described so far must be run in serial on
150	Processor 0. The next step is to start the parallel computation by
151	spreading the submeshes over the processors. The communication is
152	carried out by \code{send_submesh} and \code{rec_submesh} defined in
153	{\tt build_commun.py}. The \code{send_submesh} function should be
154	called on Processor 0 and sends the Submesh $p$ to Processor $p$,
155	while \code{rec_submesh} should be called by Processor $p$ to
156	receive Submesh $p$ from Processor 0.
157
158	As an aside, the order of communication is very important. If someone was to modify the \code{send_submesh} routine the corresponding change must be made to the \code{rec_submesh} routine.
159
160	While it is possible to get Processor 0 to communicate it's submesh to itself, it is an expensive and unnecessary communication call. The {\tt build_commun.py} file also includes a function called \code{extract_hostmesh} that should be called on Processor 0 to extract Submesh 0.
161
162
163	\subsection {Building the Local Mesh}\label{sec:part4}
164	After using \code{send_submesh} and \code{rec_submesh}, Processor $p$ should have its own local copy of Submesh $p$, however as stated previously the triangle numbering will be incorrect on all processors except Processor $0$. The \code{build_local_mesh} function from {\tt build_local.py} primarily focuses on renumbering the information stored with the submesh; including the nodes, vertices and quantities. Figure \ref{fig:subdomainf} shows what the mesh in each processor may look like.
165
166
167	\begin{figure}[hbtp]
168	\centerline{ \includegraphics[scale = 0.6]{figures/subdomainfinal.eps}}
169	\caption{An example subpartitioning after the submeshes have been renumbered.}
170	\label{fig:subdomainf}
171	\end{figure}
172
173	Note that ghost triangles are not stored in the domain any differently to the other triangles, the only way to differentiate them is to look at the communication pattern. This means that the {\tt inundation} code is essentially the same whether it is being run in serial or parallel, the only difference is a communication call at the end of each \code{evolve} step and a check to ensure that the ghost triangles are not used in the time stepping calculations.
174
175	\section{Some Example Code}
176
177	Chapter \ref{chap:code} gives full listings of some example codes.
178
179	The first example in Section \ref{subsec:codeRPA} solves the advection equation on a
180	rectangular mesh. A rectangular mesh is highly structured so a coordinate based decomposition can be used and the partitioning is simply done by calling the
181	routine \code{parallel_rectangle} as shown below.
182	\begin{verbatim}
183	#######################
184	# Partition the mesh
185	#######################
186
187	# Build a unit mesh, subdivide it over numproces processors with each
188	# submesh containing M*N nodes
189
190	points, vertices, boundary, full_send_dict, ghost_recv_dict = \
191	parallel_rectangle(N, M, len1_g=1.0)
192	\end{verbatim}
193
194	Most simulations will not be done on a rectangular mesh, and the approach to subpartitioning the mesh is different to the one described above, however this example may be of interest to those who want to measure the parallel efficiency of the code on their machine. A rectangular mesh should give a good load balance and is therefore an important first test problem.
195
196
197	A more \lq real life\rq\ mesh is the Merimbula mesh used in the code shown in Section \ref{subsec:codeRPMM}. This example also solves the advection equation. In this case the techniques described in Section \ref{sec:part} must be used to partition the mesh. Figure \ref{fig:code} shows the part of the code that is responsible for spreading the mesh over the processors. We now look at the code in detail.
198
199	\begin{figure}[htbp]
200	\begin{verbatim}
201	if myid == 0:
202
203	# Read in the test files
204
205	filename = 'merimbula_10785.tsh'
206
207	mesh_full = pmesh_to_domain_instance(filename, Advection_Domain)
208	mesh_full.set_quantity('stage', Set_Stage(756000.0,756500.0,4.0))
209
210	# Subdivide the mesh
211
212	nodes, triangles, boundary, triangles_per_proc, quantities =\
213	pmesh_divide_metis(mesh_full, numprocs)
214
215	# Build the mesh that should be assigned to each processor.
216	# This includes ghost nodes and the communication pattern
217
218	submesh = build_submesh(nodes, triangles, boundary, quantities, \
219	triangles_per_proc)
220
221	# Send the mesh partition to the appropriate processor
222
223	for p in range(1, numprocs):
224	send_submesh(submesh, triangles_per_proc, p)
225
226	# Build the local mesh for processor 0
227
228	points, vertices, boundary, quantities, ghost_recv_dict, full_send_dict =\
229	extract_hostmesh(submesh, triangles_per_proc)
230
231	else:
232	# Read in the mesh partition that belongs to this
233	# processor (note that the information is in the
234	# correct form for the ANUGA data structure
235
236	points, vertices, boundary, quantities, ghost_recv_dict, full_send_dict = \
237	rec_submesh(0)
238	\end{verbatim}
239	\caption{A section of code taken from {\tt run_parallel_merimbula_metis.py} (Section \protect \ref{subsec:codeRPMM}) showing how to subdivide the mesh.}
240	\label{fig:code}
241	\end{figure}
242	\newpage
243	\begin{itemize}
244	\item
245	These first few lines of code read in and define the (global) mesh. The \code{Set_Stage} function sets the initial conditions. See the code in \ref{subsec:codeRPMM} for the definition of \code{Set_Stage}.
246	\begin{verbatim}
247	filename = 'merimbula_10785.tsh'
248	mesh_full = pmesh_to_domain_instance(filename, Advection_Domain)
249	mesh_full.set_quantity('stage', Set_Stage(756000.0,756500.0,4.0))
250	\end{verbatim}
251
252	\item \code{pmesh_divide_metis} divides the mesh into a set of non-overlapping subdomains as described in Section \ref{sec:part1}.
253	\begin{verbatim}
254	nodes, triangles, boundary, triangles_per_proc, quantities =\
255	pmesh_divide_metis(mesh_full, numprocs)
256	\end{verbatim}
257
258	\item The next step is to build a boundary layer of ghost triangles and define the communication pattern. This step is implemented by \code{build_submesh} as discussed in Section \ref{sec:part2}. The \code{submesh} variable contains a copy of the submesh for each processor.
259	\begin{verbatim}
260	submesh = build_submesh(nodes, triangles, boundary, quantities, \
261	triangles_per_proc)
262	\end{verbatim}
263
264	\item The actual parallel communication starts when the submesh partitions are sent to the processors by calling \code{send_submesh}.
265	\begin{verbatim}
266	for p in range(1, numprocs):
267	send_submesh(submesh, triangles_per_proc, p)
268	\end{verbatim}
269
270
271	The processors receives a given subpartition by calling \code{rec_submesh}. The \code{rec_submesh} routine also calls \code{build_local_mesh}. The \code{build_local_mesh} routine described in Section \ref{sec:part4} ensures that the information is stored in a way that is compatible with the Domain datastructure. This means, for example, that the triangles and nodes must be numbered consecutively starting from 0.
272
273	\begin{verbatim}
274	points, vertices, boundary, quantities, ghost_recv_dict, full_send_dict=\
275	rec_submesh(0)
276	\end{verbatim}
277
278	Note that the submesh is not received by, or sent to, Processor 0. Rather \code{hostmesh = extract_hostmesh(submesh)} simply extracts the mesh that has been assigned to Processor 0. Recall \code{submesh} contains the list of submeshes to be assigned to each processor. This is described further in Section \ref{sec:part3}.
279	%The \code{build_local_mesh} renumbers the nodes
280	\begin{verbatim}
281	points, vertices, boundary, quantities, ghost_recv_dict, full_send_dict)=\
282	extract_hostmesh(submesh, triangles_per_proc)
283	\end{verbatim}
284
285	\end{itemize}
286
287	\section{Running the Code}
288	\subsection{Compiling Pymetis and Metis}
289	Unlike the rest of ANUGA, Metis and its Python wrapper Pymetis are not built
290	by the \verb\|compile_all.py\| script. A makefile is provided to automate the
291	build process. Change directory to the \verb\|ga/inundation/pymetis/\| directory
292	and ensure that the subdirectory \verb\|metis-4.0\| exists and contains an
293	unmodified Metis 4.0 source tree. Under most varieties of Linux, build the
294	module by running \verb\|make\|. Under x86\_64 versions of Linux, build the
295	module by running \verb\|make COPTIONS="-fPIC"\|. Under Windows, build the
296	module by running \verb\|make for_win32\|. After the build completes, verify
297	that the module works by running the supplied PyUnit test case with
298	\verb\|python test_metis.py\|.
299	\subsection{Running the Job}
300	Communication between nodes running in parallel is performed by pypar, which
301	requires the following:
302	\begin{itemize}
303	\item Python 2.0 or later
304	\item Numeric Python (including RandomArray) matching the Python installation
305	\item Native MPI C library
306	\item Native C compiler
307	\end{itemize}
308	Jobs are started by running appropriate commands for the local MPI
309	installation. Due to variations in MPI environments, specific details
310	regarding MPI commands are beyond the scope of this document. It is likely
311	that parallel jobs will need to be scheduled through some kind of queuing
312	system. Sample job scripts are available for adaptation in section
313	\ref{sec:codeSJ}. They should be easily adaptable to any queuing system
314	derived from PBS, such as TORQUE.

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/anuga_core/source/anuga_parallel/documentation/parallel.tex @ 8950

Download in other formats: