Context Navigation

source: inundation/parallel/documentation/results.tex @ 2950

Last change on this file since 2950 was 2849, checked in by ole, 19 years ago
Editorial suggestions
File size: 6.2 KB

Line
1
2	\chapter{Performance Analysis}
3	Examination of the parallel performance of the ANUGA system has been
4	done using the APAC's Linux Cluster (LC). Running MPI jobs with python
5	through the batch system requires specific settings as follows:
6	\begin{enumerate}
7	\item Comment out all module load statements from .bashrc .cshrc
8	.profile and .login
9	\item In the script for submission through PBS, add the lines:
10	\begin{verbatim}
11	module load python
12	module load lam
13	\end{verbatim}
14	\item (Bash specific; translate if using csh) add export LD\_ASSUME\_KERNEL=2.4.1
15	and export PYTHONPATH=(relevant directory)
16	\item Instead of the call to lrun, use:
17	\begin{verbatim}
18	lamboot
19	mpirun -np <number of processors> -x LD_LIBRARY_PATH,PYTHONPATH
20	,LD_ASSUME_KERNEL <python filename>
21	lamhalt
22	\end{verbatim}
23	The line beginning with `mpirun' and ending with `$<$python
24	filename$>$' should be on a single line.
25	\item Ensure the you have \verb\|-l other=mpi\| in the PBS directives.
26	\item Ensure that pypar 1.9.2 is accessible. At the time of writing,
27	it was in the SVN repository under the directory
28	ga/inundation/pypar\_dist . Move it to the directory pypar, so that
29	this pypar directory is accessible from the PYTHONPATH.
30	\end {enumerate}
31	The section of code that was profiled was the evolve function, which
32	handles progression of the simulation in time.
33
34	The functions that have been examined were the ones that took the
35	largest amount of time to execute. Efficiency values were calculated
36	as follows:
37	\[
38	E = \frac{T_1}{n \times T_{n,1}} \times 100\%
39	\]
40	Where $T_1$ is the time for the single processor case, $n$ is the
41	number of processors and $T_{n,1}$ is the time for the first processor of the $n$
42	cpu run. Results were generated using an 8 processor test run,
43	compared against the 1 processor test run. The test script used was
44	run\_parallel\_sw\_merimbula\_metis.py script in
45	svn/ga/inundation/parallel.\\
46
47	\begin{tabular}{l c}
48	Function name & Parallel efficiency\\
49	\hline
50	evolve(shallow\_water.py) & 79\%\\
51	update\_boundary(domain.py) & 52\%\\
52	distribute\_to\_vertices\_and\_edges(shallow\_water.py) & 140.6\%\\
53	evaluate(shallow\_water.py) & 57.8\%\\
54	compute\_fluxes(shallow\_water.py) & 192.2\%\\
55	\end{tabular}\\
56
57	While evolve was the main function of interest in the profiling
58	efforts, the other functions listed above displayed parallel
59	efficiencies that warranted further attantion.
60
61	\subsection{update\_boundary}
62	Whenever the domain is split for distribution, an additional layer of
63	triangles is added along the divide. These `ghost' triangles hold read
64	only data from another processor's section of the domain. The use of a
65	ghost layer is common practise and is designed to reduce the
66	communication frequency. Some edges of these ghost triangles will
67	contain boundary objects which get updated during the call to
68	update\_boundary. These are redundant calculations as information from
69	the update will be overwritten during the next communication
70	call. Increasing the number of processors increases the ratio of such
71	ghost triangles and the amount of time wasted updating the boundaries
72	of the ghost triangles. The low parallel efficiency of this function
73	has resulted in efforts towards optimisation, by tagging ghost nodes
74	so they are never updated by update\_boundary calls.
75	\subsection{distribute\_to\_vertices\_and\_edges}
76	Initially, the unusual speedup was thought to be a load balance issue,
77	but examination of the domain decomposition demonstrated that this was
78	not the case. All processors were showing this unusual speedup. The
79	profiler was then suspected, due to the fact that C code cannot be
80	properly profiled by the python profiler. The speedup was confirmed by
81	timing using the hotshot profiler and calls to time.time().
82
83	The main function called by distribute\_to\_vertices\_and\_edges is
84	balance\_deep\_and\_shallow\_c and was profiled in the same way,
85	showing the same speedup.
86
87	The call to balance\_deep\_and\_shallow\_c was replaced by a call to
88	the equivalent python function, and timed using calls to time.time().
89	The total CPU time elapsed increased slightly as processors increased,
90	which is expected. The algorithm part of the C version of
91	balance\_deep\_and\_shallow was then timed using calls to
92	gettimeofday(), with the following results:\\
93
94	\begin{tabular}{c c}
95	Number of processors & Total CPU time\\
96	\hline
97	1 & 0.482695999788\\
98	2 & 0.409433000023\\
99	4 & 0.4197840002601\\
100	8 & 0.4275599997492\\
101	\end{tabular}\\
102	For two processors and above, the arrays used by balance\_deep\_and\_shallow
103	fit into the L2 cache of an LC compute node. The cache effect is
104	therefore responsible for the speed increase from one to two
105	processors.
106	\subsubsection{h\_limiter}
107	h\_limiter is a function called by balance\_deep\_and\_shallow.
108	Timing the internals of the C version of balance\_deep\_and\_shallow
109	results in the call to h\_limiter not being correctly timed. Timing
110	h\_limiter from the python side resulted in unusual ($>$100\%)
111	parallel efficiency, and timing the algorithm part of the C version of
112	h\_limiter (i.e. not boilerplate code required for the Python-C
113	interface such as calls to PyArg\_ParseTuple) displayed expected
114	slowdowns. Therefore, it is presumed that this unusual speedup is due to
115	how python passes information to C functions.
116	\subsection{evaluate}
117	Evaluate is an overridable method for different boundary classes
118	(transmissive, reflective, etc). It contains relatively simple code,
119	which has limited scope for optimisation. The low parallel efficiency
120	of evaluate is due to it being called from update\_boundary and the
121	number of boundary objects is increasing as the number of CPUs
122	increases. Therefore, the parallel efficiency here should be improved
123	when update\_boundary is corrected.
124	\subsection{compute\_fluxes}
125	Like balance\_deep\_and\_shallow and h\_limiter, compute\_fluxes calls
126	a C function. Therefore, the unusual parallel efficiency has the same
127	cause as balance\_deep\_and\_shallow and h\_limiter - presumed to be
128	the way python calls code in C extensions.
129	\subsection{Conclusions}
130	The unusual speedup exhibited by compute\_fluxes,
131	balance\_deep\_and\_shallow and h\_limiter has been traced to how the
132	Python runtime passes information to C routines. In future work we
133	plan to investigate this further to verify our conclusions.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: