source: inundation/parallel/documentation/results.tex @ 2697

Last change on this file since 2697 was 2697, checked in by linda, 18 years ago

Started working on the report for the parallel computations

File size: 6.2 KB
Line 
1
2\chapter{Performance Analysis}
3Examination of the parallel performance of the ANUGA system has been
4done using the APAC's Linux Cluster (LC). Running MPI jobs with python
5through the batch system requires specific settings as follows:
6\begin{enumerate}
7\item Comment out all module load statements from .bashrc .cshrc
8  .profile and .login
9\item In the script for submission through PBS, add the lines:
10  \begin{verbatim}
11    module load python
12    module load lam
13  \end{verbatim}
14\item (Bash specific; translate if using csh) add export LD\_ASSUME\_KERNEL=2.4.1
15  and export PYTHONPATH=(relevant directory)
16\item Instead of the call to lrun, use:
17  \begin{verbatim}
18    lamboot
19    mpirun -np <number of processors> -x LD_LIBRARY_PATH,PYTHONPATH
20    ,LD_ASSUME_KERNEL <python filename>
21    lamhalt
22  \end{verbatim}
23  The line beginning with `mpirun' and ending with `$<$python
24  filename$>$' should be on a single line.
25\item Ensure the you have \verb|-l other=mpi| in the PBS directives.
26\item Ensure that pypar 1.9.2 is accessible. At the time of writing,
27   it was in the SVN repository under the directory
28   ga/inundation/pypar\_dist . Move it to the directory pypar, so that
29   this pypar directory is accessible from the PYTHONPATH.
30\end {enumerate}
31The section of code that was profiled was the evolve function, which
32handles progression of the simulation in time.
33
34The functions that have been examined were the ones that took the
35largest amount of time to execute. Efficiency values were calculated
36as follows:
37\[
38E = \frac{T_1}{n \times T_{n,1}} \times 100\%
39\]
40Where $T_1$ is the time for the single processor case, $n$ is the
41number of processors and $T_{n,1}$ is the first processor of the $n$
42cpu run.  Results were generated using an 8 processor test run,
43compared against the 1 processor test run. The test script used was
44run\_parallel\_sw\_merimbula\_metis.py script in
45svn/ga/inundation/parallel.\\
46
47\begin{tabular}{l c}
48Function name & Parallel efficiency\\
49\hline
50evolve(shallow\_water.py) & 79\%\\
51update\_boundary(domain.py) & 52\%\\
52distribute\_to\_vertices\_and\_edges(shallow\_water.py) & 140.6\%\\
53evaluate(shallow\_water.py) & 57.8\%\\
54compute\_fluxes(shallow\_water.py) & 192.2\%\\
55\end{tabular}\\
56
57While evolve was the main function of interest in the profiling
58efforts, the other functions listed above displayed parallel
59efficiencies that warranted further attantion.
60
61\subsection{update\_boundary}
62Whenever the domain is split for distribution, an additional layer of
63triangles is added along the divide. These `ghost' triangles hold read
64only data from another processor's section of the domain. The use of a
65ghost layer is common practise and is designed to reduce the
66communication frequency. Some edges of these ghost triangles will
67contain boundary objects which get updated during the call to
68update\_boundary. These are redundant calculations as information from
69the update will be overwritten during the next communication
70call. Increasing the number of processors increases the ratio of such
71ghost triangles and the amount of time wasted updating the boundaries
72of the ghost triangles. The low parallel efficiency of this function
73has resulted in efforts towards optimisation, by tagging ghost nodes
74so they are never updated by update\_boundary calls.
75\subsection{distribute\_to\_vertices\_and\_edges}
76Initially, the unusual speedup was thought to be a load balance issue,
77but examination of the domain decomposition demonstrated that this was
78not the case. All processors were showing this unusual speedup. The
79profiler was then suspected, due to the fact that C code cannot be
80properly profiled by the python profiler. The speedup was confirmed by
81timing using the hotshot profiler and calls to time.time().
82
83The main function called by distribute\_to\_vertices\_and\_edges is
84balance\_deep\_and\_shallow\_c and was profiled in the same way,
85showing the same speedup.
86
87The call to balance\_deep\_and\_shallow\_c was replaced by a call to
88the equivalent python function, and timed using calls to time.time().
89The total CPU time elapsed increased slightly as processors increased,
90which is expected. The algorithm part of the C version of
91balance\_deep\_and\_shallow was then timed using calls to
92gettimeofday(), with the following results:\\
93
94\begin{tabular}{c c}
95  Number of processors & Total CPU time\\
96  \hline
97  1 & 0.482695999788\\
98  2 & 0.409433000023\\
99  4 & 0.4197840002601\\
100  8 & 0.4275599997492\\
101\end{tabular}\\
102For two processors and above, the arrays used by balance\_deep\_and\_shallow
103fit into the L2 cache of an LC compute node. The cache effect is
104therefore responsible for the speed increase from one to two
105processors.
106\subsubsection{h\_limiter}
107h\_limiter is a function called by balance\_deep\_and\_shallow.
108Timing the internals of the C version of balance\_deep\_and\_shallow
109results in the call to h\_limiter not being correctly timed. Timing
110h\_limiter from the python side resulted in unusual ($>$100\%)
111parallel efficiency, and timing the algorithm part of the C version of
112h\_limiter (i.e. not boilerplate code required for the Python-C
113interface such as calls to PyArg\_ParseTuple) displayed expected
114slowdowns. Therefore, it is presumed that this unusual speedup is due to
115how python passes information to C functions.
116\subsection{evaluate}
117Evaluate is an overridable method for different boundary classes
118(transmissive, reflective, etc). It contains relatively simple code,
119which has limited scope for optimisation. The low parallel efficiency
120of evaluate is due to it being called from update\_boundary and the
121number of boundary objects is increasing as the number of CPUs
122increases. Therefore, the parallel efficiency here should be improved
123when update\_boundary is corrected.
124\subsection{compute\_fluxes}
125Like balance\_deep\_and\_shallow and h\_limiter, compute\_fluxes calls
126a C function. Therefore, the unusual parallel efficiency has the same
127cause as balance\_deep\_and\_shallow and h\_limiter - presumed to be
128the way python calls code in C extensions.
129\subsection{Conclusions}
130The unusual speedup exhibited by compute\_fluxes,
131balance\_deep\_and\_shallow and h\_limiter has been traced to how the
132Python runtime passes information to C routines. In future work we
133plan to investigate this further to verify our conclusions.
Note: See TracBrowser for help on using the repository browser.