\chapter{Performance Analysis} Examination of the parallel performance of the ANUGA system has been done using the APAC's Linux Cluster (LC). Running MPI jobs with python through the batch system requires specific settings as follows: \begin{enumerate} \item Comment out all module load statements from .bashrc .cshrc .profile and .login \item In the script for submission through PBS, add the lines: \begin{verbatim} module load python module load lam \end{verbatim} \item (Bash specific; translate if using csh) add export LD\_ASSUME\_KERNEL=2.4.1 and export PYTHONPATH=(relevant directory) \item Instead of the call to lrun, use: \begin{verbatim} lamboot mpirun -np -x LD_LIBRARY_PATH,PYTHONPATH ,LD_ASSUME_KERNEL lamhalt \end{verbatim} The line beginning with `mpirun' and ending with `$<$python filename$>$' should be on a single line. \item Ensure the you have \verb|-l other=mpi| in the PBS directives. \item Ensure that pypar 1.9.2 is accessible. At the time of writing, it was in the SVN repository under the directory ga/inundation/pypar\_dist . Move it to the directory pypar, so that this pypar directory is accessible from the PYTHONPATH. \end {enumerate} The section of code that was profiled was the evolve function, which handles progression of the simulation in time. The functions that have been examined were the ones that took the largest amount of time to execute. Efficiency values were calculated as follows: \[ E = \frac{T_1}{n \times T_{n,1}} \times 100\% \] Where $T_1$ is the time for the single processor case, $n$ is the number of processors and $T_{n,1}$ is the first processor of the $n$ cpu run. Results were generated using an 8 processor test run, compared against the 1 processor test run. The test script used was run\_parallel\_sw\_merimbula\_metis.py script in svn/ga/inundation/parallel.\\ \begin{tabular}{l c} Function name & Parallel efficiency\\ \hline evolve(shallow\_water.py) & 79\%\\ update\_boundary(domain.py) & 52\%\\ distribute\_to\_vertices\_and\_edges(shallow\_water.py) & 140.6\%\\ evaluate(shallow\_water.py) & 57.8\%\\ compute\_fluxes(shallow\_water.py) & 192.2\%\\ \end{tabular}\\ While evolve was the main function of interest in the profiling efforts, the other functions listed above displayed parallel efficiencies that warranted further attantion. \subsection{update\_boundary} Whenever the domain is split for distribution, an additional layer of triangles is added along the divide. These `ghost' triangles hold read only data from another processor's section of the domain. The use of a ghost layer is common practise and is designed to reduce the communication frequency. Some edges of these ghost triangles will contain boundary objects which get updated during the call to update\_boundary. These are redundant calculations as information from the update will be overwritten during the next communication call. Increasing the number of processors increases the ratio of such ghost triangles and the amount of time wasted updating the boundaries of the ghost triangles. The low parallel efficiency of this function has resulted in efforts towards optimisation, by tagging ghost nodes so they are never updated by update\_boundary calls. \subsection{distribute\_to\_vertices\_and\_edges} Initially, the unusual speedup was thought to be a load balance issue, but examination of the domain decomposition demonstrated that this was not the case. All processors were showing this unusual speedup. The profiler was then suspected, due to the fact that C code cannot be properly profiled by the python profiler. The speedup was confirmed by timing using the hotshot profiler and calls to time.time(). The main function called by distribute\_to\_vertices\_and\_edges is balance\_deep\_and\_shallow\_c and was profiled in the same way, showing the same speedup. The call to balance\_deep\_and\_shallow\_c was replaced by a call to the equivalent python function, and timed using calls to time.time(). The total CPU time elapsed increased slightly as processors increased, which is expected. The algorithm part of the C version of balance\_deep\_and\_shallow was then timed using calls to gettimeofday(), with the following results:\\ \begin{tabular}{c c} Number of processors & Total CPU time\\ \hline 1 & 0.482695999788\\ 2 & 0.409433000023\\ 4 & 0.4197840002601\\ 8 & 0.4275599997492\\ \end{tabular}\\ For two processors and above, the arrays used by balance\_deep\_and\_shallow fit into the L2 cache of an LC compute node. The cache effect is therefore responsible for the speed increase from one to two processors. \subsubsection{h\_limiter} h\_limiter is a function called by balance\_deep\_and\_shallow. Timing the internals of the C version of balance\_deep\_and\_shallow results in the call to h\_limiter not being correctly timed. Timing h\_limiter from the python side resulted in unusual ($>$100\%) parallel efficiency, and timing the algorithm part of the C version of h\_limiter (i.e. not boilerplate code required for the Python-C interface such as calls to PyArg\_ParseTuple) displayed expected slowdowns. Therefore, it is presumed that this unusual speedup is due to how python passes information to C functions. \subsection{evaluate} Evaluate is an overridable method for different boundary classes (transmissive, reflective, etc). It contains relatively simple code, which has limited scope for optimisation. The low parallel efficiency of evaluate is due to it being called from update\_boundary and the number of boundary objects is increasing as the number of CPUs increases. Therefore, the parallel efficiency here should be improved when update\_boundary is corrected. \subsection{compute\_fluxes} Like balance\_deep\_and\_shallow and h\_limiter, compute\_fluxes calls a C function. Therefore, the unusual parallel efficiency has the same cause as balance\_deep\_and\_shallow and h\_limiter - presumed to be the way python calls code in C extensions. \subsection{Conclusions} The unusual speedup exhibited by compute\_fluxes, balance\_deep\_and\_shallow and h\_limiter has been traced to how the Python runtime passes information to C routines. In future work we plan to investigate this further to verify our conclusions.