\chapter{Performance Analysis}
Examination of the parallel performance of the ANUGA system has been
done using the APAC's Linux Cluster (LC). Running MPI jobs with python
through the batch system requires specific settings as follows:
\begin{enumerate}
\item Comment out all module load statements from .bashrc .cshrc
  .profile and .login
\item In the script for submission through PBS, add the lines:
  \begin{verbatim}
    module load python
    module load lam
  \end{verbatim}
\item (Bash specific; translate if using csh) add export LD\_ASSUME\_KERNEL=2.4.1
  and export PYTHONPATH=(relevant directory)
\item Instead of the call to lrun, use:
  \begin{verbatim}
    lamboot
    mpirun -np <number of processors> -x LD_LIBRARY_PATH,PYTHONPATH
    ,LD_ASSUME_KERNEL <python filename>
    lamhalt
  \end{verbatim}
  The line beginning with `mpirun' and ending with `$<$python
  filename$>$' should be on a single line.
\item Ensure the you have \verb|-l other=mpi| in the PBS directives.
\item Ensure that pypar 1.9.2 is accessible. At the time of writing,
   it was in the SVN repository under the directory
   ga/inundation/pypar\_dist . Move it to the directory pypar, so that
   this pypar directory is accessible from the PYTHONPATH.
\end {enumerate}
The section of code that was profiled was the evolve function, which
handles progression of the simulation in time.

The functions that have been examined were the ones that took the
largest amount of time to execute. Efficiency values were calculated
as follows:
\[
E = \frac{T_1}{n \times T_{n,1}} \times 100\%
\]
Where $T_1$ is the time for the single processor case, $n$ is the
number of processors and $T_{n,1}$ is the first processor of the $n$
cpu run.  Results were generated using an 8 processor test run,
compared against the 1 processor test run. The test script used was
run\_parallel\_sw\_merimbula\_metis.py script in
svn/ga/inundation/parallel.\\

\begin{tabular}{l c}
Function name & Parallel efficiency\\
\hline
evolve(shallow\_water.py) & 79\%\\
update\_boundary(domain.py) & 52\%\\
distribute\_to\_vertices\_and\_edges(shallow\_water.py) & 140.6\%\\
evaluate(shallow\_water.py) & 57.8\%\\
compute\_fluxes(shallow\_water.py) & 192.2\%\\
\end{tabular}\\

While evolve was the main function of interest in the profiling
efforts, the other functions listed above displayed parallel
efficiencies that warranted further attantion.

\subsection{update\_boundary}
Whenever the domain is split for distribution, an additional layer of
triangles is added along the divide. These `ghost' triangles hold read
only data from another processor's section of the domain. The use of a
ghost layer is common practise and is designed to reduce the
communication frequency. Some edges of these ghost triangles will
contain boundary objects which get updated during the call to
update\_boundary. These are redundant calculations as information from
the update will be overwritten during the next communication
call. Increasing the number of processors increases the ratio of such
ghost triangles and the amount of time wasted updating the boundaries
of the ghost triangles. The low parallel efficiency of this function
has resulted in efforts towards optimisation, by tagging ghost nodes
so they are never updated by update\_boundary calls.
\subsection{distribute\_to\_vertices\_and\_edges}
Initially, the unusual speedup was thought to be a load balance issue,
but examination of the domain decomposition demonstrated that this was
not the case. All processors were showing this unusual speedup. The
profiler was then suspected, due to the fact that C code cannot be
properly profiled by the python profiler. The speedup was confirmed by
timing using the hotshot profiler and calls to time.time().

The main function called by distribute\_to\_vertices\_and\_edges is
balance\_deep\_and\_shallow\_c and was profiled in the same way,
showing the same speedup.

The call to balance\_deep\_and\_shallow\_c was replaced by a call to
the equivalent python function, and timed using calls to time.time().
The total CPU time elapsed increased slightly as processors increased,
which is expected. The algorithm part of the C version of
balance\_deep\_and\_shallow was then timed using calls to
gettimeofday(), with the following results:\\

\begin{tabular}{c c}
  Number of processors & Total CPU time\\
  \hline
  1 & 0.482695999788\\
  2 & 0.409433000023\\
  4 & 0.4197840002601\\
  8 & 0.4275599997492\\
\end{tabular}\\
For two processors and above, the arrays used by balance\_deep\_and\_shallow
fit into the L2 cache of an LC compute node. The cache effect is
therefore responsible for the speed increase from one to two
processors.
\subsubsection{h\_limiter}
h\_limiter is a function called by balance\_deep\_and\_shallow.
Timing the internals of the C version of balance\_deep\_and\_shallow
results in the call to h\_limiter not being correctly timed. Timing
h\_limiter from the python side resulted in unusual ($>$100\%)
parallel efficiency, and timing the algorithm part of the C version of
h\_limiter (i.e. not boilerplate code required for the Python-C
interface such as calls to PyArg\_ParseTuple) displayed expected
slowdowns. Therefore, it is presumed that this unusual speedup is due to
how python passes information to C functions.
\subsection{evaluate}
Evaluate is an overridable method for different boundary classes
(transmissive, reflective, etc). It contains relatively simple code,
which has limited scope for optimisation. The low parallel efficiency
of evaluate is due to it being called from update\_boundary and the
number of boundary objects is increasing as the number of CPUs
increases. Therefore, the parallel efficiency here should be improved
when update\_boundary is corrected.
\subsection{compute\_fluxes}
Like balance\_deep\_and\_shallow and h\_limiter, compute\_fluxes calls
a C function. Therefore, the unusual parallel efficiency has the same
cause as balance\_deep\_and\_shallow and h\_limiter - presumed to be
the way python calls code in C extensions.
\subsection{Conclusions}
The unusual speedup exhibited by compute\_fluxes,
balance\_deep\_and\_shallow and h\_limiter has been traced to how the
Python runtime passes information to C routines. In future work we
plan to investigate this further to verify our conclusions.