[2697] | 1 | |
---|
| 2 | \chapter{Performance Analysis} |
---|
| 3 | Examination of the parallel performance of the ANUGA system has been |
---|
| 4 | done using the APAC's Linux Cluster (LC). Running MPI jobs with python |
---|
| 5 | through the batch system requires specific settings as follows: |
---|
| 6 | \begin{enumerate} |
---|
| 7 | \item Comment out all module load statements from .bashrc .cshrc |
---|
| 8 | .profile and .login |
---|
| 9 | \item In the script for submission through PBS, add the lines: |
---|
| 10 | \begin{verbatim} |
---|
| 11 | module load python |
---|
| 12 | module load lam |
---|
| 13 | \end{verbatim} |
---|
| 14 | \item (Bash specific; translate if using csh) add export LD\_ASSUME\_KERNEL=2.4.1 |
---|
| 15 | and export PYTHONPATH=(relevant directory) |
---|
| 16 | \item Instead of the call to lrun, use: |
---|
| 17 | \begin{verbatim} |
---|
| 18 | lamboot |
---|
| 19 | mpirun -np <number of processors> -x LD_LIBRARY_PATH,PYTHONPATH |
---|
| 20 | ,LD_ASSUME_KERNEL <python filename> |
---|
| 21 | lamhalt |
---|
| 22 | \end{verbatim} |
---|
| 23 | The line beginning with `mpirun' and ending with `$<$python |
---|
| 24 | filename$>$' should be on a single line. |
---|
| 25 | \item Ensure the you have \verb|-l other=mpi| in the PBS directives. |
---|
| 26 | \item Ensure that pypar 1.9.2 is accessible. At the time of writing, |
---|
| 27 | it was in the SVN repository under the directory |
---|
| 28 | ga/inundation/pypar\_dist . Move it to the directory pypar, so that |
---|
| 29 | this pypar directory is accessible from the PYTHONPATH. |
---|
| 30 | \end {enumerate} |
---|
| 31 | The section of code that was profiled was the evolve function, which |
---|
| 32 | handles progression of the simulation in time. |
---|
| 33 | |
---|
| 34 | The functions that have been examined were the ones that took the |
---|
| 35 | largest amount of time to execute. Efficiency values were calculated |
---|
| 36 | as follows: |
---|
| 37 | \[ |
---|
| 38 | E = \frac{T_1}{n \times T_{n,1}} \times 100\% |
---|
| 39 | \] |
---|
| 40 | Where $T_1$ is the time for the single processor case, $n$ is the |
---|
| 41 | number of processors and $T_{n,1}$ is the first processor of the $n$ |
---|
| 42 | cpu run. Results were generated using an 8 processor test run, |
---|
| 43 | compared against the 1 processor test run. The test script used was |
---|
| 44 | run\_parallel\_sw\_merimbula\_metis.py script in |
---|
| 45 | svn/ga/inundation/parallel.\\ |
---|
| 46 | |
---|
| 47 | \begin{tabular}{l c} |
---|
| 48 | Function name & Parallel efficiency\\ |
---|
| 49 | \hline |
---|
| 50 | evolve(shallow\_water.py) & 79\%\\ |
---|
| 51 | update\_boundary(domain.py) & 52\%\\ |
---|
| 52 | distribute\_to\_vertices\_and\_edges(shallow\_water.py) & 140.6\%\\ |
---|
| 53 | evaluate(shallow\_water.py) & 57.8\%\\ |
---|
| 54 | compute\_fluxes(shallow\_water.py) & 192.2\%\\ |
---|
| 55 | \end{tabular}\\ |
---|
| 56 | |
---|
| 57 | While evolve was the main function of interest in the profiling |
---|
| 58 | efforts, the other functions listed above displayed parallel |
---|
| 59 | efficiencies that warranted further attantion. |
---|
| 60 | |
---|
| 61 | \subsection{update\_boundary} |
---|
| 62 | Whenever the domain is split for distribution, an additional layer of |
---|
| 63 | triangles is added along the divide. These `ghost' triangles hold read |
---|
| 64 | only data from another processor's section of the domain. The use of a |
---|
| 65 | ghost layer is common practise and is designed to reduce the |
---|
| 66 | communication frequency. Some edges of these ghost triangles will |
---|
| 67 | contain boundary objects which get updated during the call to |
---|
| 68 | update\_boundary. These are redundant calculations as information from |
---|
| 69 | the update will be overwritten during the next communication |
---|
| 70 | call. Increasing the number of processors increases the ratio of such |
---|
| 71 | ghost triangles and the amount of time wasted updating the boundaries |
---|
| 72 | of the ghost triangles. The low parallel efficiency of this function |
---|
| 73 | has resulted in efforts towards optimisation, by tagging ghost nodes |
---|
| 74 | so they are never updated by update\_boundary calls. |
---|
| 75 | \subsection{distribute\_to\_vertices\_and\_edges} |
---|
| 76 | Initially, the unusual speedup was thought to be a load balance issue, |
---|
| 77 | but examination of the domain decomposition demonstrated that this was |
---|
| 78 | not the case. All processors were showing this unusual speedup. The |
---|
| 79 | profiler was then suspected, due to the fact that C code cannot be |
---|
| 80 | properly profiled by the python profiler. The speedup was confirmed by |
---|
| 81 | timing using the hotshot profiler and calls to time.time(). |
---|
| 82 | |
---|
| 83 | The main function called by distribute\_to\_vertices\_and\_edges is |
---|
| 84 | balance\_deep\_and\_shallow\_c and was profiled in the same way, |
---|
| 85 | showing the same speedup. |
---|
| 86 | |
---|
| 87 | The call to balance\_deep\_and\_shallow\_c was replaced by a call to |
---|
| 88 | the equivalent python function, and timed using calls to time.time(). |
---|
| 89 | The total CPU time elapsed increased slightly as processors increased, |
---|
| 90 | which is expected. The algorithm part of the C version of |
---|
| 91 | balance\_deep\_and\_shallow was then timed using calls to |
---|
| 92 | gettimeofday(), with the following results:\\ |
---|
| 93 | |
---|
| 94 | \begin{tabular}{c c} |
---|
| 95 | Number of processors & Total CPU time\\ |
---|
| 96 | \hline |
---|
| 97 | 1 & 0.482695999788\\ |
---|
| 98 | 2 & 0.409433000023\\ |
---|
| 99 | 4 & 0.4197840002601\\ |
---|
| 100 | 8 & 0.4275599997492\\ |
---|
| 101 | \end{tabular}\\ |
---|
| 102 | For two processors and above, the arrays used by balance\_deep\_and\_shallow |
---|
| 103 | fit into the L2 cache of an LC compute node. The cache effect is |
---|
| 104 | therefore responsible for the speed increase from one to two |
---|
| 105 | processors. |
---|
| 106 | \subsubsection{h\_limiter} |
---|
| 107 | h\_limiter is a function called by balance\_deep\_and\_shallow. |
---|
| 108 | Timing the internals of the C version of balance\_deep\_and\_shallow |
---|
| 109 | results in the call to h\_limiter not being correctly timed. Timing |
---|
| 110 | h\_limiter from the python side resulted in unusual ($>$100\%) |
---|
| 111 | parallel efficiency, and timing the algorithm part of the C version of |
---|
| 112 | h\_limiter (i.e. not boilerplate code required for the Python-C |
---|
| 113 | interface such as calls to PyArg\_ParseTuple) displayed expected |
---|
| 114 | slowdowns. Therefore, it is presumed that this unusual speedup is due to |
---|
| 115 | how python passes information to C functions. |
---|
| 116 | \subsection{evaluate} |
---|
| 117 | Evaluate is an overridable method for different boundary classes |
---|
| 118 | (transmissive, reflective, etc). It contains relatively simple code, |
---|
| 119 | which has limited scope for optimisation. The low parallel efficiency |
---|
| 120 | of evaluate is due to it being called from update\_boundary and the |
---|
| 121 | number of boundary objects is increasing as the number of CPUs |
---|
| 122 | increases. Therefore, the parallel efficiency here should be improved |
---|
| 123 | when update\_boundary is corrected. |
---|
| 124 | \subsection{compute\_fluxes} |
---|
| 125 | Like balance\_deep\_and\_shallow and h\_limiter, compute\_fluxes calls |
---|
| 126 | a C function. Therefore, the unusual parallel efficiency has the same |
---|
| 127 | cause as balance\_deep\_and\_shallow and h\_limiter - presumed to be |
---|
| 128 | the way python calls code in C extensions. |
---|
| 129 | \subsection{Conclusions} |
---|
| 130 | The unusual speedup exhibited by compute\_fluxes, |
---|
| 131 | balance\_deep\_and\_shallow and h\_limiter has been traced to how the |
---|
| 132 | Python runtime passes information to C routines. In future work we |
---|
| 133 | plan to investigate this further to verify our conclusions. |
---|