1 | |
---|
2 | \chapter{Performance Analysis} |
---|
3 | Examination of the parallel performance of the ANUGA system has been |
---|
4 | done using the APAC's Linux Cluster (LC). Running MPI jobs with python |
---|
5 | through the batch system requires specific settings as follows: |
---|
6 | \begin{enumerate} |
---|
7 | \item Comment out all module load statements from .bashrc .cshrc |
---|
8 | .profile and .login |
---|
9 | \item In the script for submission through PBS, add the lines: |
---|
10 | \begin{verbatim} |
---|
11 | module load python |
---|
12 | module load lam |
---|
13 | \end{verbatim} |
---|
14 | \item (Bash specific; translate if using csh) add export LD\_ASSUME\_KERNEL=2.4.1 |
---|
15 | and export PYTHONPATH=(relevant directory) |
---|
16 | \item Instead of the call to lrun, use: |
---|
17 | \begin{verbatim} |
---|
18 | lamboot |
---|
19 | mpirun -np <number of processors> -x LD_LIBRARY_PATH,PYTHONPATH |
---|
20 | ,LD_ASSUME_KERNEL <python filename> |
---|
21 | lamhalt |
---|
22 | \end{verbatim} |
---|
23 | The line beginning with `mpirun' and ending with `$<$python |
---|
24 | filename$>$' should be on a single line. |
---|
25 | \item Ensure the you have \verb|-l other=mpi| in the PBS directives. |
---|
26 | \item Ensure that pypar 1.9.2 is accessible. At the time of writing, |
---|
27 | it was in the SVN repository under the directory |
---|
28 | ga/inundation/pypar\_dist . Move it to the directory pypar, so that |
---|
29 | this pypar directory is accessible from the PYTHONPATH. |
---|
30 | \end {enumerate} |
---|
31 | The section of code that was profiled was the evolve function, which |
---|
32 | handles progression of the simulation in time. |
---|
33 | |
---|
34 | The functions that have been examined were the ones that took the |
---|
35 | largest amount of time to execute. Efficiency values were calculated |
---|
36 | as follows: |
---|
37 | \[ |
---|
38 | E = \frac{T_1}{n \times T_{n,1}} \times 100\% |
---|
39 | \] |
---|
40 | Where $T_1$ is the time for the single processor case, $n$ is the |
---|
41 | number of processors and $T_{n,1}$ is the time for the first processor of the $n$ |
---|
42 | cpu run. Results were generated using an 8 processor test run, |
---|
43 | compared against the 1 processor test run. The test script used was |
---|
44 | run\_parallel\_sw\_merimbula\_metis.py script in |
---|
45 | svn/ga/inundation/parallel.\\ |
---|
46 | |
---|
47 | \begin{tabular}{l c} |
---|
48 | Function name & Parallel efficiency\\ |
---|
49 | \hline |
---|
50 | evolve(shallow\_water.py) & 79\%\\ |
---|
51 | update\_boundary(domain.py) & 52\%\\ |
---|
52 | distribute\_to\_vertices\_and\_edges(shallow\_water.py) & 140.6\%\\ |
---|
53 | evaluate(shallow\_water.py) & 57.8\%\\ |
---|
54 | compute\_fluxes(shallow\_water.py) & 192.2\%\\ |
---|
55 | \end{tabular}\\ |
---|
56 | |
---|
57 | While evolve was the main function of interest in the profiling |
---|
58 | efforts, the other functions listed above displayed parallel |
---|
59 | efficiencies that warranted further attantion. |
---|
60 | |
---|
61 | \subsection{update\_boundary} |
---|
62 | Whenever the domain is split for distribution, an additional layer of |
---|
63 | triangles is added along the divide. These `ghost' triangles hold read |
---|
64 | only data from another processor's section of the domain. The use of a |
---|
65 | ghost layer is common practise and is designed to reduce the |
---|
66 | communication frequency. Some edges of these ghost triangles will |
---|
67 | contain boundary objects which get updated during the call to |
---|
68 | update\_boundary. These are redundant calculations as information from |
---|
69 | the update will be overwritten during the next communication |
---|
70 | call. Increasing the number of processors increases the ratio of such |
---|
71 | ghost triangles and the amount of time wasted updating the boundaries |
---|
72 | of the ghost triangles. The low parallel efficiency of this function |
---|
73 | has resulted in efforts towards optimisation, by tagging ghost nodes |
---|
74 | so they are never updated by update\_boundary calls. |
---|
75 | \subsection{distribute\_to\_vertices\_and\_edges} |
---|
76 | Initially, the unusual speedup was thought to be a load balance issue, |
---|
77 | but examination of the domain decomposition demonstrated that this was |
---|
78 | not the case. All processors were showing this unusual speedup. The |
---|
79 | profiler was then suspected, due to the fact that C code cannot be |
---|
80 | properly profiled by the python profiler. The speedup was confirmed by |
---|
81 | timing using the hotshot profiler and calls to time.time(). |
---|
82 | |
---|
83 | The main function called by distribute\_to\_vertices\_and\_edges is |
---|
84 | balance\_deep\_and\_shallow\_c and was profiled in the same way, |
---|
85 | showing the same speedup. |
---|
86 | |
---|
87 | The call to balance\_deep\_and\_shallow\_c was replaced by a call to |
---|
88 | the equivalent python function, and timed using calls to time.time(). |
---|
89 | The total CPU time elapsed increased slightly as processors increased, |
---|
90 | which is expected. The algorithm part of the C version of |
---|
91 | balance\_deep\_and\_shallow was then timed using calls to |
---|
92 | gettimeofday(), with the following results:\\ |
---|
93 | |
---|
94 | \begin{tabular}{c c} |
---|
95 | Number of processors & Total CPU time\\ |
---|
96 | \hline |
---|
97 | 1 & 0.482695999788\\ |
---|
98 | 2 & 0.409433000023\\ |
---|
99 | 4 & 0.4197840002601\\ |
---|
100 | 8 & 0.4275599997492\\ |
---|
101 | \end{tabular}\\ |
---|
102 | For two processors and above, the arrays used by balance\_deep\_and\_shallow |
---|
103 | fit into the L2 cache of an LC compute node. The cache effect is |
---|
104 | therefore responsible for the speed increase from one to two |
---|
105 | processors. |
---|
106 | \subsubsection{h\_limiter} |
---|
107 | h\_limiter is a function called by balance\_deep\_and\_shallow. |
---|
108 | Timing the internals of the C version of balance\_deep\_and\_shallow |
---|
109 | results in the call to h\_limiter not being correctly timed. Timing |
---|
110 | h\_limiter from the python side resulted in unusual ($>$100\%) |
---|
111 | parallel efficiency, and timing the algorithm part of the C version of |
---|
112 | h\_limiter (i.e. not boilerplate code required for the Python-C |
---|
113 | interface such as calls to PyArg\_ParseTuple) displayed expected |
---|
114 | slowdowns. Therefore, it is presumed that this unusual speedup is due to |
---|
115 | how python passes information to C functions. |
---|
116 | \subsection{evaluate} |
---|
117 | Evaluate is an overridable method for different boundary classes |
---|
118 | (transmissive, reflective, etc). It contains relatively simple code, |
---|
119 | which has limited scope for optimisation. The low parallel efficiency |
---|
120 | of evaluate is due to it being called from update\_boundary and the |
---|
121 | number of boundary objects is increasing as the number of CPUs |
---|
122 | increases. Therefore, the parallel efficiency here should be improved |
---|
123 | when update\_boundary is corrected. |
---|
124 | \subsection{compute\_fluxes} |
---|
125 | Like balance\_deep\_and\_shallow and h\_limiter, compute\_fluxes calls |
---|
126 | a C function. Therefore, the unusual parallel efficiency has the same |
---|
127 | cause as balance\_deep\_and\_shallow and h\_limiter - presumed to be |
---|
128 | the way python calls code in C extensions. |
---|
129 | \subsection{Conclusions} |
---|
130 | The unusual speedup exhibited by compute\_fluxes, |
---|
131 | balance\_deep\_and\_shallow and h\_limiter has been traced to how the |
---|
132 | Python runtime passes information to C routines. In future work we |
---|
133 | plan to investigate this further to verify our conclusions. |
---|