Changeset 3168
- Timestamp:
- Jun 16, 2006, 10:43:28 AM (19 years ago)
- Location:
- inundation/parallel/documentation
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
inundation/parallel/documentation/code/RunParallelAdvection.py
r2909 r3168 55 55 processor_name = pypar.Get_processor_name() 56 56 57 N = 2058 M = 2059 60 57 ####################### 61 58 # Partition the mesh 62 59 ####################### 60 N = 20 61 M = 20 63 62 64 63 # Build a unit mesh, subdivide it over numproces processors with each -
inundation/parallel/documentation/results.tex
r2849 r3168 1 1 2 2 \chapter{Performance Analysis} 3 Examination of the parallel performance of the ANUGA system has been4 done using the APAC's Linux Cluster (LC). Running MPI jobs with python5 through the batch system requires specific settings as follows:6 \begin{enumerate}7 \item Comment out all module load statements from .bashrc .cshrc8 .profile and .login9 \item In the script for submission through PBS, add the lines:10 \begin{verbatim}11 module load python12 module load lam13 \end{verbatim}14 \item (Bash specific; translate if using csh) add export LD\_ASSUME\_KERNEL=2.4.115 and export PYTHONPATH=(relevant directory)16 \item Instead of the call to lrun, use:17 \begin{verbatim}18 lamboot19 mpirun -np <number of processors> -x LD_LIBRARY_PATH,PYTHONPATH20 ,LD_ASSUME_KERNEL <python filename>21 lamhalt22 \end{verbatim}23 The line beginning with `mpirun' and ending with `$<$python24 filename$>$' should be on a single line.25 \item Ensure the you have \verb|-l other=mpi| in the PBS directives.26 \item Ensure that pypar 1.9.2 is accessible. At the time of writing,27 it was in the SVN repository under the directory28 ga/inundation/pypar\_dist . Move it to the directory pypar, so that29 this pypar directory is accessible from the PYTHONPATH.30 \end {enumerate}31 The section of code that was profiled was the evolve function, which32 handles progression of the simulation in time.33 3 34 The functions that have been examined were the ones that took the 35 largest amount of time to execute. Efficiency values were calculated 36 as follows: 4 5 To evaluate the performance of the code on a parallel machine we ran some examples on a cluster of four nodes connected with PathScale InfiniPath HTX. 6 Each node has two AMD Opteron 275 (Dual-core 2.2 GHz Processors) and 4 GB of main memory. The system achieves 60 Gigaflops with the Linpack benchmark, which is about 85% of peak performance. 7 8 For each test run we evaluated the parallel efficiency 37 9 \[ 38 E = \frac{T_1}{n \times T_{n,1}} \times 100\%10 E_n = \frac{T_1}{nT_n} 100, 39 11 \] 40 Where $T_1$ is the time for the single processor case, $n$ is the 41 number of processors and $T_{n,1}$ is the time for the first processor of the $n$ 42 cpu run. Results were generated using an 8 processor test run, 43 compared against the 1 processor test run. The test script used was 44 run\_parallel\_sw\_merimbula\_metis.py script in 45 svn/ga/inundation/parallel.\\ 12 where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (subdomains) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$. Note that $t_i$ does not include the time required to build and subpartion the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}). 46 13 47 \begin{tabular}{l c} 48 Function name & Parallel efficiency\\ 49 \hline 50 evolve(shallow\_water.py) & 79\%\\ 51 update\_boundary(domain.py) & 52\%\\ 52 distribute\_to\_vertices\_and\_edges(shallow\_water.py) & 140.6\%\\ 53 evaluate(shallow\_water.py) & 57.8\%\\ 54 compute\_fluxes(shallow\_water.py) & 192.2\%\\ 55 \end{tabular}\\ 14 \section{Advection, Rectangular Domain} 56 15 57 While evolve was the main function of interest in the profiling 58 efforts, the other functions listed above displayed parallel 59 efficiencies that warranted further attantion. 16 The first example looked at the rectangular domain example given in Section \ref{sec:codeRPA}, excecpt that we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}). 60 17 61 \subsection{update\_boundary} 62 Whenever the domain is split for distribution, an additional layer of 63 triangles is added along the divide. These `ghost' triangles hold read 64 only data from another processor's section of the domain. The use of a 65 ghost layer is common practise and is designed to reduce the 66 communication frequency. Some edges of these ghost triangles will 67 contain boundary objects which get updated during the call to 68 update\_boundary. These are redundant calculations as information from 69 the update will be overwritten during the next communication 70 call. Increasing the number of processors increases the ratio of such 71 ghost triangles and the amount of time wasted updating the boundaries 72 of the ghost triangles. The low parallel efficiency of this function 73 has resulted in efforts towards optimisation, by tagging ghost nodes 74 so they are never updated by update\_boundary calls. 75 \subsection{distribute\_to\_vertices\_and\_edges} 76 Initially, the unusual speedup was thought to be a load balance issue, 77 but examination of the domain decomposition demonstrated that this was 78 not the case. All processors were showing this unusual speedup. The 79 profiler was then suspected, due to the fact that C code cannot be 80 properly profiled by the python profiler. The speedup was confirmed by 81 timing using the hotshot profiler and calls to time.time(). 18 For this particular example we can control the mesh size by changing the parameters \code{N} and \code{M} given in the following section of code taken from 19 Section \ref{sec:codeRPA}. 82 20 83 The main function called by distribute\_to\_vertices\_and\_edges is 84 balance\_deep\_and\_shallow\_c and was profiled in the same way, 85 showing the same speedup. 21 \begin{verbatim} 22 ####################### 23 # Partition the mesh 24 ####################### 25 N = 20 26 M = 20 86 27 87 The call to balance\_deep\_and\_shallow\_c was replaced by a call to 88 the equivalent python function, and timed using calls to time.time(). 89 The total CPU time elapsed increased slightly as processors increased, 90 which is expected. The algorithm part of the C version of 91 balance\_deep\_and\_shallow was then timed using calls to 92 gettimeofday(), with the following results:\\ 28 # Build a unit mesh, subdivide it over numproces processors with each 29 # submesh containing M*N nodes 93 30 94 \begin{tabular}{c c} 95 Number of processors & Total CPU time\\ 96 \hline 97 1 & 0.482695999788\\ 98 2 & 0.409433000023\\ 99 4 & 0.4197840002601\\ 100 8 & 0.4275599997492\\ 101 \end{tabular}\\ 102 For two processors and above, the arrays used by balance\_deep\_and\_shallow 103 fit into the L2 cache of an LC compute node. The cache effect is 104 therefore responsible for the speed increase from one to two 105 processors. 106 \subsubsection{h\_limiter} 107 h\_limiter is a function called by balance\_deep\_and\_shallow. 108 Timing the internals of the C version of balance\_deep\_and\_shallow 109 results in the call to h\_limiter not being correctly timed. Timing 110 h\_limiter from the python side resulted in unusual ($>$100\%) 111 parallel efficiency, and timing the algorithm part of the C version of 112 h\_limiter (i.e. not boilerplate code required for the Python-C 113 interface such as calls to PyArg\_ParseTuple) displayed expected 114 slowdowns. Therefore, it is presumed that this unusual speedup is due to 115 how python passes information to C functions. 116 \subsection{evaluate} 117 Evaluate is an overridable method for different boundary classes 118 (transmissive, reflective, etc). It contains relatively simple code, 119 which has limited scope for optimisation. The low parallel efficiency 120 of evaluate is due to it being called from update\_boundary and the 121 number of boundary objects is increasing as the number of CPUs 122 increases. Therefore, the parallel efficiency here should be improved 123 when update\_boundary is corrected. 124 \subsection{compute\_fluxes} 125 Like balance\_deep\_and\_shallow and h\_limiter, compute\_fluxes calls 126 a C function. Therefore, the unusual parallel efficiency has the same 127 cause as balance\_deep\_and\_shallow and h\_limiter - presumed to be 128 the way python calls code in C extensions. 129 \subsection{Conclusions} 130 The unusual speedup exhibited by compute\_fluxes, 131 balance\_deep\_and\_shallow and h\_limiter has been traced to how the 132 Python runtime passes information to C routines. In future work we 133 plan to investigate this further to verify our conclusions. 31 points, vertices, boundary, full_send_dict, ghost_recv_dict = \ 32 parallel_rectangle(N, M, len1_g=1.0) 33 \end{verbatim} 34 35 Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one node containing 4 processors, the $n = 8$ example was run on 2 nodes. The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of ghost triangles to other triangles decreases. Or anothr way to think of it is that the size of the boundary layer decreses. Hence the amount of communication relative the amount of computation decreases and thus the efficiency should increase. 36 37 The efficiency results shown here are competative. 38 39 \begin{table} 40 \caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 40, {\tt M} = 40.\label{tbl:rpa40}} 41 \begin{center} 42 \begin{tabular}{|c|c c|}\hline 43 $n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline 44 1 & 36.61 & \\ 45 2 & 18.76 & 98 \\ 46 4 & 10.16 & 90 \\ 47 8 & 6.39 & 72 \\\hline 48 \end{tabular} 49 \end{center} 50 \end{table} 51 52 \begin{table} 53 \caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 80, {\tt M} = 80.\label{tbl:rpa80}} 54 \begin{center} 55 \begin{tabular}{|c|c c|}\hline 56 $n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline 57 1 & 282.18 & \\ 58 2 & 143.14 & 99 \\ 59 4 & 75.06 & 94 \\ 60 8 & 41.67 & 85 \\\hline 61 \end{tabular} 62 \end{center} 63 \end{table} 64 65 66 \begin{table} 67 \caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 160, {\tt M} = 160.\label{tbl:rpa160}} 68 \begin{center} 69 \begin{tabular}{|c|c c|}\hline 70 $n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline 71 1 & 2200.35 & \\ 72 2 & 1126.35 & 97 \\ 73 4 & 569.49 & 97 \\ 74 8 & 304.43 & 90 \\\hline 75 \end{tabular} 76 \end{center} 77 \end{table} 78 79 80 Another way of measuring the performance of the code on a parallel machine is to increase the problem size as the number of processors are increased so that the number of triangles per processor remains roughly the same. We have node carried out measurements of this kind as we usually have static grids and it is not possible to increase the number of triangles.
Note: See TracChangeset
for help on using the changeset viewer.