Changeset 3168


Ignore:
Timestamp:
Jun 16, 2006, 10:43:28 AM (19 years ago)
Author:
linda
Message:

Added the efficiency results for advection on rectangular domain

Location:
inundation/parallel/documentation
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • inundation/parallel/documentation/code/RunParallelAdvection.py

    r2909 r3168  
    5555processor_name = pypar.Get_processor_name()
    5656
    57 N = 20
    58 M = 20
    59 
    6057#######################
    6158# Partition the mesh
    6259#######################
     60N = 20
     61M = 20
    6362
    6463# Build a unit mesh, subdivide it over numproces processors with each
  • inundation/parallel/documentation/results.tex

    r2849 r3168  
    11
    22\chapter{Performance Analysis}
    3 Examination of the parallel performance of the ANUGA system has been
    4 done using the APAC's Linux Cluster (LC). Running MPI jobs with python
    5 through the batch system requires specific settings as follows:
    6 \begin{enumerate}
    7 \item Comment out all module load statements from .bashrc .cshrc
    8   .profile and .login
    9 \item In the script for submission through PBS, add the lines:
    10   \begin{verbatim}
    11     module load python
    12     module load lam
    13   \end{verbatim}
    14 \item (Bash specific; translate if using csh) add export LD\_ASSUME\_KERNEL=2.4.1
    15   and export PYTHONPATH=(relevant directory)
    16 \item Instead of the call to lrun, use:
    17   \begin{verbatim}
    18     lamboot
    19     mpirun -np <number of processors> -x LD_LIBRARY_PATH,PYTHONPATH
    20     ,LD_ASSUME_KERNEL <python filename>
    21     lamhalt
    22   \end{verbatim}
    23   The line beginning with `mpirun' and ending with `$<$python
    24   filename$>$' should be on a single line.
    25 \item Ensure the you have \verb|-l other=mpi| in the PBS directives.
    26 \item Ensure that pypar 1.9.2 is accessible. At the time of writing,
    27    it was in the SVN repository under the directory
    28    ga/inundation/pypar\_dist . Move it to the directory pypar, so that
    29    this pypar directory is accessible from the PYTHONPATH.
    30 \end {enumerate}
    31 The section of code that was profiled was the evolve function, which
    32 handles progression of the simulation in time.
    333
    34 The functions that have been examined were the ones that took the
    35 largest amount of time to execute. Efficiency values were calculated
    36 as follows:
     4
     5To evaluate the performance of the code on a parallel machine we ran some examples on a cluster of four nodes connected with PathScale InfiniPath HTX.
     6Each node has two AMD Opteron 275 (Dual-core 2.2 GHz Processors) and 4 GB of main memory. The system achieves 60 Gigaflops with the Linpack benchmark, which is about 85% of peak performance.
     7
     8For each test run we evaluated the parallel efficiency
    379\[
    38 E = \frac{T_1}{n \times T_{n,1}} \times 100\%
     10E_n = \frac{T_1}{nT_n} 100,
    3911\]
    40 Where $T_1$ is the time for the single processor case, $n$ is the
    41 number of processors and $T_{n,1}$ is the time for the first processor of the $n$
    42 cpu run.  Results were generated using an 8 processor test run,
    43 compared against the 1 processor test run. The test script used was
    44 run\_parallel\_sw\_merimbula\_metis.py script in
    45 svn/ga/inundation/parallel.\\
     12where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (subdomains) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$.  Note that $t_i$ does not include the time required to build and subpartion the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).
    4613
    47 \begin{tabular}{l c}
    48 Function name & Parallel efficiency\\
    49 \hline
    50 evolve(shallow\_water.py) & 79\%\\
    51 update\_boundary(domain.py) & 52\%\\
    52 distribute\_to\_vertices\_and\_edges(shallow\_water.py) & 140.6\%\\
    53 evaluate(shallow\_water.py) & 57.8\%\\
    54 compute\_fluxes(shallow\_water.py) & 192.2\%\\
    55 \end{tabular}\\
     14\section{Advection, Rectangular Domain}
    5615
    57 While evolve was the main function of interest in the profiling
    58 efforts, the other functions listed above displayed parallel
    59 efficiencies that warranted further attantion.
     16The first example looked at the rectangular domain example given in Section \ref{sec:codeRPA}, excecpt that we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).
    6017
    61 \subsection{update\_boundary}
    62 Whenever the domain is split for distribution, an additional layer of
    63 triangles is added along the divide. These `ghost' triangles hold read
    64 only data from another processor's section of the domain. The use of a
    65 ghost layer is common practise and is designed to reduce the
    66 communication frequency. Some edges of these ghost triangles will
    67 contain boundary objects which get updated during the call to
    68 update\_boundary. These are redundant calculations as information from
    69 the update will be overwritten during the next communication
    70 call. Increasing the number of processors increases the ratio of such
    71 ghost triangles and the amount of time wasted updating the boundaries
    72 of the ghost triangles. The low parallel efficiency of this function
    73 has resulted in efforts towards optimisation, by tagging ghost nodes
    74 so they are never updated by update\_boundary calls.
    75 \subsection{distribute\_to\_vertices\_and\_edges}
    76 Initially, the unusual speedup was thought to be a load balance issue,
    77 but examination of the domain decomposition demonstrated that this was
    78 not the case. All processors were showing this unusual speedup. The
    79 profiler was then suspected, due to the fact that C code cannot be
    80 properly profiled by the python profiler. The speedup was confirmed by
    81 timing using the hotshot profiler and calls to time.time().
     18For this particular example we can control the mesh size by changing the parameters \code{N} and \code{M} given in the following section of code taken from
     19 Section \ref{sec:codeRPA}.
    8220
    83 The main function called by distribute\_to\_vertices\_and\_edges is
    84 balance\_deep\_and\_shallow\_c and was profiled in the same way,
    85 showing the same speedup.
     21\begin{verbatim}
     22#######################
     23# Partition the mesh
     24#######################
     25N = 20
     26M = 20
    8627
    87 The call to balance\_deep\_and\_shallow\_c was replaced by a call to
    88 the equivalent python function, and timed using calls to time.time().
    89 The total CPU time elapsed increased slightly as processors increased,
    90 which is expected. The algorithm part of the C version of
    91 balance\_deep\_and\_shallow was then timed using calls to
    92 gettimeofday(), with the following results:\\
     28# Build a unit mesh, subdivide it over numproces processors with each
     29# submesh containing M*N nodes
    9330
    94 \begin{tabular}{c c}
    95   Number of processors & Total CPU time\\
    96   \hline
    97   1 & 0.482695999788\\
    98   2 & 0.409433000023\\
    99   4 & 0.4197840002601\\
    100   8 & 0.4275599997492\\
    101 \end{tabular}\\
    102 For two processors and above, the arrays used by balance\_deep\_and\_shallow
    103 fit into the L2 cache of an LC compute node. The cache effect is
    104 therefore responsible for the speed increase from one to two
    105 processors.
    106 \subsubsection{h\_limiter}
    107 h\_limiter is a function called by balance\_deep\_and\_shallow.
    108 Timing the internals of the C version of balance\_deep\_and\_shallow
    109 results in the call to h\_limiter not being correctly timed. Timing
    110 h\_limiter from the python side resulted in unusual ($>$100\%)
    111 parallel efficiency, and timing the algorithm part of the C version of
    112 h\_limiter (i.e. not boilerplate code required for the Python-C
    113 interface such as calls to PyArg\_ParseTuple) displayed expected
    114 slowdowns. Therefore, it is presumed that this unusual speedup is due to
    115 how python passes information to C functions.
    116 \subsection{evaluate}
    117 Evaluate is an overridable method for different boundary classes
    118 (transmissive, reflective, etc). It contains relatively simple code,
    119 which has limited scope for optimisation. The low parallel efficiency
    120 of evaluate is due to it being called from update\_boundary and the
    121 number of boundary objects is increasing as the number of CPUs
    122 increases. Therefore, the parallel efficiency here should be improved
    123 when update\_boundary is corrected.
    124 \subsection{compute\_fluxes}
    125 Like balance\_deep\_and\_shallow and h\_limiter, compute\_fluxes calls
    126 a C function. Therefore, the unusual parallel efficiency has the same
    127 cause as balance\_deep\_and\_shallow and h\_limiter - presumed to be
    128 the way python calls code in C extensions.
    129 \subsection{Conclusions}
    130 The unusual speedup exhibited by compute\_fluxes,
    131 balance\_deep\_and\_shallow and h\_limiter has been traced to how the
    132 Python runtime passes information to C routines. In future work we
    133 plan to investigate this further to verify our conclusions.
     31points, vertices, boundary, full_send_dict, ghost_recv_dict =  \
     32    parallel_rectangle(N, M, len1_g=1.0)
     33\end{verbatim}
     34
     35Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one node containing 4 processors, the $n = 8$ example was run on 2 nodes. The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of ghost triangles to other triangles decreases. Or anothr way to think of it is that the size of the boundary layer decreses. Hence the amount of communication relative the amount of computation decreases and thus the efficiency should increase.
     36
     37The efficiency results shown here are competative.
     38
     39\begin{table}
     40\caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 40, {\tt M} = 40.\label{tbl:rpa40}}
     41\begin{center}
     42\begin{tabular}{|c|c c|}\hline
     43$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
     441 & 36.61 & \\
     452 & 18.76 & 98 \\
     464 & 10.16 & 90 \\
     478 & 6.39 & 72 \\\hline
     48\end{tabular}
     49\end{center}
     50\end{table}
     51
     52\begin{table}
     53\caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 80, {\tt M} = 80.\label{tbl:rpa80}}
     54\begin{center}
     55\begin{tabular}{|c|c c|}\hline
     56$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
     571 & 282.18 & \\
     582 & 143.14 & 99 \\
     594 & 75.06 & 94 \\
     608 & 41.67 & 85 \\\hline
     61\end{tabular}
     62\end{center}
     63\end{table}
     64
     65
     66\begin{table}
     67\caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 160, {\tt M} = 160.\label{tbl:rpa160}}
     68\begin{center}
     69\begin{tabular}{|c|c c|}\hline
     70$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
     711 & 2200.35 & \\
     722 & 1126.35 & 97 \\
     734 & 569.49 & 97 \\
     748 & 304.43 & 90 \\\hline
     75\end{tabular}
     76\end{center}
     77\end{table}
     78
     79
     80Another way of measuring the performance of the code on a parallel machine is to increase the problem size as the number of processors are increased so that the number of triangles per processor remains roughly the same.  We have node carried out measurements of this kind as we usually have static grids and it is not possible to increase the number of triangles. 
Note: See TracChangeset for help on using the changeset viewer.