Context Navigation

← Previous Changeset
Next Changeset →

Changeset 3168

Timestamp:

Jun 16, 2006, 10:43:28 AM (19 years ago)

Author:

linda

Message:

Added the efficiency results for advection on rectangular domain

Location:

inundation/parallel/documentation

Files:

: 2 edited

code/RunParallelAdvection.py (modified) (1 diff)
results.tex (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

inundation/parallel/documentation/code/RunParallelAdvection.py

-                      r2909
+                      r3168
 processor_name = pypar.Get_processor_name()
-N = 20
-M = 20
 #######################
 # Partition the mesh
 #######################
+N = 20
+M = 20
 # Build a unit mesh, subdivide it over numproces processors with each

inundation/parallel/documentation/results.tex

-                      r2849
+                      r3168
 \chapter{Performance Analysis}
-Examination of the parallel performance of the ANUGA system has been
-done using the APAC's Linux Cluster (LC). Running MPI jobs with python
-through the batch system requires specific settings as follows:
-\begin{enumerate}
-\item Comment out all module load statements from .bashrc .cshrc
-  .profile and .login
-\item In the script for submission through PBS, add the lines:
-  \begin{verbatim}
-    module load python
-    module load lam
-  \end{verbatim}
-\item (Bash specific; translate if using csh) add export LD\_ASSUME\_KERNEL=2.4.1
-  and export PYTHONPATH=(relevant directory)
-\item Instead of the call to lrun, use:
-  \begin{verbatim}
-    lamboot
-    mpirun -np <number of processors> -x LD_LIBRARY_PATH,PYTHONPATH
-    ,LD_ASSUME_KERNEL <python filename>
-    lamhalt
-  \end{verbatim}
-  The line beginning with `mpirun' and ending with `$<$python
-  filename$>$' should be on a single line.
-\item Ensure the you have \verb|-l other=mpi| in the PBS directives.
-\item Ensure that pypar 1.9.2 is accessible. At the time of writing,
-   it was in the SVN repository under the directory
-   ga/inundation/pypar\_dist . Move it to the directory pypar, so that
-   this pypar directory is accessible from the PYTHONPATH.
-\end {enumerate}
-The section of code that was profiled was the evolve function, which
-handles progression of the simulation in time.
+The functions that have been examined were the ones that took the
+largest amount of time to execute. Efficiency values were calculated
+as follows:
+To evaluate the performance of the code on a parallel machine we ran some examples on a cluster of four nodes connected with PathScale InfiniPath HTX.
+Each node has two AMD Opteron 275 (Dual-core 2.2 GHz Processors) and 4 GB of main memory. The system achieves 60 Gigaflops with the Linpack benchmark, which is about 85% of peak performance.
+For each test run we evaluated the parallel efficiency
 \[
 E = \frac{T_1}{n \times T_{n,1}} \times 100\%
+E_n = \frac{T_1}{nT_n} 100,
 \]
+Where $T_1$ is the time for the single processor case, $n$ is the
+number of processors and $T_{n,1}$ is the time for the first processor of the $n$
+cpu run.  Results were generated using an 8 processor test run,
+compared against the 1 processor test run. The test script used was
+run\_parallel\_sw\_merimbula\_metis.py script in
+svn/ga/inundation/parallel.\\
+where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (subdomains) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$.  Note that $t_i$ does not include the time required to build and subpartion the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).
+\begin{tabular}{l c}
+Function name & Parallel efficiency\\
+\hline
+evolve(shallow\_water.py) & 79\%\\
+update\_boundary(domain.py) & 52\%\\
+distribute\_to\_vertices\_and\_edges(shallow\_water.py) & 140.6\%\\
+evaluate(shallow\_water.py) & 57.8\%\\
+compute\_fluxes(shallow\_water.py) & 192.2\%\\
+\end{tabular}\\
+\section{Advection, Rectangular Domain}
+While evolve was the main function of interest in the profiling
+efforts, the other functions listed above displayed parallel
+efficiencies that warranted further attantion.
+The first example looked at the rectangular domain example given in Section \ref{sec:codeRPA}, excecpt that we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).
+\subsection{update\_boundary}
+Whenever the domain is split for distribution, an additional layer of
+triangles is added along the divide. These `ghost' triangles hold read
+only data from another processor's section of the domain. The use of a
+ghost layer is common practise and is designed to reduce the
+communication frequency. Some edges of these ghost triangles will
+contain boundary objects which get updated during the call to
+update\_boundary. These are redundant calculations as information from
+the update will be overwritten during the next communication
+call. Increasing the number of processors increases the ratio of such
+ghost triangles and the amount of time wasted updating the boundaries
+of the ghost triangles. The low parallel efficiency of this function
+has resulted in efforts towards optimisation, by tagging ghost nodes
+so they are never updated by update\_boundary calls.
+\subsection{distribute\_to\_vertices\_and\_edges}
+Initially, the unusual speedup was thought to be a load balance issue,
+but examination of the domain decomposition demonstrated that this was
+not the case. All processors were showing this unusual speedup. The
+profiler was then suspected, due to the fact that C code cannot be
+properly profiled by the python profiler. The speedup was confirmed by
+timing using the hotshot profiler and calls to time.time().
+For this particular example we can control the mesh size by changing the parameters \code{N} and \code{M} given in the following section of code taken from
+ Section \ref{sec:codeRPA}.
+The main function called by distribute\_to\_vertices\_and\_edges is
+balance\_deep\_and\_shallow\_c and was profiled in the same way,
+showing the same speedup.
+\begin{verbatim}
+#######################
+# Partition the mesh
+#######################
+N = 20
+M = 20
+The call to balance\_deep\_and\_shallow\_c was replaced by a call to
+the equivalent python function, and timed using calls to time.time().
+The total CPU time elapsed increased slightly as processors increased,
+which is expected. The algorithm part of the C version of
+balance\_deep\_and\_shallow was then timed using calls to
+gettimeofday(), with the following results:\\
+# Build a unit mesh, subdivide it over numproces processors with each
+# submesh containing M*N nodes
+\begin{tabular}{c c}
+  Number of processors & Total CPU time\\
+  \hline
+& 0.482695999788\\
+& 0.409433000023\\
+& 0.4197840002601\\
+& 0.4275599997492\\
+\end{tabular}\\
+For two processors and above, the arrays used by balance\_deep\_and\_shallow
+fit into the L2 cache of an LC compute node. The cache effect is
+therefore responsible for the speed increase from one to two
+processors.
+\subsubsection{h\_limiter}
+h\_limiter is a function called by balance\_deep\_and\_shallow.
+Timing the internals of the C version of balance\_deep\_and\_shallow
+results in the call to h\_limiter not being correctly timed. Timing
+h\_limiter from the python side resulted in unusual ($>$100\%)
+parallel efficiency, and timing the algorithm part of the C version of
+h\_limiter (i.e. not boilerplate code required for the Python-C
+interface such as calls to PyArg\_ParseTuple) displayed expected
+slowdowns. Therefore, it is presumed that this unusual speedup is due to
+how python passes information to C functions.
+\subsection{evaluate}
+Evaluate is an overridable method for different boundary classes
+(transmissive, reflective, etc). It contains relatively simple code,
+which has limited scope for optimisation. The low parallel efficiency
+of evaluate is due to it being called from update\_boundary and the
+number of boundary objects is increasing as the number of CPUs
+increases. Therefore, the parallel efficiency here should be improved
+when update\_boundary is corrected.
+\subsection{compute\_fluxes}
+Like balance\_deep\_and\_shallow and h\_limiter, compute\_fluxes calls
+a C function. Therefore, the unusual parallel efficiency has the same
+cause as balance\_deep\_and\_shallow and h\_limiter - presumed to be
+the way python calls code in C extensions.
+\subsection{Conclusions}
+The unusual speedup exhibited by compute\_fluxes,
+balance\_deep\_and\_shallow and h\_limiter has been traced to how the
+Python runtime passes information to C routines. In future work we
+plan to investigate this further to verify our conclusions.
+points, vertices, boundary, full_send_dict, ghost_recv_dict =  \
+    parallel_rectangle(N, M, len1_g=1.0)
+\end{verbatim}
+Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one node containing 4 processors, the $n = 8$ example was run on 2 nodes. The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of ghost triangles to other triangles decreases. Or anothr way to think of it is that the size of the boundary layer decreses. Hence the amount of communication relative the amount of computation decreases and thus the efficiency should increase.
+The efficiency results shown here are competative.
+\begin{table}
+\caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 40, {\tt M} = 40.\label{tbl:rpa40}}
+\begin{center}
+\begin{tabular}{|c|c c|}\hline
+$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
+& 36.61 & \\
+& 18.76 & 98 \\
+& 10.16 & 90 \\
+& 6.39 & 72 \\\hline
+\end{tabular}
+\end{center}
+\end{table}
+\begin{table}
+\caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 80, {\tt M} = 80.\label{tbl:rpa80}}
+\begin{center}
+\begin{tabular}{|c|c c|}\hline
+$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
+& 282.18 & \\
+& 143.14 & 99 \\
+& 75.06 & 94 \\
+& 41.67 & 85 \\\hline
+\end{tabular}
+\end{center}
+\end{table}
+\begin{table}
+\caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 160, {\tt M} = 160.\label{tbl:rpa160}}
+\begin{center}
+\begin{tabular}{|c|c c|}\hline
+$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
+& 2200.35 & \\
+& 1126.35 & 97 \\
+& 569.49 & 97 \\
+& 304.43 & 90 \\\hline
+\end{tabular}
+\end{center}
+\end{table}
+Another way of measuring the performance of the code on a parallel machine is to increase the problem size as the number of processors are increased so that the number of triangles per processor remains roughly the same.  We have node carried out measurements of this kind as we usually have static grids and it is not possible to increase the number of triangles.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 3168

Legend:

inundation/parallel/documentation/code/RunParallelAdvection.py

inundation/parallel/documentation/results.tex

Download in other formats: