source: inundation/parallel/documentation/results.tex @ 3295

Last change on this file since 3295 was 3245, checked in by linda, 19 years ago

Committed "final version" of report on parallel implementation

File size: 6.2 KB
RevLine 
[2697]1
2\chapter{Performance Analysis}
3
[3168]4
5To evaluate the performance of the code on a parallel machine we ran some examples on a cluster of four nodes connected with PathScale InfiniPath HTX.
[3198]6Each node has two AMD Opteron 275 (Dual-core 2.2 GHz Processors) and 4 GB of main memory. The system achieves 60 Gigaflops with the Linpack benchmark, which is about 85\% of peak performance.
[3168]7
[3245]8For each test run we evaluate the parallel efficiency as
[2697]9\[
[3168]10E_n = \frac{T_1}{nT_n} 100,
[2697]11\]
[3245]12where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (submesh) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$.  Note that $t_i$ does not include the time required to build and subpartition the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).
[2697]13
[3168]14\section{Advection, Rectangular Domain}
[2697]15
[3245]16The first example looked at the rectangular domain example given in Section \ref{subsec:codeRPA}, except we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).
[2697]17
[3168]18For this particular example we can control the mesh size by changing the parameters \code{N} and \code{M} given in the following section of code taken from
[3185]19 Section \ref{subsec:codeRPA}.
[2697]20
[3168]21\begin{verbatim}
22#######################
23# Partition the mesh
24#######################
25N = 20
26M = 20
[2697]27
[3168]28# Build a unit mesh, subdivide it over numproces processors with each
29# submesh containing M*N nodes
[2697]30
[3168]31points, vertices, boundary, full_send_dict, ghost_recv_dict =  \
32    parallel_rectangle(N, M, len1_g=1.0)
33\end{verbatim}
34
[3245]35Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one Opteron node containing 4 processors, the $n = 8$ example was run on 2 nodes (giving a total of 8 processors). The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of exterior to interior triangles decreases, which in-turn decreases the amount of communication relative the amount of computation  and thus the efficiency should increase.
[3168]36
[3198]37The efficiency results shown here are competitive.
[3168]38
39\begin{table} 
40\caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 40, {\tt M} = 40.\label{tbl:rpa40}}
41\begin{center}
42\begin{tabular}{|c|c c|}\hline
43$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
441 & 36.61 & \\
452 & 18.76 & 98 \\
464 & 10.16 & 90 \\
478 & 6.39 & 72 \\\hline
48\end{tabular}
49\end{center}
50\end{table}
51
52\begin{table} 
53\caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 80, {\tt M} = 80.\label{tbl:rpa80}}
54\begin{center}
55\begin{tabular}{|c|c c|}\hline
56$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
571 & 282.18 & \\
582 & 143.14 & 99 \\
594 & 75.06 & 94 \\
608 & 41.67 & 85 \\\hline
61\end{tabular}
62\end{center}
63\end{table}
64
65
66\begin{table} 
67\caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 160, {\tt M} = 160.\label{tbl:rpa160}}
68\begin{center}
69\begin{tabular}{|c|c c|}\hline
70$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
711 & 2200.35 & \\
722 & 1126.35 & 97 \\
734 & 569.49 & 97 \\
748 & 304.43 & 90 \\\hline
75\end{tabular}
76\end{center}
77\end{table}
78
79
[3245]80%Another way of measuring the performance of the code on a parallel machine is to increase the problem size as the number of processors are increased so that the number of triangles per processor remains roughly the same.  We have not carried out measurements of this kind as we usually have static grids and it is not possible to increase the number of triangles.
[3184]81 
82\section{Advection, Merimbula Mesh}
83
[3245]84We now look at another advection example, except this time the mesh comes from the Merimbula test problem. That is, we ran the code given in Section
[3184]85\ref{subsec:codeRPMM}, except the final time was reduced to 10000
[3185]86(\code{finaltime = 10000}). The results are given in Table \ref{tbl:rpm}.
[3184]87These are good efficiency results, especially considering the structure of the
[3185]88Merimbula mesh.
89%Note that since we are solving an advection problem the amount of calculation
90%done on each triangle is relatively low, when we more to other problems that
91%involve more calculations we would expect the computation to communication ratio to increase and thus get an increase in efficiency.
[3184]92
93\begin{table} 
94\caption{Parallel Efficiency Results for the Advection Problem on the
[3185]95  Merimbula Mesh.\label{tbl:rpm}}
[3184]96\begin{center}
97\begin{tabular}{|c|c c|}\hline
98$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
991 &145.17 & \\
1002 &77.52 & 94 \\
1014 & 41.24 & 88 \\
1028 & 22.96 & 79 \\\hline
103\end{tabular}
104\end{center}
105\end{table}
[3185]106
107\section{Shallow Water, Merimbula Mesh}
108
109The final example we looked at is the shallow water equation on the
110Merimbula mesh. We used the code listed in Section \ref{subsec:codeRPSMM}. The
111results are listed in Table \ref{tbl:rpsm}. The efficiency results are not as
[3245]112good as initially expected so we profiled the code and found that
[3185]113the problem is with the \code{update_boundary} routine in the {\tt domain.py}
114file. On one processor the \code{update_boundary} routine accounts for about
11572\% of the total computation time and unfortunately it is difficult to
116parallelise this routine. When metis subpartitions the mesh it is possible
117that one processor will only get a few boundary edges (some may not get any)
118while another processor may contain a relatively large number of boundary
119edges. The profiler indicated that when running the problem on 8 processors,
120Processor 0 spent about 3.8 times more doing the \code{update_boundary}
121calculations than Processor 7. This load imbalance reduced the parallel
122efficiency.
123
124Before doing the shallow equation calculations on a larger number of
125processors we recommend that the \code{update_boundary} calculations be
126optimised as much as possible to reduce the effect of the load imbalance.
127
128 
129\begin{table} 
130\caption{Parallel Efficiency Results for the Shallow Water Equation on the
131  Merimbula Mesh.\label{tbl:rpsm}}
132\begin{center}
133\begin{tabular}{|c|c c|}\hline
134$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
1351 & 7.04 & \\
1362 & 3.62 & 97 \\
1374 & 1.94 & 91 \\
1388 & 1.15 & 77 \\\hline
139\end{tabular}
140\end{center}
141\end{table}
Note: See TracBrowser for help on using the repository browser.