source: inundation/parallel/documentation/results.tex @ 3413

Last change on this file since 3413 was 3315, checked in by steve, 18 years ago
File size: 6.2 KB
Line 
1
2\chapter{Performance Analysis}
3
4
5To evaluate the performance of the code on a parallel machine we ran
6some examples on a cluster of four nodes connected with PathScale
7InfiniPath HTX. Each node has two AMD Opteron 275 (Dual-core 2.2 GHz
8Processors) and 4 GB of main memory. The system achieves 60
9Gigaflops with the Linpack benchmark, which is about 85\% of peak
10performance.
11
12For each test run we evaluate the parallel efficiency as
13\[
14E_n = \frac{T_1}{nT_n} 100,
15\]
16where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of
17processors (submesh) and $t_i$ is the time required to run the {\tt
18evolve} code on processor $i$.  Note that $t_i$ does not include the
19time required to build and subpartition the mesh etc., it only
20includes the time required to do the evolve calculations (eg.
21\code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).
22
23\section{Advection, Rectangular Domain}
24
25The first example looked at the rectangular domain example given in
26Section \ref{subsec:codeRPA}, except we changed the finaltime time
27to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).
28
29For this particular example we can control the mesh size by changing
30the parameters \code{N} and \code{M} given in the following section
31of code taken from
32 Section \ref{subsec:codeRPA}.
33
34\begin{verbatim}
35#######################
36# Partition the mesh
37#######################
38N = 20
39M = 20
40
41# Build a unit mesh, subdivide it over numproces processors with each
42# submesh containing M*N nodes
43
44points, vertices, boundary, full_send_dict, ghost_recv_dict =  \
45    parallel_rectangle(N, M, len1_g=1.0)
46\end{verbatim}
47
48Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show
49the efficiency results for different values of \code{N} and
50\code{M}. The examples where $n \le 4$ were run on one Opteron node
51containing 4 processors, the $n = 8$ example was run on 2 nodes
52(giving a total of 8 processors). The communication within a node is
53faster than the communication across nodes, so we would expect to
54see a decrease in efficiency when we jump from 4 to 8 nodes.
55Furthermore, as \code{N} and \code{M} are increased the ratio of
56exterior to interior triangles decreases, which in-turn decreases
57the amount of communication relative the amount of computation  and
58thus the efficiency should increase.
59
60The efficiency results shown here are competitive.
61
62\begin{table}
63\caption{Parallel Efficiency Results for the Advection Problem on a
64Rectangular Domain with {\tt N} = 40, {\tt M} =
6540.\label{tbl:rpa40}}
66\begin{center}
67\begin{tabular}{|c|c c|}\hline
68$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
691 & 36.61 & \\
702 & 18.76 & 98 \\
714 & 10.16 & 90 \\
728 & 6.39 & 72 \\\hline
73\end{tabular}
74\end{center}
75\end{table}
76
77\begin{table}
78\caption{Parallel Efficiency Results for the Advection Problem on a
79Rectangular Domain with {\tt N} = 80, {\tt M} =
8080.\label{tbl:rpa80}}
81\begin{center}
82\begin{tabular}{|c|c c|}\hline
83$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
841 & 282.18 & \\
852 & 143.14 & 99 \\
864 & 75.06 & 94 \\
878 & 41.67 & 85 \\\hline
88\end{tabular}
89\end{center}
90\end{table}
91
92
93\begin{table}
94\caption{Parallel Efficiency Results for the Advection Problem on a
95Rectangular Domain with {\tt N} = 160, {\tt M} =
96160.\label{tbl:rpa160}}
97\begin{center}
98\begin{tabular}{|c|c c|}\hline
99$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
1001 & 2200.35 & \\
1012 & 1126.35 & 97 \\
1024 & 569.49 & 97 \\
1038 & 304.43 & 90 \\\hline
104\end{tabular}
105\end{center}
106\end{table}
107
108
109%Another way of measuring the performance of the code on a parallel machine is to increase the problem size as the number of processors are increased so that the number of triangles per processor remains roughly the same.  We have not carried out measurements of this kind as we usually have static grids and it is not possible to increase the number of triangles.
110
111\section{Advection, Merimbula Mesh}
112
113We now look at another advection example, except this time the mesh
114comes from the Merimbula test problem. That is, we ran the code
115given in Section \ref{subsec:codeRPMM}, except the final time was
116reduced to 10000 (\code{finaltime = 10000}). The results are given
117in Table \ref{tbl:rpm}. These are good efficiency results,
118especially considering the structure of the Merimbula mesh.
119%Note that since we are solving an advection problem the amount of calculation
120%done on each triangle is relatively low, when we more to other problems that
121%involve more calculations we would expect the computation to communication ratio to increase and thus get an increase in efficiency.
122
123\begin{table}
124\caption{Parallel Efficiency Results for the Advection Problem on the
125  Merimbula Mesh.\label{tbl:rpm}}
126\begin{center}
127\begin{tabular}{|c|c c|}\hline
128$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
1291 &145.17 & \\
1302 &77.52 & 94 \\
1314 & 41.24 & 88 \\
1328 & 22.96 & 79 \\\hline
133\end{tabular}
134\end{center}
135\end{table}
136
137\section{Shallow Water, Merimbula Mesh}
138
139The final example we looked at is the shallow water equation on the
140Merimbula mesh. We used the code listed in Section \ref{subsec:codeRPSMM}. The
141results are listed in Table \ref{tbl:rpsm}. The efficiency results are not as
142good as initially expected so we profiled the code and found that
143the problem is with the \code{update_boundary} routine in the {\tt domain.py}
144file. On one processor the \code{update_boundary} routine accounts for about
14572\% of the total computation time and unfortunately it is difficult to
146parallelise this routine. When metis subpartitions the mesh it is possible
147that one processor will only get a few boundary edges (some may not get any)
148while another processor may contain a relatively large number of boundary
149edges. The profiler indicated that when running the problem on 8 processors,
150Processor 0 spent about 3.8 times more doing the \code{update_boundary}
151calculations than Processor 7. This load imbalance reduced the parallel
152efficiency.
153
154Before doing the shallow equation calculations on a larger number of
155processors we recommend that the \code{update_boundary} calculations be
156optimised as much as possible to reduce the effect of the load imbalance.
157
158
159\begin{table}
160\caption{Parallel Efficiency Results for the Shallow Water Equation on the
161  Merimbula Mesh.\label{tbl:rpsm}}
162\begin{center}
163\begin{tabular}{|c|c c|}\hline
164$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
1651 & 7.04 & \\
1662 & 3.62 & 97 \\
1674 & 1.94 & 91 \\
1688 & 1.15 & 77 \\\hline
169\end{tabular}
170\end{center}
171\end{table}
Note: See TracBrowser for help on using the repository browser.