Context Navigation

results.tex @ 3413

Last change on this file since 3413 was 3315, checked in by steve, 18 years ago

File size: 6.2 KB

Line
1
2	\chapter{Performance Analysis}
3
4
5	To evaluate the performance of the code on a parallel machine we ran
6	some examples on a cluster of four nodes connected with PathScale
7	InfiniPath HTX. Each node has two AMD Opteron 275 (Dual-core 2.2 GHz
8	Processors) and 4 GB of main memory. The system achieves 60
9	Gigaflops with the Linpack benchmark, which is about 85\% of peak
10	performance.
11
12	For each test run we evaluate the parallel efficiency as
13	\[
14	E_n = \frac{T_1}{nT_n} 100,
15	\]
16	where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of
17	processors (submesh) and $t_i$ is the time required to run the {\tt
18	evolve} code on processor $i$. Note that $t_i$ does not include the
19	time required to build and subpartition the mesh etc., it only
20	includes the time required to do the evolve calculations (eg.
21	\code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).
22
23	\section{Advection, Rectangular Domain}
24
25	The first example looked at the rectangular domain example given in
26	Section \ref{subsec:codeRPA}, except we changed the finaltime time
27	to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).
28
29	For this particular example we can control the mesh size by changing
30	the parameters \code{N} and \code{M} given in the following section
31	of code taken from
32	Section \ref{subsec:codeRPA}.
33
34	\begin{verbatim}
35	#######################
36	# Partition the mesh
37	#######################
38	N = 20
39	M = 20
40
41	# Build a unit mesh, subdivide it over numproces processors with each
42	# submesh containing M*N nodes
43
44	points, vertices, boundary, full_send_dict, ghost_recv_dict = \
45	parallel_rectangle(N, M, len1_g=1.0)
46	\end{verbatim}
47
48	Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show
49	the efficiency results for different values of \code{N} and
50	\code{M}. The examples where $n \le 4$ were run on one Opteron node
51	containing 4 processors, the $n = 8$ example was run on 2 nodes
52	(giving a total of 8 processors). The communication within a node is
53	faster than the communication across nodes, so we would expect to
54	see a decrease in efficiency when we jump from 4 to 8 nodes.
55	Furthermore, as \code{N} and \code{M} are increased the ratio of
56	exterior to interior triangles decreases, which in-turn decreases
57	the amount of communication relative the amount of computation and
58	thus the efficiency should increase.
59
60	The efficiency results shown here are competitive.
61
62	\begin{table}
63	\caption{Parallel Efficiency Results for the Advection Problem on a
64	Rectangular Domain with {\tt N} = 40, {\tt M} =
65	40.\label{tbl:rpa40}}
66	\begin{center}
67	\begin{tabular}{\|c\|c c\|}\hline
68	$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
69	1 & 36.61 & \\
70	2 & 18.76 & 98 \\
71	4 & 10.16 & 90 \\
72	8 & 6.39 & 72 \\\hline
73	\end{tabular}
74	\end{center}
75	\end{table}
76
77	\begin{table}
78	\caption{Parallel Efficiency Results for the Advection Problem on a
79	Rectangular Domain with {\tt N} = 80, {\tt M} =
80	80.\label{tbl:rpa80}}
81	\begin{center}
82	\begin{tabular}{\|c\|c c\|}\hline
83	$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
84	1 & 282.18 & \\
85	2 & 143.14 & 99 \\
86	4 & 75.06 & 94 \\
87	8 & 41.67 & 85 \\\hline
88	\end{tabular}
89	\end{center}
90	\end{table}
91
92
93	\begin{table}
94	\caption{Parallel Efficiency Results for the Advection Problem on a
95	Rectangular Domain with {\tt N} = 160, {\tt M} =
96	160.\label{tbl:rpa160}}
97	\begin{center}
98	\begin{tabular}{\|c\|c c\|}\hline
99	$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
100	1 & 2200.35 & \\
101	2 & 1126.35 & 97 \\
102	4 & 569.49 & 97 \\
103	8 & 304.43 & 90 \\\hline
104	\end{tabular}
105	\end{center}
106	\end{table}
107
108
109	%Another way of measuring the performance of the code on a parallel machine is to increase the problem size as the number of processors are increased so that the number of triangles per processor remains roughly the same. We have not carried out measurements of this kind as we usually have static grids and it is not possible to increase the number of triangles.
110
111	\section{Advection, Merimbula Mesh}
112
113	We now look at another advection example, except this time the mesh
114	comes from the Merimbula test problem. That is, we ran the code
115	given in Section \ref{subsec:codeRPMM}, except the final time was
116	reduced to 10000 (\code{finaltime = 10000}). The results are given
117	in Table \ref{tbl:rpm}. These are good efficiency results,
118	especially considering the structure of the Merimbula mesh.
119	%Note that since we are solving an advection problem the amount of calculation
120	%done on each triangle is relatively low, when we more to other problems that
121	%involve more calculations we would expect the computation to communication ratio to increase and thus get an increase in efficiency.
122
123	\begin{table}
124	\caption{Parallel Efficiency Results for the Advection Problem on the
125	Merimbula Mesh.\label{tbl:rpm}}
126	\begin{center}
127	\begin{tabular}{\|c\|c c\|}\hline
128	$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
129	1 &145.17 & \\
130	2 &77.52 & 94 \\
131	4 & 41.24 & 88 \\
132	8 & 22.96 & 79 \\\hline
133	\end{tabular}
134	\end{center}
135	\end{table}
136
137	\section{Shallow Water, Merimbula Mesh}
138
139	The final example we looked at is the shallow water equation on the
140	Merimbula mesh. We used the code listed in Section \ref{subsec:codeRPSMM}. The
141	results are listed in Table \ref{tbl:rpsm}. The efficiency results are not as
142	good as initially expected so we profiled the code and found that
143	the problem is with the \code{update_boundary} routine in the {\tt domain.py}
144	file. On one processor the \code{update_boundary} routine accounts for about
145	72\% of the total computation time and unfortunately it is difficult to
146	parallelise this routine. When metis subpartitions the mesh it is possible
147	that one processor will only get a few boundary edges (some may not get any)
148	while another processor may contain a relatively large number of boundary
149	edges. The profiler indicated that when running the problem on 8 processors,
150	Processor 0 spent about 3.8 times more doing the \code{update_boundary}
151	calculations than Processor 7. This load imbalance reduced the parallel
152	efficiency.
153
154	Before doing the shallow equation calculations on a larger number of
155	processors we recommend that the \code{update_boundary} calculations be
156	optimised as much as possible to reduce the effect of the load imbalance.
157
158
159	\begin{table}
160	\caption{Parallel Efficiency Results for the Shallow Water Equation on the
161	Merimbula Mesh.\label{tbl:rpsm}}
162	\begin{center}
163	\begin{tabular}{\|c\|c c\|}\hline
164	$n$ & $T_n$ (sec) & $E_n (\%)$ \\\hline
165	1 & 7.04 & \\
166	2 & 3.62 & 97 \\
167	4 & 1.94 & 91 \\
168	8 & 1.15 & 77 \\\hline
169	\end{tabular}
170	\end{center}
171	\end{table}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: inundation/parallel/documentation/results.tex @ 3413

Download in other formats: