Changeset 3245
- Timestamp:
- Jun 27, 2006, 4:01:02 PM (18 years ago)
- Location:
- inundation/parallel/documentation
- Files:
-
- 3 edited
Legend:
- Unmodified
- Added
- Removed
-
inundation/parallel/documentation/parallel.tex
r3244 r3245 28 28 The first step in parallelising the code is to subdivide the mesh 29 29 into, roughly, equally sized partitions. On a rectangular mesh this may be 30 done by a simple co-ordinate based dissection , but on a complicated30 done by a simple co-ordinate based dissection method, but on a complicated 31 31 domain such as the Merimbula mesh shown in Figure \ref{fig:mergrid} 32 32 a more sophisticated approach must be used. We use pymetis, a … … 92 92 \begin{figure}[hbtp] 93 93 \centerline{ \includegraphics[scale = 0.6]{figures/subdomain.eps}} 94 \caption{An example subparti oning of a mesh.}94 \caption{An example subpartitioning of a mesh.} 95 95 \label{fig:subdomain} 96 96 \end{figure} … … 99 99 \begin{figure}[hbtp] 100 100 \centerline{ \includegraphics[scale = 0.6]{figures/subdomainghost.eps}} 101 \caption{An example subparti oning with ghost triangles. The numbers in brackets shows the local numbering scheme that is calculated and stored with the mesh, but not implemented until the local mesh is built. See Section \ref{sec:part4}. }101 \caption{An example subpartitioning with ghost triangles. The numbers in brackets shows the local numbering scheme that is calculated and stored with the mesh, but not implemented until the local mesh is built. See Section \ref{sec:part4}. } 102 102 \label{fig:subdomaing} 103 103 \end{figure} … … 118 118 \subsection {Sending the Submeshes}\label{sec:part3} 119 119 120 All of functions described so far must be run in serial on Processor 0. The next step is to start the parallel computation by spreading the submeshes over the processors. The communication is carried out by120 All of the functions described so far must be run in serial on Processor 0. The next step is to start the parallel computation by spreading the submeshes over the processors. The communication is carried out by 121 121 \code{send_submesh} and \code{rec_submesh} defined in {\tt build_commun.py}. 122 122 The \code{send_submesh} function should be called on Processor 0 and sends the Submesh $p$ to Processor $p$, while \code{rec_submesh} should be called by Processor $p$ to receive Submesh $p$ from Processor 0. … … 128 128 129 129 \subsection {Building the Local Mesh}\label{sec:part4} 130 After using \code{send_submesh} and \code{rec_submesh}, Processor $p$ should have its own local copy of Submesh $p$, however as stated previously the triangle numbering will be incorrect on all processors except number $0$. The \code{build_local_mesh} function from {\tt build_local.py} primarily focuses on renumbering the information stored with the submesh; including the nodes, vertices and quantities. Figure \ref{fig:subdomainf} shows what the mesh in each processor may look like.130 After using \code{send_submesh} and \code{rec_submesh}, Processor $p$ should have its own local copy of Submesh $p$, however as stated previously the triangle numbering will be incorrect on all processors except Processor $0$. The \code{build_local_mesh} function from {\tt build_local.py} primarily focuses on renumbering the information stored with the submesh; including the nodes, vertices and quantities. Figure \ref{fig:subdomainf} shows what the mesh in each processor may look like. 131 131 132 132 133 133 \begin{figure}[hbtp] 134 134 \centerline{ \includegraphics[scale = 0.6]{figures/subdomainfinal.eps}} 135 \caption{An example subparti oning after the submeshes have been renumbered.}135 \caption{An example subpartitioning after the submeshes have been renumbered.} 136 136 \label{fig:subdomainf} 137 137 \end{figure} … … 242 242 \end{verbatim} 243 243 244 Note that the submesh is not received by, or sent to, Processor 0. Rather \code{hostmesh = extract_hostmesh(submesh)} simply extracts the mesh that has been assigned to Processor 0. Recall \code{submesh} contains the list of submeshes to be assigned to each processor. This is described further in Section \ref{sec:part3}. The \code{build_local_mesh} renumbers the nodes 244 Note that the submesh is not received by, or sent to, Processor 0. Rather \code{hostmesh = extract_hostmesh(submesh)} simply extracts the mesh that has been assigned to Processor 0. Recall \code{submesh} contains the list of submeshes to be assigned to each processor. This is described further in Section \ref{sec:part3}. 245 %The \code{build_local_mesh} renumbers the nodes 245 246 \begin{verbatim} 246 247 points, vertices, boundary, quantities, ghost_recv_dict, full_send_dict)=\ … … 267 268 \begin{itemize} 268 269 \item Python 2.0 or later 269 \item Numeric Python (incl ing RandomArray) matching the Python installation270 \item Numeric Python (including RandomArray) matching the Python installation 270 271 \item Native MPI C library 271 272 \item Native C compiler -
inundation/parallel/documentation/report.tex
r3033 r3245 55 55 56 56 \begin{abstract} 57 This document describes work done by the authors as part of a consultancy with GA during 2005-2006. The paper serves as both a report for GA and a user manual. 58 59 The report contains a description of how the code was parallelised and it lists efficiency results for a few example runs. It also gives some examples codes showing how to run the Merimbula test problem in parallel and talks about more technical aspects such as compilation issues and batch scripts for submitting compute jobs on parallel machines. 57 60 \end{abstract} 58 61 -
inundation/parallel/documentation/results.tex
r3198 r3245 6 6 Each node has two AMD Opteron 275 (Dual-core 2.2 GHz Processors) and 4 GB of main memory. The system achieves 60 Gigaflops with the Linpack benchmark, which is about 85\% of peak performance. 7 7 8 For each test run we evaluate d the parallel efficiency8 For each test run we evaluate the parallel efficiency as 9 9 \[ 10 10 E_n = \frac{T_1}{nT_n} 100, 11 11 \] 12 where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (sub domains) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$. Note that $t_i$ does not include the time required to build and subpartion the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).12 where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (submesh) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$. Note that $t_i$ does not include the time required to build and subpartition the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}). 13 13 14 14 \section{Advection, Rectangular Domain} 15 15 16 The first example looked at the rectangular domain example given in Section \ref{subsec:codeRPA}, exce cpt that we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).16 The first example looked at the rectangular domain example given in Section \ref{subsec:codeRPA}, except we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}). 17 17 18 18 For this particular example we can control the mesh size by changing the parameters \code{N} and \code{M} given in the following section of code taken from … … 33 33 \end{verbatim} 34 34 35 Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one node containing 4 processors, the $n = 8$ example was run on 2 nodes. The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of ghost triangles to other triangles decreases. Or anothr way to think of it is that the size of the boundary layer decreses. Hence the amount of communication relative the amount of computation decreasesand thus the efficiency should increase.35 Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one Opteron node containing 4 processors, the $n = 8$ example was run on 2 nodes (giving a total of 8 processors). The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of exterior to interior triangles decreases, which in-turn decreases the amount of communication relative the amount of computation and thus the efficiency should increase. 36 36 37 37 The efficiency results shown here are competitive. … … 78 78 79 79 80 Another way of measuring the performance of the code on a parallel machine is 81 to increase the problem size as the number of processors are increased so that 82 the number of triangles per processor remains roughly the same. We have not 83 carried out measurements of this kind as we usually have static grids and it 84 is not possible to increase the number of triangles. 80 %Another way of measuring the performance of the code on a parallel machine is to increase the problem size as the number of processors are increased so that the number of triangles per processor remains roughly the same. We have not carried out measurements of this kind as we usually have static grids and it is not possible to increase the number of triangles. 85 81 86 82 \section{Advection, Merimbula Mesh} 87 83 88 We now look at another advection example, except this time the mesh comes from 89 the Merimbula test problem. Inother words, we ran the code given in Section 84 We now look at another advection example, except this time the mesh comes from the Merimbula test problem. That is, we ran the code given in Section 90 85 \ref{subsec:codeRPMM}, except the final time was reduced to 10000 91 86 (\code{finaltime = 10000}). The results are given in Table \ref{tbl:rpm}. … … 115 110 Merimbula mesh. We used the code listed in Section \ref{subsec:codeRPSMM}. The 116 111 results are listed in Table \ref{tbl:rpsm}. The efficiency results are not as 117 good as init ally expected so we profiled the code and found that112 good as initially expected so we profiled the code and found that 118 113 the problem is with the \code{update_boundary} routine in the {\tt domain.py} 119 114 file. On one processor the \code{update_boundary} routine accounts for about
Note: See TracChangeset
for help on using the changeset viewer.