Context Navigation

← Previous Changeset
Next Changeset →

Changeset 3245

Timestamp:

Jun 27, 2006, 4:01:02 PM (19 years ago)

Author:

linda

Message:

Committed "final version" of report on parallel implementation

Location:

inundation/parallel/documentation

Files:

: 3 edited

parallel.tex (modified) (7 diffs)
report.tex (modified) (1 diff)
results.tex (modified) (4 diffs)

Legend:

: Unmodified
: Added
: Removed

inundation/parallel/documentation/parallel.tex

-                      r3244
+                      r3245
 The first step in parallelising the code is to subdivide the mesh
 into, roughly, equally sized partitions. On a rectangular mesh this may be
 done by a simple co-ordinate based dissection, but on a complicated
+done by a simple co-ordinate based dissection method, but on a complicated
 domain such as the Merimbula mesh shown in Figure \ref{fig:mergrid}
 a more sophisticated approach must be used.  We use pymetis, a
 …
 \begin{figure}[hbtp]
   \centerline{ \includegraphics[scale = 0.6]{figures/subdomain.eps}}
   \caption{An example subpartioning of a mesh.}
+  \caption{An example subpartitioning of a mesh.}
  \label{fig:subdomain}
 \end{figure}
 …
 \begin{figure}[hbtp]
   \centerline{ \includegraphics[scale = 0.6]{figures/subdomainghost.eps}}
   \caption{An example subpartioning with ghost triangles. The numbers in brackets shows the local numbering scheme that is calculated and stored with the mesh, but not implemented until the local mesh is built. See Section \ref{sec:part4}. }
+  \caption{An example subpartitioning with ghost triangles. The numbers in brackets shows the local numbering scheme that is calculated and stored with the mesh, but not implemented until the local mesh is built. See Section \ref{sec:part4}. }
  \label{fig:subdomaing}
 \end{figure}
 …
 \subsection {Sending the Submeshes}\label{sec:part3}
 All of functions described so far must be run in serial on Processor 0. The next step is to start the parallel computation by spreading the submeshes over the processors. The communication is carried out by
+All of the functions described so far must be run in serial on Processor 0. The next step is to start the parallel computation by spreading the submeshes over the processors. The communication is carried out by
 \code{send_submesh} and \code{rec_submesh} defined in {\tt build_commun.py}.
 The \code{send_submesh} function should be called on Processor 0 and sends the Submesh $p$ to Processor $p$, while \code{rec_submesh} should be called by Processor $p$ to receive Submesh $p$ from Processor 0.
 …
 \subsection {Building the Local Mesh}\label{sec:part4}
 After using \code{send_submesh} and \code{rec_submesh}, Processor $p$ should have its own local copy of Submesh $p$, however as stated previously the triangle numbering will be incorrect on all processors except number $0$. The \code{build_local_mesh} function from {\tt build_local.py} primarily focuses on renumbering the information stored with the submesh; including the nodes, vertices and quantities. Figure \ref{fig:subdomainf} shows what the mesh in each processor may look like.
+After using \code{send_submesh} and \code{rec_submesh}, Processor $p$ should have its own local copy of Submesh $p$, however as stated previously the triangle numbering will be incorrect on all processors except Processor $0$. The \code{build_local_mesh} function from {\tt build_local.py} primarily focuses on renumbering the information stored with the submesh; including the nodes, vertices and quantities. Figure \ref{fig:subdomainf} shows what the mesh in each processor may look like.
 \begin{figure}[hbtp]
   \centerline{ \includegraphics[scale = 0.6]{figures/subdomainfinal.eps}}
   \caption{An example subpartioning after the submeshes have been renumbered.}
+  \caption{An example subpartitioning after the submeshes have been renumbered.}
  \label{fig:subdomainf}
 \end{figure}
 …
 \end{verbatim}
+Note that the submesh is not received by, or sent to, Processor 0. Rather     \code{hostmesh = extract_hostmesh(submesh)} simply extracts the mesh that has been assigned to Processor 0. Recall \code{submesh} contains the list of submeshes to be assigned to each processor. This is described further in Section \ref{sec:part3}. The \code{build_local_mesh} renumbers the nodes
+Note that the submesh is not received by, or sent to, Processor 0. Rather     \code{hostmesh = extract_hostmesh(submesh)} simply extracts the mesh that has been assigned to Processor 0. Recall \code{submesh} contains the list of submeshes to be assigned to each processor. This is described further in Section \ref{sec:part3}.
+%The \code{build_local_mesh} renumbers the nodes
 \begin{verbatim}
     points, vertices, boundary, quantities, ghost_recv_dict, full_send_dict)=\
 …
 \begin{itemize}
 \item Python 2.0 or later
 \item Numeric Python (incling RandomArray) matching the Python installation
+\item Numeric Python (including RandomArray) matching the Python installation
 \item Native MPI C library
 \item Native C compiler

inundation/parallel/documentation/report.tex

-                      r3033
+                      r3245
 \begin{abstract}
+This document describes work done by the authors as part of a consultancy with GA during 2005-2006. The paper serves as both a report for GA and a user manual.
+The report contains a description of how the code was parallelised and it lists efficiency results for a few example runs. It also gives  some examples codes showing how to run the Merimbula test problem in parallel and talks about more technical aspects such as compilation issues and batch scripts for submitting compute jobs on parallel machines.
 \end{abstract}

inundation/parallel/documentation/results.tex

-                      r3198
+                      r3245
 Each node has two AMD Opteron 275 (Dual-core 2.2 GHz Processors) and 4 GB of main memory. The system achieves 60 Gigaflops with the Linpack benchmark, which is about 85\% of peak performance.
 For each test run we evaluated the parallel efficiency
+For each test run we evaluate the parallel efficiency as
 \[
 E_n = \frac{T_1}{nT_n} 100,
 \]
 where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (subdomains) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$.  Note that $t_i$ does not include the time required to build and subpartion the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).
+where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (submesh) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$.  Note that $t_i$ does not include the time required to build and subpartition the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).
 \section{Advection, Rectangular Domain}
 The first example looked at the rectangular domain example given in Section \ref{subsec:codeRPA}, excecpt that we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).
+The first example looked at the rectangular domain example given in Section \ref{subsec:codeRPA}, except we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).
 For this particular example we can control the mesh size by changing the parameters \code{N} and \code{M} given in the following section of code taken from
 …
 \end{verbatim}
 Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one node containing 4 processors, the $n = 8$ example was run on 2 nodes. The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of ghost triangles to other triangles decreases. Or anothr way to think of it is that the size of the boundary layer decreses. Hence the amount of communication relative the amount of computation decreases and thus the efficiency should increase.
+Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one Opteron node containing 4 processors, the $n = 8$ example was run on 2 nodes (giving a total of 8 processors). The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of exterior to interior triangles decreases, which in-turn decreases the amount of communication relative the amount of computation  and thus the efficiency should increase.
 The efficiency results shown here are competitive.
 …
+Another way of measuring the performance of the code on a parallel machine is
+to increase the problem size as the number of processors are increased so that
+the number of triangles per processor remains roughly the same.  We have not
+carried out measurements of this kind as we usually have static grids and it
+is not possible to increase the number of triangles.
+%Another way of measuring the performance of the code on a parallel machine is to increase the problem size as the number of processors are increased so that the number of triangles per processor remains roughly the same.  We have not carried out measurements of this kind as we usually have static grids and it is not possible to increase the number of triangles.
 \section{Advection, Merimbula Mesh}
+We now look at another advection example, except this time the mesh comes from
+the Merimbula test problem. Inother words, we ran the code given in Section
+We now look at another advection example, except this time the mesh comes from the Merimbula test problem. That is, we ran the code given in Section
 \ref{subsec:codeRPMM}, except the final time was reduced to 10000
 (\code{finaltime = 10000}). The results are given in Table \ref{tbl:rpm}.
 …
 Merimbula mesh. We used the code listed in Section \ref{subsec:codeRPSMM}. The
 results are listed in Table \ref{tbl:rpsm}. The efficiency results are not as
 good as initally expected so we profiled the code and found that
+good as initially expected so we profiled the code and found that
 the problem is with the \code{update_boundary} routine in the {\tt domain.py}
 file. On one processor the \code{update_boundary} routine accounts for about

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: