Changeset 3245


Ignore:
Timestamp:
Jun 27, 2006, 4:01:02 PM (18 years ago)
Author:
linda
Message:

Committed "final version" of report on parallel implementation

Location:
inundation/parallel/documentation
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • inundation/parallel/documentation/parallel.tex

    r3244 r3245  
    2828The first step in parallelising the code is to subdivide the mesh
    2929into, roughly, equally sized partitions. On a rectangular mesh this may be
    30 done by a simple co-ordinate based dissection, but on a complicated
     30done by a simple co-ordinate based dissection method, but on a complicated
    3131domain such as the Merimbula mesh shown in Figure \ref{fig:mergrid}
    3232a more sophisticated approach must be used.  We use pymetis, a
     
    9292\begin{figure}[hbtp]
    9393  \centerline{ \includegraphics[scale = 0.6]{figures/subdomain.eps}}
    94   \caption{An example subpartioning of a mesh.}
     94  \caption{An example subpartitioning of a mesh.}
    9595 \label{fig:subdomain}
    9696\end{figure}
     
    9999\begin{figure}[hbtp]
    100100  \centerline{ \includegraphics[scale = 0.6]{figures/subdomainghost.eps}}
    101   \caption{An example subpartioning with ghost triangles. The numbers in brackets shows the local numbering scheme that is calculated and stored with the mesh, but not implemented until the local mesh is built. See Section \ref{sec:part4}. }
     101  \caption{An example subpartitioning with ghost triangles. The numbers in brackets shows the local numbering scheme that is calculated and stored with the mesh, but not implemented until the local mesh is built. See Section \ref{sec:part4}. }
    102102 \label{fig:subdomaing}
    103103\end{figure}
     
    118118\subsection {Sending the Submeshes}\label{sec:part3}
    119119
    120 All of functions described so far must be run in serial on Processor 0. The next step is to start the parallel computation by spreading the submeshes over the processors. The communication is carried out by
     120All of the functions described so far must be run in serial on Processor 0. The next step is to start the parallel computation by spreading the submeshes over the processors. The communication is carried out by
    121121\code{send_submesh} and \code{rec_submesh} defined in {\tt build_commun.py}.
    122122The \code{send_submesh} function should be called on Processor 0 and sends the Submesh $p$ to Processor $p$, while \code{rec_submesh} should be called by Processor $p$ to receive Submesh $p$ from Processor 0.
     
    128128
    129129\subsection {Building the Local Mesh}\label{sec:part4}
    130 After using \code{send_submesh} and \code{rec_submesh}, Processor $p$ should have its own local copy of Submesh $p$, however as stated previously the triangle numbering will be incorrect on all processors except number $0$. The \code{build_local_mesh} function from {\tt build_local.py} primarily focuses on renumbering the information stored with the submesh; including the nodes, vertices and quantities. Figure \ref{fig:subdomainf} shows what the mesh in each processor may look like.
     130After using \code{send_submesh} and \code{rec_submesh}, Processor $p$ should have its own local copy of Submesh $p$, however as stated previously the triangle numbering will be incorrect on all processors except Processor $0$. The \code{build_local_mesh} function from {\tt build_local.py} primarily focuses on renumbering the information stored with the submesh; including the nodes, vertices and quantities. Figure \ref{fig:subdomainf} shows what the mesh in each processor may look like.
    131131
    132132
    133133\begin{figure}[hbtp]
    134134  \centerline{ \includegraphics[scale = 0.6]{figures/subdomainfinal.eps}}
    135   \caption{An example subpartioning after the submeshes have been renumbered.}
     135  \caption{An example subpartitioning after the submeshes have been renumbered.}
    136136 \label{fig:subdomainf}
    137137\end{figure}
     
    242242\end{verbatim}
    243243
    244 Note that the submesh is not received by, or sent to, Processor 0. Rather     \code{hostmesh = extract_hostmesh(submesh)} simply extracts the mesh that has been assigned to Processor 0. Recall \code{submesh} contains the list of submeshes to be assigned to each processor. This is described further in Section \ref{sec:part3}. The \code{build_local_mesh} renumbers the nodes
     244Note that the submesh is not received by, or sent to, Processor 0. Rather     \code{hostmesh = extract_hostmesh(submesh)} simply extracts the mesh that has been assigned to Processor 0. Recall \code{submesh} contains the list of submeshes to be assigned to each processor. This is described further in Section \ref{sec:part3}.
     245%The \code{build_local_mesh} renumbers the nodes
    245246\begin{verbatim}
    246247    points, vertices, boundary, quantities, ghost_recv_dict, full_send_dict)=\
     
    267268\begin{itemize}
    268269\item Python 2.0 or later
    269 \item Numeric Python (incling RandomArray) matching the Python installation
     270\item Numeric Python (including RandomArray) matching the Python installation
    270271\item Native MPI C library
    271272\item Native C compiler
  • inundation/parallel/documentation/report.tex

    r3033 r3245  
    5555
    5656\begin{abstract}
     57This document describes work done by the authors as part of a consultancy with GA during 2005-2006. The paper serves as both a report for GA and a user manual.
     58
     59The report contains a description of how the code was parallelised and it lists efficiency results for a few example runs. It also gives  some examples codes showing how to run the Merimbula test problem in parallel and talks about more technical aspects such as compilation issues and batch scripts for submitting compute jobs on parallel machines.
    5760\end{abstract}
    5861
  • inundation/parallel/documentation/results.tex

    r3198 r3245  
    66Each node has two AMD Opteron 275 (Dual-core 2.2 GHz Processors) and 4 GB of main memory. The system achieves 60 Gigaflops with the Linpack benchmark, which is about 85\% of peak performance.
    77
    8 For each test run we evaluated the parallel efficiency
     8For each test run we evaluate the parallel efficiency as
    99\[
    1010E_n = \frac{T_1}{nT_n} 100,
    1111\]
    12 where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (subdomains) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$.  Note that $t_i$ does not include the time required to build and subpartion the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).
     12where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (submesh) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$.  Note that $t_i$ does not include the time required to build and subpartition the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).
    1313
    1414\section{Advection, Rectangular Domain}
    1515
    16 The first example looked at the rectangular domain example given in Section \ref{subsec:codeRPA}, excecpt that we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).
     16The first example looked at the rectangular domain example given in Section \ref{subsec:codeRPA}, except we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).
    1717
    1818For this particular example we can control the mesh size by changing the parameters \code{N} and \code{M} given in the following section of code taken from
     
    3333\end{verbatim}
    3434
    35 Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one node containing 4 processors, the $n = 8$ example was run on 2 nodes. The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of ghost triangles to other triangles decreases. Or anothr way to think of it is that the size of the boundary layer decreses. Hence the amount of communication relative the amount of computation decreases and thus the efficiency should increase.
     35Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one Opteron node containing 4 processors, the $n = 8$ example was run on 2 nodes (giving a total of 8 processors). The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of exterior to interior triangles decreases, which in-turn decreases the amount of communication relative the amount of computation and thus the efficiency should increase.
    3636
    3737The efficiency results shown here are competitive.
     
    7878
    7979
    80 Another way of measuring the performance of the code on a parallel machine is
    81 to increase the problem size as the number of processors are increased so that
    82 the number of triangles per processor remains roughly the same.  We have not
    83 carried out measurements of this kind as we usually have static grids and it
    84 is not possible to increase the number of triangles.
     80%Another way of measuring the performance of the code on a parallel machine is to increase the problem size as the number of processors are increased so that the number of triangles per processor remains roughly the same.  We have not carried out measurements of this kind as we usually have static grids and it is not possible to increase the number of triangles.
    8581 
    8682\section{Advection, Merimbula Mesh}
    8783
    88 We now look at another advection example, except this time the mesh comes from
    89 the Merimbula test problem. Inother words, we ran the code given in Section
     84We now look at another advection example, except this time the mesh comes from the Merimbula test problem. That is, we ran the code given in Section
    9085\ref{subsec:codeRPMM}, except the final time was reduced to 10000
    9186(\code{finaltime = 10000}). The results are given in Table \ref{tbl:rpm}.
     
    115110Merimbula mesh. We used the code listed in Section \ref{subsec:codeRPSMM}. The
    116111results are listed in Table \ref{tbl:rpsm}. The efficiency results are not as
    117 good as initally expected so we profiled the code and found that
     112good as initially expected so we profiled the code and found that
    118113the problem is with the \code{update_boundary} routine in the {\tt domain.py}
    119114file. On one processor the \code{update_boundary} routine accounts for about
Note: See TracChangeset for help on using the changeset viewer.