Changeset 3315


Ignore:
Timestamp:
Jul 11, 2006, 8:39:56 PM (18 years ago)
Author:
steve
Message:
 
Location:
inundation/parallel
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • inundation/parallel/documentation/parallel.tex

    r3245 r3315  
    4444\end{figure}
    4545
    46 Figure \ref{fig:mergrid4} shows the Merimbula grid partitioned over four processors. Note that one submesh may comprise several unconnected mesh partitions. Table \ref{tbl:mer4} gives the node distribution over the four processors while Table \ref{tbl:mer8} shows the distribution over eight processors. These results imply that Pymetis gives a reasonably well balanced partition of the mesh.
     46Figure \ref{fig:mergrid4} shows the Merimbula grid partitioned over
     47four processors. Note that one submesh may comprise several
     48unconnected mesh partitions. Table \ref{tbl:mer4} gives the node
     49distribution over the four processors while Table \ref{tbl:mer8}
     50shows the distribution over eight processors. These results imply
     51that Pymetis gives a reasonably well balanced partition of the mesh.
    4752
    4853\begin{figure}[hbtp]
     
    5257
    5358  \centerline{ \includegraphics[scale = 0.75]{figures/mermesh4d.eps}
    54   \includegraphics[scale = 0.75]{figures/mermesh4b.eps}} 
     59  \includegraphics[scale = 0.75]{figures/mermesh4b.eps}}
    5560  \caption{The Merimbula grid partitioned over 4 processors using Metis.}
    5661 \label{fig:mergrid4}
    5762\end{figure}
    5863
    59 \begin{table} 
     64\begin{table}
    6065\caption{Running an 4-way test of
    6166{\tt run_parallel_sw_merimbula_metis.py}}\label{tbl:mer4}
     
    8186\end{table}
    8287
    83 The number of submeshes found by Pymetis is equal to the number of processors; Submesh $p$ will be assigned to Processor $p$.
     88The number of submeshes found by Pymetis is equal to the number of
     89processors; Submesh $p$ will be assigned to Processor $p$.
    8490
    8591\subsection {Building the Ghost Layer}\label{sec:part2}
    86 The function {\tt build_submesh.py} is the work-horse and is responsible for
    87 setting up the communication pattern as well as assigning the local numbering scheme for the submeshes.
    88 
    89 Consider the example subpartitioning given in Figure \ref{fig:subdomain}. During the \code{evolve} calculations Triangle 2 in Submesh 0 will need to access its neighbour Triangle 3 stored in Submesh 1. The standard approach to this problem is to add an extra layer of triangles, which we call ghost triangles. The ghost triangles
    90 are read-only, they should not be updated during the calculations, they are only there to hold any extra information that a processor may need to complete its calculations. The ghost triangle values are updated through communication calls. Figure \ref{fig:subdomaing} shows the submeshes with the extra layer of ghost triangles.
     92The function {\tt build_submesh.py} is the work-horse and is
     93responsible for setting up the communication pattern as well as
     94assigning the local numbering scheme for the submeshes.
     95
     96Consider the example subpartitioning given in Figure
     97\ref{fig:subdomain}. During the \code{evolve} calculations Triangle
     982 in Submesh 0 will need to access its neighbour Triangle 3 stored
     99in Submesh 1. The standard approach to this problem is to add an
     100extra layer of triangles, which we call ghost triangles. The ghost
     101triangles are read-only, they should not be updated during the
     102calculations, they are only there to hold any extra information that
     103a processor may need to complete its calculations. The ghost
     104triangle values are updated through communication calls. Figure
     105\ref{fig:subdomaing} shows the submeshes with the extra layer of
     106ghost triangles.
    91107
    92108\begin{figure}[hbtp]
     
    99115\begin{figure}[hbtp]
    100116  \centerline{ \includegraphics[scale = 0.6]{figures/subdomainghost.eps}}
    101   \caption{An example subpartitioning with ghost triangles. The numbers in brackets shows the local numbering scheme that is calculated and stored with the mesh, but not implemented until the local mesh is built. See Section \ref{sec:part4}. }
     117  \caption{An example subpartitioning with ghost triangles. The numbers
     118in brackets shows the local numbering scheme that is calculated and
     119stored with the mesh, but not implemented until the local mesh is
     120built. See Section \ref{sec:part4}. }
    102121 \label{fig:subdomaing}
    103122\end{figure}
    104123
    105 When partitioning the mesh we introduce new, dummy, boundary edges. For example, Triangle 2 in Submesh 1 from Figure \ref{fig:subdomaing} originally shared an edge with Triangle 1, but after partitioning that edge becomes a boundary edge. These new boundary edges are are tagged as \code{ghost} and should, in general, be assigned a type of \code{None}. The following piece of code taken from {\tt run_parallel_advection.py} shows an example. 
    106 {\small \begin{verbatim} 
    107 T = Transmissive_boundary(domain)
    108 domain.default_order = 2
    109 domain.set_boundary( {'left': T, 'right': T, 'bottom': T, 'top': T, 'ghost': Non
    110 e} )
     124When partitioning the mesh we introduce new, dummy, boundary edges.
     125For example, Triangle 2 in Submesh 1 from Figure
     126\ref{fig:subdomaing} originally shared an edge with Triangle 1, but
     127after partitioning that edge becomes a boundary edge. These new
     128boundary edges are are tagged as \code{ghost} and should, in
     129general, be assigned a type of \code{None}. The following piece of
     130code taken from {\tt run_parallel_advection.py} shows an example.
     131{\small \begin{verbatim} T = Transmissive_boundary(domain)
     132domain.default_order = 2 domain.set_boundary( {'left': T, 'right':
     133T, 'bottom': T, 'top': T, 'ghost': Non e} )
    111134\end{verbatim}}
    112135
    113 Looking at Figure \ref{fig:subdomaing} we see that after each \code{evolve} step Processor 0  will have to send the updated values for Triangle 2 and Triangle 4 to Processor 1, and similarly Processor 1 will have to send the updated values for Triangle 3 and Triangle 5 (recall that Submesh $p$ will be assigned to Processor $p$). The \code{build_submesh} function builds a dictionary that defines the communication pattern.
     136Looking at Figure \ref{fig:subdomaing} we see that after each
     137\code{evolve} step Processor 0  will have to send the updated values
     138for Triangle 2 and Triangle 4 to Processor 1, and similarly
     139Processor 1 will have to send the updated values for Triangle 3 and
     140Triangle 5 (recall that Submesh $p$ will be assigned to Processor
     141$p$). The \code{build_submesh} function builds a dictionary that
     142defines the communication pattern.
    114143
    115144Finally, the ANUGA code assumes that the triangles (and nodes etc.) are numbered consecutively starting from 0. Consequently, if Submesh 1 in Figure \ref{fig:subdomaing} was passed into the \code{evolve} calculations it would crash. The \code{build_submesh} function determines a local numbering scheme for each submesh, but it does not actually update the numbering, that is left to \code{build_local}.
     
    118147\subsection {Sending the Submeshes}\label{sec:part3}
    119148
    120 All of the functions described so far must be run in serial on Processor 0. The next step is to start the parallel computation by spreading the submeshes over the processors. The communication is carried out by
    121 \code{send_submesh} and \code{rec_submesh} defined in {\tt build_commun.py}.
    122 The \code{send_submesh} function should be called on Processor 0 and sends the Submesh $p$ to Processor $p$, while \code{rec_submesh} should be called by Processor $p$ to receive Submesh $p$ from Processor 0.
     149All of the functions described so far must be run in serial on
     150Processor 0. The next step is to start the parallel computation by
     151spreading the submeshes over the processors. The communication is
     152carried out by \code{send_submesh} and \code{rec_submesh} defined in
     153{\tt build_commun.py}. The \code{send_submesh} function should be
     154called on Processor 0 and sends the Submesh $p$ to Processor $p$,
     155while \code{rec_submesh} should be called by Processor $p$ to
     156receive Submesh $p$ from Processor 0.
    123157
    124158As an aside, the order of communication is very important. If someone was to modify the \code{send_submesh} routine the corresponding change must be made to the \code{rec_submesh} routine.
     
    141175\section{Some Example Code}
    142176
    143 Chapter \ref{chap:code} gives full listings of some example codes. 
     177Chapter \ref{chap:code} gives full listings of some example codes.
    144178
    145179The first example in Section \ref{subsec:codeRPA} solves the advection equation on a
    146180rectangular mesh. A rectangular mesh is highly structured so a coordinate based decomposition can be used and the partitioning is simply done by calling the
    147 routine \code{parallel_rectangle} as shown below. 
     181routine \code{parallel_rectangle} as shown below.
    148182\begin{verbatim}
    149183#######################
     
    158192\end{verbatim}
    159193
    160 Most simulations will not be done on a rectangular mesh, and the approach to subpartitioning the mesh is different to the one described above, however this example may be of interest to those who want to measure the parallel efficiency of the code on their machine. A rectangular mesh should give a good load balance and is therefore an important first test problem. 
     194Most simulations will not be done on a rectangular mesh, and the approach to subpartitioning the mesh is different to the one described above, however this example may be of interest to those who want to measure the parallel efficiency of the code on their machine. A rectangular mesh should give a good load balance and is therefore an important first test problem.
    161195
    162196
     
    181215    # Build the mesh that should be assigned to each processor.
    182216    # This includes ghost nodes and the communication pattern
    183    
     217
    184218    submesh = build_submesh(nodes, triangles, boundary, quantities, \
    185219                            triangles_per_proc)
     
    208242\newpage
    209243\begin{itemize}
    210 \item 
     244\item
    211245These first few lines of code read in and define the (global) mesh. The \code{Set_Stage} function sets the initial conditions. See the code in \ref{subsec:codeRPMM} for the definition of \code{Set_Stage}.
    212246\begin{verbatim}
     
    222256\end{verbatim}
    223257
    224 \item The next step is to build a boundary layer of ghost triangles and define the communication pattern. This step is implemented by \code{build_submesh} as discussed in Section \ref{sec:part2}. The \code{submesh} variable contains a copy of the submesh for each processor. 
    225 \begin{verbatim}       
     258\item The next step is to build a boundary layer of ghost triangles and define the communication pattern. This step is implemented by \code{build_submesh} as discussed in Section \ref{sec:part2}. The \code{submesh} variable contains a copy of the submesh for each processor.
     259\begin{verbatim}
    226260    submesh = build_submesh(nodes, triangles, boundary, quantities, \
    227261                            triangles_per_proc)
    228262\end{verbatim}
    229263
    230 \item The actual parallel communication starts when the submesh partitions are sent to the processors by calling \code{send_submesh}. 
     264\item The actual parallel communication starts when the submesh partitions are sent to the processors by calling \code{send_submesh}.
    231265\begin{verbatim}
    232266    for p in range(1, numprocs):
     
    242276\end{verbatim}
    243277
    244 Note that the submesh is not received by, or sent to, Processor 0. Rather     \code{hostmesh = extract_hostmesh(submesh)} simply extracts the mesh that has been assigned to Processor 0. Recall \code{submesh} contains the list of submeshes to be assigned to each processor. This is described further in Section \ref{sec:part3}. 
     278Note that the submesh is not received by, or sent to, Processor 0. Rather     \code{hostmesh = extract_hostmesh(submesh)} simply extracts the mesh that has been assigned to Processor 0. Recall \code{submesh} contains the list of submeshes to be assigned to each processor. This is described further in Section \ref{sec:part3}.
    245279%The \code{build_local_mesh} renumbers the nodes
    246280\begin{verbatim}
  • inundation/parallel/documentation/report.tex

    r3245 r3315  
    5555
    5656\begin{abstract}
    57 This document describes work done by the authors as part of a consultancy with GA during 2005-2006. The paper serves as both a report for GA and a user manual.
     57This document describes work done by the authors as part of a
     58consultancy with GA during 2005-2006. The paper serves as both a
     59report for GA and a user manual.
    5860
    59 The report contains a description of how the code was parallelised and it lists efficiency results for a few example runs. It also gives  some examples codes showing how to run the Merimbula test problem in parallel and talks about more technical aspects such as compilation issues and batch scripts for submitting compute jobs on parallel machines.
     61The report contains a description of how the code was parallelised
     62and it lists efficiency results for a few example runs. It also
     63gives some examples codes showing how to run the Merimbula test
     64problem in parallel and talks about more technical aspects such as
     65compilation issues and batch scripts for submitting compute jobs on
     66parallel machines.
    6067\end{abstract}
    6168
  • inundation/parallel/documentation/results.tex

    r3245 r3315  
    33
    44
    5 To evaluate the performance of the code on a parallel machine we ran some examples on a cluster of four nodes connected with PathScale InfiniPath HTX.
    6 Each node has two AMD Opteron 275 (Dual-core 2.2 GHz Processors) and 4 GB of main memory. The system achieves 60 Gigaflops with the Linpack benchmark, which is about 85\% of peak performance.
     5To evaluate the performance of the code on a parallel machine we ran
     6some examples on a cluster of four nodes connected with PathScale
     7InfiniPath HTX. Each node has two AMD Opteron 275 (Dual-core 2.2 GHz
     8Processors) and 4 GB of main memory. The system achieves 60
     9Gigaflops with the Linpack benchmark, which is about 85\% of peak
     10performance.
    711
    812For each test run we evaluate the parallel efficiency as
     
    1014E_n = \frac{T_1}{nT_n} 100,
    1115\]
    12 where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (submesh) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$.  Note that $t_i$ does not include the time required to build and subpartition the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).
     16where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of
     17processors (submesh) and $t_i$ is the time required to run the {\tt
     18evolve} code on processor $i$.  Note that $t_i$ does not include the
     19time required to build and subpartition the mesh etc., it only
     20includes the time required to do the evolve calculations (eg.
     21\code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}).
    1322
    1423\section{Advection, Rectangular Domain}
    1524
    16 The first example looked at the rectangular domain example given in Section \ref{subsec:codeRPA}, except we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).
     25The first example looked at the rectangular domain example given in
     26Section \ref{subsec:codeRPA}, except we changed the finaltime time
     27to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}).
    1728
    18 For this particular example we can control the mesh size by changing the parameters \code{N} and \code{M} given in the following section of code taken from
     29For this particular example we can control the mesh size by changing
     30the parameters \code{N} and \code{M} given in the following section
     31of code taken from
    1932 Section \ref{subsec:codeRPA}.
    2033
     
    3346\end{verbatim}
    3447
    35 Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one Opteron node containing 4 processors, the $n = 8$ example was run on 2 nodes (giving a total of 8 processors). The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of exterior to interior triangles decreases, which in-turn decreases the amount of communication relative the amount of computation  and thus the efficiency should increase.
     48Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show
     49the efficiency results for different values of \code{N} and
     50\code{M}. The examples where $n \le 4$ were run on one Opteron node
     51containing 4 processors, the $n = 8$ example was run on 2 nodes
     52(giving a total of 8 processors). The communication within a node is
     53faster than the communication across nodes, so we would expect to
     54see a decrease in efficiency when we jump from 4 to 8 nodes.
     55Furthermore, as \code{N} and \code{M} are increased the ratio of
     56exterior to interior triangles decreases, which in-turn decreases
     57the amount of communication relative the amount of computation  and
     58thus the efficiency should increase.
    3659
    3760The efficiency results shown here are competitive.
    3861
    39 \begin{table}
    40 \caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 40, {\tt M} = 40.\label{tbl:rpa40}}
     62\begin{table}
     63\caption{Parallel Efficiency Results for the Advection Problem on a
     64Rectangular Domain with {\tt N} = 40, {\tt M} =
     6540.\label{tbl:rpa40}}
    4166\begin{center}
    4267\begin{tabular}{|c|c c|}\hline
     
    5075\end{table}
    5176
    52 \begin{table}
    53 \caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 80, {\tt M} = 80.\label{tbl:rpa80}}
     77\begin{table}
     78\caption{Parallel Efficiency Results for the Advection Problem on a
     79Rectangular Domain with {\tt N} = 80, {\tt M} =
     8080.\label{tbl:rpa80}}
    5481\begin{center}
    5582\begin{tabular}{|c|c c|}\hline
     
    6491
    6592
    66 \begin{table}
    67 \caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 160, {\tt M} = 160.\label{tbl:rpa160}}
     93\begin{table}
     94\caption{Parallel Efficiency Results for the Advection Problem on a
     95Rectangular Domain with {\tt N} = 160, {\tt M} =
     96160.\label{tbl:rpa160}}
    6897\begin{center}
    6998\begin{tabular}{|c|c c|}\hline
     
    79108
    80109%Another way of measuring the performance of the code on a parallel machine is to increase the problem size as the number of processors are increased so that the number of triangles per processor remains roughly the same.  We have not carried out measurements of this kind as we usually have static grids and it is not possible to increase the number of triangles.
    81  
     110
    82111\section{Advection, Merimbula Mesh}
    83112
    84 We now look at another advection example, except this time the mesh comes from the Merimbula test problem. That is, we ran the code given in Section
    85 \ref{subsec:codeRPMM}, except the final time was reduced to 10000
    86 (\code{finaltime = 10000}). The results are given in Table \ref{tbl:rpm}.
    87 These are good efficiency results, especially considering the structure of the
    88 Merimbula mesh.
     113We now look at another advection example, except this time the mesh
     114comes from the Merimbula test problem. That is, we ran the code
     115given in Section \ref{subsec:codeRPMM}, except the final time was
     116reduced to 10000 (\code{finaltime = 10000}). The results are given
     117in Table \ref{tbl:rpm}. These are good efficiency results,
     118especially considering the structure of the Merimbula mesh.
    89119%Note that since we are solving an advection problem the amount of calculation
    90120%done on each triangle is relatively low, when we more to other problems that
    91121%involve more calculations we would expect the computation to communication ratio to increase and thus get an increase in efficiency.
    92122
    93 \begin{table} 
     123\begin{table}
    94124\caption{Parallel Efficiency Results for the Advection Problem on the
    95125  Merimbula Mesh.\label{tbl:rpm}}
     
    120150Processor 0 spent about 3.8 times more doing the \code{update_boundary}
    121151calculations than Processor 7. This load imbalance reduced the parallel
    122 efficiency. 
     152efficiency.
    123153
    124154Before doing the shallow equation calculations on a larger number of
     
    126156optimised as much as possible to reduce the effect of the load imbalance.
    127157
    128  
    129 \begin{table} 
     158
     159\begin{table}
    130160\caption{Parallel Efficiency Results for the Shallow Water Equation on the
    131161  Merimbula Mesh.\label{tbl:rpsm}}
  • inundation/parallel/parallel_advection.py

    r3184 r3315  
    2323    pass
    2424
    25 from pyvolution.advection import *
     25from pyvolution.advection_vtk import *
    2626from Numeric import zeros, Float, Int, ones, allclose, array
    2727import pypar
Note: See TracChangeset for help on using the changeset viewer.