Changeset 3315 for inundation/parallel
- Timestamp:
- Jul 11, 2006, 8:39:56 PM (19 years ago)
- Location:
- inundation/parallel
- Files:
-
- 4 edited
Legend:
- Unmodified
- Added
- Removed
-
inundation/parallel/documentation/parallel.tex
r3245 r3315 44 44 \end{figure} 45 45 46 Figure \ref{fig:mergrid4} shows the Merimbula grid partitioned over four processors. Note that one submesh may comprise several unconnected mesh partitions. Table \ref{tbl:mer4} gives the node distribution over the four processors while Table \ref{tbl:mer8} shows the distribution over eight processors. These results imply that Pymetis gives a reasonably well balanced partition of the mesh. 46 Figure \ref{fig:mergrid4} shows the Merimbula grid partitioned over 47 four processors. Note that one submesh may comprise several 48 unconnected mesh partitions. Table \ref{tbl:mer4} gives the node 49 distribution over the four processors while Table \ref{tbl:mer8} 50 shows the distribution over eight processors. These results imply 51 that Pymetis gives a reasonably well balanced partition of the mesh. 47 52 48 53 \begin{figure}[hbtp] … … 52 57 53 58 \centerline{ \includegraphics[scale = 0.75]{figures/mermesh4d.eps} 54 \includegraphics[scale = 0.75]{figures/mermesh4b.eps}} 59 \includegraphics[scale = 0.75]{figures/mermesh4b.eps}} 55 60 \caption{The Merimbula grid partitioned over 4 processors using Metis.} 56 61 \label{fig:mergrid4} 57 62 \end{figure} 58 63 59 \begin{table} 64 \begin{table} 60 65 \caption{Running an 4-way test of 61 66 {\tt run_parallel_sw_merimbula_metis.py}}\label{tbl:mer4} … … 81 86 \end{table} 82 87 83 The number of submeshes found by Pymetis is equal to the number of processors; Submesh $p$ will be assigned to Processor $p$. 88 The number of submeshes found by Pymetis is equal to the number of 89 processors; Submesh $p$ will be assigned to Processor $p$. 84 90 85 91 \subsection {Building the Ghost Layer}\label{sec:part2} 86 The function {\tt build_submesh.py} is the work-horse and is responsible for 87 setting up the communication pattern as well as assigning the local numbering scheme for the submeshes. 88 89 Consider the example subpartitioning given in Figure \ref{fig:subdomain}. During the \code{evolve} calculations Triangle 2 in Submesh 0 will need to access its neighbour Triangle 3 stored in Submesh 1. The standard approach to this problem is to add an extra layer of triangles, which we call ghost triangles. The ghost triangles 90 are read-only, they should not be updated during the calculations, they are only there to hold any extra information that a processor may need to complete its calculations. The ghost triangle values are updated through communication calls. Figure \ref{fig:subdomaing} shows the submeshes with the extra layer of ghost triangles. 92 The function {\tt build_submesh.py} is the work-horse and is 93 responsible for setting up the communication pattern as well as 94 assigning the local numbering scheme for the submeshes. 95 96 Consider the example subpartitioning given in Figure 97 \ref{fig:subdomain}. During the \code{evolve} calculations Triangle 98 2 in Submesh 0 will need to access its neighbour Triangle 3 stored 99 in Submesh 1. The standard approach to this problem is to add an 100 extra layer of triangles, which we call ghost triangles. The ghost 101 triangles are read-only, they should not be updated during the 102 calculations, they are only there to hold any extra information that 103 a processor may need to complete its calculations. The ghost 104 triangle values are updated through communication calls. Figure 105 \ref{fig:subdomaing} shows the submeshes with the extra layer of 106 ghost triangles. 91 107 92 108 \begin{figure}[hbtp] … … 99 115 \begin{figure}[hbtp] 100 116 \centerline{ \includegraphics[scale = 0.6]{figures/subdomainghost.eps}} 101 \caption{An example subpartitioning with ghost triangles. The numbers in brackets shows the local numbering scheme that is calculated and stored with the mesh, but not implemented until the local mesh is built. See Section \ref{sec:part4}. } 117 \caption{An example subpartitioning with ghost triangles. The numbers 118 in brackets shows the local numbering scheme that is calculated and 119 stored with the mesh, but not implemented until the local mesh is 120 built. See Section \ref{sec:part4}. } 102 121 \label{fig:subdomaing} 103 122 \end{figure} 104 123 105 When partitioning the mesh we introduce new, dummy, boundary edges. For example, Triangle 2 in Submesh 1 from Figure \ref{fig:subdomaing} originally shared an edge with Triangle 1, but after partitioning that edge becomes a boundary edge. These new boundary edges are are tagged as \code{ghost} and should, in general, be assigned a type of \code{None}. The following piece of code taken from {\tt run_parallel_advection.py} shows an example. 106 {\small \begin{verbatim} 107 T = Transmissive_boundary(domain) 108 domain.default_order = 2 109 domain.set_boundary( {'left': T, 'right': T, 'bottom': T, 'top': T, 'ghost': Non 110 e} ) 124 When partitioning the mesh we introduce new, dummy, boundary edges. 125 For example, Triangle 2 in Submesh 1 from Figure 126 \ref{fig:subdomaing} originally shared an edge with Triangle 1, but 127 after partitioning that edge becomes a boundary edge. These new 128 boundary edges are are tagged as \code{ghost} and should, in 129 general, be assigned a type of \code{None}. The following piece of 130 code taken from {\tt run_parallel_advection.py} shows an example. 131 {\small \begin{verbatim} T = Transmissive_boundary(domain) 132 domain.default_order = 2 domain.set_boundary( {'left': T, 'right': 133 T, 'bottom': T, 'top': T, 'ghost': Non e} ) 111 134 \end{verbatim}} 112 135 113 Looking at Figure \ref{fig:subdomaing} we see that after each \code{evolve} step Processor 0 will have to send the updated values for Triangle 2 and Triangle 4 to Processor 1, and similarly Processor 1 will have to send the updated values for Triangle 3 and Triangle 5 (recall that Submesh $p$ will be assigned to Processor $p$). The \code{build_submesh} function builds a dictionary that defines the communication pattern. 136 Looking at Figure \ref{fig:subdomaing} we see that after each 137 \code{evolve} step Processor 0 will have to send the updated values 138 for Triangle 2 and Triangle 4 to Processor 1, and similarly 139 Processor 1 will have to send the updated values for Triangle 3 and 140 Triangle 5 (recall that Submesh $p$ will be assigned to Processor 141 $p$). The \code{build_submesh} function builds a dictionary that 142 defines the communication pattern. 114 143 115 144 Finally, the ANUGA code assumes that the triangles (and nodes etc.) are numbered consecutively starting from 0. Consequently, if Submesh 1 in Figure \ref{fig:subdomaing} was passed into the \code{evolve} calculations it would crash. The \code{build_submesh} function determines a local numbering scheme for each submesh, but it does not actually update the numbering, that is left to \code{build_local}. … … 118 147 \subsection {Sending the Submeshes}\label{sec:part3} 119 148 120 All of the functions described so far must be run in serial on Processor 0. The next step is to start the parallel computation by spreading the submeshes over the processors. The communication is carried out by 121 \code{send_submesh} and \code{rec_submesh} defined in {\tt build_commun.py}. 122 The \code{send_submesh} function should be called on Processor 0 and sends the Submesh $p$ to Processor $p$, while \code{rec_submesh} should be called by Processor $p$ to receive Submesh $p$ from Processor 0. 149 All of the functions described so far must be run in serial on 150 Processor 0. The next step is to start the parallel computation by 151 spreading the submeshes over the processors. The communication is 152 carried out by \code{send_submesh} and \code{rec_submesh} defined in 153 {\tt build_commun.py}. The \code{send_submesh} function should be 154 called on Processor 0 and sends the Submesh $p$ to Processor $p$, 155 while \code{rec_submesh} should be called by Processor $p$ to 156 receive Submesh $p$ from Processor 0. 123 157 124 158 As an aside, the order of communication is very important. If someone was to modify the \code{send_submesh} routine the corresponding change must be made to the \code{rec_submesh} routine. … … 141 175 \section{Some Example Code} 142 176 143 Chapter \ref{chap:code} gives full listings of some example codes. 177 Chapter \ref{chap:code} gives full listings of some example codes. 144 178 145 179 The first example in Section \ref{subsec:codeRPA} solves the advection equation on a 146 180 rectangular mesh. A rectangular mesh is highly structured so a coordinate based decomposition can be used and the partitioning is simply done by calling the 147 routine \code{parallel_rectangle} as shown below. 181 routine \code{parallel_rectangle} as shown below. 148 182 \begin{verbatim} 149 183 ####################### … … 158 192 \end{verbatim} 159 193 160 Most simulations will not be done on a rectangular mesh, and the approach to subpartitioning the mesh is different to the one described above, however this example may be of interest to those who want to measure the parallel efficiency of the code on their machine. A rectangular mesh should give a good load balance and is therefore an important first test problem. 194 Most simulations will not be done on a rectangular mesh, and the approach to subpartitioning the mesh is different to the one described above, however this example may be of interest to those who want to measure the parallel efficiency of the code on their machine. A rectangular mesh should give a good load balance and is therefore an important first test problem. 161 195 162 196 … … 181 215 # Build the mesh that should be assigned to each processor. 182 216 # This includes ghost nodes and the communication pattern 183 217 184 218 submesh = build_submesh(nodes, triangles, boundary, quantities, \ 185 219 triangles_per_proc) … … 208 242 \newpage 209 243 \begin{itemize} 210 \item 244 \item 211 245 These first few lines of code read in and define the (global) mesh. The \code{Set_Stage} function sets the initial conditions. See the code in \ref{subsec:codeRPMM} for the definition of \code{Set_Stage}. 212 246 \begin{verbatim} … … 222 256 \end{verbatim} 223 257 224 \item The next step is to build a boundary layer of ghost triangles and define the communication pattern. This step is implemented by \code{build_submesh} as discussed in Section \ref{sec:part2}. The \code{submesh} variable contains a copy of the submesh for each processor. 225 \begin{verbatim} 258 \item The next step is to build a boundary layer of ghost triangles and define the communication pattern. This step is implemented by \code{build_submesh} as discussed in Section \ref{sec:part2}. The \code{submesh} variable contains a copy of the submesh for each processor. 259 \begin{verbatim} 226 260 submesh = build_submesh(nodes, triangles, boundary, quantities, \ 227 261 triangles_per_proc) 228 262 \end{verbatim} 229 263 230 \item The actual parallel communication starts when the submesh partitions are sent to the processors by calling \code{send_submesh}. 264 \item The actual parallel communication starts when the submesh partitions are sent to the processors by calling \code{send_submesh}. 231 265 \begin{verbatim} 232 266 for p in range(1, numprocs): … … 242 276 \end{verbatim} 243 277 244 Note that the submesh is not received by, or sent to, Processor 0. Rather \code{hostmesh = extract_hostmesh(submesh)} simply extracts the mesh that has been assigned to Processor 0. Recall \code{submesh} contains the list of submeshes to be assigned to each processor. This is described further in Section \ref{sec:part3}. 278 Note that the submesh is not received by, or sent to, Processor 0. Rather \code{hostmesh = extract_hostmesh(submesh)} simply extracts the mesh that has been assigned to Processor 0. Recall \code{submesh} contains the list of submeshes to be assigned to each processor. This is described further in Section \ref{sec:part3}. 245 279 %The \code{build_local_mesh} renumbers the nodes 246 280 \begin{verbatim} -
inundation/parallel/documentation/report.tex
r3245 r3315 55 55 56 56 \begin{abstract} 57 This document describes work done by the authors as part of a consultancy with GA during 2005-2006. The paper serves as both a report for GA and a user manual. 57 This document describes work done by the authors as part of a 58 consultancy with GA during 2005-2006. The paper serves as both a 59 report for GA and a user manual. 58 60 59 The report contains a description of how the code was parallelised and it lists efficiency results for a few example runs. It also gives some examples codes showing how to run the Merimbula test problem in parallel and talks about more technical aspects such as compilation issues and batch scripts for submitting compute jobs on parallel machines. 61 The report contains a description of how the code was parallelised 62 and it lists efficiency results for a few example runs. It also 63 gives some examples codes showing how to run the Merimbula test 64 problem in parallel and talks about more technical aspects such as 65 compilation issues and batch scripts for submitting compute jobs on 66 parallel machines. 60 67 \end{abstract} 61 68 -
inundation/parallel/documentation/results.tex
r3245 r3315 3 3 4 4 5 To evaluate the performance of the code on a parallel machine we ran some examples on a cluster of four nodes connected with PathScale InfiniPath HTX. 6 Each node has two AMD Opteron 275 (Dual-core 2.2 GHz Processors) and 4 GB of main memory. The system achieves 60 Gigaflops with the Linpack benchmark, which is about 85\% of peak performance. 5 To evaluate the performance of the code on a parallel machine we ran 6 some examples on a cluster of four nodes connected with PathScale 7 InfiniPath HTX. Each node has two AMD Opteron 275 (Dual-core 2.2 GHz 8 Processors) and 4 GB of main memory. The system achieves 60 9 Gigaflops with the Linpack benchmark, which is about 85\% of peak 10 performance. 7 11 8 12 For each test run we evaluate the parallel efficiency as … … 10 14 E_n = \frac{T_1}{nT_n} 100, 11 15 \] 12 where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of processors (submesh) and $t_i$ is the time required to run the {\tt evolve} code on processor $i$. Note that $t_i$ does not include the time required to build and subpartition the mesh etc., it only includes the time required to do the evolve calculations (eg. \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}). 16 where $T_n = \max_{0\le i < n}\{t_i\}$, $n$ is the total number of 17 processors (submesh) and $t_i$ is the time required to run the {\tt 18 evolve} code on processor $i$. Note that $t_i$ does not include the 19 time required to build and subpartition the mesh etc., it only 20 includes the time required to do the evolve calculations (eg. 21 \code{domain.evolve(yieldstep = 0.1, finaltime = 3.0)}). 13 22 14 23 \section{Advection, Rectangular Domain} 15 24 16 The first example looked at the rectangular domain example given in Section \ref{subsec:codeRPA}, except we changed the finaltime time to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}). 25 The first example looked at the rectangular domain example given in 26 Section \ref{subsec:codeRPA}, except we changed the finaltime time 27 to 1.0 (\code{domain.evolve(yieldstep = 0.1, finaltime = 1.0)}). 17 28 18 For this particular example we can control the mesh size by changing the parameters \code{N} and \code{M} given in the following section of code taken from 29 For this particular example we can control the mesh size by changing 30 the parameters \code{N} and \code{M} given in the following section 31 of code taken from 19 32 Section \ref{subsec:codeRPA}. 20 33 … … 33 46 \end{verbatim} 34 47 35 Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show the efficiency results for different values of \code{N} and \code{M}. The examples where $n \le 4$ were run on one Opteron node containing 4 processors, the $n = 8$ example was run on 2 nodes (giving a total of 8 processors). The communication within a node is faster than the communication across nodes, so we would expect to see a decrease in efficiency when we jump from 4 to 8 nodes. Furthermore, as \code{N} and \code{M} are increased the ratio of exterior to interior triangles decreases, which in-turn decreases the amount of communication relative the amount of computation and thus the efficiency should increase. 48 Tables \ref{tbl:rpa40}, \ref{tbl:rpa80} and \ref{tbl:rpa160} show 49 the efficiency results for different values of \code{N} and 50 \code{M}. The examples where $n \le 4$ were run on one Opteron node 51 containing 4 processors, the $n = 8$ example was run on 2 nodes 52 (giving a total of 8 processors). The communication within a node is 53 faster than the communication across nodes, so we would expect to 54 see a decrease in efficiency when we jump from 4 to 8 nodes. 55 Furthermore, as \code{N} and \code{M} are increased the ratio of 56 exterior to interior triangles decreases, which in-turn decreases 57 the amount of communication relative the amount of computation and 58 thus the efficiency should increase. 36 59 37 60 The efficiency results shown here are competitive. 38 61 39 \begin{table} 40 \caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 40, {\tt M} = 40.\label{tbl:rpa40}} 62 \begin{table} 63 \caption{Parallel Efficiency Results for the Advection Problem on a 64 Rectangular Domain with {\tt N} = 40, {\tt M} = 65 40.\label{tbl:rpa40}} 41 66 \begin{center} 42 67 \begin{tabular}{|c|c c|}\hline … … 50 75 \end{table} 51 76 52 \begin{table} 53 \caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 80, {\tt M} = 80.\label{tbl:rpa80}} 77 \begin{table} 78 \caption{Parallel Efficiency Results for the Advection Problem on a 79 Rectangular Domain with {\tt N} = 80, {\tt M} = 80 80.\label{tbl:rpa80}} 54 81 \begin{center} 55 82 \begin{tabular}{|c|c c|}\hline … … 64 91 65 92 66 \begin{table} 67 \caption{Parallel Efficiency Results for the Advection Problem on a Rectangular Domain with {\tt N} = 160, {\tt M} = 160.\label{tbl:rpa160}} 93 \begin{table} 94 \caption{Parallel Efficiency Results for the Advection Problem on a 95 Rectangular Domain with {\tt N} = 160, {\tt M} = 96 160.\label{tbl:rpa160}} 68 97 \begin{center} 69 98 \begin{tabular}{|c|c c|}\hline … … 79 108 80 109 %Another way of measuring the performance of the code on a parallel machine is to increase the problem size as the number of processors are increased so that the number of triangles per processor remains roughly the same. We have not carried out measurements of this kind as we usually have static grids and it is not possible to increase the number of triangles. 81 110 82 111 \section{Advection, Merimbula Mesh} 83 112 84 We now look at another advection example, except this time the mesh comes from the Merimbula test problem. That is, we ran the code given in Section 85 \ref{subsec:codeRPMM}, except the final time was reduced to 10000 86 (\code{finaltime = 10000}). The results are given in Table \ref{tbl:rpm}. 87 These are good efficiency results, especially considering the structure of the 88 Merimbula mesh. 113 We now look at another advection example, except this time the mesh 114 comes from the Merimbula test problem. That is, we ran the code 115 given in Section \ref{subsec:codeRPMM}, except the final time was 116 reduced to 10000 (\code{finaltime = 10000}). The results are given 117 in Table \ref{tbl:rpm}. These are good efficiency results, 118 especially considering the structure of the Merimbula mesh. 89 119 %Note that since we are solving an advection problem the amount of calculation 90 120 %done on each triangle is relatively low, when we more to other problems that 91 121 %involve more calculations we would expect the computation to communication ratio to increase and thus get an increase in efficiency. 92 122 93 \begin{table} 123 \begin{table} 94 124 \caption{Parallel Efficiency Results for the Advection Problem on the 95 125 Merimbula Mesh.\label{tbl:rpm}} … … 120 150 Processor 0 spent about 3.8 times more doing the \code{update_boundary} 121 151 calculations than Processor 7. This load imbalance reduced the parallel 122 efficiency. 152 efficiency. 123 153 124 154 Before doing the shallow equation calculations on a larger number of … … 126 156 optimised as much as possible to reduce the effect of the load imbalance. 127 157 128 129 \begin{table} 158 159 \begin{table} 130 160 \caption{Parallel Efficiency Results for the Shallow Water Equation on the 131 161 Merimbula Mesh.\label{tbl:rpsm}} -
inundation/parallel/parallel_advection.py
r3184 r3315 23 23 pass 24 24 25 from pyvolution.advection import *25 from pyvolution.advection_vtk import * 26 26 from Numeric import zeros, Float, Int, ones, allclose, array 27 27 import pypar
Note: See TracChangeset
for help on using the changeset viewer.