Context Navigation

source: inundation/parallel/documentation/results.tex @ 2785

Last change on this file since 2785 was 2697, checked in by linda, 19 years ago
Started working on the report for the parallel computations
File size: 6.2 KB

Rev	Line
[2697]	1
	2	\chapter{Performance Analysis}
	3	Examination of the parallel performance of the ANUGA system has been
	4	done using the APAC's Linux Cluster (LC). Running MPI jobs with python
	5	through the batch system requires specific settings as follows:
	6	\begin{enumerate}
	7	\item Comment out all module load statements from .bashrc .cshrc
	8	.profile and .login
	9	\item In the script for submission through PBS, add the lines:
	10	\begin{verbatim}
	11	module load python
	12	module load lam
	13	\end{verbatim}
	14	\item (Bash specific; translate if using csh) add export LD\_ASSUME\_KERNEL=2.4.1
	15	and export PYTHONPATH=(relevant directory)
	16	\item Instead of the call to lrun, use:
	17	\begin{verbatim}
	18	lamboot
	19	mpirun -np <number of processors> -x LD_LIBRARY_PATH,PYTHONPATH
	20	,LD_ASSUME_KERNEL <python filename>
	21	lamhalt
	22	\end{verbatim}
	23	The line beginning with `mpirun' and ending with `$<$python
	24	filename$>$' should be on a single line.
	25	\item Ensure the you have \verb\|-l other=mpi\| in the PBS directives.
	26	\item Ensure that pypar 1.9.2 is accessible. At the time of writing,
	27	it was in the SVN repository under the directory
	28	ga/inundation/pypar\_dist . Move it to the directory pypar, so that
	29	this pypar directory is accessible from the PYTHONPATH.
	30	\end {enumerate}
	31	The section of code that was profiled was the evolve function, which
	32	handles progression of the simulation in time.
	33
	34	The functions that have been examined were the ones that took the
	35	largest amount of time to execute. Efficiency values were calculated
	36	as follows:
	37	\[
	38	E = \frac{T_1}{n \times T_{n,1}} \times 100\%
	39	\]
	40	Where $T_1$ is the time for the single processor case, $n$ is the
	41	number of processors and $T_{n,1}$ is the first processor of the $n$
	42	cpu run. Results were generated using an 8 processor test run,
	43	compared against the 1 processor test run. The test script used was
	44	run\_parallel\_sw\_merimbula\_metis.py script in
	45	svn/ga/inundation/parallel.\\
	46
	47	\begin{tabular}{l c}
	48	Function name & Parallel efficiency\\
	49	\hline
	50	evolve(shallow\_water.py) & 79\%\\
	51	update\_boundary(domain.py) & 52\%\\
	52	distribute\_to\_vertices\_and\_edges(shallow\_water.py) & 140.6\%\\
	53	evaluate(shallow\_water.py) & 57.8\%\\
	54	compute\_fluxes(shallow\_water.py) & 192.2\%\\
	55	\end{tabular}\\
	56
	57	While evolve was the main function of interest in the profiling
	58	efforts, the other functions listed above displayed parallel
	59	efficiencies that warranted further attantion.
	60
	61	\subsection{update\_boundary}
	62	Whenever the domain is split for distribution, an additional layer of
	63	triangles is added along the divide. These `ghost' triangles hold read
	64	only data from another processor's section of the domain. The use of a
	65	ghost layer is common practise and is designed to reduce the
	66	communication frequency. Some edges of these ghost triangles will
	67	contain boundary objects which get updated during the call to
	68	update\_boundary. These are redundant calculations as information from
	69	the update will be overwritten during the next communication
	70	call. Increasing the number of processors increases the ratio of such
	71	ghost triangles and the amount of time wasted updating the boundaries
	72	of the ghost triangles. The low parallel efficiency of this function
	73	has resulted in efforts towards optimisation, by tagging ghost nodes
	74	so they are never updated by update\_boundary calls.
	75	\subsection{distribute\_to\_vertices\_and\_edges}
	76	Initially, the unusual speedup was thought to be a load balance issue,
	77	but examination of the domain decomposition demonstrated that this was
	78	not the case. All processors were showing this unusual speedup. The
	79	profiler was then suspected, due to the fact that C code cannot be
	80	properly profiled by the python profiler. The speedup was confirmed by
	81	timing using the hotshot profiler and calls to time.time().
	82
	83	The main function called by distribute\_to\_vertices\_and\_edges is
	84	balance\_deep\_and\_shallow\_c and was profiled in the same way,
	85	showing the same speedup.
	86
	87	The call to balance\_deep\_and\_shallow\_c was replaced by a call to
	88	the equivalent python function, and timed using calls to time.time().
	89	The total CPU time elapsed increased slightly as processors increased,
	90	which is expected. The algorithm part of the C version of
	91	balance\_deep\_and\_shallow was then timed using calls to
	92	gettimeofday(), with the following results:\\
	93
	94	\begin{tabular}{c c}
	95	Number of processors & Total CPU time\\
	96	\hline
	97	1 & 0.482695999788\\
	98	2 & 0.409433000023\\
	99	4 & 0.4197840002601\\
	100	8 & 0.4275599997492\\
	101	\end{tabular}\\
	102	For two processors and above, the arrays used by balance\_deep\_and\_shallow
	103	fit into the L2 cache of an LC compute node. The cache effect is
	104	therefore responsible for the speed increase from one to two
	105	processors.
	106	\subsubsection{h\_limiter}
	107	h\_limiter is a function called by balance\_deep\_and\_shallow.
	108	Timing the internals of the C version of balance\_deep\_and\_shallow
	109	results in the call to h\_limiter not being correctly timed. Timing
	110	h\_limiter from the python side resulted in unusual ($>$100\%)
	111	parallel efficiency, and timing the algorithm part of the C version of
	112	h\_limiter (i.e. not boilerplate code required for the Python-C
	113	interface such as calls to PyArg\_ParseTuple) displayed expected
	114	slowdowns. Therefore, it is presumed that this unusual speedup is due to
	115	how python passes information to C functions.
	116	\subsection{evaluate}
	117	Evaluate is an overridable method for different boundary classes
	118	(transmissive, reflective, etc). It contains relatively simple code,
	119	which has limited scope for optimisation. The low parallel efficiency
	120	of evaluate is due to it being called from update\_boundary and the
	121	number of boundary objects is increasing as the number of CPUs
	122	increases. Therefore, the parallel efficiency here should be improved
	123	when update\_boundary is corrected.
	124	\subsection{compute\_fluxes}
	125	Like balance\_deep\_and\_shallow and h\_limiter, compute\_fluxes calls
	126	a C function. Therefore, the unusual parallel efficiency has the same
	127	cause as balance\_deep\_and\_shallow and h\_limiter - presumed to be
	128	the way python calls code in C extensions.
	129	\subsection{Conclusions}
	130	The unusual speedup exhibited by compute\_fluxes,
	131	balance\_deep\_and\_shallow and h\_limiter has been traced to how the
	132	Python runtime passes information to C routines. In future work we
	133	plan to investigate this further to verify our conclusions.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: