Opened 18 years ago
Closed 15 years ago
#139 closed defect (fixed)
Model seems to die with dampier runs on cyclone, using 8 processors
Reported by: | nick | Owned by: | duncan |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | Functionality and features | Version: | 1.0 |
Severity: | critical | Keywords: | |
Cc: |
Description
I have this model
J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_014035_run
and the run dies in FitInterpolate?, no error is caught by the python code but the screen output is in "screen_output_0_8.txt"
It was run in parallel on cyclone and here is the error message from the commandline
One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application.
PID 23136 failed on node n11 (192.168.255.242) due to signal 9.
I have run the same model again in
J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_020329_run
and it fails in the same spot
Please Help
Change History (11)
comment:1 Changed 18 years ago by nick
- Owner changed from ole to duncan
comment:2 Changed 18 years ago by nick
comment:3 Changed 18 years ago by nick
J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_035124_run
comment:4 Changed 18 years ago by Duncan <duncan@…>
I tried it on Tornado and it worked. Nick is running on cyclone. I'm trying it on cyclone now.
comment:5 Changed 18 years ago by Duncan <duncan@…>
number of triangles is 342,366 number of vertex coordinates is 171,482 (based on the caching info).
comment:6 Changed 18 years ago by Duncan <duncan@…>
It worked on cyclone, since it used the cached result from the tornado run. I'm turning caching off.
comment:7 Changed 18 years ago by Duncan <duncan@…>
- Summary changed from Model seem to die in FitInterpolate to Model seem to die with dampier runs on cyclone, using 8 processors
It crashed on cyclone. It did not reach set_quantity elevation.
The final output messages are
+------------------------------------------------------- | Wed Apr 18 06:59:45 2007. Caching statistics (storing) +------------------------------------------------------- | Function: _pmesh_to_domain | Arguments: ('/d/xrd/gem/5/nhi/inundation//data/western_australia/dampier_tsunami_scenario_2006/anuga/meshes/dampier.msh', None) | Reason: Dependencies have changed | CPU time: 58.1 seconds | Loading time: 1.74 seconds (estimated) | | Caching dir: /d/cit/1/dgray/.python_cache/ | Result file: _pmesh_to_domain[-6918262629748670285]_Result.z (4813783 bytes, compressed) | Args file: _pmesh_to_domain[-6918262629748670285]_Args.z (113 bytes, compressed) | Admin file: _pmesh_to_domain[-6918262629748670285]_Admin.z (492 bytes, compressed) | | Dependencies: | /d/xrd/gem/5/nhi/inundation//data/western_australia/dampier_tsunami_scenario_2006/anuga/meshes/dampier.msh: Wed Apr 18 06:58:46 2007 11079948 bytes +------------------------------------------------------- General_mesh: Building basic mesh structure General_mesh: Computing areas, normals and edgelenghts (0/342366) (34237/342366) (68474/342366) (102711/342366) (136948/342366) (171185/342366) (205422/342366) (239659/342366) (273896/342366) (308133/342366) Building vertex list Initialising mesh Mesh: Computing centroids and radii (0/342366) (34237/342366) (68474/342366) (102711/342366) (136948/342366) (171185/342366) (205422/342366) (239659/342366) (273896/342366) (308133/342366) Mesh: Building neigbour structure
The command prompt error message is
----------------------------------------------------------------------------- One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application. PID 18580 failed on node n0 (192.168.255.253) due to signal 9. ----------------------------------------------------------------------------- MPI_Recv: process in local group is dead (rank 2, comm 3) Rank (4, MPI_COMM_WORLD): Call stack within LAM: Rank (4, MPI_COMM_WORLD): - MPI_Recv() Rank (4, MPI_COMM_WORLD): - MPI_Barrier() Rank (4, MPI_COMM_WORLD): - MPI_Barrier() Rank (4, MPI_COMM_WORLD): - main() MPI_Recv: process in local group is dead (rank 3, comm 3) Rank (6, MPI_COMM_WORLD): Call stack within LAM: Rank (6, MPI_COMM_WORLD): - MPI_Recv() Rank (6, MPI_COMM_WORLD): - MPI_Barrier() Rank (6, MPI_COMM_WORLD): - MPI_Barrier() Rank (6, MPI_COMM_WORLD): - main() MPI_Recv: process in local group is dead (rank 1, comm 3) Rank (2, MPI_COMM_WORLD): Call stack within LAM: Rank (2, MPI_COMM_WORLD): - MPI_Recv() Rank (2, MPI_COMM_WORLD): - MPI_Barrier() Rank (2, MPI_COMM_WORLD): - MPI_Barrier() Rank (2, MPI_COMM_WORLD): - main() [2]+ Killed /usr/bin/mpirun -x PATH,PYTHONPATH,INUNDATIONHOME c0-7 python run_dampier.py
comment:8 Changed 18 years ago by Duncan <duncan@…>
- Summary changed from Model seem to die with dampier runs on cyclone, using 8 processors to Model seems to die with dampier runs on cyclone, using 8 processors
comment:9 Changed 18 years ago by Duncan <duncan@…>
This is the output results from gem5\nhi\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_014035_run. (discussed in the initial description)
All the output files were last written to at 11:45 am. ____________________ node 0 ________________ Boundary: Number of boundary segments == 576 Boundary tags == ['side', 'back', 'ocean'] ------------------------------------------------ starting to create boundary conditions Setup initial conditions Start Set quantity Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_fit_to_mesh[-7931027587451355763]_{Result,Args,Admin}.z +----------------------------------------------------------- | Wed Apr 18 01:45:53 2007. Evaluating function _fit_to_mesh +----------------------------------------------------------- | Arguments: (Array: (250137, 2), Array: (499696, 3), '/d/xrd/gem/5/nhi/inundation/data/western_australia/dampier_tsunami_scenario_2006/anuga/topographies/dampier_combined_elevation.txt') | Keyword Args: {'point_attributes': None, 'use_cache': True, 'attribute_name': None, 'verbose': True, 'max_read_lines': 500, 'acceptable_overshoot': 1.01, 'mesh_origin': None, 'alpha': 0.1, 'data_origin': None} | Reason: No cached result +----------------------------------------------------------- Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_fit[592666254558874217]_{Result,Args,Admin}.z +--------------------------------------------------- | Wed Apr 18 01:46:33 2007. Evaluating function _fit +--------------------------------------------------- | Arguments: (Array: (250137, 2), Array: (499696, 3)) | Keyword Args: {'alpha': 0.1, 'mesh_origin': None, 'verbose': True} | Reason: No cached result +--------------------------------------------------- FitInterpolate: Building mesh ________________ ______________________ Node 1,2,3,4,5 (the rest); Boundary: Number of boundary segments == 576 Boundary tags == ['side', 'back', 'ocean'] ------------------------------------------------ starting to create boundary conditions _________________________
comment:10 Changed 18 years ago by Duncan <duncan@…>
- Priority changed from high to low
Nick ran this sequentially and it didn't crash.
Maybe two parallel processes sharing the same memory is just enough to use all the memory, and crash.
It could be a memory error.
Work around is to use Tornado.
comment:11 Changed 15 years ago by nariman
- Resolution set to fixed
- Status changed from new to closed
I have just run the same model on one processor and it seems to pass.