Opened 18 years ago
Closed 16 years ago
#139 closed defect (fixed)
Model seems to die with dampier runs on cyclone, using 8 processors
Reported by: | nick | Owned by: | duncan |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | Functionality and features | Version: | 1.0 |
Severity: | critical | Keywords: | |
Cc: |
Description
I have this model
J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_014035_run
and the run dies in FitInterpolate?, no error is caught by the python code but the screen output is in "screen_output_0_8.txt"
It was run in parallel on cyclone and here is the error message from the commandline
One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application.
PID 23136 failed on node n11 (192.168.255.242) due to signal 9.
I have run the same model again in
J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_020329_run
and it fails in the same spot
Please Help
Change History (11)
comment:1 Changed 18 years ago by
Owner: | changed from ole to duncan |
---|
comment:2 Changed 18 years ago by
comment:3 Changed 18 years ago by
J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_035124_run
comment:4 Changed 18 years ago by
I tried it on Tornado and it worked. Nick is running on cyclone. I'm trying it on cyclone now.
comment:5 Changed 18 years ago by
number of triangles is 342,366 number of vertex coordinates is 171,482 (based on the caching info).
comment:6 Changed 18 years ago by
It worked on cyclone, since it used the cached result from the tornado run. I'm turning caching off.
comment:7 Changed 18 years ago by
Summary: | Model seem to die in FitInterpolate → Model seem to die with dampier runs on cyclone, using 8 processors |
---|
It crashed on cyclone. It did not reach set_quantity elevation.
The final output messages are
+------------------------------------------------------- | Wed Apr 18 06:59:45 2007. Caching statistics (storing) +------------------------------------------------------- | Function: _pmesh_to_domain | Arguments: ('/d/xrd/gem/5/nhi/inundation//data/western_australia/dampier_tsunami_scenario_2006/anuga/meshes/dampier.msh', None) | Reason: Dependencies have changed | CPU time: 58.1 seconds | Loading time: 1.74 seconds (estimated) | | Caching dir: /d/cit/1/dgray/.python_cache/ | Result file: _pmesh_to_domain[-6918262629748670285]_Result.z (4813783 bytes, compressed) | Args file: _pmesh_to_domain[-6918262629748670285]_Args.z (113 bytes, compressed) | Admin file: _pmesh_to_domain[-6918262629748670285]_Admin.z (492 bytes, compressed) | | Dependencies: | /d/xrd/gem/5/nhi/inundation//data/western_australia/dampier_tsunami_scenario_2006/anuga/meshes/dampier.msh: Wed Apr 18 06:58:46 2007 11079948 bytes +------------------------------------------------------- General_mesh: Building basic mesh structure General_mesh: Computing areas, normals and edgelenghts (0/342366) (34237/342366) (68474/342366) (102711/342366) (136948/342366) (171185/342366) (205422/342366) (239659/342366) (273896/342366) (308133/342366) Building vertex list Initialising mesh Mesh: Computing centroids and radii (0/342366) (34237/342366) (68474/342366) (102711/342366) (136948/342366) (171185/342366) (205422/342366) (239659/342366) (273896/342366) (308133/342366) Mesh: Building neigbour structure
The command prompt error message is
----------------------------------------------------------------------------- One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application. PID 18580 failed on node n0 (192.168.255.253) due to signal 9. ----------------------------------------------------------------------------- MPI_Recv: process in local group is dead (rank 2, comm 3) Rank (4, MPI_COMM_WORLD): Call stack within LAM: Rank (4, MPI_COMM_WORLD): - MPI_Recv() Rank (4, MPI_COMM_WORLD): - MPI_Barrier() Rank (4, MPI_COMM_WORLD): - MPI_Barrier() Rank (4, MPI_COMM_WORLD): - main() MPI_Recv: process in local group is dead (rank 3, comm 3) Rank (6, MPI_COMM_WORLD): Call stack within LAM: Rank (6, MPI_COMM_WORLD): - MPI_Recv() Rank (6, MPI_COMM_WORLD): - MPI_Barrier() Rank (6, MPI_COMM_WORLD): - MPI_Barrier() Rank (6, MPI_COMM_WORLD): - main() MPI_Recv: process in local group is dead (rank 1, comm 3) Rank (2, MPI_COMM_WORLD): Call stack within LAM: Rank (2, MPI_COMM_WORLD): - MPI_Recv() Rank (2, MPI_COMM_WORLD): - MPI_Barrier() Rank (2, MPI_COMM_WORLD): - MPI_Barrier() Rank (2, MPI_COMM_WORLD): - main() [2]+ Killed /usr/bin/mpirun -x PATH,PYTHONPATH,INUNDATIONHOME c0-7 python run_dampier.py
comment:8 Changed 18 years ago by
Summary: | Model seem to die with dampier runs on cyclone, using 8 processors → Model seems to die with dampier runs on cyclone, using 8 processors |
---|
comment:9 Changed 18 years ago by
This is the output results from gem5\nhi\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_014035_run. (discussed in the initial description)
All the output files were last written to at 11:45 am. ____________________ node 0 ________________ Boundary: Number of boundary segments == 576 Boundary tags == ['side', 'back', 'ocean'] ------------------------------------------------ starting to create boundary conditions Setup initial conditions Start Set quantity Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_fit_to_mesh[-7931027587451355763]_{Result,Args,Admin}.z +----------------------------------------------------------- | Wed Apr 18 01:45:53 2007. Evaluating function _fit_to_mesh +----------------------------------------------------------- | Arguments: (Array: (250137, 2), Array: (499696, 3), '/d/xrd/gem/5/nhi/inundation/data/western_australia/dampier_tsunami_scenario_2006/anuga/topographies/dampier_combined_elevation.txt') | Keyword Args: {'point_attributes': None, 'use_cache': True, 'attribute_name': None, 'verbose': True, 'max_read_lines': 500, 'acceptable_overshoot': 1.01, 'mesh_origin': None, 'alpha': 0.1, 'data_origin': None} | Reason: No cached result +----------------------------------------------------------- Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_fit[592666254558874217]_{Result,Args,Admin}.z +--------------------------------------------------- | Wed Apr 18 01:46:33 2007. Evaluating function _fit +--------------------------------------------------- | Arguments: (Array: (250137, 2), Array: (499696, 3)) | Keyword Args: {'alpha': 0.1, 'mesh_origin': None, 'verbose': True} | Reason: No cached result +--------------------------------------------------- FitInterpolate: Building mesh ________________ ______________________ Node 1,2,3,4,5 (the rest); Boundary: Number of boundary segments == 576 Boundary tags == ['side', 'back', 'ocean'] ------------------------------------------------ starting to create boundary conditions _________________________
comment:10 Changed 18 years ago by
Priority: | high → low |
---|
Nick ran this sequentially and it didn't crash.
Maybe two parallel processes sharing the same memory is just enough to use all the memory, and crash.
It could be a memory error.
Work around is to use Tornado.
comment:11 Changed 16 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
I have just run the same model on one processor and it seems to pass.