Opened 13 years ago

Closed 11 years ago

#139 closed defect (fixed)

Model seems to die with dampier runs on cyclone, using 8 processors

Reported by: nick Owned by: duncan
Priority: low Milestone:
Component: Functionality and features Version: 1.0
Severity: critical Keywords:
Cc:

Description

I have this model

J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_014035_run

and the run dies in FitInterpolate?, no error is caught by the python code but the screen output is in "screen_output_0_8.txt"

It was run in parallel on cyclone and here is the error message from the commandline


One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application.

PID 23136 failed on node n11 (192.168.255.242) due to signal 9.


I have run the same model again in

J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_020329_run

and it fails in the same spot

Please Help

Change History (11)

comment:1 Changed 13 years ago by nick

  • Owner changed from ole to duncan

comment:2 Changed 13 years ago by nick

I have just run the same model on one processor and it seems to pass.

comment:3 Changed 13 years ago by nick

J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_035124_run

comment:4 Changed 13 years ago by Duncan <duncan@…>

I tried it on Tornado and it worked. Nick is running on cyclone. I'm trying it on cyclone now.

comment:5 Changed 13 years ago by Duncan <duncan@…>

number of triangles is 342,366 number of vertex coordinates is 171,482 (based on the caching info).

comment:6 Changed 13 years ago by Duncan <duncan@…>

It worked on cyclone, since it used the cached result from the tornado run. I'm turning caching off.

comment:7 Changed 13 years ago by Duncan <duncan@…>

  • Summary changed from Model seem to die in FitInterpolate to Model seem to die with dampier runs on cyclone, using 8 processors

It crashed on cyclone. It did not reach set_quantity elevation.

The final output messages are

+-------------------------------------------------------
| Wed Apr 18 06:59:45 2007. Caching statistics (storing)
+-------------------------------------------------------
| Function:     _pmesh_to_domain
| Arguments:    ('/d/xrd/gem/5/nhi/inundation//data/western_australia/dampier_tsunami_scenario_2006/anuga/meshes/dampier.msh', None)
| Reason:       Dependencies have changed
| CPU time:     58.1 seconds
| Loading time: 1.74 seconds (estimated)
|
| Caching dir:  /d/cit/1/dgray/.python_cache/
| Result file:  _pmesh_to_domain[-6918262629748670285]_Result.z (4813783 bytes, compressed)
| Args file:    _pmesh_to_domain[-6918262629748670285]_Args.z (113 bytes, compressed)
| Admin file:   _pmesh_to_domain[-6918262629748670285]_Admin.z (492 bytes, compressed)
|
| Dependencies:  
|  /d/xrd/gem/5/nhi/inundation//data/western_australia/dampier_tsunami_scenario_2006/anuga/meshes/dampier.msh: Wed Apr 18 06:58:46 2007   11079948 bytes
+-------------------------------------------------------

General_mesh: Building basic mesh structure
General_mesh: Computing areas, normals and edgelenghts
(0/342366)
(34237/342366)
(68474/342366)
(102711/342366)
(136948/342366)
(171185/342366)
(205422/342366)
(239659/342366)
(273896/342366)
(308133/342366)
Building vertex list
Initialising mesh
Mesh: Computing centroids and radii
(0/342366)
(34237/342366)
(68474/342366)
(102711/342366)
(136948/342366)
(171185/342366)
(205422/342366)
(239659/342366)
(273896/342366)
(308133/342366)
Mesh: Building neigbour structure

The command prompt error message is


-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 18580 failed on node n0 (192.168.255.253) due to signal 9.
-----------------------------------------------------------------------------
MPI_Recv: process in local group is dead (rank 2, comm 3)
Rank (4, MPI_COMM_WORLD): Call stack within LAM:
Rank (4, MPI_COMM_WORLD):  - MPI_Recv()
Rank (4, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (4, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (4, MPI_COMM_WORLD):  - main()
MPI_Recv: process in local group is dead (rank 3, comm 3)
Rank (6, MPI_COMM_WORLD): Call stack within LAM:
Rank (6, MPI_COMM_WORLD):  - MPI_Recv()
Rank (6, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (6, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (6, MPI_COMM_WORLD):  - main()
MPI_Recv: process in local group is dead (rank 1, comm 3)
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
Rank (2, MPI_COMM_WORLD):  - MPI_Recv()
Rank (2, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (2, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (2, MPI_COMM_WORLD):  - main()
[2]+  Killed                  /usr/bin/mpirun -x PATH,PYTHONPATH,INUNDATIONHOME c0-7 python run_dampier.py

comment:8 Changed 13 years ago by Duncan <duncan@…>

  • Summary changed from Model seem to die with dampier runs on cyclone, using 8 processors to Model seems to die with dampier runs on cyclone, using 8 processors

comment:9 Changed 13 years ago by Duncan <duncan@…>

This is the output results from gem5\nhi\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_014035_run. (discussed in the initial description)

All the output files were last written to at 11:45 am.
	____________________
 node 0
	
	________________
	  Boundary:
    Number of boundary segments == 576
    Boundary tags == ['side', 'back', 'ocean']
------------------------------------------------

starting to create boundary conditions
Setup initial conditions
Start Set quantity
Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_fit_to_mesh[-7931027587451355763]_{Result,Args,Admin}.z
+-----------------------------------------------------------
| Wed Apr 18 01:45:53 2007. Evaluating function _fit_to_mesh
+-----------------------------------------------------------
| Arguments:    (Array: (250137, 2), Array: (499696, 3), '/d/xrd/gem/5/nhi/inundation/data/western_australia/dampier_tsunami_scenario_2006/anuga/topographies/dampier_combined_elevation.txt')
| Keyword Args: {'point_attributes': None, 'use_cache': True, 'attribute_name': None, 'verbose': True, 'max_read_lines': 500, 'acceptable_overshoot': 1.01, 'mesh_origin': None, 'alpha': 0.1, 'data_origin': None}
| Reason:       No cached result
+-----------------------------------------------------------

Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_fit[592666254558874217]_{Result,Args,Admin}.z
+---------------------------------------------------
| Wed Apr 18 01:46:33 2007. Evaluating function _fit
+---------------------------------------------------
| Arguments:    (Array: (250137, 2), Array: (499696, 3))
| Keyword Args: {'alpha': 0.1, 'mesh_origin': None, 'verbose': True}
| Reason:       No cached result
+---------------------------------------------------

FitInterpolate: Building mesh

	________________

______________________

Node 1,2,3,4,5 (the rest);

 Boundary:
    Number of boundary segments == 576
    Boundary tags == ['side', 'back', 'ocean']
------------------------------------------------

starting to create boundary conditions

_________________________

comment:10 Changed 13 years ago by Duncan <duncan@…>

  • Priority changed from high to low

Nick ran this sequentially and it didn't crash.

Maybe two parallel processes sharing the same memory is just enough to use all the memory, and crash.

It could be a memory error.

Work around is to use Tornado.

comment:11 Changed 11 years ago by nariman

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.