Opened 18 years ago

Closed 16 years ago

#139 closed defect (fixed)

Model seems to die with dampier runs on cyclone, using 8 processors

Reported by: nick Owned by: duncan
Priority: low Milestone:
Component: Functionality and features Version: 1.0
Severity: critical Keywords:
Cc:

Description

I have this model

J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_014035_run

and the run dies in FitInterpolate?, no error is caught by the python code but the screen output is in "screen_output_0_8.txt"

It was run in parallel on cyclone and here is the error message from the commandline


One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application.

PID 23136 failed on node n11 (192.168.255.242) due to signal 9.


I have run the same model again in

J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_020329_run

and it fails in the same spot

Please Help

Change History (11)

comment:1 Changed 18 years ago by nick

Owner: changed from ole to duncan

comment:2 Changed 18 years ago by nick

I have just run the same model on one processor and it seems to pass.

comment:3 Changed 18 years ago by nick

J:\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_035124_run

comment:4 Changed 18 years ago by Duncan <duncan@…>

I tried it on Tornado and it worked. Nick is running on cyclone. I'm trying it on cyclone now.

comment:5 Changed 18 years ago by Duncan <duncan@…>

number of triangles is 342,366 number of vertex coordinates is 171,482 (based on the caching info).

comment:6 Changed 18 years ago by Duncan <duncan@…>

It worked on cyclone, since it used the cached result from the tornado run. I'm turning caching off.

comment:7 Changed 18 years ago by Duncan <duncan@…>

Summary: Model seem to die in FitInterpolateModel seem to die with dampier runs on cyclone, using 8 processors

It crashed on cyclone. It did not reach set_quantity elevation.

The final output messages are

+-------------------------------------------------------
| Wed Apr 18 06:59:45 2007. Caching statistics (storing)
+-------------------------------------------------------
| Function:     _pmesh_to_domain
| Arguments:    ('/d/xrd/gem/5/nhi/inundation//data/western_australia/dampier_tsunami_scenario_2006/anuga/meshes/dampier.msh', None)
| Reason:       Dependencies have changed
| CPU time:     58.1 seconds
| Loading time: 1.74 seconds (estimated)
|
| Caching dir:  /d/cit/1/dgray/.python_cache/
| Result file:  _pmesh_to_domain[-6918262629748670285]_Result.z (4813783 bytes, compressed)
| Args file:    _pmesh_to_domain[-6918262629748670285]_Args.z (113 bytes, compressed)
| Admin file:   _pmesh_to_domain[-6918262629748670285]_Admin.z (492 bytes, compressed)
|
| Dependencies:  
|  /d/xrd/gem/5/nhi/inundation//data/western_australia/dampier_tsunami_scenario_2006/anuga/meshes/dampier.msh: Wed Apr 18 06:58:46 2007   11079948 bytes
+-------------------------------------------------------

General_mesh: Building basic mesh structure
General_mesh: Computing areas, normals and edgelenghts
(0/342366)
(34237/342366)
(68474/342366)
(102711/342366)
(136948/342366)
(171185/342366)
(205422/342366)
(239659/342366)
(273896/342366)
(308133/342366)
Building vertex list
Initialising mesh
Mesh: Computing centroids and radii
(0/342366)
(34237/342366)
(68474/342366)
(102711/342366)
(136948/342366)
(171185/342366)
(205422/342366)
(239659/342366)
(273896/342366)
(308133/342366)
Mesh: Building neigbour structure

The command prompt error message is


-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 18580 failed on node n0 (192.168.255.253) due to signal 9.
-----------------------------------------------------------------------------
MPI_Recv: process in local group is dead (rank 2, comm 3)
Rank (4, MPI_COMM_WORLD): Call stack within LAM:
Rank (4, MPI_COMM_WORLD):  - MPI_Recv()
Rank (4, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (4, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (4, MPI_COMM_WORLD):  - main()
MPI_Recv: process in local group is dead (rank 3, comm 3)
Rank (6, MPI_COMM_WORLD): Call stack within LAM:
Rank (6, MPI_COMM_WORLD):  - MPI_Recv()
Rank (6, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (6, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (6, MPI_COMM_WORLD):  - main()
MPI_Recv: process in local group is dead (rank 1, comm 3)
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
Rank (2, MPI_COMM_WORLD):  - MPI_Recv()
Rank (2, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (2, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (2, MPI_COMM_WORLD):  - main()
[2]+  Killed                  /usr/bin/mpirun -x PATH,PYTHONPATH,INUNDATIONHOME c0-7 python run_dampier.py

comment:8 Changed 18 years ago by Duncan <duncan@…>

Summary: Model seem to die with dampier runs on cyclone, using 8 processorsModel seems to die with dampier runs on cyclone, using 8 processors

comment:9 Changed 18 years ago by Duncan <duncan@…>

This is the output results from gem5\nhi\inundation\data\western_australia\dampier_tsunami_scenario_2006\anuga\outputs\20070418_014035_run. (discussed in the initial description)

All the output files were last written to at 11:45 am.
	____________________
 node 0
	
	________________
	  Boundary:
    Number of boundary segments == 576
    Boundary tags == ['side', 'back', 'ocean']
------------------------------------------------

starting to create boundary conditions
Setup initial conditions
Start Set quantity
Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_fit_to_mesh[-7931027587451355763]_{Result,Args,Admin}.z
+-----------------------------------------------------------
| Wed Apr 18 01:45:53 2007. Evaluating function _fit_to_mesh
+-----------------------------------------------------------
| Arguments:    (Array: (250137, 2), Array: (499696, 3), '/d/xrd/gem/5/nhi/inundation/data/western_australia/dampier_tsunami_scenario_2006/anuga/topographies/dampier_combined_elevation.txt')
| Keyword Args: {'point_attributes': None, 'use_cache': True, 'attribute_name': None, 'verbose': True, 'max_read_lines': 500, 'acceptable_overshoot': 1.01, 'mesh_origin': None, 'alpha': 0.1, 'data_origin': None}
| Reason:       No cached result
+-----------------------------------------------------------

Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_fit[592666254558874217]_{Result,Args,Admin}.z
+---------------------------------------------------
| Wed Apr 18 01:46:33 2007. Evaluating function _fit
+---------------------------------------------------
| Arguments:    (Array: (250137, 2), Array: (499696, 3))
| Keyword Args: {'alpha': 0.1, 'mesh_origin': None, 'verbose': True}
| Reason:       No cached result
+---------------------------------------------------

FitInterpolate: Building mesh

	________________

______________________

Node 1,2,3,4,5 (the rest);

 Boundary:
    Number of boundary segments == 576
    Boundary tags == ['side', 'back', 'ocean']
------------------------------------------------

starting to create boundary conditions

_________________________

comment:10 Changed 18 years ago by Duncan <duncan@…>

Priority: highlow

Nick ran this sequentially and it didn't crash.

Maybe two parallel processes sharing the same memory is just enough to use all the memory, and crash.

It could be a memory error.

Work around is to use Tornado.

comment:11 Changed 16 years ago by nariman

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.