Opened 17 years ago

Closed 17 years ago

#165 closed defect (fixed)

"glibc detected" error in exmouth when ran in parallal

Reported by: nick Owned by: ole
Priority: high Milestone:
Component: Functionality and features Version:
Severity: normal Keywords:
Cc:

Description (last modified by duncan)

Can you provide any information about this error, it is the third time it has occurred with this code

-----------------------
*** glibc detected *** double free or corruption (!prev): 0x000000000162fc10 ***


One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 8578 failed on node n10 (192.168.0.241) due to signal 6.
-----------------------------------------------------------------------------

It occurs when running an exmouth model in parallel, (the sequential run works

this is the directory of the parallel run J:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070525_034301_run_basic_1.4_nbartzis

this is the directory of the last serial run J:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070525_033245_run_basic_1.4_nbartzis

+-------------------------------------------------------------
| Fri May 25 15:02:22 2007. Evaluating function _file_function
+-------------------------------------------------------------
| Argument:     '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww'
| Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': Array: (187, 2), 'time_thinning': 12, 'domain_starttime': 0, 'verbose': True}
| Reason:       No cached result
+-------------------------------------------------------------

Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
  References:
    Lower left corner: [189945.759564, 7488724.335279]
    Start time:   4000.000000
Building interpolation matrix from source mesh (648 vertices, 1086 triangles)
FitInterpolate: Building mesh
FitInterpolate: Building quad tree
Interpolating (187 interpolation points, 1034 timesteps). Timesteps were thinned by a factor of 12
 time step 0 of 1034
 time step 104 of 1034
 time step 208 of 1034
 time step 312 of 1034

here is an example of the same problem with code that runs quicker

\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070530_062210_run_store_0_nbartzis

you should be able to execute "run_exmouth.py" from this directory and it will create another directory with the output results

there is one difference i can see the code fails at a slightly different spot

Find midpoint coordinates of entire boundary
Initialise file_function
Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_file_function[-1913934578119235209]_{Result,Args,Admin}.z
Caching: Dependencies are ['/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww']
+-------------------------------------------------------------
| Tue Jun  5 08:49:22 2007. Evaluating function _file_function
+-------------------------------------------------------------
| Argument:     '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww'
| Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': [[  192783.96757945  7566067.52060781]
 [  192597.7735207   7576851.25925281]
 [  192621.04777805  7575503.29192219]
 [  192807.2418368   7564719.55327719]
 [  192644.32203539  7574155.32459156]
 [  192714.14480742  7570111....
| Reason:       No cached result
+-------------------------------------------------------------

Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
  References:
    Lower left corner: [189945.759564, 7488724.335279]
    Start time:   4000.000000

Processes 1,2,3 fail at this point and processor 0 continues on until it creates a .sww file. All the results are in the directory noted above.

Here's how processor 0 fails;

+-------------------------------------------------------------
| Wed May 30 16:23:08 2007. Evaluating function _file_function
+-------------------------------------------------------------
| Argument:     '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww'
| Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': Array: (305, 2), 'time_thinning': 48, 'domain_starttime': 0, 'verbose': True}
| Reason:       No cached result
+-------------------------------------------------------------

Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
  References:
    Lower left corner: [189945.759564, 7488724.335279]
    Start time:   4000.000000

Change History (11)

comment:1 Changed 17 years ago by nick

  • Description modified (diff)

comment:2 Changed 17 years ago by nick

  • Description modified (diff)
  • Owner changed from ole to duncan
  • Summary changed from problem with C code in file_function to problem with C code in fit interpolate

comment:3 Changed 17 years ago by nick

  • Description modified (diff)

comment:4 Changed 17 years ago by nick

  • Description modified (diff)

comment:5 Changed 17 years ago by duncan

  • Description modified (diff)

Here's how

comment:6 Changed 17 years ago by duncan

  • Status changed from new to assigned
  • Summary changed from problem with C code in fit interpolate to glibc detected error in exmouth when ran in parallal

in W:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070530_062210_run_store_0_nbartzis

the sww file produced is filled with null vaules.

comment:7 Changed 17 years ago by duncan

  • Summary changed from glibc detected error in exmouth when ran in parallal to "glibc detected" error in exmouth when ran in parallal

comment:8 Changed 17 years ago by duncan

  • Owner changed from duncan to Ole
  • Status changed from assigned to new

checking how past revisions go. 4530 bad 4504 bad 4490 bad 4478 bad 4473 bad 4471 bad 4470 ok 4468 ok 4458 ok 4438 ok

So change 4471 seems to be the problem.

I'll get Ole to check this.. I'll give him the files that show the error

comment:9 Changed 17 years ago by ole

  • Owner changed from Ole to duncan

I get the same behaviour whether this running the most recent or any of the older ones 4468: bad 4458: bad 4438: bad

However, I am not sure I am seing the original error - only that one processor stops prematurely.

Can you provide the output you are getting from this test or verify where in the code it dies in your installation.

comment:10 Changed 17 years ago by duncan

  • Owner changed from duncan to ole

Here's the error message

-----------------------
*** glibc detected *** double free or corruption (!prev): 0x000000000162fc10 ***


One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 8578 failed on node n10 (192.168.0.241) due to signal 6.
-----------------------------------------------------------------------------

Maybe check that the code you are running is the old revision.

comment:11 Changed 17 years ago by ole

  • Resolution set to fixed
  • Status changed from new to closed

Fixed in changeset:4536 The problem was that the inverted triangle structure in general mesh took ghost nodes and ghost triangles into account. By restricting this function to 'full' nodes and 'full' triangles, the problem went away.

The error arose from current_node exceeding N (the total number of nodes) in _average_vertex_values in quantity_ext.c and consequently that the array A was indexed out of its bounds. This is not a good idea to do in the C language.

Note: See TracTickets for help on using tickets.