Opened 18 years ago
Closed 18 years ago
#165 closed defect (fixed)
"glibc detected" error in exmouth when ran in parallal
Reported by: | nick | Owned by: | ole |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | Functionality and features | Version: | |
Severity: | normal | Keywords: | |
Cc: |
Description (last modified by )
Can you provide any information about this error, it is the third time it has occurred with this code
----------------------- *** glibc detected *** double free or corruption (!prev): 0x000000000162fc10 *** One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application. PID 8578 failed on node n10 (192.168.0.241) due to signal 6. -----------------------------------------------------------------------------
It occurs when running an exmouth model in parallel, (the sequential run works
this is the directory of the parallel run J:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070525_034301_run_basic_1.4_nbartzis
this is the directory of the last serial run J:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070525_033245_run_basic_1.4_nbartzis
+------------------------------------------------------------- | Fri May 25 15:02:22 2007. Evaluating function _file_function +------------------------------------------------------------- | Argument: '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww' | Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': Array: (187, 2), 'time_thinning': 12, 'domain_starttime': 0, 'verbose': True} | Reason: No cached result +------------------------------------------------------------- Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww References: Lower left corner: [189945.759564, 7488724.335279] Start time: 4000.000000 Building interpolation matrix from source mesh (648 vertices, 1086 triangles) FitInterpolate: Building mesh FitInterpolate: Building quad tree Interpolating (187 interpolation points, 1034 timesteps). Timesteps were thinned by a factor of 12 time step 0 of 1034 time step 104 of 1034 time step 208 of 1034 time step 312 of 1034
here is an example of the same problem with code that runs quicker
\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070530_062210_run_store_0_nbartzis
you should be able to execute "run_exmouth.py" from this directory and it will create another directory with the output results
there is one difference i can see the code fails at a slightly different spot
Find midpoint coordinates of entire boundary Initialise file_function Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_file_function[-1913934578119235209]_{Result,Args,Admin}.z Caching: Dependencies are ['/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww'] +------------------------------------------------------------- | Tue Jun 5 08:49:22 2007. Evaluating function _file_function +------------------------------------------------------------- | Argument: '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww' | Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': [[ 192783.96757945 7566067.52060781] [ 192597.7735207 7576851.25925281] [ 192621.04777805 7575503.29192219] [ 192807.2418368 7564719.55327719] [ 192644.32203539 7574155.32459156] [ 192714.14480742 7570111.... | Reason: No cached result +------------------------------------------------------------- Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww References: Lower left corner: [189945.759564, 7488724.335279] Start time: 4000.000000
Processes 1,2,3 fail at this point and processor 0 continues on until it creates a .sww file. All the results are in the directory noted above.
Here's how processor 0 fails;
+------------------------------------------------------------- | Wed May 30 16:23:08 2007. Evaluating function _file_function +------------------------------------------------------------- | Argument: '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww' | Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': Array: (305, 2), 'time_thinning': 48, 'domain_starttime': 0, 'verbose': True} | Reason: No cached result +------------------------------------------------------------- Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww References: Lower left corner: [189945.759564, 7488724.335279] Start time: 4000.000000
Change History (11)
comment:1 Changed 18 years ago by
Description: | modified (diff) |
---|
comment:2 Changed 18 years ago by
Description: | modified (diff) |
---|---|
Owner: | changed from ole to duncan |
Summary: | problem with C code in file_function → problem with C code in fit interpolate |
comment:3 Changed 18 years ago by
Description: | modified (diff) |
---|
comment:4 Changed 18 years ago by
Description: | modified (diff) |
---|
comment:5 Changed 18 years ago by
Description: | modified (diff) |
---|
comment:6 Changed 18 years ago by
Status: | new → assigned |
---|---|
Summary: | problem with C code in fit interpolate → glibc detected error in exmouth when ran in parallal |
in W:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070530_062210_run_store_0_nbartzis
the sww file produced is filled with null vaules.
comment:7 Changed 18 years ago by
Summary: | glibc detected error in exmouth when ran in parallal → "glibc detected" error in exmouth when ran in parallal |
---|
comment:8 Changed 18 years ago by
Owner: | changed from duncan to Ole |
---|---|
Status: | assigned → new |
checking how past revisions go. 4530 bad 4504 bad 4490 bad 4478 bad 4473 bad 4471 bad 4470 ok 4468 ok 4458 ok 4438 ok
So change 4471 seems to be the problem.
I'll get Ole to check this.. I'll give him the files that show the error
comment:9 Changed 18 years ago by
Owner: | changed from Ole to duncan |
---|
I get the same behaviour whether this running the most recent or any of the older ones 4468: bad 4458: bad 4438: bad
However, I am not sure I am seing the original error - only that one processor stops prematurely.
Can you provide the output you are getting from this test or verify where in the code it dies in your installation.
comment:10 Changed 18 years ago by
Owner: | changed from duncan to ole |
---|
Here's the error message
----------------------- *** glibc detected *** double free or corruption (!prev): 0x000000000162fc10 *** One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application. PID 8578 failed on node n10 (192.168.0.241) due to signal 6. -----------------------------------------------------------------------------
Maybe check that the code you are running is the old revision.
comment:11 Changed 18 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Fixed in changeset:4536 The problem was that the inverted triangle structure in general mesh took ghost nodes and ghost triangles into account. By restricting this function to 'full' nodes and 'full' triangles, the problem went away.
The error arose from current_node exceeding N (the total number of nodes) in _average_vertex_values in quantity_ext.c and consequently that the array A was indexed out of its bounds. This is not a good idea to do in the C language.
Here's how