Opened 17 years ago
Closed 17 years ago
#165 closed defect (fixed)
"glibc detected" error in exmouth when ran in parallal
Reported by: | nick | Owned by: | ole |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | Functionality and features | Version: | |
Severity: | normal | Keywords: | |
Cc: |
Description (last modified by duncan)
Can you provide any information about this error, it is the third time it has occurred with this code
----------------------- *** glibc detected *** double free or corruption (!prev): 0x000000000162fc10 *** One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application. PID 8578 failed on node n10 (192.168.0.241) due to signal 6. -----------------------------------------------------------------------------
It occurs when running an exmouth model in parallel, (the sequential run works
this is the directory of the parallel run J:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070525_034301_run_basic_1.4_nbartzis
this is the directory of the last serial run J:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070525_033245_run_basic_1.4_nbartzis
+------------------------------------------------------------- | Fri May 25 15:02:22 2007. Evaluating function _file_function +------------------------------------------------------------- | Argument: '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww' | Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': Array: (187, 2), 'time_thinning': 12, 'domain_starttime': 0, 'verbose': True} | Reason: No cached result +------------------------------------------------------------- Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww References: Lower left corner: [189945.759564, 7488724.335279] Start time: 4000.000000 Building interpolation matrix from source mesh (648 vertices, 1086 triangles) FitInterpolate: Building mesh FitInterpolate: Building quad tree Interpolating (187 interpolation points, 1034 timesteps). Timesteps were thinned by a factor of 12 time step 0 of 1034 time step 104 of 1034 time step 208 of 1034 time step 312 of 1034
here is an example of the same problem with code that runs quicker
\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070530_062210_run_store_0_nbartzis
you should be able to execute "run_exmouth.py" from this directory and it will create another directory with the output results
there is one difference i can see the code fails at a slightly different spot
Find midpoint coordinates of entire boundary Initialise file_function Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_file_function[-1913934578119235209]_{Result,Args,Admin}.z Caching: Dependencies are ['/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww'] +------------------------------------------------------------- | Tue Jun 5 08:49:22 2007. Evaluating function _file_function +------------------------------------------------------------- | Argument: '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww' | Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': [[ 192783.96757945 7566067.52060781] [ 192597.7735207 7576851.25925281] [ 192621.04777805 7575503.29192219] [ 192807.2418368 7564719.55327719] [ 192644.32203539 7574155.32459156] [ 192714.14480742 7570111.... | Reason: No cached result +------------------------------------------------------------- Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww References: Lower left corner: [189945.759564, 7488724.335279] Start time: 4000.000000
Processes 1,2,3 fail at this point and processor 0 continues on until it creates a .sww file. All the results are in the directory noted above.
Here's how processor 0 fails;
+------------------------------------------------------------- | Wed May 30 16:23:08 2007. Evaluating function _file_function +------------------------------------------------------------- | Argument: '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww' | Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': Array: (305, 2), 'time_thinning': 48, 'domain_starttime': 0, 'verbose': True} | Reason: No cached result +------------------------------------------------------------- Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww References: Lower left corner: [189945.759564, 7488724.335279] Start time: 4000.000000
Change History (11)
comment:1 Changed 17 years ago by nick
- Description modified (diff)
comment:2 Changed 17 years ago by nick
- Description modified (diff)
- Owner changed from ole to duncan
- Summary changed from problem with C code in file_function to problem with C code in fit interpolate
comment:3 Changed 17 years ago by nick
- Description modified (diff)
comment:4 Changed 17 years ago by nick
- Description modified (diff)
comment:5 Changed 17 years ago by duncan
- Description modified (diff)
comment:6 Changed 17 years ago by duncan
- Status changed from new to assigned
- Summary changed from problem with C code in fit interpolate to glibc detected error in exmouth when ran in parallal
in W:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070530_062210_run_store_0_nbartzis
the sww file produced is filled with null vaules.
comment:7 Changed 17 years ago by duncan
- Summary changed from glibc detected error in exmouth when ran in parallal to "glibc detected" error in exmouth when ran in parallal
comment:8 Changed 17 years ago by duncan
- Owner changed from duncan to Ole
- Status changed from assigned to new
checking how past revisions go. 4530 bad 4504 bad 4490 bad 4478 bad 4473 bad 4471 bad 4470 ok 4468 ok 4458 ok 4438 ok
So change 4471 seems to be the problem.
I'll get Ole to check this.. I'll give him the files that show the error
comment:9 Changed 17 years ago by ole
- Owner changed from Ole to duncan
I get the same behaviour whether this running the most recent or any of the older ones 4468: bad 4458: bad 4438: bad
However, I am not sure I am seing the original error - only that one processor stops prematurely.
Can you provide the output you are getting from this test or verify where in the code it dies in your installation.
comment:10 Changed 17 years ago by duncan
- Owner changed from duncan to ole
Here's the error message
----------------------- *** glibc detected *** double free or corruption (!prev): 0x000000000162fc10 *** One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application. PID 8578 failed on node n10 (192.168.0.241) due to signal 6. -----------------------------------------------------------------------------
Maybe check that the code you are running is the old revision.
comment:11 Changed 17 years ago by ole
- Resolution set to fixed
- Status changed from new to closed
Fixed in changeset:4536 The problem was that the inverted triangle structure in general mesh took ghost nodes and ghost triangles into account. By restricting this function to 'full' nodes and 'full' triangles, the problem went away.
The error arose from current_node exceeding N (the total number of nodes) in _average_vertex_values in quantity_ext.c and consequently that the array A was indexed out of its bounds. This is not a good idea to do in the C language.
Here's how