Opened 19 years ago
Closed 19 years ago
#165 closed defect (fixed)
"glibc detected" error in exmouth when ran in parallal
| Reported by: | nick | Owned by: | ole |
|---|---|---|---|
| Priority: | high | Milestone: | |
| Component: | Functionality and features | Version: | |
| Severity: | normal | Keywords: | |
| Cc: |
Description (last modified by )
Can you provide any information about this error, it is the third time it has occurred with this code
----------------------- *** glibc detected *** double free or corruption (!prev): 0x000000000162fc10 *** One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application. PID 8578 failed on node n10 (192.168.0.241) due to signal 6. -----------------------------------------------------------------------------
It occurs when running an exmouth model in parallel, (the sequential run works
this is the directory of the parallel run J:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070525_034301_run_basic_1.4_nbartzis
this is the directory of the last serial run J:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070525_033245_run_basic_1.4_nbartzis
+-------------------------------------------------------------
| Fri May 25 15:02:22 2007. Evaluating function _file_function
+-------------------------------------------------------------
| Argument: '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww'
| Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': Array: (187, 2), 'time_thinning': 12, 'domain_starttime': 0, 'verbose': True}
| Reason: No cached result
+-------------------------------------------------------------
Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
References:
Lower left corner: [189945.759564, 7488724.335279]
Start time: 4000.000000
Building interpolation matrix from source mesh (648 vertices, 1086 triangles)
FitInterpolate: Building mesh
FitInterpolate: Building quad tree
Interpolating (187 interpolation points, 1034 timesteps). Timesteps were thinned by a factor of 12
time step 0 of 1034
time step 104 of 1034
time step 208 of 1034
time step 312 of 1034
here is an example of the same problem with code that runs quicker
\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070530_062210_run_store_0_nbartzis
you should be able to execute "run_exmouth.py" from this directory and it will create another directory with the output results
there is one difference i can see the code fails at a slightly different spot
Find midpoint coordinates of entire boundary
Initialise file_function
Caching: looking for cached files /d/cit/1/cit/unixhome/nbartzis/.python_cache/_file_function[-1913934578119235209]_{Result,Args,Admin}.z
Caching: Dependencies are ['/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww']
+-------------------------------------------------------------
| Tue Jun 5 08:49:22 2007. Evaluating function _file_function
+-------------------------------------------------------------
| Argument: '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww'
| Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': [[ 192783.96757945 7566067.52060781]
[ 192597.7735207 7576851.25925281]
[ 192621.04777805 7575503.29192219]
[ 192807.2418368 7564719.55327719]
[ 192644.32203539 7574155.32459156]
[ 192714.14480742 7570111....
| Reason: No cached result
+-------------------------------------------------------------
Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
References:
Lower left corner: [189945.759564, 7488724.335279]
Start time: 4000.000000
Processes 1,2,3 fail at this point and processor 0 continues on until it creates a .sww file. All the results are in the directory noted above.
Here's how processor 0 fails;
+-------------------------------------------------------------
| Wed May 30 16:23:08 2007. Evaluating function _file_function
+-------------------------------------------------------------
| Argument: '/d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww'
| Keyword Args: {'quantities': ['stage', 'xmomentum', 'ymomentum'], 'interpolation_points': Array: (305, 2), 'time_thinning': 48, 'domain_starttime': 0, 'verbose': True}
| Reason: No cached result
+-------------------------------------------------------------
Reading /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
File_function data obtained from: /d/xrd/gem/5/nhi/inundation/data/western_australia/exmouth_tsunami_scenario/anuga/boundaries/urs/dampier/1_10000/exmouth_3854_17042007.sww
References:
Lower left corner: [189945.759564, 7488724.335279]
Start time: 4000.000000
Change History (11)
comment:1 Changed 19 years ago by
| Description: | modified (diff) |
|---|
comment:2 Changed 19 years ago by
| Description: | modified (diff) |
|---|---|
| Owner: | changed from ole to duncan |
| Summary: | problem with C code in file_function → problem with C code in fit interpolate |
comment:3 Changed 19 years ago by
| Description: | modified (diff) |
|---|
comment:4 Changed 19 years ago by
| Description: | modified (diff) |
|---|
comment:5 Changed 19 years ago by
| Description: | modified (diff) |
|---|
comment:6 Changed 19 years ago by
| Status: | new → assigned |
|---|---|
| Summary: | problem with C code in fit interpolate → glibc detected error in exmouth when ran in parallal |
in W:\inundation\data\western_australia\exmouth_tsunami_scenario\anuga\outputs\20070530_062210_run_store_0_nbartzis
the sww file produced is filled with null vaules.
comment:7 Changed 19 years ago by
| Summary: | glibc detected error in exmouth when ran in parallal → "glibc detected" error in exmouth when ran in parallal |
|---|
comment:8 Changed 19 years ago by
| Owner: | changed from duncan to Ole |
|---|---|
| Status: | assigned → new |
checking how past revisions go. 4530 bad 4504 bad 4490 bad 4478 bad 4473 bad 4471 bad 4470 ok 4468 ok 4458 ok 4438 ok
So change 4471 seems to be the problem.
I'll get Ole to check this.. I'll give him the files that show the error
comment:9 Changed 19 years ago by
| Owner: | changed from Ole to duncan |
|---|
I get the same behaviour whether this running the most recent or any of the older ones 4468: bad 4458: bad 4438: bad
However, I am not sure I am seing the original error - only that one processor stops prematurely.
Can you provide the output you are getting from this test or verify where in the code it dies in your installation.
comment:10 Changed 19 years ago by
| Owner: | changed from duncan to ole |
|---|
Here's the error message
----------------------- *** glibc detected *** double free or corruption (!prev): 0x000000000162fc10 *** One of the processes started by mpirun has exited with a nonzero exit code. This typically indicates that the process finished in error. If your process did not finish in error, be sure to include a "return 0" or "exit(0)" in your C code before exiting the application. PID 8578 failed on node n10 (192.168.0.241) due to signal 6. -----------------------------------------------------------------------------
Maybe check that the code you are running is the old revision.
comment:11 Changed 19 years ago by
| Resolution: | → fixed |
|---|---|
| Status: | new → closed |
Fixed in changeset:4536 The problem was that the inverted triangle structure in general mesh took ghost nodes and ghost triangles into account. By restricting this function to 'full' nodes and 'full' triangles, the problem went away.
The error arose from current_node exceeding N (the total number of nodes) in _average_vertex_values in quantity_ext.c and consequently that the array A was indexed out of its bounds. This is not a good idea to do in the C language.

Here's how