Opened 12 years ago

Closed 7 years ago

#18 closed enhancement (fixed)

checkpointing

Reported by: ole Owned by: nariman
Priority: normal Milestone: AnuGA ready for release
Component: Efficiency and optimisation Version:
Severity: normal Keywords:
Cc:

Description

Checkpointing (largely done in inundation/pyvolution/data_manager.py but needs to be revisited and properly tested)

Change History (9)

comment:1 Changed 12 years ago by ole

  • Component changed from Compilation and installation to Efficiency and optimisation

comment:2 Changed 11 years ago by anonymous

from another ticket;

Description by Nick <nick.bartzis@…>: Store all values of all varibles every 12-24 hours to disk and develop some code to run from this point.

This will prevent having to rerun the whole model if the node crashes

comment:3 Changed 9 years ago by ole

Rudy asked for this functionality 4 Dec 2007.

comment:4 Changed 8 years ago by ole

With flood models taking more than 4 weeks to complete, we really need this now. Looking at changeset:288 may be helpful and also the function sww2domain in data_manager.py. However, here's how I think it should work: Have a function

domain = restart_from_checkpoint()

which will restore stage, xmomentum, ymomentum, elevation, perhaps friction as well as mesh, time, starttime, name, georeference. The script must then redefine forcing terms and boundary conditions typically exactly as the original script did.

I imagine the function could take an sww file as input and use that. However, data is stored in single precision and friction is currently not stored. Alternatively, if no sww file is specified the function should look for a checkpoint file with same format as sww but with data stored in double precision and only the last few timesteps present.

Issues include what to do with parallel jobs.

comment:5 Changed 8 years ago by ole

  • Priority changed from low to high

comment:6 Changed 8 years ago by ole

Or perhaps name the function create_domain_from_checkpoint. A script could then look like this:

try:

domain = create_domain_from_checkpoint()

except:

domain = create_domain_from_regions(...) domain.set_quantity('elevation', ...) etc...

Do boundary conditions and forcing terms.

This could later become automatic in the generic ANUGA interface (see ticket:308)

Moreover, I think we should concentrate on a specific (NetCDF) format for checkpoint files containing quantities at the latest yieldstep including friction. This could also include the parallel communication lookup table. Otherwise, the format would reuse some of the sww code except it would use double precision.

Finally, I suggest that for each stored checkpoint file (domain.cpt) the previous one is kept just in case the latest file got corrupted. The name could be domain.backup.cpt.

comment:7 Changed 8 years ago by nariman

  • Priority changed from high to normal

comment:8 Changed 8 years ago by ole

  • Owner changed from ole to nariman

comment:9 Changed 7 years ago by hudson

  • Resolution set to fixed
  • Status changed from new to closed

Instructions for doing this are now on the ANUGA wiki.

Note: See TracTickets for help on using tickets.