Opened 20 years ago
Closed 15 years ago
#18 closed enhancement (fixed)
checkpointing
Reported by: | ole | Owned by: | nariman |
---|---|---|---|
Priority: | normal | Milestone: | AnuGA ready for release |
Component: | Efficiency and optimisation | Version: | |
Severity: | normal | Keywords: | |
Cc: |
Description
Checkpointing (largely done in inundation/pyvolution/data_manager.py but needs to be revisited and properly tested)
Change History (9)
comment:1 Changed 20 years ago by
Component: | Compilation and installation → Efficiency and optimisation |
---|
comment:2 Changed 19 years ago by
comment:4 Changed 16 years ago by
With flood models taking more than 4 weeks to complete, we really need this now. Looking at changeset:288 may be helpful and also the function sww2domain in data_manager.py. However, here's how I think it should work: Have a function
domain = restart_from_checkpoint()
which will restore stage, xmomentum, ymomentum, elevation, perhaps friction as well as mesh, time, starttime, name, georeference. The script must then redefine forcing terms and boundary conditions typically exactly as the original script did.
I imagine the function could take an sww file as input and use that. However, data is stored in single precision and friction is currently not stored. Alternatively, if no sww file is specified the function should look for a checkpoint file with same format as sww but with data stored in double precision and only the last few timesteps present.
Issues include what to do with parallel jobs.
comment:5 Changed 16 years ago by
Priority: | low → high |
---|
comment:6 Changed 16 years ago by
Or perhaps name the function create_domain_from_checkpoint. A script could then look like this:
try:
domain = create_domain_from_checkpoint()
except:
domain = create_domain_from_regions(...) domain.set_quantity('elevation', ...) etc...
Do boundary conditions and forcing terms.
This could later become automatic in the generic ANUGA interface (see ticket:308)
Moreover, I think we should concentrate on a specific (NetCDF) format for checkpoint files containing quantities at the latest yieldstep including friction. This could also include the parallel communication lookup table. Otherwise, the format would reuse some of the sww code except it would use double precision.
Finally, I suggest that for each stored checkpoint file (domain.cpt) the previous one is kept just in case the latest file got corrupted. The name could be domain.backup.cpt.
comment:7 Changed 16 years ago by
Priority: | high → normal |
---|
comment:8 Changed 16 years ago by
Owner: | changed from ole to nariman |
---|
comment:9 Changed 15 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Instructions for doing this are now on the ANUGA wiki.
from another ticket;
Description by Nick <nick.bartzis@…>: Store all values of all varibles every 12-24 hours to disk and develop some code to run from this point.
This will prevent having to rerun the whole model if the node crashes