-
Notifications
You must be signed in to change notification settings - Fork 256
(v0.103.0) Checkpointing simulations #4892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…anigans.jl into ali/checkpointing-that-works
src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_model.jl
Outdated
Show resolved
Hide resolved
…ce_model.jl Co-authored-by: Gregory L. Wagner <[email protected]>
|
Looks like tests will all pass 🎉 I'll start testing the checkpointing of increasingly complex simulations while iterating on the design! This way we'll be able to weed out most bugs and issues. |
…anigans.jl into ali/checkpointing-that-works
|
Ok I think this PR is finally ready for review! I went through and added checkpointing support + tests to different time steppers, models, schedules, turbulence closures, etc. but let me know if you feel like I missed something. This PR has a few breaking changes so we should tag v0.103.0 once it's merged:
The PR is getting quite large and distributed CI is not working, so I'll test checkpointing support for distributed models in a subsequent PR. |
There's been a lot of progress since this PR was open, so I'm wondering if this is still true. I don't oppose this, but I also don't oppose saving non-prognostic information when easy and throwing a warning to the user if we flag differences there. Mistakes happen and I think this sort of user-friendliness can avoid some headache. Again, only when it's to do so though. I definitely don't think |
| ```julia | ||
| set!(simulation, filepath) # restore from specific file (no Checkpointer required) | ||
| set!(simulation, true) # restore from latest checkpoint (requires Checkpointer) | ||
| set!(simulation, iteration) # restore from specific iteration (requires Checkpointer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that set!(simulation, true) looks vague. As in if you saw that line in a script randomly you wouldn't know what it does. So I might add kwargs so it instead looks like
set!(simulation; filepath="spinup.jld2")
set!(simulation; use_latest_checkpoint=true)
set!(simulation; iteration=12345)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only qualm is that "false" means nothing.
An altnerative is a symbol, eg
set!(simulation, :latest_checkpoint)i do like the kwarg for the other options.
| For cluster jobs with time limits, use `WallTimeInterval` to checkpoint based on elapsed | ||
| wall-clock time rather than simulation time or iterations: | ||
|
|
||
| ```julia | ||
| # Checkpoint every 30 minutes of wall-clock time | ||
| Checkpointer(model, schedule=WallTimeInterval(30minute), prefix="checkpoint") | ||
| ``` | ||
|
|
||
| This ensures checkpoints are saved regularly even if individual time steps vary significantly. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a job for another PR, but should we add option to write a Checkpoint when the wall_time_limit is exceeded? It might be super useful exactly for this sort of cluster situation such that we don't run any time-step twice regardless of how many pickups a simulation ends up needing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a good use case. I was thinking we can do something more general and checkpoint after a simulation stops running (whether it is done or not)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
ext/OceananigansNCDatasetsExt.jl
Outdated
| return ( | ||
| schedule = prognostic_state(writer.schedule), | ||
| part = writer.part, | ||
| windowed_time_averages = isempty(wta_outputs) ? nothing : wta_outputs, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we wanna enforce the formatting we usually prefer in Oceananigans? Namely:
| return ( | |
| schedule = prognostic_state(writer.schedule), | |
| part = writer.part, | |
| windowed_time_averages = isempty(wta_outputs) ? nothing : wta_outputs, | |
| ) | |
| return (schedule = prognostic_state(writer.schedule), | |
| part = writer.part, | |
| windowed_time_averages = isempty(wta_outputs) ? nothing : wta_outputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer keeping it consistent (and changing it globally if we want) but it is not the most critical thing in the world
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can be a bit sloppy with coding style. I'll make a pass and use a coding style that is consistent with the rest of Oceananigans.jl.
| return ( | ||
| η = prognostic_state(fs.η), | ||
| barotropic_velocities = prognostic_state(fs.barotropic_velocities), | ||
| filtered_state = prognostic_state(fs.filtered_state), | ||
| timestepper = prognostic_state(fs.timestepper), | ||
| ) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about formatting
| return ( | ||
| ηᵐ = prognostic_state(ts.ηᵐ), | ||
| ηᵐ⁻¹ = prognostic_state(ts.ηᵐ⁻¹), | ||
| ηᵐ⁻² = prognostic_state(ts.ηᵐ⁻²), | ||
| Uᵐ⁻¹ = prognostic_state(ts.Uᵐ⁻¹), | ||
| Uᵐ⁻² = prognostic_state(ts.Uᵐ⁻²), | ||
| Vᵐ⁻¹ = prognostic_state(ts.Vᵐ⁻¹), | ||
| Vᵐ⁻² = prognostic_state(ts.Vᵐ⁻²), | ||
| ) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here (and other places below as well)
|
|
||
| ##### | ||
| ##### Checkpointing the JLD2Writer | ||
| ##### | ||
|
|
||
| function prognostic_state(writer::JLD2Writer) | ||
| # Only checkpoint WindowedTimeAverage outputs (which have accumulated state). | ||
| wta_outputs = NamedTuple(name => prognostic_state(output) | ||
| for (name, output) in pairs(writer.outputs) | ||
| if output isa WindowedTimeAverage) | ||
|
|
||
| return ( | ||
| schedule = prognostic_state(writer.schedule), | ||
| part = writer.part, | ||
| windowed_time_averages = isempty(wta_outputs) ? nothing : wta_outputs, | ||
| ) | ||
| end | ||
|
|
||
| function restore_prognostic_state!(writer::JLD2Writer, state) | ||
| restore_prognostic_state!(writer.schedule, state.schedule) | ||
| writer.part = state.part | ||
|
|
||
| # Restore WindowedTimeAverage outputs if present | ||
| if hasproperty(state, :windowed_time_averages) && !isnothing(state.windowed_time_averages) | ||
| for (name, wta_state) in pairs(state.windowed_time_averages) | ||
| if haskey(writer.outputs, name) && writer.outputs[name] isa WindowedTimeAverage | ||
| restore_prognostic_state!(writer.outputs[name], wta_state) | ||
| end | ||
| end | ||
| end | ||
|
|
||
| return writer | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code exactly the same for all OutputWriters? If so, can we write it only once? In the future we might also have other supported formats (like zarr) and it'd be good to unify as much code as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are small differences, we can open an issue to smooth out those wrinkles between the two --- effectively creating a "convention for output writers" with the benefits mentioned by @tomchor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I should be able to combine prognostic_state and restore_prognostic_state! for multiple writers.
@tomchor can you specify what you mean by "this" in your comment? Possibly unrelated, but I'm wondering if we think there would be value in being as strict as humanly possible about the accuracy of picking up from a checkpoint. Otherwise critical operational applications will have to implement their own testing system because they could not rely on the Oceananigans in-built system for it. If there is a need to relax strictness for the purpose of productivity, we might have a checkpointer "mode" (strict=true, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work with testing! Pretty comprehensive. My only suggestion here is maybe instead of write two functions like test_XYZ_hydrostatic() and test_XYZ_nonhodrostatic(), should we write one function called test_XYZ() and pass the model as a argument? There might be some if/else statements in there, but I feel like this would be an easier test suite to keep track of and maintain in the long run, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have combined some but definitely seeing room to combine more. I agree it'll make it easier to maintain in the future.
Sorry, by
I'm not opposed to this, but I wonder if it's useful to save |
There was always a distributed test failing at #4005 and never understood why... |
Co-authored-by: Tomás Chor <[email protected]>
Co-authored-by: Tomás Chor <[email protected]>
Co-authored-by: Gregory L. Wagner <[email protected]>
|
Thanks for the reviews! I made a TODO list to tackle and I'll request another round of reviews. Please feel free to add to it.
@glwagner Are you suggesting that we checkpoint the diagnostic state so it can be used to make sure the simulation was restored properly? Or are you suggesting checkpointing things like
@navidcy Right now it seems like distributed CI even fails to initialize here. I tried looking through Buildkite runs from #4005 and couldn't find a failure with an actual error. Maybe I wouldn't think too much about it until we can run distributed checkpointing tests with/after this PR. |
This PR refactors how the
Checkpointerworks by now checkpointing simulations, rather than just models. This is needed as the simulations (+ output writers, callbacks, etc.) all contain crucial information needed to properly restore/pickup a simulation and continue time stepping.Basic design idea:
prognostic_state(obj)which returns a named tuple corresponding to the prognostic state ofobjandrestore_prognostic_state!(obj, state)which restoresobjbased on information contained instate(which is a named tuple and is read from a checkpoint file).prognostic_stateandrestore_prognostic_state!.Right now I've only implemented proper checkpointing for non-hydrostatic model but it looks like it'll be straightforward to do it for hydrostatic and shallow water models. I'm working on adding comprehensive testing too.
Will continue working on this PR, but any feedback is very welcome!
Resolves #1249
Resolves #2866
Resolves #3670
Resolves #3845
Resolves #4516
Resolves #4857
Rhetorical aside
In general, the checkpointer is assuming that the simulation setup is the same. So only prognostic state information that changes will be checkpointed (e.g. field data,
TimeInterval.actuations, etc.). The approach I have been taking (based on #4857) is to only checkpoint the prognostic state.Should we operate under this assumption? I think so because not doing so can lead to a lot of undefined behavior. The checkpointer should not be responsible for checking that you set up the same simulation as the one that was checkpointed.
For example, take the
SpecifiedTimesschedule. It has two propertiestimesandprevious_actuation. Sinceprevious_actuationchanges as the simulation runs, onlyprevious_actuationneeds to be checkpointed.This leads to the possibility of the user changing
timesthen picking upprevious_actuationwhich can lead to undefined behavior. I think this is fine, because the checkpointer only works assuming you set up the same simulation as the one that was checkpointed.Checkpointing both
timesandprevious_actuationallows us to check thattimesis the same when restoring. But I don't think this is the checkpointer's responsibility.