-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This commit adds support for checkpoint and restarting a ClimaLand simulation. The functionality is tested in a bucket experiment, where I verify that saving a simulation to disk and restarting it leads to the same state as running the simulation in one go. In the process, I had to bump some packages (because I am using the latest version of ClimaUtilities). I could not get ClimaLand to stay compatible with 1.9, so I bumped the minimum version to 1.10. We should try to keep 1.10 as minimum version because it is LTS.
- Loading branch information
Showing
11 changed files
with
492 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
## Restarting Simulations | ||
|
||
`ClimaLand` provides functionality to save and load simulation checkpoints, | ||
allowing you to restart simulations from a previous state. This is particularly | ||
useful for long-running simulations or if you want to experiment with different | ||
configurations starting from a specific point in the simulation. | ||
|
||
|
||
### Saving Checkpoints | ||
|
||
To save a simulation checkpoint, you can use the `ClimaLand.save_checkpoint` | ||
function. This function takes the current state `Y`, the simulation time `t`, | ||
and the output directory as arguments. Optionally you can provide the | ||
`ClimaLand` model object model. This will store the hash of the model in the | ||
checkpoint file. You can use this information to ensure that you are restarting | ||
the simulation with the same model that was used to generate the checkpoint. | ||
|
||
```julia | ||
ClimaLand.save_checkpoint(Y, t, output_dir; model) | ||
``` | ||
|
||
Most typically, this function is not called directly. Instead, it is called as a | ||
callback. | ||
|
||
In ClimaLand, you can automate the process of saving checkpoints using the | ||
`CheckpointCallback`. This callback allows you to specify the frequency at which | ||
checkpoints are saved and handles the saving process during the simulation. | ||
|
||
To use the `CheckpointCallback`, you need to create an instance of it and pass | ||
it to the solve function along with your other callbacks. | ||
|
||
Example: | ||
|
||
```julia | ||
|
||
# ... your ClimaLand simulation setup ... | ||
|
||
# Create a CheckpointCallback to save checkpoints every 6 hours | ||
checkpoint_cb = CheckpointCallback(Dates.Hour(6), output_dir, start_date, t_start; model, dt) | ||
|
||
# Add the callback to the callback set | ||
cb = SciMLBase.CallbackSet(checkpoint_cb, other_callbacks...) | ||
|
||
# Run the simulation with the callbacks | ||
sol = SciMLBase.solve(prob, ode_algo; dt = Δt, callback = cb) | ||
|
||
# ... your ClimaLand simulation analysis ... | ||
``` | ||
|
||
In this example, the `CheckpointCallback` will save a checkpoint every 6 hours | ||
during the simulation. You can customize the checkpoint_frequency to control how | ||
often checkpoints are saved. You can also pass the `ClimaLand` model object model | ||
to store its hash in the checkpoint file. This information can be used later to | ||
ensure that you are restarting the simulation with the same model that was used | ||
to generate the checkpoint. | ||
|
||
If `dt` is passed, `CheckpointCallback` will also check that it is consistent | ||
with the checkpoint frequency. | ||
|
||
|
||
### Restarting from a Checkpoint | ||
|
||
To restart a simulation from a checkpoint, you can use the | ||
`ClimaLand.find_restart` function to locate the most recent checkpoint file in | ||
the output directory. Then, you can use the `ClimaLand.read_checkpoint` function | ||
to load the state vector and simulation time from the checkpoint file. | ||
|
||
```julia | ||
restart_file = ClimaLand.find_restart(output_dir) | ||
Y, t = ClimaLand.read_checkpoint(restart_file; model) | ||
``` | ||
|
||
### Output Structure | ||
|
||
`ClimaLand` utilizes the `OutputPathGenerator` from `ClimaUtilities` to manage | ||
the output directory structure. By default, it uses the `ActiveLinkStyle`, which | ||
creates a sequence of numbered subfolders within the base output directory. | ||
|
||
For example, if your base output directory is output, the following structure | ||
will be created: | ||
``` | ||
output/ | ||
├── output_0000/ | ||
│ └── ... checkpoint files ... | ||
├── output_0001/ | ||
│ └── ... checkpoint files ... | ||
├── output_0002/ | ||
│ └── ... checkpoint files ... | ||
└── output_active -> output_0002/ | ||
``` | ||
|
||
The output_active symbolic link always points to the most recent output | ||
subfolder, making it easy to access the latest simulation results. | ||
|
||
#### Checkpoint File Structure | ||
|
||
When using the `CheckpointCallback`, the checkpoints are saved as HDF5 files | ||
within the numbered output subfolders. The files are named using the following | ||
convention: | ||
``` | ||
day<day_number>.<seconds_since_midnight>.hdf5 | ||
``` | ||
For example, a checkpoint saved at day 10, 3600 seconds after midnight would be | ||
named `day10.3600.hdf5`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
import ClimaCore: InputOutput | ||
import ClimaUtilities | ||
|
||
""" | ||
ClimaLand.find_restart(output_dir) | ||
Find the most recent restart file in the specified output directory. | ||
This function utilizes `ClimaUtilities.OutputPathGenerator.detect_restart_file` | ||
to locate the latest restart file within the output directory structure, | ||
assuming the `ActiveLinkStyle` is used for managing output folders. | ||
# Arguments | ||
- `output_dir`: The base output directory where the simulation results are stored. | ||
# Returns | ||
- The path to the most recent restart file found, or `nothing` if no restart | ||
file is found. | ||
""" | ||
function find_restart(output_dir) | ||
return ClimaUtilities.OutputPathGenerator.detect_restart_file( | ||
ClimaUtilities.OutputPathGenerator.ActiveLinkStyle(), | ||
output_dir, | ||
) | ||
end | ||
|
||
""" | ||
_context_from_Y(Y) | ||
Try extracting the context from the FieldVector Y. | ||
Typically Y has a structure like: | ||
``` | ||
Y | ||
.bucket | ||
.T | ||
.W | ||
.Ws | ||
``` | ||
`_context_from_Y` tries obtaining the context from the first Field in the | ||
hierarchy. | ||
""" | ||
function _context_from_Y(Y) | ||
a_model_type = getproperty(Y, first(propertynames(Y))) | ||
a_field = getproperty(a_model_type, first(propertynames(a_model_type))) | ||
return ClimaComms.context(a_field) | ||
end | ||
|
||
""" | ||
ClimaLand.save_checkpoint(Y, t, output_dir; model = nothing, comms_ctx = ClimaComms.context(Y)) | ||
Save a simulation checkpoint to an HDF5 file. | ||
This function saves the current state of the simulation, including the state | ||
vector `Y` and the current simulation time `t`, to an HDF5 file within the | ||
specified output directory. | ||
# Arguments | ||
- `Y`: The state of the simulation. | ||
- `t`: The current simulation time. | ||
- `output_dir`: The directory where the checkpoint file will be saved. | ||
- `model` (Optional): The ClimaLand model object. If provided the hash of the model | ||
will be stored in the checkpoint file. Defaults to `nothing`. This is used | ||
to check for consistency. | ||
- `comms_ctx` (Optional): The ClimaComms context. This is used for distributed I/O | ||
operations. Defaults to the context extracted from the state vector `Y` or the `model`. | ||
""" | ||
function save_checkpoint( | ||
Y, | ||
t, | ||
output_dir; | ||
model = nothing, | ||
comms_ctx = isnothing(model) ? _context_from_Y(Y) : | ||
ClimaComms.context(model), | ||
) | ||
day = floor(Int, t / (60 * 60 * 24)) | ||
sec = floor(Int, t % (60 * 60 * 24)) | ||
output_file = joinpath(output_dir, "day$day.$sec.hdf5") | ||
hdfwriter = InputOutput.HDF5Writer(output_file, comms_ctx) | ||
# If model was passed, add its hash, otherwise add nothing | ||
hash_model = isnothing(model) ? "nothing" : hash(model) | ||
InputOutput.write_attributes!( | ||
hdfwriter, | ||
"/", | ||
Dict("time" => t, "land_model_hash" => hash_model), | ||
) | ||
InputOutput.write!(hdfwriter, Y, "Y") | ||
Base.close(hdfwriter) | ||
return nothing | ||
end | ||
|
||
""" | ||
ClimaLand.read_checkpoint(file_path; model = nothing, context = ClimaComms.context()) | ||
Read a simulation checkpoint from an HDF5 file. | ||
This function loads the simulation state from a previously saved checkpoint file. | ||
# Arguments | ||
- `file_path`: The path to the HDF5 checkpoint file. | ||
- `model` (Optional): The ClimaLand model object. If provided the hash of the model | ||
stored in the checkpoint file will be compared with the hash of the provided | ||
model and a warning will be issued if they don't match. Defaults to `nothing`. | ||
- `context` (Optional): The ClimaComms context. This is used for parallel I/O | ||
operations. Defaults to the default ClimaComms context. | ||
# Returns | ||
- `Y`: The state vector loaded from the checkpoint file. | ||
- `t`: The simulation time loaded from the checkpoint file. | ||
""" | ||
function read_checkpoint( | ||
file_path; | ||
model = nothing, | ||
context = isnothing(model) ? ClimaComms.context() : | ||
ClimaComms.context(model), | ||
) | ||
hdfreader = InputOutput.HDF5Reader(file_path, context) | ||
Y = InputOutput.read_field(hdfreader, "Y") | ||
attributes = InputOutput.read_attributes(hdfreader, "/") | ||
if !isnothing(model) | ||
if hash(model) != attributes["land_model_hash"] | ||
@warn "Restart file $(file_path) was constructed with a different land model" | ||
end | ||
end | ||
t = attributes["time"] | ||
Base.close(hdfreader) | ||
return Y, t | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.