Skip to content

Commit

Permalink
Add support for checkpoints
Browse files Browse the repository at this point in the history
This commit adds support for checkpoint and restarting a ClimaLand
simulation. The functionality is tested in a bucket experiment, where I
verify that saving a simulation to disk and restarting it leads to the
same state as running the simulation in one go.

In the process, I had to bump some packages (because I am using the
latest version of ClimaUtilities). I could not get ClimaLand to stay
compatible with 1.9, so I bumped the minimum version to 1.10.

We should try to keep 1.10 as minimum version because it is LTS.
  • Loading branch information
Sbozzolo committed Oct 23, 2024
1 parent 76a410c commit 8a8614f
Show file tree
Hide file tree
Showing 11 changed files with 492 additions and 8 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/downgrade.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
version: ['1.9', '1.10', '1.11']
version: ['1.10', '1.11']
steps:
- uses: actions/checkout@v4
- uses: julia-actions/setup-julia@latest
Expand Down
10 changes: 5 additions & 5 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -43,21 +43,21 @@ ClimaComms = "0.6"
ClimaCore = "0.14.19"
ClimaDiagnostics = "0.2.5"
ClimaParams = "0.10.2"
ClimaUtilities = "0.1.15"
ClimaUtilities = "0.1.16"
DataFrames = "1.4"
Dates = "1"
DocStringExtensions = "0.9"
Flux = "0.14.0"
Flux = "0.14.13"
HTTP = "1.10"
Insolation = "0.9.2"
Interpolations = "0.15"
Interpolations = "0.15.1"
LazyArtifacts = "1"
LinearAlgebra = "1"
NCDatasets = "0.13.1, 0.14"
SciMLBase = "2"
SciMLBase = "2.34"
StaticArrays = "1.5"
StatsBase = "0.34"
SurfaceFluxes = "0.11, 0.12"
Thermodynamics = "0.12.4"
cuDNN = "1"
julia = "1.9"
julia = "1.10"
1 change: 1 addition & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ pages = Any[
"Tutorials" => tutorials,
"Standalone models" => standalone_models,
"Diagnostics" => diagnostics,
"Restarts" => "restarts.md",
"Contribution guide" => "Contributing.md",
"Repository structure" => "folderstructure.md",
"APIs" => apis,
Expand Down
104 changes: 104 additions & 0 deletions docs/src/restarts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
## Restarting Simulations

`ClimaLand` provides functionality to save and load simulation checkpoints,
allowing you to restart simulations from a previous state. This is particularly
useful for long-running simulations or if you want to experiment with different
configurations starting from a specific point in the simulation.


### Saving Checkpoints

To save a simulation checkpoint, you can use the `ClimaLand.save_checkpoint`
function. This function takes the current state `Y`, the simulation time `t`,
and the output directory as arguments. Optionally you can provide the
`ClimaLand` model object model. This will store the hash of the model in the
checkpoint file. You can use this information to ensure that you are restarting
the simulation with the same model that was used to generate the checkpoint.

```julia
ClimaLand.save_checkpoint(Y, t, output_dir; model)
```

Most typically, this function is not called directly. Instead, it is called as a
callback.

In ClimaLand, you can automate the process of saving checkpoints using the
`CheckpointCallback`. This callback allows you to specify the frequency at which
checkpoints are saved and handles the saving process during the simulation.

To use the `CheckpointCallback`, you need to create an instance of it and pass
it to the solve function along with your other callbacks.

Example:

```julia

# ... your ClimaLand simulation setup ...

# Create a CheckpointCallback to save checkpoints every 6 hours
checkpoint_cb = CheckpointCallback(Dates.Hour(6), output_dir, start_date, t_start; model, dt)

# Add the callback to the callback set
cb = SciMLBase.CallbackSet(checkpoint_cb, other_callbacks...)

# Run the simulation with the callbacks
sol = SciMLBase.solve(prob, ode_algo; dt = Δt, callback = cb)

# ... your ClimaLand simulation analysis ...
```

In this example, the `CheckpointCallback` will save a checkpoint every 6 hours
during the simulation. You can customize the checkpoint_frequency to control how
often checkpoints are saved. You can also pass the `ClimaLand` model object model
to store its hash in the checkpoint file. This information can be used later to
ensure that you are restarting the simulation with the same model that was used
to generate the checkpoint.

If `dt` is passed, `CheckpointCallback` will also check that it is consistent
with the checkpoint frequency.


### Restarting from a Checkpoint

To restart a simulation from a checkpoint, you can use the
`ClimaLand.find_restart` function to locate the most recent checkpoint file in
the output directory. Then, you can use the `ClimaLand.read_checkpoint` function
to load the state vector and simulation time from the checkpoint file.

```julia
restart_file = ClimaLand.find_restart(output_dir)
Y, t = ClimaLand.read_checkpoint(restart_file; model)
```

### Output Structure

`ClimaLand` utilizes the `OutputPathGenerator` from `ClimaUtilities` to manage
the output directory structure. By default, it uses the `ActiveLinkStyle`, which
creates a sequence of numbered subfolders within the base output directory.

For example, if your base output directory is output, the following structure
will be created:
```
output/
├── output_0000/
│ └── ... checkpoint files ...
├── output_0001/
│ └── ... checkpoint files ...
├── output_0002/
│ └── ... checkpoint files ...
└── output_active -> output_0002/
```

The output_active symbolic link always points to the most recent output
subfolder, making it easy to access the latest simulation results.

#### Checkpoint File Structure

When using the `CheckpointCallback`, the checkpoints are saved as HDF5 files
within the numbered output subfolders. The files are named using the following
convention:
```
day<day_number>.<seconds_since_midnight>.hdf5
```
For example, a checkpoint saved at day 10, 3600 seconds after midnight would be
named `day10.3600.hdf5`.
1 change: 1 addition & 0 deletions src/ClimaLand.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ import ClimaUtilities.SpaceVaryingInputs: SpaceVaryingInput
import NCDatasets # Needed to load the ClimaUtilities.*VaryingInput
using .Domains
include("Artifacts.jl")
include("shared_utilities/checkpoints.jl")
include("shared_utilities/utils.jl")
include("shared_utilities/models.jl")
include("shared_utilities/drivers.jl")
Expand Down
128 changes: 128 additions & 0 deletions src/shared_utilities/checkpoints.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
import ClimaCore: InputOutput
import ClimaUtilities

"""
ClimaLand.find_restart(output_dir)
Find the most recent restart file in the specified output directory.
This function utilizes `ClimaUtilities.OutputPathGenerator.detect_restart_file`
to locate the latest restart file within the output directory structure,
assuming the `ActiveLinkStyle` is used for managing output folders.
# Arguments
- `output_dir`: The base output directory where the simulation results are stored.
# Returns
- The path to the most recent restart file found, or `nothing` if no restart
file is found.
"""
function find_restart(output_dir)
return ClimaUtilities.OutputPathGenerator.detect_restart_file(
ClimaUtilities.OutputPathGenerator.ActiveLinkStyle(),
output_dir,
)
end

"""
_context_from_Y(Y)
Try extracting the context from the FieldVector Y.
Typically Y has a structure like:
```
Y
.bucket
.T
.W
.Ws
```
`_context_from_Y` tries obtaining the context from the first Field in the
hierarchy.
"""
function _context_from_Y(Y)
a_model_type = getproperty(Y, first(propertynames(Y)))
a_field = getproperty(a_model_type, first(propertynames(a_model_type)))
return ClimaComms.context(a_field)
end

"""
ClimaLand.save_checkpoint(Y, t, output_dir; model = nothing, comms_ctx = ClimaComms.context(Y))
Save a simulation checkpoint to an HDF5 file.
This function saves the current state of the simulation, including the state
vector `Y` and the current simulation time `t`, to an HDF5 file within the
specified output directory.
# Arguments
- `Y`: The state of the simulation.
- `t`: The current simulation time.
- `output_dir`: The directory where the checkpoint file will be saved.
- `model` (Optional): The ClimaLand model object. If provided the hash of the model
will be stored in the checkpoint file. Defaults to `nothing`. This is used
to check for consistency.
- `comms_ctx` (Optional): The ClimaComms context. This is used for distributed I/O
operations. Defaults to the context extracted from the state vector `Y` or the `model`.
"""
function save_checkpoint(
Y,
t,
output_dir;
model = nothing,
comms_ctx = isnothing(model) ? _context_from_Y(Y) :
ClimaComms.context(model),
)
day = floor(Int, t / (60 * 60 * 24))
sec = floor(Int, t % (60 * 60 * 24))
output_file = joinpath(output_dir, "day$day.$sec.hdf5")
hdfwriter = InputOutput.HDF5Writer(output_file, comms_ctx)
# If model was passed, add its hash, otherwise add nothing
hash_model = isnothing(model) ? "nothing" : hash(model)
InputOutput.write_attributes!(
hdfwriter,
"/",
Dict("time" => t, "land_model_hash" => hash_model),
)
InputOutput.write!(hdfwriter, Y, "Y")
Base.close(hdfwriter)
return nothing
end

"""
ClimaLand.read_checkpoint(file_path; model = nothing, context = ClimaComms.context())
Read a simulation checkpoint from an HDF5 file.
This function loads the simulation state from a previously saved checkpoint file.
# Arguments
- `file_path`: The path to the HDF5 checkpoint file.
- `model` (Optional): The ClimaLand model object. If provided the hash of the model
stored in the checkpoint file will be compared with the hash of the provided
model and a warning will be issued if they don't match. Defaults to `nothing`.
- `context` (Optional): The ClimaComms context. This is used for parallel I/O
operations. Defaults to the default ClimaComms context.
# Returns
- `Y`: The state vector loaded from the checkpoint file.
- `t`: The simulation time loaded from the checkpoint file.
"""
function read_checkpoint(
file_path;
model = nothing,
context = isnothing(model) ? ClimaComms.context() :
ClimaComms.context(model),
)
hdfreader = InputOutput.HDF5Reader(file_path, context)
Y = InputOutput.read_field(hdfreader, "Y")
attributes = InputOutput.read_attributes(hdfreader, "/")
if !isnothing(model)
if hash(model) != attributes["land_model_hash"]
@warn "Restart file $(file_path) was constructed with a different land model"
end
end
t = attributes["time"]
Base.close(hdfreader)
return Y, t
end
20 changes: 20 additions & 0 deletions src/shared_utilities/models.jl
Original file line number Diff line number Diff line change
Expand Up @@ -447,6 +447,26 @@ function initialize(model::AbstractModel{FT}) where {FT}
return Y, p, coords
end


"""
initialize_from_checkpoint(restart_file; model::AbstractModel)
Creates the prognostic and auxiliary states structures, but with unset
values; constructs and returns the coordinates for the `model` domain.
We may need to consider this default more as we add diverse components and
`Simulations`.
TODO: Combine this function with initialize. We don't really need two.
"""
function initialize_from_checkpoint(restart_file; model)
Y, t_checkpoint = read_checkpoint(restart_file; model)
coords = Domains.coordinates(model)
p = initialize_auxiliary(model, coords)
p = add_drivers_to_cache(p, model, coords)
return Y, p, coords, t_checkpoint
end


function ClimaComms.context(model::AbstractModel)
if :domain propertynames(model)
return ClimaComms.context(model.domain)
Expand Down
Loading

0 comments on commit 8a8614f

Please sign in to comment.