Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up documentation #38

Merged
merged 3 commits into from
Apr 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ makedocs(;
"Getting Started" => "getting_started.md",
"Running your Experiments" => "execution.md",
"Distributed Execution" => "distributed.md",
"Cluster Execution" => "clusters.md",
"Data Store" => "store.md",
"Custom Snapshots" => "snapshots.md",
"Cluster Support" => "clusters.md",
"Public API" => "api.md"
],
)
Expand Down
11 changes: 11 additions & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ merge_databases!
## Experiments
```@docs
Experiment
get_progress
get_experiment
get_experiments
get_experiment_by_name
Expand All @@ -20,6 +21,7 @@ get_ratio_completed_trials_by_name
## Data Storage
```@docs
get_global_store
get_results_from_trial_global_database
```

## Trials
Expand All @@ -36,8 +38,17 @@ get_trials_ids_by_name
SerialMode
MultithreadedMode
DistributedMode
HeterogeneousMode
MPIMode
```

## Cluster Management
```@docs
Experimenter.Cluster.init

```


## Snapshots
```@docs
get_snapshots
Expand Down
111 changes: 107 additions & 4 deletions docs/src/clusters.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,114 @@
# Clusters
# Cluster Execution

This package provides some basic support for running an experiment on a HPC. This uses `ClusterManagers.jl` under the hood.
This package is most useful for running grid search trials on a cluster environment (i.e. a HPC), or a single node with many CPUs.

At the moment, we only support running on a SLURM cluster, but any PRs to support other clusters are welcome.
There are two main ways you can distribute your experiment over many processes - `DistributedMode` or `MPIMode`.

For those using a distributed cluster, we recommend that you launch your jobs using the [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) functionality, instead of the legacy [SLURM](https://slurm.schedmd.com/overview.html) support (see the [SLURM](#slurm) section below for details).

## MPI

### Installation

Most HPC environments have access to their own MPI implementation. These MPI implementations often take advantage of proprietary interconnect (networking) between the nodes that allow for low-latency and high-throughput communication. If you would like to find your local HPC's implementation, you may be able to look through the catalogue via a bash terminal, using the [Environment Modules](https://modules.sourceforge.net/) package available on most HPC systems:
```bash
module avail
```
or, for a more directed search:
```bash
module spider mpi
```

You may have multiple versions. If you are unsure as to which version to use, check the documentation for the HPC, contact your local System Administrator or simply use what is available. Using OpenMPI is often a reliable choice.

You can load which version of MPI you would like by adding
```bash
module load mpi/latest
```
to your job script (remember to change `mpi/latest` to the package available on your system).


Make you have loaded the MPI version you wish to use by running the `module load ...` command in the same terminal before opening Julia in the terminal by using
```bash
julia --project
```
Run this command in the same directory as your project.

Now, you have to add the `MPI` package to your local environment using
```julia
import Pkg; Pkg.add("MPI")
```
Now you should be able to load `MPIPreferences` and tell MPI about using your system binary:
```julia
using MPI.MPIPreferences

MPIPreferences.use_system_binary()
exit()
```
This should create a new `LocalPreferences.toml` file. I would recommend adding this file to your `.gitignore` list and not committing it to your GitHub repository.

### Job Scripts

When you are running on a cluster, write your job script so that you load MPI and precompile Julia before launching your job. An example job script could look like the following:

```bash
#!/bin/bash

#SBATCH --ntasks=8
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=2048
#SBATCH --time=00:30:00
#SBATCH -o mpi_job_%j.out


module load mpi/latest
module load julia/1.10.2

# Precompile Julia first to avoid race conditions
julia --project --threads=4 -e 'import Pkg; Pkg.instantiate()'
julia --project --threads=4 -e 'import Pkg; Pkg.precompile()'

mpirun -n 8 julia --project --threads=4 my_experiment.jl
```

Use the above as a template and change the specifics to suit your specific workload and HPC.

!!! info
Make sure that you launch your jobs with at least 2 processes (tasks), as one task is dedicated towards coordinating the execution of trials and saving the results.

## Experiment file

As usual, you should write a script to define your experiment and run the configuration. Below is an example, where it is assumed there is another file called `run.jl` which contains a function `run_trial` which takes a configuration dictionary and a trial `UUID`.

```julia
using Experimenter

config = Dict{Symbol,Any}(
:N => IterableVariable([Int(1e6), Int(2e6), Int(3e6)]),
:seed => IterableVariable([1234, 4321, 3467, 134234, 121]),
:sigma => 0.0001)
experiment = Experiment(
name="Test Experiment",
include_file="run.jl",
function_name="run_trial",
configuration=deepcopy(config)
)

db = open_db("experiments.db")

# Init the cluster
Experimenter.Cluster.init()

@execute experiment db MPIMode(1)
```

Note that we are calling `MPIMode(1)` which says that we want a communication batch size of `1`. If your jobs are small, and you want each worker to process a batch at a time, you can set this to a higher number.

## SLURM

!!! warning
It is recommended that you use the above MPI mode to run jobs on a cluster, instead of relying on `ClusterManagers.jl`, as it is much slower to run jobs.

Normally when running on SLURM, one creates a bash script to tell the scheduler about the resource requirements for a job. The following is an example:
```bash
#!/bin/bash
Expand Down Expand Up @@ -77,7 +180,7 @@ We then modify the create `myrun.sh` file to the following:
#SBATCH --time=00:30:00
#SBATCH -o hpc/logs/job_%j.out

julia --project my_experiment.jl --threads=1
julia --project --threads=1 my_experiment.jl

# Optional: Remove the files created by ClusterManagers.jl
rm -fr julia-*.out
Expand Down
36 changes: 2 additions & 34 deletions docs/src/distributed.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,38 +11,6 @@ addprocs(8)
```
As long as `nworkers()` show more than one worker, then your execution of trials will occur in parallel, across these workers.

## Configuring SLURM
Once the workers have been added, make sure to change your execution mode to `DistributedMode` to take advantage of the parallelism.

[SLURM](https://slurm.schedmd.com/overview.html) is one of the most popular schedulers on HPC clusters, which we can integrate with `Distributed.jl` to spawn our workers automatically. See [this gist](https://gist.github.com/JamieMair/0b1ffbd4ee424c173e6b42fe756e877a) for some scripts to make this process easier.

Let's start with spawning your processes:
```julia
using Distributed
using ClusterManagers
num_tasks = parse(Int, ENV["SLURM_NTASKS"]) # One process per task
cpus_per_task = parse(Int, ENV["SLURM_CPUS_PER_TASK"]) # Assign threads per process
addprocs(SlurmManager(num_tasks),
exe_flags=[
"--project",
"--threads=$cpus_per_task"]
)

```
You can check out [`ClusterManagers.jl`](https://github.com/JuliaParallel/ClusterManagers.jl) for your own cluster software if you are not using SLURM, but the process will be similar to this.

Once this has been done, simply include your file which configures and runs your experiment using `DistributedMode` execution mode as detailed above and save in a file like `run_script.jl`.

For SLURM, you can make a slurm script to submit, for example:
```sh
#!/bin/bash

#SBATCH --ntasks=8
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=2G
#SBATCH --time=00:30:00

module load julia/1.8.2

julia --project run_script.jl
```
which can be saved to `launch_experiment.sh` and run with `sbatch launch_experiment.sh`. Note that you may need to include addition SBATCH directives like `--account` on your cluster. Check your cluster's documentation for more information.
If you have access to a HPC cluster and would like to use multiple nodes, you can do this easily with `Experimenter.jl` - see more in [Cluster Execution](@ref).
21 changes: 13 additions & 8 deletions docs/src/execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ Once you have created an experiment you can run it with the `@execute` macro sup
```
Which will only execute trials from the experiment that have not been completed. It is up to you to implement how to continue your simulations from snapshots, using the Snapshots API.

## Executing in Parallel
## Single Node Parallel

There are two main ways of executing your experiments in parallel: multithreading (Threads) or multiprocessing (Distributed). The former has lower latency, but the latter scales to working on across a cluster. The easiest option if you are executing on a single computer, use:
```julia
@execute experiment db MultithreadedMode
```
By default, this will use as many threads as you have enabled. You can set this using the environment variable `JULIA_NUM_THREADS`, or by starting Julia with `--threads=X`, replacing `X` with the number you want. You can check what your current setting is with `Threads.nthreads()`.

On a cluster, we can change the execution mode to `DistributedMode`:
Alternatively, we can change the execution mode to `DistributedMode`:
```julia
@execute experiment db DistributedMode
```
Expand All @@ -23,20 +23,25 @@ This internally uses `pmap` from the `Distributed.jl` standard library, parallel
using Distributed
nworkers()
```
`Experimenter.jl` will not spin up processes for you, this is something you have to do yourself, see [Distributed Execution](@ref) for an in depth example.
`Experimenter.jl` will not spin up processes for you, this is something you have to do yourself, see [Distributed Execution](@ref) for an in depth example.

### Heterogeneous Execution
!!! info
If your code has many [memory allocations](https://docs.julialang.org/en/v1/manual/performance-tips/#Measure-performance-with-[@time](@ref)-and-pay-attention-to-memory-allocation), it may be better to use `DistributedMode` instead of `MultithreadedMode`.

If you want each distributed worker to be able to run multiple jobs at the same time, you can select a heterogeneous execution scheduling mode, which will allow each worker to run multiple trials simulatenously using multithreading. An example use case for this is where you have multiple nodes, each with many cores, and you do not wish to pay the memory cost from each separate process. Additionally, you can load data in a single process which can be reused by each execution in the same process. This mode may also allow multiple trials to share resources, such as a GPU, which typically only supports one process.
## Heterogeneous Execution

To run this, you simple change the mode to the `HeterogeneousMode` option, providing the number of threads to use on each worker, e.g.
If you want each distributed worker to be able to run multiple jobs at the same time, you can select a heterogeneous execution scheduling mode, which will allow each worker to run multiple trials simultaneously using multithreading. An example use case for this is where you have multiple nodes, each with many cores, and you do not wish to pay the memory cost from each separate process. Additionally, you can load data in a single process which can be reused by each execution in the same process. This mode may also allow multiple trials to share resources, such as a GPU, which typically only supports one process.

To run this, you simply change the mode to the `HeterogeneousMode` option, providing the number of threads to use on each worker, e.g.
```julia
@execute experiment db HeterogeneousMode(2)
```
which will allow each distributed worker to run two trials simulatenously via multithreading. If this option is selected, it is encouraged that you enable multiple threads per worker when launching the process, e.g. with `addprocs`:
which will allow each distributed worker to run two trials simultaneously via multithreading. If this option is selected, it is encouraged that you enable multiple threads per worker when launching the process, e.g. with `addprocs`:
```julia
addprocs(4; exeflags=["--threads=2"])
```
Otherwise, each worker may only have access to a single thread and the overall performance throughput will be worse.

<!-- TODO: Update the manifest to point to current experimenter version in docs -->
## MPI Execution

Most HPC clusters use a [Message Passing Interface](https://en.wikipedia.org/wiki/Message_Passing_Interface) implementation to handle communication between different processes and synchronise tasks. `Experimenter.jl` now has built-in support for execution via MPI, which has much lower overhead than the built-in `Distributed.jl` multiprocessing library. See more examples in the [Cluster Execution](@ref) page.
2 changes: 1 addition & 1 deletion docs/src/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ will not run any more trials, as they have already been completed. However, if t

## Saving part way

If your trials take a long time to finish and may be cancelled during their run, you can always implement a way to save a `Snapshot`, which allows you to save data you need to restore a trial part way through running. The API for this has not yet been documented, but examples can be seen in the unit tests.
If your trials take a long time to finish and may be cancelled during their run, you can always implement a way to save a `Snapshot`, which allows you to save data you need to restore a trial part way through running. An example setup for doing this is given in [Custom Snapshots](@ref).

## What is an `Experiment`?

Expand Down
10 changes: 6 additions & 4 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,13 @@ CurrentModule = Experimenter
*A package for easily running experiments for different parameters and saving the results in a centralised database*

## Package Features
- Create a local SQLite database to store the results of your experiment.
- Create a local SQLite database to store the results of your experiment, removing the need to keep track of 1000s of results files for each parameter configuration.
- Provides a standard structure for executing code across a range of parameters.
- Provides saving of results into the database using standard Julia types.
- Provides an `@execute` macro that will execute an experiment (consisting of many trials with different parameters). Can execute serially, or in parallel with a choice of multithreading or multiprocessing.
- Automatically skips completed trials.
- Promotes writing a script that can be easily committed to a Git repository to keep track of results and parameters used throughout development.
- Provides an `@execute` macro that will execute an experiment (consisting of many trials with different parameters). Can execute serially, or in parallel with a choice of multithreading or multiprocessing or even MPI mode.
- Provides an easy way to execute trials across a High Performance Cluster (HPC).
- Automatically skips completed trials, and provides a Snapshots API to allow for partial progress to be saved and reloaded.

Head over to [Getting Started](@ref) to get an overview of this package.

Expand All @@ -22,9 +24,9 @@ Pages = [
"getting_started.md",
"execution.md",
"distributed.md",
"clusters.md",
"store.md",
"snapshots.md",
"clusters.md"
]
Depth = 2
```
Expand Down
Loading