JamieMair · JamieMair · Apr 30, 2024 · Apr 30, 2024 · Apr 30, 2024 · Apr 30, 2024
diff --git a/docs/make.jl b/docs/make.jl
@@ -19,9 +19,9 @@ makedocs(;
         "Getting Started" => "getting_started.md",
         "Running your Experiments" => "execution.md",
         "Distributed Execution" => "distributed.md",
+        "Cluster Execution" => "clusters.md",
         "Data Store" => "store.md",
         "Custom Snapshots" => "snapshots.md",
-        "Cluster Support" => "clusters.md",
         "Public API" => "api.md"
     ],
 )

diff --git a/docs/src/api.md b/docs/src/api.md
@@ -11,6 +11,7 @@ merge_databases!
 ## Experiments
 ```@docs
 Experiment
+get_progress
 get_experiment
 get_experiments
 get_experiment_by_name
@@ -20,6 +21,7 @@ get_ratio_completed_trials_by_name
 ## Data Storage
 ```@docs
 get_global_store
+get_results_from_trial_global_database
 ```
 
 ## Trials
@@ -36,8 +38,17 @@ get_trials_ids_by_name
 SerialMode
 MultithreadedMode
 DistributedMode
+HeterogeneousMode
+MPIMode
 ```
 
+## Cluster Management
+```@docs
+Experimenter.Cluster.init
+
+```
+
+
 ## Snapshots
 ```@docs
 get_snapshots

diff --git a/docs/src/clusters.md b/docs/src/clusters.md
@@ -1,11 +1,114 @@
-# Clusters
+# Cluster Execution
 
-This package provides some basic support for running an experiment on a HPC. This uses `ClusterManagers.jl` under the hood.
+This package is most useful for running grid search trials on a cluster environment (i.e. a HPC), or a single node with many CPUs. 
 
-At the moment, we only support running on a SLURM cluster, but any PRs to support other clusters are welcome.
+There are two main ways you can distribute your experiment over many processes - `DistributedMode` or `MPIMode`. 
+
+For those using a distributed cluster, we recommend that you launch your jobs using the [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) functionality, instead of the legacy [SLURM](https://slurm.schedmd.com/overview.html) support (see the [SLURM](#slurm) section below for details).
+
+## MPI
+
+### Installation
+
+Most HPC environments have access to their own MPI implementation. These MPI implementations often take advantage of proprietary interconnect (networking) between the nodes that allow for low-latency and high-throughput communication. If you would like to find your local HPC's implementation, you may be able to look through the catalogue via a bash terminal, using the [Environment Modules](https://modules.sourceforge.net/) package available on most HPC systems:
+```bash
+module avail
+```
+or, for a more directed search:
+```bash
+module spider mpi
+```
+
+You may have multiple versions. If you are unsure as to which version to use, check the documentation for the HPC, contact your local System Administrator or simply use what is available. Using OpenMPI is often a reliable choice. 
+
+You can load which version of MPI you would like by adding
+```bash
+module load mpi/latest
+```
+to your job script (remember to change `mpi/latest` to the package available on your system).
+
+
+Make you have loaded the MPI version you wish to use by running the `module load ...` command in the same terminal before opening Julia in the terminal by using
+```bash
+julia --project
+```
+Run this command in the same directory as your project.
+
+Now, you have to add the `MPI` package to your local environment using
+```julia
+import Pkg; Pkg.add("MPI")
+```
+Now you should be able to load `MPIPreferences` and tell MPI about using your system binary:
+```julia
+using MPI.MPIPreferences
+
+MPIPreferences.use_system_binary()
+exit()
+```
+This should create a new `LocalPreferences.toml` file. I would recommend adding this file to your `.gitignore` list and not committing it to your GitHub repository.
+
+### Job Scripts
+
+When you are running on a cluster, write your job script so that you load MPI and precompile Julia before launching your job. An example job script could look like the following:
+
+```bash
+#!/bin/bash
+
+#SBATCH --ntasks=8
+#SBATCH --cpus-per-task=4
+#SBATCH --mem-per-cpu=2048
+#SBATCH --time=00:30:00
+#SBATCH -o mpi_job_%j.out
+
+
+module load mpi/latest
+module load julia/1.10.2
+
+# Precompile Julia first to avoid race conditions
+julia --project --threads=4 -e 'import Pkg; Pkg.instantiate()'
+julia --project --threads=4 -e 'import Pkg; Pkg.precompile()'
+
+mpirun -n 8 julia --project --threads=4 my_experiment.jl
+```
+
+Use the above as a template and change the specifics to suit your specific workload and HPC.
+
+!!! info
+    Make sure that you launch your jobs with at least 2 processes (tasks), as one task is dedicated towards coordinating the execution of trials and saving the results.
+
+## Experiment file
+
+As usual, you should write a script to define your experiment and run the configuration. Below is an example, where it is assumed there is another file called `run.jl` which contains a function `run_trial` which takes a configuration dictionary and a trial `UUID`.
+
+```julia
+using Experimenter
+
+config = Dict{Symbol,Any}(
+    :N => IterableVariable([Int(1e6), Int(2e6), Int(3e6)]),
+    :seed => IterableVariable([1234, 4321, 3467, 134234, 121]),
+    :sigma => 0.0001)
+experiment = Experiment(
+    name="Test Experiment",
+    include_file="run.jl",
+    function_name="run_trial",
+    configuration=deepcopy(config)
+)
+
+db = open_db("experiments.db")
+
+# Init the cluster
+Experimenter.Cluster.init()
+
+@execute experiment db MPIMode(1)
+```
+
+Note that we are calling `MPIMode(1)` which says that we want a communication batch size of `1`. If your jobs are small, and you want each worker to process a batch at a time, you can set this to a higher number.
 
 ## SLURM
 
+!!! warning
+    It is recommended that you use the above MPI mode to run jobs on a cluster, instead of relying on `ClusterManagers.jl`, as it is much slower to run jobs.
+
 Normally when running on SLURM, one creates a bash script to tell the scheduler about the resource requirements for a job. The following is an example:
 ```bash
 #!/bin/bash
@@ -77,7 +180,7 @@ We then modify the create `myrun.sh` file to the following:
 #SBATCH --time=00:30:00
 #SBATCH -o hpc/logs/job_%j.out
 
-julia --project my_experiment.jl --threads=1
+julia --project --threads=1 my_experiment.jl
 
 # Optional: Remove the files created by ClusterManagers.jl
 rm -fr julia-*.out

diff --git a/docs/src/distributed.md b/docs/src/distributed.md
@@ -11,38 +11,6 @@ addprocs(8)
 ```
 As long as `nworkers()` show more than one worker, then your execution of trials will occur in parallel, across these workers.
 
-## Configuring SLURM
+Once the workers have been added, make sure to change your execution mode to `DistributedMode` to take advantage of the parallelism.
 
-[SLURM](https://slurm.schedmd.com/overview.html) is one of the most popular schedulers on HPC clusters, which we can integrate with `Distributed.jl` to spawn our workers automatically. See [this gist](https://gist.github.com/JamieMair/0b1ffbd4ee424c173e6b42fe756e877a) for some scripts to make this process easier.
-
-Let's start with spawning your processes:
-```julia
-using Distributed
-using ClusterManagers
-num_tasks = parse(Int, ENV["SLURM_NTASKS"]) # One process per task
-cpus_per_task = parse(Int, ENV["SLURM_CPUS_PER_TASK"]) # Assign threads per process
-addprocs(SlurmManager(num_tasks),
-    exe_flags=[
-        "--project",
-        "--threads=$cpus_per_task"]
-)
-
-```
-You can check out [`ClusterManagers.jl`](https://github.com/JuliaParallel/ClusterManagers.jl) for your own cluster software if you are not using SLURM, but the process will be similar to this.
-
-Once this has been done, simply include your file which configures and runs your experiment using `DistributedMode` execution mode as detailed above and save in a file like `run_script.jl`.
-
-For SLURM, you can make a slurm script to submit, for example:
-```sh
-#!/bin/bash
-
-#SBATCH --ntasks=8
-#SBATCH --cpus-per-task=4
-#SBATCH --mem-per-cpu=2G
-#SBATCH --time=00:30:00
-
-module load julia/1.8.2
-
-julia --project run_script.jl
-```
-which can be saved to `launch_experiment.sh` and run with `sbatch launch_experiment.sh`. Note that you may need to include addition SBATCH directives like `--account` on your cluster. Check your cluster's documentation for more information.
+If you have access to a HPC cluster and would like to use multiple nodes, you can do this easily with `Experimenter.jl` - see more in [Cluster Execution](@ref).
diff --git a/docs/src/execution.md b/docs/src/execution.md
@@ -6,15 +6,15 @@ Once you have created an experiment you can run it with the `@execute` macro sup
 ```
 Which will only execute trials from the experiment that have not been completed. It is up to you to implement how to continue your simulations from snapshots, using the Snapshots API. 
 
-## Executing in Parallel
+## Single Node Parallel
 
 There are two main ways of executing your experiments in parallel: multithreading (Threads) or multiprocessing (Distributed). The former has lower latency, but the latter scales to working on across a cluster. The easiest option if you are executing on a single computer, use:
 ```julia
 @execute experiment db MultithreadedMode
 ```
 By default, this will use as many threads as you have enabled. You can set this using the environment variable `JULIA_NUM_THREADS`, or by starting Julia with `--threads=X`, replacing `X` with the number you want. You can check what your current setting is with `Threads.nthreads()`.
 
-On a cluster, we can change the execution mode to `DistributedMode`:
+Alternatively, we can change the execution mode to `DistributedMode`:
 ```julia
 @execute experiment db DistributedMode
 ```
@@ -23,20 +23,25 @@ This internally uses `pmap` from the `Distributed.jl` standard library, parallel
 using Distributed
 nworkers()
 ```
-`Experimenter.jl` will not spin up processes for you, this is something you have to do yourself, see [Distributed Execution](@ref) for an in depth example. 
+`Experimenter.jl` will not spin up processes for you, this is something you have to do yourself, see [Distributed Execution](@ref) for an in depth example.
 
-### Heterogeneous Execution
+!!! info
+    If your code has many [memory allocations](https://docs.julialang.org/en/v1/manual/performance-tips/#Measure-performance-with-[@time](@ref)-and-pay-attention-to-memory-allocation), it may be better to use `DistributedMode` instead of `MultithreadedMode`.
 
-If you want each distributed worker to be able to run multiple jobs at the same time, you can select a heterogeneous execution scheduling mode, which will allow each worker to run multiple trials simulatenously using multithreading. An example use case for this is where you have multiple nodes, each with many cores, and you do not wish to pay the memory cost from each separate process. Additionally, you can load data in a single process which can be reused by each execution in the same process. This mode may also allow multiple trials to share resources, such as a GPU, which typically only supports one process.
+## Heterogeneous Execution
 
-To run this, you simple change the mode to the `HeterogeneousMode` option, providing the number of threads to use on each worker, e.g.
+If you want each distributed worker to be able to run multiple jobs at the same time, you can select a heterogeneous execution scheduling mode, which will allow each worker to run multiple trials simultaneously using multithreading. An example use case for this is where you have multiple nodes, each with many cores, and you do not wish to pay the memory cost from each separate process. Additionally, you can load data in a single process which can be reused by each execution in the same process. This mode may also allow multiple trials to share resources, such as a GPU, which typically only supports one process.
+
+To run this, you simply change the mode to the `HeterogeneousMode` option, providing the number of threads to use on each worker, e.g.
 ```julia
 @execute experiment db HeterogeneousMode(2)
 ```
-which will allow each distributed worker to run two trials simulatenously via multithreading. If this option is selected, it is encouraged that you enable multiple threads per worker when launching the process, e.g. with `addprocs`:
+which will allow each distributed worker to run two trials simultaneously via multithreading. If this option is selected, it is encouraged that you enable multiple threads per worker when launching the process, e.g. with `addprocs`:
 ```julia
 addprocs(4; exeflags=["--threads=2"])
 ```
 Otherwise, each worker may only have access to a single thread and the overall performance throughput will be worse.
 
-<!-- TODO: Update the manifest to point to current experimenter version in docs -->
+## MPI Execution
+
+Most HPC clusters use a [Message Passing Interface](https://en.wikipedia.org/wiki/Message_Passing_Interface) implementation to handle communication between different processes and synchronise tasks. `Experimenter.jl` now has built-in support for execution via MPI, which has much lower overhead than the built-in `Distributed.jl` multiprocessing library. See more examples in the [Cluster Execution](@ref) page. 
diff --git a/docs/src/getting_started.md b/docs/src/getting_started.md
@@ -106,7 +106,7 @@ will not run any more trials, as they have already been completed. However, if t
 
 ## Saving part way
 
-If your trials take a long time to finish and may be cancelled during their run, you can always implement a way to save a `Snapshot`, which allows you to save data you need to restore a trial part way through running. The API for this has not yet been documented, but examples can be seen in the unit tests.
+If your trials take a long time to finish and may be cancelled during their run, you can always implement a way to save a `Snapshot`, which allows you to save data you need to restore a trial part way through running. An example setup for doing this is given in [Custom Snapshots](@ref).
 
 ## What is an `Experiment`?
 

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -7,11 +7,13 @@ CurrentModule = Experimenter
 *A package for easily running experiments for different parameters and saving the results in a centralised database*
 
 ## Package Features
-- Create a local SQLite database to store the results of your experiment.
+- Create a local SQLite database to store the results of your experiment, removing the need to keep track of 1000s of results files for each parameter configuration.
 - Provides a standard structure for executing code across a range of parameters.
 - Provides saving of results into the database using standard Julia types.
-- Provides an `@execute` macro that will execute an experiment (consisting of many trials with different parameters). Can execute serially, or in parallel with a choice of multithreading or multiprocessing.
-- Automatically skips completed trials.
+- Promotes writing a script that can be easily committed to a Git repository to keep track of results and parameters used throughout development.
+- Provides an `@execute` macro that will execute an experiment (consisting of many trials with different parameters). Can execute serially, or in parallel with a choice of multithreading or multiprocessing or even MPI mode.
+- Provides an easy way to execute trials across a High Performance Cluster (HPC).
+- Automatically skips completed trials, and provides a Snapshots API to allow for partial progress to be saved and reloaded.
 
 Head over to [Getting Started](@ref) to get an overview of this package.
 
@@ -22,9 +24,9 @@ Pages = [
     "getting_started.md",
     "execution.md",
     "distributed.md",
+    "clusters.md",
     "store.md",
     "snapshots.md",
-    "clusters.md"
 ]
 Depth = 2
 ```