You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To summarize: --ntasks=4 --gpus-per-task=1 causes slurm to bind GPUs to tasks and limit tasks to one visible device each. I think all devices need to be visible within one task for CUDA IPC to work and avoid host-to-device transfer between tasks.
If we use --ntasks=4 --gpus=4 instead, all GPUs are visible to each other. I think ClimaComms can allocate devices to individual tasks, but I'm not sure. The RDMA error has been difficult to reproduce because of an nsys error.
To summarize: --ntasks=4 --gpus-per-task=1 causes slurm to bind GPUs to tasks and limit tasks to one visible device each. I think all devices need to be visible within one task for CUDA IPC to work and avoid host-to-device transfer between tasks.
If we use --ntasks=4 --gpus=4 instead, all GPUs are visible to each other. I think ClimaComms can allocate devices to individual tasks, but I'm not sure. The RDMA error has been difficult to reproduce because of an nsys error.
The issue Simon found was fixed in SLURM 23.11 with the new flag --allow-sharing. Our GPUs successfully do device-to-device communication across MPI processes
For example: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/255#018eaa55-1d98-4f43-a1be-b2199302f2c4
The text was updated successfully, but these errors were encountered: