RDMA error on clima for 4 GPUs #54

Sbozzolo · 2024-04-05T20:48:51Z

For example: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/255#018eaa55-1d98-4f43-a1be-b2199302f2c4

clima.gps.caltech.edu:3667226] Failed to register remote memory, rc=-1

The text was updated successfully, but these errors were encountered:

charleskawczynski · 2024-04-18T19:15:33Z

Perhaps relevant: https://users.open-mpi.narkive.com/bt1qTy3M/ompi-cuipcopenmemhandle-failure-when-using-openmpi-1-8-5-with-cuda-7-0-and-multi-process-service

nefrathenrici · 2024-05-04T00:27:22Z

It looks like @simonbyrne already ran into this issue: open-mpi/ompi#11949

To summarize: --ntasks=4 --gpus-per-task=1 causes slurm to bind GPUs to tasks and limit tasks to one visible device each. I think all devices need to be visible within one task for CUDA IPC to work and avoid host-to-device transfer between tasks.

If we use --ntasks=4 --gpus=4 instead, all GPUs are visible to each other. I think ClimaComms can allocate devices to individual tasks, but I'm not sure. The RDMA error has been difficult to reproduce because of an nsys error.

[nefrathe@clima ~]$ srun -n 4 --gpus-per-task=1 printenv CUDA_VISIBLE_DEVICES
0
1
2
3
[nefrathe@clima ~]$ srun -n 4 --gpus=4 printenv CUDA_VISIBLE_DEVICES
0,1,2,3
0,1,2,3
0,1,2,3
0,1,2,3

Sbozzolo · 2024-05-04T00:30:14Z

It looks like @simonbyrne already ran into this issue: open-mpi/ompi#11949

To summarize: --ntasks=4 --gpus-per-task=1 causes slurm to bind GPUs to tasks and limit tasks to one visible device each. I think all devices need to be visible within one task for CUDA IPC to work and avoid host-to-device transfer between tasks.

If we use --ntasks=4 --gpus=4 instead, all GPUs are visible to each other. I think ClimaComms can allocate devices to individual tasks, but I'm not sure. The RDMA error has been difficult to reproduce because of an nsys error.
[nefrathe@clima ~]$ srun -n 4 --gpus-per-task=1 printenv CUDA_VISIBLE_DEVICES
0
1
2
3
[nefrathe@clima ~]$ srun -n 4 --gpus=4 printenv CUDA_VISIBLE_DEVICES
0,1,2,3
0,1,2,3
0,1,2,3
0,1,2,3

The issue Simon found was fixed in SLURM 23.11 with the new flag --allow-sharing. Our GPUs successfully do device-to-device communication across MPI processes

Sbozzolo added the bug Something isn't working label Apr 5, 2024

Sbozzolo added this to the O1.1.3 Demonstrate strong scaling on GPU milestone Apr 18, 2024

nefrathenrici self-assigned this Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDMA error on clima for 4 GPUs #54

RDMA error on clima for 4 GPUs #54

Sbozzolo commented Apr 5, 2024

charleskawczynski commented Apr 18, 2024

nefrathenrici commented May 4, 2024

Sbozzolo commented May 4, 2024

RDMA error on clima for 4 GPUs #54

RDMA error on clima for 4 GPUs #54

Comments

Sbozzolo commented Apr 5, 2024

charleskawczynski commented Apr 18, 2024

nefrathenrici commented May 4, 2024

Sbozzolo commented May 4, 2024