Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA error on clima for 4 GPUs #54

Open
Sbozzolo opened this issue Apr 5, 2024 · 3 comments
Open

RDMA error on clima for 4 GPUs #54

Sbozzolo opened this issue Apr 5, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@Sbozzolo
Copy link
Member

Sbozzolo commented Apr 5, 2024

For example: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/255#018eaa55-1d98-4f43-a1be-b2199302f2c4

clima.gps.caltech.edu:3667226] Failed to register remote memory, rc=-1
@Sbozzolo Sbozzolo added the bug Something isn't working label Apr 5, 2024
@nefrathenrici nefrathenrici self-assigned this Apr 22, 2024
@nefrathenrici
Copy link
Member

It looks like @simonbyrne already ran into this issue: open-mpi/ompi#11949

To summarize: --ntasks=4 --gpus-per-task=1 causes slurm to bind GPUs to tasks and limit tasks to one visible device each. I think all devices need to be visible within one task for CUDA IPC to work and avoid host-to-device transfer between tasks.

If we use --ntasks=4 --gpus=4 instead, all GPUs are visible to each other. I think ClimaComms can allocate devices to individual tasks, but I'm not sure. The RDMA error has been difficult to reproduce because of an nsys error.

[nefrathe@clima ~]$ srun -n 4 --gpus-per-task=1 printenv CUDA_VISIBLE_DEVICES
0
1
2
3
[nefrathe@clima ~]$ srun -n 4 --gpus=4 printenv CUDA_VISIBLE_DEVICES
0,1,2,3
0,1,2,3
0,1,2,3
0,1,2,3

@Sbozzolo
Copy link
Member Author

Sbozzolo commented May 4, 2024

It looks like @simonbyrne already ran into this issue: open-mpi/ompi#11949

To summarize: --ntasks=4 --gpus-per-task=1 causes slurm to bind GPUs to tasks and limit tasks to one visible device each. I think all devices need to be visible within one task for CUDA IPC to work and avoid host-to-device transfer between tasks.

If we use --ntasks=4 --gpus=4 instead, all GPUs are visible to each other. I think ClimaComms can allocate devices to individual tasks, but I'm not sure. The RDMA error has been difficult to reproduce because of an nsys error.

[nefrathe@clima ~]$ srun -n 4 --gpus-per-task=1 printenv CUDA_VISIBLE_DEVICES
0
1
2
3
[nefrathe@clima ~]$ srun -n 4 --gpus=4 printenv CUDA_VISIBLE_DEVICES
0,1,2,3
0,1,2,3
0,1,2,3
0,1,2,3

The issue Simon found was fixed in SLURM 23.11 with the new flag --allow-sharing. Our GPUs successfully do device-to-device communication across MPI processes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants