-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrade to nccl>=2.19
across RAPIDS
#102
Comments
nccl>2.18.1
across RAPIDSnccl>=2.18.1.1
across RAPIDS
Contributes to rapidsai/build-planning#102 Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #723
Contributes to rapidsai/build-planning#102 Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Chuck Hastings (https://github.com/ChuckHastings) - Vyas Ramasubramani (https://github.com/vyasr) - https://github.com/jakirkham URL: #4661
Contributes to rapidsai/build-planning#102 Fixes #217 ## Notes for Reviewers ### How I tested this Temporarily added a CUDA 11.4.3 test job to CI here (the same specs as the failing nightly), by pointing at the branch from rapidsai/shared-workflows#246. Observed the exact same failures with CUDA 11.4 reported in rapidsai/build-planning#102. ```text ... + nccl 2.10.3.1 hcad2f07_0 rapidsai-nightly 125MB ... ./WHOLEGRAPH_CSR_WEIGHTED_SAMPLE_WITHOUT_REPLACEMENT_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit sh -c exec "$0" ./WHOLEMEMORY_HANDLE_TEST ./WHOLEMEMORY_HANDLE_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit sh -c exec "$0" ./GRAPH_APPEND_UNIQUE_TEST ``` ([build link](https://github.com/rapidsai/wholegraph/actions/runs/10966022370/job/30453393224?pr=218)) Pushed a commit adding a floor of `nccl>=2.18.1.1`. Saw all tests pass with CUDA 11.4 😁 ```text ... + nccl 2.22.3.1 hee583db_1 conda-forge 131MB ... (various log messages showing all tests passed) ``` ([build link](https://github.com/rapidsai/wholegraph/actions/runs/10966210441/job/30454147250?pr=218)) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - https://github.com/linhu-nv - https://github.com/jakirkham URL: #218
Contributes to rapidsai/build-planning#102 Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that. `cuvs` doesn't have any *direct* uses of NCCL... it only uses it via raft. This PR proposes removing `cuvs`'s dependency pinnings on NCCL, in favor of just using whatever it gets transitively via raft. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Corey J. Nolet (https://github.com/cjnolet) URL: #341
In cuGraph, as part of the NumPy 2 upgrade ( rapidsai/cugraph#4615 (review) ), we noticed the following logic # Starting from 2.2, PyTorch wheels depend on nvidia-nccl-cuxx>=2.19 wheel and
# dynamically link to NCCL. RAPIDS CUDA 11 CI images have an older NCCL version that
# might shadow the newer NCCL required by PyTorch during import (when importing
# `cupy` before `torch`).
if [[ "${NCCL_VERSION}" < "2.19" ]]; then
PYTORCH_VER="2.1.0"
else
PYTORCH_VER="2.3.0"
fi Given this, am wondering if the minimum NCCL version should be 2.19? cc @hcho3 (for awareness) |
Contributes to rapidsai/build-planning#102 Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #2443
There are some competing requirements here:
Given how close it is to 24.10 code freeze for
Here's my proposal:
@jakirkham @hcho3 what do you think? |
Sounds good to me |
Think if we are planning to bump the minimum across the board, we should go ahead and get it done Yes we are at code freeze and yes it is not great we are finding this change before then. However I think having the right intended behavior trumps that Would add that a user not constraining NCCL likely gets a much newer version of NCCL anyways. This is also true of our own testing. So the lower bound is there to prevent unexpected solves and incompatible environments One last note wholegraph also has a PyTorch dependency. So would need similar treatment to cuGraph |
The issue you've found is at the intersection of PyTorch and NCCL. That's why I support bumping the floor up further to I thought that So I guess we should do this there too. |
@jakirkham in the interest of time, I started an internal thread about this. |
Ok let's start with RAFT since that is going into code freeze Then we can work on the rest Edit: Didn't see your comment above when posting this. Will follow the internal thread |
Thanks James! 🙏 Now that we have a PR for RAFT ( rapidsai/raft#2458 ), should we work on other cases (like Wholegraph)? |
Yes, I'll do that right now. |
Think all the updates we want now have associated PRs:
|
Thanks James! 🙏 Also thank you for organizing that nice list 🙂 |
Follow-up to #218 This bumps the NCCL floor here slightly higher, to `>=2.19`. Part of a RAPIDS-wide update of that floor for the 24.10 release. See rapidsai/build-planning#102 (comment) for context. cc @linhu-nv for awareness Authors: - James Lamb (https://github.com/jameslamb) Approvers: - https://github.com/jakirkham URL: #223
Follow-up to #2443 As part of the work to support NumPy 2 across RAPIDS, we found reason to upgrade some libraries like `cugraph` to slightly newer NCCL (`>=2.19`). Context: rapidsai/build-planning#102 (comment) This applies that same bump here, to keep the range of NCCL versions consistent across RAPIDS. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - https://github.com/jakirkham - Corey J. Nolet (https://github.com/cjnolet) URL: #2458
nccl>=2.18.1.1
across RAPIDSnccl>=2.19
across RAPIDS
Follow-up to #723 This bumps the NCCL floor here slightly higher, to `>=2.19`. Part of a RAPIDS-wide update of that floor for the 24.10 release. See rapidsai/build-planning#102 (comment) for context. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: #726
This is complete! Thanks very much for the help @jakirkham @hcho3 @alexbarghi-nv @cjnolet |
Description
wholegraph
nightly conda-cpp-tests(amd64, rockylinux8, CUDA 11.4)
have been failing with errors like this:(build link)
ncclCommSplit
was first introduced innccl==2.18.1-1
(commit link) and is used unconditionally inwholegraph
, butwholegraph
(and other RAPIDS libraries) have a floor ofnccl>=2.9.9
.In this particular job,
nccl=2.10.3.1
is getting installed.We haven't noticed this in other jobs or other projects' CI because in all of those,
nccl>=2.18.1
is getting installed.In addition, pytorch 2.3 requires
nccl>=2.19
: #102 (comment)Across RAPIDS, we should raise the floor on
nccl
to>=2.19
.Benefits of this work
wholegraph
in a CUDA 11.4 environment againAcceptance Criteria
nccl
dependencies are pinned tonccl>=2.19
Approach
wholegraph
nightlies to the PR matriceswholegraph
's pinnings, open a PR with CI pointed at that branch, and confirm that CI passeswholegraph
CI back atbranch-24.10
ofshared-workflows
, delete thatshared-workflows
branchnccl
dependenciesChanges
The text was updated successfully, but these errors were encountered: