upgrade to `nccl>=2.19` across RAPIDS #102

jameslamb · 2024-09-20T18:49:56Z

Description

wholegraph nightly conda-cpp-tests (amd64, rockylinux8, CUDA 11.4) have been failing with errors like this:

./WHOLEGRAPH_CSR_WEIGHTED_SAMPLE_WITHOUT_REPLACEMENT_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./WHOLEMEMORY_HANDLE_TEST 
./WHOLEMEMORY_HANDLE_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./GRAPH_APPEND_UNIQUE_TEST

(build link)

ncclCommSplit was first introduced in nccl==2.18.1-1 (commit link) and is used unconditionally in wholegraph, but wholegraph (and other RAPIDS libraries) have a floor of nccl>=2.9.9.

In this particular job, nccl=2.10.3.1 is getting installed.

 + nccl                     2.10.3.1  hcad2f07_0                  rapidsai-nightly     125MB

We haven't noticed this in other jobs or other projects' CI because in all of those, nccl>=2.18.1 is getting installed.

In addition, pytorch 2.3 requires nccl>=2.19: #102 (comment)

Across RAPIDS, we should raise the floor on nccl to >=2.19.

Benefits of this work

prevents a source of runtime errors when using RAPIDS ML libraries
allows testing wholegraph in a CUDA 11.4 environment again

Acceptance Criteria

all RAPIDS libraries with explicit nccl dependencies are pinned to nccl>=2.19

Approach

create a branch in https://github.com/rapidsai/shared-workflows adding the test job that's failing in wholegraph nightlies to the PR matrices
update wholegraph's pinnings, open a PR with CI pointed at that branch, and confirm that CI passes
point wholegraph CI back at branch-24.10 of shared-workflows, delete that shared-workflows branch
put up PRs with similar pins across all of the RAPIDS libraries with direct nccl dependencies
merge all those PRs (can be in any order)

Changes

Give feedback

wholegraph (bump NCCL floor to 2.18.1.1, relax PyTorch pin wholegraph#218)
raft (bump NCCL floor to 2.18.1.1 raft#2443)
cugraph (bump NCCL floor to 2.18.1.1, include nccl.h where it's needed cugraph#4661)
cuvs (remove NCCL pins in build and test environments cuvs#341)
integration (bump NCCL floor to 2.18.1.1 integration#723)
cuopt (https://github.com/rapidsai/cuopt/pull/2017 ... not critical for RAPIDS 24.10)
Options

The text was updated successfully, but these errors were encountered:

Contributes to rapidsai/build-planning#102 Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #723

Contributes to rapidsai/build-planning#102 Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Chuck Hastings (https://github.com/ChuckHastings) - Vyas Ramasubramani (https://github.com/vyasr) - https://github.com/jakirkham URL: #4661

Contributes to rapidsai/build-planning#102 Fixes #217 ## Notes for Reviewers ### How I tested this Temporarily added a CUDA 11.4.3 test job to CI here (the same specs as the failing nightly), by pointing at the branch from rapidsai/shared-workflows#246. Observed the exact same failures with CUDA 11.4 reported in rapidsai/build-planning#102. ```text ... + nccl 2.10.3.1 hcad2f07_0 rapidsai-nightly 125MB ... ./WHOLEGRAPH_CSR_WEIGHTED_SAMPLE_WITHOUT_REPLACEMENT_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit sh -c exec "$0" ./WHOLEMEMORY_HANDLE_TEST ./WHOLEMEMORY_HANDLE_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit sh -c exec "$0" ./GRAPH_APPEND_UNIQUE_TEST ``` ([build link](https://github.com/rapidsai/wholegraph/actions/runs/10966022370/job/30453393224?pr=218)) Pushed a commit adding a floor of `nccl>=2.18.1.1`. Saw all tests pass with CUDA 11.4 😁 ```text ... + nccl 2.22.3.1 hee583db_1 conda-forge 131MB ... (various log messages showing all tests passed) ``` ([build link](https://github.com/rapidsai/wholegraph/actions/runs/10966210441/job/30454147250?pr=218)) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - https://github.com/linhu-nv - https://github.com/jakirkham URL: #218

Contributes to rapidsai/build-planning#102 Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that. `cuvs` doesn't have any *direct* uses of NCCL... it only uses it via raft. This PR proposes removing `cuvs`'s dependency pinnings on NCCL, in favor of just using whatever it gets transitively via raft. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Corey J. Nolet (https://github.com/cjnolet) URL: #341

jakirkham · 2024-09-26T01:57:03Z

In cuGraph, as part of the NumPy 2 upgrade ( rapidsai/cugraph#4615 (review) ), we noticed the following logic

# Starting from 2.2, PyTorch wheels depend on nvidia-nccl-cuxx>=2.19 wheel and
# dynamically link to NCCL. RAPIDS CUDA 11 CI images have an older NCCL version that
# might shadow the newer NCCL required by PyTorch during import (when importing
# `cupy` before `torch`).
if [[ "${NCCL_VERSION}" < "2.19" ]]; then
  PYTORCH_VER="2.1.0"
else
  PYTORCH_VER="2.3.0"
fi

Given this, am wondering if the minimum NCCL version should be 2.19?

cc @hcho3 (for awareness)

Contributes to rapidsai/build-planning#102 Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #2443

jameslamb · 2024-09-26T15:18:09Z

There are some competing requirements here:

we want to pin pytorch>=2.3 to support numpy>=2
pytorch>=2.3 requires NCCL >= 2.19
raft's floor was just set to NCCL >=2.18.1.1, in bump NCCL floor to 2.18.1.1 raft#2443
raft goes into code freeze for 24.10 in 4 hours
raft is used widely, including in places that don't also use pytorch (like cuml)
we generally want the floor for NCCL to be the same across RAPIDS libraries if possible

Given how close it is to 24.10 code freeze for raft, I don't think we should raise raft's floor further to 2.19 at this point. I think it's ok for cugraph to have a slightly higher floor than raft for the 24.10 release... in practice, both libraries have been building against much newer versions of NCCL for a while anyway. For example, look at the most recent successful nightly build of cugraph on 24.10... it used NCCL 2.22.3.1:

    nccl:                             2.22.3.1-hd961488_1                        conda-forge

(build link)

Here's my proposal:

24.10:
- raft stays pinned to nccl>=2.18.1.1 (because it's about to enter code freeze)
- cugraph pins to nccl>=2.19 (to support pytorch>=2.3 and therefore numpy>2)
24.12
- all RAPIDS libraries with NCCL pins to nccl>=2.19

@jakirkham @hcho3 what do you think?

hcho3 · 2024-09-26T15:21:05Z

Sounds good to me

jakirkham · 2024-09-26T16:22:41Z

Think if we are planning to bump the minimum across the board, we should go ahead and get it done

Yes we are at code freeze and yes it is not great we are finding this change before then. However I think having the right intended behavior trumps that

Would add that a user not constraining NCCL likely gets a much newer version of NCCL anyways. This is also true of our own testing. So the lower bound is there to prevent unexpected solves and incompatible environments

One last note wholegraph also has a PyTorch dependency. So would need similar treatment to cuGraph

jameslamb · 2024-09-26T16:35:02Z

However I think having the right intended behavior trumps that

The issue you've found is at the intersection of PyTorch and NCCL. That's why I support bumping the floor up further to >=2.19 for the libraries that have PyTorch dependencies (and which, conveniently, are not in code freeze yet).

I thought that raft did not have a torch dependency... but was wrong about that 🙃

https://github.com/rapidsai/raft/blob/f37c41c54fc64a4e3689e5a61851ba3821800fee/python/pylibraft/pylibraft/common/outputs.py#L29-L36

So I guess we should do this there too.

jameslamb · 2024-09-26T16:42:10Z

@jakirkham in the interest of time, I started an internal thread about this.

jakirkham · 2024-09-26T16:46:51Z

Ok let's start with RAFT since that is going into code freeze

Then we can work on the rest

Edit: Didn't see your comment above when posting this. Will follow the internal thread

jakirkham · 2024-09-26T17:37:15Z

Thanks James! 🙏

Now that we have a PR for RAFT ( rapidsai/raft#2458 ), should we work on other cases (like Wholegraph)?

jameslamb · 2024-09-26T17:45:19Z

Yes, I'll do that right now.

jameslamb · 2024-09-26T18:07:16Z

Think all the updates we want now have associated PRs:

wholegraph: bump NCCL floor to 2.19 wholegraph#223
raft: bump NCCL floor to 2.19 raft#2458
cugraph: should be done together with NumPy 2 updates (Remove NumPy <2 pin cugraph#4615 (comment), cc @hcho3 )
cuvs: NCCL dropped as unneeded and later readded for a new feature needing NCCL
- NCCL was removed in remove NCCL pins in build and test environments cuvs#341
- NCCL was later readded for a new feature SNMG ANN cuvs#231 (comment)
integration: bump NCCL floor to 2.19 integration#726
cuopt: no action required
- NCCL being removed in https://github.com/rapidsai/cuopt/pull/2017

jakirkham · 2024-09-26T18:14:53Z

Thanks James! 🙏

Also thank you for organizing that nice list 🙂

@linhu-nv

Follow-up to #218 This bumps the NCCL floor here slightly higher, to `>=2.19`. Part of a RAPIDS-wide update of that floor for the 24.10 release. See rapidsai/build-planning#102 (comment) for context. cc @linhu-nv for awareness Authors: - James Lamb (https://github.com/jameslamb) Approvers: - https://github.com/jakirkham URL: #223

Follow-up to #2443 As part of the work to support NumPy 2 across RAPIDS, we found reason to upgrade some libraries like `cugraph` to slightly newer NCCL (`>=2.19`). Context: rapidsai/build-planning#102 (comment) This applies that same bump here, to keep the range of NCCL versions consistent across RAPIDS. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - https://github.com/jakirkham - Corey J. Nolet (https://github.com/cjnolet) URL: #2458

Follow-up to #723 This bumps the NCCL floor here slightly higher, to `>=2.19`. Part of a RAPIDS-wide update of that floor for the 24.10 release. See rapidsai/build-planning#102 (comment) for context. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: #726

jameslamb · 2024-09-30T15:51:42Z

This is complete! Thanks very much for the help @jakirkham @hcho3 @alexbarghi-nv @cjnolet

jameslamb self-assigned this Sep 20, 2024

jameslamb changed the title ~~upgrade to nccl>2.18.1 across RAPIDS~~ upgrade to nccl>=2.18.1.1 across RAPIDS Sep 20, 2024

This was referenced Sep 20, 2024

bump NCCL floor to 2.18.1.1 rapidsai/integration#723

Merged

bump NCCL floor to 2.18.1.1, include nccl.h where it's needed rapidsai/cugraph#4661

Merged

jakirkham added this to the v24.10 milestone Sep 24, 2024

jakirkham mentioned this issue Sep 26, 2024

Remove NumPy <2 pin rapidsai/cugraph#4615

Merged

jameslamb mentioned this issue Sep 26, 2024

bump NCCL floor to 2.19 rapidsai/raft#2458

Merged

This was referenced Sep 26, 2024

bump NCCL floor to 2.19 rapidsai/wholegraph#223

Merged

bump NCCL floor to 2.19 rapidsai/integration#726

Merged

jakirkham changed the title ~~upgrade to nccl>=2.18.1.1 across RAPIDS~~ upgrade to nccl>=2.19 across RAPIDS Sep 27, 2024

jameslamb closed this as completed Sep 30, 2024

jameslamb mentioned this issue Oct 2, 2024

SNMG ANN rapidsai/cuvs#231

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade to `nccl>=2.19` across RAPIDS #102

upgrade to `nccl>=2.19` across RAPIDS #102

jameslamb commented Sep 20, 2024 •

edited

Loading

Changes

jakirkham commented Sep 26, 2024

jameslamb commented Sep 26, 2024 •

edited

Loading

hcho3 commented Sep 26, 2024

jakirkham commented Sep 26, 2024

jameslamb commented Sep 26, 2024

jameslamb commented Sep 26, 2024

jakirkham commented Sep 26, 2024 •

edited

Loading

jakirkham commented Sep 26, 2024

jameslamb commented Sep 26, 2024

jameslamb commented Sep 26, 2024 •

edited by jakirkham

Loading

jakirkham commented Sep 26, 2024

jameslamb commented Sep 30, 2024

upgrade to nccl>=2.19 across RAPIDS #102

upgrade to nccl>=2.19 across RAPIDS #102

Comments

jameslamb commented Sep 20, 2024 • edited Loading

Description

Benefits of this work

Acceptance Criteria

Approach

Changes

jakirkham commented Sep 26, 2024

jameslamb commented Sep 26, 2024 • edited Loading

hcho3 commented Sep 26, 2024

jakirkham commented Sep 26, 2024

jameslamb commented Sep 26, 2024

jameslamb commented Sep 26, 2024

jakirkham commented Sep 26, 2024 • edited Loading

jakirkham commented Sep 26, 2024

jameslamb commented Sep 26, 2024

jameslamb commented Sep 26, 2024 • edited by jakirkham Loading

jakirkham commented Sep 26, 2024

jameslamb commented Sep 30, 2024

upgrade to `nccl>=2.19` across RAPIDS #102

upgrade to `nccl>=2.19` across RAPIDS #102

jameslamb commented Sep 20, 2024 •

edited

Loading

jameslamb commented Sep 26, 2024 •

edited

Loading

jakirkham commented Sep 26, 2024 •

edited

Loading

jameslamb commented Sep 26, 2024 •

edited by jakirkham

Loading