Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO-OOM] Exploration of cudf-spilling + UVM #1504

Draft
wants to merge 3 commits into
base: branch-24.06
Choose a base branch
from

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Mar 21, 2024

In order to avoid out-of-memory errors, we try to combine cudf-spilling and UVM in this PR.

Goal

  • Doesn't affect performance when no spilling is needed.
  • Preserve the existing performance of cudf-spilling.
  • When cudf spilling falls short, use UVM to avoid OOM failure.

The approach

  • Set up cudf to use a memory pool backed by managed memory (UVM).
  • Register a callback function that gets called every time the memory pool needs to expand.
  • When the memory pool wants to expand beyond the available device memory, the callback function triggers cudf spilling instead. Only if cudf cannot find any buffers to spill, will we expand the memory pool.

Preliminary results

Running an extended version of @wence-'s join benchmark https://gist.github.com/madsbk/9589dedc45dbcfc828f7274ce3bdabc6 on a DGX-1 (32GiB).

Running with three configs:

  • cudf-spilling: --use-pool --base-memory-resource cuda --use-spilling
  • cudf-spilling+UVM: --use-pool --base-memory-resource managed --use-spilling
  • UVM-only: --use-pool --base-memory-resource managed --no-use-spilling

TL;DR

When spilling isn’t needed, we see some overhead spikes when going from cudf-spilling to cudf-spilling+UVM and UVM-only but overall, the performance is very much on par. It might be possible to avoid these spikes using cudaMemAdvise().

When spilling is needed, the performance of cudf-spilling and cudf-spilling+UVM is very similar but again, we see some overhead spikes.

Finally, when the peak memory usage exceeds device memory, cudf-spilling fails with an OOM error. In this case, we need UVM. The performance of cudf-spilling+UVM and UVM-only are similar but cudf-spilling+UVM is more variable.

Raw Numbers

Everything fits in device memory, no spilling

cudf-spilling
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 400_000_000 --base-memory-resource cuda --use-pool --use-spilling
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f253aa31df0>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.34s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.34s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.12s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s
cudf-spilling+UVM
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 400_000_000 --base-memory-resource managed --use-pool --use-spilling
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f274a8bd3f0>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.57s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.52s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.17s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s
UVM-only
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 400_000_000 --base-memory-resource managed --use-pool
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fab56ecd170>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.57s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.52s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.17s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s

Spilling is required

cudf-spilling
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 600_000_000 --base-memory-resource cuda --use-pool --use-spilling
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7efd995998f0>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.95s
medium inner on int repeat=2: 3.31s
medium inner on int repeat=3: 3.59s
medium inner on int repeat=4: 3.58s
medium inner on int repeat=5: 3.58s
medium outer on int repeat=1: 4.50s
medium outer on int repeat=2: 4.78s
medium outer on int repeat=3: 4.79s
medium outer on int repeat=4: 4.78s
medium outer on int repeat=5: 4.79s
medium inner on factor repeat=1: 6.67s
medium inner on factor repeat=2: 5.79s
medium inner on factor repeat=3: 5.77s
medium inner on factor repeat=4: 5.78s
medium inner on factor repeat=5: 5.77s
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 105.16GiB in 44.452s
    cpu => gpu: 88.36GiB in 10.987s
cudf-spilling+UVM
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 600_000_000 --base-memory-resource managed --use-pool --use-spilling
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f6283162160>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 2.01s
medium inner on int repeat=2: 3.20s
medium inner on int repeat=3: 3.49s
medium inner on int repeat=4: 3.50s
medium inner on int repeat=5: 3.53s
medium outer on int repeat=1: 4.76s
medium outer on int repeat=2: 5.40s
medium outer on int repeat=3: 5.34s
medium outer on int repeat=4: 4.69s
medium outer on int repeat=5: 4.73s
medium inner on factor repeat=1: 6.57s
medium inner on factor repeat=2: 5.65s
medium inner on factor repeat=3: 5.66s
medium inner on factor repeat=4: 5.67s
medium inner on factor repeat=5: 5.67s
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 105.16GiB in 43.841s
    cpu => gpu: 88.36GiB in 12.214s
UVM-only
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 600_000_000 --base-memory-resource managed --use-pool
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f2743ff92b0>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.55s
medium inner on int repeat=2: 4.07s
medium inner on int repeat=3: 4.23s
medium inner on int repeat=4: 1.90s
medium inner on int repeat=5: 2.06s
medium outer on int repeat=1: 4.67s
medium outer on int repeat=2: 7.65s
medium outer on int repeat=3: 4.59s
medium outer on int repeat=4: 4.47s
medium outer on int repeat=5: 5.40s
medium inner on factor repeat=1: 8.77s
medium inner on factor repeat=2: 9.29s
medium inner on factor repeat=3: 8.14s
medium inner on factor repeat=4: 13.85s
medium inner on factor repeat=5: 14.92s

UVM is required

cudf-spilling
OOM
cudf-spilling+UVM
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 700_000_000 --base-memory-resource managed --use-pool --use-spilling
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f1845c8b6f0>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 700000000 rows and is 13.04 GiB
Right table is 700000 rows and is 0.01 GiB
medium inner on int repeat=1: 3.11s
medium inner on int repeat=2: 10.05s
medium inner on int repeat=3: 5.95s
medium inner on int repeat=4: 8.77s
medium inner on int repeat=5: 5.42s
medium outer on int repeat=1: 10.37s
medium outer on int repeat=2: 4.88s
medium outer on int repeat=3: 11.70s
medium outer on int repeat=4: 4.95s
medium outer on int repeat=5: 9.82s
medium inner on factor repeat=1: 16.28s
medium inner on factor repeat=2: 18.24s
medium inner on factor repeat=3: 17.66s
medium inner on factor repeat=4: 17.84s
medium inner on factor repeat=5: 17.99s
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 3.92GiB in 1.626s
    cpu => gpu: 3.92GiB in 0.492s
UVM-only
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 700_000_000 --base-memory-resource managed --use-pool 
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f6423872520>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 700000000 rows and is 13.04 GiB
Right table is 700000 rows and is 0.01 GiB
medium inner on int repeat=1: 7.50s
medium inner on int repeat=2: 5.94s
medium inner on int repeat=3: 7.18s
medium inner on int repeat=4: 7.70s
medium inner on int repeat=5: 6.72s
medium outer on int repeat=1: 8.30s
medium outer on int repeat=2: 7.28s
medium outer on int repeat=3: 9.88s
medium outer on int repeat=4: 6.36s
medium outer on int repeat=5: 9.28s
medium inner on factor repeat=1: 16.43s
medium inner on factor repeat=2: 19.07s
medium inner on factor repeat=3: 18.61s
medium inner on factor repeat=4: 17.92s
medium inner on factor repeat=5: 18.42s

NB

The current code is hacky. If this approach is viable, we should implement the functionality in a standalone rmm resource.

@github-actions github-actions bot added Python Related to RMM Python API cpp Pertains to C++ code labels Mar 21, 2024
@harrism
Copy link
Member

harrism commented Mar 21, 2024

@madsbk thank you for this! Would you be able to extract and summarize the important performance numbers? Perhaps in a graph or table to make clear the performance landscape?

@madsbk
Copy link
Member Author

madsbk commented Mar 22, 2024

Some more results. I am now calling cudaMemPrefetchAsync() when initializing the memory pool in order to stabilize the performance.

I am measuring the total time of all the joins (5 repeats each) . The timings of join repeats vary quite a bit when using UVM. This is expected since we do not reset the memory page locations between repeats.

Rows backend total time (sec)
400k cudf-spilling-only 9.00
400k cudf-spilling+UVM 9.00
400k UVM-only 8.99
600k cudf-spilling-only 67.28
600k cudf-spilling+UVM 67.56
600k UVM-only 121.02
700k cudf-spilling-only out-of-memory
700k cudf-spilling+UVM 162.75
700k UVM-only 166.02
Raw data
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 400_000_000 --base-memory-resource cuda --use-spilling  
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f9de448b0b0>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.34s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.34s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.12s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s
Total time: 9.00
Spill Statistics (level=3):
  Spilling (level >= 1): None
  Exposed buffers (level >= 2): None
Exception ignored in: <function RandomState.__del__ at 0x7f9e020fd440>
Traceback (most recent call last):
  File "/datasets/mkristensen/miniforge3/envs/cudf-0319/lib/python3.11/site-packages/cupy/random/_generator.py", line 65, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 400_000_000 --base-memory-resource managed --use-spilling  
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f9741697060>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.34s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.34s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.12s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s
Total time: 9.00
Spill Statistics (level=3):
  Spilling (level >= 1): None
  Exposed buffers (level >= 2): None
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 400_000_000 --base-memory-resource managed  
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fe06666f420>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.34s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.34s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.12s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s
Total time: 8.99
Exception ignored in: <function RandomState.__del__ at 0x7fe095fa5440>
Traceback (most recent call last):
  File "/datasets/mkristensen/miniforge3/envs/cudf-0319/lib/python3.11/site-packages/cupy/random/_generator.py", line 65, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 600_000_000 --base-memory-resource cuda --use-spilling  
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fb388b3ab60>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.91s
medium inner on int repeat=2: 3.18s
medium inner on int repeat=3: 3.47s
medium inner on int repeat=4: 3.47s
medium inner on int repeat=5: 3.45s
medium outer on int repeat=1: 4.36s
medium outer on int repeat=2: 4.62s
medium outer on int repeat=3: 4.63s
medium outer on int repeat=4: 4.62s
medium outer on int repeat=5: 4.62s
medium inner on factor repeat=1: 6.48s
medium inner on factor repeat=2: 5.61s
medium inner on factor repeat=3: 5.62s
medium inner on factor repeat=4: 5.62s
medium inner on factor repeat=5: 5.62s
Total time: 67.28
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 105.16GiB in 42.732s
    cpu => gpu: 88.36GiB in 10.561s
  Exposed buffers (level >= 2): None
Exception ignored in: <function RandomState.__del__ at 0x7fb3b8469440>
Traceback (most recent call last):
  File "/datasets/mkristensen/miniforge3/envs/cudf-0319/lib/python3.11/site-packages/cupy/random/_generator.py", line 65, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 600_000_000 --base-memory-resource managed --use-spilling  
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fa22586d440>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.90s
medium inner on int repeat=2: 3.20s
medium inner on int repeat=3: 3.49s
medium inner on int repeat=4: 3.47s
medium inner on int repeat=5: 3.48s
medium outer on int repeat=1: 4.41s
medium outer on int repeat=2: 4.64s
medium outer on int repeat=3: 4.65s
medium outer on int repeat=4: 4.64s
medium outer on int repeat=5: 4.64s
medium inner on factor repeat=1: 6.50s
medium inner on factor repeat=2: 5.64s
medium inner on factor repeat=3: 5.63s
medium inner on factor repeat=4: 5.64s
medium inner on factor repeat=5: 5.64s
Total time: 67.56
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 105.16GiB in 43.151s
    cpu => gpu: 88.36GiB in 10.759s
  Exposed buffers (level >= 2): None
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 600_000_000 --base-memory-resource managed
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fbdb85b09f0>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 4.30s
medium inner on int repeat=2: 6.42s
medium inner on int repeat=3: 5.75s
medium inner on int repeat=4: 8.27s
medium inner on int repeat=5: 3.92s
medium outer on int repeat=1: 4.50s
medium outer on int repeat=2: 5.38s
medium outer on int repeat=3: 4.46s
medium outer on int repeat=4: 4.49s
medium outer on int repeat=5: 5.30s
medium inner on factor repeat=1: 8.50s
medium inner on factor repeat=2: 15.25s
medium inner on factor repeat=3: 14.66s
medium inner on factor repeat=4: 14.97s
medium inner on factor repeat=5: 14.86s
Total time: 121.02
Exception ignored in: <function RandomState.__del__ at 0x7fbddc43d440>
Traceback (most recent call last):
  File "/datasets/mkristensen/miniforge3/envs/cudf-0319/lib/python3.11/site-packages/cupy/random/_generator.py", line 65, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 700_000_000 --base-memory-resource managed --use-spilling  
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fad40a5d030>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 700000000 rows and is 13.04 GiB
Right table is 700000 rows and is 0.01 GiB
medium inner on int repeat=1: 3.33s
medium inner on int repeat=2: 9.67s
medium inner on int repeat=3: 5.67s
medium inner on int repeat=4: 8.68s
medium inner on int repeat=5: 5.60s
medium outer on int repeat=1: 10.25s
medium outer on int repeat=2: 4.99s
medium outer on int repeat=3: 11.59s
medium outer on int repeat=4: 5.17s
medium outer on int repeat=5: 9.68s
medium inner on factor repeat=1: 16.23s
medium inner on factor repeat=2: 18.07s
medium inner on factor repeat=3: 17.79s
medium inner on factor repeat=4: 17.93s
medium inner on factor repeat=5: 18.09s
Total time: 162.75
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 3.92GiB in 1.575s
    cpu => gpu: 3.92GiB in 0.480s
  Exposed buffers (level >= 2): None
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 700_000_000 --base-memory-resource managed
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f7df52d5850>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 700000000 rows and is 13.04 GiB
Right table is 700000 rows and is 0.01 GiB
medium inner on int repeat=1: 9.01s
medium inner on int repeat=2: 5.57s
medium inner on int repeat=3: 9.24s
medium inner on int repeat=4: 7.02s
medium inner on int repeat=5: 7.45s
medium outer on int repeat=1: 7.36s
medium outer on int repeat=2: 8.85s
medium outer on int repeat=3: 8.60s
medium outer on int repeat=4: 7.80s
medium outer on int repeat=5: 7.39s
medium inner on factor repeat=1: 16.01s
medium inner on factor repeat=2: 18.34s
medium inner on factor repeat=3: 17.33s
medium inner on factor repeat=4: 17.81s
medium inner on factor repeat=5: 18.24s
Total time: 166.02
Exception ignored in: <function RandomState.__del__ at 0x7f7e16491440>
Traceback (most recent call last):
  File "/datasets/mkristensen/miniforge3/envs/cudf-0319/lib/python3.11/site-packages/cupy/random/_generator.py", line 65, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down

The results show that, at least in this case, combining cudf-spilling with UVM clearly outperforms UVM-only without any real downside.

I think the next step is to implement this in a more unintrusive way and test more workflows and hardware setups. E.g., how is the performance of cudf-spilling+UVM when running on multiple GPUs using UCX?

@@ -203,7 +211,7 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_

if (size <= 0) { return nullptr; }

lock_guard lock(mtx_);
// lock_guard lock(mtx_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why comment out the lock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because of a deadlock that would otherwise trigger when the allocation results in cudf-spilling. In this case, cudf-spilling will find another buffer to spill and de-allocate its memory, which also
requires the lock.

I haven't given this too much thought, but I think this could be handled with a reentrant lock.

@github-actions github-actions bot added the ci label Apr 9, 2024
@madsbk madsbk changed the base branch from branch-24.04 to branch-24.06 April 9, 2024 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci cpp Pertains to C++ code Python Related to RMM Python API
Projects
Status: Blocked
Development

Successfully merging this pull request may close these issues.

2 participants