Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Warnings from methods that raise exceptions and are marked nogil. #1365

Open
bdice opened this issue Oct 19, 2023 · 7 comments
Open

[BUG] Warnings from methods that raise exceptions and are marked nogil. #1365

bdice opened this issue Oct 19, 2023 · 7 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@bdice
Copy link
Contributor

bdice commented Oct 19, 2023

Describe the bug
RMM is raising warnings during builds due to except being used on methods marked as nogil. These declarations are not compatible because the GIL must be acquired for exception checks. See build logs below (also linked here):

[6/22] Generating CXX source rmm/_cuda/stream.cxx
  performance hint: /__w/rmm/rmm/python/rmm/_cuda/stream.pyx:71:17: Exception check will always require the GIL to be acquired. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: /__w/rmm/rmm/python/rmm/_cuda/stream.pyx:78:30: Exception check will always require the GIL to be acquired.
  Possible solutions:
        1. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on the function to allow an error code to be returned.
  performance hint: /__w/rmm/rmm/python/rmm/_cuda/stream.pyx:84:24: Exception check will always require the GIL to be acquired. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  [7/22] Generating CXX source rmm/_lib/torch_allocator.cxx
  performance hint: /__w/rmm/rmm/python/rmm/_lib/torch_allocator.pyx:17:5: Exception check on 'deallocate' will always require the GIL to be acquired.
  Possible solutions:
        1. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on the function to allow an error code to be returned.
  [8/22] Building CXX object rmm/_lib/CMakeFiles/logger.dir/logger.cxx.o
  [9/22] Building CXX object rmm/_lib/CMakeFiles/cuda_stream.dir/cuda_stream.cxx.o
  [10/22] Building CXX object rmm/_cuda/CMakeFiles/stream.dir/stream.cxx.o
  [11/22] Linking CXX shared module rmm/_lib/cuda_stream.cpython-310-x86_64-linux-gnu.so
  [12/22] Linking CXX shared module rmm/_lib/logger.cpython-310-x86_64-linux-gnu.so
  [13/22] Linking CXX shared module rmm/_cuda/stream.cpython-310-x86_64-linux-gnu.so
  [14/22] Building CXX object rmm/_lib/CMakeFiles/torch_allocator.dir/torch_allocator.cxx.o
  [15/22] Linking CXX shared module rmm/_lib/torch_allocator.cpython-310-x86_64-linux-gnu.so
  [16/22] Generating CXX source rmm/_lib/memory_resource.cxx
  performance hint: /__w/rmm/rmm/python/rmm/_lib/memory_resource.pyx:578:5: Exception check on '_deallocate_callback_wrapper' will always require the GIL to be acquired.
  Possible solutions:
        1. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on the function to allow an error code to be returned.
  [17/22] Building CXX object rmm/_lib/CMakeFiles/memory_resource.dir/memory_resource.cxx.o
  [18/22] Linking CXX shared module rmm/_lib/memory_resource.cpython-310-x86_64-linux-gnu.so
  [19/22] Generating CXX source rmm/_lib/device_buffer.cxx
  performance hint: /__w/rmm/rmm/python/rmm/_lib/device_buffer.pyx:381:5: Exception check on '_copy_async' will always require the GIL to be acquired.
  Possible solutions:
        1. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on the function to allow an error code to be returned.
  performance hint: /__w/rmm/rmm/python/rmm/_lib/device_buffer.pyx:85:68: Exception check will always require the GIL to be acquired. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: /__w/rmm/rmm/python/rmm/_lib/device_buffer.pyx:87:75: Exception check will always require the GIL to be acquired. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: /__w/rmm/rmm/python/rmm/_lib/device_buffer.pyx:89:38: Exception check will always require the GIL to be acquired. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: /__w/rmm/rmm/python/rmm/_lib/device_buffer.pyx:90:40: Exception check will always require the GIL to be acquired.
  Possible solutions:
        1. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on the function to allow an error code to be returned.
  performance hint: /__w/rmm/rmm/python/rmm/_lib/device_buffer.pyx:443:70: Exception check will always require the GIL to be acquired. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: /__w/rmm/rmm/python/rmm/_lib/device_buffer.pyx:442:19: Exception check will always require the GIL to be acquired.
  Possible solutions:
        1. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on the function to allow an error code to be returned.
  performance hint: /__w/rmm/rmm/python/rmm/_lib/device_buffer.pyx:486:70: Exception check will always require the GIL to be acquired. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: /__w/rmm/rmm/python/rmm/_lib/device_buffer.pyx:485:19: Exception check will always require the GIL to be acquired.
  Possible solutions:
        1. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on the function to allow an error code to be returned.
  performance hint: /__w/rmm/rmm/python/rmm/_lib/device_buffer.pyx:519:72: Exception check will always require the GIL to be acquired. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: /__w/rmm/rmm/python/rmm/_lib/device_buffer.pyx:518:19: Exception check will always require the GIL to be acquired.
  Possible solutions:
        1. Declare the function as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on the function to allow an error code to be returned.
@bdice bdice added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 19, 2023
@bdice bdice changed the title [BUG] Exceptions raised by methods that are marked nogil. [BUG] Warnings from methods that raise exceptions and are marked nogil. Oct 19, 2023
@bdice
Copy link
Contributor Author

bdice commented Oct 19, 2023

I don't know how to reconcile this warning with

cdef void c_synchronize(self) except * nogil:
"""
Synchronize the CUDA stream.
This function *must* be called in a `with nogil` block
"""
self.view().synchronize()
def synchronize(self):
"""
Synchronize the CUDA stream
"""
with nogil:
self.c_synchronize()

It says the function must be called as nogil but I am unsure if this synchronize function is allowed to raise an exception...

@wence-
Copy link
Contributor

wence- commented Oct 23, 2023

tl;dr: I think we should replace the pattern:

cdef void c_foo(self) except * nogil:
    do_foo()
def foo(self):
   with nogil:
      self.c_foo()

With

def foo(self):
   with nogil:
      do_foo()

Let's think about what the intention is here, to understand why the function should be called in a nogil block. In this code, there are two different locks in play:

  1. The Python GIL
  2. The CUDA driver lock

So what we're trying to avoid is a lock ordering problem where one thread hold the GIL and is trying to acquire the driver lock while another holds the driver lock and is trying to acquire the GIL.

self.view().synchronize() calls cudaStreamSynchronize which needs to acquire the driver lock, and potentially throws depending on whether the synchronization produced an error. Without releasing the GIL, we're in Python so we must have the GIL as well. So we hold one lock and acquire the other one.

What could be happening on a different thread? A CUDA call could be in progress that (due to a user callback) needs to acquire the Python GIL while the driver lock is held. So we have a potential for deadlock (thread A holds the GIL and needs the driver lock, thread B holds the driver lock and needs the GIL).

So the solution must be (in RMM) to release the lock we don't need (the GIL) when we call into RMM's C++ layer. By hand, this would be something like:

void py_synchronize(...) 
{
  PyGIL_Release();
  try {
      synchronize(...);
  } catch (e) {
      PyGIL_Ensure():
      raise_as_python_exception(e);
      PyGIL_Release();
  }
}

Translating the C++ exception to a Python one can happily take the GIL because the assumption is that we've dropped whatever locks we took in the driver at the end of calling the C++-level synchronize function.

Here's a sketch of the different options we have (AIUI):

cdef extern from "foo.h" nogil:
    void do_foo() except + nogil


# - Call to C++ function is done with GIL released.
# - _Requires_ that the caller holds the GIL on entry
# - _Acquires_ the GIL it on exit
# - Caller checks for exception by doing PyErr_Occurred()
cdef void c_foo1() except *:
    with nogil:
        do_foo()


# Current status quo, produces warnings
# - Call to C++ function is done with current GIL state (caller must release GIL)
# - Acquires GIL if an exception was thrown in C++
# - Releases GIL on exit (if acquired due to exception)
# - Caller checks for exception by reacquiring the GIL and doing PyErr_Occurred()
cdef void c_foo2() except * nogil:
    do_foo()


# No warnings, but an exception does not produce a useful traceback
# Functionally identical to status quo except worse traceback
# - Call to C++ function is done with current GIL state (caller must release GIL)
# - Acquires GIL if an exception was thrown in C++
# - Releases GIL on exit (if acquired due to exception)
# - Caller does not check for exception (although PyErr is set, so a
#   later check will pick it up, but without a traceback)
cdef void c_foo3() noexcept nogil:
    do_foo()


# No warnings
# - Call to C++ function is done with current GIL state (caller must release GIL)
# - Acquires GIL if an exception was thrown in C++
# - Releases GIL on exit (if acquired due to exception)
# - Indicates an error occurred to the caller by returning -1
cdef int c_foo4() except -1 nogil:
    do_foo()


def foo(*, variant):
    if variant == 1:
        c_foo1()
    elif variant == 2:
        with nogil:
            c_foo2()
    elif variant == 3:
        with nogil:
            c_foo3()
    elif variant == 4:
        with nogil:
            c_foo4()
    elif variant == 5:
        # - Call to C++ is done with GIL released
        # - We're in Python so we have the GIL and reacquire it after
        #   this block.
        # - Exceptions are translated with GIL acquired
        with nogil:
            do_foo()

Of these, I think variant 5 is the best option. What we're kind of fighting here is that we have Cython-level functions which we want to catch and re-throw C++ exceptions (so that we can propagate up the call stack). But that's not possible, as soon as you leave extern cdef land Cython will convert any exception into a Python one. We could refactor everything to return exceptions through return values (that's the except -1 dance or variant 4).

This performance hint is kind of silly for our use case, all of the different approaches must acquire the GIL to translate the exception to a Python one. The only minor differences are in how the exception information is propagated to the caller.

I presume that we have synchonize(self) and c_synchronize(self) because we thought other people might use our Cython wrappings of the RMM C++ classes. But I couldn't find any such examples. Some rapids libraries use the rmm._lib wrappers which just export the C++ classes, but that's fine.

Consequently, what we need is the Python-exported functions on the RMM classes, which by definition are holding the GIL on entry, and must arrange to drop it, at which point they should directly call the extern cdef C++ definition. We can then remove the c_foo version of the functions completely.

Unfortunately, there is no way to mark an extern cdef function as to be called with the GIL dropped, so we must laboriously go through the codebase and ensure we're doing so manually.

@jakirkham
Copy link
Member

Another option is pin to Cython 0.29 until we have a chance to make these changes

@wence-
Copy link
Contributor

wence- commented Oct 26, 2023

They're not errors at the moment, so I think pinning would be a backward step (and we'd have to go through the rest of rapids and pin there too...)

@vyasr
Copy link
Contributor

vyasr commented Nov 3, 2023

I don't think this will ever be an error because it's perfectly valid code. It's just telling you that it might be slower than you anticipated because of acquiring the GIL. I wouldn't consider that grounds for reversion because all that reversion would do is silence the warning, not change the behavior.

@wence-
Copy link
Contributor

wence- commented Nov 6, 2023

I don't think this will ever be an error because it's perfectly valid code. It's just telling you that it might be slower than you anticipated because of acquiring the GIL. I wouldn't consider that grounds for reversion because all that reversion would do is silence the warning, not change the behavior.

Agreed, though I think we should consider how to restructure such that we don't get this warning so as we can keep track of cases where warnings might actually be important.

@vyasr
Copy link
Contributor

vyasr commented Nov 6, 2023

I agree. I haven't had a chance to read through your very thorough list of proposals, but I imagine we'll want to apply one of them to fix the issue properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

4 participants