Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests hang on Windows (RX7900XT) #653

Open
Victorious3 opened this issue Jul 13, 2024 · 1 comment
Open

Tests hang on Windows (RX7900XT) #653

Victorious3 opened this issue Jul 13, 2024 · 1 comment

Comments

@Victorious3
Copy link

I'm trying to get AMDGPU to work on Windows with an RX7900XT.
The test output lists successfully finding the gpu and my igpu. However, the tests hang. After interrupting them I got this output:

┌ Warning: MIOpen is unavailable, functionality will be disabled.
└ @ AMDGPU C:\Users\Vic\.julia\packages\AMDGPU\WqMSe\src\AMDGPU.jl:216
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 7950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)
Environment:
  JULIA_LOAD_PATH = @;C:\Users\Vic\AppData\Local\Temp\jl_1YXWuJ
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬────────────────────────────────────────────────────────────────────────────
│ Available │ Name             │ Version   │ Path                                                                      ⋯
├───────────┼──────────────────┼───────────┼────────────────────────────────────────────────────────────────────────────
│     +     │ LLD              │ -         │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\ld.lld.exe                        ⋯
│     +     │ Device Libraries │ -         │ C:\\Users\\Vic\\.julia\\artifacts\\5ad5ecb46e3c334821f54c1feecc6c152b7b6a ⋯
│     +     │ HIP              │ 5.7.32000 │ C:\\Windows\\SYSTEM32\\amdhip64.DLL                                       ⋯
│     +     │ rocBLAS          │ 4.1.2     │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocblas.dll                       ⋯
│     +     │ rocSOLVER        │ 3.25.0    │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocsolver.dll                     ⋯
│     +     │ rocALUTION       │ -         │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocalution.dll                    ⋯
│     +     │ rocSPARSE        │ -         │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocsparse.dll                     ⋯
│     +     │ rocRAND          │ 2.10.5    │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocrand.dll                       ⋯
│     +     │ rocFFT           │ 1.0.27    │ C:\\Program Files\\AMD\\ROCm\\6.1\\bin\\rocfft.dll                        ⋯
│     -     │ MIOpen           │ -         │ -                                                                         ⋯
└───────────┴──────────────────┴───────────┴────────────────────────────────────────────────────────────────────────────
                                                                                                        1 column omitted

[ Info: AMDGPU devices
┌────┬─────────────────────────┬──────────┬───────────┬────────────┐
│ Id │                    Name │ GCN arch │ Wavefront │     Memory │
├────┼─────────────────────────┼──────────┼───────────┼────────────┤
│  1 │   AMD Radeon RX 7900 XT │  gfx1100 │        32 │ 19.984 GiB │
│  2 │ AMD Radeon(TM) Graphics │  gfx1036 │        32 │ 24.003 GiB │
└────┴─────────────────────────┴──────────┴───────────┴────────────┘

[ Info: Test suite info
┌─────────┬───────────────────────────────────────────────────────────────┬───────────────────────────────────────────────┐
│ Workers │                                                        Device │                                         Tests │
├─────────┼───────────────────────────────────────────────────────────────┼───────────────────────────────────────────────┤
│       2 │ HIPDevice(id=1, name=AMD Radeon RX 7900 XT, gcn_arch=gfx1100) │ core, hip, ext, gpuarrays, kernelabstractions │
└─────────┴───────────────────────────────────────────────────────────────┴───────────────────────────────────────────────┘
[ Info: Scanning for test items in project `AMDGPU` at paths: C:\Users\Vic\.julia\packages\AMDGPU\WqMSe
[ Info: Finished scanning for test items in 0.51 seconds. Scheduling 34 tests on pid 15192 with 2 worker processes and 1 threads per worker.
[ Info: Starting test workers
  Worker 27588:  [ Info: Starting test worker 2 on pid = 27588, with 1 threads
  Worker 27644:  [ Info: Starting test worker 1 on pid = 27644, with 1 threads
[ Info: Starting running test items
  Worker 27588:  18:57:42 | maxrss  0.5% | mem 21.1% | START ( 2/34) test item "gpuarrays - reductions/== isequal" at test\gpuarrays_tests.jl:57
  Worker 27644:  18:57:42 | maxrss  0.6% | mem 21.1% | START ( 1/34) test item "core" at test\core_tests.jl:1

     Testing Tests interrupted. Exiting the test process

  Worker 27588:  18:58:30 | maxrss  1.7% | mem 21.5% | DONE  ( 2/34) test item "gpuarrays - reductions/== isequal" 45.3 secs (72.7% compile, <0.1% recompile, 3.3% GC), 82.24 M allocs (4.883 GB)

Captured Logs for test item "core" at test\core_tests.jl:1 on worker 27644
:0:C:\constructicon\builds\gfx\eleven\24.10\drivers\compute\clr\hipamd\src\hip_fatbin.hpp:74  : 59097994949 us: [pid:27644 tid:0x7412] Invalid DeviceId less than 0
┌ Error: Worker(pid=27644, terminated=true, termsignal=0) died running test item "core". Recording test error.
└ @ ReTestItems C:\Users\Vic\.julia\packages\ReTestItems\VrjGK\src\ReTestItems.jl:585
  Worker 27588:  fatal: error thrown and no exception handler available.
InterruptException()


Captured logs for test setup "TSGPUArrays" (dependency of "gpuarrays - reductions/== isequal") at test\gpuarrays_tests.jl:1 on worker 27588
┌ Warning: MIOpen is unavailable, functionality will be disabled.
└ @ AMDGPU C:\Users\Vic\.julia\packages\AMDGPU\WqMSe\src\AMDGPU.jl:216
No Captured Logs for test item "gpuarrays - reductions/== isequal" at test\gpuarrays_tests.jl:57 on worker 27588
┌ Error: Worker(pid=27588, terminated=true, termsignal=15) timed out running test item "gpuarrays - reductions/== isequal" after 1800 seconds. Recording test error.
└ @ ReTestItems C:\Users\Vic\.julia\packages\ReTestItems\VrjGK\src\ReTestItems.jl:579
  Worker 35092:  [ Info: Starting test worker on pid = 35092, with 1 threads
  Worker 3380:  [ Info: Starting test worker on pid = 3380, with 1 threads
  Worker 35092:  20:00:10 | maxrss  0.6% | mem 37.0% | START ( 3/34) test item "core: device" at test\device_tests.jl:1
  Worker 3380:  20:00:10 | maxrss  0.6% | mem 37.0% | START ( 4/34) test item "gpuarrays - reductions/any all count" at test\gpuarrays_tests.jl:60

The relevant thing seems to be

:0:C:\constructicon\builds\gfx\eleven\24.10\drivers\compute\clr\hipamd\src\hip_fatbin.hpp:74  : 59097994949 us: [pid:27644 tid:0x7412] Invalid DeviceId less than 0

Do I have to explicitly give it a GPU to run on or is this some other issue?

@pxl-th
Copy link
Collaborator

pxl-th commented Jul 14, 2024

There are some issues with multi-gpu setup, not sure if this is the one as well:
#648

You can disabling multi-gpu tests with HIP_VISIBLE_DEVICES=0 to see if the hangs dissapear.

Another cause might be the same as hangs we had with Navi 3 up until recently which were fixed upstream, since they were driver/ROCm issues #650 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants