Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to import AMDGPU fails with (an lvm?) error #579

Open
pulkin opened this issue Jan 6, 2024 · 10 comments
Open

Trying to import AMDGPU fails with (an lvm?) error #579

pulkin opened this issue Jan 6, 2024 · 10 comments

Comments

@pulkin
Copy link

pulkin commented Jan 6, 2024

: CommandLine Error: Option 'disassemble' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options

Not sure where from to approach. Platform is Fedora 39

> dnf list installed "rocm-*"
Installed Packages
rocm-clinfo.x86_64                                                                         5.7.1-1.fc39                                                                   @updates
rocm-comgr.x86_64                                                                          17.0-3.fc39                                                                    @updates
rocm-comgr-devel.x86_64                                                                    17.0-3.fc39                                                                    @updates
rocm-device-libs.x86_64                                                                    17.1-1.fc39                                                                    @updates
rocm-hip.x86_64                                                                            5.7.1-1.fc39                                                                   @updates
rocm-hip-devel.x86_64                                                                      5.7.1-1.fc39                                                                   @updates
rocm-opencl.x86_64                                                                         5.7.1-1.fc39                                                                   @updates
rocm-runtime.x86_64                                                                        5.7.1-1.fc39                                                                   @updates
rocm-runtime-devel.x86_64                                                                  5.7.1-1.fc39                                                                   @updates

Traced it down to this

julia> import Libdl

julia> Libdl.dlpath("libamdhip64")
"/usr/bin/../lib64/julia/../libamdhip64.so"

julia> Libdl.dlpath("libamdhip64")
: CommandLine Error: Option 'disassemble' registered more than once!
@pxl-th
Copy link
Collaborator

pxl-th commented Jan 7, 2024

My guess is that HIP is built with statically linked LLVM in Fedora, when it probably should link dynamically.

@pxl-th
Copy link
Collaborator

pxl-th commented Jan 7, 2024

You can check if this is true by dev'ing AMDGPU package with ]dev AMDGPU.
Then in ~/.julia/dev/AMDGPU.jl directory create a LocalPreferences.toml file with the following content:

[AMDGPU]
use_artifacts = true

Then try importing it again. Artifacts don't have all the libraries and are of older ROCm version, but at least you'll be able to confirm that dynamically linked LLVM is what you need.

@pulkin
Copy link
Author

pulkin commented Jan 7, 2024

Thanks. Tried creating ~/.julia/dev/AMDGPU/LocalPreferences.toml with the content but the error is the same. The stack trace points to /.julia/dev/AMDGPU so it seems like the right place to do it but there is no effect. If I corrupt LocalPreferences.toml layout it will also ignore that.

@pulkin
Copy link
Author

pulkin commented Jan 7, 2024

Managed to force artifacts through JULIA_AMDGPU_DISABLE_ARTIFACTS=false julia. It imports now: дякую.

@pxl-th
Copy link
Collaborator

pxl-th commented Jan 7, 2024

Thanks. Tried creating ~/.julia/dev/AMDGPU/LocalPreferences.toml with the content but the error is the same. The stack trace points to /.julia/dev/AMDGPU so it seems like the right place to do it but there is no effect. If I corrupt LocalPreferences.toml layout it will also ignore that.

Forgot to mention that you then need to launch julia with project set to the AMDGPU.jl folder:

julia --project=~/.julia/dev/AMDGPU.jl

Otherwise, you should put that file where your current project is (and modify project path accordingly).
But for global projects it is better to use env variable, yes.

The downside of artifacts is that you can use only Julia kernels, so things like matmul (rocBLAS) are not available (and other stuff).

Будь ласка :)

@pulkin
Copy link
Author

pulkin commented Jan 7, 2024

Most tests pass, 263 errored, 19 broken. Back to the original issue, my system library has libLLVM-17.so in its dependencies though

> ldd /usr/bin/../lib64/julia/../libamdhip64.so
	linux-vdso.so.1 (0x00007ffd6f2f4000)
	libamd_comgr.so.2 => /lib64/libamd_comgr.so.2 (0x00007f7b42800000)
	libhsa-runtime64.so.1 => /lib64/libhsa-runtime64.so.1 (0x00007f7b42400000)
	libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f7b448f8000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f7b42000000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f7b4271f000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f7b448d4000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f7b41e1e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f7b44924000)
	liblldELF.so.17 => /lib64/liblldELF.so.17 (0x00007f7b41a00000)
	liblldCommon.so.17 => /lib64/liblldCommon.so.17 (0x00007f7b431d4000)
	libclang-cpp.so.17 => /lib64/libclang-cpp.so.17 (0x00007f7b3d800000)
	libLLVM-17.so => /lib64/libLLVM-17.so (0x00007f7b36200000)
	libhsakmt.so.1 => /lib64/libhsakmt.so.1 (0x00007f7b431a6000)
	libelf.so.1 => /lib64/libelf.so.1 (0x00007f7b43189000)
	libdrm.so.2 => /lib64/libdrm.so.2 (0x00007f7b43172000)
	libz.so.1 => /lib64/libz.so.1 (0x00007f7b43158000)
	libffi.so.8 => /lib64/libffi.so.8 (0x00007f7b43148000)
	libedit.so.0 => /lib64/libedit.so.0 (0x00007f7b426e2000)
	libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f7b426ad000)
	libdrm_amdgpu.so.1 => /lib64/libdrm_amdgpu.so.1 (0x00007f7b4313b000)
	libzstd.so.1 => /lib64/libzstd.so.1 (0x00007f7b42344000)

To provide you the context, I am looking into porting this small CUDA PoC

https://github.com/jinwen-yang/cuPDLP.jl/tree/master

to run on my 6600. It does not look like there is a lot to port (uses zeros, norm, dot and sparse arrays). But maybe sparse arrays or something else are out of the question until I resolve the original q?

@pxl-th
Copy link
Collaborator

pxl-th commented Jan 8, 2024

Mine looks like this:

$ ldd /opt/rocm/lib/libamdhip64.so
	linux-vdso.so.1 (0x00007ffec6387000)
	libamd_comgr.so.2 => /opt/rocm/lib/libamd_comgr.so.2 (0x00007f6a7d600000)
	libhsa-runtime64.so.1 => /opt/rocm/lib/libhsa-runtime64.so.1 (0x00007f6a7d200000)
	libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f6a87beb000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f6a7ce00000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6a87b04000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f6a87ae2000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6a7ca00000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6a87c0f000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f6a87ac6000)
	libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007f6a85fce000)
	libelf.so.1 => /lib/x86_64-linux-gnu/libelf.so.1 (0x00007f6a85fb0000)
	libdrm.so.2 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm.so.2 (0x00007f6a85f96000)
	libdrm_amdgpu.so.1 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1 (0x00007f6a87ab6000)

Not really sure what to suggest besides recompiling HIP without linking against LLVM, but then you'd need to change this line to point to your .so. But then I'm not sure if they link other things, MIOpen, for example uses LLVM to JIT compile some of the kernels at runtime.

As an alternative approach, I prefer to get ROCm from the official install script which links dynamically, but it does not have Fedora support.

@wgmitchener
Copy link

Making some connections: Here's a thread on a Julia forum about this issue. One post there suggests the problem comes from how the ROCm .so file is opened:

https://discourse.julialang.org/t/failed-while-initializing-amdgpu-jl-with-llvm-17-and-rocm-6-1-on-fedora-40/118341/3

@billmclean
Copy link

I can confirm that patching find_rocm_library so it finds the libraries that Fedora installs under /usr/lib64 seems to solve the problem.

@pxl-th
Copy link
Collaborator

pxl-th commented Sep 29, 2024

We can then add a set of predefined locations where we'll look for instead of relying on dlopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants