Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disk cache infrastructure for Julia 1.11 #557

Merged
merged 11 commits into from
Jun 28, 2024
Merged

Add disk cache infrastructure for Julia 1.11 #557

merged 11 commits into from
Jun 28, 2024

Conversation

vchuravy
Copy link
Member

@vchuravy vchuravy commented Apr 3, 2024

Replaces #351

Using JuliaLang/julia#52233 for ensuring inference correctness.
Note the CodeInstances should be precompiled, which we don't yet have a nice-ish infrastructure for.

Might need JuliaLang/julia#53943 so that we can encode the dependencies of the CodeInstance,
instead of just GPUCompiler.

TODO:

  • Wire up tests
  • add a precompile function
  • Hash CompilerConfig to be robust against compiler params changes

Copy link

codecov bot commented Apr 16, 2024

Codecov Report

Attention: Patch coverage is 51.72414% with 28 lines in your changes missing coverage. Please review.

Project coverage is 71.80%. Comparing base (ee9077d) to head (3cfb3c2).

Files Patch % Lines
src/execution.jl 61.22% 19 Missing ⚠️
src/jlgen.jl 0.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #557      +/-   ##
==========================================
- Coverage   74.75%   71.80%   -2.95%     
==========================================
  Files          24       24              
  Lines        3414     3469      +55     
==========================================
- Hits         2552     2491      -61     
- Misses        862      978     +116     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vchuravy vchuravy marked this pull request as ready for review April 17, 2024 03:23
@vchuravy
Copy link
Member Author

Okay the big caveat is that this will only work with code that has been precompiled. So we will need to make sure our inference caches are hot.

@vchuravy
Copy link
Member Author

Using this small code

module GemmDenseCUDA

using PrecompileTools

import CUDA
import CUDA: i32

BLOCK_SIZE = 32

function gemm!(A, B, C)
    row = (CUDA.blockIdx().x - Int32(1)) * CUDA.blockDim().x + CUDA.threadIdx().x
    col = (CUDA.blockIdx().y - Int32(1)) * CUDA.blockDim().y + CUDA.threadIdx().y

    sum = zero(eltype(C))

    if row <= size(A, 1) && col <= size(B, 2)
        for i in 1:size(A, 2)
            @inbounds sum += A[row, i] * B[i, col]
        end
        @inbounds C[row, col] = sum
    end

    return
end

@setup_workload let
    if CUDA.functional()
        A16 = CUDA.CuArray{Float16,2}(undef, 0, 0)
        A32 = CUDA.CuArray{Float32,2}(undef, 0, 0)
        A64 = CUDA.CuArray{Float64,2}(undef, 0, 0)
        @compile_workload begin
            CUDA.@cuda launch=false gemm!(A16, A16, A16)
            CUDA.@cuda launch=false gemm!(A32, A32, A32)
            CUDA.@cuda launch=false gemm!(A64, A64, A32)
        end
    end
end

end #module

Testing in REPL

First compilation

julia> A32 = CUDA.CuArray{Float32,2}(undef, 0, 0);
julia> @time CUDA.@cuda launch=false gemm!(A32, A32, A32)
 12.064387 seconds (17.52 M allocations: 884.286 MiB, 2.53% gc time, 99.67% compilation time: 2% of which was recompilation)
CUDA.HostKernel for gemm!(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1})

Second compilation

julia> A16 = CUDA.CuArray{Float16,2}(undef, 0, 0)
julia> @time CUDA.@cuda launch=false gemm!(A16, A16, A16)
  0.174178 seconds (258.09 k allocations: 14.019 MiB, 77.03% compilation time)
CUDA.HostKernel for gemm!(CuDeviceMatrix{Float16, 1}, CuDeviceMatrix{Float16, 1}, CuDeviceMatrix{Float16, 1})

Without disk-cache (just inference caching)

First compilation

julia> A32 = CUDA.CuArray{Float32,2}(undef, 0, 0);
julia> @time CUDA.@cuda launch=false GemmDenseCUDA.gemm!(A32, A32, A32)
  0.345229 seconds (93.13 k allocations: 6.010 MiB, 86.83% compilation time)
CUDA.HostKernel for gemm!(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1})

Second compilation

julia> A16 = CUDA.CuArray{Float16,2}(undef, 0, 0)
0×0 CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}
julia> @time CUDA.@cuda launch=false GemmDenseCUDA.gemm!(A16, A16, A16)
  0.058902 seconds (29.10 k allocations: 2.807 MiB, 30.30% compilation time)
CUDA.HostKernel for gemm!(CuDeviceMatrix{Float16, 1}, CuDeviceMatrix{Float16, 1}, CuDeviceMatrix{Float16, 1})

Would be interesting to see the impact of JuliaGPU/CUDA.jl#2325

With disk-cache

First compilation

julia> A32 = CUDA.CuArray{Float32,2}(undef, 0, 0);
julia> @time CUDA.@cuda launch=false GemmDenseCUDA.gemm!(A32, A32, A32)
  0.070375 seconds (47.02 k allocations: 3.382 MiB, 96.08% compilation time)
CUDA.HostKernel for gemm!(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1})

Second Compilation

julia> A16 = CUDA.CuArray{Float16,2}(undef, 0, 0)
julia> @time CUDA.@cuda launch=false GemmDenseCUDA.gemm!(A16, A16, A16)
  0.018031 seconds (21.87 k allocations: 2.063 MiB, 96.90% compilation time)
CUDA.HostKernel for gemm!(CuDeviceMatrix{Float16, 1}, CuDeviceMatrix{Float16, 1}, CuDeviceMatrix{Float16, 1})
  • 12s -> 0.34s -> 0.07s for first compilation
  • 0.17s -> 0.058s -> 0.018s for second compilation

@vchuravy
Copy link
Member Author

With JuliaGPU/CUDA.jl#2325

julia> A32 = CUDA.CuArray{Float32,2}(undef, 0, 0)
0×0 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}

julia> @time CUDA.@cuda launch=false gemm!(A32, A32, A32)
  4.224223 seconds (6.91 M allocations: 343.977 MiB, 6.31% gc time, 99.18% compilation time)
CUDA.HostKernel for gemm!(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1})

julia> A16 = CUDA.CuArray{Float16,2}(undef, 0, 0)
0×0 CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}

julia> @time CUDA.@cuda launch=false gemm!(A16, A16, A16)
  0.205474 seconds (258.59 k allocations: 14.003 MiB, 79.12% compilation time)
CUDA.HostKernel for gemm!(CuDeviceMatrix{Float16, 1}, CuDeviceMatrix{Float16, 1}, CuDeviceMatrix{Float16, 1})

@vchuravy
Copy link
Member Author

So precompiling with disk_cache = true, leads to this interesting hang.

cmd: /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/julia 18641 running 2 of 2

signal (10): User defined signal 1
unknown function (ip: 0x7c1c1496f10e)
pthread_mutex_lock at /usr/lib/libc.so.6 (unknown line)
__gthread_mutex_lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/x86_64-linux-gnu/bits/gthr-default.h:749 [inlined]
__gthread_recursive_mutex_lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/x86_64-linux-gnu/bits/gthr-default.h:811 [inlined]
lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/mutex:106 [inlined]
lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/unique_lock.h:141 [inlined]
unique_lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/unique_lock.h:71 [inlined]
Lock at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/include/llvm/ExecutionEngine/Orc/ThreadSafeModule.h:42 [inlined]
getLock at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/include/llvm/ExecutionEngine/Orc/ThreadSafeModule.h:69
jl_codegen_params_t at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jitlayers.h:258 [inlined]
_jl_compile_codeinst at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jitlayers.cpp:213
jl_generate_fptr_impl at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jitlayers.cpp:528
jl_compile_method_internal at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:2534 [inlined]
jl_compile_method_internal at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:2421
_jl_invoke at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:2938 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:3123
handle_error at /home/vchuravy/.julia/packages/LLVM/bzSzE/src/core/context.jl:168
jfptr_handle_error_5213 at /home/vchuravy/.julia/compiled/v1.11/LLVM/e8NBy_INkA2.so (unknown line)
jlcapi_handle_error_5773 at /home/vchuravy/.julia/compiled/v1.11/LLVM/e8NBy_INkA2.so (unknown line)
_ZN4llvm18report_fatal_errorERKNS_5TwineEb at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel15CannotYetSelectEPNS_6SDNodeE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel16SelectCodeCommonEPNS_6SDNodeEPKhj at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN12_GLOBAL__N_115X86DAGToDAGISel6SelectEPN4llvm6SDNodeE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel22DoInstructionSelectionEv at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel17CodeGenAndEmitDAGEv at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20SelectAllBasicBlocksERKNS_8FunctionE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE.part.0 at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN12_GLOBAL__N_115X86DAGToDAGISel20runOnMachineFunctionERN4llvm15MachineFunctionE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm19MachineFunctionPass13runOnFunctionERNS_8FunctionE.part.0 at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
add_output_impl at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1171
operator() at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1477
operator() at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/std_function.h:690 [inlined]
lambda_trampoline at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1347
unknown function (ip: 0x7c1c14972559)
unknown function (ip: 0x7c1c149efa3b)
unknown function (ip: (nil))
unknown function (ip: 0x7c1c1496eebc)
unknown function (ip: 0x7c1c149740e2)
uv_thread_join at /workspace/srcdir/libuv/src/unix/thread.c:294
add_output<jl_dump_native_impl(void*, char const*, char const*, char const*, char const*, ios_t*, ios_t*, jl_emission_params_t*)::<lambda(llvm::Module&)> > at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1485
operator()<jl_dump_native_impl(void*, char const*, char const*, char const*, char const*, ios_t*, ios_t*, jl_emission_params_t*)::<lambda(llvm::Module&)> > at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1645 [inlined]
jl_dump_native_impl at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1790
ijl_write_compiler_output at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/precompile.c:168
ijl_atexit_hook at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/init.c:285
jl_repl_entrypoint at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jlapi.c:1060
main at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
unknown function (ip: 0x7c1c1490cccf)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
unknown function (ip: (nil))

@vchuravy
Copy link
Member Author

sigh
This has a 50% chance so far of showing up...

ERROR: The following 1 direct dependency failed to precompile:

GemmDenseCUDA 

Failed to precompile GemmDenseCUDA [77fe268f-d775-43f7-b4ca-0f4dd283d536] to "/home/vchuravy/.julia/compiled/v1.11/GemmDenseCUDA/jl_3sDnKv".
LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.membar.sys

[26379] signal 6 (-6): Aborted
in expression starting at none:0
Allocations: 33542987 (Pool: 33542022; Big: 965); GC: 2

@vchuravy
Copy link
Member Author

vchuravy commented Apr 19, 2024

The issue is that our threadfence_system function is not device only and so a method could accidentially leak through...

julia> CUDA.threadfence_system()
ERROR: LLVM error: Cannot select: intrinsic %llvm.nvvm.membar.sys
Stacktrace:
 [1] handle_error(reason::Cstring)
   @ LLVM ~/src/LLVM/src/core/context.jl:170
 [2] top-level scope
   @ REPL[2]:1
 [3] top-level scope
   @ ~/.julia/packages/CUDA/02Uw6/src/initialization.jl:206

The issue is that during image creation we sometimes broaden our horizon and then try to compile based on methods
and not CodeInstances.

@vchuravy
Copy link
Member Author

WaterLily TGV example

Backend Precompilation Disk Cache What Time
CPU First execution 5.32s (80% compilation)
CPU Second execution 0.85s
CPU First execution 2.48s (65% compilation)
CPU Second execution 0.9s
CUDA First execution 11.07s (70% compilation)
CUDA Second execution 0.02s
CUDA First execution 6.38s (46% compilation)
CUDA Second execution 0.02s
CUDA First execution 2.70s (97% compilation)
CUDA Second execution 0.02s

ClimaOcean OMIP

Note that I tried to make the model as "small" as possible,
since we pay the time in precompilation. But this is a similar case
to what we saw a year ago. 10min of setup time before the model reached steady-state.
The CUDA version needs 804s on my machine to reach steady state without inference caching.
49s with inference caching, and 12 seconds with this PR.

Backend Precompilation Disk Cache What Time
CPU Initialization 422s (88% compilation)
CPU Time step 1 75s (100% compilation)
CPU Time step 2 21s (100% compilation)
CPU Time step 3 75s (100% compilation)
CPU Time step 4 0.008s (100% compilation)
CPU Initialization 52s (3% compilation)
CPU Time step 1 0.021s
CPU Time step 2 0.290s
CPU Time step 3 0.014s
CPU Time step 4 0.008s
CUDA Initialization 596s
CUDA Time step 1 75s
CUDA Time step 2 34s
CUDA Time step 3 99s
CUDA Time step 4 0.017s
CUDA Initialization 44s
CUDA Time step 1 0.339s
CUDA Time step 2 1.171s
CUDA Time step 3 3.659s
CUDA Time step 4 0.018s
CUDA Initialization 12s
CUDA Time step 1 0.023s
CUDA Time step 2 0.207s
CUDA Time step 3 0.029s
CUDA Time step 4 0.017s

Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor naming nits. I don't have the time right now to test or review closely, but if it's working fine feel free to merge so that it gets some exposure on the master branch.

src/execution.jl Outdated Show resolved Hide resolved
src/execution.jl Outdated Show resolved Hide resolved
src/execution.jl Outdated Show resolved Hide resolved
src/execution.jl Outdated Show resolved Hide resolved
test/native_tests.jl Outdated Show resolved Hide resolved
@vchuravy vchuravy merged commit 4178477 into master Jun 28, 2024
21 checks passed
@vchuravy vchuravy deleted the vc/diskcache3 branch June 28, 2024 14:07
@maleadt
Copy link
Member

maleadt commented Jun 28, 2024

Yay!

@vchuravy
Copy link
Member Author

Ah we might still need to add an interface for backends to opt out. E.g. Enzyme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants