CUDSS.jl is a Julia interface to the NVIDIA cuDSS library. NVIDIA cuDSS provides three factorizations (LDU, LDLᵀ, LLᵀ) for solving sparse linear systems on GPUs.
Unlike other CUDA libraries that are commonly bundled together, cuDSS is currently in preview. For this reason, it is not included in CUDA.jl. To maintain consistency with the naming conventions used for other CUDA libraries (such as CUBLAS, CUSOLVER, CUSPARSE, etc.), we have named this interface CUDSS.jl.
CUDSS.jl can be installed and tested through the Julia package manager:
julia> ]
pkg> add CUDSS
pkg> test CUDSS
CUDSS.jl provides a structured approach for leveraging NVIDIA cuDSS functionalities.
It introduces the CudssSolver
type along with three core routines: cudss
, cudss_set
, and cudss_get
.
Additionally, specialized methods for the CuSparseMatrixCSR
type have been incorporated for cholesky
, ldlt
, lu
and \
.
To further enhance performance, in-place variants including cholesky!
, ldlt!
, lu!
and ldiv!
have been implemented.
These variants optimize performance by reusing the symbolic factorization as well as storage.
This ensures efficient solving of sparse linear systems on GPUs.
using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra
T = Float64
n = 100
A_cpu = sprand(T, n, n, 0.05) + I
x_cpu = zeros(T, n)
b_cpu = rand(T, n)
A_gpu = CuSparseMatrixCSR(A_cpu)
x_gpu = CuVector(x_cpu)
b_gpu = CuVector(b_cpu)
solver = CudssSolver(A_gpu, "G", 'F')
cudss("analysis", solver, x_gpu, b_gpu)
cudss("factorization", solver, x_gpu, b_gpu)
cudss("solve", solver, x_gpu, b_gpu)
r_gpu = b_gpu - A_gpu * x_gpu
norm(r_gpu)
# In-place LU
d_gpu = rand(T, n) |> CuVector
A_gpu = A_gpu + Diagonal(d_gpu)
cudss_set(solver, A_gpu)
c_cpu = rand(T, n)
c_gpu = CuVector(c_cpu)
cudss("refactorization", solver, x_gpu, c_gpu)
cudss("solve", solver, x_gpu, c_gpu)
r_gpu = c_gpu - A_gpu * x_gpu
norm(r_gpu)
using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra
T = Float64
R = real(T)
n = 100
p = 5
A_cpu = sprand(T, n, n, 0.05) + I
A_cpu = A_cpu + A_cpu'
X_cpu = zeros(T, n, p)
B_cpu = rand(T, n, p)
A_gpu = CuSparseMatrixCSR(A_cpu |> tril)
X_gpu = CuMatrix(X_cpu)
B_gpu = CuMatrix(B_cpu)
structure = T <: Real ? "S" : "H"
solver = CudssSolver(A_gpu, structure, 'L')
cudss("analysis", solver, X_gpu, B_gpu)
cudss("factorization", solver, X_gpu, B_gpu)
cudss("solve", solver, X_gpu, B_gpu)
R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
norm(R_gpu)
# In-place LDLᵀ
d_gpu = rand(R, n) |> CuVector
A_gpu = A_gpu + Diagonal(d_gpu)
cudss_set(solver, A_gpu)
C_cpu = rand(T, n, p)
C_gpu = CuMatrix(C_cpu)
cudss("refactorization", solver, X_gpu, C_gpu)
cudss("solve", solver, X_gpu, C_gpu)
R_gpu = C_gpu - ( CuSparseMatrixCSR(A_cpu) + Diagonal(d_gpu) ) * X_gpu
norm(R_gpu)
using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra
T = ComplexF64
R = real(T)
n = 100
p = 5
A_cpu = sprand(T, n, n, 0.01)
A_cpu = A_cpu * A_cpu' + I
X_cpu = zeros(T, n, p)
B_cpu = rand(T, n, p)
A_gpu = CuSparseMatrixCSR(A_cpu |> triu)
X_gpu = CuMatrix(X_cpu)
B_gpu = CuMatrix(B_cpu)
structure = T <: Real ? "SPD" : "HPD"
solver = CudssSolver(A_gpu, structure, 'U')
cudss("analysis", solver, X_gpu, B_gpu)
cudss("factorization", solver, X_gpu, B_gpu)
cudss("solve", solver, X_gpu, B_gpu)
R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
norm(R_gpu)
# In-place LLᴴ
d_gpu = rand(R, n) |> CuVector
A_gpu = A_gpu + Diagonal(d_gpu)
cudss_set(solver, A_gpu)
C_cpu = rand(T, n, p)
C_gpu = CuMatrix(C_cpu)
cudss("refactorization", solver, X_gpu, C_gpu)
cudss("solve", solver, X_gpu, C_gpu)
R_gpu = C_gpu - ( CuSparseMatrixCSR(A_cpu) + Diagonal(d_gpu) ) * X_gpu
norm(R_gpu)