Threaded implementation of dense-sparse matrix multiplication, built on top of
Polyester.jl
.
Just install and import this package, and launch Julia with some threads e.g. julia --threads=auto
! Then e.g. any of these will be accelerated:
A = rand(1_000, 2_000); B = sprand(2_000, 30_000, 0.05); buf = similar(size(A,1), size(B,2)) # prealloc
fastdensesparsemul!(buf, A, B, 1, 0)
fastdensesparsemul_threaded!(buf, A, B, 1, 0)
fastdensesparsemul_outer!(buf, @view(A[:, 1]), B[1,:], 1, 0)
fastdensesparsemul_outer_threaded!(buf, @view(A[:, 1]), B[1,:], 1, 0)
The interface is adapted from the 5-parameter definition used by mul!
and also BLAS.
Previously we tried to overload the mul!
operator, which comes with some nice syntactic convenience, but it lead to some downstream problems with package precompilation, and checking whether our implementation is actually used.
I want to do
- The SparseArrays.jl package doesn't support threaded multiplication.
- The IntelMKL.jl package doesn't seem to support dense
$\times$ sparsecsc multiplication, although one can get similar performance using that package and transposing appropriately. It also comes with possible licensing issues and is vendor-specific. - The ThreadedSparseCSR.jl package also just supports sparsecsr
$\times$ dense. - The ThreadedSparseArrays.jl package also just supports ThreadedSparseMatrixCSC
$\times$ dense, and also doesn't install for me currently.
I haven't found an implementation for that, so made one myself. In fact, the package Polyester.jl
makes this super easy, the entire code is basically
import SparseArrays: SparseMatrixCSC, mul!; import SparseArrays
import Polyester: @batch
function fastdensesparsemul_threaded!(C::AbstractMatrix, A::AbstractMatrix, B::SparseMatrixCSC, α::Number, β::Number)
@batch for j in axes(B, 2)
C[:, j] .*= β
C[:, j] .+= A * (α.*B[:, j])
end
return C
end
Notice that this approach doesn't make sense for matrix-vector multiplication (the loop would just have one element), so that case is not considered in this package, however it does make sense for outer producs.
I haven't found much literature on the choice of CSC vs CSR storage specifically of the context of dense
For matrices
For all M we see a speed up over _spmul!
from the SparseArrays package of up to ~2x for M in [300, 30_000].
We also compare against MKLSparse.jl
. However, since MKLSparse only supports dense x sparse
we first need to allocate spare buffers and transpose the dense matrix (these allocations are not measured in the no_transpose
variant), and then computing essentially