-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try VectorizedRNG.jl #304
Try VectorizedRNG.jl #304
Conversation
using StochasticDiffEq
using BenchmarkTools
α=1
β=1
u0=[1/2]
f(du,u,p,t) = (du .= u)
g(du,u,p,t) = (du .= u)
dt = 1//2^(4)
tspan = (0.0,1.0)
prob = SDEProblem(f,g,u0,(0.0,1.0))
sol = solve(prob,EM(),dt=dt)
@code_warntype solve(prob,EM(),dt=dt)
@btime sol = solve(prob,EM(),dt=dt) Is the test case |
Note this is on top of #291 |
Yikes. I'll look into it.
|
What is help?> local_rng()
No documentation found. But VectorizedRNG will allocate memory for julia> local_rng(11212351235412351)
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x9f56a135d27e6d00)
julia> local_rng(0)
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x00005563e3387100)
julia> local_rng()
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x00005563e3387100) This is mostly to let it control alignment. Also means that, while the states are close together in memory, cache line boundaries should be aligned with different thread local parts, so it should be fine. |
If I remove the Test Summary: | Pass Fail Total
Two-dimensional Linear SDE Tests | 1 37 38 Benchmarks: julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
memory estimate: 23.88 KiB
allocs estimate: 118
--------------
minimum time: 5.051 μs (0.00% GC)
median time: 5.311 μs (0.00% GC)
mean time: 6.844 μs (20.10% GC)
maximum time: 503.748 μs (96.00% GC)
--------------
samples: 10000
evals/sample: 6 This branch: julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
memory estimate: 6.22 KiB
allocs estimate: 95
--------------
minimum time: 3.863 μs (0.00% GC)
median time: 4.177 μs (0.00% GC)
mean time: 4.991 μs (14.98% GC)
maximum time: 709.685 μs (99.14% GC)
--------------
samples: 10000
evals/sample: 8 small_overhead (291): julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
memory estimate: 6.17 KiB
allocs estimate: 95
--------------
minimum time: 2.902 μs (0.00% GC)
median time: 3.150 μs (0.00% GC)
mean time: 3.923 μs (18.48% GC)
maximum time: 636.300 μs (99.32% GC)
--------------
samples: 10000
evals/sample: 9 So it looks like this regresses performance. What are the lengths of the involved arrays? |
Awesome, that got it working.
Here, just 1. It's of course problem dependent, from 1 to 100,000.
Let me get that small optimizations branch merged and it should be easier. All this PR adds to it though is just changing out the RNG, so it's really this vs Xorshifts.Xoroshiro128Plus |
I assume that VectorizedRNG.jl only supports Arrays of Float64? Could it also do out of place with static array like things? |
I'm worried about the test failure that I got, and while not a failure, these warnings did not appear when testing the small_overhead branch: Solve and Plot
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329 VectorizedRNG does reasonably well vs the smallcrush RNG test, but the sin approximation might need to be a little more accurate. Also, it's using a Xoshiro256plus. Maybe I should implement the Xoroshiro128Plus.
Ah. julia> rng = Xorshifts.Xoroshiro128Plus()
RandomNumbers.Xorshifts.Xoroshiro128Plus(0xef4459b15a5cab2a, 0x33d487d5296eb0e2)
julia> x = [0.0];
julia> @benchmark randn!($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 9.429 ns (0.00% GC)
median time: 9.852 ns (0.00% GC)
mean time: 9.880 ns (0.00% GC)
maximum time: 41.759 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark randn!(local_rng(), $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 21.980 ns (0.00% GC)
median time: 22.052 ns (0.00% GC)
mean time: 22.084 ns (0.00% GC)
maximum time: 48.035 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 996
julia> @benchmark randn!($rng, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 11.235 ns (0.00% GC)
median time: 11.645 ns (0.00% GC)
mean time: 11.665 ns (0.00% GC)
maximum time: 21.735 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> x = Vector{Float64}(undef, 15);
julia> @benchmark randn!($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 50.670 ns (0.00% GC)
median time: 52.744 ns (0.00% GC)
mean time: 52.787 ns (0.00% GC)
maximum time: 88.798 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 990
julia> @benchmark randn!(local_rng(), $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 28.716 ns (0.00% GC)
median time: 29.022 ns (0.00% GC)
mean time: 29.019 ns (0.00% GC)
maximum time: 52.158 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 995
julia> @benchmark randn!($rng, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 48.997 ns (0.00% GC)
median time: 50.915 ns (0.00% GC)
mean time: 50.944 ns (0.00% GC)
maximum time: 67.167 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 989 That should be addressable though:
julia> @btime randn(local_rng(), NTuple{1,NTuple{2,Core.VecElement{Float64}}})
13.260 ns (0 allocations: 0 bytes)
((VecElement{Float64}(0.057132286810525984), VecElement{Float64}(0.7160102533566447)),) For very short arrays. It can generate a single uniform at a time: julia> @btime rand(local_rng(), NTuple{1,NTuple{1,Core.VecElement{Float64}}})
2.269 ns (0 allocations: 0 bytes)
((VecElement{Float64}(0.30066402503431744),),) But that isn't an option right now for normals because of the Box-Muller algoirthm. As the Box-Muller algorithm is a really slow way to generate 2 random normals anyways, we should be falling back to the Ziggurat algorithm for small (non-SIMD) batches of normals. Alternatively, have buffers of perhaps 256 random numbers stored, and simply load from this buffer (refilling the buffer every time it reaches the end).
Currently, it only supports It typically generates julia> randn(local_rng(), NTuple{4,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
((VecElement{Float64}(-0.35070219008784215), VecElement{Float64}(-1.1500644689621624), VecElement{Float64}(-1.6140966900130058), VecElement{Float64}(-0.13218650868339507), VecElement{Float64}(-1.145717731296182), VecElement{Float64}(-1.4488616660755809), VecElement{Float64}(-1.219831252656312), VecElement{Float64}(0.26954048537845754)), (VecElement{Float64}(-0.8649042896705302), VecElement{Float64}(1.1491223266582307), VecElement{Float64}(-0.8821573749577012), VecElement{Float64}(0.19394881631501515), VecElement{Float64}(1.957643704171434), VecElement{Float64}(-2.927466363335935), VecElement{Float64}(-0.14172465407366736), VecElement{Float64}(-1.5318948796751526)), (VecElement{Float64}(-0.34632071461043773), VecElement{Float64}(0.17695680200558128), VecElement{Float64}(0.34350477272227437), VecElement{Float64}(1.2389961224532435), VecElement{Float64}(1.1252296319202757), VecElement{Float64}(1.8952112260924336), VecElement{Float64}(0.7779111642563302), VecElement{Float64}(-1.0682395626203594)), (VecElement{Float64}(-0.40499368481842707), VecElement{Float64}(-0.5697825884055833), VecElement{Float64}(0.9163026476719056), VecElement{Float64}(-1.6460932788182563), VecElement{Float64}(-1.071672933005311), VecElement{Float64}(0.6215884503640314), VecElement{Float64}(-0.09387081715702705), VecElement{Float64}(0.07085548151904485)))
julia> @benchmark randn(local_rng(), NTuple{4,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 35.995 ns (0.00% GC)
median time: 36.122 ns (0.00% GC)
mean time: 36.191 ns (0.00% GC)
maximum time: 76.804 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 993
julia> @benchmark randn(local_rng(), NTuple{2,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.060 ns (0.00% GC)
median time: 21.076 ns (0.00% GC)
mean time: 21.101 ns (0.00% GC)
maximum time: 30.223 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
julia> @benchmark randn(local_rng(), NTuple{1,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 15.668 ns (0.00% GC)
median time: 16.484 ns (0.00% GC)
mean time: 16.315 ns (0.00% GC)
maximum time: 44.946 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998 So for out of place support, you just need another way to get these into your preferred data structure. |
It just seems to segfault though?
7bada79
to
e0771be
Compare
VectorizedRNG is only really good at batched sampling. julia> using VectorizedRNG, BenchmarkTools, Random
julia> x = Vector{Float64}(undef, 1024);
julia> @btime rand!($(Random.default_rng()), $x);
274.723 ns (0 allocations: 0 bytes)
julia> @btime rand!($(VectorizedRNG.local_rng()), $x);
178.311 ns (0 allocations: 0 bytes)
julia> @btime randn!($(Random.default_rng()), $x);
1.660 μs (0 allocations: 0 bytes)
julia> @btime randn!($(VectorizedRNG.local_rng()), $x);
1.213 μs (0 allocations: 0 bytes)
So if it makes sense to do larger batches, we could consider this again. |
Yeah, this is just so stale and it's not generally the right thing to do, so no reason to keep the branch around. A special noise process can be made around this if big SPDEs run into this as the time limiting part. |
I just seem to get segfaults though.