-
-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General performance improvements discussion #3062
Comments
Awesome! This is great and it all makes sense (using updated versions of libraries, especially those tuned to the system it's running on). Where can we help with the discussion? I saw the current suggestion, but not sure where what could actually be configured as Maybe it'd help to know exactly what things you're suggesting? One note: we've tested MKL in the past. It turns out that the implementation gets loose with IEEE floating point rules and although the wall time is faster, we found that:
I'm wary of things that are faster that produce different results. It requires a bit more effort to sort through what that means and actually assess whether it's faster. |
I made an example. Unfortunately I had to directly include the binary This checks, when running linux, if
I've had this experience with
That's interesting. I should really investigate this further for other projects. This isn't really a problem for me, but I should really try to figure out why they give different results once and for all.
Agree completely. Did you try other BLAS implementations to see? They likely also had different results, so it probably falls in the same bucket, but might be worth a gander. |
Thanks @blackwer!! Couple little notes
Yes so this has vex'd us for a while, the // abstract base class
struct vari_base {
virtual void chain() = 0;
}
// internal impl
struct vari : vari_base {
// vtable ptr for chain()
double val_;
double adj_;
}
// smart pointer like class
struct var {
vari* vi_;
};
So these two sort of go hand in hand.
This is where the culmination of all the above leads to not being able to have vectorized instructions. Since our base struct vari {
// vtable pointer to chain() method
Eigen::MatrixXd val_;
Eigen::MatrixXd adj_;
}; So then when we do the reverse pass for matrix operations and we access the values and adjoints of the matrix we can actually have vectorized instructions. The only alternative to the above would involve adding packet math for |
I haven't tried personally, but I do think others have. If you end up with any results (positive, negative, same), please let us know! |
Summary:
At the suggestion of @bob-carpenter, I'm opening this as a discussion thread for how to improve the performance of Stan both generally, and on specific platforms. These improvements can vary between simple compile and run-time tricks the user can perform in certain environments which can be addressed with documentation and basic automation, to deep overhauls of the internals to exploit modern processor architecture. While my experience with the code is minimal currently, it seems Stan is already at a very good baseline for performance, but there is likely room for improvement (such as exploiting vectorization better and improving special function performance, for starters).
Description:
Stan is great, and I want to help make it better. This issue is a place to discuss how to best do so, and provide various benchmarks on common systems I have access to see where performance is lagging. The main machines I have readily available for trials are:
AMD Ryzen 5800X (Ubuntu 21.04, Windows 10)
2X AMD EPYC 7742 64-Core Processor (CentOS 7)
2X Intel(R) Xeon(R) Gold 6128 CPU (CentOS 7)
2X Intel(R) Xeon(R) CPU E5-2680 v4 (CentOS 7)
2X Intel(R) Xeon(R) Gold 6148 CPU (CentOS 7)
Fujitsu A64FX (CentOS 8)
ThunderX2 99xx (CentOS 8)
Intel(R) Core(TM) i9-10885H CPU (Ubuntu 21.04)
Intel(R) Core(TM) i7-6920HQ (Big Sur)
Apple M1 (Big Sur)
Benchmarking tools:
I've been using these programs to do some simple benchmarks.
simple cmdstan benchmark
compare amdlibm to libm, need to extract amd-libm somewhere and add the library to library path as in the example script.
Results
This is currently mostly a more qualitative stub, with more results to come later. I'm only using
cmdstan
so far.On machines running CentOS 7,
glibc
is painfully out of date. From a practical standpoint, this means thatlibm
(log
,exp
, etc.) is also incredibly out of date (2.17). AMD provides a performance tuned version oflibm
, which honestly is approximately the speed of currentlibc
implementations (given their permissive license, the code might have been absorbed intoglibc
). However, it absolutely crushes oldlibc
versions.Other notes:
vari_base*
stack. This improves with more modern compilers, but is still probably ~5% of execution time on this example for the x86_64 machines. It could be better or worse, it's kinda painstaking to isolate because the stack is modified everywhere.conda
packages, since I doubt you want to auto-pull and build openblas/blis/mkl/whatever.Current suggestions
It might be worth detecting the
libc
implementation (when appropriate) and suggesting the AMD implementation be loaded. This could also be pulled in directly, given their permissive license. Themake/local
I'm using incmdstan
containsLDFLAGS+= -L/mnt/home/rblackwell/tmp/amd-libm/lib -Wl,-rpath,"/mnt/home/rblackwell/tmp/amd-libm/lib" -lalm -lm
to enable this feature, but this could be changed to$(PWD)
or whatever themake
equivalent is.On the same machine/OS, I got significant performance bumps from using a more modern compiler (
gcc10
vsgcc7
), and swapping theEigen
backend out forMKL
, both of which were relatively trivial. The combined effect of these three things cut execution time down from ~75s to ~45s, and it was all very low hanging fruit. I didn't see any obvious mention of this in documentation. The same modifications on my modern Ryzen system didn't do much of anything though (I didn't try an older compiler,MKL
andlibalm
didn't help noticeably). The code executed in about 28-30s, depending on configuration.Current Version:
v2.27.0
The text was updated successfully, but these errors were encountered: