-
-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doing the decomposition for every momentum draw is inefficient #2881
Comments
Yikes. We had explicitly discussed coding it originally so it didn't have this inefficiency. I guess that didn't make it to the code. |
Should it only happen when |
Yes that makes sense. Right now it's accessed directly by other things: https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp#L32, but an accessor makes more sense. |
Is a fix under way? This slows things down quite a bit when I use Radon model for testing. |
… once each time the inverse metric is set (instead of every sample, Issue #2881). This involved switching to setter/getters for interfacing with dense_e_point so I made the change for diag_e_point as well. Also changed set_metric verbage to set_inv_metric.
The current design is intentional. The Cholesky is needed only once per transition which is a relatively small cost compared to the many gradient evaluations needed within each transition. Saving the Cholesky decomposition introduces an additional I am inclined to close this issue until less anecdotal evidence of hotspots in the current code is demonstrated. |
That's a huge speedup for dense matrices, which is where we're already paying an O(N^2) memory penalty just for storing the dense metric. Is there a big memory penalty for the diagonal case? That shouldn't need a Cholesky decomposition. |
I just put it there for symmetry and it was easy to get. We changed how the rng code is written but that's just a syntax thing: https://github.com/stan-dev/stan/pull/2894/files#diff-6a891b298c8e18322df7f82a1a362732R48 Edit: Oh I missed the question sorry. This uses no extra memory for the diagonal case. |
Nice! @betanalpha you cool with that? |
Unfortunately I am not as this example isn't relevant to the cases where a dense metric would be necessary. In particular the density, and hence gradient, is artificially cheap due to the assumed independence and the number of leapfrog steps are relatively sparse which makes the overhead look more important than it is. A dense metric is useful when the target density has global correlations, and those typically cost at least O(N^{2}) not to mention requiring more leapfrog steps per numerical trajectory. At the very least I would want to see how the overhead behaves for a correlated normal and student t, say with Sigma_{ij} = rho^{|i - j|}, for a few choices of rho like 0.25, 0.5, and 0.75 and a span of dimensions, say 50, 100, and 500. That wouldn't be exhaustive by any means but it would provide a much better picture of what a more realistic overhead would be. Although it limits the practicality of testing, the choice to recompute was made with much higher dimensional models in mind, i.e. tens of thousands and hundreds of thousands. It's at those larger models where the memory starts to become problematic. |
Yeah I was trying to pick an example that exaggerated the differences. Adaptation is turned off, but it's for a problem where the default will be fine. Just directly comparing the costs of doing the ideal thing (diagonal) to the expensive thing (dense). I hit this problem originally with the same model Yi did: https://github.com/bbbales2/cmdstan-warmup/tree/develop/examples/radon . It's just a super simple model and it makes sense to run it with diag but it's artificially slow if you run it with dense cause of this problem.
That's where you'd definitely only want to pre-compute the Cholesky though. If N is number of parameters, Cholesky is O(N^3) right? So even if the memory is O(N^2) that's fine -- you're paying that cost to have the metric to begin with. |
… and diagonal metrics (Issue #2881)
Summary:
For N parameters, there's an NxN Cholesky computed every time a new momentum is drawn. In reality we only need to recompute that when the metric changes.
Description:
Check the code here: https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/hamiltonians/dense_e_metric.hpp#L54
That llt() only needs computed when
z.inv_e_metric_
changes, not every timesample_p
is called.Reproducible Steps:
Sampling a simple model with a large number of parameters should be sufficient.
Should do the trick. Run that model with:
And compare the time with:
And the difference should be noticeable once that cholesky is precomputed.
Current Output:
Output is fine, just slow.
Expected Output:
Same output.
Current Version:
v2.21.0
The text was updated successfully, but these errors were encountered: