Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doing the decomposition for every momentum draw is inefficient #2881

Open
bbbales2 opened this issue Jan 22, 2020 · 12 comments
Open

Doing the decomposition for every momentum draw is inefficient #2881

bbbales2 opened this issue Jan 22, 2020 · 12 comments

Comments

@bbbales2
Copy link
Member

Summary:

For N parameters, there's an NxN Cholesky computed every time a new momentum is drawn. In reality we only need to recompute that when the metric changes.

Description:

Check the code here: https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/hamiltonians/dense_e_metric.hpp#L54

That llt() only needs computed when z.inv_e_metric_ changes, not every time sample_p is called.

Reproducible Steps:

Sampling a simple model with a large number of parameters should be sufficient.

parameters {
  real x[500];
}

model {
  x ~ normal(0, 1);
}

Should do the trick. Run that model with:

./test sample num_warmup=0 adapt engaged=0 algorithm=hmc metric=dense_e

And compare the time with:

./test sample num_warmup=0 adapt engaged=0 algorithm=hmc metric=diag_e

And the difference should be noticeable once that cholesky is precomputed.

Current Output:

Output is fine, just slow.

Expected Output:

Same output.

Current Version:

v2.21.0

@bob-carpenter
Copy link
Contributor

Yikes. We had explicitly discussed coding it originally so it didn't have this inefficiency. I guess that didn't make it to the code.

@SteveBronder
Copy link
Collaborator

Should it only happen when set_metric is called? Could also move the metric to a protected or private member so it's not accessed outside of set_metric

@bbbales2
Copy link
Member Author

Should it only happen when set_metric is called? Could also move the metric to a protected or private member so it's not accessed outside of set_metric

Yes that makes sense. Right now it's accessed directly by other things: https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp#L32, but an accessor makes more sense.

@yizhang-yiz
Copy link

yizhang-yiz commented Feb 19, 2020

Is a fix under way? This slows things down quite a bit when I use Radon model for testing.

bbbales2 added a commit that referenced this issue Mar 6, 2020
… once each time the inverse metric is set (instead of every sample, Issue #2881).

This involved switching to setter/getters for interfacing with dense_e_point so I made the change for diag_e_point as well.

Also changed set_metric verbage to set_inv_metric.
@betanalpha
Copy link
Contributor

The current design is intentional.

The Cholesky is needed only once per transition which is a relatively small cost compared to the many gradient evaluations needed within each transition. Saving the Cholesky decomposition introduces an additional $\mathcal{O}(N^{2})$ memory burden which is much more than the $mathcal{O}(N)$ burden everywhere else, and becomes problematic for sufficiently large models. Without any explicit profiling demonstrating that the Cholesky is a substantial const for typical models I don't see any strong motivation for the change.

I am inclined to close this issue until less anecdotal evidence of hotspots in the current code is demonstrated.

@SteveBronder
Copy link
Collaborator

@bbbales2 would you mind running some benchmarks to see if #2894 makes things faster/slower for a large/small model?

@bbbales2
Copy link
Member Author

Running the example code included at the top of this issue:

diagonal (pull #2894): 0.26s
dense (pull #2894): 0.65s

diagonal (develop): 0.25s
dense (develop): 3.5s

@bob-carpenter
Copy link
Contributor

That's a huge speedup for dense matrices, which is where we're already paying an O(N^2) memory penalty just for storing the dense metric.

Is there a big memory penalty for the diagonal case? That shouldn't need a Cholesky decomposition.

@bbbales2
Copy link
Member Author

bbbales2 commented Mar 17, 2020

Is there a big memory penalty for the diagonal case?

I just put it there for symmetry and it was easy to get. We changed how the rng code is written but that's just a syntax thing: https://github.com/stan-dev/stan/pull/2894/files#diff-6a891b298c8e18322df7f82a1a362732R48

Edit: Oh I missed the question sorry. This uses no extra memory for the diagonal case.

@SteveBronder
Copy link
Collaborator

Nice! @betanalpha you cool with that?

@betanalpha
Copy link
Contributor

Unfortunately I am not as this example isn't relevant to the cases where a dense metric would be necessary. In particular the density, and hence gradient, is artificially cheap due to the assumed independence and the number of leapfrog steps are relatively sparse which makes the overhead look more important than it is. A dense metric is useful when the target density has global correlations, and those typically cost at least O(N^{2}) not to mention requiring more leapfrog steps per numerical trajectory.

At the very least I would want to see how the overhead behaves for a correlated normal and student t, say with Sigma_{ij} = rho^{|i - j|}, for a few choices of rho like 0.25, 0.5, and 0.75 and a span of dimensions, say 50, 100, and 500. That wouldn't be exhaustive by any means but it would provide a much better picture of what a more realistic overhead would be.

Although it limits the practicality of testing, the choice to recompute was made with much higher dimensional models in mind, i.e. tens of thousands and hundreds of thousands. It's at those larger models where the memory starts to become problematic.

@bbbales2
Copy link
Member Author

Yeah I was trying to pick an example that exaggerated the differences.

Adaptation is turned off, but it's for a problem where the default will be fine. Just directly comparing the costs of doing the ideal thing (diagonal) to the expensive thing (dense).

I hit this problem originally with the same model Yi did: https://github.com/bbbales2/cmdstan-warmup/tree/develop/examples/radon . It's just a super simple model and it makes sense to run it with diag but it's artificially slow if you run it with dense cause of this problem.

much higher dimensional models in mind

That's where you'd definitely only want to pre-compute the Cholesky though. If N is number of parameters, Cholesky is O(N^3) right? So even if the memory is O(N^2) that's fine -- you're paying that cost to have the metric to begin with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants