Doing the decomposition for every momentum draw is inefficient #2881

bbbales2 · 2020-01-22T16:46:39Z

Summary:

For N parameters, there's an NxN Cholesky computed every time a new momentum is drawn. In reality we only need to recompute that when the metric changes.

Description:

Check the code here: https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/hamiltonians/dense_e_metric.hpp#L54

That llt() only needs computed when z.inv_e_metric_ changes, not every time sample_p is called.

Reproducible Steps:

Sampling a simple model with a large number of parameters should be sufficient.

parameters {
  real x[500];
}

model {
  x ~ normal(0, 1);
}

Should do the trick. Run that model with:

./test sample num_warmup=0 adapt engaged=0 algorithm=hmc metric=dense_e

And compare the time with:

./test sample num_warmup=0 adapt engaged=0 algorithm=hmc metric=diag_e

And the difference should be noticeable once that cholesky is precomputed.

Current Output:

Output is fine, just slow.

Expected Output:

Same output.

Current Version:

v2.21.0

The text was updated successfully, but these errors were encountered:

bob-carpenter · 2020-01-22T19:54:04Z

Yikes. We had explicitly discussed coding it originally so it didn't have this inefficiency. I guess that didn't make it to the code.

SteveBronder · 2020-01-22T20:11:58Z

Should it only happen when set_metric is called? Could also move the metric to a protected or private member so it's not accessed outside of set_metric

bbbales2 · 2020-01-22T20:17:14Z

Should it only happen when set_metric is called? Could also move the metric to a protected or private member so it's not accessed outside of set_metric

Yes that makes sense. Right now it's accessed directly by other things: https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp#L32, but an accessor makes more sense.

yizhang-yiz · 2020-02-19T21:27:28Z

Is a fix under way? This slows things down quite a bit when I use Radon model for testing.

… once each time the inverse metric is set (instead of every sample, Issue #2881). This involved switching to setter/getters for interfacing with dense_e_point so I made the change for diag_e_point as well. Also changed set_metric verbage to set_inv_metric.

betanalpha · 2020-03-17T14:48:20Z

The current design is intentional.

The Cholesky is needed only once per transition which is a relatively small cost compared to the many gradient evaluations needed within each transition. Saving the Cholesky decomposition introduces an additional $\mathcal{O}(N^{2})$ memory burden which is much more than the $mathcal{O}(N)$ burden everywhere else, and becomes problematic for sufficiently large models. Without any explicit profiling demonstrating that the Cholesky is a substantial const for typical models I don't see any strong motivation for the change.

I am inclined to close this issue until less anecdotal evidence of hotspots in the current code is demonstrated.

SteveBronder · 2020-03-17T15:24:02Z

@bbbales2 would you mind running some benchmarks to see if #2894 makes things faster/slower for a large/small model?

bbbales2 · 2020-03-17T15:32:55Z

Running the example code included at the top of this issue:

diagonal (pull #2894): 0.26s
dense (pull #2894): 0.65s

diagonal (develop): 0.25s
dense (develop): 3.5s

bob-carpenter · 2020-03-17T16:52:36Z

That's a huge speedup for dense matrices, which is where we're already paying an O(N^2) memory penalty just for storing the dense metric.

Is there a big memory penalty for the diagonal case? That shouldn't need a Cholesky decomposition.

bbbales2 · 2020-03-17T16:57:22Z

Is there a big memory penalty for the diagonal case?

I just put it there for symmetry and it was easy to get. We changed how the rng code is written but that's just a syntax thing: https://github.com/stan-dev/stan/pull/2894/files#diff-6a891b298c8e18322df7f82a1a362732R48

Edit: Oh I missed the question sorry. This uses no extra memory for the diagonal case.

SteveBronder · 2020-03-17T17:18:50Z

Nice! @betanalpha you cool with that?

betanalpha · 2020-03-18T18:05:19Z

Unfortunately I am not as this example isn't relevant to the cases where a dense metric would be necessary. In particular the density, and hence gradient, is artificially cheap due to the assumed independence and the number of leapfrog steps are relatively sparse which makes the overhead look more important than it is. A dense metric is useful when the target density has global correlations, and those typically cost at least O(N^{2}) not to mention requiring more leapfrog steps per numerical trajectory.

At the very least I would want to see how the overhead behaves for a correlated normal and student t, say with Sigma_{ij} = rho^{|i - j|}, for a few choices of rho like 0.25, 0.5, and 0.75 and a span of dimensions, say 50, 100, and 500. That wouldn't be exhaustive by any means but it would provide a much better picture of what a more realistic overhead would be.

Although it limits the practicality of testing, the choice to recompute was made with much higher dimensional models in mind, i.e. tens of thousands and hundreds of thousands. It's at those larger models where the memory starts to become problematic.

bbbales2 · 2020-03-18T19:20:26Z

Yeah I was trying to pick an example that exaggerated the differences.

Adaptation is turned off, but it's for a problem where the default will be fine. Just directly comparing the costs of doing the ideal thing (diagonal) to the expensive thing (dense).

I hit this problem originally with the same model Yi did: https://github.com/bbbales2/cmdstan-warmup/tree/develop/examples/radon . It's just a super simple model and it makes sense to run it with diag but it's artificially slow if you run it with dense cause of this problem.

much higher dimensional models in mind

That's where you'd definitely only want to pre-compute the Cholesky though. If N is number of parameters, Cholesky is O(N^3) right? So even if the memory is O(N^2) that's fine -- you're paying that cost to have the metric to begin with.

… and diagonal metrics (Issue #2881)

bbbales2 mentioned this issue Mar 6, 2020

Don't do inverse metric decomposition every draw #2894

Open

3 tasks

bbbales2 added a commit that referenced this issue Mar 15, 2020

Applying suggested changes from review (Issue #2881)

7582cce

bbbales2 added a commit that referenced this issue Jul 16, 2020

Added unit tests to diag_e_point/diag_e_metric interface (Issue #2881)

83ef803

bbbales2 added a commit that referenced this issue Oct 28, 2020

Converted some other samplers over to use accessor functions on dense…

1da6121

… and diagonal metrics (Issue #2881)

bbbales2 added a commit that referenced this issue Oct 29, 2020

Removed more direct metric accesses (Issue #2881)

a182f78

bbbales2 added a commit that referenced this issue Oct 29, 2020

Fixed lint issues (Issue #2881)

6f165f9

bbbales2 added a commit that referenced this issue Oct 30, 2020

Fixed inv_metric access in test (Issue #2881)

f8b5be7

bbbales2 added a commit that referenced this issue Oct 31, 2020

Fixed test (Issue #2881)

a67e9e5

bbbales2 added a commit that referenced this issue Nov 2, 2020

Fixed order of updated inverse metric with stepsize (Issue #2881)

a8e7500

bbbales2 added a commit that referenced this issue Mar 9, 2021

Updated to use forwarding (Issue #2881)

6925d77

bbbales2 added a commit that referenced this issue Mar 9, 2021

Only close over rng (Issue #2881)

13d139f

bbbales2 added a commit that referenced this issue Mar 15, 2021

Fix universal references (Issue #2881)

83f551f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doing the decomposition for every momentum draw is inefficient #2881

Doing the decomposition for every momentum draw is inefficient #2881

bbbales2 commented Jan 22, 2020

bob-carpenter commented Jan 22, 2020

SteveBronder commented Jan 22, 2020

bbbales2 commented Jan 22, 2020

yizhang-yiz commented Feb 19, 2020 •

edited

Loading

betanalpha commented Mar 17, 2020

SteveBronder commented Mar 17, 2020

bbbales2 commented Mar 17, 2020

bob-carpenter commented Mar 17, 2020

bbbales2 commented Mar 17, 2020 •

edited

Loading

SteveBronder commented Mar 17, 2020

betanalpha commented Mar 18, 2020

bbbales2 commented Mar 18, 2020

Doing the decomposition for every momentum draw is inefficient #2881

Doing the decomposition for every momentum draw is inefficient #2881

Comments

bbbales2 commented Jan 22, 2020

Summary:

Description:

Reproducible Steps:

Current Output:

Expected Output:

Current Version:

bob-carpenter commented Jan 22, 2020

SteveBronder commented Jan 22, 2020

bbbales2 commented Jan 22, 2020

yizhang-yiz commented Feb 19, 2020 • edited Loading

betanalpha commented Mar 17, 2020

SteveBronder commented Mar 17, 2020

bbbales2 commented Mar 17, 2020

bob-carpenter commented Mar 17, 2020

bbbales2 commented Mar 17, 2020 • edited Loading

SteveBronder commented Mar 17, 2020

betanalpha commented Mar 18, 2020

bbbales2 commented Mar 18, 2020

yizhang-yiz commented Feb 19, 2020 •

edited

Loading

bbbales2 commented Mar 17, 2020 •

edited

Loading