Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedded Laplace approximation #755

Closed
charlesm93 opened this issue Feb 7, 2018 · 29 comments
Closed

Embedded Laplace approximation #755

charlesm93 opened this issue Feb 7, 2018 · 29 comments
Assignees
Milestone

Comments

@charlesm93
Copy link
Contributor

Summary:

Want to create a Laplace approximation for the following distributions:

  • Normal
  • Poisson
  • Binomial
  • Negative binomial

Description:

Doing a Laplace approximation with the current algebraic solver (which uses a dogleg solver) is prohibitively slow. Instead we could use a Newton line-search method, coupled with some analytical derivatives. This line-search method would be specialized and therefore not exposed to the Stan language, but the approximator would be. Aki and Dan have a prototype in Stan, which uses autodiff. The C++ implementation would naturally be more optimized.

Current Version:

v2.17.0

@bob-carpenter
Copy link
Contributor

Is the only use case you anticipate through the Stan language? Is the idea to have approximate distributions, like poisson_approx or something?

This issue eventually needs to specify what the signatures are of any functions that are being proposed for the math library.

@charlesm93---I assigned to you for v3.0.0 (that's where I'm putting things we definitely want to do but not necessarily for the next release). It helps to add labels to classify the issue.

@avehtari
Copy link

This is part of the path towards INLA speed for Gaussian latent variable models in Stan. Anticipated first use case is compound functions for GLMs with Laplace integration over weights. Issue #493 describes compound function

y ~ logistic_regression(x, beta);

where speed-up is obtained with analytic gradients. With Laplace approximation it would be possible to add something like

y ~ logistic_regression_la(x, sigma);

which would assume Gaussian prior on beta with sd sigma, and would use Laplace approximation to integrate over beta. Instead of logistic, the first ones to implement would be Normal, Poisson, Binomial and Negative-Binomial. Eventually there will be different structured Gaussian priors like hierarchical, Markov random field and Gaussian process. Ping @dpsimpson

@wds15
Copy link
Contributor

wds15 commented Feb 11, 2018

@avehtari this sounds great...shouldn't we have a different sigma for the intercept and all other regressors? I would appreciate that as I don't like that much data-driven priors which you essentially end-up if I can only give a single sigma and hence have to transform my data. Maybe too early for these comments, but relevant from my perspective.

@avehtari
Copy link

shouldn't we have a different sigma for the intercept and all other regressors?

Sure. I didn't say that sigma would be a scalar. We'll make a more detail proposal soon.

@SteveBronder
Copy link
Collaborator

SteveBronder commented Sep 27, 2020

copy pasta for email

One other little thing if you want to squeeze out the last little bits of speed. Note that stan math functions always return back an actual matrix, where one of the main benefits of eigen is to compose multiple expressions together so during compile time it writes them out very efficiently. If you are working with matrices of doubles you should use eigen methods for faster code. Here's a trick you can do with things like third_diff() etc

template <typename T>
Eigen::Matrix<T, Eigen::Dynamic, 1>
third_diff(const Eigen::Matrix<T, Eigen::Dynamic, 1>& theta) const {
  Eigen::VectorXd exp_theta = exp(theta);
  Eigen::VectorXd one = Eigen::VectorXd::Ones(theta.size());
  // Note the auto type!!
  auto nominator = exp_theta.cwiseProduct(exp_theta - one);
  auto denominator = (one + exp_theta).array().sqrt().matrix().cwiseProduct(one + exp_theta);
  return n_samples_.cwiseProduct((nominator.array() / denominator.array()).matrix());
}

So the main bit here is that numerator and denominator being "auto" means they are a wacky expression like ColWiseProd<CWiseSubtract<Matrix, Matrix>,Matrix> that is not actually evaluated at that line. So when you pass thos into the return line the expression for the result is generated fully. (you could also just slap the code for numerator and denominator into the return line but then it starts becoming a bit of a mouthful). But you should only do this in places where you use the expression once, otherwise it's better to eval it into a temporary like you are currently doing with one. Here's an example that shows on the rhs the weird expression type nominator, denominator, and the return type etc gets

https://godbolt.org/z/7bE5n1

@SteveBronder
Copy link
Collaborator

SteveBronder commented Sep 27, 2020

Also writing the laplace vari using the reverse_pass_callback() and var<matrix> for the covariance should be very efficient

https://github.com/stan-dev/math/blob/try-laplace_approximation2/stan/math/laplace/laplace_pseudo_target.hpp#L29

@dpsimpson
Copy link
Contributor

dpsimpson commented Sep 27, 2020 via email

@SteveBronder
Copy link
Collaborator

Whups, good catch! I just searched for laplace issue and this was the nearest looking one

@charlesm93
Copy link
Contributor Author

Good catch @dpsimpson. The reason we have two branches is because the first branch was written when we submitted the manuscript on the adjoint-differentiated Laplace approximation (with code discussed in the paper). The second branch came after discussion on how to improve the code before StanCon, but we didn't want to break "backward compatibility" with the paper.

We get to update the manuscript before its publication, so I'll link to the new branch and delete the old one.

@serban-nicusor-toptal serban-nicusor-toptal modified the milestones: 3.3.0++, 3.4.0++ Nov 3, 2020
@charlesm93 charlesm93 changed the title Laplace approximation Embedded Laplace approximation Dec 6, 2021
@charlesm93
Copy link
Contributor Author

I deleted a bunch of files which I used for my research but wouldn't make it into production code. The commit and the commit prior to those changes, should we need to retrieve those files, are

commit 031a81b8bcc9bfef3dd5c81f791895086db33765 (HEAD -> experimental/laplace, origin/experimental/laplace)
Author: Charles Margossian <[email protected]>
Date:   Sun Dec 5 19:23:31 2021 -0500

    Delete research tests which are / will not be unit tests.

commit b2e4b2fdfc5a8a0b638eebacb075a7d2e89e572a
Author: Charles Margossian <[email protected]>
Date:   Sun Dec 5 19:11:08 2021 -0500

    make laplace_marginal_bernoulli_logit_lpmf use general likelihood functor.

dated December 5th. One file we might want to pull back out at some point is the draft for the negative binomial likelihood.

Currently, every unit test under test/unit/math/laplace/ passes (even though some tests are incomplete). I haven't touched the rng tests yet. These files use the analytical derivatives for the likelihood that I worked out in early prototypes of the embedded Laplace. This will be replaced with automatic differentiation, which is a tad more efficient and requires much less coding for every new likelihood we add. There is also a unit test for the user-specified likelihood, even though this feature is more experimental. I have already switched to autodiff for the function that computes the approximate marginal likelihood. Once the rng files are updated, we can delete a good chunk of code under laplace_likelihood_bernoulli_logit and laplace_likelihood_poisson_log.

Since we might want to further compare autodiff and analytical derivatives, I think it's fine to keep this code in for now.

@charlesm93
Copy link
Contributor Author

For any future modifications, the unit tests that currently pass should be:

  • laplace_marginal_poisson_log_lpmf_test.cpp
  • laplace_marginal_bernoulli_logit_lpmf_test.cpp
  • laplace_marginal_lpdf_test.cpp

All these tests for consistency between (first-order reverse mode) autodiff and finite diff. For most tests, I have benchmarks from GPstuff but where indicated in the code a benchmark is missing.

@charlesm93
Copy link
Contributor Author

charlesm93 commented Dec 6, 2021

Todo list for myself:

  • write a primer design doc (per @SteveBronder 's request)
  • complete unit tests for existing functions using benchmarks from GPstuff (demo for poisson log likelihood)
  • work out unit tests for rng functions (demo for poisson log likelihood)

EDIT: updated 01/16.

@dpsimpson
Copy link
Contributor

dpsimpson commented Dec 6, 2021 via email

@charlesm93
Copy link
Contributor Author

What would be the best way to test the rng? I can try to come up with limiting cases for each example, but I'm wondering if there is a general approach. How was this done for INLA and GPstuff?

@dpsimpson
Copy link
Contributor

It's hard to work out how to test this as it currently is - the function does 3 things: competes Laplace approximation (tested elsewhere), computes the mean and covariance matrix (not tested), and does the multi normal rng (function tested, this specific call not). My temptation (which might not be very smart) would be to separate that computation of the mean and the covariance into an inline function and test it separately. Then you can test the rest of it in the same way that, say multi_normal_prec_rng is tested.

@wds15
Copy link
Contributor

wds15 commented Dec 6, 2021

Ripping code apart into testable things is not at all not smart from what I learned.

@dpsimpson
Copy link
Contributor

The other option is to do a probabilistic test that runs the rng 1k+++ times and computes the mean and covariances, but that feels like it would be a very expensive test. (and it would occasionally fail) I can't think of another way to test a multivariate distribution without direct access to that mean and covariance matrix.

@dpsimpson
Copy link
Contributor

Actually, there is one other possibility, that depends on how the rng argument works. If you can control its state, you could compare the output from laplace_base_rng to the multi_normal_rng with analytically computed mean and covariance. That would probably be the sensible thing because they should be (floating point) identical.

@bob-carpenter
Copy link
Contributor

Ripping code apart into testable things is not at all not smart from what I learned.

There are at least three concerns that need to be balanced,

  1. efficiency
  2. code readability and maintainability
  3. testability

I think @wds15 just means that (1) is usually opposed to (2) and (3).

@wds15
Copy link
Contributor

wds15 commented Dec 6, 2021

One different approach I did for one of my R packages where I wanted to run SBC is to proxy the test results from the SBC run. So what I do is:

  1. Run SBC on a cluster and store the final histograms needed to caclculate the chi-square statistic. This is stored with a time-stamp
  2. The unit test is then doing
    a) check that the time-stamp of the stored SBC run is no older than 0.5y as compared to the runtime of the test
    b) test for uniformity via a chi-square test; allowing some failures according to what I would expect

This way I can have a massive SBC simulation from which the key results are part of the test evidence and I ensure that the SBC outputs are not too old. That's not ideal, but a compromise. Maybe a more sophisticated version of this is useful here as well?

@dpsimpson
Copy link
Contributor

dpsimpson commented Dec 7, 2021 via email

@charlesm93
Copy link
Contributor Author

I ended up running with one of @dpsimpson's suggestions and implementing it in laplace_poisson_log_rng_test.cpp. I took a limiting case with a diagonal prior covariance and worked out the derivative of the objective function. You then pass this to algebra_solver to obtain the mean. The covariance of the Laplace approximation can be calculated by hand (as a function of the mode).

I then ran a bunch of simulations (1e6) and checked estimators of the mean, variance and covariance. On a two group problem, this test takes 8 seconds to run. It's not ideal but it's not horrendous.

I'm not sure how to control the rng state. Both laplace_poisson_log_rng and multinormal_rng admit an rng argument with the type

boost::random::mt19937

but I couldn't fix this state. I'm assuming it's possible, in which case, we can implement Dan's second proposition. It's a bit of work to get the analytical results, but only a bit and I think it's good to have unit tests which rely on analytical calculations.

@charlesm93
Copy link
Contributor Author

Thanks @SteveBronder for showing me how to control the random seed. Using this, I was able to produce the same sample using the embedded Laplace and multi_normal_rng. The mean and covariance for the latter are calculated (semi-)analytically.

  boost::random::mt19937 rng;
  rng.seed(1954);
  Eigen::MatrixXd theta_pred
    = laplace_poisson_log_rng(sums, n_samples, covariance_function, phi,
                              x_dummy, d0, di0, theta_0, rng);

  // Compute exact mean and covariance.
  Eigen::VectorXd theta_root
    = algebra_solver(stationary_point(), theta_0, phi, d0, di0);
  Eigen::MatrixXd K_laplace
    = laplace_covariance(theta_root, phi);

  rng.seed(1954);
  Eigen::MatrixXd theta_benchmark
    = multi_normal_rng(theta_root, K_laplace, rng);

  double tol = 1e-3;
  EXPECT_NEAR(theta_benchmark(0), theta_pred(0), tol);
  EXPECT_NEAR(theta_benchmark(1), theta_pred(1), tol);

Is this enough? I think this checks the boxes, i.e. making sure the rng function generates a sample from the correct approximating normal distribution.

@SteveBronder
Copy link
Collaborator

Seems good! Is this all up on a branch somewhere?

@charlesm93
Copy link
Contributor Author

Yes. experimental/laplace.

@SteveBronder
Copy link
Collaborator

SteveBronder commented Feb 18, 2022

Just making a note we removed an experimental feature to flip between diagonal and non-diagonal covariance estimates in commit 6b97f1f. We also remove the do_line_search argument there (and just use max_line_search > 0 to decide the line search)

@charlesm93
Copy link
Contributor Author

charlesm93 commented May 18, 2022

I'm going to work on the following update:

  • constrain hessian_block_size > 0. Right now, hessian_block_size = 0 is used to store W in a vector, as opposed to a sparse Eigen matrix, which is what you would get with hessian_block_size = 1. I find using a vector more efficient than using a sparse matrix. But having the 0 and 1 cases is confusing. So if hessian_block_size = 1, it will use a vector and 0 will no longer be an acceptable value.

Update: ... and job done.

Next up: update the notebook and the examples therein (and done!)

@charlesm93
Copy link
Contributor Author

Up next, I want to tackle the sparse covariance case. Of interest is the scenario where the Hessian and the covariance are both sparse, as is the case in a model from our BLS colleagues. In this scenario the B-matrix (the one matrix on which we perform a Cholesky decomposition) also becomes sparse. If B is diagonal all operations become O(n), as opposed to O(n^3). More generally, if m is the largest block size between the Hessian and the covariance, the complexity of the method would be O(nm^2).

Before implementing this at a production level, I want to build a proof of concept. My idea would be to add an optional argument:

  • cov_block_size, which defaults to n.
  • As a first step, only handle the special case where both hessian_block_size = 1 and cov_block_size = 1 in the Newton solver, and test this. If this works, update the autodiff method.
  • As a test, I'll start with the disease map model but using a diagonal covariance.

@charlesm93
Copy link
Contributor Author

Moving this to #3065

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants