extend gpdfit to return marginal posterior of k #186 #188

ozanstats · 2021-10-10T09:53:53Z

Extended gpdfit() to return the marginal posterior density of k in addition to k-hat and sigma-hat. We can also create a separate function but this seemed like a better approach to me.

Updated the profile log-likelihood calculation by using an outer product (%o%) instead of vapply, which made it more efficient.

# switched to
k <- matrixStats::rowMeans2(log1p(-theta %o% x))
N * (log(-theta / k) - k - 1)

# instead of
a <- -theta
k <- vapply(-a, FUN = function(a_i) mean(log1p(a_i * x)), FUN.VALUE = numeric(1))
N * (log(-a / k) - k - 1)

Dropped the lx() function and moved its updated content to gpdfit() to avoid repeated calculations because the code example in the issue thread actually computes the k values twice:

# first (within the `lx()` function)
l_theta <- N * lx(theta, x) # profile log-lik

# second 
ks <- numeric(length=M)
for (i in 1:M) {
  ks[i] <- mean.default(log1p(-theta[i] * x))
}

Following the notation of Zhang, J., and Stephens, M. A. (2009), suggested new names for the gpdfit() output components:

# switched to
nlist(k_hat, sigma_hat, k, w_theta)

# instead of
nlist(k, sigma)

This name change shouldn't really cause backward compatibility issues but it's not that important anyway. I will use whatever you believe is more appropriate.

tests/testthat/test_gpdfit.R

avehtari

Thanks for the PR! I'm ok with the name changes.

Based on the discussion in the issue thread, we still need to decide whether we want to return normalized quadrature weights or normalized densities.

ozanstats · 2021-10-10T21:17:48Z

@avehtari - I can update the function with normalization term estimation using the trapezoidal rule once you decide what to return. Below are some options:

# normalization term estimation with the trapezoidal rule
c <- 2 / sum((w_theta[-M] + w_theta[-1]) * (k[-M] - k[-1]))

# normalized densities
d_theta <- c * w_theta

# some return options:
nlist(k_hat, sigma_hat, k, w_theta) # normalized quadrature weights
nlist(k_hat, sigma_hat, k, d_theta) # normalized densities
nlist(k_hat, sigma_hat, k, w_theta, c) # normalized quadrature weights + density normalization term
nlist(k_hat, sigma_hat, k, w_theta, d_theta) # normalized quadrature weights + normalized densities

ozanstats · 2021-10-10T21:42:20Z

I noticed a small side effect of this recent commit. That is, when the estimation doesn't work, k-hat and sigma-hat are not necessarily Inf and NaN, respectively:

set.seed(147)
x <- exp(rnorm(10))

# CRAN version (without the commit)
gpdfit(x, sort_x = FALSE)
$k
[1] Inf

$sigma
[1] NaN

# GitHub version on master (with the commit)
$k
[1] NA

$sigma
[1] NA

Should we avoid NAs and return always Inf and NaN here? I can also suppress the NaNs produced warnings that occur in these cases.

avehtari · 2021-10-11T18:19:45Z

Should we avoid NAs and return always Inf and NaN here?

I would prefer khat=Inf when the failure is due to extremely long tail. At least in the cases I've seen Inf appearing, the tail was very long.

jgabry · 2021-10-11T18:23:53Z

Thanks for the PR!

Should we avoid NAs and return always Inf and NaN here?

Yeah I think Inf for k and and NaN for sigma would be better, thanks!

Based on the discussion in the issue thread, we still need to decide whether we want to return normalized quadrature weights or normalized densities.

I haven't had a chance to think much about this. I'm ok with whatever @avehtari decides.

avehtari · 2021-10-11T19:11:17Z

Thinking for a moment, I think normalized densities would make more sense, and if someone would like to use normalized quadrature weights it's easier to obtain them from normalized densities than vice versa.

ParadaCarleton · 2021-12-20T00:00:11Z

Quick question -- is the gpd_fit posterior actually a good one? The estimator is good, but because the prior is derived from the data, the posterior will be overconfident. This doesn't matter much if you just need k-hat, but I think you'd have to test how much the empirical prior affects the overconfidence.

avehtari · 2021-12-20T07:49:28Z

I have tested. When using sample sizes that are typical gpd_fit use and when tail thicknesses are in the interesting region (<0.7), gpd_fit approximation is very good with no practical difference when compared to the marginals of the full posterior computed with Stan. I guess I should add this comment to the PSIS paper revision.

Autumn has been very busy, but I hope to be able to go through some PRs and issues in a a few weeks.

avehtari · 2022-01-06T16:04:14Z

Thanks to a question by @sethaxen, I realized that what is above named as densities are not yet densities. Originally, the algorithm computes quadrature weights and for those most natural scale would be normalized weights. To compute densities, it is not enough to compute the normalization term, as each quadrature point presents a different volume of the parameter space. I'll prepare a note with illustration and code. We can then discuss what we should return and how to document clearly what we are returning.

avehtari · 2022-01-14T16:36:55Z

Here is an illustration

The yellow contours present the likelihood, the red dashed contours present the posterior, blue line is the profile line, the blue dots present the quadrature points. w_theta are normalized likelihood values at the qudrature points and the effect of the prior comes from non-equidistant spacing of the quadrature points. To plot the marginal posterior of k, we need to evaluate the posterior density at the quadrature points (or in any other evaluation points).

The following code change returns the quadrature weights k_w that are specific to the quadrature points k, and posterior densities k_d evaluated at the quadrature points k.

  # quadrature weights for k are same as for theta
  k_w <- w_theta
  # quadrature weights are just the normalized likelihoods
  # we get the unnormalized posterior by multiplying these by the prior
  k_d <- k_w * dgpd(-theta, mu=-1/x[N], sigma=1/prior/xstar, xi=0.5)
  # normalize using the trapezoidal rule
  Z <- sum((k_d[-M] + k_d[-1]) * (k[-M] - k[-1])) / 2
  k_d <- k_d / Z

  nlist(k_hat, sigma_hat, k, k_w, k_d)

Here is the exact marginal posterior (red dashed) and the profile posterior (blue) corresponding to the above figure

The following plot shows comparisons with one data realization each using M=20, 100, 1000 and k=0.3, 0.5, 0.7 ,0.9.

The exact marginal posterior is wider than the profile posterior, as expected, but the profile posterior is quite close and thus useful for presenting the uncertainty. In my experiments I used extraDistr::dgpd, but to avoid the dependency we should implement that in loo. Computing the prior added maybe 10% or less computation time, so computing the marginal approximation could be moved also to another function.

ParadaCarleton · 2022-01-14T20:15:04Z

I have tested. When using sample sizes that are typical gpd_fit use and when tail thicknesses are in the interesting region (<0.7), gpd_fit approximation is very good with no practical difference when compared to the marginals of the full posterior computed with Stan. I guess I should add this comment to the PSIS paper revision.

Autumn has been very busy, but I hope to be able to go through some PRs and issues in a a few weeks.

I see; does that hold even though the samples aren't IID? That would also tend to make the posterior too narrow. Perhaps "correcting" the log-likelihood by multiplying it with r_eff would fix that problem well enough in practice, though I'm not sure such an approach would be fully justified.

I do also wonder how Zhang's method compares to Laplace approximation for finding the posterior in terms of performance.

avehtari · 2022-01-17T19:31:50Z

I see; does that hold even though the samples aren't IID? That would also tend to make the posterior too narrow.

True. I don't think this is an issue as the plan at the moment is just to provide some indication of the uncertainty. Of course, if you plan to use it for something which needs more accuracy, then follow the usual extreme value analysis procedure and thin the chains to have approximately independent draws. For Pareto-k diagnostic and PSIS, having too narrow posterior doesn't harm if there is no big change in the posterior mean, which seems to be true based on the experiments with dependencies typical for MCMC from Stan. If you are worried, then just thin.

I do also wonder how Zhang's method compares to Laplace approximation for finding the posterior in terms of performance.

Not a huge difference, but it doesn't matter as the posterior is much more skewed than normal and the posterior mode is then clearly different from the posterior mean.

ozanstats · 2022-01-27T09:55:45Z

tried to update gpdfit() to reflect the improvements @avehtari mentioned above
implemented the full suite of dgpd/pgpd/qgpd/rgpd() functions
cleaned up and condensed some internal helpers

@jgabry; the new functions pass my local checks and the existing tests but it'd be great if you could also give them a try, perhaps with some fresh test cases.

avehtari · 2022-04-25T07:56:35Z

Sorry, this has been left hanging for such a long time. After considering, I think we shouldn't change the existing output variable names (ie k -> k_hat, sigma -> sigma_hat), as it is possible the function is already used (and at least I'm using it in different simulations), and that would break those.

I'm also thinking that maybe we should have a separate function for the marginal posterior. It is almost just a by product, but still it is adding a bit of computation and memory usage, and maybe it would be better to have the streamlined one for PSIS-LOO, and then have separate function for the cases when someone wants to investigate more. It would be then possible to add also computationally more intensive, but more accurate marginal posterior computation, too.

avehtari · 2023-03-24T11:52:50Z

Adding a comment that this will very likely go to posterior package, and there is no need to merge this to loo

extend gpdfit to return marginal posterior of k #186

a373184

ozanstats commented Oct 10, 2021

View reviewed changes

tests/testthat/test_gpdfit.R Outdated Show resolved Hide resolved

ozanstats mentioned this pull request Oct 10, 2021

Add a function to return marginal posterior of k #186

Open

avehtari reviewed Oct 10, 2021

View reviewed changes

return theta densities

cdae01d

updated gpdfit and created dgpd/pgpd/qgpd/rgpd

eec5ccd

ozanstats requested review from avehtari and jgabry January 27, 2022 09:35

avehtari mentioned this pull request Apr 25, 2022

Add Pareto-khat diagnostic stan-dev/posterior#237

Closed

6 tasks

ozanstats closed this by deleting the head repository May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend gpdfit to return marginal posterior of k #186 #188

extend gpdfit to return marginal posterior of k #186 #188

ozanstats commented Oct 10, 2021 •

edited

Loading

avehtari left a comment

ozanstats commented Oct 10, 2021

ozanstats commented Oct 10, 2021

avehtari commented Oct 11, 2021

jgabry commented Oct 11, 2021

avehtari commented Oct 11, 2021

ParadaCarleton commented Dec 20, 2021

avehtari commented Dec 20, 2021

avehtari commented Jan 6, 2022

avehtari commented Jan 14, 2022

ParadaCarleton commented Jan 14, 2022

avehtari commented Jan 17, 2022

ozanstats commented Jan 27, 2022

avehtari commented Apr 25, 2022

avehtari commented Mar 24, 2023

extend gpdfit to return marginal posterior of k #186 #188

extend gpdfit to return marginal posterior of k #186 #188

Conversation

ozanstats commented Oct 10, 2021 • edited Loading

avehtari left a comment

Choose a reason for hiding this comment

ozanstats commented Oct 10, 2021

ozanstats commented Oct 10, 2021

avehtari commented Oct 11, 2021

jgabry commented Oct 11, 2021

avehtari commented Oct 11, 2021

ParadaCarleton commented Dec 20, 2021

avehtari commented Dec 20, 2021

avehtari commented Jan 6, 2022

avehtari commented Jan 14, 2022

ParadaCarleton commented Jan 14, 2022

avehtari commented Jan 17, 2022

ozanstats commented Jan 27, 2022

avehtari commented Apr 25, 2022

avehtari commented Mar 24, 2023

ozanstats commented Oct 10, 2021 •

edited

Loading