Add a function to return marginal posterior of k #186

avehtari · 2021-09-13T19:28:03Z

A few times users have asked how much variability there can be in khat estimate if they would re-run MCMC. We can estimate that by looking at the marginal posterior of k. By re-using existing code (functions in gpdfit.R and nlist from helpers.R), the following code snippet returns vectors of ks and corresponding densities

gpdkpost <- function(x, wip = TRUE, min_grid_pts = 30, sort_x = TRUE) {
  # See section 4 of Zhang and Stephens (2009)
  if (sort_x) {
    x <- sort.int(x)
  }
  N <- length(x)
  prior <- 3
  M <- min_grid_pts + floor(sqrt(N))
  jj <- seq_len(M)
  xstar <- x[floor(N / 4 + 0.5)] # first quartile of sample
  theta <- 1 / x[N] + (1 - sqrt(M / (jj - 0.5))) / prior / xstar
  l_theta <- N * lx(theta, x) # profile log-lik
  ws <- 1 / vapply(jj, FUN.VALUE = numeric(1), FUN = function(j) {
    sum(exp(l_theta - l_theta[j]))
  })
  theta_hat <- sum(theta * ws)
  k <- mean.default(log1p(-theta_hat * x))
  ks <- numeric(length=M)
  for (i in 1:M) {
    ks[i] <- mean.default(log1p(-theta[i] * x))
  }
  sigma <- -k / theta_hat

  if (wip) {
    k <- adjust_k_wip(k, n = N)
  }

  if (is.nan(k)) {
    k <- Inf
  }

  nlist(k, sigma, ks, ws)
}

Demonstration with fake data

x <- sort(exp(rnorm(10000)*4),decreasing=TRUE)[1:100]
(foo <- gpdkpost(x))
qplot(q$ks,q$ws)+geom_line()+labs(x="k",y="unnormalized marginal posterior density")

I haven't usually looked at these, as there are so many k's to look at and I have some sense of the variability. But for the users who have not been playing around with Pareto as much, this could be a useful tool to have.

The text was updated successfully, but these errors were encountered:

yao-yl · 2021-09-13T20:38:50Z

Would Pr(k>0.7) be a more reliable diagnosis with small MC sample size?

…

On Mon, Sep 13, 2021 at 3:28 PM Aki Vehtari ***@***.***> wrote: A few times users have asked how much variability there can be in khat estimate if they would re-run MCMC. We can estimate that by looking at the marginal posterior of k. By re-using existing code (functions in gpdfit.R and nlist from helpers.R), the following code snippet returns vectors of ks and corresponding densities gpdkpost <- function(x, wip = TRUE, min_grid_pts = 30, sort_x = TRUE) { # See section 4 of Zhang and Stephens (2009) if (sort_x) { x <- sort.int(x) } N <- length(x) prior <- 3 M <- min_grid_pts + floor(sqrt(N)) jj <- seq_len(M) xstar <- x[floor(N / 4 + 0.5)] # first quartile of sample theta <- 1 / x[N] + (1 - sqrt(M / (jj - 0.5))) / prior / xstar l_theta <- N * lx(theta, x) # profile log-lik ws <- 1 / vapply(jj, FUN.VALUE = numeric(1), FUN = function(j) { sum(exp(l_theta - l_theta[j])) }) theta_hat <- sum(theta * ws) k <- mean.default(log1p(-theta_hat * x)) ks <- numeric(length=M) for (i in 1:M) { ks[i] <- mean.default(log1p(-theta[i] * x)) } sigma <- -k / theta_hat if (wip) { k <- adjust_k_wip(k, n = N) } if (is.nan(k)) { k <- Inf } nlist(k, sigma, ks, ws) } Demonstration with fake data x <- sort(exp(rnorm(10000)*4),decreasing=TRUE)[1:100] (foo <- gpdkpost(x)) qplot(q$ks,q$ws)+geom_line()+labs(x="k",y="unnormalized marginal posterior density") [image: image] <https://user-images.githubusercontent.com/6705400/133144297-0f5b4cdf-9e0d-4026-ac83-57ec23770469.png> I haven't usually looked at these, as there are so many k's to look at and I have some sense of the variability. But for the users who have not been playing around with Pareto as much, this could be a useful tool to have. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#186>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFQE724IZD7GQYYZL27ERI3UBZGE5ANCNFSM5D6Q3LNA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

avehtari · 2021-09-13T20:54:48Z

Based on my extensive experiments, no. It just maps the diagnostic to a different scalar, which seems to have weaker discriminatory power. khat seems to be the best single value, but more summary statistics jointly or looking at the marginal will provide more information.

ozanstats · 2021-10-10T10:19:24Z

This small PR presents a suggestion to extend the existing gpdfit() function to also return the marginal posterior density of k using the code @avehtari posted above with a few tweaks (cc: @jgabry).

Note: Although the plot above was labeled "unnormalized", my understanding is that the resulting density vector (w_theta) is actually normalized.

library(ggplot2)

set.seed(147)
x <- sort(exp(rnorm(10000) * 4), decreasing = TRUE)[seq_len(100)]

(gpd <- gpdfit(x)) # updated version

ggplot(mapping = aes(x = gpd$k, y = gpd$w_theta)) +
  geom_point() +
  geom_line() +
  labs(
    title = "Normalized marginal posterior density of k",
    x = "k",
    y = "Density"
  ) +
  geom_vline(xintercept = gpd$k_hat, color = "red") +
  annotate(
    geom = "label",
    label = as.expression(bquote(hat(k) == .(round(gpd$k_hat, 4)))),
    x = gpd$k_hat,
    y = 0, 
    color = "red",
    size = 3.5
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

avehtari · 2021-10-10T11:41:38Z

Thanks for the PR! I'll check it soon.

Note: Although the plot above was labeled "unnormalized", my understanding is that the resulting density vector (w_theta) is actually normalized.

The sum of w_theta vector is 1, so that the quadrature weights are normalized, but w_theta are not normalized densities. Your plot makes it easy to check this. The integral (area under the curve) can be approximated by a triangle with the base from 0.5 to 1.25 and the height of 0.13. The area of that triangle is 0.750.13/2 ≈ 0.049 ≠ 1. If the values were normalized densities, the maximum density value in this example should be close to 2.7, so that 0.752.7/2 ≈ 1. To return normalized densities, we could estimate the normalization term quite accurately with the trapezoidal rule https://en.wikipedia.org/wiki/Trapezoidal_rule (taking into account that the evaluation points are not equidistant).

ozanstats · 2021-10-10T21:50:41Z

Sounds good, I added this normalization term estimation with the trapezoidal rule to the PR thread along with a couple of implementation questions.

ozanstats mentioned this issue Oct 10, 2021

extend gpdfit to return marginal posterior of k #186 #188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a function to return marginal posterior of k #186

Add a function to return marginal posterior of k #186

avehtari commented Sep 13, 2021

yao-yl commented Sep 13, 2021 via email

avehtari commented Sep 13, 2021

ozanstats commented Oct 10, 2021 •

edited

Loading

avehtari commented Oct 10, 2021

ozanstats commented Oct 10, 2021

Add a function to return marginal posterior of k #186

Add a function to return marginal posterior of k #186

Comments

avehtari commented Sep 13, 2021

yao-yl commented Sep 13, 2021 via email

avehtari commented Sep 13, 2021

ozanstats commented Oct 10, 2021 • edited Loading

avehtari commented Oct 10, 2021

ozanstats commented Oct 10, 2021

ozanstats commented Oct 10, 2021 •

edited

Loading