The R package nbfar
implements Negative Binomial factor regression models that allow the estimation of
structured (sparse) associations between a feature matrix X and overdispersed count data Y.
The package has been developed with microbiome count data Y in mind and can be used, e.g., to associate host or environmental covariates with microbial abundances.
Currently, two models are available
- Negative Binomial reduced rank regression (NB-RRR)
- Negative Binomial co-sparse factor regression (NB-FAR).
The underlying structure of the models are illustrated in the Figure below.
Both models result in a type of joint biclustering structure linking features to count responses.
Microbiome data example - linking host phenotype data to microbial abundances of the American Gut project
Using nbfar
we analyzed the American Gut Project data
and Vioscreen information to identify robust links between diet and life style
features and broad abundance patterns of microbial families.
The manually curated data file as 'phyloseq' object is available in the manuscript_file folder. After downloading raw data from Qiita, we have taken following steps to obtain a subset of the microbiome:
- Raw american gut data
- Subset of microbiome data fron fecal samples
- Subset of microbiome data with vioscreen variables
- Final curated microbiome data with vioscreen variables is available here.
- Subset of microbiome data with vioscreen variables
- Subset of microbiome data fron fecal samples
Some of our findings are summarized in the Figure below where we show microbial families and host-associated feature
biclusters automatically identified by nbfar
.
The nbfar
package is currently available on GitHub and can be installed as follows.
The package Rcpp
is required for installation.
# Install packages
install.packages("Rcpp", repos="https://rcppcore.github.io/drat")
devtools::install_github('amishra-stats/nbfar/nbfar', force = TRUE)
To use the library, we also rely on parallel computation.
# load library
library(nbfar)
library(RcppParallel)
We showcase the usage of nbfar and nbrrr on simulated data.
## ## -----------------------Simulation settings --------------------
## Simulation setting:
## n: sample size
## p: number of predictors
## pz: number of control variable other than intercept parameter
## q: number of outcome variables
## nrank: true rank of the model
## snr: signal to noise ratio to be used in generating gaussian outcome variables
## nlam: number of lambda values to be fitted
## rank.est: maximum estimated rank to be specified to the model
## s: multiplicative factor of singular values
## nthread: number of parallel thread can be used in parallel for cross validation
## The simulation was replicated 100 times under each setting as detailed in the paper.
#
## Model specification:
p <- 50;
example_seed <- 123
xp = 30
set.seed(example_seed)
n <- 200
nrank <- 3 # true rank
rank.est <- 5 # estimated rank
nlam <- 40 # number of tuning parameter
s = 0.5; q <- 30
sp = xp/p
nthread = 1
## -------------- Generate low-rank and sparse coefficient matrix ---
## D: singular values
## U: sparse left singular vectors
## V: sparse right singular vectors
D <- rep(0, nrank)
V <- matrix(0, ncol = nrank, nrow = q)
U <- matrix(0, ncol = nrank, nrow = p)
#
U[, 1] <- c(sample(c(1, -1), 8, replace = TRUE), rep(0, p - 8))
U[, 2] <- c(rep(0, 5), sample(c(1, -1), 9, replace = TRUE), rep(0, p - 14))
U[, 3] <- c(rep(0, 11), sample(c(1, -1), 9, replace = TRUE), rep(0, p - 20))
#
# for similar type response type setting
V[, 1] <- c(rep(0, 8), sample(c(1, -1), 8, replace = TRUE) *
runif(8, 0.3, 1), rep(0, q - 16))
V[, 2] <- c(rep(0, 20), sample(c(1, -1), 8, replace = TRUE) *
runif(8, 0.3, 1), rep(0, q - 28))
V[, 3] <- c( sample(c(1, -1), 5, replace = TRUE) *
runif(5, 0.3, 1), rep(0, 23),
sample(c(1, -1), 2, replace = TRUE) *
runif(2, 0.3, 1), rep(0, q - 30))
U[, 1:3] <- apply(U[, 1:3], 2, function(x) x / sqrt(sum(x^2)))
V[, 1:3] <- apply(V[, 1:3], 2, function(x) x / sqrt(sum(x^2)))
#
D <- s * c(4, 6, 5) # signal strength varries as per the value of s
or <- order(D, decreasing = T)
U <- U[, or]
V <- V[, or]
D <- D[or]
C <- U %*% (D * t(V)) # simulated coefficient matrix
intercept <- rep(0.5, q) # specifying intercept to the model:
C0 <- rbind(intercept, C)
## ----- Simulate data -----
Xsigma <- 0.5^abs(outer(1:p, 1:p, FUN = "-"))
sim.sample <- nbfar_sim(U, D, V, n, Xsigma, C0,disp = 0.75, depth = 10) # Simulated sample
X <- sim.sample$X[1:n, ] # simulated predictors (training)
Y <- sim.sample$Y[1:n, ] # simulated responses (training)
# 1000 test sample data
sim.sample <- nbfar_sim(U, D, V, 1000, Xsigma, C0, disp = 0.75, depth = 10)
Xt <- sim.sample$X # simulated predictors (test)
Yt <- sim.sample$Y # simulated predictors (test)
X0 <- cbind(1, X) # 1st column accounting for intercept
# Simulate data with 20% missing entries
miss <- 0.10 # Proportion of entries missing
t.ind <- sample.int(n * q, size = miss * n * q)
y <- as.vector(Y)
y[t.ind] <- NA
Ym <- matrix(y, n, q) # 20% of entries are missing at random
# Model fit: (full data)
set.seed(example_seed)
control_r3 <- nbfar_control(initmaxit = 10000, initepsilon = 1e-5,
objI = 1)
nbrrr_test <- nbrrr(Y, X, maxrank = 5, control = control_r3, nfold = 5, trace = F)
# Model fit: (missing data)
set.seed(example_seed)
control_r3 <- nbfar_control(initmaxit = 10000, initepsilon = 1e-5,
objI = 1)
nbrrr_testm <- nbrrr(Ym, X, maxrank = 5, control = control_r3, nfold = 5,trace = F)
# Model fit: (full data)
RcppParallel::setThreadOptions(numThreads = nthread)
set.seed(example_seed)
control_nbfar <- nbfar_control(gamma0 = 1, spU = sp, spV = 20/q,
maxit = 2000, lamMaxFac = 1e-2,
lamMinFac = 1e-7, epsilon = 1e-4,
objI = 0,
initmaxit = 10000, initepsilon = 1e-7)
nbfar_test <- nbfar(Y, X, maxrank = rank.est, nlambda = nlam,
cIndex = NULL,
ofset = NULL, control = control_nbfar, nfold = 5,
PATH = FALSE, nthread = nthread,trace = F)
# Model fit: (missing data)
RcppParallel::setThreadOptions(numThreads = nthread)
set.seed(example_seed)
control_nbfar <- nbfar_control(gamma0 = 1, spU = sp, spV = 20/q,
maxit = 2000, lamMaxFac = 1e-2,
lamMinFac = 1e-7, epsilon = 1e-4,
objI = 0,
initmaxit = 10000, initepsilon = 1e-7)
nbfar_testm <- nbfar(Ym, X, maxrank = rank.est, nlambda = nlam,
cIndex = NULL,
ofset = NULL, control = control_nbfar, nfold = 5,
PATH = FALSE, nthread = nthread,trace = F)
- Contributions and suggestions to the software are always welcome. Please consult our contribution guidelines prior to submitting a pull request.
- Report issues or problems with the software using github’s issue tracker.
- Contributors must adhere to the Code of Conduct.
We thank Dr. Andreas Buja for useful comments on the project.
You can also contact us via email