This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Flexible Zero-Inflated Poisson-Gamma Model with application to microbiome read counts

Roulan Jiang    Xiang Zhanlabel=e2 [    mark]zhanx@bjmu.edu.cn    Tianying Wanglabel=e3 [    mark]tianyingw@tsinghua.edu.cn Center for Statistical Science, Tsinghua University, Beijing 100084, China. Department of Industrial Engineering, Tsinghua University, Beijing 100084, China. Department of Biostatistics, School of Public Health, Peking University, Beijing 100191, China. Beijing International Center for Mathematical Research, Peking University, Beijing 100871, China.
Abstract

Abstract

In microbiome studies, it is of interest to use a sample from a population of microbes, such as the gut microbiota community, to estimate the population proportion of these taxa. However, due to biases introduced in sampling and preprocessing steps, these observed taxa abundances may not reflect true taxa abundance patterns in the ecosystem. Repeated measures including longitudinal study designs may be potential solutions to mitigate the discrepancy between observed abundances and true underlying abundances. Yet, widely observed zero-inflation and over-dispersion issues can distort downstream statistical analyses aiming to associate taxa abundances with covariates of interest. To this end, we propose a Zero-Inflated Poisson Gamma (ZIPG) framework to address the aforementioned challenges. From a perspective of measurement errors, we accommodate the discrepancy between observations and truths by decomposing the mean parameter in Poisson regression into a true abundance level and a multiplicative measurement of sampling variability from the microbial ecosystem. Then, we provide flexible modeling by connecting both mean abundance and the variability to different covariates, and build valid statistical inference procedures for both parameter estimation and hypothesis testing. Through comprehensive simulation studies and real data applications, the proposed ZIPG method provides significant insights into distinguished differential variability and abundance.

Longitudinal data,
Measurement error,
Poisson Gamma distribution,
Sequence count data,
Zero-inflation,
keywords:
\startlocaldefs\endlocaldefs

, and

1 Introduction

The human microbiome consists of the collection of all microbes living in or on the human body and plays an important role in maintaining human health (Manor et al., 2020). Quantification of the microbiome usually proceeds by 16s rRNA sequencing or metagenomic shotgun sequencing, where sequence read counts are often summarized into a taxa count table. Here the word taxa generically refers to features such as operational taxonomic units or other taxonomic or functional groupings of bacterial sequences. A crucial task in microbiome research is to link these taxa counts to covariates of interest (e.g., disease status, health outcomes, and environmental conditions) via statistical analysis (Li, 2015). To achieve this goal, one needs first to address some common challenges, such as zero inflation and over-dispersion in observed taxa counts, and more importantly, the discrepancy between observed taxa abundances in samples and true abundances in the underlying microbial ecosystem, such as the gut microbiota community, to guarantee rigor and reproducibility of the analysis results (Willis, 2019).

Owing to biases introduced in sampling extraction, polymerase chain reaction (PCR) amplification, sequencing, bioinformatics prepossessing, and other possible experimental procedures, observed taxa abundances may not well reflect unobserved true abundances in the ecosystem. While multiple versatile statistical methods have been proposed to address the aforementioned issue for microbiome compositional data (Shi et al., 2022; Martin et al., 2020), measurement errors actually occur at latent count variables rather than proportions. Such a compositional transformation may lose some variation/dispersion information that is important to subsequent statistical analysis (McMurdie and Holmes, 2014; Xu et al., 2021; Li et al., 2021). Moreover, recent research indicates that it is possible to quantify microbial load (and hence the absolute abundance of each taxon) using flow cytometry (Vandeputte et al., 2017). Following this research vein, we will propose valid statistical inference for microbiome count data accommodating the discrepancy between observed sample abundances and underlying true abundances. Specifically, motivated by a recent inference procedure based on multiple rarefaction-based re-samplings (Hu et al., 2021), we take samples with repeated measures (or longitudinal measurements) to account for sampling fluctuations.

Like many high-throughput DNA sequencing assays exhibiting high sparsity, microbiome experiments often have about 50% or more zero measurements (Silverman et al., 2020). There are, in general, two types of approaches to handle these zeros in microbiome sequence count data. One is to impute zeros based on missing data scheme assumptions (Martın-Fernandez et al., 2011) or random matrix low-rank assumptions (Cao et al., 2020). The imputation approach is often coupled with downstream log-ratio-based compositional data analysis. The other approach is to propose a two-part model with a point probability mass at zero along with another parametric distribution. Examples include zero-inflated Poisson, zero-inflated negative binomial and many others (Xu et al., 2021; Zhang and Yi, 2020; Li et al., 2018; Tang and Chen, 2019; Zeng et al., 2022). While both approaches are popular in microbiome data analysis, a recent study demonstrates that a potential limitation of imputing zeros is that violation of underlying assumptions may distort downstream statistical analysis (Silverman et al., 2020). To this end, we will take the two-part modeling strategy to handle excessive zeros in microbiome data in this paper.

Suppose samples are collected from nn subjects, and every subject could have multiple measurements with counts of KK taxa measured in each sample. For i=1,,ni=1,...,n, j=1,,nij=1,...,n_{i} and k=1,,Kk=1,...,K, denote WijkW_{ijk} as the count of the kkth taxon in the jjth sample/measurement from the iith subject. For ease of presentation, we assume hereafter that sample jj, with j=1,,nij=1,...,n_{i}, was collected longitudinally from subject ii, without loss of generality. The sequencing depth, or library size, of each sample is Mij=k=1KWijkM_{ij}=\sum_{k=1}^{K}W_{ijk}. To account for the aforementioned excessive zeros issue, the zero-inflated Poisson distribution has been proposed to model microbiome counts (Xu et al., 2021):

Wijk{0withprobabilitypijkPoisson(λijk)withprobability 1pijk,W_{ijk}\sim\left\{\begin{matrix}0&{\rm with\ probability}\ p_{ijk}\\ {\rm Poisson}(\lambda_{ijk})&{\rm\ \ with\ probability}\ 1-p_{ijk},\end{matrix}\right. (1)

where λijk\lambda_{ijk} represents the mean abundance of taxon kk on the jjth observation from subject ii, and pijkp_{ijk} is the probability mass to model excessive zeros. A major critique of Poisson models is the failure to accommodate over-dispersion, which has been widely observed for sequence count data, including microbiome data. An alternative of Poisson is the negative binomial distribution, originally proposed for RNA sequence count data (Robinson et al., 2010; Love et al., 2014) and recently extended to microbiome data (Zhang and Yi, 2020). The zero-inflated negative binomial (ZINB) distribution is given by:

Wijk{0withprobabilitypijkNB(μijk,αk)withprobability 1pijk,W_{ijk}\sim\left\{\begin{matrix}0&{\rm with\ probability}\ p_{ijk}\\ {\rm NB}(\mu_{ijk},\alpha_{k})&{\rm\ \ with\ probability}\ 1-p_{ijk},\end{matrix}\right. (2)

where λijk\lambda_{ijk} and αk\alpha_{k} are the mean parameter and (over-)dispersion parameter of the negative binomial distribution, respectively. ZINB can also be expressed as a Gamma prior upon λijk\lambda_{ijk} in ZIP with E(λijk)=μijkE(\lambda_{ijk})=\mu_{ijk} and Var(λijk)=μijk2αkVar(\lambda_{ijk})=\mu_{ijk}^{2}\alpha_{k}. That is, λijkGamma(αk1,μijkαk),\lambda_{ijk}\sim{\rm Gamma}\left(\alpha_{k}^{-1},\mu_{ijk}\alpha_{k}\right), where αk\alpha_{k} is a nuisance parameter depending on the kkth taxon only.

Host factors, like disease status and dietary regimes, can impact microbiome stability, referred to as dysbiosis to describe the imbalance of microbiome community during some unhealthy conditions compared to normal ones (Petersen and Round, 2014). Thus, it is of our interest to investigate the relationship between the taxa abundance variation and covariates, which is naturally caused by microbiome stability perturbation yet overlooked in the ZINB model. For illustration purposes, we use the taxon Burkholderiales bacterium from the diet-microbiome study (Johnson et al., 2019) to show potential abundance variation associated with covariates. One primary goal of the diet-microbiome study is to analyze the microbial difference between alcohol drinkers and teetotalers. According to box plots for the raw data and the predicted distribution using the ZINB model by R package pscl (Zeileis et al., 2008), we observe that two groups have similar mean abundances, but the variance among alcohol drinkers is evidently larger, which can not be captured by pscl (Figure 1). Therefore, it is crucial to consider variation in microbiome count data to obtain accurate and robust analysis results.

Refer to caption
(a) Raw data
Refer to caption
(b) pscl predicted data
Figure 1: Box plot for log of relative abundance of Burkholderiales bacterium in alcohol (ALC = 1) and non-alcohol (ALC = 0) groups.

To address potential limitations, we propose a Zero-Inflated Poisson-Gamma (ZIPG) framework, which provides flexible modeling by connecting both the mean abundance and its dispersion with different sets of covariates, respectively. First, we consider a hierarchical model by adjusting the mean parameter λijk\lambda_{ijk} of ZIP (i.e., model (1)) with a multiplicative factor UijkU_{ijk}, whose distribution is Gamma and can be viewed as a multiplicative measurement error. Further, we construct the ZIPG model and connect mean parameter λijk\lambda_{ijk} and variation factor UijkU_{ijk} to different sets of covariates. Note that our model is different from the traditional Bayesian expression of negative-binomial in the sense that we model the mean and variability of taxa abundance separately, providing a meaningful explanation from the individual-level variation perspective.

Our contributions are two-fold. First, our ZIPG framework provides flexible modeling of microbiome sequence counts with repeated measures and allows us to analyze how different sets of variables affect both mean taxa abundance and its dispersion. Second, within the ZIPG framework, we develop inference procedures, including point and interval estimates and hypothesis testing, to examine the relationship between microbial taxa abundances and covariates of interest. By introducing the variation factor as a multiplicative measurement error term, our ZIPG method is able to capture higher-order moment information of taxa abundance and has been shown to be more powerful than ZINB-based methods. Through extensive simulations, we illustrate that existing ZINB-based methods could have severely inflated type I error when differential variability exists, whereas ZIPG can control type I error around the nominal level. When analyzing two real microbiome data sets, ZIPG identified more significant taxa than two ZINB models under the same nominal false discovery rate level, and also distinguished differential variability and differential abundance, providing more insights for further biological or biomedical functional investigations.

The rest of this paper is organized as follows. In Section 2, we introduce our ZIPG model and discuss some parameters of interest. ZIPG model fitting and hypothesis testing procedures are proposed in Section 3. In Section 4, we demonstrate the superior performance of our approach under different simulation settings. We apply our method to the data from a vaginal microbiome study and a diet-microbiome study in Section 5, and conclude it with a brief discussion in Section 6.

2 Model and Notation

2.1 Zero-Inflated Poisson Gamma Model

For taxon count WijkW_{ijk}, we decompose the mean of Poisson distribution in eq (1) into a true abundance level λijk\lambda_{ijk} and a multiplicative factor UijkU_{ijk} and consider the following hierarchical Zero-Inflated Poisson-Gamma (ZIPG) model:

WijkUijk\displaystyle{W}_{ijk}\mid U_{ijk} \displaystyle\sim {0withprobabilitypkPoisson(λijkUijk)withprobability 1pk,\displaystyle\left\{\begin{matrix}\phantom{-}0&\phantom{-}{\rm with\ probability}\ p_{k}\\ {\rm Poisson}(\lambda_{ijk}U_{ijk})\ &{\rm with\ probability}\ 1-p_{k},\\ \end{matrix}\right. (3)
Uijk\displaystyle U_{ijk} \displaystyle\sim Gamma(θik1,θik),\displaystyle{\rm Gamma}(\theta_{ik}^{-1},\theta_{ik}),

where λijk\lambda_{ijk} represents the true abundance level for taxon kk on the jjth observation from subject ii, and pkp_{k} denotes the zero-inflation parameter describing the probability of true zero occurrence of taxon kk. UijkU_{ijk} follows Gamma distribution with the same rate parameter and shape parameter θik1\theta_{ik}^{-1}. This factor does not change the average abundance level given the fact that E(Uijk)=1E(U_{ijk})=1. On the other hand, Var(Uijk)=θikVar(U_{ijk})=\theta_{ik} allows extra variation of the observed abundance WijkW_{ijk} around the average level λijk\lambda_{ijk}, a phenomenon described as the deviation of observed abundance to unobserved true abundance. It is of note that the variability term, θik\theta_{ik}, remains the same for all measurements (i.e., j=1,,nij=1,...,n_{i}) across individual ii. Thus it reflects the stability of taxon kk in the microbial system of individual ii. This is motivated by the assumption about metagenomic sequencing bias being taxon-specific but not sample-specific made in literature (McLaren et al., 2019). Finally, we assume the zero-inflation parameter pkp_{k} is only taxon-specific and is common across samples (jj) and individuals (ii). This is because many experimental factors in sequencing can introduce the measurement of zeros (Silverman et al., 2020) and hence it is less appealing to link zero inflation parameter pkp_{k} to other covariates (e.g., biological or environmental conditions) possessed by individuals or samples. We have checked the sensitivity of ZIPG under model misspecification when this assumption is violated by comparing the performance of the current ZIPG model (3) to a full ZIPG model (denoted as ZIPG-full), which replaces pkp_{k} by pijkp_{ijk} and links pijkp_{ijk} to covariates. Results in Section 4.4 indicates that the current ZIPG model tends to have better model-fitting performance than ZIPG-full, which is consistent with previous empirical conclusions that modeling zero inflation in sequence count data should be careful (with respect to underlying zero generating process) and numerical evidence tends to favor simpler models (Silverman et al., 2020).

To further explore the new ZIPG framework, let WijkPoisPoisson(λijk)W_{ijk}^{\rm Pois}\sim{\rm Poisson}(\lambda_{ijk}) denote as the random variable generated from the Poisson distribution and WijkPGW_{ijk}^{\rm PG} as the random variable generated from the Poisson-Gamma part in eq (3). We have

E(WijkPG)=E(WijkPois)=λijk,\displaystyle E(W_{ijk}^{\rm PG})=E(W_{ijk}^{\rm Pois})=\lambda_{ijk},
Var(WijkPG)=λijk(1+λijkθik)=Var(WijkPois)(1+λijkθik).\displaystyle Var(W_{ijk}^{\rm PG})=\lambda_{ijk}(1+\lambda_{ijk}\theta_{ik})=Var(W_{ijk}^{\rm Pois})(1+\lambda_{ijk}\theta_{ik}).

Thus, the mean of the Poisson-Gamma distribution is the same as the regular Poisson distribution, but its variance is multiplied by (1+λijkθik)(1+\lambda_{ijk}\theta_{ik}) to account for the over-dispersion caused by the multiplicative measurement error factor UijkU_{ijk}. We also observe the similar phenomenon of using a more sophisticated hierarchical model to account for over-dispersion in microbiome data analysis, such as the Beta-Binomial distribution (Martin et al., 2020) and Dirichlet-multinomial distribution (La Rosa et al., 2012). In this paper, we refer to λijk\lambda_{ijk} as the abundance mean parameter and θik\theta_{ik} as the abundance dispersion parameter.

2.2 Parameters of Interest

A critical task in microbiome research is to explore the relationship between taxa abundances and covariates of interest. Compared to noisy observed counts WijkW_{ijk}, it is more interesting to investigate the association between key parameters (i.e., λijk\lambda_{ijk} and θik\theta_{ik}) of the underlying taxa abundance distribution and covariates of interest. To this end, we connect the mean parameter and dispersion parameter with different sets of covariates, respectively. In the repeated measures or longitudinal study design considered in the current paper, some covariates vary across different samples within the same subject, such as dietary intake, and we refer to them as “time-dependent” covariates. Other covariates, which do not change during the study, such as the treatment group assigned at the beginning of the study, are referred to as “time-independent” covariates. Since the dispersion parameter θik\theta_{ik} describes the deviation of short-term abundance from long-term mean abundance λijk\lambda_{ijk}, we propose to link it to time-independent covariates, supported by the evidence of microbiome stability perturbation in Morgan et al. (2012) and Couch et al. (2021). For the mean abundance λijk\lambda_{ijk}, it can be linked to either time-dependent covariates or time-independent covariates, or both. Therefore, we define the following link functions:

g(λijk)=βk,0+𝑿𝒊𝒋T𝜷𝒌+log(Mij),g(θik)=βk,0+𝑿𝒊𝑻𝜷𝒌,g(\lambda_{ijk})=\beta_{k,0}+\boldsymbol{X_{ij}}^{T}\boldsymbol{\beta_{k}}+\log(M_{ij}),\quad g^{*}(\theta_{ik})=\beta^{*}_{k,0}+\boldsymbol{X^{*T}_{i}}\boldsymbol{\beta^{*}_{k}}, (4)

where 𝑿𝒊𝒋d1\boldsymbol{X_{ij}}\in\mathbb{R}^{d_{1}} is a vector of covariates associated with λijk\lambda_{ijk}, 𝑿𝒊d2\boldsymbol{X^{*}_{i}}\in\mathbb{R}^{d_{2}} include covariates associated with θik\theta_{ik}, 𝜷𝒌=(β1,,βd1)T\boldsymbol{\beta_{k}}=(\beta_{1},...,\beta_{d_{1}})^{T} and 𝜷𝒌=(β1,,βd2)T\boldsymbol{\beta_{k}^{*}}=(\beta_{1}^{*},...,\beta_{d_{2}}^{*})^{T} are regression coefficients of interest. We allow overlapped covariates in two models. For ease of presentation, we term the two models in (4) as ZIPG mean model and ZIPG dispersion model, respectively. The log(Mij)\log(M_{ij}) term in the mean model accounts for the effect of sequencing depth variation on mean abundances. The same offset is used in previous ZINB models (Zhang and Yi, 2020; Robinson et al., 2010), and the log of median-of-ratios is another possible candidate for offset used in literature (Xu et al., 2021; Love et al., 2014). While most existing ZINB-based methods (2) models μijk\mu_{ijk} but treat αk\alpha_{k} as a nuisance parameter, our approach allows additional time-independent covariates linked to the dispersion of abundance, which would lead to better model fit and more powerful association analysis as will be shown later in this paper. According to previous discussions, we suggest including time-independent covariates such as demographic and lifestyle-related variables in 𝑿𝒊\boldsymbol{X^{*}_{i}}. All covariates of interest, regardless of time-dependent or not, shall be included in 𝑿𝒊𝒋\boldsymbol{X_{ij}}. By testing the coefficients 𝜷𝒌\boldsymbol{\beta_{k}} and 𝜷𝒌\boldsymbol{\beta^{*}_{k}}, we can detect differential abundance and differential variability impacted by physiological status or host environment, respectively. Finally, if we do not include any covariates in 𝑿𝒊\boldsymbol{X^{*}_{i}}, our ZIPG model will degenerate to ZINB (i.e., model (2)) with θik=αk\theta_{ik}=\alpha_{k} for any subject. Throughout this paper we choose a logarithmic link function g(x)=g(x):=log(x)g(x)=g^{*}(x):=\log(x) to ensure λijk>0\lambda_{ijk}>0 and θik>0\theta_{ik}>0.

The proposed hierarchical ZIPG model has several key advantages. First, to handle over-dispersion in microbiome count data, we decompose abundances into the long-term true abundance and its individual-level variation through a multiplicative factor. Second, we allow different sets of variables associated with the mean and the variation of abundance and provide explanations for variations in the individual-level microbial system. Thus, the proposed method is not only able to test the change of the mean abundance but also the microbiome stability affected by physiological status or host environment, which cannot be achieved by existing ZIP models. In addition, parameter θik\theta_{ik} also controls the skewness and kurtosis of the Gamma distribution. That is, we allow the higher-order moments (or the shape of the distribution) to be linked to covariates, which is another feature that is typically missed in existing models.

3 Statistical Inference in ZIPG

In this section, we develop statistical inference procedures in ZIPG, including parameter point estimation, interval estimation, and hypothesis testing. For ease of presenting, we omit the subscript kk and simply denote the parameter set associated with taxon kk as 𝛀=(β0,𝜷T,β0,𝜷T,γ)T\boldsymbol{\Omega}=(\beta_{0},\boldsymbol{\beta}^{T},\beta^{*}_{0},\boldsymbol{\beta^{*}}^{T},\gamma)^{T}, where γ=log{p/(1p)}\gamma=\log\left\{p/(1-p)\right\} is the logit transformation of zero-inflated parameter pp in ZIPG model (3).

3.1 Model Fitting

Given covariates 𝑿\boldsymbol{X} and 𝑿\boldsymbol{X^{*}}, observed count data 𝑾\boldsymbol{W} and sequencing depth 𝑴\boldsymbol{M}, we write the log-likelihood of 𝛀\boldsymbol{\Omega} as follows:

L(𝛀𝑾)\displaystyle L(\boldsymbol{\Omega}\mid\boldsymbol{W}) =\displaystyle= i=1nj=1ni[I(Wij=0)log{exp(γ)+PPG(Wij𝛀)}\displaystyle\sum_{i=1}^{n}\sum_{j=1}^{n_{i}}\left[I(W_{ij}=0)\log\left\{\exp(\gamma)+P_{\rm PG}(W_{ij}\mid\boldsymbol{\Omega})\right\}\right. (5)
+I(Wij>0)log{PPG(Wij𝛀)}log{exp(γ)+1}],\displaystyle\hskip 5.69054pt\left.+I(W_{ij}>0)\log\left\{P_{\rm PG}(W_{ij}\mid\boldsymbol{\Omega})\right\}-\log\left\{\exp(\gamma)+1\right\}\right],
withPPG(Wij𝛀)\displaystyle{\rm with\ \ }P_{\rm PG}(W_{ij}\mid\boldsymbol{\Omega}) =\displaystyle= Γ(Wij+θi1)Γ(Wij+1)Γ(θi1)(λijθi)Wij(1+λijθi)θi1+Wij,\displaystyle\frac{\Gamma(W_{ij}+\theta_{i}^{-1})}{\Gamma(W_{ij}+1)\Gamma(\theta_{i}^{-1})}\frac{(\lambda_{ij}\theta_{i})^{W_{ij}}}{(1+\lambda_{ij}\theta_{i})^{\theta_{i}^{-1}+W_{ij}}},

where λij\lambda_{ij} and θi\theta_{i} are functions of 𝛀\boldsymbol{\Omega} defined in eq (4) and Γ(x)=0+tx1exp(t)𝑑t\Gamma(x)=\int_{0}^{+\infty}t^{x-1}\exp(-t)dt is the Gamma function. The log-likelihood eq (5) is non-concave in 𝛀\boldsymbol{\Omega} (see Section 1.1 of Supplements). In practice, we found that directly maximizing eq (5) can cause trouble in distinguishing zeros from the Poisson-Gamma part and the other zero-inflation part of the ZIPG model, leading to an unreasonably low estimator of γ\gamma. A similar phenomenon has been observed for pscl, with more discussions provided in Section 4. Therefore, we use the EM algorithm for a reliable estimator of 𝛀\boldsymbol{\Omega}.

Let zijz_{ij} be the latent variable, where zij=1z_{ij}=1 indicates WijW_{ij} is generated from zero-inflated part with probability p=exp(γ)/(exp(γ)+1)p=\exp(\gamma)/(\exp(\gamma)+1), and zij=0z_{ij}=0 indicates WijW_{ij} is generated from the Poisson-Gamma distribution with probability 1p1-p. The log-likelihood with complete data {Wij,Mij,Xij,Xij,zij}\{W_{ij},M_{ij},X_{ij},X_{ij}^{*},z_{ij}\} for i=1,,ni=1,...,n and j=1,,nij=1,...,n_{i} is written as:

L(𝛀𝑾,𝒛)=i,j[zijlog(p)+(1zij)log{(1p)PPG(Wij𝛀)}].L(\boldsymbol{\Omega}\mid\boldsymbol{W},\boldsymbol{z})=\sum_{i,j}\left[z_{ij}\log(p)+(1-z_{ij})\log\left\{(1-p)P_{\rm PG}(W_{ij}\mid\boldsymbol{\Omega})\right\}\right]. (6)

The detailed procedure of the EM algorithm is provided in Algorithm 1. We first initialize 𝛀(0)\boldsymbol{\Omega}^{(0)} by the results of zeroinfl in pscl with 𝜷=𝟎\boldsymbol{\beta^{*}}=\boldsymbol{0}. pij(0)p_{ij}^{(0)} is adjusted to the proportion of observed zeros of 𝑾\boldsymbol{W} to avoid the local maximum at the start point. Then we can repeat the E-step and M-step until convergence or the maximum number of iterations tmaxt_{\rm max} is reached. For the tt-th iteration, in M-step, we update 𝛀\boldsymbol{\Omega} by maximizing eq (6) using BFGS in R function optim (Broyden (1970); Fletcher (1970); Goldfarb (1970); Shanno (1970)) given latent variable 𝒛(t1)\boldsymbol{z}^{(t-1)}. The gradient of eq (6) is applied to improve the computational efficiency (see Section 1.2 of Supplements).

𝛀(t)=argmax𝛀L(𝛀𝑾,𝒛(t1)).\displaystyle{\boldsymbol{\Omega}^{(t)}}=\mathop{\arg\max}_{\boldsymbol{\Omega}}L(\boldsymbol{\Omega}\mid\boldsymbol{W},\boldsymbol{z}^{(t-1)}). (7)

In E-step, we update latent variables zijz_{ij} by their conditional expectations given 𝛀(t)\boldsymbol{\Omega}^{(t)} estimated from M-step:

zij(t)=E{zijWij,𝛀(t)}\displaystyle{z}_{ij}^{(t)}=E\left\{{z}_{ij}\mid W_{ij},\boldsymbol{\Omega}^{(t)}\right\}
=\displaystyle= P(Wijzij=1,𝛀(t))P(zij=1𝛀(t))P(Wijzij=1,𝛀(t))P(zij=1𝛀(t))+P(Wijzij=0,𝛀(t))P(zij=0𝛀(t))\displaystyle\frac{P(W_{ij}\mid{z}_{ij}=1,\boldsymbol{\Omega}^{(t)})P({z}_{ij}=1\mid\boldsymbol{\Omega}^{(t)})}{P(W_{ij}\mid{z}_{ij}=1,\boldsymbol{\Omega}^{(t)})P({z}_{ij}=1\mid\boldsymbol{\Omega}^{(t)})+P(W_{ij}\mid{z}_{ij}=0,\boldsymbol{\Omega}^{(t)})P({z}_{ij}=0\mid\boldsymbol{\Omega}^{(t)})}
=\displaystyle= I(Wij=0)p(t)I(Wij=0)p(t)+PPG(Wij𝛀(t))(1p(t)).\displaystyle\frac{I(W_{ij}=0)p^{(t)}}{I(W_{ij}=0)p^{(t)}+P_{\rm PG}(W_{ij}\mid\boldsymbol{\Omega}^{(t)})(1-p^{(t)})}.

Substituting latent variables zij(t)=z(Wij,𝛀(t))z_{ij}^{(t)}=z(W_{ij},\boldsymbol{\Omega}^{(t)}) into eq (6), we can write the expectation of log-likelihood on a single observation as a new function

Qij(𝛀𝛀(t))\displaystyle Q_{ij}(\boldsymbol{\Omega}\mid\boldsymbol{\Omega}^{(t)}) =\displaystyle= E{L(𝛀Wij,zij)Wij,𝛀(t)}\displaystyle E\left\{L(\boldsymbol{\Omega}\mid W_{ij},z_{ij})\mid W_{ij},\boldsymbol{\Omega}^{(t)}\right\}
=\displaystyle= z(𝛀(t),Wij)logp+{1z(𝛀(t),Wij)}log{(1p)PPG(Wij𝛀)},\displaystyle z(\boldsymbol{\Omega}^{(t)},W_{ij})\log p+\left\{1-z(\boldsymbol{\Omega}^{(t)},W_{ij})\right\}\log\left\{(1-p)P_{\rm PG}(W_{ij}\mid\boldsymbol{\Omega})\right\},

then Q(𝛀𝛀(t))=ijQij(𝛀𝛀(t))Q(\boldsymbol{\Omega}\mid\boldsymbol{\Omega}^{(t)})=\sum_{ij}Q_{ij}(\boldsymbol{\Omega}\mid\boldsymbol{\Omega}^{(t)}) is the quantity maximized in M-step equivalently.

According to Theorem 6 in Wu (1983), since Q(𝛀𝛀)/𝛀\partial Q(\boldsymbol{\Omega}\mid\boldsymbol{\Omega}^{*})/\partial\boldsymbol{\Omega} is continuous in 𝛀\boldsymbol{\Omega} and 𝛀\boldsymbol{\Omega}^{*}, and our algorithm ensures Q(𝛀𝛀(t))/𝛀𝛀=𝛀(t+1)=0\partial Q(\boldsymbol{\Omega}\mid\boldsymbol{\Omega}^{(t)})/\partial\boldsymbol{\Omega}\mid_{\boldsymbol{\Omega}=\boldsymbol{\Omega}^{(t+1)}}=\textbf{0} in each iteration, then EM estimator will converge to a stationary point of L(𝛀𝑾)L(\boldsymbol{\Omega}\mid\boldsymbol{W}). Under mild conditions, EM Algorithm converges to the local maximum (Wu, 1983, Theorem 3). Based on our suggested initialization, namely the MLE assuming 𝜷=0\boldsymbol{\beta^{*}}=0 with adjusted p(0)p^{(0)}, numerical studies suggest that our estimators are nearly unbiased.

Algorithm 1 ZIPG Expectation Maximization Algorithm
𝑾,𝑴,𝑿,𝑿\boldsymbol{W},\boldsymbol{M},\boldsymbol{X},\boldsymbol{X^{*}}, the maximum iterations tmaxt_{\max}, a tolerance ϵtol\epsilon_{\rm tol}.
Initialize 𝛀(0)\boldsymbol{\Omega}^{(0)} by adjusted pscl estimation regardless of 𝑿\boldsymbol{X^{*}}, set t=0t=0.
Initialize 𝒛(0)\boldsymbol{z}^{(0)} with 𝛀(0)\boldsymbol{\Omega}^{(0)} based on eq (3.1).
Calculate L(0)=L(𝛀(0)𝑾,𝒛(0))L^{(0)}=L(\boldsymbol{\Omega}^{(0)}\mid\boldsymbol{W},\boldsymbol{z}^{(0)}) as eq (6).
while t<tmaxand|L(t)L(0)|/|L(0)|<ϵtolt<t_{\max}\ \text{and}\ |L^{(t)}-L^{(0)}|/|L^{(0)}|<\epsilon_{\rm tol} do
     Given 𝒛(t1)\boldsymbol{z}^{(t-1)}, estimate 𝛀(t)\boldsymbol{\Omega}^{(t)} by eq (7) (M-step).
     Get the maximized L(t)=L(𝛀(t)𝑾,𝒛(t1))L^{(t)}=L(\boldsymbol{\Omega}^{(t)}\mid\boldsymbol{W},\boldsymbol{z}^{(t-1)}) as in eq (6).
     Given 𝛀(t)\boldsymbol{\Omega}^{(t)}, update 𝒛(t)\boldsymbol{z}^{(t)} by eq (3.1) (E-step).
     t=t+1t=t+1.
end while
Return 𝛀(t)\boldsymbol{\Omega}^{(t)}.

3.2 Hypothesis Testing

In this paper, proving the asymptotic normality of the EM estimator is less of our interest, and despite its lack of rigorous theoretical justification, the EM estimator has been essentially treated as MLE in literature. Thus, we treat EM estimator 𝛀^\boldsymbol{\hat{\Omega}} as MLE and construct the Wald test statistics and confidence interval based on the asymptotic normality of MLE. Further, we carefully consider the potential practical issues and evaluate different bootstrap methods and interval construction strategies with extensive numerical studies.

Consider the null hypothesis H0:𝑨𝛀=𝒃H_{0}:\boldsymbol{A\boldsymbol{\Omega}=b} as the general form to test arbitrary subsets and linear combinations of parameters within 𝛀\boldsymbol{\Omega}, where 𝑨r×(d1+d2+3)\boldsymbol{A}\in\mathbb{R}^{r\times(d_{1}+d_{2}+3)} has full rank, r<d1+d2+3r<d_{1}+d_{2}+3, and 𝒃r\boldsymbol{b}\in\mathbb{R}^{r}. The classical Wald test statistic can be constructed as

TWald=N(𝑨𝛀^𝒃)T(𝑨𝑽𝑨T)1(𝑨𝛀^𝒃),{T}_{\rm Wald}=N(\boldsymbol{A{\boldsymbol{\hat{\Omega}}}-b})^{T}(\boldsymbol{A}\boldsymbol{V}\boldsymbol{A}^{T})^{-1}(\boldsymbol{A{\boldsymbol{\hat{\Omega}}}-b}),

where 𝑽\boldsymbol{V} is the asymptotic covariance matrix of 𝛀\boldsymbol{\Omega}, NN is the sample size, TWaldT_{\rm Wald} is asymptotic χr2\chi^{2}_{r} under the null hypothesis. In practice, we found that directly using the inverse of observed information derived from the last M-step might underestimate the variance matrix (see Section 3.1 of the Supplements). Thus, we propose to use nonparametric bootstrap to estimate the covariance matrix and construct the test statistic as in Algorithm 2.

Algorithm 2 ZIPG Bootstrap Wald Test
𝑾,𝑴,𝑿,𝑿\boldsymbol{W},\boldsymbol{M},\boldsymbol{X},\boldsymbol{X^{*}}, bootstrap replicates BB.
Estimate 𝛀^𝟎\boldsymbol{\hat{\Omega}^{0}} by EM (Algorithm 1).
for b=1,,Bb=1,\ldots,B do
     Randomly draw samples regarding all measurements from original data at the same sample size with replacement, thus we get 𝑾𝒃,𝑴𝒃,𝑿𝒃,𝑿𝒃\boldsymbol{W^{b}},\boldsymbol{M^{b}},\boldsymbol{X^{b}},\boldsymbol{{X^{*}}^{b}}.
     Estimate 𝛀^𝒃\boldsymbol{\hat{\Omega}^{b}} using 𝑾𝒃,𝑴𝒃,𝑿𝒃,𝑿𝒃\boldsymbol{W^{b}},\boldsymbol{M^{b}},\boldsymbol{X^{b}},\boldsymbol{{X^{*}}^{b}} by EM (Algorithm 1).
end for
Compute the covariance matrix of 𝛀^𝒃\boldsymbol{\hat{\Omega}^{b}} as 𝑽^=Var(𝛀^𝒃)\boldsymbol{\hat{V}}=Var(\boldsymbol{\hat{\Omega}^{b}}) .
Compute
T^Wald=(𝑨𝛀^𝟎𝒃)T(𝑨𝑽^𝑨T)1(𝑨𝛀^𝟎𝒃).\hat{T}_{\rm Wald}=(\boldsymbol{A{\boldsymbol{\hat{\Omega}^{0}}}-b})^{T}(\boldsymbol{A}\boldsymbol{\hat{V}}\boldsymbol{A}^{T})^{-1}(\boldsymbol{A{\boldsymbol{\hat{\Omega}^{0}}}-b}).

The 100(1α)%(1-\alpha)\% confidence interval for any single parameter β𝛀\beta\in\boldsymbol{\Omega} can also be obtained through nonparametric bootstrap Wald test as (β^zα/2SD(βb),β^+zα/2SD(βb))\left(\hat{\beta}-z_{\alpha/2}{\rm SD}(\beta^{b}),\hat{\beta}+z_{\alpha/2}{\rm SD}(\beta^{b})\right) based on the standard error (SD) of βb𝛀𝒃\beta^{b}\in\boldsymbol{\Omega^{b}} from BB bootstrap samples. When the sample size is small or when the collected data is unbalanced (e.g., Romero data in Section 5), parametric bootstrap could be applied for more robust results as in Martin et al. (2020). Thus, we also developed the Wald test based on parametric bootstrap (i.e., ZIPG-pbWald). That is, we simulate bootstrap samples from the ZIPG model with its parameters estimated under H0H_{0}. A detailed algorithm is provided in Section 2 of the Supplements.

There are multiple ways to construct test statistics and confidence intervals. We conducted extensive simulations studies evaluating the performance of (1) the Wald test without bootstrap (ZIPG-Wald) and the likelihood ratio test (ZIPG-LRT); (2) nonparametric and parametric bootstrap through different construction strategies, such as normality-based/quantile-based/BCa\rm BC_{a}(Efron and Tibshirani, 1994) intervals; (3) resampling schemes for ZIPG-bWald, such as resampling based on measurements or subjects. Simulation results suggest that bootstrap-based Wald tests (i.e., ZIPG-bWald and ZIPG-pbWald) with normality-based confidence intervals are desired, and resampling based on measurements yields satisfactory results. More discussions are provided in Section 4.5.

4 Simulation Studies

We have conducted comprehensive numerical studies to evaluate the performance of ZIPG in terms of both hypothesis testing and point/interval estimation. We evaluated both ZIPG with bootstrap Wald test (denoted as ZIPG-bWald) and parametric bootstrap Wald test (denoted as ZIPG-pbWald), and then we compared them with ZINB-based methods, including NBZIMM (Zhang and Yi, 2020) and pscl (Zeileis et al., 2008). We also assessed the Poisson-Gamma (PG) model implemented by glmmtmb (Brooks et al., 2017) with the Wald test (no bootstrap). PG can link covariates to λ\lambda and θ\theta as in ZIPG, but it is not adjusted for inflated zeros. We summarized all methods compared in Section 4 in Table 1.

Besides hypothesis testing and estimation, we further check the sensitivity of proposed ZIPG inference procedures under misspecified models. In particular, we evaluated the performance of ZIPG when the true zero proportion pp is also associated with covariates (Section 4.4), and when data is generated from Poisson-Gamma distribution or zero-inflated Beta-Binomial distribution (see Section 3.5 of the Supplements).

Table 1: Summary of methods evaluated in simulation studies.
Method λ\lambda θ\theta Zero-Inflation Notes
ZIPG (proposed) \checkmark \checkmark \checkmark (parametric) bootstrap-based test
pscl \checkmark ×\times \checkmark
NBZIMM \checkmark ×\times \checkmark consider random effect on λ\lambda
PG \checkmark \checkmark ×\times

4.1 Simulation Settings

We simulated n=20n=20 subjects with m={5,25}m=\{5,25\} measurements for each subject, with a total sample size of N={100,500}N=\{100,500\}. For i=1,,ni=1,...,n and j=1,,mj=1,...,m, we generated covariates 𝑿𝒊𝒋=(Xi,1,Xij,2)\boldsymbol{X_{ij}}=(X_{i,1},X_{ij,2}) associated with the mean parameter λij\lambda_{ij} and 𝑿𝒊=Xi,1\boldsymbol{X^{*}_{i}}=X_{i,1} associated with the dispersion parameter θi\theta_{i}, where Xi,1X_{i,1} was a time-independent indicator sampled from a Bernoulli distribution with equal probabilities and served as the group indicator for each subject, such as pregnant or non-pregnant, alcohol drinker or non-alcohol drinker. Xij,2X_{ij,2} was a time-dependent longitudinal measurement generated from Xij,2=Xi,2+ϵijX_{ij,2}=X_{i,2}+\epsilon_{ij}, where Xi,2𝒩(0,1)X_{i,2}\sim\mathcal{N}(0,1) and ϵij𝒩(0,0.1)\epsilon_{ij}\sim\mathcal{N}(0,0.1), representing covariates with variation in different measurements, such as calorie intake. Sequencing depths MijM_{ij}’s were generated based on the empirical distribution of observed sequencing depths in the Romero data set (Romero et al., 2014) analyzed later in Section 5. Finally, the observed count data 𝑾\boldsymbol{W} was generated based on models (3)-(4) with 𝜷\boldsymbol{\beta} and 𝜷\boldsymbol{\beta^{*}} specified as below.

To imitate real-world microbial data, we set (β0,β0)=(4.23,0.6)(\beta_{0},\beta^{*}_{0})=(-4.23,0.6), guided by the ZIPG estimates in Romero data. We investigate performance of ZIPG under different model parameter configurations (Table 2). In each simulation setting, we explored three different values of the zero-inflation proportion p={0.3,0.5,0.7}p=\{0.3,0.5,0.7\}, which is equivalent to γ={0.847,0,0.847}\gamma=\{-0.847,0,0.847\}. For power analysis, we set p=0.5p=0.5 and vary the parameter of interest under each null hypothesis from 0 to 1.8. We compare ZIPG-bWald, ZIPG-pbWald, pscl, NBZIMM, and PG for the inference of β1\beta_{1} and only ZIPG-bWald, ZIPG-pbWald, and PG for the inference on β1\beta^{*}_{1}, because inference of the dispersion parameter is not applicable in pscl and NBZIMM. Results are presented based on L=1,000L=1,000 Monte Carlo replicates for each scenario. The performances of ZIPG-bWald and ZIPG-pbWald are evaluated using B=200B=200 bootstrap samples, which are numerically sufficient and stable based on our experience.

Table 2: Summary for simulation settings.
H0H_{0} β0\beta_{0} β1\beta_{1} β2\beta_{2} β0\beta^{*}_{0} β1\beta^{*}_{1} pp
type I error settings
β1=0\beta_{1}=0 -4.23 0 0.45 0.6 {0,0.5,1} {0.3,0.5,0.7}
β1=0\beta_{1}^{*}=0 -4.23 {0,0.5,1} 0.45 0.6 0 {0.3,0.5,0.7}
β2=0\beta_{2}=0 -4.23 1 0 0.6 {0,0.5,1} {0.3,0.5,0.7}
power settings
β1=0\beta_{1}=0 -4.23 {0,0.2,…,1.8} 0.45 0.6 1 0.5
β1=0\beta_{1}^{*}=0 -4.23 1 0.45 0.6 {0,0.2,…,1.8} 0.5
β2=0\beta_{2}=0 -4.23 1 {0,0.2,…,1.8} 0.6 1 0.5

4.2 Hypothesis Testing Results

Type I error analysis for β1\beta_{1} and β1\beta_{1}^{*}.

For H0:β1=0H_{0}:\beta_{1}=0 in the mean model, we observe that both pscl and NBZIMM have inflated type I errors under all simulation scenarios, and PG is too conservative regardless of the proportion pp of inflated zeros (Figure 2(a)). In particular, the type I error of pscl and NBZIMM increase significantly with the increase of β1\beta_{1}^{*}, suggesting that ignoring differential variability could lead to severely increased false positives for the mean model. Of note, the type I error of ZIPG with the bootstrap-based Wald test (ZIPG-bWald) is slightly inflated in a few cases with a small total number of observations (N=100N=100), while its results are satisfactory with a larger sample size (N=500N=500). In general, the type I error of ZIPG is robustly controlled at the nominal level α=0.05\alpha=0.05, regardless of the change of β1\beta_{1}^{*} and pp.

For H0:β1=0H_{0}:\beta_{1}^{*}=0 in the dispersion model, we observe that ZIPG-bWald maintains a controlled type I error for β1\beta^{*}_{1}, yet a little conservative when N=100N=100, while ZIPG-pbWald controls type I error to 0.05 more robustly (Figure 2(b)). Therefore, we suggest using ZIPG-pbWald as an alternative for hypothesis testing in small-sample scenarios. However, when we have a larger sample size (N=500N=500), both ZIPG-bWald and ZIPG-pbWald perform similarly.

Refer to caption
(a) Type I error of H0:β1=0H_{0}:\beta_{1}=0
Refer to caption
(b) Type I error of H0:β1=0H_{0}:\beta_{1}^{*}=0
Figure 2: Type I error results for (a) β1\beta_{1} and (b) β1\beta_{1}^{*}. The significance level is α=0.05\alpha=0.05.

Power analysis for β1\beta_{1} and β1\beta_{1}^{*}.

We present power results of testing H0:β1=0H_{0}:\beta_{1}=0 (Figure 3(a)) and H0:β1=0H_{0}:\beta_{1}^{*}=0 (Figure 3(b)). Since pscl, NBZIMM, and PG fail to preserve the nominal type I error, we do not evaluate their empirical power in the power analysis. For both null hypotheses, the power curves increase with the increase of sample size and true effect size. ZIPG-pbWald and ZIPG-bWald perform similarly in larger sample cases, while ZIPG-pbWald is relatively more powerful for detecting differential variability (H0:β1=0H_{0}:\beta_{1}^{*}=0) with a small sample size (N=100N=100). We only report the common covariates in both the mean model and dispersion model in the main text, and hypothesis testing results on covariates only in the mean model are reported in Section 3.3 of the Supplements.

Refer to caption
(a) Power of rejecting H0:β1=0H_{0}:\beta_{1}=0
Refer to caption
(b) Power of rejecting H0:β1=0H_{0}:\beta_{1}^{*}=0
Figure 3: Power curves of rejecting null hypothesis with N=100N=100 (solid lines) and N=500N=500 (dash lines). With p=0.5p=0.5, the proportion of observed zeros decreased from 0.606 to 0.582 with the increase of β1\beta_{1} in (a), while it increased from 0.536 to 0.653 with the increase of β1\beta_{1}^{*} in (b).

4.3 Point and Interval Estimation Results

To demonstrate the advantage of ZIPG regarding point estimators and confidence intervals, we report the average bias (e.g., {l(β^lβ)}/L\{\sum_{l}(\hat{\beta}_{l}-\beta)\}/L) with its standard error over l=1,,L=1000l=1,\ldots,L=1000 Monte Carlo replicates, average bootstrap standard error (avg-SE, e.g., {lSE^(β^l)}/L\{\sum_{l}\widehat{SE}(\hat{\beta}_{l})\}/L), root mean squared error (RMSE, e.g., {l(β^lβ)2)}/L\sqrt{\{\sum_{l}(\hat{\beta}_{l}-\beta)^{2})\}/L}), and converge rate of confidence intervals (CR) for the settings with β1=0,β1=1\beta_{1}=0,\beta^{*}_{1}=1 (Figure 2(a)) and β1=1,β1=0\beta_{1}=1,\beta^{*}_{1}=0 (Figure 2(b)) with p={0.5,0.7}p=\{0.5,0.7\} and N=500N=500. Results under other settings are similar and hence not reported.

In Table 3, we observe that ZIPG-bWald has the smallest bias of β1\beta_{1}, β1\beta_{1}^{*} and γ\gamma among all methods when β1=1\beta_{1}^{*}=1. In addition, ZIPG is often more efficient than the other two methods, providing a smaller RMSE. For β1\beta_{1} and β1\beta_{1}^{*}, ZIPG-bWald always maintains a valid confidence interval with its coverage rate close to nominal level 0.95, whereas pscl and NBZIMM provide underestimated confidence intervals in most cases, and PG estimate β1\beta_{1}^{*} with strong bias and provide a more conservative CI for β1\beta_{1}. Additional results with setting β1=1,β1=0\beta_{1}=1,\beta^{*}_{1}=0 corresponds to Figure 2(b) are presented in Section 3.4 of the Supplements.

Table 3: Average bias and its SE, average standard error, RMSE, and the empirical coverage rate (CR) of β1\beta_{1} and β1\beta_{1}^{*} estimators. Simulation parameters are set as β1=0,β1=1\beta_{1}=0,\beta^{*}_{1}=1 corresponding to Figure 2(a); p={0.5,0.7}p=\{0.5,0.7\} or γ={0,0.847}\gamma=\{0,0.847\}. NBZIMM does not report inference and CI on γ\gamma. The proportion of zeros observed in each simulation setting is denoted as pobsp_{\rm obs}.
Method avg-bias(SE) avg-SE RMSE CR
N=500N=500, β1=0\beta_{1}=0, β1=1\beta_{1}^{*}=1, p=0.5p=0.5, pobs=0.61p_{\rm obs}=0.61
β1\beta_{1} ZIPG-bWald -0.013(0.254) 0.254 0.254 0.936
pscl 0.216(0.252) 0.23 0.332 0.826
NBZIMM 0.084(0.274) 0.235 0.286 0.877
PG -0.016(0.257) 0.307 0.258 0.977
β1\beta_{1}^{*} ZIPG-bWald -0.009(0.246) 0.256 0.246 0.954
PG -0.449(0.169) 0.165 0.48 0.231
γ\gamma ZIPG-bWald -0.004(0.145) 0.158 0.145 0.968
pscl 0.108(0.152) 0.154 0.186 0.828
NBZIMM -0.245(0.454) - 0.516 -
Method avg-bias(SE) avg-SE RMSE CR
N=500N=500, β1=0\beta_{1}=0, β1=1\beta_{1}^{*}=1, p=0.7p=0.7, pobs=0.77p_{\rm obs}=0.77
β1\beta_{1} ZIPG-bWald -0.027(0.326) 0.341 0.327 0.952
pscl 0.220(0.311) 0.299 0.381 0.874
NBZIMM -0.07(0.408) 0.363 0.414 0.917
PG -0.037(0.34) 0.432 0.342 0.988
β1\beta_{1}^{*} ZIPG-bWald 0.006(0.341) 0.373 0.341 0.959
PG -0.536(0.21) 0.211 0.576 0.282
γ\gamma ZIPG-bWald 0.007(0.158) 0.165 0.158 0.958
pscl 0.047(0.872) 0.153 0.873 0.864
NBZIMM -0.294(0.604) - 0.671 -

4.4 Model Sensitivity Analysis

While ZIPG assumes that all differential variability comes from the Poisson-Gamma part (i.e., θ\theta) and the true zero proportion pp is only taxon-specific, one may wonder how the model fits when these assumptions are violated. Here, we evaluate ZIPG under the misspecified model, in which the true zero proportion pp is also associated with covariates (denoted as “ZIPG-full”). Of note, ZIPG-full is equivalent to Omnibus (Chen et al., 2018) from a hypothesis testing perspective, whereas the Omnibus test does not provide point/interval estimation and hypothesis test for each parameter separately. We consider the group covariates X1X_{1} only and sample size N=500N=500. We use a logistic link logit(p)=γ0+γ1X1{\rm logit}(p)=\gamma_{0}+\gamma_{1}X_{1} with (β0,β0,γ0)=(4.23,0.6,0.847)(\beta_{0},\beta_{0}^{*},\gamma_{0})=(-4.23,0.6,-0.847), and then we report the performance of ZIPG with γ1\gamma_{1} increasing from 0 to 2.5, which is equivalent to increasing pp from 0.300.30 to 0.840.84 in the group with X1=1X_{1}=1.

For two model fittings, i.e., ZIPG and ZIPG-full, we report the proportion of simulations suggesting better BIC from ZIPG (Figure 4). Over 1000 replicated simulations in each setting, ZIPG has a smaller BIC than ZIPG-full in most cases. Moreover, we also use Kolmogorov-Smirnov test (Kolmogorov, 1933; Smirnoff, 1939) to compare the ZIPG predicted distribution with the simulated observations: 100% replicates report insignificant differences between the two distributions (p>0.05p>0.05) , suggesting that ZIPG-predicted distribution has no difference to the observed samples.

Refer to caption
Figure 4: Proportion of ZIPG having smaller BIC than ZIPG-full in each simulation setting when N=500N=500.

We also evaluate ZIPG’s performance under other misspecified models: (1) Poisson-Gamma without zero inflation and (2) zero-inflated Beta-Binomial model (see Section 3.5 of the Supplements). In both scenarios, ZIPG preserves nominal type I error and retains its superior power in detecting differential abundance/variability, especially for large-sample cases.

4.5 Additional Simulation Results

We conduct additional simulations to compare multiple ways of constructing test statistics and confidence intervals. For hypothesis testing, we investigate different test statistics for the proposed ZIPG (Section 3.1 of the Supplements), including the Wald test without bootstrap (ZIPG-Wald) and the likelihood ratio test (ZIPG-LRT). Simulation results suggest that ZIPG with bootstrap-based Wald tests (i.e., ZIPG-bWald and ZIPG-pbWald) are desired. For confidence intervals, we compare the coverage rate for both nonparametric and parametric bootstrap through different construction strategies, such as normality-based/quantile-based/BCa\rm BC_{a}(Efron and Tibshirani, 1994) intervals. Results show that the normality-based confidence interval often has close-to-nominal coverage with a low computational cost (Section 3.2 of the Supplements).

We further evaluate different resampling schemes for ZIPG-bWald, such as resampling based on measurements or subjects. Given the total sample size N=200N=200, results suggest no significant difference between the two strategies (Section 3.6 of the Supplements).

Additional simulations about how measurement times affect ZIPG performance are provided in Section 3.6 of the Supplements. We consider the following two scenarios: (1) when the numbers of measurements per subject are very small (i.e., m=2m=2) and (2) when the numbers of measurements per subject are unequal. Results show that the proposed ZIPG method is valid for both cases above.

5 Data Analysis

In this section, we analyze two microbiome data sets, Romero (Romero et al., 2014) and Dietary (Johnson et al., 2019), to investigate how physical conditions impact microbiome stability in specific taxa. The proposed ZIPG method with bootstrap-based Wald test is compared to pscl (Zeileis et al., 2008) and DESeq2 (Love et al., 2014) from two perspectives: identification of taxa related to covariates and goodness of fit for prediction models. In addition, we also report hypothesis testing results from Omnibus (Chen et al., 2018) to validate the results of ZIPG. We also performed NBZIMM, but it detected only a few taxa with their default subject-level random effect, and hence we present the corresponding results in Section 4.3 of the Supplements.

5.1 Data Description

Romero

is a longitudinal case-control study including 16s rRNA gene sequence-based vaginal microbiota from 22 pregnant and 32 non-pregnant women with samples collected from each subject over intervals of weeks, resulting in 143 taxa and N=900N=900 longitudinal samples ( 139 measurements from pregnant women and 761 measurements from non-pregnant women.) To investigate how taxa are impacted by pregnant status and age in this data, we set the covariates matrix 𝑿=𝑿=(X1X2X3)\boldsymbol{X^{*}}=\boldsymbol{X}=(X_{1}\ X_{2}\ X_{3}), where X1X_{1} is a binary indicator of pregnant status, X2X_{2} is the observational age, and X3X_{3} is an indicator for the race (white or others) but not of our main interest. Note that both pregnant status and age are not changed for each person during data collection.

Dietary

is diet-microbiome data with shotgun metagenomic sequencing results of fecal samples and daily dietary records of 34 subjects on 17 consecutive days. There are total N=475N=475 samples with both microbiome data and dietary records available. In this data, the main analysis of interest is how alcohol affects the microbiome variability. We created a binary indicator for 25 alcohol drinkers and 9 teetotalers, and included this variable in both 𝑿\boldsymbol{X^{*}} and 𝑿\boldsymbol{X}. To account for the impact of other dietary intakes, we also include the first two principal components of the macronutrient matrix in 𝑿\boldsymbol{X}.

In microbiome studies, it is common to filter out taxa with extremely low abundance (i.e., pobs>0.9p_{\rm obs}>0.9) for more stable and reliable results (Wadsworth et al., 2017; Jiang et al., 2021; Zhang and Yi, 2020). Further, taxa with pobs<0.1p_{\rm obs}<0.1 are likely to have little or no zero inflation and can be modeled by other existing methods. Though ZIPG can still be performed with satisfactory results (see Section 3.5 of the Supplements), this group of taxa is not of our main interest. For both Romero and Dietary data, we analyze the taxa with 0.1<pobs<0.90.1<p_{\rm obs}<0.9, which results in 25 taxa in Romero and 52 taxa in Dietary. More details about pobsp_{\rm obs} in Romero and Dietary can be found in Section 4.1 of the Supplements. To account for multiple testing, we report the results with the controlled false discovery rate (FDR<0.05<0.05) using the method of Benjamini and Hochberg (1995).

5.2 Results on Hypothesis Testing

We first present numbers of identified taxa regarding the covariates of interest in the two studies (Figure 5). For ZIPG, we show the set of taxa associated with 𝑿\boldsymbol{X} in the mean model (denoted as ZIPG β\beta) and 𝑿\boldsymbol{X^{*}} in the dispersion model (denoted as ZIPG β\beta^{*}), separately. In Romero, we observe that pregnant subjects are clustered under age 35, while non-pregnant subjects are collected in a much wider range of ages. Owing to the unbalanced sample that pregnant women have fewer measurements and are often younger than non-pregnant women, we use parametric bootstrap (i.e., ZIPG-pbWald) for more stable results. ZIPG identified 18 taxa with differential abundance (H0:β1=0H_{0}:\beta_{1}=0) and 17 taxa with differential variability (H0:β1=0H_{0}:\beta_{1}^{*}=0) associated with pregnancy, while 13 taxa are associated with pregnant status with both abundance and variability (Figure 5(a)). Most of the taxa found by pscl and DESeq2 are also identified by ZIPG, while ZIPG also identified 3 additional taxa with significant differential variability, which are not detected by other methods. Further, ZIPG identified totally additionally 6 taxa associated to age with differential abundance (H0:β2=0H_{0}:\beta_{2}=0) or differential variability (H0:β2=0H_{0}:\beta_{2}^{*}=0) , compared to other methods (Figure 5(b)).

In Dietary, ZIPG is performed using nonparametric bootstrap Wald test (i.e., ZIPG-bWald), because the data is balanced and the sample size is sufficient. ZIPG identified 33 taxa with differential abundance (H0:β1=0H_{0}:\beta_{1}=0) and 24 taxa with differential variability (H0:β1=0H_{0}:\beta_{1}^{*}=0) associated to the alcohol intake (Figure 5(c)). Compared to other methods, ZIPG discovered 5 extra taxa only associated with differential variability and 1 extra taxon associated with both differential abundance and differential variability.

Refer to caption
(a) Romero Pregnant
Refer to caption
(b) Romero Age
Refer to caption
(c) Dietary ALC
Figure 5: The numbers of taxa with significant difference detected by ZIPG, DESeq2 and pscl after controlling FDR<0.05<0.05 regarding H0:β=0H_{0}:\beta=0 and H0:β=0H_{0}:\beta^{*}=0 in Romero (5(a) and 5(b)) and Dietary (5(c)).

We further use the Omnibus test (Chen et al., 2018) as verification for ZIPG detected taxa. The Omnibus test links covariates of interest to all three parameters in the Negative-Binomial distribution and rejects the null hypothesis if any covariate is associated with any of the parameters. Thus, taxa identified by both ZIPG and Omnibus are less likely to be false positives. As expected, in Romero, all taxa identified by ZIPG are also detected by Omnibus. In Dietary, 35 out of 41 taxa identified by ZIPG are also detected by Omnibus. Though the Omnibus test detected more taxa than ZIPG, it is worth pointing out that the Omnibus test can not distinguish differential abundance and differential variability and does not provide point/interval estimation as ZIPG does. Details of taxa detected by each method and estimation results for those taxa regarding parameters of interest are shown in Section 4.3 of the Supplements.

5.3 Analysis on Model Fitting

To visualize the differential variability tested by ZIPG (i.e., H0:β=0H_{0}:\beta^{*}=0), we further analyze the result of Bifidobacteriaceae and Lactobacillus.vaginalis from Romero, and Burkholderiales bacterium and Alistipes indistinctus from Dietary as examples. Other taxa with differential variability identified by ZIPG have similar conclusions.

First, we compare the goodness of fit of results from fitted models to the empirical distribution of the relative abundance. The log of relative abundance observed in real data (i.e., Bifidobacteriaceae from Romero) is compared to the predicted distribution according to the estimated model by ZIPG, pscl, and DESeq2 (Figure 6). For boxplots, we generated samples from each predicted distribution with a quintuple of the observed sample sizes for a better visualization at the tail (e.g., Figure 6(6(a))). All three methods can estimate the median of two groups (pregnant and non-pregnant) accurately, but pscl and DESeq2 cannot distinguish the overdispersion between the two groups, as the box length for the pregnant and non-pregnant groups are similar. On the contrary, ZIPG identified the differential variability of this taxon associated with the factor pregnant with p=0.00256p=0.00256 regarding the null hypothesis H0:β1=0H_{0}:\beta_{1}^{*}=0, and thus its simulated data matches the real data better, providing a shorter interquartile range with a long tail in the pregnant group. We also present the empirical cumulative distribution functions (ECDF) of the log of the relative abundance for the pregnant group, using sampled data from fitted ZIPG, pscl, and DESeq2 models (Figure 6(6(b))). It has been shown that ZIPG also fits the real data better than other methods. Quantile-quantile plots for real data versus predicted distribution also show that ZIPG models the entire distribution better (Section 4.2 of the Supplements).

For Lactobacillus.vaginalis in Romero, we present the relative abundance based on the simulated distributions of the fluctuation factor UGamma(θi1,θi)U\sim{\rm Gamma}(\theta_{i}^{-1},\theta_{i}) from ZIPG and pscl at four representative ages (Figure 7). ZIPG identified that the differential variability of this taxon is associated to age with pp ¡ 0.001 regarding H0:β2=0H_{0}:\beta_{2}^{*}=0. Accordingly, we observe that the shape of the fitted distribution changes with the increase of age. However, pscl is not able to model the change in the entire shape of distribution as the average relative abundance remained the same in this group. The ECDF plot of the pregnant group again shows that ZIPG fitted model is closer to the empirical distribution.

Refer to caption
(a) Box plot of predicted distribution
Refer to caption
(b) ECDF (Pregnant)
Figure 6: Bifidobacteriaceae in Romero: (a) box plot for the predicted distribution and the real observed counts (the log of relative abundance is presented with zero count samples adjusted to 0.5 in the pregnant and non-pregnant group), (b) ECDF in the pregnant group.
Refer to caption
(a) Predicted distribution (Pregnant, White)
Refer to caption
(b) ECDF (Pregnant)
Figure 7: Lactobacillus vaginalis in Romero: (a) the half-violin curves of the fluctuation factor UU for the pregnant, white women generating from parameters estimated by ZIPG and pscl respectively, comparing to the raw count data divided by its mean (“+”). (b) ECDF of predicted and real observed counts in the pregnant group.
Refer to caption
(a) Burkholderiales bacterium
Refer to caption
(b) Alistipes indistinctus
Figure 8: Box plot for the predicted distribution and the real observed counts. We plot the log of relative abundance with zero count samples adjusted to 0.5 in the alcohol and non-alcohol groups for (a) Burkholderiales bacterium and (b) Alistipes indistinctus in Dietary.

In Dietary, we present the box plots for Burkholderiales bacterium and Alistipes indistinctus to show the differential variability between the alcohol and non-alcohol drinkers (Figure 8). For Burkholderiales bacterium, the differential abundance in the two groups (i.e., alcohol and non-alcohol drinkers) are similar, but the overdispersion in alcohol drinkers is obviously larger than that in teetotal subjects. ZIPG can distinguish the differential variability in two groups with p=0.00293p=0.00293 (H0:β1=0H_{0}:\beta_{1}^{*}=0). ZIPG also provides a shorter interquartile in the non-alcohol drinker group, which is consistent with the raw data. On the contrary, pscl failed to detect any differential abundance (p=0.636p=0.636), while DESeq2 provided p=0.049p=0.049 which is significant but much larger than the p-value of ZIPG. For Alistipes indistinctus, though pscl and DESeq2 identified the differential abundance between two groups, both of them did not approximate the data from a distributional perspective because of the ignorance of differential variability. In contrast, ZIPG can distinguish it with pp = 8.54e-6 (H0:β1=0H_{0}:\beta_{1}^{*}=0), and provide a long box for the non-alcohol group showing their small overdispersion and a median more closer to real data.

6 Discussion

In this paper, we propose a Zero-Inflated Poisson-Gamma model for microbiome count data analysis. We decompose the zero-inflated Poisson model and factor the Poisson mean as λijk\lambda_{ijk} for the average abundance level and a multiplicative factor UijkU_{ijk} following gamma distribution controlled by variation parameter θik\theta_{ik}, which accounts for individual-level microbiome abundance variation around λijk\lambda_{ijk}. In traditional ZINB regression, the dispersion parameter is often treated as a nuisance parameter. Our model allows different sets of covariates to be linked to λijk\lambda_{ijk} and θik\theta_{ik} and provides a valid test, outperforming other negative-binomial-based models such as pscl and NBZIMM. To our knowledge, the ZINB-based Omnibus (Chen et al., 2018) method may be the first and only paper that links dispersion to covariates. However, the Omnibus test can not distinguish differential abundance and differential variability. In comparison, we test differential abundance and differential variability separately for longitudinal data and provide valid confidence intervals for each parameter. Moreover, other potential distributions for modeling the multiplicative factor UijkU_{ijk} are worth future exploring, including the mixing distribution of log-normals. However, the difficulty in distinguishing two sources of zeros always exists when the overdispersion is large.

Though linking γk\gamma_{k} to covariates is proposed in pscl and NBZIMM, it is not suggested in our ZIPG model based on two reasons. First, the mechanism of zero-inflation γk\gamma_{k} does not have explicit biological interpretation, while θik\theta_{ik} can be explained as individual-level microbiome stability in longitudinal data. Through simulations, we have shown that linking γk\gamma_{k} to covariates is not preferred from the model selection perspective, even if both θik\theta_{ik} and γk\gamma_{k} are covariates-dependent. Second, the increment in either γk\gamma_{k} or θik\theta_{ik} will lead to the increment of zeros in observed data, and thus linking both parameters to covariates simultaneously will make the inference more challenging and unreliable.

Some promising future work could be incorporating auxiliary information from other taxa. One possible way is to assume the taxon-specific dispersion parameters θik\theta_{ik}’s of closely related taxa (e.g., taxa in the same phylogenetic branch) are impacted by covariates 𝑿𝒊\boldsymbol{X_{i}^{*}} identically and share the same coefficient 𝜷\boldsymbol{\beta^{*}}. In addition, inference on a group of taxa in a joint multivariate model is also worth future investigation.


Supplementary Materials

Supplements:

The supplemental materials (ZIPG-appendix.pdf) include mathematical details of the non-concavity of the log-likelihood, analytical expressions of gradient, details of the parametric bootstrap algorithm, and supplementary figures and tables for additional simulation and real-world data analysis.

R code for ZIPG:

The R code for ZIPG is available based on request. We will submit the R package to CRAN once the paper is accepted.

References

  • Benjamini and Hochberg (1995) Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57(1), 289–300.
  • Brooks et al. (2017) Brooks, M. E., K. Kristensen, K. J. Van Benthem, A. Magnusson, C. W. Berg, A. Nielsen, H. J. Skaug, M. Machler, and B. M. Bolker (2017). glmmtmb balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. The R journal 9(2), 378–400.
  • Broyden (1970) Broyden, C. G. (1970). The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics 6(1), 76–90.
  • Cao et al. (2020) Cao, Y., A. Zhang, and H. Li (2020). Multisample estimation of bacterial composition matrices in metagenomics data. Biometrika 107(1), 75–92.
  • Chen et al. (2018) Chen, J., E. King, R. Deek, Z. Wei, Y. Yu, D. Grill, and K. Ballman (2018). An omnibus test for differential distribution analysis of microbiome sequencing data. Bioinformatics 34(4), 643–651.
  • Couch et al. (2021) Couch, C. E., K. Stagaman, R. S. Spaan, H. J. Combrink, T. J. Sharpton, B. R. Beechler, and A. E. Jolles (2021). Diet and gut microbiome enterotype are associated at the population level in african buffalo. Nature communications 12(1), 1–11.
  • Efron and Tibshirani (1994) Efron, B. and R. J. Tibshirani (1994). An introduction to the bootstrap. CRC press.
  • Fletcher (1970) Fletcher, R. (1970). A new approach to variable metric algorithms. The computer journal 13(3), 317–322.
  • Goldfarb (1970) Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics of computation 24(109), 23–26.
  • Hu et al. (2021) Hu, Y.-J., A. Lane, and G. A. Satten (2021). A rarefaction-based extension of the ldm for testing presence–absence associations in the microbiome. Bioinformatics 37(12), 1652–1657.
  • Jiang et al. (2021) Jiang, S., G. Xiao, A. Y. Koh, J. Kim, Q. Li, and X. Zhan (2021). A Bayesian zero-inflated negative binomial regression model for the integrative analysis of microbiome data. Biostatistics 22(3), 522–540.
  • Johnson et al. (2019) Johnson, A. J., P. Vangay, G. A. Al-Ghalith, B. M. Hillmann, T. L. Ward, R. R. Shields-Cutler, A. D. Kim, A. K. Shmagel, A. N. Syed, P. M. C. Students, et al. (2019). Daily sampling reveals personalized diet-microbiome associations in humans. Cell host & microbe 25(6), 789–802.
  • Kolmogorov (1933) Kolmogorov, A. (1933). Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn. 4, 83–91.
  • La Rosa et al. (2012) La Rosa, P. S., J. P. Brooks, E. Deych, E. L. Boone, D. J. Edwards, Q. Wang, E. Sodergren, G. Weinstock, and W. D. Shannon (2012). Hypothesis testing and power calculations for taxonomic-based human microbiome data. PloS one 7(12), e52078.
  • Li (2015) Li, H. (2015). Microbiome, metagenomics, and high-dimensional compositional data analysis. Annual Review of Statistics and Its Application 2, 73–94.
  • Li et al. (2018) Li, Z., K. Lee, M. R. Karagas, J. C. Madan, A. G. Hoen, A. J. O’malley, and H. Li (2018). Conditional regression based on a multivariate zero-inflated logistic-normal model for microbiome relative abundance data. Statistics in biosciences 10(3), 587–608.
  • Li et al. (2021) Li, Z., L. Tian, A. J. O’Malley, M. R. Karagas, A. G. Hoen, B. C. Christensen, J. C. Madan, Q. Wu, R. Z. Gharaibeh, C. Jobin, et al. (2021). Ifaa: robust association identification and inference for absolute abundance in microbiome analyses. Journal of the American Statistical Association 116(536), 1595–1608.
  • Love et al. (2014) Love, M. I., W. Huber, and S. Anders (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15(12), 550.
  • Manor et al. (2020) Manor, O., C. L. Dai, S. A. Kornilov, B. Smith, N. D. Price, J. C. Lovejoy, S. M. Gibbons, and A. T. Magis (2020). Health and disease markers correlate with gut microbiome composition across thousands of people. Nature communications 11(1), 1–12.
  • Martin et al. (2020) Martin, B. D., D. Witten, and A. D. Willis (2020). Modeling microbial abundances and dysbiosis with beta-binomial regression. The annals of applied statistics 14(1), 94.
  • Martın-Fernandez et al. (2011) Martın-Fernandez, J. A., J. Palarea-Albaladejo, and R. A. Olea (2011). Dealing with zeros. Compositional data analysis: Theory and applications, 43–58.
  • McLaren et al. (2019) McLaren, M. R., A. D. Willis, and B. J. Callahan (2019). Consistent and correctable bias in metagenomic sequencing experiments. Elife 8, e46923.
  • McMurdie and Holmes (2014) McMurdie, P. J. and S. Holmes (2014). Waste not, want not: why rarefying microbiome data is inadmissible. PLoS computational biology 10(4), e1003531.
  • Morgan et al. (2012) Morgan, X. C., T. L. Tickle, H. Sokol, D. Gevers, K. L. Devaney, D. V. Ward, J. A. Reyes, S. A. Shah, N. LeLeiko, S. B. Snapper, et al. (2012). Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome biology 13(9), 1–18.
  • Petersen and Round (2014) Petersen, C. and J. L. Round (2014). Defining dysbiosis and its influence on host immunity and disease. Cellular microbiology 16(7), 1024–1033.
  • Robinson et al. (2010) Robinson, M. D., D. J. McCarthy, and G. K. Smyth (2010, January). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140.
  • Romero et al. (2014) Romero, R., S. S. Hassan, P. Gajer, A. L. Tarca, D. W. Fadrosh, L. Nikita, M. Galuppi, R. F. Lamont, P. Chaemsaithong, J. Miranda, et al. (2014). The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women. Microbiome 2(1), 1–19.
  • Shanno (1970) Shanno, D. F. (1970). Conditioning of quasi-newton methods for function minimization. Mathematics of computation 24(111), 647–656.
  • Shi et al. (2022) Shi, P., Y. Zhou, and A. R. Zhang (2022). High-dimensional log-error-in-variable regression with applications to microbial compositional data analysis. Biometrika 109(2), 405–420.
  • Silverman et al. (2020) Silverman, J. D., K. Roche, S. Mukherjee, and L. A. David (2020). Naught all zeros in sequence count data are the same. Computational and structural biotechnology journal 18, 2789–2798.
  • Smirnoff (1939) Smirnoff, N. (1939). Sur les écarts de la courbe de distribution empirique. Matematicheskii Sbornik 48(1), 3–26.
  • Tang and Chen (2019) Tang, Z.-Z. and G. Chen (2019). Zero-inflated generalized dirichlet multinomial regression model for microbiome compositional data analysis. Biostatistics 20(4), 698–713.
  • Vandeputte et al. (2017) Vandeputte, D., G. Kathagen, K. D’hoe, S. Vieira-Silva, M. Valles-Colomer, J. Sabino, J. Wang, R. Y. Tito, L. De Commer, Y. Darzi, et al. (2017). Quantitative microbiome profiling links gut community variation to microbial load. Nature 551(7681), 507–511.
  • Wadsworth et al. (2017) Wadsworth, W. D., R. Argiento, M. Guindani, J. Galloway-Pena, S. A. Shelburne, and M. Vannucci (2017). An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data. BMC Bioinformatics 18(1), 94.
  • Willis (2019) Willis, A. D. (2019). Rigorous statistical methods for rigorous microbiome science. MSystems 4(3), e00117–19.
  • Wu (1983) Wu, C. J. (1983). On the convergence properties of the em algorithm. The Annals of statistics, 95–103.
  • Xu et al. (2021) Xu, T., R. T. Demmer, and G. Li (2021). Zero-inflated poisson factor model with application to microbiome read counts. Biometrics 77(1), 91–101.
  • Zeileis et al. (2008) Zeileis, A., C. Kleiber, and S. Jackman (2008). Regression models for count data in R. Journal of Statistical Software 27(8).
  • Zeng et al. (2022) Zeng, Y., D. Pang, H. Zhao, and T. Wang (2022). A zero-inflated logistic normal multinomial model for extracting microbial compositions. Journal of the American Statistical Association, 1–14.
  • Zhang and Yi (2020) Zhang, X. and N. Yi (2020). Fast zero-inflated negative binomial mixed modeling approach for analyzing longitudinal metagenomics data. Bioinformatics 36(8), 2345–2351.