This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\cleanlookdateon\cleanlookdateon

Modeling complex measurement error in microbiome experiments to estimate relative abundances and detection effects

David S Clausen and Amy D Willis
Department of Biostatistics, University of Washington
The authors gratefully acknowledge support from the National Institute of General Medical Sciences (R35 GM133420). Correspondence: adwillis@uw.edu
Abstract

Accurate estimates of microbial species abundances are needed to advance our understanding of the role that microbiomes play in human and environmental health. However, artificially constructed microbiomes demonstrate that intuitive estimators of microbial relative abundances are biased. To address this, we propose a semiparametric method to estimate relative abundances, species detection effects, and/or cross-sample contamination in microbiome experiments. We show that certain experimental designs result in identifiable model parameters, and we present consistent estimators and asymptotically valid inference procedures. Notably, our procedure can estimate relative abundances on the boundary of the simplex. We demonstrate the utility of the method for comparing experimental protocols, removing cross-sample contamination, and estimating species’ detectability.

1 Introduction

Next generation sequencing (NGS) has profoundly impacted the study of microbial communities (microbiomes), which reside in the host-associated environments (such as the human body) as well as natural environments (such as soils, salt- and freshwater systems, and aquifers). NGS identifies contiguous genetic sequences, which can be grouped and counted to provide a measure of their abundances. However, even when sequences can be contextualized using databases of microbial DNA, NGS measurements of microbial abundances are not proportional to the abundances of organisms in their originating community. In this paper, we consider the problem of recovering microbial relative abundances from noisy and distorted NGS abundance measurements.

Two main avenues of statistical research have considered the analysis of microbial abundance data from NGS: batch effects removal and differential abundance. Batch effects are systemic distortions in observed abundance data due either to true biological variation (e.g., cage/tank effects) or measurement error (e.g., variation in extraction kits). Tools to remove batch effects from microbiome data include both methods adapted from RNA-seq and microarray analysis (e.g., (Leek and Storey, 2007; Johnson et al., 2007; Gagnon-Bartsch and Speed, 2012; Sims et al., 2008)) as well as microbiome-specific approaches (Gibbons et al., 2018; Dai et al., 2019). A key motivation for batch effect removal is to allow for more precise comparisons of biological differences between microbial communities, and methods for “differential abundance” offer one avenue for assessing difference. Differential abundance methods typically aim to detect microbial units (such as species or gene abundances) that are present in the data at different average levels across groups. Differential abundance methods differ widely in generative models for the data (Love et al., 2014; Martin et al., 2019; Li et al., 2021) and/or their approaches to transforming the data before performing regression analyses (Fernandes et al., 2014; Mandal et al., 2015; Mallick et al., 2021).

In this paper, we consider a distinct goal from batch effects removal and differential abundance: modeling the relationship between NGS measurements and the true microbial composition of the community that was sequenced. Our methodology can estimate relative abundances of individual microbial species in specimens by correcting for the unequal detectability of microbial species, accounting for unequal depths of sequencing across samples, and removing batch-specific contamination in samples. In addition, it can estimate species detectabilities, sample intensities, and cross-sample contamination. Our model incorporates information about shared origins of samples (e.g., replicates and dilution series), shared processing of samples (e.g., sequencing or DNA extraction batches) and known information about sample composition (e.g., mock communities and reference protocols), allowing us to identify and estimate statistical parameters that are not accounted for in most models for microbial abundances (e.g., species detectabilities and contamination). While methods exist to estimate either species detectabilities (McLaren et al., 2019; Silverman et al., 2021; Zhao and Satten, 2021) or contamination (Knights et al., 2011; Shenhav et al., 2019; Davis et al., 2018), to our knowledge, no methods can simultaneously estimate species detectabilities, true microbial relative abundances, sample intensities, and contamination.

Because we consider estimation of relative abundances, and we do not assume that all species are present in all samples, our method entails estimation of a parameter that may fall on the boundary of its parameter space. We use estimating equations to construct our estimator, prove semiparametric consistency and existence of limiting laws, construct a highly stable algorithm to find a solution that may fall on the boundary, and develop uncertainty quantification that accounts for the boundary problem. Therefore, in addition to its scientific contribution, our methodology employs contemporary statistical, probabilistic, and algorithmic tools.

We begin by constructing our mean model (Section 2), and discussing identifiability (Section 3), estimation (Section 4), and asymptotic guarantees (Section 5). We demonstrate applications of our method to relative abundance estimation, contamination removal, and the comparison of multiple experimental protocols (Section 6), and evaluate its performance under simulation (Section 7). We conclude with a discussion of our approach and areas for future research (Section 8). Software implementing the methodology are implemented in a R package available at https://github.com/statdivlab/tinyvamp. Code to reproduce all simulations and data analyses are available at https://github.com/statdivlab/tinyvamp_supplementary.

2 A measurement error model for microbiome data

Here we propose a model to connect observed microbial abundances to true relative abundances. We use the term “read” to refer to the observed measurements (which is not required to be integer-valued), “taxon” to the categorical unit under study (e.g., species, strains or cell types), “sample” for the unit of sequencing, and “specimen” for the unique source of genetic material (which may be repeatedly sampled).

Let Wi=(Wi1,,WiJ)W_{i}=(W_{i1},\dots,W_{iJ}) denote observed reads from taxa 1,,J1,\dots,J in sample ii. A common modeling assumption for high-throughput sequencing data is

𝔼[Wi|𝐩i,γi]\displaystyle\mathbb{E}[W_{i}|\mathbf{p}_{i},\gamma_{i}] =exp(γi)𝐩i\displaystyle=\text{exp}(\gamma_{i})\mathbf{p}_{i} (1)

where 𝐩i𝕊J1\mathbf{p}_{i}\in\mathbb{S}^{J-1} is the unknown relative abundances of taxa 1,,J1,\dots,J and exp(γi)+\text{exp}(\gamma_{i})\in\mathbb{R}^{+} is a sampling intensity parameter (throughout, we let 𝕊J1\mathbb{S}^{J-1} denote the closed J1J-1-dimensional simplex). Unfortunately, microbial taxa are not detected equally well (Figure 1 (left); McLaren et al. (2019) and references therein). To account for this, we begin by considering the model

𝔼[Wi|𝐩i,β,γi]\displaystyle\mathbb{E}[W_{i}|\mathbf{p}_{i},\beta,\gamma_{i}] =exp(γi)(exp(β)𝐩i)\displaystyle=\text{exp}(\gamma_{i})(\text{exp}(\beta)\circ\mathbf{p}_{i}) (2)

where β=β1,,βJ\beta=\beta_{1},\dots,\beta_{J} represents the detection effects for taxa 1,,J1,\dots,J in a given experiment (\circ indicates element-wise multiplication). As an identifiability constraint, we set βJ=0\beta_{J}=0, and so interpret exp(βj),j=1,,J\exp(\beta_{j}),j=1,\ldots,J, as the degree of multiplicative over- or under-detection of taxon jj relative to taxon JJ. We discuss identifiability in more detail in Section 3.

Refer to caption
Figure 1: Two features of microbiome measurement data that are accounted for in our model include (left) observed relative abundances are predictably biased relative for true relative abundances; and (right) contamination is associated with biomass. Here the proportion of reads due to contamination is larger in lower-biomass than in higher-biomass samples, reflecting the greater impact of an approximately constant amount of contamination on lower biomass samples.

We now generalize model (2) to multiple samples and one or more experimental protocols. For a study involving nn samples of KK unique specimens (nKn\geq K), we define the sample design matrix 𝐙n×K\mathbf{Z}\in\mathbb{R}^{n\times K} to link samples to specimens (𝐙i𝕊K1\mathbf{Z}_{i\cdot}\in\mathbb{S}^{K-1}). In most experiments, Zik=𝟏{sample i taken from specimen k}Z_{ik}=\mathbf{1}_{\{\text{sample $i$ taken from specimen $k$}\}}, but more complex designs are also possible (e.g., Zik=Zik=12Z_{ik}=Z_{ik^{\prime}}=\frac{1}{2} denotes that sample ii is a 1:1 mixture of specimens kk and kk^{\prime}). Letting 𝐩\mathbf{p} be a K×JK\times J matrix such that the kk-th row of 𝐩\mathbf{p} gives the true relative abundances of taxa in specimen kk, we have that the relative abundance vector for sample ii is (𝐙𝐩)i(\mathbf{Z}\mathbf{p})_{i\cdot}. We allow differing detections across samples to be specified via the detection design matrix 𝐗n×p\mathbf{X}\in\mathbb{R}^{n\times p}. For example, if samples are processed using one of pp different protocols, we might specify Xiq=𝟏{sample i processed with protocol q}X_{iq}=\mathbf{1}_{\{\text{sample $i$ processed with protocol $q$}\}}. Accordingly, we now consider a detection effect matrix 𝜷p×J\boldsymbol{\beta}\in\mathbb{R}^{p\times J}, with 𝜷J=𝟎p\boldsymbol{\beta}_{\cdot J}=\mathbf{0}_{p} as before. Therefore, one generalization of model (2) is

𝔼[reads due to contributing samples|𝜷,𝐩,𝜸,𝐙,𝐗]\displaystyle\mathbb{E}\left[\text{reads due to contributing samples}|\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma},\mathbf{Z},\mathbf{X}\right] =𝐃𝜸(𝐙𝐩)exp(𝐗𝜷),\displaystyle=\mathbf{D}_{\boldsymbol{\gamma}}\left(\mathbf{Z}\mathbf{p}\right)\circ\text{exp}\left(\mathbf{X}\boldsymbol{\beta}\right), (3)

where 𝐃𝜸=diag(exp(𝜸))\mathbf{D}_{\boldsymbol{\gamma}}=\text{diag}(\text{exp}(\boldsymbol{\gamma})), and where exponentiation is element-wise.

We now extend this model to reflect contributions of contaminant sources (Figure 1 (right)). We consider K~\tilde{K} sources of contamination with relative abundance profiles given in the rows of 𝐩~K~×J\tilde{\mathbf{p}}\in\mathbb{R}^{\tilde{K}\times J}. To link sources of contamination to samples, we let 𝐙~n×K~\tilde{\mathbf{Z}}\in\mathbb{R}^{n\times\tilde{K}} be a spurious read design matrix. Most commonly we expect Z~ik~=𝟏{source k~ may contribute reads to sample i}\tilde{Z}_{i\tilde{k}}=\mathbf{1}_{\{\text{source $\tilde{k}$ may contribute reads to sample $i$}\}}, but we give an example of an analysis with more complex 𝐙~\tilde{\mathbf{Z}} in Section 6.2. Then, along with contaminant read intensities 𝜸~=[γ~1,,γ~k~]T\tilde{\boldsymbol{\gamma}}=[\tilde{\gamma}_{1},\dots,\tilde{\gamma}_{\tilde{k}}]^{\text{T}}, we propose to model

𝔼[reads due to spurious sources|𝜸,𝜸~,𝐩~,𝐙~]\displaystyle\mathbb{E}\left[\text{reads due to spurious sources}|\boldsymbol{\gamma},\tilde{\boldsymbol{\gamma}},\tilde{\mathbf{p}},\tilde{\mathbf{Z}}\right] =𝐃𝜸,𝜶𝐙~[𝐩~exp(𝜸~𝟏JT)].\displaystyle=\mathbf{D}_{\boldsymbol{\gamma},\boldsymbol{\alpha}}\tilde{\mathbf{Z}}\left[\tilde{\mathbf{p}}\circ\text{exp}\left(\tilde{\boldsymbol{\gamma}}\mathbf{1}^{\text{T}}_{J}\right)\right]. (4)

Here, 𝐃𝜸,𝜶\mathbf{D}_{\boldsymbol{\gamma},\boldsymbol{\alpha}} is a diagonal matrix with diagonal elements equal to exp(𝜸+𝐕𝜶)\exp(\boldsymbol{\gamma}+\mathbf{V}\boldsymbol{\alpha}) for design matrix 𝐕n×Kα\mathbf{V}\in\mathbbm{R}^{n\times K_{\alpha}} and parameter 𝜶Kα\boldsymbol{\alpha}\in\mathbbm{R}^{K_{\alpha}}. As discussed in more detail in Section 6.2, 𝐕𝜶\mathbf{V}\boldsymbol{\alpha} allows the intensity of contamination to vary across sources, but typically 𝜶=𝟎Kα\boldsymbol{\alpha}=\mathbf{0}_{K_{\alpha}}. Additionally, while we could incorporate a detection design matrix for contaminant reads (replacing (4) with 𝐃𝜸,𝜶𝐙~[𝐩~exp(𝜸~𝟏JT+𝐗~𝜷)]\mathbf{D}_{\boldsymbol{\gamma},\boldsymbol{\alpha}}\tilde{\mathbf{Z}}\left[\tilde{\mathbf{p}}\circ\text{exp}\left(\tilde{\boldsymbol{\gamma}}\mathbf{1}_{J}^{T}+\tilde{\mathbf{X}}\boldsymbol{\beta}\right)\right] for 𝐗~K~×p\tilde{\mathbf{X}}\in\mathbb{R}^{\tilde{K}\times p}), for most practical applications it is sufficient to identify 𝐩~\tilde{\mathbf{p}} up to detection effects. Therefore, combining models (3) and (4), we propose the following mean model for next-generation sequencing data 𝐖n×J\mathbf{W}\in\mathbb{R}^{n\times J}:

𝝁:=𝔼[𝐖|𝜷,𝐩,𝜸,𝐩~,𝜸~]=𝐃𝜸(𝐙𝐩)exp(𝐗𝜷)+𝐃𝜸,𝜶𝐙~[𝐩~exp(𝜸~𝟏JT)].\displaystyle\boldsymbol{\mu}:=\mathbb{E}\left[\mathbf{W}|\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma},\tilde{\mathbf{p}},\tilde{\boldsymbol{\gamma}}\right]=\mathbf{D}_{\boldsymbol{\gamma}}\left(\mathbf{Z}\mathbf{p}\right)\circ\text{exp}\left(\mathbf{X}\boldsymbol{\beta}\right)+\mathbf{D}_{\boldsymbol{\gamma},\boldsymbol{\alpha}}\tilde{\mathbf{Z}}\left[\tilde{\mathbf{p}}\circ\text{exp}\left(\tilde{\boldsymbol{\gamma}}\mathbf{1}_{J}^{T}\right)\right]. (5)

3 Model identifiability

Our highly flexible mean model (5) encompasses a wide variety of experimental designs and targets of estimation. As a result, without additional knowledge, the parameters in the mean model (5) will generally be unidentifiable. We provide a general condition for identifiability in SI Lemma 1, and prove identifiability the following specific cases:

  1. (a)

    Estimating relative detectability under different sampling protocols using specimens of unknown composition (see SI Section 9.2 for identifiability; Section 6.1 for analysis)

  2. (b)

    Estimating contamination using dilution series (see SI Section 9.3 for identifiability; Section 6.2 for analysis)

  3. (c)

    Estimating contamination, detectability and composition when some samples have known composition (see SI Section 9.4 for identifiability; Section 7 for analysis)

In addition to knowledge of 𝐗\mathbf{X}, 𝐙\mathbf{Z} and 𝐙~\tilde{\mathbf{Z}}, constraints on the entries of 𝜷\boldsymbol{\beta} and/or 𝐩\mathbf{p} can result in the identifiability of model parameters. For example, in (c), some rows of 𝐩\mathbf{p} are known. As mentioned previously, an identifiability constraint on β\beta (e.g., 𝜷J=𝟎p\boldsymbol{\beta}_{\cdot J}=\mathbf{0}_{p}) is always necessary.

4 Estimation and optimization

We propose to estimate parameters 𝜽:=(𝜽,𝜸):=\boldsymbol{\theta}^{\star}:=(\boldsymbol{\theta},\boldsymbol{\gamma}):= (𝜷,𝐩,𝐩~,𝜸~,𝜸,𝜶)(\boldsymbol{\beta},\mathbf{p},\tilde{\mathbf{p}},\tilde{\boldsymbol{\gamma}},\boldsymbol{\gamma},\boldsymbol{\alpha}) as M-estimators. We use likelihoods to define estimating equations but do not require or assume that the distribution of 𝐖\mathbf{W} lies in any particular parametric class. We show consistency and weak convergence of our estimators of 𝜽:=(𝜷,𝐩,𝐩~,𝜸~,𝜶)\boldsymbol{\theta}:=(\boldsymbol{\beta},\mathbf{p},\tilde{\mathbf{p}},\tilde{\boldsymbol{\gamma}},\boldsymbol{\alpha}) under mild conditions in SI Section 10. We use 𝜽0\boldsymbol{\theta}_{0} to denote the true value of 𝜽\boldsymbol{\theta}. Our unweighted objective is given by a Poisson log-likelihood:

Mn(𝜽):=1nln(𝜽)=1n𝟏T[vec(W)log(vec(𝝁(𝜽)))vec(𝝁(𝜽))].\displaystyle M^{\star}_{n}(\boldsymbol{\theta}^{\star}):=\frac{1}{n}l_{n}(\boldsymbol{\theta}^{\star})=\frac{1}{n}\mathbf{1}^{T}\left[\text{vec}\left(\textbf{W}\right)\circ\text{log}\left(\text{vec}\left({\boldsymbol{\mu}\left(\boldsymbol{\theta}^{\star}\right)}\right)\right)-\text{vec}\left(\boldsymbol{\mu}\left(\boldsymbol{\theta}^{\star}\right)\right)\right]. (6)

We use Mn(𝜽)M_{n}(\boldsymbol{\theta}) to indicate the profile log-likelihood sup𝜸nMn(𝜽,𝜸)\text{sup}_{\boldsymbol{\gamma}\in\mathbb{R}^{n}}M^{\star}_{n}(\boldsymbol{\theta},\boldsymbol{\gamma}), considering the elements of 𝜸n\boldsymbol{\gamma}\in\mathbb{R}^{n} as sample-specific nuisance parameters. Consistency of 𝜽^\hat{\boldsymbol{\theta}} for 𝜽0\boldsymbol{\theta}_{0} does not require WijW_{ij} to follow a Poisson distribution, however, the proposed estimator will be inefficient if the relationship between 𝔼[Wij|Zi,Xi,Z~i,γi,𝜽0]\mathbbm{E}[W_{ij}|Z_{i},X_{i},\tilde{Z}_{i},\gamma_{i},\boldsymbol{\theta}_{0}] and Var[Wij|Zi,Xi,Z~i,γi,𝜽0]\text{Var}[W_{ij}|Z_{i},X_{i},\tilde{Z}_{i},\gamma_{i},\boldsymbol{\theta}_{0}] is not linear (McCullagh, 1983). Therefore, to obtain a more efficient estimator, we also consider maximizing a reweighted Poisson log-likelihood, with weights chosen on the basis of a flexibly estimated mean-variance relationship. We motivate our specific choice of weights via the Poisson score equations (see SI Section 9). For weighting vector 𝐯^+nJ\hat{\mathbf{v}}\in\mathbb{R}_{+}^{nJ} such that 𝟏T𝐯^=nJ\mathbf{1}^{T}\hat{\mathbf{v}}=nJ, we define the weighted Poisson log-likelihood as

Mn𝐯^(𝜽):=1n𝐯^T[vec(W)log(vec(𝝁(𝜽)))vec(𝝁(𝜽))].\displaystyle M_{n}^{\star\hat{\mathbf{v}}}(\boldsymbol{\theta}^{\star}):=\frac{1}{n}\hat{\mathbf{v}}^{T}\left[\text{vec}\left(\textbf{W}\right)\circ\text{log}\left(\text{vec}\left({\boldsymbol{\mu}\left(\boldsymbol{\theta}^{\star}\right)}\right)\right)-\text{vec}\left(\boldsymbol{\mu}\left(\boldsymbol{\theta}^{\star}\right)\right)\right]. (7)

We define Mn𝐯^n(𝜽)M^{\hat{\mathbf{v}}_{n}}_{n}(\boldsymbol{\theta}) by analogy with Mn(𝜽)M_{n}(\boldsymbol{\theta}) above. We select 𝐯^\hat{\mathbf{v}} via a centered isotonic regression (Oron and Flournoy, 2017) of squared residuals vec[(𝐖𝝁^)2]\text{vec}\big{[}(\mathbf{W}-\boldsymbol{\hat{\mu}})^{2}\big{]} on fitted means vec[𝝁^]\text{vec}\big{[}\boldsymbol{\hat{\mu}}\big{]} obtained from the unweighted objective. Full details are given in SI Section 10, but briefly, we set v^ijμ^ij+1σ^ij2+1\hat{v}_{ij}\propto\frac{\hat{\mu}_{ij}+1}{\hat{\sigma}^{2}_{ij}+1}, where σ^ij2\hat{\sigma}^{2}_{ij} is the monotone regression fitted value for μij\mu_{ij}.

The estimators defined by optima of the weighted or unweighted Poisson likelihoods given above are consistent for the true value of 𝜽\boldsymbol{\theta} and converge weakly to well-defined limiting distributions at n\sqrt{n} rate, though the form of this distribution in general depends on the true value of 𝜽\boldsymbol{\theta}. We leverage an approach from Van der Vaart (2000) to prove consistency, and we combine a bracketing argument with a directional delta method theorem of Dümbgen (1993) to show weak convergence. Proofs are given in SI Section 11.

Computing maximum (weighted) likelihood estimates of 𝜽\boldsymbol{\theta}^{\star} is a constrained optimization problem, as the relative abundance parameters in our model are simplex-valued, and the estimate may lie on the boundary of the simplex. Therefore, we minimize fn(𝜽)=Mn(𝜽)f_{n}(\boldsymbol{\theta}^{\star})=-M^{\star}_{n}(\boldsymbol{\theta}^{\star}) or fn(𝜽)=Mn𝐯^n(𝜽)f_{n}(\boldsymbol{\theta}^{\star})=-M^{\star\hat{\mathbf{v}}_{n}}_{n}(\boldsymbol{\theta}^{\star}) in two steps. In the first step, we employ the barrier method, converting our constrained optimization problem into a sequence of unconstrained optimizations, permitting solutions progressively closer to the boundary. That is, for barrier penalty parameter tt, we update 𝜽\boldsymbol{\theta} as

𝜽(r+1)\displaystyle\boldsymbol{\theta}^{\star(r+1)} =arg min𝜽(fn(𝜽)+1t(r)[k=1Kj=1Jlogpkj+k~=1K~j=1Jlogp~k~j])\displaystyle=\text{arg min}_{\boldsymbol{\theta}}\left(f_{n}(\boldsymbol{\theta}^{\star})+\frac{1}{t^{(r)}}\left[\sum_{k=1}^{K}\sum_{j=1}^{J}-\text{log}p_{kj}+\sum_{\tilde{k}=1}^{\tilde{K}}\sum_{j=1}^{J}-\text{log}\tilde{p}_{\tilde{k}j}\right]\right) (8)
subject to j=1Jpkj=1 and j=1Jp~k~j=1 for all k,k~ s.t. 𝐩k,𝐩~k~ unknown\displaystyle\text{subject to }\sum_{j=1}^{J}p_{kj}=1\text{ and }\sum_{j=1}^{J}\tilde{p}_{\tilde{k}j}=1\text{ for all }k,\tilde{k}\text{ s.t. }\mathbf{p}_{k},\tilde{\mathbf{p}}_{\tilde{k}}\text{ unknown } (9)

and set t(r+1)=at(r)t^{(r+1)}=at^{(r)} where a>1a>1 is a prespecified incrementing factor, iterating until t(r)>tcutofft^{(r)}>t_{\text{cutoff}} for large tcutofft_{\text{cutoff}}. In practice we find that t(0)=1t^{(0)}=1, a=10a=10, and tcutoff=1012t_{\text{cutoff}}=10^{12} yield good performance. We enforce the sum-to-one constraints (9) by reparametrizing 𝐩\mathbf{p} and 𝐩~\tilde{\mathbf{p}} as 𝝆\boldsymbol{\rho} and 𝝆~\tilde{\boldsymbol{\rho}}, with ρkj:=logpkjpkJ\rho_{kj}:=\text{log}\frac{p_{kj}}{p_{kJ}} and ρ~k~j:=logp~k~jp~k~J\tilde{\rho}_{\tilde{k}j}:=\text{log}\frac{\tilde{p}_{\tilde{k}j}}{\tilde{p}_{\tilde{k}J}} for j=1,,J1j=1,\dots,J-1, which are well-defined because of the logarithmic penalty terms in (8).

In the second step of our optimization procedure, we apply a constrained Newton algorithm within an augmented Lagrangian algorithm to allow elements of 𝐩^\hat{\mathbf{p}} and 𝐩~^\hat{\tilde{\mathbf{p}}} to equal zero. Iteratively in each row 𝐩k\mathbf{p}_{k} of 𝐩\mathbf{p}, we approximately solve

arg min𝐩kfn(𝐩k) subject to j=1Jpkj=1,pkj0 for j=1,J\displaystyle\text{arg min}_{\mathbf{p}_{k}}f_{n}(\mathbf{p}_{k})\hskip 7.11317pt\text{ subject to }\sum_{j=1}^{J}p_{kj}=1,~{}p_{kj}\geq 0\text{ for }j=1,\dots J (10)

where we write fn(𝐩k)f_{n}(\mathbf{p}_{k}) as a function of only 𝐩k\mathbf{p}_{k} to reflect that we fix all other parameters at values obtained in previous optimization steps. We choose update directions for 𝐩k\mathbf{p}_{k} via an augmented Lagrangian algorithm of Bazaraa (2006) applied to

k:=Qk(t)+ν[j=1Jpkj1]+μ[j=1Jpkj1]2\displaystyle\mathcal{L}_{k}:=Q_{k}^{(t)}+\nu\left[\sum_{j=1}^{J}p_{kj}-1\right]+\mu\left[\sum_{j=1}^{J}p_{kj}-1\right]^{2} (11)

where Qk(t)Q_{k}^{(t)} is a quadratic approximation to fn(𝐩k)f_{n}(\mathbf{p}_{k}) at 𝐩k(t)\mathbf{p}_{k}^{(t)} and ν\nu and μ\mu are chosen using the algorithm of Bazaraa (2006). The augmented Lagrangian algorithm iteratively updates ν\nu and μ\mu until solutions to k\mathcal{L}_{k} satisfy |j=1Jpkj1|<ϵ|\sum_{j=1}^{J}p_{kj}-1|<\epsilon for a small prespecified value of ϵ\epsilon (we use 101010^{-10} by default). Within each iteration of the augmented Lagrangian algorithm, we minimize k\mathcal{L}_{k} via fast non-negative least squares to preserve nonnegativity of 𝐩k\mathbf{p}_{k}. Through the augmented Lagrangian algorithm, we obtain a value 𝐩k(t)\mathbf{p}_{\mathcal{L}_{k}}^{(t)} of 𝐩k\mathbf{p}_{k} that minimizes k(t)\mathcal{L}_{k}^{(t)} (at final values of ν\nu and μ\mu) subject to nonnegativity constraints. Our update direction for 𝐩k\mathbf{p}_{k} is then given by 𝐬k(t)=𝐩k(t)𝐩k(t)\mathbf{s}_{k}^{(t)}=\mathbf{p}^{(t)}_{\mathcal{L}_{k}}-\mathbf{p}^{(t)}_{k}. We conduct a backtracking line search in direction 𝐬k(t)\mathbf{s}_{k}^{(t)} to find an update 𝐩k(t+1)\mathbf{p}_{k}^{(t+1)} that decreases fn(𝐩k)f_{n}(\mathbf{p}_{k}).

5 Inference for 𝐩\mathbf{p} and 𝜷\boldsymbol{\beta}

We now address construction of confidence intervals and hypothesis tests. We focus on parameters 𝜷\boldsymbol{\beta} and 𝐩\mathbf{p}, which we believe to be the most common targets for inference. To derive both marginal confidence intervals and more complex hypothesis tests, we consider a general setting in which we observe some estimate ϕ^=ϕ(n)\hat{\phi}=\phi(\mathbb{P}_{n}) of population quantity ϕ=ϕ(P)\phi=\phi(P), where n\mathbb{P}_{n} is the empirical distribution corresponding to a sample {(𝐖i,𝐙i,𝐗i,𝐙~i)}i=1n\left\{\left(\mathbf{W}_{i},\mathbf{Z}_{i},\mathbf{X}_{i},\tilde{\mathbf{Z}}_{i}\right)\right\}_{i=1}^{n}, PP is its population analogue, and ϕ\phi is a Hadamard directionally differentiable map into the parameter space Θ\Theta or into \mathbb{R}. To derive marginal confidence intervals for 𝐩\mathbf{p} and 𝜷\boldsymbol{\beta}, we let ϕ(n)=𝜽^n\phi(\mathbb{P}_{n})=\boldsymbol{\hat{\theta}}_{n} with ϕ(P)=𝜽0\phi(P)=\boldsymbol{\theta}_{0}. For hypothesis tests involving multiple parameters, we specify ϕ(n)\phi(\mathbb{P}_{n}) as 2[sup𝜽ΘMn(𝜽)sup𝜽Θ0Mn(𝜽)]2\left[\text{sup}_{\boldsymbol{\theta}\in\Theta}M_{n}(\boldsymbol{\theta})-\text{sup}_{\boldsymbol{\theta}\in\Theta_{0}}M_{n}(\boldsymbol{\theta})\right] with population analogue ϕ(P)=0\phi(P)=0 under H0:𝜽Θ0H_{0}:\boldsymbol{\theta}\in\Theta_{0}. In each case, we estimate the asymptotic distribution of a(n)(ϕ(n)ϕ(P))a(n)\left(\phi(\mathbb{P}_{n})-\phi(P)\right) for an appropriately chosen a(n)a(n)\rightarrow\infty.

As our model includes parameters that may lie on the boundary of the parameter space, the limiting distributions of our estimators and test statistics in general do not have a simple distributional form (Geyer, 1994), and the multinomial bootstrap will fail to produce asymptotically valid inference (Andrews, 2000). To address this, we employ a Bayesian subsampled bootstrap (Ishwaran et al., 2009), which consistently estimates the asymptotic distribution of our estimators when the true parameter is on the boundary. Let n𝝃\mathbbm{P}^{\boldsymbol{\xi}}_{n} be a weighted empirical distribution i=1nξi,n𝟏𝐖i\sum_{i=1}^{n}\xi_{i,n}\mathbf{1}_{\mathbf{W}_{i}} with weights 𝝃G\boldsymbol{\xi}\sim G for GDirichlet(mn𝟏n)G\sim\text{Dirichlet}\left(\frac{m}{n}\mathbf{1}_{n}\right). Then the bootstrap estimator a(m)(ϕ(n𝝃)ϕ(n))a(m)\left(\phi\left(\mathbbm{P}_{n}^{\boldsymbol{\xi}}\right)-\phi\left(\mathbbm{P}_{n}\right)\right) converges weakly to the limiting distribution of a(n)(ϕ(n)ϕ(P))a(n)\left(\phi\left(\mathbbm{P}_{n}\right)-\phi\left(P\right)\right) if we choose m=m(n)m=m(n) such that limnm=\underset{n\rightarrow\infty}{\text{lim}}m=\infty and limnmn=0\underset{n\rightarrow\infty}{\text{lim}}\frac{m}{n}=0 (Ishwaran et al., 2009). We explore finite-sample behavior of the proposed bootstrap estimators with m=nm=\sqrt{n} in Section 7, finding good Type 1 error control.

To derive marginal confidence intervals for elements of 𝜽\boldsymbol{\theta}, we let a(n)=na(n)=\sqrt{n} and ϕ(P)=𝜽(P)\phi(P)=\boldsymbol{\theta}(P). Then m(𝜽^nξ𝜽^n)\sqrt{m}(\boldsymbol{\hat{\theta}}^{\xi}_{n}-\boldsymbol{\hat{\theta}}_{n}) has the same limiting distribution as n(𝜽^𝜽)\sqrt{n}(\boldsymbol{\hat{\theta}}-\boldsymbol{\theta}). Therefore, for L^cq\hat{L}^{q}_{c} the cc-th bootstrap quantile of the qq-th element of m(𝜽^nξ𝜽^n)\sqrt{m}(\boldsymbol{\hat{\theta}}^{\xi}_{n}-\boldsymbol{\hat{\theta}}_{n}) and θ^q\hat{\theta}_{q} the qq-th element of the maximum (weighted) likelihood estimate 𝜽^\boldsymbol{\hat{\theta}}, an asymptotically 100(1α)%100(1-\alpha)\% marginal confidence interval for θq\theta_{q} is given by (θ^q1nL^1α/2q,θ^q1nL^α/2q)\left(\hat{\theta}_{q}-\frac{1}{\sqrt{n}}\hat{L}^{q}_{1-\alpha/2},~{}\hat{\theta}_{q}-\frac{1}{\sqrt{n}}\hat{L}^{q}_{\alpha/2}\right).

As it may be of interest to test hypotheses about multiple parameters while leaving other parameters unrestricted (e.g., 𝜷=𝟎\boldsymbol{\beta}=\mathbf{0} with unrestricted elements of 𝐩\mathbf{p}), we also develop a procedure to test hypotheses of the form H0:{θk=ck:k𝒦0}H_{0}:\{\theta_{k}=c_{k}:k\in\mathcal{K}_{0}\} for a set of parameter indices 𝒦0\mathcal{K}_{0} against alternatives with 𝜽\boldsymbol{\theta} unrestricted. Letting Θ0\Theta_{0} indicate the parameter space under H0H_{0} and Θ\Theta indicate the full parameter space, we conduct tests using test statistic Tn:=nT(n):=2n[sup𝜽ΘMn(𝜽)sup𝜽Θ0Mn(𝜽)]T_{n}:=nT(\mathbbm{P}_{n}):=2n[\text{sup}_{\boldsymbol{\theta}\in\Theta}M_{n}(\boldsymbol{\theta})-\text{sup}_{\boldsymbol{\theta}\in\Theta_{0}}M_{n}(\boldsymbol{\theta})]. As noted above, TnT_{n} is in general not asymptotically χ2\chi^{2} if (unknown) elements of 𝐩\mathbf{p} or 𝐩~\tilde{\mathbf{p}} lie at the boundary, and so we instead approximate the null distribution of TnT_{n} by bootstrap resampling from an empirical distribution projected onto an approximate null; this is closely related to the approach suggested by Hinkley (1988). Let μ̊ij\mathring{\mu}_{ij} denote exp(γi)\text{exp}(-\gamma_{i}) times the expectation of WijW_{ij} under the full model; μ̊ij0\mathring{\mu}_{ij}^{0} denote the analogous quantity under H0H_{0}; and define Wij0=Wijμ̊ij0μ̊ijW_{ij}^{0}=W_{ij}\frac{\mathring{\mu}_{ij}^{0}}{\mathring{\mu}_{ij}}  if μ̊ij>0\text{ if }\mathring{\mu}_{ij}>0 and otherwise set Wij0=Wij=0W_{ij}^{0}=W_{ij}=0. In practice, we do not know μ̊ij\mathring{\mu}_{ij} or μ̊ij0\mathring{\mu}_{ij}^{0}, so we replace Wij0W_{ij}^{0} with W^ij0=μ̊^ij0μ̊^ij\hat{W}_{ij}^{0}=\frac{\hat{\mathring{\mu}}_{ij}^{0}}{\hat{\mathring{\mu}}_{ij}}, where μ̊^ij\hat{\mathring{\mu}}_{ij} and μ̊^ij0\hat{\mathring{\mu}}_{ij}^{0} are, up to proportionality constant exp(γ^i)\text{exp}(\hat{\gamma}_{i}), fitted means for WijW_{ij} under the full and null models. After constructing 𝐖^0\hat{\mathbf{W}}^{0}, we rescale its rows so row sums of 𝐖^0\hat{\mathbf{W}}^{0} and 𝐖\mathbf{W} are equal. We then approximate the null distribution of TT via bootstrap draws from mT(0n𝝃)mT(\mathbbm{P}^{\boldsymbol{\xi}}_{0n}) where 0n𝝃\mathbbm{P}^{\boldsymbol{\xi}}_{0n} is the Bayesian subsampled bootstrap distribution on 𝐖^n0\hat{\mathbf{W}}^{0}_{n}. We reject H0H_{0} at level α\alpha if the observed likelihood ratio test statistic is larger than the 1α1-\alpha quantile of the bootstrap estimate of its null distribution, or equivalently, if TnL~1αmT_{n}\geq\tilde{L}^{m}_{1-\alpha} for L~1αm\tilde{L}^{m}_{1-\alpha} the 1α1-\alpha quantile of mT(0n𝝃)mT(\mathbbm{P}^{\boldsymbol{\xi}}_{0n}).

6 Data Examples

6.1 Comparing detection effects across experiments

We now demonstrate the utility of our model in comparing different experimental protocols for high-throughput sequencing of microbial communities. We consider data generated in the Phase 2 experiment of Costea et al. (2017) (see also McLaren et al. (2019)), wherein ten human fecal specimens (labeled 1,2,,81,2,\ldots,8, A and B) were mixed with a synthetic community of 10 taxa and prepared for shotgun metagenomic sequencing according to three different sample preparations (labeled H, Q, and W; samples A and B were only analyzed with preparation Q). The synthetic community was also sequenced alone. Raw sequencing data was processed into taxon abundance data using MetaPhlAn2 (Truong et al., 2015) by McLaren et al. (2019). In addition to sequencing data, taxon abundances in the synthetic community were also measured using flow cytometry. We treat both sequencing and flow cytometry measurements as outcomes 𝐖\mathbf{W}. We are interested in comparing the detection of taxa in the synthetic community across protocols H, Q and W relative to flow cytometry. We are specifically interested in testing the null hypothesis that all sequencing protocols share the same detection effects. To accomplish this, we estimate the 3×103\times 10 matrix 𝜷\boldsymbol{\beta} (we set 𝜷10=𝟎3\boldsymbol{\beta}_{\cdot 10}=\mathbf{0}_{3} to ensure identifiability; see SI Section 9.2 for proof of identifiability). Each row of 𝜷\boldsymbol{\beta} corresponds to a sequencing protocol, and each column corresponds to a taxon. For details regarding the specification of 𝐗\mathbf{X} and other model parameters, see SI Section 13.1. Under our model, exp(β1j)\text{exp}(\beta_{1j}), exp(β1j+β2j)\text{exp}(\beta_{1j}+\beta_{2j}), and exp(β1j+β3j)\text{exp}(\beta_{1j}+\beta_{3j}) give the degree of over- or under-detection of taxon jj relative to taxon 10 under protocols H, Q, and W, respectively. We compare this model to a submodel in which βkj=0\beta_{kj}=0 for k=2,3k=2,3 and all jj. Under this null hypothesis, taxon detections relative to flow cytometry do not differ across protocols.

To compare predictive performance of these models, we perform 10-fold cross-validation on each model (see SI Section 13.2 for details). We use a bootstrapped likelihood ratio test to formally test our full model against the null submodel. We use the Bayesian subsampled bootstrap with m=nm=\sqrt{n} to illustrate its applicability, however, a multinomial bootstrap would also be appropriate as all parameters are in the interior of the parameter space in this case. In addition, we report point estimates and bootstrapped marginal 95% confidence intervals for detection effects estimated for each protocol under the full model. We also compare our results to MetaPhlAn2’s “plug-in” estimate of each sample’s composition.

Refer to caption
Figure 2: 10-fold cross-validated estimates of relative abundance (y-axis) in Costea et al. (2017) samples that were measured by whole-genome sequencing as well as plug-in estimates of relative abundance (bottom row). On the x-axis are relative abundance estimates obtained via flow cytometry (mean concentration in each taxon divided by sum across taxa). The top row contains estimates produced by a model containing separate detection effects 𝜷\boldsymbol{\beta} for each protocol; the middle row contains output from a model assuming a single common detection effect across protocols; and the bottom row contains “plug-in” estimates from MetaPhlAn2 output. Results for each protocols H, Q, and W are given in the leftmost, center, and rightmost columns, respectively. Within each pane, estimated relative abundances for the same sample are connected by line segments, and the line y=xy=x is indicated with a dotted line.

Figure 2 summarizes 10-fold cross-validated estimated relative abundances from the full model, null model and plug-in estimates. We observe substantially better model fit for the full model (top row) than for the null model (middle row). At each flow cytometric relative abundance, cross-validated estimates from the full model are generally centered around the line y=xy=x (dotted line), whereas estimates from the null model exhibit substantial bias for some taxa. A bootstrapped likelihood ratio test of the null model (i.e., H0:𝜷2=𝜷3=𝟎H_{0}:\boldsymbol{\beta}_{2\cdot}=\boldsymbol{\beta}_{3\cdot}=\mathbf{0}) against the full model reflects this, and we reject the null with p<0.001p<0.001. Both the full and null models outperform the plug-in estimates of sample composition (bottom row), which produces substantially biased estimates of relative abundance relative to a flow cytometry standard. We report point estimates and marginal 95% confidence intervals for 𝜷\boldsymbol{\beta} in SI Section 13.3.

Using the full model, we estimate relative abundances with substantially greater precision under protocol W (top right) than under either other protocol (top left and center). This appears to be primarily due to lower variability in measurements taken via protocol W (bottom row). Our finding of greater precision of protocol W contrasts with Costea et al. (2017), who recommend protocol Q as a “potential benchmark for new methods” on the basis of median absolute error of centered-log-ratio-transformed plug-in estimates of relative abundance against flow cytometry measurements (as well as on the basis of cross-laboratory measurement reproducibility, which we do not examine here). The recommendations of Costea et al. (2017) are driven by performance of plug-in estimators subject to considerable bias, whereas we are able to model and remove a large degree of bias and can hence focus on residual variation after bias correction. We also note that Costea et al. (2017) did not use MetaPhlAn2 to construct abundance estimates, which may partly account our different conclusions.

6.2 Estimating contamination via dilution series

We next illustrate how to use our model to estimate and remove contamination in samples. We consider 16S rRNA sequencing data from Karstens et al. (2019), who generated 9 samples via three-fold dilutions of a synthetic community containing 8 distinct strains of bacteria which each account for 12.5% of the DNA in the community. Despite only 8 strains being present in the synthetic community, 248 total strains were identified based on sequencing (see SI Section 14.1 for data processing details). We refer to the 8 strains in the synthetic community as “target” taxa and other strains as “off-target.” Note that Karstens et al. (2019) identified one strain as a likely mutant of synthetic community member S. enterica, and we refer to this strain as S. [unclassified].

To evaluate the performance of our model, we perform three-fold cross-validation and estimate relative abundance in the hold-out fold. We consider 𝐩2×248\mathbf{p}\in\mathbb{R}^{2\times 248}, where the first row contains the known composition of the training fold (𝟎240T18𝟏8T)\left(\mathbf{0}^{T}_{240}~{}\frac{1}{8}\mathbf{1}_{8}^{T}\right) and the second row is unknown and the target of inference (see SI Section 14.2 for full model specification). We evenly balance dilutions across folds, and using our proposed reweighted estimator to fit a model that accounts for the serial dilutions. We set K~=1\tilde{K}=1 and let 𝐙~i=3di\tilde{\mathbf{Z}}_{i}=3^{d_{i}}, where did_{i} is the number of three-fold dilutions sample ii has undergone. This model reflects the assumption that the ratio of expected contaminant reads to expected non-contaminant reads is proportional to 3di3^{d_{i}}. To avoid improperly sharing information about contamination amounts across folds, we include terms in a fixed, unknown parameter α\alpha\in\mathbb{R} in 𝐃𝜸,𝜶\mathbf{D}_{\boldsymbol{\gamma},\boldsymbol{\alpha}}. In particular, we let design matrix 𝐕\mathbf{V} by which we premultiply α\alpha consist of a single column vector with ii-th entry 1 if sample ii is in the held-out fold and 0 otherwise. This preserves information about relative dilution within the held-out fold without treating samples in the training and held-out folds as part of the same dilution series. That is, the source of contamination for the held-out fold is modeled to be the same as in the training fold, but intensity of contamination for held-out sample ii having undergone did_{i} three-fold dilutions is given by exp(γ~+α)3di\exp(\tilde{\gamma}+\alpha)3^{d_{i}} as compared to exp(γ~)3di\exp(\tilde{\gamma})3^{d_{i}} for a sample in the training set. We model a single differential detection effect βj\beta_{j} for each of the 8 taxa in the synthetic community, setting βJ=0\beta_{J}=0 for the reference taxon L. fermentum for identifiability. Because βj\beta_{j} is not identifiable for off-target taxa, we also fix βj=0\beta_{j}=0 for j=1,,240j=1,\ldots,240 (see SI Section 9.3 for identifiability results).

Refer to caption
Figure 3: Data from Karstens et al. (2019), and estimates and summaries from the fitted model. (Top left) Observed read proportions by taxon and dilution, with theoretical synthetic composition indicated by dotted line. (Top right) The log-ratio of total contaminant reads to total non-contaminant reads (excluding S. [unclassified]). The dotted line has slope log(3)\text{log}(3), with intercept fit via least squares. (Bottom left) Fitted read proportions obtained from each cross-validation fold, with theoretical synthetic composition indicated by dotted line. Every fold produces abundance estimates that improve over observed read proportions. (Bottom right) The log-ratio of reads for target taxa to reference taxon L. fermentum is relatively constant across increasing numbers of dilutions.

Figure 3 shows data from Karstens et al. (2019) along with summaries of our analysis. Our estimate for the relative abundance of taxa in samples in the held-out fold 𝐩2\mathbf{p}_{2\cdot} improves on the performance of plug-in estimators (Figure 3, left panels) by taking into account two forms of structure in the Karstens et al. (2019) data. First, in each successive three-fold dilution, we observe approximately three times more contamination relative to the number of non-contaminant reads (Figure 3, top right). In addition, our model accounts for the degree of under- (or over-) detection of target taxa relative to L. fermentum. We observe that taxon detection is reasonably constant across dilutions (Figure 3, bottom right). However, we do observe greater variability in taxon detections at higher dilutions, most likely because we observe comparatively few reads (j=1248W{i:di=1}j227,000\sum_{j=1}^{248}W_{\{i:d_{i}=1\}j}\approx 227,000 while j=1248W{i:di=8}j8,000\sum_{j=1}^{248}W_{\{i:d_{i}=8\}j}\approx 8,000).

In terms of root mean squared error (RMSE) 1/Jj=1J(p^jpj)2\sqrt{1/J\sum_{j=1}^{J}(\hat{p}_{j}-p_{j})^{2}}, our cross-validated estimates (0.00370.0037, 0.00730.0073, and 0.00330.0033) substantially outperform the “plug-in” estimates given by sample read proportions in any of these dilutions (median 0.0170.017; range 0.0130.0130.0220.022). This is not an artifact of incorporating information from 3 samples in each cross-validation fold, as pooling reads across all samples yields an estimator with RMSE 0.0140.014.

Fitting a model to this relatively small dataset and evaluating its performance using cross validation prohibits the reasonable construction of confidence intervals. Therefore, to evaluate the performance of our proposed approach to generating confidence intervals, we also fit a model which treats all samples as originating from a single specimen of unknown composition. 𝜷\boldsymbol{\beta} is not identifiable in this setting, so we set it equal to zero and do not estimate it. We set K~=1\tilde{K}=1 and 𝐙~i=exp(γi)×3di\tilde{\mathbf{Z}}_{i}=\text{exp}(\gamma_{i})\times 3^{d_{i}} as before (the need for α~\tilde{\alpha} is alleviated). Strikingly, under this model, marginal 95% confidence intervals for elements of 𝐩\mathbf{p} included zero for 238 out of 240 off-target taxa (empirical coverage of 99.2%99.2\% when true pkj=0p_{kj}=0). No interval estimates for target taxa included zero. This suggests that applying our proposed approach to data from a dilution series experiment can aid in evaluating whether taxa detected by next-generation sequencing are actually present in a given specimen.

7 Simulations

7.1 Sample size and predictive performance

To evaluate the predictive performance of our model on microbiome datasets of varying size, we use data from 65 unique specimens consisting of synthetic communities of 1, 2, 3, 4, or 7 species combined in equal abundances. Brooks et al. (2015) performed 16S amplicon sequencing on 80 samples of these 65 specimens across two plates of 40 samples each. Very few reads in this dataset were ascribed to taxa outside the 7 present by design, so we limit analysis to these taxa. To explore how prediction error varies with number of samples of known composition, we fit models treating randomly selected subsets of nknown{3,5,10,20}n_{\text{known}}\in\{3,5,10,20\} samples per plate as known. In each model, we included one source of unknown contamination for each plate and estimate a detection vector 𝜷J\boldsymbol{\beta}\in\mathbb{R}^{J}. For each nknownn_{\text{known}}, we drew 100 independent sets of samples to be treated as known, requiring that each set satisfy an identifiability condition in 𝜷\boldsymbol{\beta} (see SI Section 9.4). On each set, we fit both the unweighted and reweighted models, treating nknownn_{\text{known}} samples as having known composition and 80nknown80-n_{\text{known}} samples as arising from unique specimens of unknown composition.

We observe similar RMSE for the reweighted and unweighted estimators. For nknownn_{\text{known}} equal to 3, 5, 10, and 20, we observe RMSE 0.041, 0.037, 0.035, and 0.032 (to 2 decimal places) for both estimators. By comparison, the RMSE for the plugin estimator Wijj=1JWij\frac{W_{ij}}{\sum_{j=1}^{J}W_{ij}} is 0.173. Notably, RMSE decreases but does not approach zero as larger number of samples are treated as known, which reflects that we estimated each relative abundance profile on the basis of a single sample.

With respect to correctly estimating pkjp_{kj} when pkj=0p_{kj}=0, we again see very similar performance of the reweighted and unweighted Poisson estimators. Out of 100 sets, unweighted estimation yields p^kj=0\hat{p}_{kj}=0 for 53%,55%,59%53\%,55\%,59\%, and 64%64\% of {k,j}\{k,j\} pairs for which pkj=0p_{kj}=0 (nknownn_{\text{known}} equal to 3, 5, 10, and 20). The corresponding figures for the reweighted estimator are 53%,55%,58%53\%,55\%,58\%, and 63%63\%. The plug-in estimator sets 51%51\% of these relative abundances equal to zero. While we observe the proportion of theoretical zero relative abundances estimated to be zero increases in number of samples treated as known regardless of estimator, in general we do not expect this proportion to approach 1 as number of known samples increases. We also note that our model is not designed to produce prediction intervals, and confidence intervals for a parameter estimated from a single observation are unlikely to have reasonable coverage. Finally, we acknowledge that despite the excellent performance of our model on the data of Brooks et al. (2015), our model does not fully capture sample cross-contamination known as index-hopping (Hornung et al., 2019), which likely affects this data.

7.2 Type 1 error rate and power

To investigate the Type 1 error rate and power of tests based on reweighted and unweighted estimators, we simulate data arising from a set of hypothetical dilution series. In each simulated dataset, we observe reads from dilution series of four specimens: two specimens of known composition and two specimens of unknown composition (specimens A and B). Each dilution series consists of four samples: an undiluted sample from a specimen as well as a 9-, 81-, and 729-fold dilution of the specimen. We vary the number of taxa J{5,20}J\in\{5,20\}, as well as the magnitude of elements of 𝜷\boldsymbol{\beta}, the number of samples, and the distribution of Wij|μijW_{ij}|\mu_{ij}. The identifiability results of SI Section 9.3 apply here.

We consider three different values of 𝜷J\boldsymbol{\beta}\in\mathbb{R}^{J}: 𝜷=0𝜷\boldsymbol{\beta}=0\boldsymbol{\beta}^{\star}, 𝜷=110𝜷\boldsymbol{\beta}=\frac{1}{10}\boldsymbol{\beta}^{\star}, and 𝜷=𝜷\boldsymbol{\beta}=\boldsymbol{\beta}^{\star} where 𝜷=(31130)\boldsymbol{\beta}^{\star}=\begin{pmatrix}3&-1&1&-3&0\end{pmatrix} when J=5J=5 and 𝜷=(31133130)\boldsymbol{\beta}^{\star}=\begin{pmatrix}3&-1&1&-3&3&-1&\dots&-3&0\end{pmatrix} when J=20J=20. We base the magnitude of entries of 𝜷\boldsymbol{\beta}^{\star} using the observed magnitude of entries of 𝜷^\boldsymbol{\hat{\beta}} in our analysis of Costea et al. (2017) data. We vary the number of samples between either a single dilution series from each specimen or three dilution series from each specimen (for a total of nine samples per specimen). We draw Wij|μijW_{ij}|\mu_{ij} from either a Poisson(μij\mu_{ij}) distribution or a Negative Binomial distribution with mean parameter μij\mu_{ij} and size parameter s=13s=13. s=13s=13 was chosen to approximate the Karstens et al. (2019) data via a linear regression of fitted mean-centered squared residuals. In all settings we simulate {γi}i=1n\{\gamma_{i}\}_{i=1}^{n} from a log-normal distribution with parameters μ=min(13.51.5di,12)\mu=\text{min}(13.5-1.5d_{i},12) and σ2=0.05\sigma^{2}=0.05. These values were chosen based on observed trends in reads from target taxa in the data of Karstens et al. (2019) data. In all settings, the first specimen has true relative abundance proportional to (1,24J1,224J1,,24)(1,2^{\frac{4}{J-1}},2^{2\cdot\frac{4}{J-1}},\dots,2^{4}), that is, taxon JJ is 16 times more abundant than taxon 1. The second specimen has true relative abundance proportional to (24,,24J1,1)(2^{4},\dots,2^{\frac{4}{J-1}},1). When J=5J=5, the first two taxa are absent from specimen A, and when J=20J=20, first eight taxa are absent from specimen A. Relative abundances in the remaining taxa form an increasing power series such that the first taxon present in nonzero abundance has relative abundance that is 1/1001/100-th of the relative abundance of taxon JJ. The relative abundance profile of specimen B is given by the relative abundance vector for specimen A in reverse order. We also simulate the degree of contamination as scaling with dilution. When comparing samples with the same read depth, on average a 9-fold diluted sample will contain 9 times more contaminant reads than an undiluted sample (see Section 5.2 and Figure 3, top right). We simulate contamination from a source containing equal relative abundance of all taxa. We set 𝐙~n\tilde{\mathbf{Z}}\in\mathbb{R}^{n} such that Z~i=9di\tilde{Z}_{i}=9^{d_{i}} where did_{i} is the degree of dilutions of sample ii, and γ~=3.7\tilde{\gamma}=-3.7, as we observe in Karstens et al. (2019) data.

Refer to caption
Figure 4: At left, the Type 1 error (top row) and power of our proposed likelihood ratio tests for both the unweighted and reweighted estimators. Performance of tests of H0:𝜷=0H_{0}:\boldsymbol{\beta}=0 against a general alternative are summarized in terms of empirical rejection rates at level α=0.05\alpha=0.05 (y-axis), with sample size plotted on the x-axis. Columns give the conditional distribution of data (Poisson or Negative Binomial) and number of taxa JJ. Rows specify whether data was simulated under the null 𝜷=0\boldsymbol{\beta}=0, under a “weak” alternative with 𝜷=110𝜷\boldsymbol{\beta}=\frac{1}{10}\boldsymbol{\beta}^{\star} (i.e., 𝜷0\boldsymbol{\beta}\neq 0 of small magnitude), or under a “strong” alternative 𝜷=𝜷\boldsymbol{\beta}=\boldsymbol{\beta}^{\star} (i.e., 𝜷0\boldsymbol{\beta}\neq 0 of larger magnitude). At right, empirical rejection rate for marginal bootstrap tests of H0:pkj=0H_{0}:p_{kj}=0 at the 0.050.05 level versus true value of pkjp_{kj} (x-axis). Columns and rows are as specified above.

Figure 4 summarizes our results from 250 simulations under each condition. Empirical performance of bootstrapped likelihood ratio tests of H0:𝜷=0H_{0}:\boldsymbol{\beta}=0 against HA:𝜷0H_{A}:\boldsymbol{\beta}\neq 0 at level 0.05 (lefthand pane of Figure 4) reveals good performance, even for small sample sizes. Unsurprisingly, we observe improved Type 1 error control and power at larger sample sizes. Tests based on the reweighted estimator generally improved Type 1 error compared to the unweighted estimator, with the greatest improvements observed for data simulated from a negative binomial distribution (mean Type 1 error across simulations was 0.15 for the unweighted estimator and 0.04 for the reweighted estimator). Surprisingly, Type 1 error control appears to improve when the number of taxa JJ is larger (mean Type 1 error was 0.11 for J=5J=5 and 0.05 for J=10J=10). This may in part be a result of simulating WijW_{ij} as conditionally independent given γi\gamma_{i}, covariates, and other parameters. Power to reject the null hypothesis is very high when the data generating process is Poisson, as well as for strong alternatives (𝜷=𝜷\boldsymbol{\beta}=\boldsymbol{\beta}^{\star}) when the data follows a negative binomial distribution.

Empirical Type 1 error control for bootstrapped marginal tests of H0:pkj=0H_{0}:p_{kj}=0 against HA:pkj>0H_{A}:p_{kj}>0 at the 0.05 level (lefthand pane of Figure 4) is generally no larger than nominal, with median empirical Type 1 error 0.020.02 across all conditions. We observe above-nominal Type 1 error in limited cases, primarily when sample size is small and data is Poisson-distributed (the largest observed Type 1 error rate, of 0.10, occurs for tests based on reweighted estimators at sample size n=16n=16 and number of taxa J=5J=5 when data is Poisson-distributed). Power to reject the null is close to 1 for all non-zero values of pkjp_{kj} when data is conditionally Poisson-distributed (median empirical power across conditions 1; minimum 0.88). When data is simulated as negative binomial, unsurprisingly, power appears to increase in magnitude of pkjp_{kj} for tests based on unweighted and reweighted estimators, with somewhat higher power in tests using reweighted estimators. Magnitude of 𝜷\boldsymbol{\beta} (rows of Figure 4) does not appear to affect Type 1 error or power of our tests of H0:pkj=0H_{0}:p_{kj}=0. We also note that empirical coverage of marginal bootstrapped confidence intervals for pkjp_{kj} is generally lower than nominal in our simulations (SI Figure 2). This aligns with general performance of bootstrap percentile confidence intervals at small sample sizes, and we expect that generating confidence intervals from inverted bootstrapped likelihood ratio tests would yield better coverage in this case (unfortunately this is not feasible for computational reasons).

8 Discussion

In this paper, we introduce a statistical method to model measurement error due to contamination and differential detection of taxa in microbiome experiments. Our method builds on previous work in several ways. By directly modeling the output of microbiome experiments, we do not rely on data transformations that discard information regarding measurement precision, such as ratio- or proportion-based transformations. This affords our method the key advantage of estimating relative abundances lying on the boundary of the simplex, which is typically precluded by transformation-based approaches. Accordingly, we implement inference tools appropriate to the non-standard parameter space that we consider. The advantage of estimating relative abundances on the boundary of the simplex is not purely theoretical, and we show that our interval estimates do indeed include boundary values, and demonstrate above-nominal empirical coverage in an analysis of data from Karstens et al. (2019). Furthermore, our reweighting estimator allows for flexible mean-variance relationships without the need to specify a parametric model. Our approach to parameter estimation does not assume that observations are counts, and therefore our method can be applied to a wide array of microbiome data types, including proportions, coverages and cell concentrations as well as counts. Finally, our method can accommodate complex experimental designs, including analysis of mixtures of samples, technical replicates, dilution series, detection effects that vary by experimental protocol or specimen type, and contamination impacting multiple samples. Contamination is commonly addressed via “pre-processing” data, thereby conditioning on the decontamination step. In contrast, by simultaneously estimating contamination along with all other model parameters, our approach captures holistic uncertainty in estimation.

Another advantage of our methodology is that we do not require the true composition of any specimens to be known. For example, our test of equal detection effects across protocols in Costea et al. (2017) can be performed without knowledge of specimen composition. Accordingly, our approach provides a framework for comparing experimental protocols in the absence of synthetic communities, which can be challenging to construct.

In addition, we expect that our method may have substantial utility applied to dilution series experiments, as illustrated in our analysis of data from Karstens et al. (2019). Dilution series are relatively low-cost and scalable (especially in comparison to synthetic communities) and may be especially advantageous when the impact of sample contamination on relative abundance estimates is of particular concern. With this said, we strongly recommend against testing the composite null H0:pkj>0H_{0}:p_{kj}>0 against the point alternative HA:pkj=0H_{A}:p_{kj}=0, as these hypotheses are statistically indistinguishable on the basis of any finite sample of reads. However, we are able to determine the degree to which our observations are consistent with the absence of any given taxon, and therefore we can meaningfully test H0:pkj=0H_{0}:p_{kj}=0 against a general alternative.

The focus of our paper was on 𝐩\mathbf{p} and 𝜷\boldsymbol{\beta} as targets of inference. Future research could investigate extensions to our model that connect relative abundances 𝐩\mathbf{p} to covariates of interest, allowing the comparison of average relative abundances across groups defined by covariates, for example. Our proposed bootstrap procedures may also aid the propagation of uncertainty to group-level comparisons of relative abundances in downstream analyses. In addition, while we focus applications of our model on microbiome data, our model could be applied to a broad variety of data structures obtained from high-throughput sequencing, such as single-cell RNAseq. We leave these applications to future work.

9 SI: Identifiability

As discussed in the main text, our model is flexible enough to encompass a wide variety of experimental designs and targets of estimation, but as a result, the parameters in model (5) are unidentifiable without additional constraints. In this section, we demonstrate identifiability for the three use cases of our model that we considered as illustrative examples. For reference, we provide here the definition of identifiability:

Definition 0.1.

A mean function μ:𝒟qN\mu:\mathcal{D}\subset\mathbb{R}^{q}\rightarrow\mathbb{R}^{N} is said to be identifiable if θθμ(θ)μ(θ)θ,θ𝒟.\theta\neq\theta^{\prime}\Rightarrow\mu(\theta)\neq\mu(\theta^{\prime})~{}\forall~{}\theta,\theta^{\prime}\in\mathcal{D}.

9.1 Preliminaries

We first consider situations where contamination is not modelled (e.g., Section 6.1), and as such, 𝐙~=0\tilde{\mathbf{Z}}=0. In this case, the following Lemma provides a sufficient condition under which mean model (5) may be reparametrized as a log-linear model. Since a log-linear model is identifiable if and only if its design matrix has full column rank, this result provides a basis on which to establish identifiability for a large subclass of models.

Unless otherwise noted, without loss of generality we take β.J=𝟎p\beta_{.J}=\mathbf{0}_{p} throughout.

Lemma 1.

Let 𝔼[𝐖|𝛃,𝐩,𝛄]=:𝛍(𝛃,𝐩,𝛄)\mathbb{E}[\mathbf{W}|\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma}]=:\boldsymbol{\mu}(\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma}) denote a mean function for an n×Jn\times J matrix 𝐖\mathbf{W} such that the (i,j)(i,j)-th element of 𝛍(𝛃,𝐩,𝛄)\boldsymbol{\mu}(\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma}) is given by μij(𝛉)=exp(γi)Zipjexp(𝟏{j<J}Xiβj)\mu_{ij}(\boldsymbol{\theta})=\exp(\gamma_{i})Z_{i}p^{j}\exp(\mathbf{1}_{\{j<J\}}X_{i}\beta^{j}), for γi\gamma_{i} the ii-th element of 𝛄n\boldsymbol{\gamma}\in\mathbb{R}^{n}, pjp^{j} the jj-th column of 𝐩\mathbf{p} (for 𝐩\mathbf{p} a K×JK\times J matrix with rows in the (J1)(J-1)-simplex), βj\beta^{j} the jj-th column of the p×(J1)p\times(J-1) matrix 𝛃\boldsymbol{\beta}, and ZiZ_{i} and XiX_{i} the ii-th rows of 𝐙0n×K\mathbf{Z}\in\mathbb{R}_{\geq 0}^{n\times K} and 𝐗n×p\mathbf{X}\in\mathbb{R}^{n\times p}, respectively. If each row of 𝐩\mathbf{p} lies in the interior of 𝕊J1\mathbb{S}^{J-1} and each row of 𝐙\mathbf{Z} contains a single nonzero entry ziz_{i}, equal to 1, then there exists a matrix MnJ×qM\in\mathbb{R}^{nJ\times q} for q=(p+K)(J1)+nq=(p+K)(J-1)+n and an isomorphism f:p×(J1)×𝐒K×(J1)×nqf:\mathbb{R}^{p\times(J-1)}\times\mathbf{S}^{K\times(J-1)}\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{q} such that for all (𝛃,𝐩,𝛄)(\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma}), μij(𝛃,𝐩,𝛄)=exp(M[ij]f(𝛃,𝐩,𝛄))\mu_{ij}(\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma})=\exp\big{(}M_{[ij]}f(\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma})\big{)} for M[ij]M_{[ij]} the (i1)×J+j(i-1)\times J+j-th row of MM.

Proof.

It suffices to show the result holds for an arbitrary element of 𝝁(𝜷,𝐩,𝜸)\boldsymbol{\mu}(\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma}), say, μij\mu_{ij}. Let ll be the unique index such that zil:=1z_{il}:=1. We have

μij\displaystyle\mu_{ij} =exp(γi)Zipjexp(𝟏{j<J}Xiβj)\displaystyle=\exp(\gamma_{i})Z_{i}p^{j}\exp(\mathbf{1}_{\{j<J\}}X_{i}\beta^{j})
=zilpljexp(γi+𝟏{j<J}Xiβj)\displaystyle=z_{il}p_{lj}\exp(\gamma_{i}+\mathbf{1}_{\{j<J\}}X_{i}\beta^{j})
=exp(γi+logplJ+𝟏{j<J}Xiβj+logpljplJ)\displaystyle=\exp(\gamma_{i}+\log p_{lJ}+\mathbf{1}_{\{j<J\}}X_{i}\beta^{j}+\log\frac{p_{lj}}{p_{lJ}})
=:exp(δi+𝟏{j<J}Xiβj+𝟏{j<J}ρlj) for δi:=γi+logplJ and ρlj:=logpljplJ\displaystyle=:\exp(\delta_{i}+\mathbf{1}_{\{j<J\}}X_{i}\beta^{j}+\mathbf{1}_{\{j<J\}}\rho_{lj})\text{ for }\delta_{i}:=\gamma_{i}+\log p_{lJ}\text{ and }\rho_{lj}:=\log\frac{p_{lj}}{p_{lJ}}

We let f(𝜷,𝐩,𝜸)=[δ1,,δn,ρ11,,ρ1(J1),,ρK1,,ρK(J1),β1T,βJ1T]Tf(\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma})=[\delta_{1},\dots,\delta_{n},\rho_{11},\dots,\rho_{1(J-1)},\dots,\rho_{K1},\dots,\rho_{K(J-1)},{\beta^{1}}^{T}\dots,{\beta^{J-1}}^{T}]^{T}, which is trivially an isomorphism. M[ij]1×qM_{[ij]}\in\mathbb{R}^{1\times q} is then given by

M[ij]\displaystyle M_{[ij]} =[eiT:𝟏{j<J}elTejT:𝟏{j<J}ejTXi]\displaystyle=[e_{i}^{T}:\mathbf{1}_{\{j<J\}}e_{l}^{T}\otimes e_{j}^{T}:\mathbf{1}_{\{j<J\}}e_{j}^{T}\otimes X_{i}] (12)

where \otimes is the Kronecker product, eine_{i}\in\mathbb{R}^{n} has ii-th element equal to 1 and all others 0, elKe_{l}\in\mathbb{R}^{K} has ll-th element equal to 1 and all others 0, and ejJ1e_{j}\in\mathbb{R}^{J-1} has jj-th element equal to 1 and all others zero (when j=Jj=J, eje_{j} contains only zeroes). Matrix multiplication then gives μij(𝜷,𝐩,𝜸)=exp(M[ij]f(𝜷,𝐩,𝜸))\mu_{ij}(\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma})=\exp\big{(}M_{[ij]}f(\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma})\big{)}. ∎

Theorem 1.

Under the conditions of Lemma 1, then mean function μ\mu is identifiable if and only if MM has full column rank.

Proof.

We proceed by contradiction for both directions. (\Leftarrow) Suppose μ\mu satisfies the criteria given in Lemma 1 and is not identifiable. For θ=(𝜷,𝐩,𝜸)\mathbf{\theta}=(\boldsymbol{\beta},\mathbf{p},\boldsymbol{\gamma}), this implies that μ(θ)=μ(θ)\mu(\theta)=\mu(\theta^{\prime}) for some θ,θ\theta,\theta^{\prime} s.t. θθ\theta\neq\theta^{\prime}. This in turn implies that Mf(θ)=Mf(θ)M[f(θ)f(θ)]=𝟎f(θ)f(θ)=𝟎Mf(\theta)=Mf(\theta^{\prime})\Rightarrow M\big{[}f(\theta)-f(\theta^{\prime})\big{]}=\mathbf{0}\Rightarrow f(\theta)-f(\theta^{\prime})=\mathbf{0} since MM has full column rank. Since ff is an isomorphism, we have θ=θ\theta=\theta^{\prime}, and therefore, a contradiction. Hence μ\mu must be identifiable if MM is full rank. (\Rightarrow) Now suppose MM is not full rank. Then for some a𝟎qa\neq\mathbf{0}\in\mathbb{R}^{q}, Ma=𝟎Ma=\mathbf{0}. Letting η\eta be an arbitrary vector in q\mathbb{R}^{q}, we have Mη=M(η+a)M\eta=M(\eta+a) and hence μ(f1(η))=μ(f1(η+a))\mu(f^{-1}(\eta))=\mu(f^{-1}(\eta+a)). Since ff is an isomorphism, we must have f1(η)f1(η+a)f^{-1}(\eta)\neq f^{-1}(\eta+a), and hence the mean function is not identifiable. ∎

The above results apply when all rows of 𝐩\mathbf{p} are unknown. However, when some rows of 𝐩\mathbf{p} are known, we may establish identifiability of the unknown parameters under the conditions of Lemma 1 by considering the rank of the matrix MM^{\prime} obtained from MM by excluding columns corresponding to any element of ρk\rho_{k} for kk such that pkp_{k} is known.

9.2 Identifiability: Comparing detection effects across experiments

In Section 6.1, we analyzed data consisting of measurements taken via three sequencing protocols on human fecal samples containing a spiked-in artificial community in addition to measurements taken via flow cytometry on the spiked-in community (Costea et al., 2017). We constrained our analysis to taxa in the spiked-in community, which by design are present in every sample. In this setting, we may apply Theorem 1.

We first show identifiability of the mean function defined for models fit without cross-validation, and then address models fit via cross-validation. We assume familiarity with the experimental setup and notation introduced in Section 6.1. We adopt the identifiability constraint βJ=0\beta^{J}=0.

9.2.1 Model without Cross-Validation

In the model fit without cross-validation, all samples originate from a single source. The mean for sample ii is given by

𝝁i=exp(γi)[𝕡exp(Xi𝜷)]\displaystyle\boldsymbol{\mu}_{i}=\exp(\gamma_{i})[\mathbb{p}\circ\exp(X_{i}\boldsymbol{\beta})] (13)

where γi\gamma_{i} parametrizes sample intensity, 𝜷p×J\boldsymbol{\beta}\in\mathbb{R}^{p\times J} is a matrix of detection effects (here, p=3p=3), 𝐩\mathbf{p} is the unknown relative abundance profile for the sample from which our observations were taken, and

Xi={𝟎 if sample i was measured by flow cytometry,[1𝟏Q𝟏W] otherwiseX_{i}=\begin{cases}\mathbf{0}&\text{ if sample }i\text{ was measured by flow cytometry,}\\ [1~{}\mathbf{1}_{Q}~{}\mathbf{1}_{W}]&\text{ otherwise}\end{cases}

for 𝟏Q\mathbf{1}_{Q} and 𝟏W\mathbf{1}_{W} indicators for sample ii being processed according to protocol QQ or WW, respectively. Thus, from (12), the form of MM is

M[ij]\displaystyle M_{[ij]} ={[eiT:ejT:𝟎T] if sample i was measured by flow cytometry,[eiT:ejT:𝟏{j<J}ejT[1𝟏Q𝟏W]] otherwise.\displaystyle=\begin{cases}[e_{i}^{T}:e_{j}^{T}:\mathbf{0}^{T}]&\text{ if sample }i\text{ was measured by flow cytometry,}\\ [e_{i}^{T}:e_{j}^{T}:\mathbf{1}_{\{j<J\}}e_{j}^{T}\otimes[1~{}\mathbf{1}_{Q}~{}\mathbf{1}_{W}]]&\text{ otherwise.}\end{cases}

Hence the submatrix of MM consisting of rows corresponding to taxa 11 through JJ in sample ii is given by Mi=[1JeiT:IJ:IJXi]M_{i}=[1_{J}e_{i}^{T}:I_{-J}:I_{-J}\otimes X_{i}] where 1J1_{J} is a column JJ-vector of ones and IJI_{-J} is the J×JJ\times J identity matrix with JJ-th column removed. Since there are 4 unique rows of XX, the leftmost (p+1)×(J1)=4×(J1)(p+1)\times(J-1)=4\times(J-1) columns of MM are spanned by the 4×(J1)4\times(J-1) rows of

[IJ:IJ[000]IJ:IJ[100]IJ:IJ[110]IJ:IJ[101]].\displaystyle\begin{bmatrix}I_{-J}&:&I_{-J}\otimes[0~{}0~{}0]\\ I_{-J}&:&I_{-J}\otimes[1~{}0~{}0]\\ I_{-J}&:&I_{-J}\otimes[1~{}1~{}0]\\ I_{-J}&:&I_{-J}\otimes[1~{}0~{}1]\\ \end{bmatrix}.

Since this is a full-rank matrix, the matrix consisting of the leftmost (p+1)×(J1)(p+1)\times(J-1) columns of MM must have full column rank. It now remains to show that the rightmost nn columns of MM are linearly independent of the leftmost (p+1)×(J1)(p+1)\times(J-1) and compose a matrix with full column rank. The latter is immediate from inspection, and the former follows from the fact that the leftmost (p+1)×(J1)(p+1)\times(J-1) columns of MM contain rows of zeroes (corresponding to observations on taxon JJ), whereas the rightmost nn columns contain no zero rows and hence must be linearly independent of the remainder of the columns. Theorem 1 therefore applies and we have established identifiability of this model.

9.2.2 Cross-Validated Model

The cross-validated version of the above model differs only in the addition of an extra row of 𝐩\mathbf{p} corresponding to the composition of the specimen from which samples in the held-out fold are taken. This yields a model we can express as a log-linear model with nearly the same design matrix as above. Namely, we have MM such that the the submatrix consisting of rows corresponding to sample ii is given by

Mi\displaystyle M_{i} =[1JeiT:𝟏{i held-in}IJ:𝟏{i held-out}IJ:IJXi]\displaystyle=[1_{J}e_{i}^{T}:\mathbf{1}_{\{i\text{ held-in}\}}I_{-J}:\mathbf{1}_{\{i\text{ held-out}\}}I_{-J}:I_{-J}\otimes X_{i}]

where 𝟏{i held-in}\mathbf{1}_{\{i\text{ held-in}\}} is an indicator that sample ii is in the held-in fold, and analogously for 𝟏{i held-out}\mathbf{1}_{\{i\text{ held-out}\}}. There are now 8 rather than 4 unique rows in columns of MM:

[IJ:𝟎:IJ[000]IJ:𝟎:IJ[100]IJ:𝟎:IJ[110]IJ:𝟎:IJ[101]𝟎:IJ:IJ[000]𝟎:IJ:IJ[100]𝟎:IJ:IJ[110]𝟎:IJ:IJ[101]]\displaystyle\begin{bmatrix}I_{-J}&:&\mathbf{0}&:&I_{-J}\otimes[0~{}0~{}0]\\ I_{-J}&:&\mathbf{0}&:&I_{-J}\otimes[1~{}0~{}0]\\ I_{-J}&:&\mathbf{0}&:&I_{-J}\otimes[1~{}1~{}0]\\ I_{-J}&:&\mathbf{0}&:&I_{-J}\otimes[1~{}0~{}1]\\ \mathbf{0}&:&I_{-J}&:&I_{-J}\otimes[0~{}0~{}0]\\ \mathbf{0}&:&I_{-J}&:&I_{-J}\otimes[1~{}0~{}0]\\ \mathbf{0}&:&I_{-J}&:&I_{-J}\otimes[1~{}1~{}0]\\ \mathbf{0}&:&I_{-J}&:&I_{-J}\otimes[1~{}0~{}1]\end{bmatrix}

This is again a full-rank matrix, and thus the columns of MM corresponding to 𝝆\boldsymbol{\rho} and 𝜷\boldsymbol{\beta} are linearly independent. Consequently, Theorem 1 applies and thus the mean model is identifiable.

9.3 Identifiability: Estimating contamination via dilution series

In Section 6.2, we analyze a serial dilution of a specimen of known composition (Karstens et al., 2019). In this application, we estimated a contamination profile, and therefore the conditions of Theorem 1 do not apply. In the following sections we thus establish identifiability of the model directly. We consider the two cases discussed in the manuscript (fitting with and without cross-validation) separately.

9.3.1 Two sources; estimated via cross-validation

We first consider identifiability of parameters in the model fit using cross-validation in Section 14.2. For sample ii in the training set, the mean model is

𝝁i=exp(γi)[𝐩0exp(𝜷))+exp(γ~)3di𝐩~]\displaystyle\ \boldsymbol{\mu}_{i}=\exp(\gamma_{i})[\mathbf{p}^{0}\circ\exp(\boldsymbol{\beta)})+\exp(\tilde{\gamma})3^{d_{i}}\tilde{\mathbf{p}}]

where 𝐩0=(𝟎240T18𝟏8T)\mathbf{p}^{0}=\left(\mathbf{0}^{T}_{240}~{}\frac{1}{8}\mathbf{1}_{8}^{T}\right) is the (known) true relative abundance vector of the synthetic community, exp(γ~)\exp(\tilde{\gamma}) gives (unknown) intensity of contamination in an undiluted sample from the reference, 𝐩~\tilde{\mathbf{p}} is the (unknown) contaminant relative abundance vector, and did_{i} represents the (known) number of three-fold dilutions sample ii has undergone. For sample ii in the held-out/test set, the mean model is

𝝁i=exp(γi)[𝐩exp(𝜷)+exp(γ~+α)3di𝐩~]\displaystyle\boldsymbol{\mu}_{i}=\exp(\gamma_{i})[\mathbf{p}\circ\exp(\boldsymbol{\beta})+\exp(\tilde{\gamma}+\alpha)3^{d_{i}}\tilde{\mathbf{p}}] (14)

where 𝐩\mathbf{p} is the unknown relative abundance vector, and exp(γ~+α)\exp(\tilde{\gamma}+\alpha) is the (unknown) intensity of contamination in an undiluted test set sample.

The inclusion of a contaminant profile requires additional identifiability constraints. For example, as noted in the main text, we fix βj=0\beta_{j}=0 for all jj such that pj0=0p^{0}_{j}=0. In addition, we require that there is a taxon, call it jj^{*}, that is present in the synthetic community (pj0>0p^{0}_{j^{*}}>0), present in the test set sample (pj>0p_{j^{*}}>0), and absent from the contaminant profile (p~j=0\tilde{p}_{j^{*}}=0). Finally, we require that the “reference taxon” for interpreting exp(𝜷)\exp(\boldsymbol{\beta}) is present in the synthetic community. We may choose any reference taxon other than jj^{*}. As usual, we take β.J=𝟎p\beta_{.J}=\mathbf{0}_{p}.

We proceed via contrapositive, beginning with the training fold. Suppose

exp(γi)[𝐩0exp(𝜷)+exp(γ~)3di𝐩~]=exp(γi)[𝐩0exp(𝜷)+exp(γ~)3di𝐩~].\displaystyle\exp(\gamma_{i})[\mathbf{p}^{0}\circ\exp(\boldsymbol{\beta})+\exp(\tilde{\gamma})3^{d_{i}}\tilde{\mathbf{p}}]=\exp(\gamma^{\prime}_{i})[\mathbf{p}^{0}\circ\exp(\boldsymbol{\beta}^{\prime})+\exp(\tilde{\gamma}^{\prime})3^{d_{i}}\tilde{\mathbf{p}}^{\prime}]. (15)

Equation (15) for taxon jj^{*} gives exp(γi)exp(βj)=exp(γi)exp(βj)\exp(\gamma_{i})\exp({\beta}_{j^{*}})=\exp(\gamma^{\prime}_{i})\exp({\beta}^{\prime}_{j^{*}}), and thus

𝐩0exp(𝜷)+exp(γ~)3di𝐩~exp(βj)=𝐩0exp(𝜷)+exp(γ~)3di𝐩~exp(βj).\displaystyle\frac{\mathbf{p}^{0}\circ\exp(\boldsymbol{\beta})+\exp(\tilde{\gamma})3^{d_{i}}\tilde{\mathbf{p}}}{\exp(\beta_{j^{\star}})}=\frac{\mathbf{p}^{0}\circ\exp(\boldsymbol{\beta}^{\prime})+\exp(\tilde{\gamma}^{\prime})3^{d_{i}}\tilde{\mathbf{p}}^{\prime}}{\exp(\beta_{j^{\star}}^{\prime})}. (16)

Now, consider distinct samples i,ii,i^{\prime} in the training fold such that didid_{i^{\prime}}\neq d_{i}. Subtracting (16) for sample ii from (16) for sample ii^{\prime}, with the latter rescaled by 3didi3^{d_{i}-d_{i^{\prime}}}, and cancelling 𝐩~\tilde{\mathbf{p}} terms, yields

(13didi)𝐩0exp(𝜷)exp(βj)=(13didi)𝐩0exp(𝜷)exp(βj)\displaystyle(1-3^{d_{i^{\prime}}-d_{i}})\frac{\mathbf{p}^{0}\circ\exp(\boldsymbol{\beta})}{\exp(\beta_{j^{\star}})}=(1-3^{d_{i^{\prime}}-d_{i}})\frac{\mathbf{p}^{0}\circ\exp(\boldsymbol{\beta}^{\prime})}{\exp(\beta^{\prime}_{j^{\star}})} (17)

and thus βjβj=βjβjj\beta_{j}-\beta_{j^{\star}}=\beta^{\prime}_{j}-\beta^{\prime}_{j^{\star}}~{}\forall j such that pj0>0p^{0}_{j}>0. Combining this with βJ=βJ=0\beta_{J}=\beta^{\prime}_{J}=0, noting that pJ0>0p^{0}_{J}>0, and recalling that all other elements of 𝜷\boldsymbol{\beta} are constrained to equal zero, we have that that 𝜷=𝜷\boldsymbol{\beta}=\boldsymbol{\beta}^{\prime}. Plugging this into equation (16) gives exp(γ~)3di𝐩~=exp(γ~)3di𝐩~\exp(\tilde{\gamma})3^{d_{i}}\tilde{\mathbf{p}}=\exp(\tilde{\gamma}^{\prime})3^{d_{i}}\tilde{\mathbf{p}}^{\prime} and summing over jj and simplifying gives both γ~i=γ~i\tilde{\gamma}_{i}=\tilde{\gamma}^{\prime}_{i} and 𝐩~=𝐩~\tilde{\mathbf{p}}=\tilde{\mathbf{p}}^{\prime}. Since all other parameters in (15) are now identified, we have γi=γi\gamma_{i}=\gamma^{\prime}_{i} for ii in the training fold. Thus all parameters are identified for the training fold.

Turning now to the held-out test fold, we again proceed by contrapositive. 𝐩~\tilde{\mathbf{p}}, 𝜷\boldsymbol{\beta}, and γ~\tilde{\gamma} are identifiable from the training fold and thus we consider

exp(γi)[𝐩exp(𝜷)+exp(γ~+α)3di𝐩~]=exp(γi)[𝐩exp(𝜷)+exp(γ~+α)3di𝐩~]\displaystyle\exp(\gamma_{i})[\mathbf{p}\circ\exp(\boldsymbol{\beta})+\exp(\tilde{\gamma}+\alpha)3^{d_{i}}\tilde{\mathbf{p}}]=\exp(\gamma^{\prime}_{i})[\mathbf{p}^{\prime}\circ\exp(\boldsymbol{\beta})+\exp(\tilde{\gamma}+\alpha^{\prime})3^{d_{i}}\tilde{\mathbf{p}}] (18)

As above, we divide by the mean for taxon jj^{\star} on both sides to obtain

𝐩exp(𝜷)+exp(γ~+α)3di𝐩~pjexp(βj)=𝐩exp(𝜷)+exp(γ~+α)3di𝐩~pjexp(βj).\displaystyle\frac{\mathbf{p}\circ\exp(\boldsymbol{\beta})+\exp(\tilde{\gamma}+\alpha)3^{d_{i}}\tilde{\mathbf{p}}}{p_{j^{\star}}\exp(\beta_{j^{\star}})}=\frac{\mathbf{p}^{\prime}\circ\exp(\boldsymbol{\beta})+\exp(\tilde{\gamma}+\alpha^{\prime})3^{d_{i}}\tilde{\mathbf{p}}}{p_{j^{\star}}^{\prime}\exp(\beta_{j^{\star}})}. (19)

We now subtract (19) for sample ii from (19) for sample ii^{\prime} (didid_{i}\neq d_{i^{\prime}}; both ii and ii^{\prime} in the test fold) then simplifies to give

exp(α)pj=exp(α)pj\displaystyle\frac{\exp(\alpha)}{p_{j^{\star}}}=\frac{\exp(\alpha^{\prime})}{p^{\prime}_{j^{\star}}} (20)

Recalling that (18) for taxon jj^{*} gives exp(γi)pjexp(βj)=exp(γi)pjexp(βj)\exp(\gamma_{i})p_{j^{\star}}\exp(\beta_{j^{\star}})=\exp(\gamma_{i}^{\prime})p^{\prime}_{j^{\star}}\exp(\beta_{j^{\star}}), and thus

pjpj=exp(γi)exp(γi).\displaystyle\frac{p_{j^{\star}}}{p^{\prime}_{j^{\star}}}=\frac{\exp(\gamma_{i}^{\prime})}{\exp(\gamma_{i})}. (21)

Additionally, by the same argument that yielded equation (17),

(13didi)𝐩exp(𝜷)pjexp(βj)=(13didi)𝐩exp(𝜷)pjexp(βj),\displaystyle(1-3^{d_{i^{\prime}}-d_{i}})\frac{\mathbf{p}\circ\exp(\boldsymbol{\beta})}{p_{j^{\star}}\exp(\beta_{j^{\star}})}=(1-3^{d_{i^{\prime}}-d_{i}})\frac{\mathbf{p}^{\prime}\circ\exp(\boldsymbol{\beta})}{p^{\prime}_{j^{\star}}\exp(\beta_{j^{\star}})}, (22)

and therefore 𝐩𝐩\mathbf{p}\propto\mathbf{p}^{\prime}. Since both must sum to 1, 𝐩=𝐩.\mathbf{p}=\mathbf{p}^{\prime}. Combining this with (20) gives α=α\alpha=\alpha^{\prime} and (21) gives γi=γi\gamma_{i}=\gamma^{\prime}_{i} for ii in the test fold. Hence the parameters in the test fold are also identified.

9.3.2 Single source; no cross-validation

To investigate the empirical coverage of confidence intervals for the elements of 𝕡\mathbb{p}, we also considered a model where all samples are derived from a single source. No detection efficiencies are estimated in this model, which has mean

𝝁i=exp(γi)(𝕡+exp(γ~)3di𝐩~).\displaystyle\boldsymbol{\mu}_{i}=\exp(\gamma_{i})(\mathbb{p}+\exp(\tilde{\gamma})3^{d_{i}}\tilde{\mathbf{p}}). (23)

We demonstrate identifiability assuming again that for known jj^{\star}, p~j=0\tilde{p}_{j^{\star}}=0 and pj>0p_{j^{\star}}>0. Taking the same approach as above, we have that

exp(γi)(𝐩+exp(γ~)3di𝐩~)=exp(γi)(𝐩+exp(γ~)3di𝐩~).\displaystyle\exp(\gamma_{i})(\mathbf{p}+\exp(\tilde{\gamma})3^{d_{i}}\tilde{\mathbf{p}})=\exp(\gamma^{\prime}_{i})(\mathbf{p}^{\prime}+\exp(\tilde{\gamma}^{\prime})3^{d_{i}}\tilde{\mathbf{p}}^{\prime}). (24)

Dividing this expression by the mean for taxon jj^{\star}, and subtracting the resulting expression for sample ii by the resulting expression for sample ii^{\prime}, we have

(3di3di)exp(γ~)𝐩~pj=(3di3di)exp(γ~)𝐩~pj.\displaystyle(3^{d_{i}}-3^{d_{i^{\prime}}})\frac{\exp(\tilde{\gamma})\tilde{\mathbf{p}}}{p_{j^{\star}}}=(3^{d_{i}}-3^{d_{i^{\prime}}})\frac{\exp(\tilde{\gamma}^{\prime})\tilde{\mathbf{p}}^{\prime}}{p^{\prime}_{j^{\star}}}. (25)

Summing over jj and rearranging, we obtain exp(γ~)exp(γ~)=pjpj\frac{\exp(\tilde{\gamma})}{\exp(\tilde{\gamma}^{\prime})}=\frac{p_{j^{\star}}}{p^{\prime}_{j^{\star}}}. Also, since p~j=0\tilde{p}_{j^{\star}}=0, we have exp(γi)pj=exp(γi)pj\exp(\gamma_{i})p_{j^{\star}}=\exp(\gamma^{\prime}_{i})p^{\prime}_{j^{\star}}, and so exp(γiγi)=exp(γ~γ~)\exp(\gamma^{\prime}_{i}-\gamma_{i})=\exp(\tilde{\gamma}-\tilde{\gamma}^{\prime}). We substitute this into (24) to obtain

(𝐩+exp(γ~)3di𝐩~)exp(γ~)=(𝐩+exp(γ~)3di𝐩~)exp(γ~)\displaystyle\frac{(\mathbf{p}+\exp(\tilde{\gamma})3^{d_{i}}\tilde{\mathbf{p}})}{\exp(\tilde{\gamma})}=\frac{(\mathbf{p}^{\prime}+\exp(\tilde{\gamma}^{\prime})3^{d_{i}}\tilde{\mathbf{p}}^{\prime})}{\exp(\tilde{\gamma}^{\prime})} (26)

Summing over jj gives 1+exp(γ~)3diexp(γ~)=1+exp(γ~)3diexp(γ~)\frac{1+\exp(\tilde{\gamma})3^{d_{i}}}{\exp(\tilde{\gamma})}=\frac{1+\exp(\tilde{\gamma}^{\prime})3^{d_{i}}}{\exp(\tilde{\gamma}^{\prime})} and therefore γ~=γ~,\tilde{\gamma}=\tilde{\gamma}^{\prime}, and therefore also γi=γi\gamma_{i}=\gamma^{\prime}_{i} and pj=pjp_{j^{\star}}=p^{\prime}_{j^{\star}}. Equation (25) then gives 𝐩~=𝐩~\tilde{\mathbf{p}}=\tilde{\mathbf{p}}^{\prime} and finally 𝐩=𝐩\mathbf{p}=\mathbf{p}^{\prime} follows from (24).

9.4 Identifiability: Estimating detectability, composition and contamination via samples of known composition

In Section 7.1, we used data from Brooks et al. (2015) to study the prediction error of our model as a function of the number of samples of known composition, randomly selecting sets of samples to treat as known. In this model, we have unknown detection efficiencies, an unknown contamination profile, an unknown contamination intensity, unknown sample detection effects, and unknown composition for some samples. We do, however, have known composition for some samples. Specifically, for sample ii of known composition, we have

𝝁i=exp(γi)[𝐩i0exp(𝜷)+exp(γ~)𝐩~]\displaystyle\boldsymbol{\mu}_{i}=\exp(\gamma_{i})\Big{[}\mathbf{p}_{i}^{0}\circ\exp(\boldsymbol{\beta})+\exp(\tilde{\gamma})\tilde{\mathbf{p}}\Big{]} (27)

for {γi}\{{\gamma}_{i}\}, 𝜷\boldsymbol{\beta}, γ~\tilde{\gamma} and 𝐩~\tilde{\mathbf{p}} unknown and 𝐩i0\mathbf{p}_{i}^{0} known. For sample ii of unknown composition, we have

𝝁i=exp(γi)[𝐩iexp(𝜷)+exp(γ~)𝐩~]\displaystyle\boldsymbol{\mu}_{i}=\exp(\gamma_{i})\Big{[}\mathbf{p}_{i}\circ\exp(\boldsymbol{\beta})+\exp(\tilde{\gamma})\tilde{\mathbf{p}}\Big{]}

with {γi}\{{\gamma}_{i}\} and 𝐩i\mathbf{p}_{i} unknown. Critically, 𝜷\boldsymbol{\beta}, γ~\tilde{\gamma} and 𝐩~\tilde{\mathbf{p}} are shared across the samples of known and unknown composition.

Consider a graph with JJ nodes, each representing a taxon, and connect all nodes (j,j)(j,j^{\prime}) if there exists a sample ii of known composition such that both pij>0p_{ij}>0 and pij>0p_{ij^{\prime}}>0. For identifiability in 𝜷\boldsymbol{\beta}, we require that this graph is connected. For simultaneous identifiability of detection efficiencies and contamination profiles, we also require that all taxa are absent from at least two samples of known composition, and we assume that contamination profiles 𝐩~\tilde{\mathbf{p}} lie in the interior of the J1J-1-simplex.

9.4.1 Identifiability in samples of known composition

We begin by demonstrating identifiability of the parameters present in the samples of known composition. Proceeding via contrapositive, if the mean model is not identifiable in (γi,𝜷,γ~,𝐩~)(\gamma_{i},\boldsymbol{\beta},\tilde{\gamma},\tilde{\mathbf{p}}) we must have that

exp(γi)[𝐩i0exp(𝜷)+exp(γ~)𝐩~]\displaystyle\exp(\gamma_{i})\Big{[}\mathbf{p}_{i}^{0}\circ\exp(\boldsymbol{\beta})+\exp(\tilde{\gamma})\tilde{\mathbf{p}}\Big{]} =exp(γi)[𝐩i0exp(𝜷)+exp(γ~)𝐩~]\displaystyle=\exp(\gamma_{i}^{\prime})\Big{[}\mathbf{p}_{i}^{0}\circ\exp(\boldsymbol{\beta}^{\prime})+\exp(\tilde{\gamma}^{\prime})\tilde{\mathbf{p}}^{\prime}\Big{]}

with (γi,𝜷,γ~,𝐩~)(γi,𝜷,γ~,𝐩~)(\gamma_{i},\boldsymbol{\beta},\tilde{\gamma},\tilde{\mathbf{p}})\neq(\gamma^{\prime}_{i},\boldsymbol{\beta}^{\prime},\tilde{\gamma}^{\prime},\tilde{\mathbf{p}}^{\prime}).

We first demonstrate identifiability in 𝐩~\tilde{\mathbf{p}}. Define 𝐣(i)={j:pij0=0}\mathbf{j}(i)=\{j:p_{ij}^{0}=0\}. Then, exp(γi)exp(γ~)p~𝐣(i)\exp(\gamma_{i})\exp(\tilde{\gamma})\tilde{p}_{\mathbf{j}(i)} =exp(γi)exp(γ~)p~𝐣(i)=\exp(\gamma_{i}^{\prime})\exp(\tilde{\gamma}^{\prime})\tilde{p}^{\prime}_{\mathbf{j}(i)} and exp(γi)exp(γ~)p~𝐣(i)=exp(γi)exp(γ~)p~𝐣(i)\exp(\gamma_{i^{\prime}})\exp(\tilde{\gamma})\tilde{p}_{\mathbf{j}({i^{\prime}})}=\exp(\gamma_{i^{\prime}}^{\prime})\exp(\tilde{\gamma}^{\prime})\tilde{p}^{\prime}_{\mathbf{j}({i^{\prime}})} for two samples ii and ii^{\prime}. Then, for j𝐣(i)𝐣(i)j\in\mathbf{j}(i)\cap\mathbf{j}(i^{\prime}), rearranging gives p~j=aip~j=aip~j\tilde{p}_{j}=a_{i}\tilde{p}^{\prime}_{j}=a_{i^{\prime}}\tilde{p}^{\prime}_{j}, so if p~j>0\tilde{p}_{j}>0, ai=aia_{i}=a_{i^{\prime}}. Since we assumed that all taxa are absent from at least two samples of known composition, the desired samples ii and ii^{\prime} must exist, and therefore 𝐩~=ai𝐩~\tilde{\mathbf{p}}=a_{i}\tilde{\mathbf{p}}^{\prime}, which implies that 𝐩~=𝐩~\tilde{\mathbf{p}}=\tilde{\mathbf{p}}^{\prime} (both sides must sum to 1). Therefore, 𝐩~\tilde{\mathbf{p}} is identifiable in this model.

Before showing identifiability in the remaining parameters (γi,𝜷,γ~)(\gamma_{i},\boldsymbol{\beta},\tilde{\gamma}), we first prove the following lemma guaranteeing that identifiability of these model parameters does not depend on which jj^{\star} we choose in defining our identifiability constraint βj=0\beta_{j^{\star}}=0 on 𝜷\boldsymbol{\beta}.

Lemma 1.

For any j,j{1,,J}j^{\star},j^{\dagger}\in\{1,\dots,J\}, given 𝐩\mathbf{p} and 𝐩~\tilde{\mathbf{p}}, the model specified in (27) is identifiable in (𝛄,𝛃,γ~)(\boldsymbol{\gamma},\boldsymbol{\beta},\tilde{\gamma}) under the constraint βj=0\beta_{j^{\dagger}}=0 if and only if it is identifiable in these parameters under constraint βj=0\beta_{j^{\star}}=0.

Proof.

Let 𝜷\boldsymbol{\beta}_{\dagger} denote an element of G:={xJ:xj=0}G_{\dagger}:=\{x\in\mathbb{R}^{J}:x_{j^{\dagger}}=0\}, and similarly for 𝜷\boldsymbol{\beta}_{\star} and GG_{\star}. Note that f(𝐱)=𝐱xjf(\mathbf{x})=\mathbf{x}-x_{j^{\star}} defines a bijection from GG_{\dagger} into GG_{\star}. Since it is sufficient to show identifiability for an arbitrary sample ii, we proceed by constructing a bijection g:×G××G×g:\mathbb{R}\times G_{\dagger}\times\mathbb{R}\rightarrow\mathbb{R}\times G_{\star}\times\mathbb{R} such that μi(γi,𝜷,γ~)=μi(g(γi,𝜷,γ~))\mu_{i}(\gamma_{i},\boldsymbol{\beta}_{\dagger},\tilde{\gamma})=\mu_{i}(g(\gamma_{i},\boldsymbol{\beta}_{\dagger},\tilde{\gamma})). For arbitrary (γi,𝜷,γ~)×G×(\gamma_{i},\boldsymbol{\beta}_{\dagger},\tilde{\gamma})\in\mathbb{R}\times G_{\dagger}\times\mathbb{R},

μi(γi,𝜷,γ~)\displaystyle\mu_{i}(\gamma_{i},\boldsymbol{\beta}_{\dagger},\tilde{\gamma}) =exp(γi)[𝐩i0exp(𝜷βj)exp(βj)+exp(γ~)𝐩~]\displaystyle=\exp(\gamma_{i})\Big{[}\mathbf{p}_{i}^{0}\circ\exp(\boldsymbol{\beta}_{\dagger}-\beta_{\dagger j^{\star}})\exp(\beta_{\dagger j^{\star}})+\exp(\tilde{\gamma})\tilde{\mathbf{p}}\Big{]}
=exp(γi+βj)[𝐩i0exp(𝜷βj)+exp(γ~βj)𝐩~]\displaystyle=\exp(\gamma_{i}+\beta_{\dagger j^{\star}})\Big{[}\mathbf{p}_{i}^{0}\circ\exp(\boldsymbol{\beta}_{\dagger}-\beta_{\dagger j^{\star}})+\exp(\tilde{\gamma}-\beta_{\dagger j^{\star}})\tilde{\mathbf{p}}\Big{]}
=μi(γi+βj,𝜷βj,γ~βj)\displaystyle=\mu_{i}(\gamma_{i}+\beta_{\dagger j^{\star}},\boldsymbol{\beta}_{\dagger}-\beta_{\dagger j^{\star}},\tilde{\gamma}-\beta_{\dagger j^{\star}})

That is, g(γi,𝜷,γ~):=(γi+βj,𝜷βj,γ~βj)g(\gamma_{i},\boldsymbol{\beta}_{\dagger},\tilde{\gamma}):=(\gamma_{i}+\beta_{\dagger j^{\star}},\boldsymbol{\beta}_{\dagger}-\beta_{\dagger j^{\star}},\tilde{\gamma}-\beta_{\dagger j^{\star}}) defines a bijection from ×G×\mathbb{R}\times G_{\dagger}\times\mathbb{R} into ×G×\mathbb{R}\times G_{\star}\times\mathbb{R} such that μi(γi,𝜷,γ~)=μi(g(γi,𝜷,γ~))\mu_{i}(\gamma_{i},\boldsymbol{\beta}_{\dagger},\tilde{\gamma})=\mu_{i}(g(\gamma_{i},\boldsymbol{\beta}_{\dagger},\tilde{\gamma})). As such, if we have that

μi(γi,𝜷,γ~)μi(γi,𝜷,γ~)(γi,𝜷,γ~)(γi,𝜷,γ~)\displaystyle\mu_{i}(\gamma_{i},\boldsymbol{\beta}_{\star},\tilde{\gamma})\neq\mu_{i}(\gamma_{i}^{\prime},\boldsymbol{\beta}_{\star}^{\prime},\tilde{\gamma}^{\prime})\Rightarrow(\gamma_{i},\boldsymbol{\beta}_{\star},\tilde{\gamma})\neq(\gamma_{i}^{\prime},\boldsymbol{\beta}_{\star}^{\prime},\tilde{\gamma}^{\prime})

under βj=0\beta_{j^{\star}}=0 (for arbitrary (γi,𝜷,γ~),(γi,𝜷,γ~)×G×(\gamma_{i},\boldsymbol{\beta}_{\star},\tilde{\gamma}),(\gamma_{i}^{\prime},\boldsymbol{\beta}_{\star}^{\prime},\tilde{\gamma}^{\prime})\in\mathbb{R}\times G_{\star}\times\mathbb{R}), then the same must hold for arbitrary (γi,𝜷,γ~),(γi,𝜷,γ~)×G×(\gamma_{i},\boldsymbol{\beta}_{\dagger},\tilde{\gamma}),(\gamma_{i}^{\prime},\boldsymbol{\beta}_{\dagger}^{\prime},\tilde{\gamma}^{\prime})\in\mathbb{R}\times G_{\dagger}\times\mathbb{R}. This is because otherwise we could find g1(γi,𝜷,γ~),g1(γi,𝜷,γ~)×G×g^{-1}(\gamma_{i},\boldsymbol{\beta}_{\dagger},\tilde{\gamma}),g^{-1}(\gamma_{i}^{\prime},\boldsymbol{\beta}_{\dagger}^{\prime},\tilde{\gamma}^{\prime})\in\mathbb{R}\times G_{\star}\times\mathbb{R} such that

μi(g1(γi,𝜷,γ~))=μi(g1(γi,𝜷,γ~)) with g1(γi,𝜷,γ~)g1(γi,𝜷,γ~),\displaystyle\mu_{i}(g^{-1}(\gamma_{i},\boldsymbol{\beta}_{\dagger},\tilde{\gamma}))=\mu_{i}(g^{-1}(\gamma_{i}^{\prime},\boldsymbol{\beta}_{\dagger}^{\prime},\tilde{\gamma}^{\prime}))\text{ with }g^{-1}(\gamma_{i},\boldsymbol{\beta}_{\dagger},\tilde{\gamma})\neq g^{-1}(\gamma_{i}^{\prime},\boldsymbol{\beta}_{\dagger}^{\prime},\tilde{\gamma}^{\prime}),

which would be a contradiction. Therefore, identifiability under βj=0\beta_{j^{\star}}=0 implies identifiability under βj=0\beta_{j^{\dagger}}=0. Finally, it is immediate that lack of identifiability under βj=0\beta_{j^{\star}}=0 implies lack of identifiability under βj\beta_{j^{\dagger}}. If there exists (γi,𝜷,γ~),(γi,𝜷,γ~)×G×(\gamma_{i},\boldsymbol{\beta}_{\star},\tilde{\gamma}),(\gamma_{i}^{\prime},\boldsymbol{\beta}_{\star}^{\prime},\tilde{\gamma}^{\prime})\in\mathbb{R}\times G_{\star}\times\mathbb{R} with (γi,𝜷,γ~)(γi,𝜷,γ~)(\gamma_{i},\boldsymbol{\beta}_{\star},\tilde{\gamma})\neq(\gamma_{i}^{\prime},\boldsymbol{\beta}_{\star}^{\prime},\tilde{\gamma}^{\prime}), and μi(γi,𝜷,γ~)=μi(γi,𝜷,γ~)\mu_{i}(\gamma_{i},\boldsymbol{\beta}_{\star},\tilde{\gamma})=\mu_{i}(\gamma_{i}^{\prime},\boldsymbol{\beta}_{\star}^{\prime},\tilde{\gamma}^{\prime}), then g(γi,𝜷,γ~),g(γi,𝜷,γ~)g(\gamma_{i},\boldsymbol{\beta}_{\star},\tilde{\gamma}),g(\gamma_{i}^{\prime},\boldsymbol{\beta}_{\star}^{\prime},\tilde{\gamma}^{\prime}) is a pair in ×G×\mathbb{R}\times G_{\star}\times\mathbb{R} with the same property. ∎

We now show identifiability in (γi,𝜷,γ~)(\gamma_{i},\boldsymbol{\beta},\tilde{\gamma}). For ii such that pij=0p_{ij}=0 for at least one jj, consider the identifiability constraint βj=0\beta_{j^{\star}}=0 for some jj^{\star} such that pij>0p_{ij^{\star}}>0 (which is WLOG by the above lemma). Then if μi(γi,𝐩,𝜷,γ~,𝐩~)=μi(γi,𝐩,𝜷,γ~,𝐩~)\mu_{i}(\gamma_{i},\mathbf{p},\boldsymbol{\beta},\tilde{\gamma},\tilde{\mathbf{p}})=\mu_{i}(\gamma_{i}^{\prime},\mathbf{p},\boldsymbol{\beta}^{\prime},\tilde{\gamma}^{\prime},\tilde{\mathbf{p}}), we have exp(γi)exp(γ~)p~j=exp(γi)exp(γ~)p~j\exp(\gamma_{i})\exp(\tilde{\gamma})\tilde{p}_{j}=\exp(\gamma_{i}^{\prime})\exp(\tilde{\gamma^{\prime}})\tilde{p}_{j} and thus exp(γi)exp(γ~)=exp(γi)exp(γ~)\exp(\gamma_{i})\exp(\tilde{\gamma})=\exp(\gamma_{i}^{\prime})\exp(\tilde{\gamma^{\prime}}), from which we have

exp(γ~)[exp(𝜷)𝐩i+exp(γ~)𝐩~]=exp(γ~)[exp(𝜷)𝐩i+exp(γ~)𝐩~]\displaystyle\exp(-\tilde{\gamma})\Big{[}\exp(\boldsymbol{\beta})\circ\mathbf{p}_{i}+\exp(\tilde{\gamma})\tilde{\mathbf{p}}\Big{]}=\exp(-\tilde{\gamma}^{\prime})\Big{[}\exp(\boldsymbol{\beta}^{\prime})\circ\mathbf{p}_{i}+\exp(\tilde{\gamma}^{\prime})\tilde{\mathbf{p}}\Big{]}
\displaystyle\Rightarrow exp(𝜷γ~)𝐩i=exp(𝜷γ~)𝐩i (cancelling 𝐩~ on both sides)\displaystyle\exp(\boldsymbol{\beta}-\tilde{\gamma})\circ\mathbf{p}_{i}=\exp(\boldsymbol{\beta}^{\prime}-\tilde{\gamma}^{\prime})\circ\mathbf{p}_{i}\text{ (cancelling }\tilde{\mathbf{p}}\text{ on both sides)}
\displaystyle\Rightarrow exp(γ~)pij=exp(γ~)pij\displaystyle\exp(-\tilde{\gamma})p_{ij}=\exp(-\tilde{\gamma}^{\prime})p_{ij}
\displaystyle\Rightarrow γ~=γ~\displaystyle\tilde{\gamma}=\tilde{\gamma}^{\prime}

since pij>0p_{ij}>0. Now choose another sample ii^{\prime} with pij>0p_{i^{\prime}j^{\star}}>0 and at least one other nonzero entry of 𝐩i\mathbf{p}_{i^{\prime}} (by assumption such an ii exists). If in taxon jj^{\star} we have

exp(γi)[exp(βj)pij+exp(γ~)p~j]=exp(γi)[exp(βj)pij+exp(γ~)p~j]\displaystyle\exp(\gamma_{i^{\prime}})\Big{[}\exp(\beta_{j^{\star}})p_{i^{\prime}j^{\star}}+\exp(\tilde{\gamma})\tilde{p}_{j^{\star}}\Big{]}=\exp(\gamma_{i^{\prime}}^{\prime})\Big{[}\exp(\beta_{j^{\star}}^{\prime})p_{i^{\prime}j^{\star}}+\exp(\tilde{\gamma})\tilde{p}_{j^{\star}}\Big{]} (28)

this immediately implies γi=γi\gamma_{i^{\prime}}=\gamma^{\prime}_{i^{\prime}} since pij>0p_{i^{\prime}j^{\star}}>0 and βj=βj\beta_{j^{\star}}=\beta_{j^{\star}}^{\prime} (as both equal 0 under our identifiability constraint). Now let jj^{\prime} denote another taxon such that pij>0p_{i^{\prime}j^{\prime}}>0. If we have

exp(γi)[exp(βj)pij+exp(γ~)p~j]=exp(γi)[exp(βj)pij+exp(γ~)p~j]\displaystyle\exp(\gamma_{i^{\prime}})\Big{[}\exp(\beta_{j^{\prime}})p_{i^{\prime}j^{\prime}}+\exp(\tilde{\gamma})\tilde{p}_{j^{\prime}}\Big{]}=\exp(\gamma_{i^{\prime}})\Big{[}\exp(\beta_{j^{\prime}})p_{i^{\prime}j^{\prime}}+\exp(\tilde{\gamma})\tilde{p}_{j^{\prime}}\Big{]} (29)

this implies βj=βj\beta_{j^{\prime}}=\beta_{j^{\prime}}^{\prime}. For arbitrary j,jj,j^{\prime}, this argument allows us to show identifiability in βj\beta_{j^{\prime}} given identifiability in βj\beta_{j} conditional on the existance of an ii such that pij>0p_{ij}>0 and pij>0p_{ij^{\prime}}>0. Since we assume connectedness of the graph whose nodes are taxa 1,,J1,\dots,J with an edge between two nodes j,jj,j^{\prime} if pij>0p_{ij}>0 and pij>0p_{ij^{\prime}}>0 for some known sample ii, we are guaranteed to be able to find a path starting at taxon jj^{\star} and visiting each other taxon at least once such that for any adjacent j,jj,j^{\prime} in our path, there exists a known sample ii with pij>0p_{ij}>0 and pij>0p_{ij^{\prime}}>0. Hence by induction with base case βj=βj\beta_{j^{\star}}=\beta^{\prime}_{j^{\star}}, we have 𝜷=𝜷\boldsymbol{\beta}=\boldsymbol{\beta}^{\prime} if for arbitrary ii we have μi(γi,𝐩,𝜷,γ~,𝐩~)=μi(γi,𝐩,𝜷,γ~,𝐩~)\mu_{i}(\gamma_{i},\mathbf{p},\boldsymbol{\beta},\tilde{\gamma},\tilde{\mathbf{p}})=\mu_{i}(\gamma_{i}^{\prime},\mathbf{p},\boldsymbol{\beta}^{\prime},\tilde{\gamma},\tilde{\mathbf{p}}). Finally, given identifiability of 𝜷,γ~,\boldsymbol{\beta},\tilde{\gamma}, and 𝐩~\tilde{\mathbf{p}}, identifiability of γi\gamma_{i} for any sample ii of known composition is trivial. Hence the mean model is identifiable in all parameters that appear in the mean model for samples treated as known.

9.4.2 Identifiability in samples of unknown composition

We now turn our attention to parameters appearing only in samples treated as of unknown composition. For such a sample, a lack of identifiability now would entail

exp(γi)[exp(𝜷)𝐩i+exp(γ~)𝐩~]\displaystyle\exp(\gamma_{i})\Big{[}\exp(\boldsymbol{\beta})\circ\mathbf{p}_{i}+\exp(\tilde{\gamma})\tilde{\mathbf{p}}\Big{]} =exp(γi)[exp(𝜷)𝐩i+exp(γ~)𝐩~]\displaystyle=\exp(\gamma_{i}^{\prime})\Big{[}\exp(\boldsymbol{\beta})\circ\mathbf{p}^{\prime}_{i}+\exp(\tilde{\gamma})\tilde{\mathbf{p}}\Big{]}

for (γi,𝐩i)(γi,𝐩i)(\gamma_{i},\mathbf{p}_{i})\neq(\gamma^{\prime}_{i},\mathbf{p}^{\prime}_{i}). This immediately gives us

exp(γi)[exp(𝜷)(𝐩i+exp(γ~)exp(𝜷)𝐩~)]\displaystyle\exp(\gamma_{i})\Big{[}\exp(\boldsymbol{\beta})\circ\Big{(}\mathbf{p}_{i}+\exp(\tilde{\gamma})\exp(-\boldsymbol{\beta})\circ\tilde{\mathbf{p}}\Big{)}\Big{]} =exp(γi)[exp(𝜷)(𝐩i+exp(γ~)exp(𝜷)𝐩~)]\displaystyle=\exp(\gamma_{i}^{\prime})\Big{[}\exp(\boldsymbol{\beta})\circ\Big{(}\mathbf{p}^{\prime}_{i}+\exp(\tilde{\gamma})\exp(-\boldsymbol{\beta})\circ\tilde{\mathbf{p}}\Big{)}\Big{]}

from which we have

exp(γi)[𝐩i+exp(γ~)exp(𝜷)𝐩~]\displaystyle\exp(\gamma_{i})\Big{[}\mathbf{p}_{i}+\exp(\tilde{\gamma})\exp(-\boldsymbol{\beta})\circ\tilde{\mathbf{p}}\Big{]} =exp(γi)[𝐩i+exp(γ~)exp(𝜷)𝐩~]\displaystyle=\exp(\gamma_{i}^{\prime})\Big{[}\mathbf{p}^{\prime}_{i}+\exp(\tilde{\gamma})\exp(-\boldsymbol{\beta})\circ\tilde{\mathbf{p}}\Big{]}

Summing over jj, we obtain

exp(γi)[1+exp(γ~)exp(𝜷)T𝐩~]\displaystyle\exp(\gamma_{i})\Big{[}1+\exp(\tilde{\gamma})\exp(-\boldsymbol{\beta})^{T}\tilde{\mathbf{p}}\Big{]} =exp(γi)[1+exp(γ~)exp(𝜷)T𝐩~]\displaystyle=\exp(\gamma_{i}^{\prime})\Big{[}1+\exp(\tilde{\gamma})\exp(-\boldsymbol{\beta})^{T}\tilde{\mathbf{p}}\Big{]}

which implies γi=γi\gamma_{i}=\gamma^{\prime}_{i}, which in turn implies that 𝐩i=𝐩i\mathbf{p}_{i}=\mathbf{p}^{\prime}_{i} for ii a sample treated as of unknown composition. Hence the model is identifiable in these parameters as well.

Note that while we have given sufficient conditions for a model with a single source of contamination, the same argument applies to multiple (including plate-specific) sources of contamination, since 𝜷\boldsymbol{\beta} does not vary across plate.

10 SI: Additional details for reweighted estimator

In Section 4, we introduced a weighted Poisson log-likelihood with weight for the likelihood contribution of WijW_{ij} given by

w^ij=μ^ij+1σ^ij2+1\displaystyle\hat{w}_{ij}=\frac{\hat{\mu}_{ij}+1}{\hat{\sigma}^{2}_{ij}+1}

where μ^ij\hat{\mu}_{ij} is the fitted mean for WijW_{ij} given parameters θ^\hat{\theta} estimated under a Poisson likelihood and read depth WiW_{i\cdot}. arising from a model fit to 𝐖\mathbf{W} via a Poisson likelihood (without reweighting) and σ^ij2\hat{\sigma}^{2}_{ij} is a fitted value from a monotone regression of squared residuals (Wijμ^ij)2(W_{ij}-\hat{\mu}_{ij})^{2} on fitted means μ^ij\hat{\mu}_{ij} (with i=1,,ni=1,\dots,n and j=1,,Jj=1,\dots,J). In other words, σ^ij2\hat{\sigma}^{2}_{ij} is an estimate of var(Wij|Wi,θ)\text{var}(W_{ij}|W_{i\cdot},\theta).

To motivate why this reweighting is reasonable, we consider the case in which θ\mathbf{\theta} is in the interior of the parameter space Θ\Theta. In this setting we can express the Poisson MLE as a solution to the following score equations:

{i,j1μij[θ1μij](Wijμij)=0i,j1μij[θLμij](Wijμij)=0\displaystyle\begin{cases}\sum_{i,j}\frac{1}{\mu_{ij}}[\frac{\partial}{\partial\theta_{1}}\mu_{ij}](W_{ij}-\mu_{ij})&=0\\ \vdots&\\ \sum_{i,j}\frac{1}{\mu_{ij}}[\frac{\partial}{\partial\theta_{L}}\mu_{ij}](W_{ij}-\mu_{ij})&=0\\ \end{cases}

Equivalently, we can write

i,j1μij𝐠ij(θ)=0\displaystyle\sum_{i,j}\frac{1}{\mu_{ij}}\mathbf{g}_{ij}(\theta)=0

letting 𝐠ij=[θTμij](Wijμij)\mathbf{g}_{ij}=[\frac{\partial}{\partial\theta^{T}}\mu_{ij}](W_{ij}-\mu_{ij}). Hence, we can view this system of equations as a weighted sum of zero expectation terms 𝐠ij\mathbf{g}_{ij} with weights given by 1μij\frac{1}{\mu_{ij}} –– that is, one over a model-based estimate of Var(Wijμij)\text{Var}(W_{ij}-\mu_{ij}). In this setting, if the Poisson mean-variance relationship holds and the score equations have a unique solution, we expect the estimator given by this solution to be asymptotically efficient (McCullagh, 1983), whereas when a different mean-variance relationship holds, in general we expect to lose efficiency. In contrast, when the Poisson mean-variance relationship does not hold, we expect to be able to improve efficiency by reweighting the score equations with a more flexible estimator of Var(Wijμij)\text{Var}(W_{ij}-\mu_{ij}). To accomplish this, we use a consistent estimator of θ\theta, the Poisson MLE θ^\hat{\theta}, to estimate 𝝁\boldsymbol{\mu} and Var(Wij|μij)\text{Var}(W_{ij}|\mu_{ij}). Specifically, we estimate σ2(μ^ij):=Var(Wij|Wi,Xi,Zi,Z~i,𝜷,𝐩,𝐩~,𝜸~)\sigma^{2}(\hat{\mu}_{ij}):=\text{Var}(W_{ij}|W_{i\cdot},X_{i},Z_{i},\tilde{Z}_{i},\boldsymbol{\beta},\mathbf{p},\tilde{\mathbf{p}},\tilde{\boldsymbol{\gamma}}) under the assumption that σ2()\sigma^{2}(\cdot) is an increasing function via a centered isotonic regression of (Wijμ^ij)2(W_{ij}-\hat{\mu}_{ij})^{2} on μ^ij\hat{\mu}_{ij}. Weighting the log-likelihood contribution of WijW_{ij}, lij:=Wijlog(μij)μijl_{ij}:=W_{ij}\text{log}(\mu_{ij})-\mu_{ij}, by a factor of μ^ijσ^ij2\frac{\hat{\mu}_{ij}}{\hat{\sigma}^{2}_{ij}} then yields reweighted score equations

i,jμ^ijσ^ij21μij𝐠ij(θ)=i,jμ^ijμij1σ^ij2𝐠ij(θ)\displaystyle\sum_{i,j}\frac{\hat{\mu}_{ij}}{\hat{\sigma}^{2}_{ij}}\frac{1}{\mu_{ij}}\mathbf{g}_{ij}(\theta)=\sum_{i,j}\frac{\hat{\mu}_{ij}}{\mu_{ij}}\frac{1}{\hat{\sigma}^{2}_{ij}}\mathbf{g}_{ij}(\theta)

in which each 𝐠ij\mathbf{g}_{ij} is, up to a factor of μ^ijμijp1\frac{\hat{\mu}_{ij}}{\mu_{ij}}\stackrel{{\scriptstyle p}}{{\rightarrow}}1, weighted by the inverse of a flexible estimate of Var(Wij|μij)\text{Var}(W_{ij}|\mu_{ij}). In practice, however, the weighting above may be unstable when μ^ij\hat{\mu}_{ij} and σ^ij2\hat{\sigma}^{2}_{ij} are small. Hence, we weight instead by μ^ij+1σ^ij2+1\frac{\hat{\mu}_{ij}+1}{\hat{\sigma}^{2}_{ij}+1} to preserve behavior of weights when the estimated mean and variance are both large (where reweighting is typically most important) and stabilizes them when these quantities are small.

11 SI: Supporting theory for proposed model and estimators

Throughout this section, we will use the following notation:

  • 𝐖i=(Wi1,,WiJ)\mathbf{W}_{i}=(W_{i1},\dots,W_{iJ}): a measured outcome of interest in sample ii across taxa j=1,,Jj=1,\dots,J. We also use 𝐖\mathbf{W} without subscript ii where this does not lead to ambiguity

  • 𝐗i\mathbf{X}_{i} here denotes covariates (Zi,Xi,Z~i)(Z_{i},X_{i},\tilde{Z}_{i}) described in the main text

  • 𝒲\mathcal{W}: the support of 𝐖=(W1,,WJ)\mathbf{W}=(W_{1},\dots,W_{J})

  • 𝒳\mathcal{X}: the support of 𝐗=(X1,,Xp)\mathbf{X}=(X_{1},\dots,X_{p})

  • 𝒲\scaleobj0.75Σ\mathcal{W}_{\scaleobj{0.75}{\Sigma}}: the support of W\scaleobj0.75Σ:=j=1JWjW_{\scaleobj{0.75}{\Sigma}}:=\sum_{j=1}^{J}W_{j}

  • vv: a weighting function from 𝒲\scaleobj0.75Σ×𝒳\mathcal{W}_{\scaleobj{0.75}{\Sigma}}\times\mathcal{X} into >0J\mathbb{R}_{>0}^{J}. For simplicity of notation, we frequently suppress dependence on Wi\scaleobj0.75ΣW_{i\scaleobj{0.75}{\Sigma}} and 𝐗i\mathbf{X}_{i} and write vijv_{ij} to indicate vj(Wi\scaleobj0.75Σ,𝐗i)v_{j}(W_{i\scaleobj{0.75}{\Sigma}},\mathbf{X}_{i})

  • v^n\hat{v}_{n}: an empirical weighting function estimated from a sample of size nn

  • θ\theta: unknown parameters (𝐩,𝜷,𝐩~,𝜸~,𝜶)(\mathbf{p},\boldsymbol{\beta},\tilde{\mathbf{p}},\tilde{\boldsymbol{\gamma}},\boldsymbol{\alpha}); we denote the true value with θ0\theta_{0}

  • 𝝁θ=(μθ1μθJ)\boldsymbol{\mu}_{\theta}=(\mu_{\theta 1}\dots\mu_{\theta J}): a parametrization of the mean model given in equation (5) in main text; 𝔼[𝐖|𝐗,γ,θ]=exp(γ)𝝁θ(𝐗)\mathbb{E}[\mathbf{W}|\mathbf{X},\gamma,\theta]=\text{exp}(\gamma)\boldsymbol{\mu}_{\theta}(\mathbf{X}); when unambiguous, we suppress dependence on 𝐗\mathbf{X} and write 𝝁θ\boldsymbol{\mu}_{\theta}; we also use 𝝁θ\scaleobj0.75Σ\boldsymbol{\mu}_{\theta\scaleobj{0.75}{\Sigma}} to denote j=1Jμθj\sum_{j=1}^{J}\mu_{\theta j}

  • Mnv(θ)M^{v}_{n}(\theta): profile log-likelihood under weighting function vv, evaluated at θ\theta on a sample of size nn

  • Mv(θ)M^{v}(\theta): expected profile log-likelihood under weighting function vv, evaluated at θ\theta

  • mθvm_{\theta}^{v}: the profile log-likelihood under weighting function vv as a function from 𝒲×𝒳\mathcal{W}\times\mathcal{X} into \mathbb{R}; Mθv=𝔼𝐖,𝐗mθv(𝐖,𝐗):=PmθvM^{v}_{\theta}=\mathbb{E}_{\mathbf{W},\mathbf{X}}m^{v}_{\theta}(\mathbf{W},\mathbf{X}):=Pm^{v}_{\theta}, and similarly we can express Mnv(θ)M^{v}_{n}(\theta) in terms of mθvm^{v}_{\theta} and the empirical measure n\mathbb{P}_{n}: Mnv(θ)=nmθvM^{v}_{n}(\theta)=\mathbb{P}_{n}m^{v}_{\theta}

  • L()L^{\infty}(\mathcal{F}): the set of all uniformly bounded real functions on \mathcal{F}

11.1 Assumptions

  1. (A)

    We draw pairs (𝐖,𝐗)iidPθ0(\mathbf{W},\mathbf{X})\stackrel{{\scriptstyle\text{iid}}}{{\sim}}P_{\theta_{0}} where 𝐖\mathbf{W} has closed, bounded support 𝒲0J\mathcal{W}\subset\mathbb{R}_{\geq 0}^{J} and 𝐗\mathbf{X} has closed, bounded support 𝒳p\mathcal{X}\subset\mathbb{R}^{p}.

  2. (B)

    Letting θ\theta denote (𝐩,𝜷,𝐩~,𝜸~,𝜶)(\mathbf{p},\boldsymbol{\beta},\tilde{\mathbf{p}},\tilde{\boldsymbol{\gamma}},\boldsymbol{\alpha}), for a set of known functions from 𝒳\mathcal{X} to 0J\mathbb{R}^{J}_{\geq 0} {𝝁θ:θΘ}\{\boldsymbol{\mu}_{\theta}:\theta\in\Theta\} we have that 𝔼[𝐖|𝐗,W\scaleobj0.75Σ]=W\scaleobj0.75Σ𝝁θ0(𝐗)𝝁θ0\scaleobj0.75Σ(𝐗)\mathbb{E}[\mathbf{W}|\mathbf{X},W_{\scaleobj{0.75}{\Sigma}}]=W_{\scaleobj{0.75}{\Sigma}}\frac{\boldsymbol{\mu}_{\theta_{0}}(\mathbf{X})}{\boldsymbol{\mu}_{\theta_{0}\scaleobj{0.75}{\Sigma}}(\mathbf{X})} where θ0Θd\theta_{0}\in\Theta\subset\mathbb{R}^{d}, with μθ(x)\mu_{\theta}(x) differentiable in θ\theta for all x𝒳x\in\mathcal{X} and for each fixed θΘ\theta\in\Theta, μθ(x)\mu_{\theta}(x) a bounded function on 𝒳\mathcal{X}.

  3. (C)

    For almost all 𝐱𝒳\mathbf{x}\in\mathcal{X}, Pr([j=1JWj]>b|𝐗=x)=1\text{Pr}([\sum_{j=1}^{J}W_{j}]>b|\mathbf{X}=x)=1 for some b>0b>0.

Note: while the form of the mean model given above differs somewhat from the presentation in the main text, it in fact implies the form in the main text if we introduce random variable Γ\Gamma and let 𝔼[W\scaleobj0.75Σ|𝐗=𝐱,Γ=γ]=exp(γ)𝝁θ0\scaleobj0.75Σ\mathbb{E}[W_{\scaleobj{0.75}{\Sigma}}|\mathbf{X}=\mathbf{x},\Gamma=\gamma]=\text{exp}(\gamma)\boldsymbol{\mu}_{\theta_{0}\scaleobj{0.75}{\Sigma}}. However, since this construction in terms of Γ\Gamma is not necessary for the results that follow, we omit it.

11.2 Form of profile log-likelihood

We first derive the form of a log-likelihood in which nuisance parameters {γi}i=1n\{\gamma_{i}\}_{i=1}^{n}, have been profiled out. We characterize population analogue of this log-likelihood. The form of this profile log-likelihood is as follows:

Mnv(θ):\displaystyle M^{v}_{n}(\theta): =1ni=1nsupγi[j=1Jvij(Wijlog[exp(γi)μθj(Xi)]exp(γi)μθj(Xi))]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\text{sup}_{\gamma_{i}\in\mathbb{R}}\Big{[}\sum_{j=1}^{J}v_{ij}\Big{(}W_{ij}\text{log}[\text{exp}(\gamma_{i})\mu_{\theta j}(X_{i})]-\text{exp}(\gamma_{i})\mu_{\theta j}(X_{i})\Big{)}\Big{]} (30)
=1ni=1nj=1J[vij(Wijlog[𝐯i𝐖i𝐯i𝝁θμθj]𝐯i𝐖i𝐯i𝝁θμθj)]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{J}\Big{[}v_{ij}\Big{(}W_{ij}\text{log}[\frac{\mathbf{v}_{i}\cdot\mathbf{W}_{i}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}\mu_{\theta j}]-\frac{\mathbf{v}_{i}\cdot\mathbf{W}_{i}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}\mu_{\theta j}\Big{)}\Big{]} (31)

where we suppress dependence on 𝐗i\mathbf{X}_{i} for simplicity in the second row. We derive the profile likelihood in the second row via differentiation with respect to γi\gamma_{i}; the optimum is unique by convexity of aybexp(y)ay-b\text{exp}(y) in yy when a,b>0a,b>0. We use 𝐯i𝐖i\mathbf{v}_{i}\cdot\mathbf{W}_{i} to denote j=1JvijWij\sum_{j=1}^{J}v_{ij}W_{ij} and similarly for 𝐯i𝝁θ\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}.

We now allow weights 𝐯i=(vi1,,viJ)\mathbf{v}_{i}=(v_{i1},\dots,v_{iJ}) to be given as a (bounded positive) function of 𝐗i\mathbf{X}_{i} and 𝐖i\scaleobj0.75Σ:=j=1JWij\mathbf{W}_{i\scaleobj{0.75}{\Sigma}}:=\sum_{j=1}^{J}W_{ij} and examine the population analogue Mv(θ)M^{v}(\theta) of of the weighted profile log-likelihood Mnv(θ)M^{v}_{n}(\theta).

Mv(θ)=\displaystyle M^{v}(\theta)= 𝔼𝐖,𝐗[j=1Jvij(Wijlog[𝐯i𝐖i𝐯i𝝁θμθj]𝐯i𝐖i𝐯i𝝁θμθj)]\displaystyle\mathbb{E}_{\mathbf{W},\mathbf{X}}\Big{[}\sum_{j=1}^{J}v_{ij}\Big{(}W_{ij}\text{log}[\frac{\mathbf{v}_{i}\cdot\mathbf{W}_{i}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}\mu_{\theta j}]-\frac{\mathbf{v}_{i}\cdot\mathbf{W}_{i}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}\mu_{\theta j}\Big{)}\Big{]} (32)
𝔼𝐖\scaleobj0.75Σ,𝐗𝔼𝐖|𝐖\scaleobj0.75Σ,𝐗[j=1Jvij(Wijlog[𝐯i𝐖i𝐯i𝝁θμθj]𝐯i𝐖i𝐯i𝝁θμθj)]\displaystyle\mathbb{E}_{\mathbf{W}_{\scaleobj{0.75}{\Sigma}},\mathbf{X}}\mathbb{E}_{\mathbf{W}|\mathbf{W}_{\scaleobj{0.75}{\Sigma}},\mathbf{X}}\Big{[}\sum_{j=1}^{J}v_{ij}\Big{(}W_{ij}\text{log}[\frac{\mathbf{v}_{i}\cdot\mathbf{W}_{i}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}\mu_{\theta j}]-\frac{\mathbf{v}_{i}\cdot\mathbf{W}_{i}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}\mu_{\theta j}\Big{)}\Big{]} (33)
=𝔼𝐖,𝐗[j=1JvijWijlog𝐯i𝐖i]\displaystyle=\mathbb{E}_{\mathbf{W},\mathbf{X}}\Big{[}\sum_{j=1}^{J}v_{ij}W_{ij}\text{log}\mathbf{v}_{i}\cdot\mathbf{W}_{i}\Big{]} (34)
+𝔼𝐖\scaleobj0.75Σ,𝐗[j=1Jvij(Wi\scaleobj0.75Σμθ0jμθ0\scaleobj0.75Σlogμθj𝐯i𝝁θWi\scaleobj0.75Σ𝐯i𝝁θ0μθ0\scaleobj0.75Σμθj𝐯i𝝁θ)]\displaystyle+\mathbb{E}_{\mathbf{W}_{\scaleobj{0.75}{\Sigma}},\mathbf{X}}\Big{[}\sum_{j=1}^{J}v_{ij}\Big{(}W_{i\scaleobj{0.75}{\Sigma}}\frac{\mu_{\theta_{0}j}}{\mu_{\theta_{0}\scaleobj{0.75}{\Sigma}}}\text{log}\frac{\mu_{\theta j}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}-W_{i\scaleobj{0.75}{\Sigma}}\frac{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta_{0}}}{\mu_{\theta_{0}\scaleobj{0.75}{\Sigma}}}\frac{\mu_{\theta j}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}\Big{)}\Big{]} (35)
=C+𝔼𝐖\scaleobj0.75Σ,𝐗[Wi\scaleobj0.75Σ𝐯i𝝁θ0μθ0\scaleobj0.75Σj=1Jvij(μθ0j𝐯i𝝁θ0logμθj𝐯i𝝁θμθj𝐯i𝝁θ)]\displaystyle=C+\mathbb{E}_{\mathbf{W}_{\scaleobj{0.75}{\Sigma}},\mathbf{X}}\Big{[}W_{i\scaleobj{0.75}{\Sigma}}\frac{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta_{0}}}{\mu_{\theta_{0}\scaleobj{0.75}{\Sigma}}}\sum_{j=1}^{J}v_{ij}\Big{(}\frac{\mu_{\theta_{0}j}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta_{0}}}\text{log}\frac{\mu_{\theta j}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}-\frac{\mu_{\theta j}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}\Big{)}\Big{]} (36)

We note that the term in line 5 above depends on θ0\theta_{0} but not θ\theta; accordingly, we represent it with constant CC on line 7.

11.3 Optimizer of profile likelihood

We now show that, under a suitable identifiability condition, a weak condition on 𝐖\mathbf{W}, and a condition on weighting function vv, that if the mean model given in assumption B holds at θ=θ0\theta=\theta_{0}, then the unique optimizer of population criterion Mv(θ)M^{v}(\theta) is θ0\theta_{0}.

The additional conditions we need are as follows:

  1. (D)

    For all θ,θΘ\theta,\theta^{\prime}\in\Theta, we have that, for any a+a\in\mathbb{R}^{+}, θθ𝝁θ(𝐱)a𝝁θ(𝐱)\theta\neq\theta^{\prime}\Rightarrow\boldsymbol{\mu}_{\theta}(\mathbf{x})\neq a\boldsymbol{\mu}_{\theta^{\prime}}(\mathbf{x}) holds for all 𝐱A𝒳\mathbf{x}\in A\subset\mathcal{X} with P𝐗(A)>0P_{\mathbf{X}}(A)>0.

  2. (E)

    Weighting function v:𝒳×𝒲\scaleobj0.75Σ>0Jv:\mathcal{X}\times\mathcal{W}_{\scaleobj{0.75}{\Sigma}}\rightarrow\mathbb{R}_{>0}^{J}, where 𝒲\scaleobj0.75Σ\mathcal{W}_{\scaleobj{0.75}{\Sigma}} is the support of W\scaleobj0.75ΣW_{\scaleobj{0.75}{\Sigma}}, is continuous and bounded.

We will use the following simple lemma:

Lemma 2.

For every a0a\geq 0, the function defined by fa(b):=alog(b)bf_{a}(b):=a\text{log}(b)-b is uniquely maximized at b=ab=a (defining 0log0:=00\text{log}0:=0 and letting alog0=tya\text{log}0=-\ ty for every a>0a>0).

Proof.

First consider the case a>0a>0. Since fa(0)=f_{a}(0)=-\infty in this case and faf_{a} is finite for all b>0b>0, the optimum cannot occur at b=0b=0. Over b+b\in\mathbb{R}^{+}, 2b2f=ab2<0\frac{\partial^{2}}{\partial b^{2}}f=-\frac{a}{b^{2}}<0, so faf_{a} is strictly convex over +\mathbb{R}^{+} and hence takes a unique optimum. Setting bf=ab1=0\frac{\partial}{\partial b}f=\frac{a}{b}-1=0 gives us that the optimum occurs at b=ab=a.

When a=0a=0, fa(b)={0 if b=0b if b>0f_{a}(b)=\begin{cases}0&\text{ if }b=0\\ -b&\text{ if }b>0\end{cases}, so faf_{a} is optimized at 0 since b<0-b<0 when b>0b>0. ∎

Theorem 1.

Suppose that conditions (A) - (D) are met. Then for any weighting function satisfying (E), the criterion Mv()M^{v}(\cdot) defined above is uniquely optimized at θ0\theta_{0}.

Proof.

From above we have the form of the population criterion Mv(θ)M^{v}(\theta):

Mv(θ)=C+𝔼𝐖\scaleobj0.75Σ,𝐗[W\scaleobj0.75Σ𝐯𝝁θ0μθ0\scaleobj0.75Σj=1Jvij(μθ0j𝐯i𝝁θ0logμθj𝐯i𝝁θμθj𝐯i𝝁θ)]\displaystyle M^{v}(\theta)=C+\mathbb{E}_{\mathbf{W}_{\scaleobj{0.75}{\Sigma}},\mathbf{X}}\Big{[}W_{\scaleobj{0.75}{\Sigma}}\frac{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}{\mu_{\theta_{0}\scaleobj{0.75}{\Sigma}}}\sum_{j=1}^{J}v_{ij}\Big{(}\frac{\mu_{\theta_{0}j}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta_{0}}}\text{log}\frac{\mu_{\theta j}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}-\frac{\mu_{\theta j}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}\Big{)}\Big{]} (37)

For each fixed pair (𝐱,w\scaleobj0.75Σ)supp(𝐗,W\scaleobj0.75Σ)(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}})\in\text{supp}(\mathbf{X},W_{\scaleobj{0.75}{\Sigma}}), denote by μθ0j𝐯𝝁θ0(𝐱,w\scaleobj0.75Σ)\frac{\mu_{\theta_{0}j}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}}) the function μθ0j(𝐱)𝐯(𝐱,w\scaleobj0.75Σ)𝝁θ0(𝐱)\frac{\mu_{\theta_{0}j}(\mathbf{x})}{\mathbf{v}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}})\cdot\boldsymbol{\mu}_{\theta_{0}}(\mathbf{x})}. Then

hjv(𝐱,w\scaleobj0.75Σ,θ;θ0):=μθ0j𝐯𝝁θ0(𝐱,w\scaleobj0.75Σ)logμθj𝐯𝝁θ(𝐱,w\scaleobj0.75Σ)μθj𝐯𝝁θ(𝐱,w\scaleobj0.75Σ)\displaystyle h^{v}_{j}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}},\theta;\theta_{0}):=\frac{\mu_{\theta_{0}j}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}})\text{log}\frac{\mu_{\theta j}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta}}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}})-\frac{\mu_{\theta j}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta}}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}}) (38)

is maximized when μθj𝐯i𝝁θ(𝐱,w\scaleobj0.75Σ)=μθ0j𝐯i𝝁θ0(𝐱,w\scaleobj0.75Σ)\frac{\mu_{\theta j}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta}}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}})=\frac{\mu_{\theta_{0}j}}{\mathbf{v}_{i}\cdot\boldsymbol{\mu}_{\theta_{0}}}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}}) by Lemma 1. ∎

Before proceeding, we show that <Mv(θ0)<-\infty<M^{v}(\theta_{0})<\infty. By definition, we have

Mv(θ0)\displaystyle M^{v}(\theta_{0}) =𝔼𝐖,𝐗supγj=1Jvj(Wjlog[exp(γ)𝐯𝐖𝐯𝝁θ0μθ0j]exp(γ)𝐯𝐖𝐯𝝁θ0μθ0j)\displaystyle=\mathbb{E}_{\mathbf{W},\mathbf{X}}\text{sup}_{\gamma\in\mathbb{R}}\sum_{j=1}^{J}v_{j}\big{(}W_{j}\text{log}\big{[}\text{exp}(\gamma)\frac{\mathbf{v}\cdot\mathbf{W}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}\mu_{\theta_{0}j}\big{]}-\text{exp}(\gamma)\frac{\mathbf{v}\cdot\mathbf{W}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}\mu_{\theta_{0}j}\big{)} (39)
𝔼𝐖,𝐗j=1Jvj(Wjlog[𝐯𝐖𝐯𝝁θ0μθ0j]𝐯𝐖𝐯𝝁θ0μθ0j) (setting γ=0)\displaystyle\geq\mathbb{E}_{\mathbf{W},\mathbf{X}}\sum_{j=1}^{J}v_{j}\big{(}W_{j}\text{log}\big{[}\frac{\mathbf{v}\cdot\mathbf{W}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}\mu_{\theta_{0}j}\big{]}-\frac{\mathbf{v}\cdot\mathbf{W}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}\mu_{\theta_{0}j}\big{)}\text{ (setting }\gamma=0) (40)
=𝔼𝐖,𝐗j=1JvjWjlog𝐯𝐖\displaystyle=\mathbb{E}_{\mathbf{W},\mathbf{X}}\sum_{j=1}^{J}v_{j}W_{j}\text{log}\mathbf{v}\cdot\mathbf{W} (41)
+𝔼W\scaleobj0.75Σ,𝐗𝐯𝝁θ0μθ0\scaleobj0.75ΣW\scaleobj0.75Σj=1J[μθ0j𝐯𝝁θ0logμθ0j𝐯𝝁θ0μθ0j𝐯𝝁θ0]\displaystyle+\mathbb{E}_{W_{\scaleobj{0.75}{\Sigma}},\mathbf{X}}\frac{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}{\mu_{\theta_{0}\scaleobj{0.75}{\Sigma}}}W_{\scaleobj{0.75}{\Sigma}}\sum_{j=1}^{J}\big{[}\frac{\mu_{\theta_{0}j}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}\text{log}\frac{\mu_{\theta_{0}j}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}-\frac{\mu_{\theta_{0}j}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}\big{]} (42)

The term in line (41) is equal to 𝔼𝐖,𝐗𝐯𝐖log𝐯𝐖\mathbb{E}_{\mathbf{W},\mathbf{X}}\mathbf{v}\cdot\mathbf{W}\text{log}\mathbf{v}\cdot\mathbf{W}, which as the integral of a bounded function over a bounded domain is finite. By assumption (C), we must have jμθ0j(𝐗)>0\sum_{j}\mu_{\theta_{0}j}(\mathbf{X})>0 almost surely, so the term 𝐯𝝁θ0μθ0\scaleobj0.75ΣW\scaleobj0.75Σ\frac{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}{\mu_{\theta_{0}\scaleobj{0.75}{\Sigma}}}W_{\scaleobj{0.75}{\Sigma}} in line (42) is almost surely bounded by boundedness of μθ\mu_{\theta}, vv, and 𝐖\mathbf{W}. Inside the sum in this line, we have terms of the form alogaaa\text{log}a-a, which is a bounded function on any bounded set in 0\mathbb{R}_{\geq 0}. Hence line (42) is an integral of a bounded function over a bounded domain and so is also finite, so Mv(θ0)>M^{v}(\theta_{0})>-\infty.

Similarly,

Mv(θ0)\displaystyle M^{v}(\theta_{0}) =𝔼𝐖,𝐗supγj=1Jvj(Wjlog[exp(γ)𝐯𝐖𝐯𝝁θ0μθ0j]exp(γ)𝐯𝐖𝐯𝝁θ0μθ0j)\displaystyle=\mathbb{E}_{\mathbf{W},\mathbf{X}}\text{sup}_{\gamma\in\mathbb{R}}\sum_{j=1}^{J}v_{j}\big{(}W_{j}\text{log}\big{[}\text{exp}(\gamma)\frac{\mathbf{v}\cdot\mathbf{W}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}\mu_{\theta_{0}j}\big{]}-\text{exp}(\gamma)\frac{\mathbf{v}\cdot\mathbf{W}}{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}\mu_{\theta_{0}j}\big{)} (43)
𝔼𝐖,𝐗j=1Jvj(WjlogWjWj) by Lemma 1\displaystyle\leq\mathbb{E}_{\mathbf{W},\mathbf{X}}\sum_{j=1}^{J}v_{j}\big{(}W_{j}\text{log}W_{j}-W_{j})\text{ by Lemma 1} (44)
< since j=1Jvj(WjlogWjWj) is bounded on 𝒲×𝒳×𝒢\displaystyle<\infty\text{ since }\sum_{j=1}^{J}v_{j}\big{(}W_{j}\text{log}W_{j}-W_{j})\text{ is bounded on }\mathcal{W}\times\mathcal{X}\times\mathcal{G} (45)

Hence <Mv(θ0)<-\infty<M^{v}(\theta_{0})<\infty, which guarantees that the difference in the following argument is not of the form \infty-\infty.

Now, for any θΘ\theta\in\Theta with θθ0\theta\neq\theta_{0}, we have

Mv(θ)Mv(θ0)\displaystyle M^{v}(\theta)-M^{v}(\theta_{0})
=\displaystyle= Aw𝐯𝝁θ0μθ0\scaleobj0.75Σj=1J[hjv(𝐱,w\scaleobj0.75Σ,θ;θ0)hjv(𝐱,w\scaleobj0.75Σ,θ0;θ0)]dPW\scaleobj0.75Σ|𝐗(w\scaleobj0.75Σ)dP𝐗(𝐱)\displaystyle\int_{A}\int w\frac{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}{\mu_{\theta_{0}\scaleobj{0.75}{\Sigma}}}\sum_{j=1}^{J}\Big{[}h^{v}_{j}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}},\theta;\theta_{0})-h^{v}_{j}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}},\theta_{0};\theta_{0})\Big{]}dP_{W_{\scaleobj{0.75}{\Sigma}}|\mathbf{X}}(w_{\scaleobj{0.75}{\Sigma}})dP_{\mathbf{X}}(\mathbf{x})
+\displaystyle+ ACw𝐯𝝁θ0μθ0\scaleobj0.75Σj=1J[hjv(𝐱,w\scaleobj0.75Σ,θ;θ0)hjv(𝐱,w\scaleobj0.75Σ,θ0;θ0)]dPW\scaleobj0.75Σ|𝐗(w\scaleobj0.75Σ)dP𝐗(𝐱)\displaystyle\int_{A^{C}}\int w\frac{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}{\mu_{\theta_{0}\scaleobj{0.75}{\Sigma}}}\sum_{j=1}^{J}\Big{[}h^{v}_{j}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}},\theta;\theta_{0})-h^{v}_{j}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}},\theta_{0};\theta_{0})\Big{]}dP_{W_{\scaleobj{0.75}{\Sigma}}|\mathbf{X}}(w_{\scaleobj{0.75}{\Sigma}})dP_{\mathbf{X}}(\mathbf{x})
\displaystyle\leq Aw𝐯𝝁θ0μθ0\scaleobj0.75Σj=1J[hjv(𝐱,w\scaleobj0.75Σ,θ;θ0)hjv(𝐱,w\scaleobj0.75Σ,θ0;θ0)]dPW\scaleobj0.75Σ|𝐗(w\scaleobj0.75Σ)dP𝐗(𝐱)()\displaystyle\int_{A}\int w\frac{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}{\mu_{\theta_{0}\scaleobj{0.75}{\Sigma}}}\sum_{j=1}^{J}\Big{[}h^{v}_{j}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}},\theta;\theta_{0})-h^{v}_{j}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}},\theta_{0};\theta_{0})\Big{]}dP_{W_{\scaleobj{0.75}{\Sigma}}|\mathbf{X}}(w_{\scaleobj{0.75}{\Sigma}})dP_{\mathbf{X}}(\mathbf{x})~{}~{}~{}(\boldsymbol{\star})
<\displaystyle< 0\displaystyle~{}0

The first inequality is a result of θ0\theta_{0} maximizing (not necessarily uniquely) hjvh^{v}_{j}; i.e., hjv(𝐱,θ;θ0)hjv(𝐱,θ0;θ0)0h^{v}_{j}(\mathbf{x},\theta;\theta_{0})-h^{v}_{j}(\mathbf{x},\theta_{0};\theta_{0})\leq 0. Strict inequality holds in the last line because the integrand in ()(\boldsymbol{\star}) is strictly negative, since hjv(𝐱,w\scaleobj0.75Σ,θ;θ0)hjv(𝐱,w\scaleobj0.75Σ,θ0;θ0)<0h^{v}_{j}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}},\theta;\theta_{0})-h^{v}_{j}(\mathbf{x},w_{\scaleobj{0.75}{\Sigma}},\theta_{0};\theta_{0})<0 for θθ0\theta\neq\theta_{0} on AA, and the term w𝐯𝝁θ0μθ0\scaleobj0.75Σw\frac{\mathbf{v}\cdot\boldsymbol{\mu}_{\theta_{0}}}{\mu_{\theta_{0}\scaleobj{0.75}{\Sigma}}} is a.s. strictly positive by assumption (C) and positivity of 𝐯\mathbf{v}. Hence Mv(θ0)>Mv(θ)M^{v}(\theta_{0})>M^{v}(\theta) for all θθ0\theta\neq\theta_{0} in Θ\Theta, so θ0\theta_{0} is the unique maximizer of the population criterion Mv()M^{v}(\cdot).

11.4 Consistency of M-estimators

We apply theorem 5.14 of van der Vaart (1998) to show consistency of maximizers θ^nv\hat{\theta}_{n}^{v} of MnvM^{v}_{n} for θ0\theta_{0}. We also show consistency of estimators θ^nv^n\hat{\theta}_{n}^{\hat{v}_{n}}, where {v^n}\{\hat{v}_{n}\} is a sequence of random continuous positive bounded weighting functions converging uniformly in probability to vv.

We first require the following assumption on vv,v^n\hat{v}_{n}, and 𝝁\boldsymbol{\mu}:

  1. (F)

    supt𝒳×𝒲\scaleobj0.75Σ|v(t)v^n(t)|p0\text{sup}_{t\in\mathcal{X}\times\mathcal{W}_{\scaleobj{0.75}{\Sigma}}}|v(t)-\hat{v}_{n}(t)|\stackrel{{\scriptstyle p}}{{\rightarrow}}0 and every v^n\hat{v}_{n} is continuous, positive, and bounded.

  2. (G)

    There exist δ>0\delta>0 and ϵ>0\epsilon>0 such that minjinf𝐱𝒳:μθ0j(x)>0μθj(x)μθ\scaleobj0.75Σ(x)ϵ\text{min}_{j}~{}\inf_{\mathbf{x}\in\mathcal{X}:\mu_{\theta_{0}j}(x)>0}~{}\frac{\mu_{\theta j}(x)}{\mu_{\theta\scaleobj{0.75}{\Sigma}}(x)}\geq\epsilon holds for all θHδ:=KδΘ\theta\in H_{\delta}:=K_{\delta}\cap\Theta where Kδ:={θ:d(θ,θ0)δ}K_{\delta}:=\{\theta:d(\theta,\theta_{0})\leq\delta\} is a closed δ\delta-neighborhood of θ0\theta_{0} in d\mathbb{R}^{d}. Moreover, μθ(w,x)\mu_{\theta}(w,x) is Lipschitz continuous in θ\theta on HδH_{\delta} for each (w,x)supp(W,X)(w,x)\in\text{supp}(W,X).

Note: while assumption (G) can likely be loosened, we note that in practice it is not particularly restrictive. In particular, we emphasize that it in no way precludes θ0\theta_{0} from lying at the boundary of the parameter space; rather, it guarantees that nonzero means are bounded away from zero, which allows us to select neighborhoods of θ0\theta_{0} in which mθvm_{\theta}^{v} is bounded.

Theorem 2.

Suppose conditions (A) through (G) are satisfied. Then for d(θ,θ)=k=1p|arctan(θk)arctan(θk)|d(\theta,\theta^{\prime})=\sum_{k=1}^{p}|arctan(\theta_{k})-arctan(\theta^{\prime}_{k})|, Pr(d(θ^v,θ0)>ϵ)0Pr(d(\hat{\theta}^{v},\theta_{0})>\epsilon)\rightarrow 0 for all ϵ>0\epsilon>0 and Pr(d(θ^v^n,θ0)>ϵ)0Pr(d(\hat{\theta}^{\hat{v}_{n}},\theta_{0})>\epsilon)\rightarrow 0 for all ϵ>0\epsilon>0.

Proof

We first compactify our parameter space Θ\Theta to obtain Θ¯\bar{\Theta} by allowing elements of unconstrained Euclidean parameters to take values in the extended reals.

A necessary condition for theorem 5.14 is that we have 𝔼supθUmθv<\mathbb{E}\text{sup}_{\theta\in U}m_{\theta}^{v}<\infty for mθv(𝐰,𝐱):=j=1Jvj(wjlog[𝐯𝐰𝐯μθμθj]𝐯𝐰𝐯μθμθj)m_{\theta}^{v}(\mathbf{w},\mathbf{x}):=\sum_{j=1}^{J}v_{j}\big{(}w_{j}\text{log}\big{[}\frac{\mathbf{v}\cdot\mathbf{w}}{\mathbf{v}\cdot\mu_{\theta}}\mu_{\theta_{j}}\big{]}-\frac{\mathbf{v}\cdot\mathbf{w}}{\mathbf{v}\cdot\mu_{\theta}}\mu_{\theta_{j}}\big{)} and a sufficiently small ball UΘU\in\Theta. By Lemma 1, supθUmv(𝐰,𝐱)j=1Jvj(wjlogwjwj)\text{sup}_{\theta\in U}m^{v}(\mathbf{w},\mathbf{x})\leq\sum_{j=1}^{J}v_{j}\big{(}w_{j}\text{log}w_{j}-w_{j}\big{)}, which is bounded above since vv is bounded. Hence by assumption (A), 𝔼supθUmθvPj=1Jvj(wjlogwjwj)<\mathbb{E}\text{sup}_{\theta\in U}m_{\theta}^{v}\leq P\sum_{j=1}^{J}v_{j}\big{(}w_{j}\text{log}w_{j}-w_{j}\big{)}<\infty. We also require Mnv(θ^v)Mnv(𝜽𝟎)+oP(1)M_{n}^{v}(\hat{\theta}^{v})\geq M_{n}^{v}(\boldsymbol{\theta_{0}})+o_{P}(1) which is trivially satisfied since θ^v\hat{\theta}^{v} maximizes MnvM_{n}^{v}.

Then letting compact set K=Θ¯K=\bar{\Theta}, we can directly apply theorem 5.14 to obtain Pr(d(θ^v,θ0)ϵ)0Pr(d(\hat{\theta}^{v},\theta_{0})\geq\epsilon)\rightarrow 0 for any ϵ>0\epsilon>0.

To apply theorem 5.14 to θ^v^n\hat{\theta}^{\hat{v}_{n}}, we only need in addition to the above that Mnv(θ^v^n)Mnv(θ0)+oP(1)M_{n}^{v}(\hat{\theta}^{\hat{v}_{n}})\geq M_{n}^{v}(\theta_{0})+o_{P}(1).

For any fixed v^\hat{v}, we have

Mnv^(θ^nv^)\displaystyle M_{n}^{\hat{v}}(\hat{\theta}^{\hat{v}}_{n}) Mnv^(θ0)+oP(1)\displaystyle\geq M_{n}^{\hat{v}}(\theta_{0})+o_{P}(1) (46)
=Mnv(θ0)+oP(1)+(Mnv(θ0)Mnv^(θ0))\displaystyle=M_{n}^{v}(\theta_{0})+o_{P}(1)+(M_{n}^{v}(\theta_{0})-M_{n}^{\hat{v}}(\theta_{0})) (47)

However, the term (Mnv(θ0)Mnv^(θ0))=oP(1)(M_{n}^{v}(\theta_{0})-M_{n}^{\hat{v}}(\theta_{0}))=o_{P}(1) if we let v^=v^n\hat{v}=\hat{v}_{n} since, letting lij(θ):=Wjlogμθjμθ\scaleobj0.75Σμθjμθ\scaleobj0.75Σl_{ij}(\theta):=W_{j}\text{log}\frac{\mu_{\theta j}}{\mu_{\theta\scaleobj{0.75}{\Sigma}}}-\frac{\mu_{\theta j}}{\mu_{\theta\scaleobj{0.75}{\Sigma}}},

|Mnv^n(θ0)Mnv(θ0)|\displaystyle|M_{n}^{\hat{v}_{n}}(\theta_{0})-M_{n}^{v}(\theta_{0})| =|1ni=1nj=1J(v^n,ijvij)lij(θ0)|\displaystyle=\Big{|}\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{J}(\hat{v}_{n,ij}-v_{i}j)l_{ij}(\theta_{0})\Big{|} (48)
supt𝒳×𝒲\scaleobj0.75Σ|v^n(t)v(t)|1ni=1nj=1J|lij(θ0)|p0\displaystyle\leq\text{sup}_{t\in\mathcal{X}\times\mathcal{W}_{\scaleobj{0.75}{\Sigma}}}|\hat{v}_{n}(t)-v(t)|\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{J}|l_{ij}(\theta_{0})|\stackrel{{\scriptstyle p}}{{\rightarrow}}0 (49)

since supt𝒳×𝒲\scaleobj0.75Σ|v^n(t)v(t)|p0\text{sup}_{t\in\mathcal{X}\times\mathcal{W}_{\scaleobj{0.75}{\Sigma}}}|\hat{v}_{n}(t)-v(t)|\stackrel{{\scriptstyle p}}{{\rightarrow}}0. Hence Mnv(θ^v^n)Mnv(θ0)+oP(1)M_{n}^{v}(\hat{\theta}^{\hat{v}_{n}})\geq M_{n}^{v}(\theta_{0})+o_{P}(1), so Pr(d(θ^v^n,θ0)ϵ)0Pr(d(\hat{\theta}^{\hat{v}_{n}},\theta_{0})\geq\epsilon)\rightarrow 0 for any ϵ>0\epsilon>0.

11.5 Convergence in distribution of M-estimators

Theorem 3.

If assumptions A - E are met, n(θ^vθ0)\sqrt{n}(\hat{\theta}^{v}-\theta_{0}) converges in distribution to a tight limiting distribution in d\mathbb{R}^{d}.

Proof.

Without loss of generality, we consider δv:={mθv:θHδ}\mathcal{M}^{v}_{\delta}:=\{m_{\theta}^{v}:\theta\in H_{\delta}\} for HδH_{\delta} given in assumption G (since by theorem 2, with probability approaching 11, θ^nvHδ\hat{\theta}_{n}^{v}\in H_{\delta} as nn increases without bound).

We begin by considering criterion mθm_{\theta} without weighting.

Fix (w,x)supp(W,X)(w,x)\in\text{supp}(W,X). We then have, for any θHδ\theta\in H_{\delta},

mθ(w,x)\displaystyle m_{\theta}(w,x) =j=1Jwjlogμθj(x)μθ\scaleobj0.75Σ(x)μθj(x)μθ\scaleobj0.75Σ(x)\displaystyle=\sum_{j=1}^{J}w_{j}\text{log}\frac{\mu_{\theta j}(x)}{\mu_{\theta\scaleobj{0.75}{\Sigma}}(x)}-\frac{\mu_{\theta j}(x)}{\mu_{\theta\scaleobj{0.75}{\Sigma}}(x)} (50)

By Lipschitz continuity of μθ(x)\mu_{\theta}(x) in θ\theta over HδH_{\delta}, mθ(w,x)m_{\theta}(w,x) is a Lipschitz continuous function at all θ\theta such that all elements of μθ(x)\mu_{\theta}(x) are nonzero since g(z):=alog(z)zg(z):=a\text{log}(z)-z is Lipschitz continuous for zϵ>0z\geq\epsilon>0 and bounded a0a\geq 0. When one or more elements of μθ(w,x)\mu_{\theta}(w,x) are zero, they must be 0 identically over all θHδ\theta\in H_{\delta}; otherwise by continuity of μθ(w,x)\mu_{\theta}(w,x) in θ\theta, we are able to select θHδ\theta^{\star}\in H_{\delta} such that μθj(w,x)\mu_{\theta^{\star}j}(w,x) is positive but arbitrarily close to zero, contradicting assumption G.

Since if 𝔼Wj=0\mathbb{E}W_{j}=0, we must have Wj=0W_{j}=0 with probability 1 (on account of WjW_{j} being a nonnegative random variable), except possibly on a set with measure 0, we have

mθ(w,x)\displaystyle m_{\theta}(w,x) =j=1J𝟏μθ0j(x)>0[wjlogμθj(x)μθ\scaleobj0.75Σ(x)]μθj(x)μθ\scaleobj0.75Σ(x)]\displaystyle=\sum_{j=1}^{J}\mathbf{1}_{\mu_{\theta_{0}j}(x)>0}\big{[}w_{j}\text{log}\frac{\mu_{\theta j}(x)}{\mu_{\theta\scaleobj{0.75}{\Sigma}}(x)}\big{]}-\frac{\mu_{\theta j}(x)}{\mu_{\theta\scaleobj{0.75}{\Sigma}}(x)}\big{]} (51)

Accordingly, mθ(w,x)m_{\theta}(w,x) is Lipschitz continuous in θ\theta for all θHδ\theta\in H_{\delta}. Hence mθv(w,x)m^{v}_{\theta}(w,x), as a weighted sum of Lipschitz continuous functions (with bounded weights), must also be Lipschitz continuous θ\theta for all θHδ\theta\in H_{\delta} as well.

We now use a bracketing argument to show that δ\mathcal{M}_{\delta} is a Donsker class, which we will use together with a result due to Dümbgen (1993) to show that n(θ^nvθ0)\sqrt{n}(\hat{\theta}_{n}^{v}-\theta_{0}) converges to a well-defined limiting distribution in d\mathbb{R}^{d}.

By Lipschitz continuity of mθv(w,x)m^{v}_{\theta}(w,x) in θ\theta on HδH_{\delta}, we have

|mθv(w,x)mθv(w,x)|Cw,xθθ\displaystyle|m^{v}_{\theta}(w,x)-m^{v}_{\theta^{\prime}}(w,x)|\leq C_{w,x}||\theta-\theta^{\prime}|| (52)

for some Cw,x<C_{w,x}<\infty.

Hence we can consider

C(w,x)\displaystyle C(w,x) =inf{C:|mθv(w,x)mθv(w,x)|Cθθ|θ,θHδ}\displaystyle=\inf\{C:|m^{v}_{\theta}(w,x)-m^{v}_{\theta^{\prime}}(w,x)|\leq C||\theta-\theta^{\prime}||\big{|}\theta,\theta^{\prime}\in H_{\delta}\} (53)
=supθ,θHδfw,x(θ,θ)\displaystyle=\sup_{\theta,\theta^{\prime}\in H_{\delta}}f_{w,x}(\theta,\theta^{\prime}) (54)

where

fw,x(θ,θ)={|mθv(w,x)mθv(w,x)|θθθθlimsupθθ|mθv(w,x)mθv(w,x)|θθθ=θ\displaystyle f_{w,x}(\theta,\theta^{\prime})=\begin{cases}\frac{|m^{v}_{\theta}(w,x)-m^{v}_{\theta^{\prime}}(w,x)|}{||\theta-\theta^{\prime}||}&\theta\neq\theta^{\prime}\\ \lim\text{sup}_{\theta\rightarrow\theta^{\prime}}\frac{|m^{v}_{\theta}(w,x)-m^{v}_{\theta^{\prime}}(w,x)|}{||\theta-\theta^{\prime}||}&\theta=\theta^{\prime}\end{cases}

where Lipschitz continuity of mθv(w,x)m^{v}_{\theta}(w,x) guarantees the existence of the limit as θθ\theta^{\prime}\rightarrow\theta.

Hence by Lipschitz continuity of mθv(w,x)m^{v}_{\theta}(w,x) in θ\theta and continuity of mθv(w,x)m^{v}_{\theta}(w,x) in ww and xx, fw,x(θ,θ)f_{w,x}(\theta,\theta^{\prime}) is continuous in w,x,θw,x,\theta, and θ\theta^{\prime}, which in turn implies that C(w,x)C(w,x) is continuous on supp(W,X)\text{supp}(W,X) and hence is bounded on supp(W,X)\text{supp}(W,X) by compactness of supp(W,X)\text{supp}(W,X). Since by assumption supp(W,X)\text{supp}(W,X) is bounded, this implies that |C(w,x)|2|C(w,x)|^{2} is integrable, which is sufficient for the bracketing entropy of MδM_{\delta} to be at most of order log(1/ϵ)\text{log}(1/\epsilon) (see Van der Vaart (2000) example 19.7). Hence by theorem 19.5 of Van der Vaart (2000), δ\mathcal{M}_{\delta} is Donsker.

Accordingly, we have {n(nmθvPmθv):mθMδ}\{\sqrt{n}(\mathbb{P}_{n}m_{\theta}^{v}-Pm_{\theta}^{v}):m_{\theta}\in M_{\delta}\} weakly converging to a tight Gaussian process in l(δ)l^{\infty}(\mathcal{M}_{\delta}) as nn\rightarrow\infty. By proposition 1 of Dümbgen (1993), this, taken with Hadamard directional differentiability of the map defined by g(F):=arg maxθHδF(θ)g(F):=\text{arg max}_{\theta\in H_{\delta}}F(\theta), gives us

n(θ^nvθ0):=n(g(nmθv)g(Pmθv))𝒥\displaystyle\sqrt{n}(\hat{\theta}^{v}_{n}-\theta_{0}):=\sqrt{n}(g(\mathbb{P}_{n}m_{\theta}^{v})-g(Pm_{\theta}^{v}))\rightsquigarrow\mathcal{J} (55)

for well-defined limiting distribution JJ on d\mathbb{R}^{d}.

Theorem 4.

n(θ^v^nθ0)\sqrt{n}(\hat{\theta}^{\hat{v}_{n}}-\theta_{0}) converges in distribution to the same limit as n(θ^vθ0)\sqrt{n}(\hat{\theta}^{v}-\theta_{0}) if assumptions A - G are met.

Proof.
n(Mnv^n(θ)Mv^n(θ))\displaystyle\sqrt{n}(M^{\hat{v}_{n}}_{n}(\theta)-M^{\hat{v}_{n}}(\theta)) (56)
=\displaystyle= n(Mnv(θ)Mv(θ)+[[Mnv^n(θ)Mnv(θ)][Mv^n(θ)Mv(θ)]])\displaystyle\sqrt{n}(M^{v}_{n}(\theta)-M^{v}(\theta)+\big{[}[M^{\hat{v}_{n}}_{n}(\theta)-M^{v}_{n}(\theta)]-[M^{\hat{v}_{n}}(\theta)-M^{v}(\theta)]\big{]}) (57)
=\displaystyle= n(Mnv(θ)Mv(θ))+n(nP)mθv^nv\displaystyle\sqrt{n}(M^{v}_{n}(\theta)-M^{v}(\theta))+\sqrt{n}(\mathbb{P}_{n}-P)m^{\hat{v}_{n}-v}_{\theta} (58)
=\displaystyle= n(Mnv(θ)Mv(θ))+oP(1)\displaystyle\sqrt{n}(M^{v}_{n}(\theta)-M^{v}(\theta))+o_{P}(1) (59)

since

n(nP)mθv^nv\displaystyle\sqrt{n}(\mathbb{P}_{n}-P)m^{\hat{v}_{n}-v}_{\theta} =n1ni=1nj=1J(v^n;ijvij)lij(θ)\displaystyle=\sqrt{n}\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{J}(\hat{v}_{n;ij}-v_{ij})l_{ij}(\theta) (60)
n1ni=1nj=1J|lij(θ)|suptsupp(𝐗,W\scaleobj0.75Σ)|v^n(t)v(t)|\displaystyle\leq\sqrt{n}\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{J}|l_{ij}(\theta)|~{}\text{sup}_{t\in\text{supp}(\mathbf{X},W_{\scaleobj{0.75}{\Sigma}})}|\hat{v}_{n}(t)-v(t)| (61)
p0\displaystyle\stackrel{{\scriptstyle p}}{{\rightarrow}}0 (62)

Note that arg maxθΘMv^n(θ)=arg maxθΘMv(θ)=θ0\text{arg max}_{\theta\in\Theta}M^{\hat{v}_{n}}(\theta)=\text{arg max}_{\theta\in\Theta}M^{v}(\theta)=\theta_{0} by theorem 1. ∎

12 SI: Optimization details

12.1 Reparametrization of barrier subproblem

Letting θ\theta^{\star} indicate the unknown parameters in our model after reparametrizing 𝐩\mathbf{p} and 𝐩~\tilde{\mathbf{p}} as 𝝆\boldsymbol{\rho} and 𝝆~\tilde{\boldsymbol{\rho}}, we now have the following unconstrained minimization problem:

arg minθfn(θ)+1t(r)[\displaystyle\text{arg min}_{\theta^{\star}}f_{n}(\theta^{\star})+\frac{1}{t^{(r)}}\Big{[} k(j=1J1ρkj+Jlog[1+j=1J1exp(ρkj)])\displaystyle\sum_{k}\left(\sum_{j=1}^{J-1}-\rho_{kj}+J\text{log}\left[1+\sum_{j=1}^{J-1}\text{exp}(\rho_{kj})\right]\right) (63)
k~(j=1J1ρ~k~j+Jlog[1+j=1J1exp(ρ~k~j)])]\displaystyle\sum_{\tilde{k}}\left(\sum_{j=1}^{J-1}-\tilde{\rho}_{\tilde{k}j}+J\text{log}\left[1+\sum_{j=1}^{J-1}\text{exp}(\tilde{\rho}_{\tilde{k}j})\right]\right)\Big{]} (64)

12.2 Barrier algorithm

Barrier Algorithm 1. Initiate with value of penalty parameter tt set to starting value t(0)t^{(0)} and values of parameters θ\theta^{\star} equal to θ(0)\theta^{\star(0)}. Set iteration r=0r=0. 2. Using current value t(r)t^{(r)} of tt and starting at parameter estimate θ(r)\theta^{\star(r)}, solve barrier subproblem r given in main text via Fisher scoring. Denote the solution of this subproblem θ(r+1)\theta^{\star}_{(r+1)} and set t(r+1)=at(r)t_{(r+1)}=at_{(r)} for a prespecified a>1a>1. 3. If t(r+1)>tmaxt_{(r+1)}>t_{\text{max}} for prespecified tmaxt_{\text{max}}, return θ(r+1)\theta^{\star}_{(r+1)}. Otherwise set iteration r=r+1r=r+1 and return to step 2.

12.3 Constrained Newton within Augmented Lagrangian Algorithm

We calculate update steps from k\mathcal{L}_{k} given in Section 4 of the main text as follows:

Constrained Newton within Augmented Lagrangian Algorithm 1. Initiate with initial values ν(0)\nu^{(0)} and μ(0)>0\mu^{(0)}>0 of penalty coefficients ν\nu and μ\mu 2. Calculate proposed update 𝐩kupdate\mathbf{p}^{\text{update}}_{k} via nonnegative least squares on k\mathcal{L}_{k} using current values of ν\nu and μ\mu 3. If |jJpkjupdate1|<δ|\sum_{j}^{J}{p}^{\text{update}}_{kj}-1|<\delta for some prespecified tolerance δ>0\delta>0, set update direction 𝐬k:=𝐩k𝐩kupdate\mathbf{s}_{k}:=\mathbf{p}_{k}-\mathbf{p}^{\text{update}}_{k} and proceed to step (3). Otherwise update ν\nu and μ\mu via algorithm given in Bazaraa (2006) (p. 496) and return to step (1). 4. Perform a line search in direction 𝐬k\mathbf{s}_{k} to determine updated parameter value 𝐩kupdated:=𝐩k+ϵ𝐬k\mathbf{p}_{k}^{\text{updated}}:=\mathbf{p}_{k}+\epsilon\mathbf{s}_{k} that decreases objective fn(𝐩k)f_{n}(\mathbf{p}_{k}) for some 0<ϵ10<\epsilon\leq 1.

12.4 Quadratic approximation to fn(𝐩k)f_{n}(\mathbf{p}_{k}).

In Section 4 of the main text, we specify k\mathcal{L}_{k} in terms of a quadratic approximation Qk(t)Q_{k}^{(t)} to objective fn(𝐩k)f_{n}(\mathbf{p}_{k}). In practice we construct Qk(t)Q_{k}^{(t)} as a slightly modified Taylor expansion of fn()f_{n}() around the current value of 𝐩k(t)\mathbf{p}_{k}^{(t)}. We use the gradient of fnf_{n} with respect to 𝐩k\mathbf{p}_{k} in the first order term, and in the second order term, and in place of the Hessian, we use (1-1 times) the Fisher information matrix in 𝐩k\mathbf{p}_{k} regularized (for numerical stability) by addition of magnitude of the gradient times an identity matrix.

13 SI: Analysis of Costea et al. (2017) data

13.1 Details of model specification

Costea et al. (2017) published two flow cytometric readings for every species in the synthetic community with the exception of V. cholerae, for which only one reading was published. In all taxa save V. cholerae, we take the mean reading as our observation, and we include the resulting vector of readings augmented by the single reading for V. cholerae as a row in 𝐖\mathbf{W}. We anticipate that our use of mean readings represents a fairly small loss of information, as flow cytometric readings did not vary substantially within taxon. However, in a similar setting where multiple sets of flow cytometric readings across all taxa were available, we could include each set as a row of 𝐖\mathbf{W} to capture variability in these measurements.

To estimate detection effects relative to flow cytometry measurements, we specify X1=𝟎X_{1}=\mathbf{0}. For i2i\geq 2, 𝐗i=[1𝟏Q𝟏W]\mathbf{X}_{i\cdot}=[1~{}\mathbf{1}_{Q}~{}\mathbf{1}_{W}] where 𝟏Q\mathbf{1}_{Q} is an indicator for sample ii being processed according to protocol Q, and similarly for 𝟏W\mathbf{1}_{W}.

13.2 Cross-validation design

We construct folds for our 10-fold cross-validation on Costea et al. (2017) data so that, with the exception of samples A and B, which we grouped together in a single fold, each fold included all observations for a given specimen. For each fold, we fit a model in which all observations in all other folds, along with flow cytometry readings, were treated as arising from a common specimen (as in fact they do, save for flow cytometry readings, which were taken on specimens mixed to create the mock spike-in). We model each sample in the held-out fold as arising from a distinct specimen of unknown composition to allow our model to estimate a different relative abundance profile for distinct samples processed according to different protocols.

13.3 Model summaries

Table 1: Point estimates and 95% bootstrap confidence intervals for protocol-specific detection effects 𝜷\boldsymbol{\beta} (with reference taxon Y. pseudotuberculosis) estimated from Costea et al. (2017) data
Taxon Protocol H Protocol Q Protocol W
B. hansenii -1.61 (-2.00 – -1.16) -1.55 (-1.75 – -1.31) -0.08 (-0.16 – 0.00)
C. difficile -0.18 (-0.30 – 0.01) -0.57 (-0.79 – -0.41) 1.23 (1.18 – 1.28)
C. perfringens 3.38 (3.27 – 3.57) 2.48 (2.31 – 2.62) 4.05 (4.03 – 4.07)
C. saccharolyticum -0.19 (-0.23 – -0.16) -0.01 (-0.12 – 0.1) -0.10 (-0.13 – -0.06)
F. nucleatum 2.37 (2.28 – 2.44) 0.14 (-0.16 – 0.42) 2.11 (2.05 – 2.16)
L. plantarum -2.62 (-2.96 – -2.12) 0.72 (0.60 – 0.93) 0.60 (0.56 – 0.63)
P. melaninogenica 4.17 (4.12 – 4.2) 3.88 (3.82 – 4.04) 4.25 (4.23 – 4.27)
S. enterica 2.49 (2.45 – 2.51) 2.74 (2.64 – 2.79) 2.48 (2.46 – 2.51)
V. cholerae 1.54 (1.50 – 1.56) 0.90 (0.78 – 0.99) 1.48 (1.44 – 1.50)

Table 1 provides point estimates and marginal 95%95\% confidence intervals for the detection effects for each of protocols H, Q, and W estimated via the full model described above. This model was fit with reference taxon Y. pseudotuberculosis (i.e., under the constraint that the column of 𝜷\boldsymbol{\beta} corresponding to this taxon consists of 0 entries). Hence we interpret estimates in this table in terms of degree of over- or under-detection relative to Y. pseudotuberculosis – for example, we estimate that, repeated measurement under protocol H of samples consisting of 1:1 mixtures of B. hansenii and Y. pseudotuberculosis, the mean MetaPhlAn2 estimate of the relative abundance of B. hansenii will be exp(1.61)0.20\text{exp}(-1.61)\approx 0.20 as large as the mean estimate of the relative abundance of Y. pseudotuberculosis.

14 SI: Analysis of Karstens et al. (2019) data

14.1 Preprocessing

We process raw read data reported by Karstens et al. (2019) using the DADA2 R package (version 1.20.0) (Callahan et al., 2016). We infer amplicon sequence variants using the dada function with option ‘pooled = TRUE’ and assign taxonomy with the assignSpecies function using a SILVA v138 training dataset downloaded from https://benjjneb.github.io/dada2/training.html (Quast et al., 2012).

14.2 Model Specification

We conduct a three-fold cross-validation of a model containing both contamination and detection effects. For each held-out fold rr, if we let 𝐞r:=(𝟏[sample 1 in fold r],,𝟏[sample n in fold r])T\mathbf{e}_{r}:=(\mathbf{1}_{[\text{sample 1 in fold }r]},\dots,\mathbf{1}_{[\text{sample n in fold }r]})^{T}, 𝐝=(30,,38)T\mathbf{d}=(3^{0},\dots,3^{8})^{T}, and \circ indicate element-wise multiplication then we specify the model for this fold with

𝐙\displaystyle\mathbf{Z} =[𝟏𝐞r𝐞r]\displaystyle=\begin{bmatrix}\mathbf{1}-\mathbf{e}_{r}&\mathbf{e}_{r}\end{bmatrix}
𝐙~\displaystyle\tilde{\mathbf{Z}} =[𝐝(𝟏𝐞r)]+exp(α~)[𝐝𝐞r]\displaystyle=\begin{bmatrix}\mathbf{d}\circ(\mathbf{1}-\mathbf{e}_{r})\end{bmatrix}+\text{exp}(\tilde{\alpha})\begin{bmatrix}\mathbf{d}\circ\mathbf{e}_{r}\end{bmatrix}
𝐗\displaystyle\mathbf{X} =𝟏\displaystyle=\vec{\mathbf{1}}
𝐗~\displaystyle\tilde{\mathbf{X}} =0\displaystyle=0

The relative abundance matrix 𝐩\mathbf{p} consists of two rows, the first of which is treated as fixed and known and contains the theoretical composition of the mock community used by Karstens et al. (2019). The second row is to be estimated from observations on samples in the held-out fold. 𝐩~\tilde{\mathbf{p}} consists of a single row, the first 247 elements of which we treat as unknown. We fix p~248=0\tilde{p}_{248}=0 as an identifiability constraint – identifiability problems arise here because all samples sequenced arise from the same specimen, we lack identifiability over, for any choice of fixed 𝐩~\tilde{\mathbf{p}}^{\star} and 𝐩\mathbf{p}, the set {𝐩~=a𝐩~+(1a)𝐩:0a1}\{\tilde{\mathbf{p}}=a*\tilde{\mathbf{p}}^{\star}+(1-a)\mathbf{p}:0\leq a\leq 1\}. Briefly, we do not consider the assumption p~248=0\tilde{p}_{248}=0 unrealistic; while in general distinguishing between contaminant and non-contaminant taxa is challenging, it is fairly frequently the case that choosing a single taxon unlikely to be a contaminant is not difficult. Moreover, we anticipate that in most applied settings, more than one specimen will be sequenced and this identifiability problem will hence not arise.

𝜷\boldsymbol{\beta} consists of a single row, the first J8=240J-8=240 elements of which (corresponding to contaminant taxa) are treated as fixed and known parameters equal to 0, as we cannot estimate detection efficiencies in contaminant taxa. The following 7 elements of β\mathbf{\beta} are treated as fixed and unknown (to be estimated from data), and 𝜷248\boldsymbol{\beta}_{248} is set equal to 0 as an identifiability constraint. 𝜸~\tilde{\boldsymbol{\gamma}} is specified as a single unknown parameter in \mathbb{R}

The full model fit without detection efficiencies 𝜷\boldsymbol{\beta} is specified by treating 𝜷\boldsymbol{\beta} as fixed and known with all elements equal to 0. We treat all samples as arising from the same specimen, so 𝐙=𝟏\mathbf{Z}=\mathbf{1}, 𝐙~=𝐝\tilde{\mathbf{Z}}=\mathbf{d}, and 𝐩\mathbf{p} consists of a single row treated as an unknown relative abundance. Specifications of 𝐗\mathbf{X}, 𝐗~\tilde{\mathbf{X}}, 𝐩~\tilde{\mathbf{p}}, and γ~\tilde{\gamma} are specified as above.

14.3 Additional summaries

Table 2: Entries of 𝜷^\hat{\boldsymbol{\beta}} (reference taxon L. fermentum) estimated from Karstens et al. (2019) data
Taxon Estimate
P. aeruginosa -1.29
E. coli -0.20
S. enterica -0.48
E. faecium -2.09
S. aureus -3.05
L. monocytogenes -1.60
B. halotolerans -0.74

For each taxon for which βj\beta_{j} is identifiable (i.e., taxa in the mock community), our model produces a point estimate β^j\hat{\beta}_{j}, as shown in table 2. (The reference taxon, L. fermentum, for which we enforce identifiability constraint βj=0\beta_{j}=0, is excluded.) On the basis of this model, we estimate that in an equal mixture of E. coli and our reference taxon, L. fermentum sequenced by the method used by Karstens et al. (2019), we expect on average to observe exp(0.20)0.82\text{exp}(-0.20)\approx 0.82 E. coli reads for each L. fermentum read. In an equal mixture of S. aureus and L. fermentum similarly sequenced, we expect on average to observe exp(3.05)0.047\text{exp}(-3.05)\approx 0.047 reads for each L. fermentum read.

15 SI: Simulation results based on Brooks et al. (2015) data

15.1 Figures

Figure 5 summarizes performance of cross-validated models fit to Brooks et al. (2015) data. Briefly, we observe very similar performance comparing unweighted and weighted estimators both in terms of root mean square error (RMSE) and proportion of elements of 𝐩\mathbf{p} estimated to be 0. RMSE is generally smaller for models fit on larger training sets, but it does not approach zero as training set size increases. We also observe a strong relationship between RMSE and theoretical true relative abundance, which likely reflects a strong mean-variance relationship in the data.

We also observe generally a greater proportion of elements of 𝐩\mathbf{p} estimated to be zero with increasing training set size, although the degree to which this occurs depends on taxon. Weighting does not appear to have a large impact on this measure of predictive performance.

Refer to caption
Figure 5: Predictive performance of models fit on Brooks et al. (2015) data. The upper row includes root mean square error of relative abundance estimates by estimator, taxon, number of samples treated as known per plate, and true relative abundance. True relative abundance is given on the x-axis and root mean square error is plotted on the y-axis; for concision, true relative abundances equal to 1 are plotted at 0. Each column pane contains estimates for a different taxon, estimator is indicated with line type (solid for Poisson and dashed for weighted Poisson), and number of samples known per plate is indicated by color. In the lower row, proportion of elements of 𝐏\mathbf{P} truly equal to zero estimated to be equal to zero is plotted on the y-axis of each pane, and the x-axis gives number of samples per plate treated as known. Taxon and estimator are represented as in the upper row, and the proportion of nonzero elements of 𝐖\mathbf{W} corresponding to zero elements of 𝐏\mathbf{P} for each taxon is plotted as a dotted horizontal line.

16 SI: Simulations with Artificial Data

Figure 6 summarizes empirical coverage of marginal bootstrap 95% confidence intervals for elements pkjp_{kj} of 𝐩\mathbf{p} obtained from simulations described in the main text. As discussed in the main text, coverage is high for pkj=0p_{kj}=0 but falls when pkj>0p_{kj}>0. Unsurprisingly, we observe higher coverage at larger sample sizes. Coverage of intervals based on reweighted estimators appears to be slightly lower than for unweighted estimators when data is Poisson-distributed (lefthand columns), but intervals using reweighted estimators substantially outperform unweighted intervals when data is negative binomial distributed, particularly when number of taxa JJ is larger.

Refer to caption
Figure 6: Empirical coverage of marginal bootstrap 95% confidence intervals for pkjp_{kj} (k=3,4k=3,4) versus true value of pkjp_{kj}. Coverage for tests based on unweighted and reweighted estimators are shown in grey and black, respectively. Sample size is indicated by line type (solid for n=16n=16 and dotted for n=48n=48). Columns give the conditional distribution of data (Poisson or Negative Binomial) and number of taxa JJ. Rows specify which row of 𝐩\mathbf{p} coverages are computed for.

References

  • Andrews (2000) Andrews, D.W. (2000). Inconsistency of the bootstrap when a parameter is on the boundary of the parameter space. Econometrica, 399–405.
  • Bazaraa (2006) Bazaraa, M.S. (2006). Nonlinear programming: theory and algorithms.
  • Brooks et al. (2015) Brooks, J.P. et al (2015). The truth about metagenomics: quantifying and counteracting bias in 16s rrna studies. BMC Microbiology 15(1), 1–14.
  • Callahan et al. (2016) Callahan, B.J. et al (2016). DADA2: high-resolution sample inference from illumina amplicon data. Nature Methods 13(7), 581.
  • Costea et al. (2017) Costea, P.I. et al (2017). Towards standards for human fecal sample processing in metagenomic studies. Nature Biotechnology 35(11), 1069.
  • Dai et al. (2019) Dai, Z. et al (2019). Batch effects correction for microbiome data with dirichlet-multinomial regression. Bioinformatics 35(5), 807–814.
  • Davis et al. (2018) Davis, N.M. et al (2018). Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6(1), 1–14.
  • Dümbgen (1993) Dümbgen, L. (1993). On nondifferentiable functions and the bootstrap. Probability Theory and Related Fields 95(1), 125–140.
  • Fernandes et al. (2014) Fernandes, A.D. et al (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing rna-seq, 16s rrna gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2, 1–13.
  • Gagnon-Bartsch and Speed (2012) Gagnon-Bartsch, J.A. and Speed, T.P. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics 13(3), 539–552.
  • Geyer (1994) Geyer, C.J. (1994). On the asymptotics of constrained m-estimation. The Annals of Statistics, 1993–2010.
  • Gibbons et al. (2018) Gibbons, S.M., Duvallet, C. and Alm, E.J. (2018). Correcting for batch effects in case-control microbiome studies. PLoS Computational Biology 14(4), e1006102.
  • Hinkley (1988) Hinkley, D.V. (1988). Bootstrap methods. Journal of the Royal Statistical Society: Series B (Methodological) 50(3), 321–337.
  • Hornung et al. (2019) Hornung, B.V., Zwittink, R.D. and Kuijper, E.J. (2019). Issues and current standards of controls in microbiome research. FEMS Microbiology Ecology 95(5), fiz045.
  • Ishwaran et al. (2009) Ishwaran, H., James, L.F. and Zarepour, M. (2009). An alternative to the m out of n bootstrap. Journal of Statistical Planning and Inference 139(3), 788–801.
  • Johnson et al. (2007) Johnson, W.E., Li, C. and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 8(1), 118–127.
  • Karstens et al. (2019) Karstens, L. et al (2019). Controlling for contaminants in low-biomass 16s rrna gene sequencing experiments. MSystems 4(4), e00290–19.
  • Knights et al. (2011) Knights, D. et al (2011). Bayesian community-wide culture-independent microbial source tracking. Nature Methods 8(9), 761–763.
  • Leek and Storey (2007) Leek, J.T. and Storey, J.D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS genetics 3(9), e161.
  • Li et al. (2021) Li, Z. et al (2021). Ifaa: robust association identification and inference for absolute abundance in microbiome analyses. Journal of the American Statistical Association 116(536), 1595–1608.
  • Love et al. (2014) Love, M.I., Huber, W. and Anders, S. (2014). Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology 15, 1–21.
  • Mallick et al. (2021) Mallick, H. et al (2021). Multivariable association discovery in population-scale meta-omics studies. PLoS Computational Biology 17(11), e1009442.
  • Mandal et al. (2015) Mandal, S. et al (2015). Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial Ecology in Health and Disease 26(1), 27663.
  • Martin et al. (2019) Martin, B.D., Witten, D. and Willis, A.D. (2019). Modeling microbial abundances and dysbiosis with beta-binomial regression . arXiv.
  • McCullagh (1983) McCullagh, P. (1983). Quasi-likelihood functions. The Annals of Statistics 11(1), 59–67.
  • McLaren et al. (2019) McLaren, M.R., Willis, A.D. and Callahan, B.J. (2019). Consistent and correctable bias in metagenomic sequencing experiments. eLife 8.
  • Oron and Flournoy (2017) Oron, A.P. and Flournoy, N. (2017). Centered isotonic regression: point and interval estimation for dose–response studies. Statistics in Biopharmaceutical Research 9(3), 258–267.
  • Quast et al. (2012) Quast, C. et al (2012). The silva ribosomal rna gene database project: improved data processing and web-based tools. Nucleic Acids Research 41(D1), D590–D596.
  • Shenhav et al. (2019) Shenhav, L. et al (2019). Feast: fast expectation-maximization for microbial source tracking. Nature Methods 16(7), 627–632.
  • Silverman et al. (2021) Silverman, J.D. et al (2021). Measuring and mitigating pcr bias in microbiota datasets. PLoS Computational Biology 17(7), e1009113.
  • Sims et al. (2008) Sims, A.H. et al (2008). The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets–improving meta-analysis and prediction of prognosis. BMC Medical Genomics 1(1), 1–14.
  • Truong et al. (2015) Truong, D.T. et al (2015). Metaphlan2 for enhanced metagenomic taxonomic profiling. Nature Methods 12(10), 902–903.
  • Van der Vaart (2000) Van der Vaart, A.W. (2000). Asymptotic statistics, Volume 3. Cambridge university press.
  • Zhao and Satten (2021) Zhao, N. and Satten, G.A. (2021). A log-linear model for inference on bias in microbiome studies. In Statistical Analysis of Microbiome Data, pp.  221–246. Springer.