This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Asymptotically Optimal Knockoff Statistics via the Masked Likelihood Ratio

Asher Spector Department of Statistics, Stanford University, USA William Fithian Department of Statistics, UC Berkeley, USA
Abstract

In feature selection problems, knockoffs are synthetic controls for the original features. Employing knockoffs allows analysts to use nearly any variable importance measure or “feature statistic” to select features while rigorously controlling false positives. However, it is not clear which statistic maximizes power. In this paper, we argue that state-of-the-art lasso-based feature statistics often prioritize features that are unlikely to be discovered, leading to low power in real applications. Instead, we introduce masked likelihood ratio (MLR) statistics, which prioritize features according to one’s ability to distinguish each feature from its knockoff. Although no single feature statistic is uniformly most powerful in all situations, we show that MLR statistics asymptotically maximize the number of discoveries under a user-specified Bayesian model of the data. (Like all feature statistics, MLR statistics always provide frequentist error control.) This result places no restrictions on the problem dimensions and makes no parametric assumptions; instead, we require a “local dependence” condition that depends only on known quantities. In simulations and three real applications, MLR statistics outperform state-of-the-art feature statistics, including in settings where the Bayesian model is highly misspecified. We implement MLR statistics in the python package knockpy; our implementation is often faster than computing a cross-validated lasso.

1 Introduction

Given a design matrix 𝐗=(𝐗1,,𝐗p)n×p\mathbf{X}=(\mathbf{X}_{1},\dots,\mathbf{X}_{p})\in\mathbb{R}^{n\times p} and a response vector 𝐘n\mathbf{Y}\in\mathbb{R}^{n}, the task of controlled feature selection is, informally, to discover features that influence 𝐘\mathbf{Y} while controlling the false discovery rate (FDR). In this context, knockoffs (Barber and Candès,, 2015; Candès et al.,, 2018) are fake variables 𝐗~n×p\widetilde{\mathbf{X}}\in\mathbb{R}^{n\times p} which act as negative controls for the features 𝐗\mathbf{X}. Remarkably, employing knockoffs allows analysts to use nearly any machine learning model or test statistic, often known interchangeably as a “feature statistic” or “knockoff statistic,” to select features while exactly controlling the FDR. As a result, knockoffs has become popular in the analysis of genetic studies, financial data, clinical trials, and more (Sesia et al.,, 2018, 2019; Challet et al.,, 2021; Sechidis et al.,, 2021).

The flexibility of knockoffs has inspired the development of a variety of feature statistics based on penalized regression coefficients, sparse Bayesian models, random forests, neural networks, and more (see, e.g., Barber and Candès, (2015); Candès et al., (2018); Gimenez et al., (2019); Lu et al., (2018)). These feature statistics not only reflect different modeling assumptions, but more fundamentally, they estimate different quantities, including coefficient sizes, Bayesian posterior inclusion probabilities, and various other measures of variable importance. Yet there has been relatively little theoretical comparison of these methods, in large part because analyzing the power of knockoffs can be very challenging; see Section 1.4. In this work, we develop a principled approach and concrete methods for designing knockoff statistics that maximize power.

1.1 Review of model-X and fixed-X knockoffs

This section reviews the key elements of Model-X (MX) and Fixed-X (FX) knockoff methods.

Model-X (MX) knockoffs (Candès et al.,, 2018) is a method to test the hypotheses Hj:𝐗j𝐘𝐗jH_{j}:\mathbf{X}_{j}\perp\!\!\!\perp\mathbf{Y}\mid\mathbf{X}_{-j}, where 𝐗j{𝐗}j\mathbf{X}_{-j}\coloneqq\{\mathbf{X}_{\ell}\}_{\ell\neq j} denotes all features except 𝐗j\mathbf{X}_{j}, assuming that the law of 𝐗\mathbf{X} is known.111Note that this assumption can be relaxed to having a well-specified parametric model for 𝐗\mathbf{X} (Huang and Janson,, 2020), and knockoffs are known to be robust to misspecification of the law of 𝐗\mathbf{X} (Barber et al.,, 2020). Applying MX knockoffs requires three steps.

1. Constructing knockoffs. Valid MX knockoffs must satisfy two properties. First, the columns of 𝐗\mathbf{X} must be pairwise exchangeable with the corresponding columns of 𝐗~\widetilde{\mathbf{X}}, i.e. [𝐗j,𝐗~j,𝐗j,𝐗~j]=d[𝐗j,𝐗~j,𝐗~j,𝐗j][\mathbf{X}_{-j},\widetilde{\mathbf{X}}_{-j},\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}]\stackrel{{\scriptstyle\mathrm{d}}}{{=}}[\mathbf{X}_{-j},\widetilde{\mathbf{X}}_{-j},\widetilde{\mathbf{X}}_{j},\mathbf{X}_{j}] must hold for all j[p]j\in[p]. Second, we require that 𝐗~𝐘𝐗\widetilde{\mathbf{X}}\perp\!\!\!\perp\mathbf{Y}\mid\mathbf{X}, which holds if one constructs 𝐗~\widetilde{\mathbf{X}} without looking at 𝐘\mathbf{Y}. Informally, these constraints guarantee that 𝐗j,𝐗~j\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j} are “indistinguishable” under HjH_{j}. Sampling knockoffs can be challenging, but this problem is well studied (e.g., Bates et al.,, 2020).

2. Fitting feature statistics. Next, use any machine learning (ML) algorithm to fit feature importances Z=z([𝐗,𝐗~],𝐘)2pZ=z([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\in\mathbb{R}^{2p}, where ZjZ_{j} and Zj+pZ_{j+p} heuristically measure the “importance” of 𝐗j\mathbf{X}_{j} and 𝐗~j\widetilde{\mathbf{X}}_{j} in predicting 𝐘\mathbf{Y}. The only restriction is that swapping 𝐗j\mathbf{X}_{j} and 𝐗~j\widetilde{\mathbf{X}}_{j} must also swap the feature importances ZjZ_{j} and Zj+pZ_{j+p} without changing any of the other feature importances {Z}j,{Z+p}j\{Z_{\ell}\}_{\ell\neq j},\{Z_{\ell+p}\}_{\ell\neq j}. This restriction is satisfied by most ML algorithms, such as the lasso or various neural networks (Lu et al.,, 2018).

Given ZZ, we define the feature statistics W=w([𝐗,𝐗~],𝐘)pW=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\in\mathbb{R}^{p} via Wj=f(Zj,Zj+p)W_{j}=f(Z_{j},Z_{j+p}) where f(x,y)=f(y,x)f(x,y)=-f(y,x) is any antisymmetric function. E.g., the lasso coefficient difference (LCD) statistic sets Wj=|Zj||Zj+p|W_{j}=|Z_{j}|-|Z_{j+p}|, where ZjZ_{j} and Zj+pZ_{j+p} are coefficients from a lasso fit on [𝐗,𝐗~],𝐘[\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}. Intuitively, when WjW_{j} is positive, this suggests that 𝐗j\mathbf{X}_{j} is more important than 𝐗~j\widetilde{\mathbf{X}}_{j} and thus is evidence against the null. Indeed, Steps 1-2 guarantee that the signs of the null {Wj}j=1p\{W_{j}\}_{j=1}^{p} are i.i.d. random signs.

3.Make rejections. Define the data-dependent threshold Tinf{t>0:#{j:Wjt}+1#{Wjt}q}T\coloneqq\inf\left\{t>0:\frac{\#\{j:W_{j}\leq-t\}+1}{\#\{W_{j}\geq t\}}\leq q\right\}, where inf\inf\emptyset\coloneqq\infty. Then, reject S{j:WjT}S\coloneqq\{j:W_{j}\geq T\}, which guarantees finite-sample FDR control at level q(0,1)q\in(0,1). Note this result does not require any assumptions about the law of 𝐘𝐗\mathbf{Y}\mid\mathbf{X}.

Theorem 1.1 (Candès et al., (2018)).

Let PP^{\star} denote the unknown joint law of (𝐗,𝐘)(\mathbf{X},\mathbf{Y}), and suppose the law of 𝐗PX\mathbf{X}\sim P_{X}^{\star} is known, allowing one to construct valid knockoffs 𝐗~\tilde{\mathbf{X}}. 222Typically, one assumes that the observations are i.i.d. to construct valid knockoffs, but the i.i.d. assumption is not necessary as long as 𝐗~\widetilde{\mathbf{X}} are valid knockoffs. Then for any feature statistic ww,

FDR𝔼P[|S0|1|S|]q,\mathrm{FDR}\coloneqq\mathbb{E}_{P^{\star}}\left[\frac{|S\cap\mathcal{H}_{0}|}{1\vee|S|}\right]\leq q,

where 0={j[p]:Hj is true}\mathcal{H}_{0}=\{j\in[p]:H_{j}\text{ is true}\} is the set of nulls under PP^{\star}.

Fixed-X (FX) knockoffs (Barber and Candès,, 2015) treats 𝐗\mathbf{X} as nonrandom and yields exact FDR control under the Gaussian linear model 𝐘𝐗𝒩(𝐗β,σ2In)\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,\sigma^{2}I_{n}). Fitting FX knockoffs is identical to the steps above with two exceptions:

  1. 1.

    FX knockoffs need not satisfy the constraints in Step 1: instead, 𝐗~\widetilde{\mathbf{X}} must satisfy (i) 𝐗~T𝐗~=𝐗T𝐗\widetilde{\mathbf{X}}^{T}\widetilde{\mathbf{X}}=\mathbf{X}^{T}\mathbf{X} and (ii) 𝐗~T𝐗=𝐗T𝐗Δ\widetilde{\mathbf{X}}^{T}\mathbf{X}=\mathbf{X}^{T}\mathbf{X}-\Delta, for some diagonal matrix Δ\Delta satisfying 2𝐗T𝐗Δ2\mathbf{X}^{T}\mathbf{X}\succ\Delta.

  2. 2.

    The feature importances ZZ can only depend on 𝐘\mathbf{Y} through [𝐗,𝐗~]T𝐘[\mathbf{X},\widetilde{\mathbf{X}}]^{T}\mathbf{Y}, which permits the use of many test statistics, but not all (for example, this prohibits the use of cross-validation).

Our theory applies to both MX and FX knockoffs,but oftenfocus on the MX context for brevity.

1.2 Theoretical problem statement

This section defines two types of optimal knockoff statistics: oracle statistics, which maximize the expected number of discoveries (ENDisc) for the true (unknown) data distribution PP^{\star}, and Bayes-optimal statistics, which maximize ENDisc  with respect to a prior distribution over PP^{\star}. We focus on the expected number of discoveries since it greatly simplifies the analysis and all feature statistics provably control the frequentist FDR anyway. However, Section 3.4 extends our analysis to consider the expected number of true discoveries.

Let Sw[p]S_{w}\subset[p] denote the discovery set using feature statistic ww on data ([𝐗,𝐗~],𝐘)([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}). An oracle statistic maximizes the expected number of discoveries under PP^{\star} as defined below:

ENDisc(w)𝔼P[|Sw|].\texttt{ENDisc}^{\star}(w)\coloneqq\,\,\mathbb{E}_{P^{\star}}\left[|S_{w}|\right]. (1.1)

Next, let 𝒫={P(θ):θΘ}\mathcal{P}=\{P^{(\theta)}:\theta\in\Theta\} denote a model class of potential distributions for (𝐗,𝐘)(\mathbf{X},\mathbf{Y}) and let π:Θ0\pi:\Theta\to\mathbb{R}_{\geq 0} denote a prior density over 𝒫\mathcal{P}.333We implicitly assume all elements of 𝒫\mathcal{P} are consistent with the core assumptions of MX/FX knockoffs (see Section 1.1). For example, when employing MX knockoffs, all P(θ)𝒫P^{(\theta)}\in\mathcal{P} should specify the correct marginal law for XX. Although this is not necessary for our theoretical results, it is necessary for the computational techniques in Section 4. A Bayes-optimal statistic maximizes the average-case expected number of discoveries with respect to π\pi:

ENDiscπ(w)Θ𝔼P(θ)[|Sw|]π(θ)𝑑θ𝔼Pπ[|Sw|],\texttt{ENDisc}^{\pi}(w)\coloneqq\int_{\Theta}\mathbb{E}_{P^{(\theta)}}[|S_{w}|]\pi(\theta)d\theta\coloneqq\mathbb{E}_{P^{\pi}}[|S_{w}|], (1.2)

where above, PπP^{\pi} denotes the mixture distribution which first samples a parameter θΘ\theta^{\star}\in\Theta according to the prior π\pi and then samples (𝐗,𝐘)θP(θ)(\mathbf{X},\mathbf{Y})\mid\theta^{\star}\sim P^{(\theta^{\star})}. We refer to PπP^{\pi} as a “Bayesian model,” and we give a default choice of PπP^{\pi} (based on sparse generalized additive models) in Section 4.2. Our paper introduces statistics mlroracle\mathrm{mlr}^{\mathrm{oracle}} and mlrπ\mathrm{mlr}^{\pi} that asymptotically maximize ENDisc\texttt{ENDisc}^{\star} and ENDiscπ\texttt{ENDisc}^{\pi}, respectively.

While introducing a prior distribution may seem a strong assumption to some readers, Bayesian models are routinely used in applications where knockoffs are commonly applied, such as genetic fine-mapping (e.g., Guan and Stephens,, 2011; Weissbrod et al.,, 2020). Furthermore, in simulations and real applications, our approach yields significant power gains over pre-existing approaches even when the prior is highly misspecified. Lastly, we emphasize that Theorem 1.1 guarantees that using mlrπ\mathrm{mlr}^{\pi} as a feature statistic guarantees frequentist FDR control under the true law PP^{\star} of the data, even if PP^{\star} is not a member of 𝒫\mathcal{P}—and in this Type I error result, the conditional independence null hypotheses H1,,HpH_{1},\ldots,H_{p} are defined nonparametrically with respect to the unknown PP^{\star}, not with respect to PπP^{\pi}.

1.3 Contribution and overview of results

This paper develops masked likelihood ratio (MLR) statistics, a class of feature statistics that are asymptotically optimal, computationally efficient, and powerful in applications. To derive these statistics, we reformulate MX knockoffs as a guessing game on masked data D=(𝐘,{𝐗j,𝐗~j}j=1p),D=(\mathbf{Y},\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p}), where the notation {𝐗j,𝐗~j}\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\} denotes an unordered set.444For brevity, this section only presents results for MX knockoffs. See Section 3 for analogous results for the FX case. After observing DD, the analyst must do as follows:

  • For each j[p]j\in[p], the analyst must produce a guess 𝐗^jn\widehat{\mathbf{X}}_{j}\in\mathbb{R}^{n} of the value of the feature 𝐗j\mathbf{X}_{j} based on DD. Note that given DD, 𝐗^j{𝐗j,𝐗~j}\widehat{\mathbf{X}}_{j}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\} takes one of two values.

  • The analyst must then assign an order to their pp guesses, ideally from most to least promising.

  • The analyst may make kk discoveries if roughly kk of their first (1+q)k(1+q)k guesses are correct (according to the order they specify), where qq is the FDR level. Here, guess jj is “correct” if 𝐗^j=𝐗j\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}.

We show that to maximize the expected number of discoveries, an asymptotically optimal strategy is:

  • For each j[p]j\in[p], guess the value 𝐗^j{𝐗j,𝐗~j}\widehat{\mathbf{X}}_{j}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\} which is conditionally more likely given DD (see below).

  • Order the guesses in descending order of the probability that each guess is correct.

In the traditional language of knockoffs, this corresponds to using the masked data to compute the log-likelihood ratio between the two possible values of 𝐗j\mathbf{X}_{j} (namely {𝐗j,𝐗~j})\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}) given DD. Precisely, let PπP^{\pi} denote a Bayesian model as defined in Section 1.2. Then for any value 𝐝=(𝐲,{𝐱j,𝐱~j}j=1p)\mathbf{d}=(\mathbf{y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p}) in the support of DD and any fixed 𝐱{𝐱j,𝐱~j}\mathbf{x}\in\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}, let Pjπ(𝐱𝐝)=Pπ(𝐗j=𝐱D=𝐝)P^{\pi}_{j}(\mathbf{x}\mid\mathbf{d})=P^{\pi}(\mathbf{X}_{j}=\mathbf{x}\mid D=\mathbf{d}) denote the conditional law of 𝐗j\mathbf{X}_{j} given DD. The masked likelihood ratio (MLR) statistic is defined as

MLRjπmlrjπ([𝐗,𝐗~],𝐘)log(Pjπ(𝐗jD)Pjπ(𝐗~jD)).\mathrm{MLR}_{j}^{\pi}\coloneqq\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\coloneqq\log\left(\frac{P^{\pi}_{j}(\mathbf{X}_{j}\mid D)}{P^{\pi}_{j}(\widetilde{\mathbf{X}}_{j}\mid D)}\right). (1.3)

In words, the numerator plugs the observed values of 𝐗j\mathbf{X}_{j} and DD into the conditional law of 𝐗jD\mathbf{X}_{j}\mid D under PπP^{\pi}, and the denominator plugs 𝐗~j\widetilde{\mathbf{X}}_{j} and DD into the same law. Since swapping 𝐗j\mathbf{X}_{j} and 𝐗~j\widetilde{\mathbf{X}}_{j} flips the sign of MLRjπ\mathrm{MLR}_{j}^{\pi} without changing the values of {MLRπ}j\{\mathrm{MLR}_{\ell}^{\pi}\}_{\ell\neq j}, this equation defines a valid knockoff statistic. See Section 1.4 for comparison to Katsevich and Ramdas, (2020)’s (unmasked) likelihood ratio statistic.

This paper gives three arguments motivating the use of MLR statistics.

1. Intuition: the right notion of variable importance. Existing feature statistics measure many different proxies for variable importance, ranging from regression coefficients to posterior inclusion probabilities. However, Section 2 shows that popular lasso-based methods incorrectly prioritize features 𝐗j\mathbf{X}_{j} that are predictive of 𝐘\mathbf{Y} but are nearly indistinguishable from their knockoffs 𝐗~j\widetilde{\mathbf{X}}_{j}, leading to low power in real applications. In contrast to conventional variable importances, MLR statistics instead estimate whether a feature 𝐗j\mathbf{X}_{j} is distinguishable from its knockoff 𝐗~j\widetilde{\mathbf{X}}_{j}.

2. Theory: asymptotic Bayes-optimality. Section 3.3 shows that MLR statistics asymptotically maximize the number of expected discoveries under PπP^{\pi}. Namely, under technical assumptions, Theorem 3.2 shows that for any valid feature statistic ww,

ENDiscπ(mlrπ)ENDiscπ(w)+o(# of non-nulls).\texttt{ENDisc}^{\pi}(\mathrm{mlr}^{\pi})\geq\texttt{ENDisc}^{\pi}(w)+o(\text{\# of non-nulls}). (1.4)

Our result applies to arbitrarily high-dimensional asymptotic regimes and allows PπP^{\pi} to take any form—we do not assume that 𝐘𝐗\mathbf{Y}\mid\mathbf{X} follows a linear model under PπP^{\pi}. Instead, we assume the signs of the MLR statistics satisfy a local dependency condition, similar to dependency conditions often assumed on p-values (Genovese and Wasserman,, 2004; Storey et al.,, 2004; Ferreira and Zwinderman,, 2006; Farcomeni,, 2007). Our condition does not involve unknown quantities, so it can be diagnosed in practice.

Despite the Bayesian nature of this optimality result, we emphasize that MLR statistics are valid knockoff statistics. Thus, if Smlr[p]S_{\mathrm{mlr}}\subset[p] are the discoveries made by the mlrπ\mathrm{mlr}^{\pi} feature statistic, Theorem 1.1 shows that the frequentist FDR is controlled in finite samples assuming only that 𝐗~\widetilde{\mathbf{X}} are valid knockoffs:

FDR𝔼P[|Smlr0|1|Smlr|]q.\mathrm{FDR}\coloneqq\mathbb{E}_{P^{\star}}\left[\frac{|S_{\mathrm{mlr}}\cap\mathcal{H}_{0}|}{1\vee|S_{\mathrm{mlr}}|}\right]\leq q. (1.5)

3. Empirical results. We demonstrate via simulations and three real data analyses that MLR statistics are powerful in practice, even when the user-specified Bayesian model PπP^{\pi} is highly misspecified.

  • We develop concrete instantiations of MLR statistics based on uninformative (sparse) priors for generalized additive models and binary GLMs. Our Python implementation is often faster than fitting a cross-validated lasso.

  • In simulations, MLR statistics outperform other state-of-the-art feature statistics, often by wide margins. Even when PπP^{\pi} is highly misspecified, MLR statistics often nearly match the performance of the oracle which sets Pπ=PP^{\pi}=P^{\star}. Furthermore, when 𝐘\mathbf{Y} has a highly nonlinear relationship with 𝐗\mathbf{X}, MLR statistics also outperform “black-box” feature statistics based on neural networks and random forests.

  • We replicate three knockoff-based analyses of drug resistance (Barber and Candès,, 2015), financial factor selection (Challet et al.,, 2021), and RNA-seq data (Li and Maathuis,, 2019). We find that MLR statistics (with an uninformative prior) make one to ten times more discoveries than the original analyses.

Overall, our results suggest that MLR statistics can substantially increase the power of knockoffs.

1.4 Related literature

The literature contains many feature statistics, which can (roughly) be separated into three categories. First, perhaps the most common feature statistics are based on penalized regression coefficients, notably the lasso signed maximum (LSM) and lasso coefficient difference (LCD) statistics (Barber and Candès,, 2015). Indeed, these lasso-based statistics are often used in applied work (e.g., Sesia et al.,, 2019) and have received much theoretical attention (Weinstein et al.,, 2017; Fan et al.,, 2020; Ke et al.,, 2020; Weinstein et al.,, 2020; Wang and Janson,, 2021). Perhaps surprisingly, we argue that many of these statistics target the wrong notion of variable importance, leading to reduced power. Second, some works have introduced Bayesian knockoff statistics (e.g., Candès et al.,, 2018; Ren and Candès,, 2020). MLR statistics have a Bayesian flavor but take a different form than previous statistics. Furthermore, our motivation differs from those of previous works: the real innovation of MLR statistics is to estimate a masked likelihood ratio, and we mainly use a Bayesian framework to quantify uncertainty about nuisance parameters (see Section 3.2). In contrast, previous works largely motivated Bayesian statistics as a way to incorporate prior information (Candès et al.,, 2018; Ren and Candès,, 2020). That said, an important special case of MLR statistics is similar to the “BVS” statistics from Candès et al., (2018), as discussed in Section 4. Third, many feature statistics take advantage of “black-box” ML to assign variable importances (e.g., Lu et al.,, 2018; Gimenez et al.,, 2019). Empirically, our implementation of MLR statistics based on regression splines outperforms “black-box” feature statistics in Section 5.

Previous power analyses for knockoffs have largely focused on showing the consistency of coefficient-difference feature statistics (Liu and Rigollet,, 2019; Fan et al.,, 2020; Spector and Janson,, 2022) or quantifying the power of coefficient-difference feature statistics assuming 𝐗\mathbf{X} has i.i.d. Gaussian entries (Weinstein et al.,, 2017, 2020; Wang and Janson,, 2021). Ke et al., (2020) also derive a phase diagram for LCD statistics assuming 𝐗\mathbf{X} is blockwise orthogonal. Our goal is different: to show that MLR statistics are asymptotically optimal, with particular focus on settings where the asymptotic power lies strictly between 0 and 11. Furthermore, the works above exclusively focus on Gaussian linear models, whereas our analysis places no explicit restrictions on the law of 𝐘𝐗\mathbf{Y}\mid\mathbf{X} or the dimensionality of the problem. Instead, we assume the signs of the MLR statistics satisfy a local dependency condition, similar to common dependency conditions on p-values (Genovese and Wasserman,, 2004; Storey et al.,, 2004; Ferreira and Zwinderman,, 2006; Farcomeni,, 2007). However, our proof technique is novel and specific to knockoffs.

Our theory builds on Li and Fithian, (2021), who developed knockoff\star, a provably optimal oracle statistic for FX knockoffs—in fact, oracle MLR statistics are equivalent to knockoff\star for FX knockoffs. Our work also builds on Katsevich and Ramdas, (2020), who showed that unmasked likelihood statistics maximize P(Wj>0)P^{\star}(W_{j}>0). MLR statistics also have this property, although we show the stronger result that MLR statistics maximize the expected number of overall discoveries. Another key difference is that unmasked likelihood statistics are not jointly valid knockoff statistics (see Appendix D.1). Thus, unmasked likelihood statistics do not yield provable FDR control, whereas MLR statistics do. Lastly, we note that the oracle procedures derived in these two works cannot be used in practice since they depend on unknown parameters. To our knowledge, MLR statistics are the first usable knockoff statistics with explicit optimality guarantees.

1.5 Notation and outline

Notation: Let 𝐗n×p\mathbf{X}\in\mathbb{R}^{n\times p} and 𝐘n\mathbf{Y}\in\mathbb{R}^{n} denote the design matrix and response vector in a feature selection problem with nn data points and pp features. We let the non-bold versions X=(X1,,Xp)pX=(X_{1},\dots,X_{p})\in\mathbb{R}^{p} and YY\in\mathbb{R} denote the features and response for any arbitrary observation. For kk\in\mathbb{N}, define [k]{1,,k}[k]\coloneqq\{1,\dots,k\}. For any Mm×kM\in\mathbb{R}^{m\times k} and J[k]J\subset[k], MJM_{J} denotes the columns of MM corresponding to the indices in JJ. Similarly, MJM_{-J} denotes the columns of MM which do not appear in JJ, and MjM_{-j} denotes all columns except column j[k]j\in[k]. For matrices M1n×k1,M2n×k2M_{1}\in\mathbb{R}^{n\times k_{1}},M_{2}\in\mathbb{R}^{n\times k_{2}}, [M1,M2]n×(k1+k2)[M_{1},M_{2}]\in\mathbb{R}^{n\times(k_{1}+k_{2})} denotes the column-wise concatenation of M1,M2M_{1},M_{2}. InI_{n} denotes the n×nn\times n identity. Throughout, PP^{\star} denotes the true (unknown) law of 𝐗,𝐘\mathbf{X},\mathbf{Y}, and PπP^{\pi} denotes a user-specified Bayesian model of the law of 𝐗,𝐘,θ\mathbf{X},\mathbf{Y},\theta^{\star} as defined in Section 1.2.

Outline: Section 2 gives intuition explaining why popular feature statistics may have low power, using an HIV resistance dataset as motivation. Section 3 introduces MLR statistics and presents our theoretical results. Section 4 discusses computation and suggests default choices of the Bayesian model PπP^{\pi}. Section 5 compares MLR statistics to competitors via simulations. Section 6 applies MLR statistics to three real datasets. Section 7 discusses future directions.

2 Intuition and motivation from an HIV drug resistance dataset

2.1 Intuition: what makes knockoffs powerful?

Given a vector of knockoff statistics WpW\in\mathbb{R}^{p}, the number of discoveries is determined as follows:

  • Step 1: Let σ:[p][p]\sigma:[p]\to[p] denote the permutation such that |Wσ(1)||Wσ(2)||Wσ(p)||W_{\sigma(1)}|\geq|W_{\sigma(2)}|\geq\dots\geq|W_{\sigma(p)}|.

  • Step 2: Let kk be the largest integer such that at least (k+1)/(1+q)\left\lceil(k+1)/(1+q)\right\rceil of the kk feature statistics Wσ(1),,Wσ(k)W_{\sigma(1)},\dots,W_{\sigma(k)} with the largest absolute values have positive signs. Then the analyst may discover the features corresponding to the positive signs among Wσ(1),,Wσ(k)W_{\sigma(1)},\dots,W_{\sigma(k)}.

The procedure above is equivalent to using the “data-dependent threshold” from Section 1.1 (see Barber and Candès,, 2015). This characterization suggests that to make many discoveries, we should:

  • Goal 1: Maximize the probability that each coordinate WjW_{j} has a positive sign. (Note that null coordinates are guaranteed to be symmetric.)

  • Goal 2: Assign absolute values such that coordinates WjW_{j} with larger absolute values also have higher probabilities of being positive. This ensures that for each kk, the kk feature statistics with the highest absolute values contain as many positive signs as possible, thus maximizing the number of discoveries. Although it is not yet clear how to formalize this goal, intuitively, we would like {|Wj|}j=1p\{|W_{j}|\}_{j=1}^{p} to have the same order as {P(Wj>0)}j=1p\{P^{\star}(W_{j}>0)\}_{j=1}^{p}.

We emphasize that the second goal is crucial to make discoveries when {P(Wj>0)}j=1p\{P^{\star}(W_{j}>0)\}_{j=1}^{p} is heterogeneous, as illustrated in Section 2.2. See also Appendix A for a simpler simulated example.

2.2 Motivation from the HIV drug resistance dataset

We now ask: do the most common choices of feature statistics used in the literature, LCD and LSM statistics, accomplish Goals 1-2? We argue no, using the HIV drug resistance dataset from Rhee et al., (2006) as an illustrative example. This dataset has been used as a benchmark in several papers about knockoffs, e.g., Barber and Candès, (2015); Romano et al., (2020), and we perform a complete analysis of this dataset in Section 6. For now, note that the design 𝐗\mathbf{X} consists of genotype data from n750n\approx 750 HIV samples, the response 𝐘\mathbf{Y} measures the resistance of each sample to a drug (in this case Indinavir), and we apply knockoffs to discover which genetic variants affect drug resistance—note our analysis exactly mimics that of Barber and Candès, (2015). As notation, let (β^(λ),β~(λ))2p(\hat{\beta}^{(\lambda)},\tilde{\beta}^{(\lambda)})\in\mathbb{R}^{2p} denote the estimated lasso coefficients fit on [𝐗,𝐗~][\mathbf{X},\widetilde{\mathbf{X}}] and 𝐘\mathbf{Y} with regularization parameter λ\lambda. Furthermore, let λ^j\hat{\lambda}_{j} (resp. λ~j)\tilde{\lambda}_{j}) denote the smallest value of λ\lambda such that β^j(λ)0\hat{\beta}^{(\lambda)}_{j}\neq 0 (resp. β~j(λ)0\tilde{\beta}^{(\lambda)}_{j}\neq 0). Then the LCD and LSM statistics are defined as:

WjLCD=|β^j(λ)||β~j(λ)|,WjLSM=sign(λ^jλ~j)max(λ^j,λ~j).W_{j}^{\mathrm{LCD}}=|\hat{\beta}_{j}^{(\lambda)}|-|\tilde{\beta}_{j}^{(\lambda)}|,\,\,\,\,\,\,\,\,\,\,W_{j}^{\mathrm{LSM}}=\operatorname*{sign}(\hat{\lambda}_{j}-\tilde{\lambda}_{j})\max(\hat{\lambda}_{j},\tilde{\lambda}_{j}). (2.1)
Refer to caption
Figure 1: We plot the first 5050 LCD, LSM, and MLR feature statistics sorted in descending order of absolute value when applied to the HIV drug resistance dataset for the drug Indinavir (IDV). For FDR level q=0.05q=0.05, all positive feature statistics to the left of the dotted black line are discoveries. This figure shows that when 𝐗\mathbf{X} is correlated, LCD and LSM statistics make few discoveries because they occasionally yield highly negative WW-statistics for highly predictive variables that have low-quality knockoffs, such as the “P90.M” variant from Section 2. In contrast, MLR statistics (defined in Section 3) deprioritize the P90.M variant; although they still do not discover P.90M, this deprioritization allows the discovery of 2222 other features. For visualization, we apply a monotone transformation to {Wj}\{W_{j}\} such that |Wj|1|W_{j}|\leq 1, which (provably) does not change the performance of knockoffs. See Appendix H for further details and corresponding plots for the other fifteen drugs in the dataset.

As intuition, imagine that a feature 𝐗j\mathbf{X}_{j} appears to influence 𝐘\mathbf{Y}: however, due to high correlations within 𝐗\mathbf{X}, we must create a knockoff 𝐗~j\widetilde{\mathbf{X}}_{j} which is highly correlated with 𝐗j\mathbf{X}_{j}. For example, the “P90.M” variant in the HIV dataset is extremely predictive of resistance to Indinavir (IDV), as its OLS t-statistic is 8.95\approx 8.95. However, in the original analysis, P90.M is >99%>99\% correlated with its knockoff, so the lasso may select 𝐗~j\widetilde{\mathbf{X}}_{j} instead of 𝐗j\mathbf{X}_{j}. Furthermore, since the lasso induces sparsity, it is unlikely to select both 𝐗j\mathbf{X}_{j} and 𝐗~j\widetilde{\mathbf{X}}_{j} as they are highly correlated. Thus, WjLCDW_{j}^{\mathrm{LCD}} and WjLSMW_{j}^{\mathrm{LSM}} will have large absolute values, since 𝐗j\mathbf{X}_{j} appears significant, and a reasonably high probability of being negative, since 𝐗j𝐗~j\mathbf{X}_{j}\approx\widetilde{\mathbf{X}}_{j}. Indeed, the LCD and LSM statistics for P90.M have, respectively, the largest and second-largest absolute values among all genetic variants, but both statistics are negative because the lasso selected the knockoff instead of the feature.

Figure 1 shows that this misprioritization prevents the LCD and LSM statistics from making any discoveries when q=0.05q=0.05. Yet this problem is avoidable. If Corr(Xj,X~j)\operatorname{Corr}(X_{j},\widetilde{X}_{j}) is large and WjW_{j} may be negative, we can “deprioritize” WjW_{j} by lowering its absolute value. As shown by Figure 1, this is exactly what MLR statistics do for the P90.M variant. Although this does not allow us to discover P90.M, it does allow us to discover 2222 other features.

Remark 1 (Alternative solutions).

To our knowledge, this problem with lasso statistics has not been previously discussed (see Section 1.4). Once pointed out, there are many intuitive approaches that mitigate (but do not fully solve) this problem, such as studentizing the coefficients or adding a ridge penalty. These practical ideas may merit further exploration; however, we focus on obtaining optimal feature statistics. Furthermore, some may argue that the best solution is simply to ensure that P90.M is less correlated with its knockoff. We wholeheartedly agree that the SDP knockoff construction from Barber and Candès, (2015) is sub-optimal here, and thus, we also use an alternative knockoff construction in Section 6. Yet reasonably strong correlations between features and knockoffs are inevitable when the features are correlated (Dai and Barber,, 2016). However, knockoffs can still be powerful in these settings if the feature statistics properly account for the (known) dependencies among [𝐗,𝐗~][\mathbf{X},\widetilde{\mathbf{X}}]. MLR statistics are designed to do this.

3 Masked likelihood ratio statistics

This section introduces and analyzes MLR statistics. First, Section 3.1 introduces the notation needed to define MLR statistics. Then, Section 3.2 defines MLR statistics, and Section 3.3 shows that MLR statistics asymptotically maximize the expected number of discoveries. Finally, Section 3.4 introduces an adjusted MLR statistic that asymptotically maximizes the expected number of true discoveries.

3.1 Knockoffs as inference on masked data

Section 2 argued that to maximize power, we should assign WjW_{j} a large absolute value if and only if P(Wj>0)P^{\star}(W_{j}>0) is large. To do this, we must estimate P(Wj>0)P^{\star}(W_{j}>0) from the data, but we cannot use all the data for this purpose: e.g., we cannot directly adjust |W||W| based on sign(W)\operatorname*{sign}(W) without violating FDR control. To resolve this ambiguity, we reformulate knockoffs as inference on masked data.

Definition 3.1.

Suppose we observe data 𝐗,𝐘\mathbf{X},\mathbf{Y}, knockoffs 𝐗~\widetilde{\mathbf{X}}, and independent random noise UU. (UU may be used to fit a randomized feature statistic.) The masked data DD is defined as

D={(𝐘,{𝐗j,𝐗~j}j=1p,U) for model-X knockoffs(𝐗,𝐗~,{𝐗jT𝐘,𝐗~jT𝐘}j=1p,U) for fixed-X knockoffs.D=\begin{cases}(\mathbf{Y},\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p},U)&\text{ for model-X knockoffs}\\ (\mathbf{X},\widetilde{\mathbf{X}},\{\mathbf{X}_{j}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\}_{j=1}^{p},U)&\text{ for fixed-X knockoffs.}\end{cases} (3.1)

As shown in Propositions 3.1-3.2, the masked data DD is all the data we may use when assigning magnitudes to WW, and knockoffs will be powerful when we can recover the full data from DD.

Proposition 3.1.

Let 𝐗~\widetilde{\mathbf{X}} be model-X knockoffs such that 𝐗j𝐗~j\mathbf{X}_{j}\neq\widetilde{\mathbf{X}}_{j} a.s. for j[p]j\in[p]. Then W=w([𝐗,𝐗~],𝐘)W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}) is a valid feature statistic if and only if:

  1. 1.

    |W||W| is a function of the masked data DD.

  2. 2.

    For all j[p]j\in[p], there exists a DD-measurable random vector 𝐗^j\widehat{\mathbf{X}}_{j} such that Wj>0W_{j}>0 if and only if 𝐗^j=𝐗j\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}.

Proposition 3.1 reformulates knockoffs as a guessing game, where we produce a “guess” 𝐗^j{𝐗j,𝐗~j}\widehat{\mathbf{X}}_{j}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\} of the value of 𝐗j\mathbf{X}_{j} based on D=(𝐘,{𝐗j,𝐗~j}j=1p)D=(\mathbf{Y},\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p}). If our guess is right, meaning 𝐗^j=𝐗j\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}, then we are rewarded and Wj>0W_{j}>0: else Wj<0W_{j}<0. To avoid highly negative WW-statistics, we should only assign WjW_{j} a large absolute value when we are confident that our “guess” 𝐗^j\widehat{\mathbf{X}}_{j} is correct. We discuss more implications of this result in the next section: for now, we obtain an analogous result for fixed-X knockoffs (similar to a result from Li and Fithian, (2021)) by substituting {𝐗jT𝐘,𝐗~jT𝐘}\{\mathbf{X}_{j}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\} for {𝐗j,𝐗~j}\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}.

Proposition 3.2.

Let 𝐗~\widetilde{\mathbf{X}} be fixed-X knockoffs satisfying 𝐗jT𝐘𝐗~jT𝐘\mathbf{X}_{j}^{T}\mathbf{Y}\neq\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y} a.s. for j[p]j\in[p]. Then W=w([𝐗,𝐗~],𝐘)W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}) is a valid feature statistic if and only if:

  1. 1.

    |W||W| is a function of the masked data DD.

  2. 2.

    For j[p]j\in[p], there exists a DD-measurable random variable RjR_{j} such that Wj>0W_{j}>0 if and only if Rj=𝐗jT𝐘R_{j}=\mathbf{X}_{j}^{T}\mathbf{Y}.

Remark 2.

Propositions 3.1-3.2 hold for knockoffs as defined in Barber and Candès, (2015); Candès et al., (2018). However, in the fixed-X case, one can also augment DD to include σ^2=(InH)𝐘22\hat{\sigma}^{2}=\|(I_{n}-H)\mathbf{Y}\|_{2}^{2}, where HH is the OLS projection matrix of [𝐗,𝐗~][\mathbf{X},\widetilde{\mathbf{X}}] while preserving validity (Chen et al.,, 2019; Li and Fithian,, 2021). Our theory also applies to this extension of the knockoffs framework.

3.2 Introducing masked likelihood ratio (MLR) statistics

We now introduce masked likelihood ratio (MLR) statistics in two steps. First, we introduce oracle MLR statistics, which depend on the unknown law PP^{\star} of the data. Then, we introduce Bayesian MLR statistics, which substitute PπP^{\pi} for PP^{\star}. Throughout, we focus on MX knockoffs, but analogous results for FX knockoffs merely replace {𝐗j,𝐗~j}\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\} with {𝐗jT𝐘,𝐗~jT𝐘}\{\mathbf{X}_{j}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\} (see Definition 3.2).

Step 1: Oracle MLR statistics. We now apply Proposition 3.1 to achieve the two intuitive optimality criteria from Section 2.

  • Goal 1 asks that we maximize P(Wj>0)P^{\star}(W_{j}>0). Proposition 3.1 shows that ensuring Wj>0W_{j}>0 is equivalent to correctly guessing the value of 𝐗j\mathbf{X}_{j} from the masked data D=(𝐘,{𝐗j,𝐗~j}j=1p)D=(\mathbf{Y},\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p}). Thus, to maximize P(Wj>0D)=P(𝐗^j=𝐗jD)P^{\star}(W_{j}>0\mid D)=P^{\star}(\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}\mid D), the analyst should guess the value 𝐗^j=argmax𝐱{𝐗j,𝐗~j}Pj(𝐱D)\widehat{\mathbf{X}}_{j}=\operatorname*{arg\,max}_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\star}(\mathbf{x}\mid D) which maximizes the likelihood that the guess is correct.

  • Goal 2 asks us to order the guesses in descending order of P(Wj>0)P^{\star}(W_{j}>0), i.e., in descending order of the likelihood that each guess 𝐗^j\widehat{\mathbf{X}}_{j} is correct.

Both goals are achieved by using the masked data to compute a log-likelihood ratio between the two possible values of 𝐗j\mathbf{X}_{j} (namely {𝐗j,𝐗~j}\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}). This defines the oracle masked likelihood ratio:

MLRjoracle=log(Pj(𝐗jD)Pj(𝐗~jD)),\mathrm{MLR}_{j}^{\mathrm{oracle}}=\log\left(\frac{P_{j}^{\star}(\mathbf{X}_{j}\mid D)}{P_{j}^{\star}(\widetilde{\mathbf{X}}_{j}\mid D)}\right), (3.2)

where PjP𝐗jDP^{\star}_{j}\coloneqq P^{\star}_{\mathbf{X}_{j}\mid D} is the true (unknown) conditional law of 𝐗j\mathbf{X}_{j} given DD. Soon, Proposition 3.3 will verify that MLRoracle\mathrm{MLR}^{\mathrm{oracle}} achieves both goals above, and Section 3.3 shows that MLRjoracle\mathrm{MLR}_{j}^{\mathrm{oracle}} asymptotically maximizes the expected number of discoveries under PP^{\star} (under regularity conditions).

Step 2: Bayesian MLR statistics. MLRoracle\mathrm{MLR}^{\mathrm{oracle}} cannot be used in practice since it depends on PP^{\star}. A heuristic solution is to “plug in” an estimate P^\hat{P} for PP^{\star}. For example, given some model class 𝒫={P(θ):θΘ}\mathcal{P}=\{P^{(\theta)}:\theta\in\Theta\} of the law of (𝐘,𝐗)(\mathbf{Y},\mathbf{X}), one could estimate θ^\hat{\theta} using DD and replace PP^{\star} with P(θ^)P^{(\hat{\theta})}. However, this “plug-in” approach can perform poorly, since knockoffs are most popular in high-dimensional settings with significant uncertainty about the true value of any unknown parameters. Thus, to account for uncertainty, we suggest replacing PP^{\star} with a Bayesian model PπP^{\pi}.

Definition 3.2 (MLR statistics).

For any Bayesian model PπP^{\pi} (see Section 1.2), we define the model-X masked likelihood ratio (MLR) statistic below:

MLRjπmlrjπ([𝐗,𝐗~],𝐘)log(Pjπ(𝐗jD)Pjπ(𝐗~jD)) for model-X knockoffs,\mathrm{MLR}_{j}^{\pi}\coloneqq\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\coloneqq\log\left(\frac{P^{\pi}_{j}(\mathbf{X}_{j}\mid D)}{P^{\pi}_{j}(\widetilde{\mathbf{X}}_{j}\mid D)}\right)\text{ for model-X knockoffs,} (3.3)

where PjπP𝐗jDπP^{\pi}_{j}\coloneqq P^{\pi}_{\mathbf{X}_{j}\mid D} denotes the conditional law of 𝐗jD\mathbf{X}_{j}\mid D under PπP^{\pi}.

The fixed-X MLR statistic is analogous but replaces {𝐗j,𝐗~j}\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\} with {𝐗jT𝐘,𝐗~jT𝐘}\{\mathbf{X}_{j}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\}. In particular, if Pj,fxπP^{\pi}_{j,\mathrm{fx}} denotes the conditional law of 𝐗jT𝐘D\mathbf{X}_{j}^{T}\mathbf{Y}\mid D under PπP^{\pi}, then

MLRjπmlrjπ([𝐗,𝐗~],𝐘)log(Pj,fxπ(𝐗jT𝐘D)Pj,fxπ(𝐗~jT𝐘D)) for fixed-X knockoffs.\mathrm{MLR}_{j}^{\pi}\coloneqq\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\coloneqq\log\left(\frac{P^{\pi}_{j,\mathrm{fx}}(\mathbf{X}_{j}^{T}\mathbf{Y}\mid D)}{P^{\pi}_{j,\mathrm{fx}}(\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\mid D)}\right)\text{ for fixed-X knockoffs.} (3.4)
Lemma 3.1.

Equations (3.3) and (3.4) define valid MX and FX knockoff statistics, respectively.

To see how MLR statistics account for uncertainty about nuisance parameters, let π(θD)\pi(\theta\mid D) denote the posterior density of θD\theta^{\star}\mid D under PπP^{\pi}. We can write, e.g., in the model-X case:

MLRjπ=log(ΘPj(θ)(𝐗jD)π(θD)𝑑θΘPj(θ)(𝐗~jD)π(θD)𝑑θ).\mathrm{MLR}_{j}^{\pi}=\log\left(\frac{\int_{\Theta}P^{(\theta)}_{j}(\mathbf{X}_{j}\mid D)\pi(\theta\mid D)d\theta}{\int_{\Theta}P^{(\theta)}_{j}(\widetilde{\mathbf{X}}_{j}\mid D)\pi(\theta\mid D)d\theta}\right). (3.5)

Unlike the “plug-in” approach, MLRjπ\mathrm{MLR}_{j}^{\pi} does not rely on a single estimate of θ\theta—instead, it takes the weighted average of the likelihoods Pj(θ)(𝐗jD)P^{(\theta)}_{j}(\mathbf{X}_{j}\mid D), weighted by the posterior law of θD\theta\mid D under PπP^{\pi}.

We now verify that under PπP^{\pi}, MLR statistics achieve the intuitive criteria from Section 2 (Goals 1-2). This result applies to oracle MLR statistics, since MLRjoracle=MLRjπ\mathrm{MLR}_{j}^{\mathrm{oracle}}=\mathrm{MLR}_{j}^{\pi} in the special case where Pπ=PP^{\pi}=P^{\star}.

Proposition 3.3.

Let MLRπ\mathrm{MLR}^{\pi} be the MLR statistics with respect to a Bayesian model PπP^{\pi}. Then for any other feature statistic WW,

Pπ(MLRjπ>0D)Pπ(Wj>0D).P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\geq P^{\pi}(W_{j}>0\mid D). (3.6)

Furthermore, {|MLRjπ|}j=1p\{|\mathrm{MLR}_{j}^{\pi}|\}_{j=1}^{p} has the same order as {Pπ(MLRjπ>0D)}j=1p\{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\}_{j=1}^{p}. More precisely,

Pπ(MLRjπ>0D)=exp(|MLRjπ|)1+exp(|MLRjπ|).P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)=\frac{\exp(|\mathrm{MLR}_{j}^{\pi}|)}{1+\exp(|\mathrm{MLR}_{j}^{\pi}|)}. (3.7)

Equation (3.7) shows that the absolute values |MLRjπ||\mathrm{MLR}_{j}^{\pi}| have the same order as Pπ(MLRjπ>0D)P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D), so under PπP^{\pi}, MLR statistics prioritize the hypotheses “correctly.” More generally, if 𝐗j\mathbf{X}_{j} is predictive of 𝐘\mathbf{Y} but 𝐗~j\widetilde{\mathbf{X}}_{j} is nearly indistinguishable from 𝐗j\mathbf{X}_{j}, |MLRjπ||\mathrm{MLR}_{j}^{\pi}| should be small, since 𝐗j𝐗~j\mathbf{X}_{j}\approx\widetilde{\mathbf{X}}_{j} suggests Pjπ(𝐗jD)Pjπ(𝐗~jD)P^{\pi}_{j}(\mathbf{X}_{j}\mid D)\approx P^{\pi}_{j}(\widetilde{\mathbf{X}}_{j}\mid D). Thus, MLR statistics should rarely be highly negative (see Figure 1).

Lastly, we make two connections to the literature. First, one proposal in Ren and Candès, (2020) also suggests ranking the hypotheses by (Wj>0|Wj|)\mathbb{P}(W_{j}>0\mid|W_{j}|) (see their footnote 8). That said, Ren and Candès, (2020) do not propose a feature statistic accomplishing this. Rather, they develop “adaptive knockoffs,” an extension of knockoffs that can be combined with any predefined feature statistic, including MLR or lasso statistics. Indeed, using better initial feature statistics should increase the power of adaptive knockoffs, so our contribution is both orthogonal and complementary to theirs (see Appendix D.2 for more details). Second, Katsevich and Ramdas, (2020) show that the unmasked likelihood statistic maximizes P(Wj>0)P^{\star}(W_{j}>0); indeed, our work builds on theirs. However, there are two key differences. First, unlike MLR statistics, the unmasked likelihood statistic is not a valid knockoff statistic even though it is marginally symmetric under the null (see Appendix D.1), so it does not provably control the FDR. Second, MLR statistics have additional guarantees on their magnitudes (Eq. 3.7), allowing us to show much stronger theoretical results in Section 3.3.

Remark 3.

Appendix F extends this section’s results to apply to group knockoffs (Dai and Barber,, 2016).

3.3 MLR statistics are asymptotically optimal

We now show that MLR statistics asymptotically maximize ENDiscπ\texttt{ENDisc}^{\pi}, the expected number of discoveries under PπP^{\pi}. Indeed, Proposition 3.3 might make one hope that MLR statistics exactly maximize ENDiscπ\texttt{ENDisc}^{\pi}, since MLR statistics exactly accomplish Goals 1-2 from Section 2. This intuition is correct under the conditional independence condition below (generalizing Li and Fithian, (2021) Proposition 2).

Proposition 3.4.

If {𝕀(MLRjπ>0)}j=1p\{\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)\}_{j=1}^{p} are conditionally independent given DD under PπP^{\pi}, then
ENDiscπ(mlrπ)ENDiscπ(w)\texttt{ENDisc}^{\pi}(\mathrm{mlr}^{\pi})\geq\texttt{ENDisc}^{\pi}(w) for any valid feature statistic ww.

Furthermore, in Gaussian linear models, oracle MLR statistics satisfy this conditional independence condition, making them finite-sample optimal.

Proposition 3.5.

Suppose that (i) 𝐗\mathbf{X} are FX knockoffs or Gaussian conditional MX knockoffs (Huang and Janson,, 2020) and (ii) under PP^{\star}, 𝐘𝐗𝒩(𝐗β,σ2In)\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,\sigma^{2}I_{n}). Then under PP^{\star}, {𝕀(MLRjoracle>0)}j=1pD\{\mathbb{I}(\mathrm{MLR}_{j}^{\mathrm{oracle}}>0)\}_{j=1}^{p}\mid D are conditionally independent.

Absent this independence condition, it may be possible to exploit dependencies among the coordinates of sign(MLRπ)\operatorname*{sign}(\mathrm{MLR}^{\pi}) to slightly improve power. Yet Appendix B.3 shows that to improve power even slightly seems to require pathological dependencies, making it hard to imagine that accounting for dependencies can substantially increase power in practice. Formally, we now show that MLR statistics are asymptotically optimal under regularity conditions on the dependence of sign(MLRπ)D\operatorname*{sign}(\mathrm{MLR}^{\pi})\mid D.

To this end, consider any asymptotic regime where we observe 𝐗(n)n×pn,𝐘(n)n\mathbf{X}^{(n)}\in\mathbb{R}^{n\times p_{n}},\mathbf{Y}^{(n)}\in\mathbb{R}^{n} and construct knockoffs 𝐗~(n)\widetilde{\mathbf{X}}^{(n)}. For each nn, let PnπP^{\pi}_{n} denote a Bayesian model based on a model class 𝒫(n)={P(θ):θΘ(n)}\mathcal{P}^{(n)}=\{P^{(\theta)}:\theta\in\Theta^{(n)}\} and prior density π(n):Θ(n)0\pi^{(n)}:\Theta^{(n)}\to\mathbb{R}_{\geq 0}. Let D(n)D^{(n)} denote the masked data (Definition 3.1). For a sequence of feature statistics W(n)=wn([𝐗(n),𝐗~(n)],𝐘(n))W^{(n)}=w_{n}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)}), let S(n)(q)S^{(n)}(q) denote the rejection set of W(n)W^{(n)} when controlling the FDR at level qq. So far, we have made no assumptions about the law of 𝐘(n),𝐗(n)\mathbf{Y}^{(n)},\mathbf{X}^{(n)} under PnπP^{\pi}_{n}, and we allow the dimension pnp_{n} to grow arbitrarily with nn. To analyze the asymptotic behavior of MLR statistics under PnπP^{\pi}_{n}, we need two main assumptions.

Assumption 3.1 (Sparsity).

For θΘ(n)\theta\in\Theta^{(n)}, let sn(θ)s_{n}^{(\theta)} denote the number of non-nulls under P(θ)P^{(\theta)} and sn=Θsn(θ)π(θ)𝑑θs_{n}=\int_{\Theta}s_{n}^{(\theta)}\pi(\theta)d\theta denote the expected number of non-nulls under PnπP_{n}^{\pi}. We assume snlog(pn)5s_{n}\gg\log(p_{n})^{5} as nn\to\infty.

Assumption 3.1 allows for many previously studied sparsity regimes, such as polynomial (Donoho and Jin,, 2004; Ke et al.,, 2020) and linear (e.g., Weinstein et al.,, 2017) sparsity regimes.

Assumption 3.2 (Local dependence).

Under PnπP^{\pi}_{n}, the conditional covariance matrix of sign(MLRπ)\operatorname*{sign}(\mathrm{MLR}^{\pi}) given D(n)D^{(n)} decays exponentially off its diagonal. Formally, there exist constants C0,ρ(0,1)C\geq 0,\rho\in(0,1) such that

|CovPnπ(𝕀(MLRiπ>0),𝕀(MLRjπ>0)D(n))|Cρ|ij|.|\operatorname{Cov}_{P^{\pi}_{n}}(\mathbb{I}(\mathrm{MLR}^{\pi}_{i}>0),\mathbb{I}(\mathrm{MLR}^{\pi}_{j}>0)\mid D^{(n)})|\leq C\rho^{|i-j|}. (3.8)

Assumption 3.2 quantifies the requirement that sign(MLRπ)D(n)\operatorname*{sign}(\mathrm{MLR}^{\pi})\mid D^{(n)} are not “too” conditionally dependent. Similar local dependence conditions are common in the multiple testing literature (Genovese and Wasserman,, 2004; Storey et al.,, 2004; Ferreira and Zwinderman,, 2006; Farcomeni,, 2007), although previous assumptions are typically made about p-values. We justify this assumption below.

  1. 1.

    This assumption is intuitively plausible because knockoffs guarantee that the null coordinates of sign(W)\operatorname*{sign}(W) are independent given DD under PP^{\star}, regardless of the correlations among 𝐗\mathbf{X} (Barber and Candès,, 2015). This independence also holds for non-null coordinates in Gaussian linear models (see Prop. 3.5). Appendix B.9 gives additional informal intuition explaining why this result often holds approximately under PnπP_{n}^{\pi} for both null and non-null variables.

  2. 2.

    Empirically, CovPπ(sign(MLRπ)D)\operatorname{Cov}_{P^{\pi}}(\operatorname*{sign}(\mathrm{MLR}^{\pi})\mid D) is nearly indistinguishable from a diagonal matrix in all of our simulations and three real analyses. This suggests that Assumption 3.2 holds in practice.

  3. 3.

    This assumption can also be diagnosed in real applications, since it depends only on PπP^{\pi}, which is specified by the analyst. To this end, Section 4 shows how to compute CovPπ(sign(MLRπ)D)\operatorname{Cov}_{P^{\pi}}(\operatorname*{sign}(\mathrm{MLR}^{\pi})\mid D).

  4. 4.

    Explicit analysis of the covariances in Eq. (3.8) is known to be challenging. Nonetheless, in Appendix B.8, we prove that Assumption 3.2 holds if the design matrix 𝐗\mathbf{X} is blockwise orthogonal, which is an important (if not entirely realistic) special case studied by Ke et al., (2020).

  5. 5.

    Assumption 3.2 can also be substantially relaxed (see Appendix B.5). All we require is that the partial sums of {sign(MLRjπ)}j=1p\{\operatorname*{sign}(\mathrm{MLR}_{j}^{\pi})\}_{j=1}^{p} obey a strong law of large numbers conditional on DD.

With these two assumptions, we show that MLR statistics asymptotically maximize Γq(wn)\Gamma_{q}(w_{n}), the expected number of discoveries normalized by the expected number of non-nulls:

Γq(wn)𝔼(𝐗(n),𝐘(n))Pnπ[|S(n)(q)|]sn.\Gamma_{q}(w_{n})\coloneqq\frac{\mathbb{E}_{(\mathbf{X}^{(n)},\mathbf{Y}^{(n)})\sim P^{\pi}_{n}}[|S^{(n)}(q)|]}{s_{n}}. (3.9)
Theorem 3.2.

Consider any high-dimensional asymptotic regime where we observe data 𝐗(n)n×pn,𝐘(n)n\mathbf{X}^{(n)}\in\mathbb{R}^{n\times p_{n}},\mathbf{Y}^{(n)}\in\mathbb{R}^{n} and knockoffs 𝐗~(n)\widetilde{\mathbf{X}}^{(n)} with D(n)D^{(n)} denoting the masked data. Let PnπP^{\pi}_{n} be a sequence of Bayesian models of the data satisfying Assumptions 3.1-3.2, and let mlrnπ([𝐗(n),𝐗~(n)],𝐘(n))\mathrm{mlr}_{n}^{\pi}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)}) denote the MLR statistics with respect PnπP^{\pi}_{n}. Let wn([𝐗(n),𝐗~(n)],𝐘(n))w_{n}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)}) denote any other sequence of feature statistics.

Then, if the limits limnΓq(wn)\lim_{n\to\infty}\Gamma_{q}(w_{n}) and limnΓq(mlrnπ)\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi}) exist for q(0,1)q\in(0,1), we have that

limnΓq(mlrnπ)limnΓq(wn)\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})\geq\lim_{n\to\infty}\Gamma_{q}(w_{n}) (3.10)

holds for all but countably many qq.

Theorem 3.2 shows that MLR statistics asymptotically maximize the (normalized) number of expected discoveries without any explicit assumptions on the relationship between 𝐘\mathbf{Y} and 𝐗\mathbf{X} or the dimensionality. Besides Assumptions 3.1-3.2, we also assume that the quantities we aim to study actually exist, i.e., limnΓq(wn)\lim_{n\to\infty}\Gamma_{q}(w_{n}) and limnΓq(mlrnπ)\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi}) exist—however, even this assumption can be relaxed (see Appendix B.5). Yet the weakest aspect of Theorem 3.2 is that MLR statistics are only provably optimal under PπP^{\pi}. If PjP_{j}^{\star} and PjπP_{j}^{\pi} are quite different, MLR statistics may not perform well. For this reason, Section 4.2 suggests practical choices of PπP^{\pi} that performed well empirically, even under misspecification.

3.4 Maximizing the expected number of true discoveries

We now introduce adjusted MLR (AMLR) statistics, which asymptotically maximize the number of expected true discoveries under PπP^{\pi}. Empirically, AMLR and MLR statistics perform similarly, but AMLR statistics are less intuitive and depend somewhat counterintuitively on the FDR level. (This is why our paper focuses mostly on MLR statistics.) Thus, for brevity, this section gives only a little intuition and a slightly informal theorem statement. Please see Appendix B.7 for a rigorous theorem statement.

We begin with notation. For θΘ\theta\in\Theta, 1(θ)[p]\mathcal{H}_{1}(\theta)\subset[p] denotes the set of non-nulls under P(θ)P^{(\theta)} and 1(θ)[p]\mathcal{H}_{1}(\theta^{\star})\subset[p] denotes the random set of non-nulls under PπP^{\pi}. Then, Pπ(MLRjπ>0,j1(θ)D)P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D) is the conditional probability that MLRjπ\mathrm{MLR}_{j}^{\pi} is positive and the jjth feature is non-null given the masked data. Finally, define the following ratio νj\nu_{j}:

νj=Pπ(MLRjπ>0,j1(θ)D)(1+q)1Pπ(MLRjπ>0D).\nu_{j}=\frac{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)}{(1+q)^{-1}-P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)}. (3.11)
Definition 3.3.

With this notation, we now define AMLR statistics {AMLRj}j=1p\{\mathrm{AMLR}_{j}\}_{j=1}^{p} in two cases.

  • Case 1: AMLRjπ=MLRjπ\mathrm{AMLR}_{j}^{\pi}=\mathrm{MLR}_{j}^{\pi} if Pπ(MLRjπ>0D)(1+q)1P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\geq(1+q)^{-1}.

  • Case 2: Otherwise, with logit(x)log(x/(1x))\mathrm{logit}(x)\coloneqq\log(x/(1-x)), we define

    AMLRjπ=sign(MLRjπ)logit((1+q)1)logit1(νj).\mathrm{AMLR}_{j}^{\pi}=\operatorname*{sign}(\mathrm{MLR}_{j}^{\pi})\cdot\mathrm{logit}\left((1+q)^{-1}\right)\cdot\mathrm{logit}^{-1}\left(\nu_{j}\right). (3.12)

    By construction, all AMLR statistics in Case 2 have smaller absolute values than all statistics in Case 1. Note that Appendix E.4 shows how to compute AMLR statistics.

Corollary 3.1.

AMLR statistics from Definition 3.3 are valid knockoff statistics.

MLR and AMLR statistics have the same signs but different absolute values. To understand why, Appendix B.7 argues that maximizing the expected number of true discoveries can be formulated as a simple linear program where the “benefit” of prioritizing a feature is bjPπ(MLRjπ>0,j1(θ)D)b_{j}\coloneqq P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)—the probability that MLRjπ\mathrm{MLR}_{j}^{\pi} is positive and jj is non-null—and the “cost” is cj(1+q)1Pπ(MLRjπ>0D)c_{j}\coloneqq(1+q)^{-1}-P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D). The intuition is that to make kk discoveries, (1+q)1k\approx(1+q)^{-1}k of the kk feature statistics with the largest absolute values must have positive signs. Thus, cjc_{j} measures the difference between (1+q)1(1+q)^{-1} and the (conditional) probability that MLRjπ\mathrm{MLR}_{j}^{\pi} is positive. Feature jj has a negative cost cj<0c_{j}<0 if it produces a “surplus” of (1+q)1\geq(1+q)^{-1} positive signs in expectation.

The optimal solution to this problem is to (a) maximally prioritize all features with negative costs by giving them the highest absolute values—i.e., the features in Case 1 above—and (b) prioritize all other features in descending order of the benefit-cost ratio νj=bj/cj\nu_{j}=b_{j}/c_{j}. This is accomplished by the AMLR formulas in Definition 3.3. In contrast, MLR statistic magnitudes are a decreasing function of only the costs cjc_{j}. By incorporating the benefit bjb_{j}, AMLR statistics reduce the expected number of discoveries while increasing the expected number of true discoveries. See Appendix B.7 for further details.

AMLR and MLR statistics are different but not too different, since typically, {Pπ(MLRjπ>0,j1(θ)D)}j=1p\{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)\}_{j=1}^{p} has a similar order as {Pπ(MLRjπ>0D)}j=1p\{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\}_{j=1}^{p}. (When the orders are the same, AMLR and MLR statistics yield identical rejection sets.) Indeed, Figure 2 shows in a simple simulation that the power of AMLR and MLR statistics is nearly identical.

We now show that AMLR statistics asymptotically maximize power under PπP^{\pi} (see Appendix B.7 for a formal statement and proof). For any statistic w([𝐗,𝐗~],𝐘)w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}) with discovery set Sw[p]S_{w}\subset[p], let TPπ(w)\mathrm{TP}^{\pi}(w) denote the expected number of true positives under PπP^{\pi}:

TPπ(w)Θ𝔼P(θ)[|Sw1(θ)|]π(θ)𝑑θ=𝔼Pπ[|Sw1(θ)|].\mathrm{TP}^{\pi}(w)\,\coloneqq\,\int_{\Theta}\mathbb{E}_{P^{(\theta)}}\big{[}|S_{w}\cap\mathcal{H}_{1}(\theta)|\big{]}\,\pi(\theta)\,d\theta\,=\,\mathbb{E}_{P^{\pi}}[|S_{w}\cap\mathcal{H}_{1}(\theta^{\star})|]. (3.13)
Theorem 3.3 (Informal).

Suppose the conditions of Theorem 3.2 hold. Furthermore, suppose that (i) the local dependence condition in Assumption 3.2 holds when replacing 𝕀(MLRjπ>0)\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0) with 𝕀(MLRjπ>0,j1(θ))\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})) and (ii) the coefficient of variation of the number of non-nulls |1(θ)||\mathcal{H}_{1}(\theta^{\star})| is bounded as nn\to\infty. Then for any sequence of feature statistics {wn}n\{w_{n}\}_{n\in\mathbb{N}},

TPπ(amlr)TPπ(wn)+o(sn),\mathrm{TP}^{\pi}(\mathrm{amlr})\geq\mathrm{TP}^{\pi}(w_{n})+o(s_{n}), (3.14)

where sns_{n} is the expected number of non-nulls under PnπP_{n}^{\pi}, as defined in Assumption 3.1.

Refer to caption
Figure 2: In the AR(1) simulation setting from Section 5.1, this figure plots the power of MLR and AMLR statistics (using MVR fixed-X or model-X knockoffs) for different nominal FDR levels qq. It shows that both statistics have essentially the same power. See Section 5.1 for details.

4 Computing MLR statistics

4.1 General strategy

We now show how to compute Pjπ(𝐗jD)P^{\pi}_{j}(\mathbf{X}_{j}\mid D) and Pjπ(𝐗~jD)P^{\pi}_{j}(\widetilde{\mathbf{X}}_{j}\mid D) by Gibbs sampling from the law of 𝐗D\mathbf{X}\mid D under PπP^{\pi}. For brevity, we focus on the MX setting—Appendix E discusses the FX case.

The key idea is that conditional on 𝐗j\mathbf{X}_{-j} and the latent parameter θ\theta^{\star}, sampling from the law of 𝐗j𝐗j,θ,D\mathbf{X}_{j}\mid\mathbf{X}_{-j},\theta^{\star},D is easy. In particular, for any fixed 𝐝=(𝐲,{𝐱j,𝐱~j}j=1p)\mathbf{d}=(\mathbf{y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p}), observing D=𝐝D=\mathbf{d} implies that 𝐗j\mathbf{X}_{j} must lie in {𝐱j,𝐱~j}\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}. Lemma E.1 shows that the conditional likelihood ratio equals:

Pπ(𝐗j=𝐱j𝐗j,θ=θ,D=𝐝)Pπ(𝐗j=𝐱~j𝐗j,θ=θ,D=𝐝)\displaystyle\frac{P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j}\mid\mathbf{X}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}{P^{\pi}(\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j}\mid\mathbf{X}_{-j},\theta^{\star}=\theta,D=\mathbf{d})} =P𝐘𝐗(θ)(𝐲𝐗j=𝐱j,𝐗j)P𝐘𝐗(θ)(𝐲𝐗j=𝐱~j,𝐗j).\displaystyle=\frac{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\mathbf{x}_{j},\mathbf{X}_{-j})}{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j},\mathbf{X}_{-j})}. (4.1)

The right-hand side of Eq. (4.1) is easy to compute for most parametric models 𝒫\mathcal{P}, since it only involves computing the likelihood of 𝐘\mathbf{Y} given 𝐗\mathbf{X}. Thus, we can easily sample from the law of 𝐗jD,θ,𝐗j\mathbf{X}_{j}\mid D,\theta^{\star},\mathbf{X}_{-j}.

To sample from the law of 𝐗D\mathbf{X}\mid D, Algorithm 1 describes a Gibbs sampler which (i) for j[p]j\in[p], resamples from 𝐗j𝐘,𝐗j,θ\mathbf{X}_{j}\mid\mathbf{Y},\mathbf{X}_{-j},\theta^{\star} and (ii) resamples from the posterior of θ𝐘,𝐗\theta^{\star}\mid\mathbf{Y},\mathbf{X}. Step (ii) can be done using any off-the-shelf Bayesian sampler (Brooks et al.,, 2011), since this step is identical to a typical Bayesian regression. Lemma 4.1 shows that Algorithm 1 correctly computes the MLR statistics as nsamplen_{\mathrm{sample}}\to\infty under standard regularity conditions (Robert and Casella,, 2004). These mild conditions are satisfied by our default choices (Example 1), but they can also be relaxed further (see Appendix E.3).

Algorithm 1 Gibbs sampling meta-algorithm to compute MLR statistics.

Input:𝐘,𝐗,𝐗~\mathbf{Y},\mathbf{X},\widetilde{\mathbf{X}}, a model class {P(θ):θΘ}\{P^{(\theta)}:\theta\in\Theta\} and prior π:Θ0\pi:\Theta\to\mathbb{R}_{\geq 0}.

1:Initialize θ(0)π\theta^{(0)}\sim\pi and 𝐗j(0)indUnif({𝐗j,𝐗~j})\mathbf{X}_{j}^{(0)}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathrm{Unif}(\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}) for j[p]j\in[p]. \triangleright Initialization
2:for i=1,2,,nsamplei=1,2,\dots,n_{\mathrm{sample}} do:
3:     Initialize 𝐗(i)=𝐗(i1)n×p\mathbf{X}^{(i)}=\mathbf{X}^{(i-1)}\in\mathbb{R}^{n\times p}.
4:     for j=1,,pj=1,\dots,p do: \triangleright Resample 𝐗(i)\mathbf{X}^{(i)}
5:         Set ηj(i)=log(P𝐘𝐗(θi)(𝐘[𝐗j(i),𝐗j])log(P𝐘𝐗(θi)(𝐘[𝐗j(i),𝐗~j]))\eta_{j}^{(i)}=\log\left(P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta_{i})}(\mathbf{Y}\mid[\mathbf{X}_{-j}^{(i)},\mathbf{X}_{j}]\right)-\log\left(P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta_{i})}(\mathbf{Y}\mid[\mathbf{X}_{-j}^{(i)},\widetilde{\mathbf{X}}_{j}])\right).
6:         Define pj(i)=logit1(ηj(i))p_{j}^{(i)}=\mathrm{logit}^{-1}(\eta_{j}^{(i)}).
7:         Set 𝐗j(i)=𝐗j\mathbf{X}_{j}^{(i)}=\mathbf{X}_{j} with probability pj(i)p_{j}^{(i)}. Else set 𝐗j(i)=𝐗~j\mathbf{X}_{j}^{(i)}=\widetilde{\mathbf{X}}_{j}.      
8:     Sample θ(i)\theta^{(i)} from the law of θ𝐘,𝐗=𝐗(i)\theta^{\star}\mid\mathbf{Y},\mathbf{X}=\mathbf{X}^{(i)} under PπP^{\pi}. \triangleright Resample θ(i)\theta^{(i)}
9:Return MLRjπ=log(i=1nsamplepj(i))log(i=1nsample1pj(i))\mathrm{MLR}_{j}^{\pi}=\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}p_{j}^{(i)}\right)-\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}1-p_{j}^{(i)}\right), for j[p]j\in[p].
Lemma 4.1.

Let pj(i)p_{j}^{(i)} be defined as in Algorithm 1. Suppose that under PπP^{\pi}, (i) pj(i)(0,1)p_{j}^{(i)}\in(0,1) a.s. for j[p]j\in[p] and (ii) the support of θ𝐗,𝐘\theta^{\star}\mid\mathbf{X},\mathbf{Y} equals Θ\Theta. Then as nsamplen_{\mathrm{sample}}\to\infty,

log(i=1nsamplepj(i))log(i=1nsample1pj(i))pMLRjπlog(Pjπ(𝐗jD)Pjπ(𝐗~jD)).\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}p_{j}^{(i)}\right)-\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}1-p_{j}^{(i)}\right)\stackrel{{\scriptstyle\mathrm{p}}}{{\to}}\mathrm{MLR}_{j}^{\pi}\coloneqq\log\left(\frac{P_{j}^{\pi}(\mathbf{X}_{j}\mid D)}{P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D)}\right).
Remark 4.

Algorithm 1 also allows us to diagnose Assumption 3.2. Prop. 3.1 yields that if 𝐗^j=argmax𝐱{𝐗j,𝐗~j}Pjπ(𝐱D)\widehat{\mathbf{X}}_{j}=\operatorname*{arg\,max}_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\pi}(\mathbf{x}\mid D), then

CovPπ(𝕀(MLRkπ>0),𝕀(MLRjπ>0)D)=CovPπ(𝕀(𝐗^k=𝐗k),𝕀(𝐗^j=𝐗j)D).\displaystyle\operatorname{Cov}_{P^{\pi}}(\mathbb{I}(\mathrm{MLR}_{k}^{\pi}>0),\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)\mid D)=\operatorname{Cov}_{P^{\pi}}(\mathbb{I}(\widehat{\mathbf{X}}_{k}=\mathbf{X}_{k}),\mathbb{I}(\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j})\mid D). (4.2)

Thus, we can approximate the covariance above with the empirical covariance of {𝕀(𝐗j(i)=𝐗^j)}i=1nsample\{\mathbb{I}(\mathbf{X}_{j}^{(i)}=\widehat{\mathbf{X}}_{j})\}_{i=1}^{n_{\mathrm{sample}}} and {𝕀(𝐗k(i)=𝐗^k)}i=1nsample\{\mathbb{I}(\mathbf{X}_{k}^{(i)}=\widehat{\mathbf{X}}_{k})\}_{i=1}^{n_{\mathrm{sample}}}, where {𝐗(i)}i=1nsample\{\mathbf{X}^{(i)}\}_{i=1}^{n_{\mathrm{sample}}} are the samples from Algorithm 1.

Remark 5.

In the special case of Gaussian linear models with a sparse prior on the coefficients β\beta, Algorithm 1 is similar in flavor to the “Bayesian Variable Selection” (BVS) feature statistic from Candès et al., (2018), although there are differences in the Gibbs sampler and the final estimand. Broadly, we see our work as complementary to theirs. Yet aside from technical details, a main difference is that Candès et al., (2018) seemed to argue that the advantage of BVS was to incorporate accurate prior information. In contrast, we argue that MLR statistics can improve power even without prior information (see Section 5) by estimating the right notion of variable importance.

4.2 A default choice of Bayesian model

Below, we describe a class of Bayesian models that is computationally efficient and can flexibly model both linear and nonlinear relationships. Note that to specify 𝒫\mathcal{P}, it suffices to model the law of 𝐘𝐗\mathbf{Y}\mid\mathbf{X}, since the law of 𝐗\mathbf{X} is assumed known in the MX case and 𝐗\mathbf{X} is fixed in the FX case.

Example 1 (Sparse generalized additive model).

For linear coefficients β(j)kj\beta^{(j)}\in\mathbb{R}^{k_{j}} and noise variance σ2\sigma^{2}\in\mathbb{R}, let θ=(β(1),,β(p),σ2)ΘK×0\theta=(\beta^{(1)},\dots,\beta^{(p)},\sigma^{2})\in\Theta\coloneqq\mathbb{R}^{K}\times\mathbb{R}_{\geq 0}. For a prespecified set of basis functions ϕj:kj\phi_{j}:\mathbb{R}\to\mathbb{R}^{k_{j}}, we consider the model class 𝒫={P(θ):θΘ}\mathcal{P}=\{P^{(\theta)}:\theta\in\Theta\} where

𝐘i𝐗ind𝒩(j=1pϕj(𝐗ij)Tβ(j),σ2) for i=1,,n under P(θ).\mathbf{Y}_{i}\mid\mathbf{X}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathcal{N}\left(\sum_{j=1}^{p}\phi_{j}(\mathbf{X}_{ij})^{T}\beta^{(j)},\sigma^{2}\right)\text{ for }i=1,\dots,n\text{ under }P^{(\theta)}. (4.3)

By default, we take ϕj\phi_{j} to be the identity function, which reduces to a Gaussian linear model. However, if 𝐘\mathbf{Y} and 𝐗\mathbf{X} may have nonlinear relationships, we suggest taking ϕj()\phi_{j}(\cdot) to be the basis representation of regression splines (see Hastie et al., (2001) for review), as we do in Section 5.2. For the prior, we let π\pi denote the law of θ\theta^{\star} after sampling from the following process:

  • Sample hyperparameters p0Beta(a0,b0)p_{0}\sim\mathrm{Beta}(a_{0},b_{0}) (sparsity), τ2invGamma(aτ,bτ)\tau^{2}\sim\mathrm{invGamma}(a_{\tau},b_{\tau}) (signal size), and σ2invGamma(aσ,bσ)\sigma^{2}\sim\mathrm{invGamma}(a_{\sigma},b_{\sigma}) (noise variance). By default, we take a=b=bτ=bσ=1,aτ=aσ=2a=b=b_{\tau}=b_{\sigma}=1,a_{\tau}=a_{\sigma}=2.

  • Sample β(j)=BjZj\beta^{(j)}=B_{j}Z_{j} for Zjind𝒩(0,τ2Ikj)Z_{j}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathcal{N}(0,\tau^{2}I_{k_{j}}) and Bji.i.d.Bern(1p0).B_{j}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathrm{Bern}(1-p_{0}).

This group-sparse prior is effectively a “two-groups” model, as 𝐗j\mathbf{X}_{j} is null if and only if β(j)=0\beta^{(j)}=0. As shown in Section 5, using these hyperpriors allows us to adaptively estimate the sparsity level.

Standard techniques for “spike-and-slab” models (George and McCulloch,, 1997) allow us to compute the MLR statistics from Ex. 1 in O(nsamplenp)O(n_{\mathrm{sample}}np) operations (assuming j=1pkj=O(p)\sum_{j=1}^{p}k_{j}=O(p))—see Appendix E for review. This cost is cheaper than computing Gaussian MX or FX knockoffs, which requires O(np2+p3)O(np^{2}+p^{3}) operations. Fitting the LASSO has a comparable cost, which is O(nsamplenp)O(n_{\mathrm{sample}}np) using coordinate descent or O(np2)O(np^{2}) using the LARS algorithm (Efron et al.,, 2004).

Lastly, we can easily extend this algorithm to binary responses. In particular, using techniques from Albert and Chib, (1993), we can compute Gibbs updates in the same computational complexity when Pπ(Y=1X)=Φ(j=1pϕj(Xj)Tβ(j))P^{\pi}(Y=1\mid X)=\Phi\left(\sum_{j=1}^{p}\phi_{j}(X_{j})^{T}\beta^{(j)}\right), where Φ\Phi is the Gaussian CDF (see Appendix E for details).

5 Simulations

We now analyze the power of MLR statistics in simulations. Throughout, MLR statistics do not have accurate prior information: we use exactly the same choice of Bayesian model PπP^{\pi} (the default from Section 4.2) to compute MLR statistics in every plot. Also, we let 𝐗\mathbf{X} be highly correlated to test whether MLR statistics perform well even when Assumption 3.2 may fail. Nonetheless, MLR statistics uniformly outperform existing competitors.

The FDR level is q=0.05q=0.05. All plots have two standard deviation error bars, although the bars may be too small to be visible. In each plot, knockoffs provably control the frequentist FDR, so we only plot power. All code is available at https://github.com/amspector100/mlr_knockoff_paper.

5.1 Gaussian linear models

Refer to caption
Figure 3: Power of MLR, LCD, and LSM statistics in a sparse Gaussian linear model with p=500p=500 and 5050 non-nulls. For FX knockoffs, MLR statistics almost exactly match the power of the oracle procedure which provably upper bounds the power of any feature statistic. For MX knockoffs, MLR statistics are slightly less powerful than the oracle, although they are still very powerful compared to the lasso-based statistics. Note that the power of knockoffs can be roughly constant in nn in the “SDP” setting: this is because SDP knockoffs sometimes have identifiability issues (Spector and Janson,, 2022). Also, note that the “LCD” and “LSM” curves completely overlap in two of the bottom panels, where both methods have zero power. See Appendix G for precise simulation details.

In this section, we sample 𝐘𝐗𝒩(𝐗β,In)\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,I_{n}) for sparse β\beta. We draw Xi.i.d.𝒩(0,Σ)X\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathcal{N}(0,\Sigma) for two choices of Σ\Sigma. By default, Σ\Sigma corresponds to a highly correlated nonstationary AR(1) process, inspired by real genetic design matrices. However, we also analyze an “ErdosRenyi” covariance matrix where Σ\Sigma is 80%80\% sparse with the nonzero entries drawn uniformly at random. We compute both “SDP” and “MVR” knockoffs (Candès et al.,, 2018; Spector and Janson,, 2022) to show that MLR statistics perform well in both cases. See Appendix G for further simulation details.

We compare four feature statistics. First, we compute MLR statistics using the default Bayesian model from Section 4—in plots, “MLR” refers to this version of MLR statistics. Second, we compute LCD and LSM statistics as described in Section 2. Lastly, we compute the oracle MLR statistics which have full knowledge of the true value of β\beta. Figure 3 shows the results while varying nn in low dimensions (using FX knockoffs) and high dimensions (using MX knockoffs). It shows that MLR statistics are substantially more powerful than the lasso-based statistics and, in the FX case, MLR statistics almost perfectly match the power of the oracle. Indeed, this result holds even for the “ErdosRenyi” covariance matrix, where 𝐗\mathbf{X} exhibits strong non-local dependencies (in contrast to Assumption 3.2). Furthermore, Figure 4 shows that MLR statistics are computationally efficient, often faster than a cross-validated lasso and comparable to the cost of computing FX knockoffs.

Refer to caption
Figure 4: This figure shows the computation time for various feature statistics in the same setting as Figure 3, as well as the cost of computing knockoffs. It shows that MLR statistics are competitive with state-of-the-art feature statistics (in the model-X case) or comparable to the cost of computing knockoffs (in the fixed-X case).

Next, we analyze the performance of MLR statistics when the prior is misspecified. In Figure 5, we vary the sparsity (proportion of non-nulls) between 5%5\% and 40%40\%, and we draw the non-null coefficients as (i) heavy-tailed i.i.d. Laplace variables and (ii) “light-tailed” i.i.d. Unif([1/2,1/4][1/4,1/2])\mathrm{Unif}([-1/2,-1/4]\cup[1/4,1/2]) variables. In all cases, the MLR prior assumes the non-null coefficients are i.i.d. 𝒩(0,τ2)\mathcal{N}(0,\tau^{2}) with sparsity p0Beta(1,1)p_{0}\sim\mathrm{Beta}(1,1). Nonetheless, MLR statistics consistently outperform the lasso-based statistics and nearly match the performance of the oracle.

Refer to caption
Figure 5: This figure shows the power of MLR, LCD, and LSM statistics when varying the sparsity level and drawing the non-null coefficients from a heavy-tailed (Laplace) and light-tailed (Uniform) distribution, with p=500p=500 and n=1250n=1250. The setting is otherwise identical to the AR1 setting from Figure 3. It shows that the MLR statistics perform well despite using the same (misspecified) prior in every setting.

Lastly, we verify that the local dependence condition assumed in Theorem 3.2 holds empirically. We consider the AR(1) setting but modify the parameters so that 𝐗\mathbf{X} is extremely highly correlated, with adjacent correlations drawn as i.i.d. Beta(50,1)\mathrm{Beta}(50,1) variables. We also consider a setting where 𝐗\mathbf{X} is equicorrelated with correlation 95%95\%. In both cases, Figure 6 shows that CovPπ(𝕀(MLRπ>0)D)\operatorname{Cov}_{P^{\pi}}(\mathbb{I}(\mathrm{MLR}^{\pi}>0)\mid D) has entries which decay off the main diagonal—in fact, the maximum off-diagonal covariance across both examples is 0.070.07. Please see Section 3.3 and Appendix B.9 for intuition behind this result, although we cannot perfectly explain it.

Refer to caption
Figure 6: In the AR(1) and equicorrelated settings, this plot shows both the correlation matrix of 𝐗\mathbf{X} as well as the conditional covariance of signs of the MLR statistic MLRπ\mathrm{MLR}^{\pi}, computed as per Remark 4. It shows that even when 𝐗\mathbf{X} is very highly correlated, the signs of MLRπ\mathrm{MLR}^{\pi} are only locally dependent. Note in this plot, every feature is non-null and the power of knockoffs is 53%53\% and 10%10\% for the AR(1) and equicorrelated settings, respectively.

5.2 Generalized additive models

We now sample 𝐘𝐗𝒩(h(𝐗)β,In)\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(h(\mathbf{X})\beta,I_{n}) for some non-linear function h:h:\mathbb{R}\to\mathbb{R} applied element-wise to 𝐗n×p\mathbf{X}\in\mathbb{R}^{n\times p}. We consider the AR(1) setting from Section 5.1 with four choices of hh: h(x)=sin(x),h(x)=cos(x),h(x)=x2,h(x)=x3h(x)=\sin(x),h(x)=\cos(x),h(x)=x^{2},h(x)=x^{3}. We compare six feature statistics: linear MLR statistics, MLR based on cubic regression splines with one knot, the LCD, a random forest with swap importances as in Gimenez et al., (2019), and DeepPINK (Lu et al.,, 2018), which is based on a feedforward neural network. This setting is more challenging than the linear setting, since the feature statistics must learn (or approximate) the function hh. Thus, our simulations in this section are low-dimensional with n>pn>p, and we should not expect any feature statistic to match the performance of the oracle MLR statistics.

Refer to caption
Figure 7: This figure plots power in generalized additive models where 𝐘𝐗𝒩(h(𝐗)β,In)\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(h(\mathbf{X})\beta,I_{n}), for h:h:\mathbb{R}\to\mathbb{R} applied elementwise to 𝐗\mathbf{X}. The x-facets show the choice of hh, and the y-facets show results for both MVR and SDP MX knockoffs; note p=200p=200 and there are 6060 non-nulls. For this plot only, we choose q=0.1q=0.1 because several of the competitors made zero discoveries at q=0.05q=0.05. See Appendix G for the plot with q=0.05q=0.05. Note that in the top “cubic” panel, the MLR (splines) statistic has 100%100\% power and overlaps with the oracle.

Figure 7 shows that “MLR (splines)” uniformly outperforms every other feature statistic, often by wide margins. Linear MLR and LCD statistics are powerless in the cos\cos and quadratic settings, where hh is an even function and thus the non-null features have no linear relationship with the response. However, in the sin\sin and cubic settings, linear MLR statistics outperform the LCD, suggesting that linear MLR statistics can be powerful under misspecification as long as there is some linear effect.

5.3 Logistic regression

Lastly, we now consider the setting of logistic regression, so we sample YXBern(s(Xβ))Y\mid X\sim\mathrm{Bern}(s(X^{\top}\beta)) where ss is the sigmoid function. We run the same simulation setting as Figure 3, except that now YY is binary and we consider low-dimensional settings, since inference in logistic regression is generally more challenging than in linear regression. The results are shown by Figure 8, which shows that MLR statistics outperform the LCD, although there is a substantial gap between the performances of the MLR and oracle MLR statistics.

Refer to caption
Figure 8: This plot shows the power of MLR statistics compared to the cross-validated LCD in logistic regression, with p=500p=500, 5050 non-nulls, and nn varied between 15001500 and 45004500. The setting is otherwise identical to Figure 3.

6 Real applications

In this section, we apply MLR statistics to three real datasets which have been previously analyzed using knockoffs. We use the same default choice of MLR statistics from our simulations in all three applications. In each case, MLR statistics have comparable or higher power than competitor statistics. All code and data are available at https://github.com/amspector100/mlr_knockoff_paper.

6.1 HIV drug resistance

We begin with the HIV drug resistance dataset from Rhee et al., (2006), which (e.g.) Barber and Candès, (2015) previously analyzed using knockoffs. The dataset consists of genotype data from n750n\approx 750 HIV samples as well as drug resistance measurements for 1616 different drugs, and the goal is to discover genetic variants that affect drug resistance for each of the drugs. Furthermore, Rhee et al., (2005) published treatment-selected mutation panels for this setting, so we can check whether any discoveries made by knockoffs are corroborated by this separate analysis.

We preprocess and model the data following Barber and Candès, (2015). Then, we apply FX knockoffs with LCD, LSM, and MLR statistics and FDR level q=0.05q=0.05. For both MVR and SDP knockoffs, Figure 9 shows the total number of discoveries made by each statistic, stratified by whether each discovery is corroborated by Rhee et al., (2005). For SDP knockoffs, the MLR statistics make nearly an order of magnitude more discoveries than the competitor methods with a comparable corroboration rate. For MVR knockoffs, MLR and LCD statistics perform roughly equally well, although MLR statistics make 5%\approx 5\% more discoveries with a slightly higher corroboration rate. Overall, in this setting, MLR statistics are competitive with and sometimes substantially outperform the lasso-based statistics. See Appendix H for specific results for each drug.

Refer to caption
Figure 9: This figure shows the total number of discoveries made by the LCD, LSM, and MLR feature statistics in the HIV drug resistance dataset from Rhee et al., (2006), summed across all 1616 drugs.

6.2 Financial factor selection

In finance, analysts often aim to select factors that drive the performance of a particular asset. Challet et al., (2021) applied FX knockoffs to factor selection, and as a benchmark, they tested which US equities explain the performance of an index fund for the energy sector (XLE). Here, the ground truth is available since the index fund is a weighted combination of a known list of stocks.

We perform the same analysis for ten index funds of key sectors of the US economy, including energy, technology, and more (see Appendix H). Here, 𝐘\mathbf{Y} is the index fund’s daily log return and 𝐗\mathbf{X} contains the daily log returns of each stock in the S&P 500 since 20132013, so p500p\approx 500 and n2300n\approx 2300. We compute fixed-X MVR and SDP knockoffs and apply LCD, LSM, and MLR statistics. Figure 10 shows the number of true and false discoveries summed across all index funds with q=0.05q=0.05. In particular, MLR statistics make 35%35\% and 78%78\% more discoveries than the LCD for MVR and SDP knockoffs (respectively), and the LSM makes more than 55 times fewer discoveries than the MLR statistics. Thus, MLR statistics substantially outperform the lasso-based statistics. Appendix H also shows that the FDP (averaged across all index funds) is well below 5%5\% for each method.

Refer to caption
Figure 10: This figure shows the total number of discoveries made by each method in the fund replication dataset inspired by Challet et al., (2021), summed across all ten index funds. See Appendix H for a table showing that the average FDP for each method is below the nominal level of q=0.05q=0.05.

6.3 Graphical model discovery for gene networks

Lastly, we consider the problem of recovering a gene network from single-cell RNAseq data. Our analysis follows Li and Maathuis, (2019), who model gene expression log counts as a Gaussian graphical model (see Li and Maathuis, (2019) for justification of the Gaussian assumption). In particular, they develop an extension of FX knockoffs that detects edges in Gaussian graphical models while controlling the FDR across discovered edges. They applied this method to RNAseq data from Zheng et al., (2017). The ground truth is not available, so following Li and Maathuis, (2019), we only evaluate methods based on the number of discoveries they make.

We replicate this analysis and compare LCD, LSM, and MLR statistics. Figure 11 plots the number of discoveries as a function of q[0,0.5]q\in[0,0.5]. MLR statistics make the most discoveries for nearly every value of qq, although often by a small margin. For small qq, the LSM statistic performs poorly, and for large qq, the LCD statistic performs poorly, whereas the MLR statistic is consistently powerful.

Refer to caption
Figure 11: This figure shows the number of discoveries made by LCD, LSM, and MLR statistics when used to detect edges in a Gaussian graphical model for gene expression data, as in Li and Maathuis, (2019).

7 Discussion

This paper introduces masked likelihood ratio statistics, a class of asymptotically Bayes-optimal knockoff statistics. We show in simulations and three applications that MLR statistics are efficient and powerful. However, our work leaves open several directions for future research.

  • MLR statistics are asymptotically Bayes optimal. However, it might be worthwhile to develop minimax-optimal knockoff-statistics, e.g., by computing a “least-favorable” prior.

  • Our theory requires a “local dependency” condition which is challenging to verify analytically, although it can be diagnosed using the data at hand. It might be interesting to investigate (i) precisely when this condition holds and (ii) if MLR statistics are still optimal when it fails.

  • We only consider classes of MLR statistics designed for binary GLMs and generalized additive models. However, other types of MLR statistics could be more powerful, e.g., those based on Bayesian additive regression trees (Chipman et al.,, 2010).

  • In practice, analysts may prefer to discover features with large effect sizes. E.g., in Section 2, the P90.M variant has a large estimated OLS coefficient; thus, while it is particularly hard to discover, it may be particularly valuable to discover. In principle, the Bayesian framework in Section 3.4 could be used to find knockoff statistics which asymptotically maximize many different notions of power, e.g., the sum of squared coefficient sizes across all discovered variables.

8 Acknowledgements

The authors thank John Cherian, Kevin Guo, Lucas Janson, Lihua Lei, Basil Saeed, Anav Sood, and Timothy Sudijono for valuable comments. A.S. is partially supported by a Citadel GQS PhD Fellowship, the Two Sigma Graduate Fellowship Fund, and an NSF Graduate Research Fellowship. W.F. is partially supported by the NSF DMS-1916220 and a Hellman Fellowship from Berkeley.

References

  • Albert and Chib, (1993) Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679.
  • Barber and Candès, (2015) Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist., 43(5):2055–2085.
  • Barber et al., (2020) Barber, R. F., Candès, E. J., and Samworth, R. J. (2020). Robust inference with knockoffs. Ann. Statist., 48(3):1409–1431.
  • Bates et al., (2020) Bates, S., Candès, E., Janson, L., and Wang, W. (2020). Metropolized knockoff sampling. Journal of the American Statistical Association, 0(0):1–15.
  • Brooks et al., (2011) Brooks, S., Gelman, A., Jones, G., and Meng, X. (2011). Handbook of Markov Chain Monte Carlo. CRC Press, United States.
  • Candès et al., (2018) Candès, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold: Model-X knockoffs for high-dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B, 80(3):551–577.
  • Challet et al., (2021) Challet, D., Bongiorno, C., and Pelletier, G. (2021). Financial factors selection with knockoffs: Fund replication, explanatory and prediction networks. Physica A: Statistical Mechanics and its Applications, 580:126105.
  • Chen et al., (2019) Chen, J., Hou, A., and Hou, T. Y. (2019). A prototype knockoff filter for group selection with FDR control. Information and Inference: A Journal of the IMA, 9(2):271–288.
  • Chipman et al., (2010) Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266 – 298.
  • Dai and Barber, (2016) Dai, R. and Barber, R. (2016). The knockoff filter for fdr control in group-sparse and multitask regression. In Balcan, M. F. and Weinberger, K. Q., editors, International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1851–1859, New York, New York, USA. PMLR.
  • Donoho and Jin, (2004) Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics, 32(3):962 – 994.
  • Doukhan and Neumann, (2007) Doukhan, P. and Neumann, M. H. (2007). Probability and moment inequalities for sums of weakly dependent random variables, with applications. Stochastic Processes and their Applications, 117(7):878–903.
  • Efron et al., (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2):407 – 499.
  • Fan et al., (2020) Fan, Y., Demirkaya, E., Li, G., and Lv, J. (2020). Rank: Large-scale inference with graphical nonlinear knockoffs. Journal of the American Statistical Association, 115(529):362–379. PMID: 32742045.
  • Farcomeni, (2007) Farcomeni, A. (2007). Some results on the control of the false discovery rate under dependence. Scandinavian Journal of Statistics, 34(2):275–297.
  • Ferreira and Zwinderman, (2006) Ferreira, J. A. and Zwinderman, A. H. (2006). On the Benjamini–Hochberg method. The Annals of Statistics, 34(4):1827 – 1849.
  • Fithian and Lei, (2020) Fithian, W. and Lei, L. (2020). Conditional calibration for false discovery rate control under dependence.
  • Genovese and Wasserman, (2004) Genovese, C. and Wasserman, L. (2004). A stochastic process approach to false discovery control. The Annals of Statistics, 32(3):1035 – 1061.
  • George and McCulloch, (1997) George, E. I. and McCulloch, R. E. (1997). Approaches for bayesian variable selection. Statistica Sinica, 7(2):339–373.
  • Gimenez et al., (2019) Gimenez, J. R., Ghorbani, A., and Zou, J. Y. (2019). Knockoffs for the mass: New feature importance statistics with false discovery guarantees. In Chaudhuri, K. and Sugiyama, M., editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 2125–2133. PMLR.
  • Guan and Stephens, (2011) Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics, 5(3):1780 – 1815.
  • Hastie et al., (2001) Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA.
  • Huang and Janson, (2020) Huang, D. and Janson, L. (2020). Relaxing the assumptions of knockoffs by conditioning. Ann. Statist., 48(5):3021–3042.
  • Katsevich and Ramdas, (2020) Katsevich, E. and Ramdas, A. (2020). On the power of conditional independence testing under model-x.
  • Ke et al., (2020) Ke, Z. T., Liu, J. S., and Ma, Y. (2020). Power of fdr control methods: The impact of ranking algorithm, tampered design, and symmetric statistic. arXiv preprint: arXiv:2010.08132.
  • Li and Maathuis, (2019) Li, J. and Maathuis, M. H. (2019). Ggm knockoff filter: False discovery rate control for gaussian graphical models.
  • Li and Fithian, (2021) Li, X. and Fithian, W. (2021). Whiteout: when do fixed-x knockoffs fail?
  • Liu and Rigollet, (2019) Liu, J. and Rigollet, P. (2019). Power analysis of knockoff filters for correlated designs. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Lu et al., (2018) Lu, Y. Y., Fan, Y., Lv, J., and Noble, W. S. (2018). Deeppink: reproducible feature selection in deep neural networks. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 8690–8700.
  • Martello and Toth, (1990) Martello, S. and Toth, P. (1990). Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc., USA.
  • Polson et al., (2013) Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American Statistical Association, 108(504):1339–1349.
  • Ren and Candès, (2020) Ren, Z. and Candès, E. J. (2020). Knockoffs with side information. Annals of Applied Statistics.
  • Rhee et al., (2005) Rhee, S.-Y., Fessel, W. J., Zolopa, A. R., Hurley, L., Liu, T., Taylor, J., Nguyen, D. P., Slome, S., Klein, D., Horberg, M., Flamm, J., Follansbee, S., Schapiro, J. M., and Shafer, R. W. (2005). Hiv-1 protease and reverse-transcriptase mutations: Correlations with antiretroviral therapy in subtype b isolates and implications for drug-resistance surveillance. The Journal of Infectious Diseases, 192(3):456–465.
  • Rhee et al., (2006) Rhee, S.-Y., Taylor, J., Wadhera, G., Ben-Hur, A., Brutlag, D. L., and Shafer, R. W. (2006). Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences, 103(46):17355–17360.
  • Robert and Casella, (2004) Robert, C. and Casella, G. (2004). Monte Carlo statistical methods. Springer Verlag.
  • Romano et al., (2020) Romano, Y., Sesia, M., and Candès, E. (2020). Deep knockoffs. Journal of the American Statistical Association, 115(532):1861–1872.
  • Sechidis et al., (2021) Sechidis, K., Kormaksson, M., and Ohlssen, D. (2021). Using knockoffs for controlled predictive biomarker identification. Statistics in Medicine, 40(25):5453–5473.
  • Sesia et al., (2019) Sesia, M., Katsevich, E., Bates, S., Candès, E., and Sabatti, C. (2019). Multi-resolution localization of causal variants across the genome. bioRxiv.
  • Sesia et al., (2018) Sesia, M., Sabatti, C., and Candès, E. J. (2018). Gene hunting with hidden Markov model knockoffs. Biometrika, 106(1):1–18.
  • Spector and Janson, (2022) Spector, A. and Janson, L. (2022). Powerful knockoffs via minimizing reconstructability. The Annals of Statistics, 50(1):252 – 276.
  • Storey et al., (2004) Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(1):187–205.
  • Wang and Janson, (2021) Wang, W. and Janson, L. (2021). A high-dimensional power analysis of the conditional randomization test and knockoffs. Biometrika.
  • Weinstein et al., (2017) Weinstein, A., Barber, R. F., and Candès, E. J. (2017). A power analysis for knockoffs under Gaussian designs. IEEE Transactions on Information Theory.
  • Weinstein et al., (2020) Weinstein, A., Su, W. J., Bogdan, M., Barber, R. F., and Candès, E. J. (2020). A power analysis for knockoffs with the lasso coefficient-difference statistic. arXiv.
  • Weissbrod et al., (2020) Weissbrod, O., Hormozdiari, F., Benner, C., Cui, R., Ulirsch, J., Gazal, S., Schoech, A., van de Geijn, B., Reshef, Y., Márquez-Luna, C., O’Connor, L., Pirinen, M., Finucane, H. K., and Price, A. L. (2020). Functionally-informed fine-mapping and polygenic localization of complex trait heritability. Nature Genetics, 52:1355–1363.
  • Zheng et al., (2017) Zheng, G., Terry, J., Belgrader, P., Ryvkin, P., Bent, Z., Wilson, R., Ziraldo, S., Wheeler, T., McDermott, G., Zhu, J., Gregory, M., Shuga, J., Montesclaros, L., Underwood, J., Masquelier, D., Nishimura, S., Schnall-Levin, M., Wyatt, P., Hindson, C., Bharadwaj, R., Wong, A., Ness, K., Beppu, L., Deeg, H., McFarland, C., Loeb, K., Valente, W., Ericson, N., Stevens, E., Radich, J., Mikkelsen, T., Hindson, B., and Bielas, J. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications.

Appendix A An illustration of the importance of the order of |W||W|

Section 2.1 argues intuitively that a good knockoff statistic should roughly achieve the following goals:

  1. 1.

    For each jj, it should maximize P(Wj>0)P^{\star}(W_{j}>0).

  2. 2.

    The order of {|Wj|}j=1p\{|W_{j}|\}_{j=1}^{p} should match the order of {P(Wj>0)}j=1p\{P^{\star}(W_{j}>0)\}_{j=1}^{p}—i.e., |Wj||W_{j}| should be an increasing function of P(Wj>0)P^{\star}(W_{j}>0).

Sections 3 formalizes (and slightly modifies) these goals to develop an asymptotically optimal test statistic. However, to build intuition, we now give a concrete (if contrived) example showing the importance of the second goal.

Consider a setting with p=50p=50 features where 2525 features have a large signal size with P(Wj>0)=99.9%P^{\star}(W_{j}>0)=99.9\%, 1010 features have a moderate signal size with P(Wj>0)=75%P^{\star}(W_{j}>0)=75\%, the last 1515 features are null with P(Wj>0)=50%P^{\star}(W_{j}>0)=50\%, and {sign(Wj)}j=1p\{\operatorname*{sign}(W_{j})\}_{j=1}^{p} are independent. In this case, what absolute values should WW take to maximize power?

To make any discoveries and control the FDR at level q=0.05q=0.05, we must ensure that >95%>95\% of the kk feature statistics with the largest absolute values have positive signs (for some k20k\geq 20 due to the ceiling function in Step 2 in Section 2.1). Since only the features with large signal sizes have a >95%>95\% chance of being positive, making any discoveries is extremely unlikely unless the features with large signal sizes generally have the highest absolute values. Figure 12 illustrates this argument—in the “random prioritization” setting, we sample |Wj|i.i.d.Unif(0,1)|W_{j}|\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathrm{Unif}(0,1), and in the “oracle prioritization” setting, we set |Wj|=P(Wj>0)|W_{j}|=P^{\star}(W_{j}>0). As expected, knockoffs makes zero discoveries with random prioritization and 30\approx 30 discoveries with oracle prioritization.

Refer to caption
Figure 12: In a simple (if contrived) example from Appendix A, this figure illustrates the importance of ensuring that {|Wj|}j=1p\{|W_{j}|\}_{j=1}^{p} has roughly the same order as {P(Wj>0)}j=1p\{P^{\star}(W_{j}>0)\}_{j=1}^{p}. The dotted black line shows the discovery threshold.

Appendix B Main proofs and interpretation

In this section, we prove the main results of the paper. We also offer additional discussion of these results.

B.1 Knockoffs as inference on masked data

In this section, we prove Propositions 3.1 and 3.2, Lemma 3.1, and one more related corollary which will be useful when proving Theorem 3.2. As notation, for any matrices M1,M2n×pM_{1},M_{2}\in\mathbb{R}^{n\times p}, let [M1,M2]swap(j)[M_{1},M_{2}]_{{\text{swap}(j)}} denote the matrix [M1,M2][M_{1},M_{2}] but with the jjth column of M1M_{1} and M2M_{2} swapped: similarly, [M1,M2]swap(J)[M_{1},M_{2}]_{{\text{swap}(J)}} swaps all columns jJj\in J of M1M_{1} and M2M_{2}.

Proposition 3.1.

Let 𝐗~\widetilde{\mathbf{X}} be model-X knockoffs such that 𝐗j𝐗~j\mathbf{X}_{j}\neq\widetilde{\mathbf{X}}_{j} a.s. for j[p]j\in[p]. Then W=w([𝐗,𝐗~],𝐘)W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}) is a valid feature statistic if and only if:

  1. 1.

    |W||W| is a function of the masked data DD.

  2. 2.

    For all j[p]j\in[p], there exists a DD-measurable random vector 𝐗^j\widehat{\mathbf{X}}_{j} such that Wj>0W_{j}>0 if and only if 𝐗^j=𝐗j\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}.

Proof.

Forward direction: Suppose WW is a valid feature statistic; we will now show conditions (i) and (ii). To show (i), note that observing {𝐗j,𝐗~j}j=1p\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p} is equivalent to observing [𝐗,𝐗~]swap(J)[\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}} for some unobserved J[p]J\subset[p] chosen uniformly at random. Define [𝐗(1),𝐗(2)][𝐗,𝐗~]swap(J)[\mathbf{X}^{(1)},\mathbf{X}^{(2)}]\coloneqq[\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}} and let W=w([𝐗(1),𝐗(2)],𝐘)W^{\prime}=w([\mathbf{X}^{(1)},\mathbf{X}^{(2)}],\mathbf{Y}). Then by the swap invariance property of knockoffs, we have that |W|=|W||W|=|W^{\prime}|. Since |W||W^{\prime}| is a function of DD, this implies |W||W| is a function of DD as well, which proves (i).

To show (ii), we construct 𝐗^j\widehat{\mathbf{X}}_{j} as follows. Let OjnO_{j}\in\mathbb{R}^{n} be any “other” random vector chosen such that Oj{𝐗j(1),𝐗j(2)}O_{j}\not\in\{\mathbf{X}_{j}^{(1)},\mathbf{X}_{j}^{(2)}\}. Then define

𝐗^j{𝐗j(1)Wj>0𝐗j(2)Wj<0OjWj=0.\widehat{\mathbf{X}}_{j}\coloneqq\begin{cases}\mathbf{X}_{j}^{(1)}&W_{j}^{\prime}>0\\ \mathbf{X}_{j}^{(2)}&W_{j}^{\prime}<0\\ O_{j}&W_{j}^{\prime}=0.\end{cases}

Intuitively, we set 𝐗^j=Oj\widehat{\mathbf{X}}_{j}=O_{j} if and only if Wj=0Wj=0W_{j}=0\Leftrightarrow W_{j}^{\prime}=0, since this will guarantee that 𝐗^j𝐗j\widehat{\mathbf{X}}_{j}\neq\mathbf{X}_{j} whenever Wj=0W_{j}=0.

Note that 𝐗^j\widehat{\mathbf{X}}_{j} is a function of [𝐗(1),𝐗(2)],𝐘[\mathbf{X}^{(1)},\mathbf{X}^{(2)}],\mathbf{Y} and therefore is DD-measurable. To show 𝐗^j\widehat{\mathbf{X}}_{j} it is well-defined (does not depend on JJ), note that 𝐗^j{𝐗j,𝐗~j,Oj}\widehat{\mathbf{X}}_{j}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j},O_{j}\} can only take one of three values conditional on DD. Thus, it suffices to show that the events 𝐗^j=Oj\widehat{\mathbf{X}}_{j}=O_{j} and 𝐗^j=𝐗j\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j} do not depend on the random set JJ.

To show that the event 𝐗^j=Oj\widehat{\mathbf{X}}_{j}=O_{j} does not depend on JJ, recall 𝐗^j=Oj\widehat{\mathbf{X}}_{j}=O_{j} iff |Wj|=0|W_{j}^{\prime}|=0; since |Wj|=|Wj||W_{j}^{\prime}|=|W_{j}|, this event does not depend on JJ.

To show that the event 𝐗^j=𝐗j\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j} does not depend on JJ, it suffices to show 𝐗^j=𝐗j\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j} if and only if Wj>0W_{j}>0, which also shows (ii). There are two cases. In the first case, if jJj\not\in J, then 𝐗j(1)=𝐗j\mathbf{X}_{j}^{(1)}=\mathbf{X}_{j} by definition of 𝐗(1)\mathbf{X}^{(1)} and also Wj=WjW_{j}^{\prime}=W_{j} by the “flip-sign” property of ww. Thus 𝐗^j=𝐗j(1)=𝐗j\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}^{(1)}=\mathbf{X}_{j} if and only if Wj>0W_{j}>0. The second case is analogous: if jJj\in J, then Wj=WjW_{j}^{\prime}=-W_{j}, so 𝐗^j=𝐗j(2)=𝐗j\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}^{(2)}=\mathbf{X}_{j} if and only if Wj<0Wj>0W_{j}^{\prime}<0\Leftrightarrow W_{j}>0. In both cases, Wj>0W_{j}>0 if and only if 𝐗^j=𝐗j\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}, proving (ii).

Backwards direction: To show W=w([𝐗,𝐗~],𝐘)W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}) is a valid feature statistic, it suffices to show the flip-sign property, namely that Ww([𝐗,𝐗~]swap(J),𝐘)=1JWW^{\prime}\coloneqq w([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}},\mathbf{Y})=-1_{J}\odot W, where \odot denotes elementwise multiplication and 1J-1_{J} is the vector of all ones but with negative ones at the indices in JJ. To do this, note that DD is invariant to swaps of 𝐗\mathbf{X} and 𝐗~\widetilde{\mathbf{X}}, so |W|=|W||W|=|W^{\prime}| because by assumption |W|,|W||W|,|W^{\prime}| are a function of DD. Furthermore, for any j[p]j\in[p], we have that Wj>0W_{j}>0 if and only if 𝐗^j=𝐗j\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}; however, since 𝐗^j\widehat{\mathbf{X}}_{j} is also a function of DD, we have that sign(Wj)=sign(Wj)\operatorname*{sign}(W_{j})=\operatorname*{sign}(W^{\prime}_{j}) if and only if jJj\not\in J. This completes the proof. ∎

The proof of Proposition 3.2 is identical to the proof of Proposition 3.1, so we omit it for brevity.

We now prove Lemma 3.1.

Lemma 3.1.

Equations (3.3) and (3.4) define valid MX and FX knockoff statistics, respectively.

Proof.

For the MX case, we will show that for any J[p]J\subset[p], mlrπ([𝐗,𝐗~]swap(J),𝐘)=1Jmlrπ([𝐗,𝐗~],𝐘)\mathrm{mlr}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}},\mathbf{Y})=-1_{J}\odot\mathrm{mlr}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}), where \odot denotes elementwise multiplication and 1Jp-1_{J}\in\mathbb{R}^{p} is the vector of ones but with negative ones at the indices in JJ. To show this, note that the masked data DD is invariant to swaps. Therefore, applying Eq. 3.3 yields

mlrjπ([𝐗,𝐗~]swap(J),𝐘)={log(Pjπ(𝐗jD)Pjπ(𝐗~jD))jJlog(Pjπ(𝐗~jD)Pjπ(𝐗jD))jJ.\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}},\mathbf{Y})=\begin{cases}\log\left(\frac{P_{j}^{\pi}(\mathbf{X}_{j}\mid D)}{P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D)}\right)&j\not\in J\\ \log\left(\frac{P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D)}{P_{j}^{\pi}(\mathbf{X}_{j}\mid D)}\right)&j\in J.\end{cases} (B.1)

Since log(x/y)=log(x)log(y)\log(x/y)=\log(x)-\log(y) is an antisymmetric function, this proves that

mlrjπ([𝐗,𝐗~]swap(J),𝐘)={mlrjπ([𝐗,𝐗~],𝐘)jJmlrjπ([𝐗,𝐗~],𝐘)jJ.\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}},\mathbf{Y})=\begin{cases}\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})&j\not\in J\\ -\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})&j\in J.\end{cases} (B.2)

which completes the proof. The proof for the FX case is analogous. ∎

Finally, the following corollary of Propositions 3.1 and 3.2 will be important when proving Theorem 3.2.

Corollary B.1.

Let W,WW,W^{\prime} be two knockoff feature statistics. Then in the same setting as Propositions 3.1 and 3.2, the event sign(Wj)=sign(Wj)\operatorname*{sign}(W_{j})=\operatorname*{sign}(W_{j}^{\prime}) is a deterministic function of the masked data DD.

Proof.

We give the proof for the model-X case, and the fixed-X case is analogous. First, note that the events Wj=0W_{j}=0 and Wj=0W_{j}^{\prime}=0 are DD-measurable events since |Wj|,|Wj||W_{j}|,|W_{j}^{\prime}| are DD-measurable by Proposition 3.1. Therefore, the only non-trivial case is the case where Wj,Wj0W_{j},W_{j}^{\prime}\neq 0, which we now consider.

By Proposition (3.1), there exist DD-measurable vectors 𝐗^j,𝐗^j\widehat{\mathbf{X}}_{j},\widehat{\mathbf{X}}_{j}^{\prime} such that Wj>0𝐗^j=𝐗jW_{j}>0\Leftrightarrow\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j} and Wj>0𝐗^j=𝐗jW_{j}^{\prime}>0\Leftrightarrow\widehat{\mathbf{X}}_{j}^{\prime}=\mathbf{X}_{j}. Since 𝐗^j,𝐗^j\widehat{\mathbf{X}}_{j},\widehat{\mathbf{X}}_{j}^{\prime} must take one of exactly two distinct values, this implies that

sign(Wj)=sign(Wj)𝐗^j=𝐗^j\operatorname*{sign}(W_{j})=\operatorname*{sign}(W_{j}^{\prime})\Leftrightarrow\widehat{\mathbf{X}}_{j}=\widehat{\mathbf{X}}_{j}^{\prime}

where the right-most expression is DD-measurable since 𝐗^j,𝐗^j\widehat{\mathbf{X}}_{j},\widehat{\mathbf{X}}_{j}^{\prime} are DD-measurable. This completes the proof. ∎

B.2 Proof of Proposition 3.3

Proposition 3.3.

Given data 𝐘,𝐗\mathbf{Y},\mathbf{X} and knockoffs 𝐗~\widetilde{\mathbf{X}}, let MLRπ\mathrm{MLR}^{\pi} be the MLR statistics with respect to some Bayesian model PπP^{\pi}. Let WW be any other valid knockoff feature statistic. Then,

Pπ(MLRjπ>0D)Pπ(Wj>0D).P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\geq P^{\pi}(W_{j}>0\mid D). (3.6)

Furthermore, {|MLRjπ|}j=1p\{|\mathrm{MLR}_{j}^{\pi}|\}_{j=1}^{p} has the same order as {Pπ(MLRjπ>0D)}j=1p\{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\}_{j=1}^{p}. More precisely,

Pπ(MLRjπ>0D)=exp(|MLRjπ|)1+exp(|MLRjπ|).P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)=\frac{\exp(|\mathrm{MLR}_{j}^{\pi}|)}{1+\exp(|\mathrm{MLR}_{j}^{\pi}|)}. (3.7)
Proof.

First, we prove Eq. (3.6). Let 𝐗^j=argmax𝐱{𝐗j,𝐗~j}Pjπ(𝐱D)\widehat{\mathbf{X}}_{j}^{\star}=\operatorname*{arg\,max}_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\pi}(\mathbf{x}\mid D) be the “best guess” of the value of 𝐗j\mathbf{X}_{j} based on DD, and observe that by definition MLRjπlog(Pjπ(𝐗jD))log(Pjπ(𝐗~jD))>0\mathrm{MLR}_{j}^{\pi}\coloneqq\log(P_{j}^{\pi}(\mathbf{X}_{j}\mid D))-\log(P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D))>0 if and only if 𝐗^j=𝐗j\widehat{\mathbf{X}}_{j}^{\star}=\mathbf{X}_{j}. Similarly, Proposition 3.1 proves that there exists some alternative DD-measurable random variable 𝐗^j\widehat{\mathbf{X}}_{j} such that Wj>0W_{j}>0 if and only if 𝐗^j>0\widehat{\mathbf{X}}_{j}>0. However, we note that by definition of 𝐗^j\widehat{\mathbf{X}}_{j}^{\star},

Pπ(MLRjπ>0D)=Pπ(𝐗^j=𝐗jD)Pπ(𝐗^j=𝐗jD)=Pπ(Wj>0D),P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)=P^{\pi}(\widehat{\mathbf{X}}_{j}^{\star}=\mathbf{X}_{j}\mid D)\geq P^{\pi}(\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}\mid D)=P^{\pi}(W_{j}>0\mid D), (B.3)

which completes the proof of Eq. (3.6).

To prove Eq. (3.7), observe Pjπ(𝐗jD)=1Pjπ(𝐗~jD)P_{j}^{\pi}(\mathbf{X}_{j}\mid D)=1-P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D), since conditional on DD we observe the set {𝐗j,𝐗~j}\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}. Therefore,

|MLRjπ|=log(max𝐱{𝐗j,𝐗~j}Pjπ(𝐱D)1max𝐱{𝐗j,𝐗~j}Pjπ(𝐱D))=log(Pπ(MLRjπ>0D)1Pπ(MLRjπ>0D)),|\mathrm{MLR}_{j}^{\pi}|=\log\left(\frac{\max_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\pi}(\mathbf{x}\mid D)}{1-\max_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\pi}(\mathbf{x}\mid D)}\right)=\log\left(\frac{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)}{1-P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)}\right),

where the second step uses the fact that Pπ(MLRjπ>0D)=Pπ(𝐗^j=𝐗jD)=max𝐱{𝐗j,𝐗~j}Pjπ(𝐱D)P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)=P^{\pi}\left(\widehat{\mathbf{X}}_{j}^{\star}=\mathbf{X}_{j}\mid D\right)=\max_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\pi}(\mathbf{x}\mid D) for 𝐗^j\widehat{\mathbf{X}}_{j}^{\star} as defined above. This completes the proof for model-X knockoffs; the proof in the fixed-X case is analogous and just replaces {𝐗j,𝐗~j}\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\} with {𝐗jT𝐘,𝐗~jT𝐘}\{\mathbf{X}_{j}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\}. ∎

B.3 How far from optimality are MLR statistics in finite samples?

Our main result (Theorem 3.2) shows that MLR statistics asymptotically maximize the number of discoveries made by knockoffs under PπP^{\pi}. However, before rigorously proving Theorem 3.2, we give intuition suggesting that even in finite samples, MLR statistics are probably nearly optimal anyway.

Recall from Section 3.2 that in finite samples, MLR statistics (i) maximize Pπ(MLRjπ>0D)P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D) for each j[p]j\in[p] and (ii) ensure that the absolute values of the feature statistics {|MLRjπ|}j=1p\{|\mathrm{MLR}_{j}^{\pi}|\}_{j=1}^{p} have the same order as the probabilities {Pπ(MLRjπ>0)}j=1p\{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0)\}_{j=1}^{p}. As per Proposition 3.4, this strategy is exactly optimal when the vector of signs sign(MLRπ)\operatorname*{sign}(\mathrm{MLR}^{\pi}) are conditionally independent given DD, but in general, it is possible to exploit conditional dependencies among the coordinates of sign(MLRπ)\operatorname*{sign}(\mathrm{MLR}^{\pi}) to slightly improve power. However, we argue below that it is challenging to even slightly improve power without unrealistically strong dependencies.

To see why this is the case, consider a very simple setting with p=6p=6 features and FDR level q=0.2q=0.2, so knockoffs will make discoveries exactly when the first five WW-statistics with the largest absolute values have positive signs. Suppose that W1,,W5DW_{1},\dots,W_{5}\mid D are perfectly correlated and satisfy Pπ(W1>0D)==Pπ(W5>0D)=70%P^{\pi}(W_{1}>0\mid D)=\dots=P^{\pi}(W_{5}>0\mid D)=70\%, and W6W1:5DW_{6}\perp\!\!\!\perp W_{1:5}\mid D satisfies Pπ(W6>0D)=90%P^{\pi}(W_{6}>0\mid D)=90\%. Since W6W_{6} has the highest chance of being positive, MLR statistics will assign it the highest absolute value, in which case knockoffs will make discoveries with probability 70%90%=63%70\%\cdot 90\%=63\%. However, in this example, knockoffs will be more powerful if we ensure that W1,,W5W_{1},\dots,W_{5} have the five largest absolute values, since their signs are perfectly correlated and thus Pπ(W1>0,,W5>0D)=70%>63%P^{\pi}(W_{1}>0,\dots,W_{5}>0\mid D)=70\%>63\%.666Note that there is nothing special about positive correlations in this example: one can also find similar examples where negative correlations among MLRπ\mathrm{MLR}^{\pi} can be prioritized to slightly increase power.

This example has two properties which shed light on the more general situation. First, to even get a slight improvement in power required extremely strong dependencies among the coordinates of sign(MLRπ)\operatorname*{sign}(\mathrm{MLR}^{\pi}), which is not realistic. Indeed, empirically in Figure 6, the coordinates of sign(MLRπ)\operatorname*{sign}(\mathrm{MLR}^{\pi}) appear to be almost completely conditionally uncorrelated even when 𝐗\mathbf{X} is extremely highly correlated. Thus, although it may be possible to slightly improve power by exploiting dependencies among sign(MLRπ)\operatorname*{sign}(\mathrm{MLR}^{\pi}), the magnitude of the improvement in power is likely to be small. Second, the reason that it is possible to exploit dependencies to improve power in this case is because knockoffs has a “hard” threshold where one can only make any discoveries if one makes at least 5 discoveries, and exploiting conditional correlations among the vector sign(MLRπ)\operatorname*{sign}(\mathrm{MLR}^{\pi}) can slightly improve the probability that we reach that initial threshold. However, this “threshold” phenomenon is less important in situations where knockoffs are guaranteed to make at least a few discoveries; thus, if the number of discoveries grows with pp, this effect should be insignificant asymptotically.

B.4 Proof of Theorem 3.2

Notation: For any vector xnx\in\mathbb{R}^{n} and knk\leq n, we let x¯k1ki=1kxi\bar{x}_{k}\coloneqq\frac{1}{k}\sum_{i=1}^{k}x_{i} be the sample mean of the first kk elements of xx. For k>nk>n, we let x¯knkx¯\bar{x}_{k}\coloneqq\frac{n}{k}\bar{x} equal the sample mean of the vector xx plus knk-n additional zeros. Additionally, for xnx\in\mathbb{R}^{n} and a permutation κ:[n][n]\kappa:[n]\to[n], κ(x)\kappa(x) denotes the coordinates of xx permuted according to κ\kappa, so that κ(x)i=xκ(i)\kappa(x)_{i}=x_{\kappa(i)}. Throughout this section, all probabilities \mathbb{P} and expectations 𝔼\mathbb{E} are taken under PπP^{\pi}.

Main idea: There are two main ideas behind Theorem 3.2. First, for any feature statistic WW, we will compare the power of WW to the power of a “soft” version of the SeqStep procedure, which depends only on the conditional expectation of sign(W)\operatorname*{sign}(W) instead of the realized values of sign(W)\operatorname*{sign}(W). Roughly speaking, if the coordinates of sign(W)\operatorname*{sign}(W) obey a strong law of large numbers, the power of SeqStep and the power of the “soft” version of SeqStep will be the same asymptotically. Second, we will show that MLR statistics MLRπ\mathrm{MLR}^{\pi} exactly maximize the power of the “soft” version of the SeqStep procedure. Taken together, these two results imply that MLR statistics are asymptotically optimal.

To make this precise, for a feature statistic WW, let sorted(W)\mathrm{sorted}(W) denote WW sorted in decreasing order of its absolute values, and let R=𝕀(sorted(W)>0){0,1}pR=\mathbb{I}(\mathrm{sorted}(W)>0)\in\{0,1\}^{p} be the vector indicating where sorted(W)\mathrm{sorted}(W) has positive entries. The number of discoveries made by knockoffs only depends on RR. Indeed, for any vector η[0,1]p\eta\in[0,1]^{p} and any desired FDR level q(0,1)q\in(0,1), define

ψq(η)maxk{k:kkη¯k+1kη¯kq} and τq(η)=ψq(η)+11+q,\psi_{q}(\eta)\coloneqq\max_{k\in\mathbb{N}}\left\{k:\frac{k-k\bar{\eta}_{k}+1}{k\bar{\eta}_{k}}\leq q\right\}\text{ and }\tau_{q}(\eta)=\left\lceil\frac{\psi_{q}(\eta)+1}{1+q}\right\rceil, (B.4)

where by convention we set x0=\frac{x}{0}=\infty for any x>0x\in\mathbb{R}_{>0} and we remind the reader that for k>pk>p, η¯k=pkη¯\bar{\eta}_{k}=\frac{p}{k}\bar{\eta}. It turns out that knockoffs makes exactly τq(R)\tau_{q}(R) discoveries. For brevity, we refer the reader to Lemma B.3 of Spector and Janson, (2022) for a formal proof of this: however, to see this intuitively, note that kkR¯kk-k\bar{R}_{k} (resp. kR¯kk\bar{R}_{k}) counts the number of negative (resp. positive) entries in the first kk coordinates of sorted(W)\mathrm{sorted}(W), so this definition lines up with the definition of the data-dependent threshold in Section 1.1.

Now, let δ𝔼[RD][0,1]p\delta\coloneqq\mathbb{E}[R\mid D]\in[0,1]^{p} be the conditional expectation of RR given the masked data DD (defined in Equation 3.1). The “soft” version of SeqStep simply applies the functions ψq\psi_{q} and τq\tau_{q} to the conditional expectation δ\delta instead of the realized indicators RR. Intuitively speaking, our goal will be to apply a law of large numbers to show the following asymptotic result:

|τq(δ)τq(R)|=op(# of non-nulls).|\tau_{q}(\delta)-\tau_{q}(R)|=o_{p}\left(\#\text{ of non-nulls}\right).

Once we have shown this, it will be straightforward to show that MLR statistics are asymptotically optimal, since MLR statistics maximize τq(δ)\tau_{q}(\delta) in finite samples.

We now begin to prove Theorem 3.2 in earnest. In particular, the following pair of lemmas tells us that if R¯k\bar{R}_{k} converges to δ¯k\bar{\delta}_{k} uniformly in kk, then τq(δ)τq(R)\tau_{q}(\delta)\approx\tau_{q}(R).

Lemma B.2.

Let W=w([𝐗,𝐗~],𝐘)W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}) be any feature statistic with R,δ,ψq,τqR,\delta,\psi_{q},\tau_{q} as defined earlier. Fix any k0[p]k_{0}\in[p] and sufficiently small ϵ>0\epsilon>0 such that η3(1+q)ϵ<q\eta\coloneqq 3(1+q)\epsilon<q. Define the event

Ak0,ϵ={maxk0kp|R¯kδ¯k|ϵ}.A_{k_{0},\epsilon}=\left\{\max_{k_{0}\leq k\leq p}|\bar{R}_{k}-\bar{\delta}_{k}|\leq\epsilon\right\}.

Then on the event Ak0,ϵA_{k_{0},\epsilon}, we have that

11+3ϵτqη(R)k01τq(δ)(1+3ϵ)τq+η(R)+k0+1.\frac{1}{1+3\epsilon}\tau_{q-\eta}(R)-k_{0}-1\leq\tau_{q}(\delta)\leq(1+3\epsilon)\tau_{q+\eta}(R)+k_{0}+1. (B.5)

This implies that

|τq(R)τq(δ)|\displaystyle|\tau_{q}(R)-\tau_{q}(\delta)| p𝕀(Ak0,ϵc)+[τq+η(R)τqη(R)]+k0+1+3ϵτq+η(R).\displaystyle\leq p\mathbb{I}(A_{k_{0},\epsilon}^{c})+\big{[}\tau_{q+\eta}(R)-\tau_{q-\eta}(R)\big{]}+k_{0}+1+3\epsilon\tau_{q+\eta}(R). (B.6)
Proof.

Note the proof is entirely algebraic (there is no probabilistic content). We proceed in two steps, first showing Equation (B.5), then Equation (B.6).

Step 1: We now prove Equation (B.5). To start, define the sets

={k[p]:kkR¯k+1kR¯kq+η} and 𝒟={k[p]:kkδ¯k+1kδ¯kq}\mathcal{R}=\left\{k\in[p]:\frac{k-k\bar{R}_{k}+1}{k\bar{R}_{k}}\leq q+\eta\right\}\text{ and }\mathcal{D}=\left\{k\in[p]:\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}}\leq q\right\}

and recall that by definition ψq+η(R)=max()\psi_{q+\eta}(R)=\max(\mathcal{R}), ψq(δ)=max(𝒟)\psi_{q}(\delta)=\max(\mathcal{D}). To analyze the difference between these quantities, fix any k𝒟k\in\mathcal{D}\setminus\mathcal{R}. Then by definition of 𝒟\mathcal{D} and \mathcal{R}, we know

kkδ¯k+1kδ¯kq<q+η<kkR¯k+1kR¯k.\displaystyle\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}}\leq q<q+\eta<\frac{k-k\bar{R}_{k}+1}{k\bar{R}_{k}}.

However, Lemma B.3 (proved in a moment) tells us that this implies the following algebraic identity:

δ¯kR¯kη3(1+q)=3(1+q)ϵ3(1+q)=ϵ.\bar{\delta}_{k}-\bar{R}_{k}\geq\frac{\eta}{3(1+q)}=\frac{3(1+q)\epsilon}{3(1+q)}=\epsilon.

However, on the event Ak0,ϵA_{k_{0},\epsilon} this cannot occur for any kk0k\geq k_{0}. Therefore, on the event Ak0,ϵA_{k_{0},\epsilon}, 𝒟{1,,k01}\mathcal{D}\setminus\mathcal{R}\subset\{1,\dots,k_{0}-1\}. This implies that

ψq(δ)ψq+η(R)=max(𝒟)max(){0max(𝒟)k0k01max(𝒟)<k0.\psi_{q}(\delta)-\psi_{q+\eta}(R)=\max(\mathcal{D})-\max(\mathcal{R})\leq\begin{cases}0&\max(\mathcal{D})\geq k_{0}\\ k_{0}-1&\max(\mathcal{D})<k_{0}.\end{cases} (B.7)

We can combine these conditions by writing that ψq(δ)ψq+η(R)k01\psi_{q}(\delta)-\psi_{q+\eta}(R)\leq k_{0}-1. Using the definition of τq()\tau_{q}(\cdot), we conclude

τq(δ)τq+η(R)\displaystyle\tau_{q}(\delta)-\tau_{q+\eta}(R) =ψq(R)+11+qψq+η(δ)+11+q+η\displaystyle=\left\lceil\frac{\psi_{q}(R)+1}{1+q}\right\rceil-\left\lceil\frac{\psi_{q+\eta}(\delta)+1}{1+q+\eta}\right\rceil
1+ψq(R)+11+qψq+η(δ)+11+q+η\displaystyle\leq 1+\frac{\psi_{q}(R)+1}{1+q}-\frac{\psi_{q+\eta}(\delta)+1}{1+q+\eta}
2+ψq(R)ψq+η(δ)1+q+(11+q11+q+η)ψq+η(R)\displaystyle\leq 2+\frac{\psi_{q}(R)-\psi_{q+\eta}(\delta)}{1+q}+\left(\frac{1}{1+q}-\frac{1}{1+q+\eta}\right)\psi_{q+\eta}(R)
=2+ψq(R)ψq+η(δ)1+q+3ϵ1+q+ηψq+η(R)\displaystyle=2+\frac{\psi_{q}(R)-\psi_{q+\eta}(\delta)}{1+q}+\frac{3\epsilon}{1+q+\eta}\psi_{q+\eta}(R) by def. of η\eta
2+k011+q+3ϵ1+q+ηψq+η(R)\displaystyle\leq 2+\frac{k_{0}-1}{1+q}+\frac{3\epsilon}{1+q+\eta}\psi_{q+\eta}(R) by Eq. (B.7)
k0+1+3ϵτq+η(R)\displaystyle\leq k_{0}+1+3\epsilon\tau_{q+\eta}(R) by def. of τq(R)\tau_{q}(R).

This proves the upper bound, namely that τq(δ)(1+3ϵ)τq+η(R)+k0+1\tau_{q}(\delta)\leq(1+3\epsilon)\tau_{q+\eta}(R)+k_{0}+1. To prove the lower bound, note that we can swap the role of RR and δ\delta and apply the upper bound to q=qηq^{\prime}=q-\eta. Then if we take η=3(1+q)ϵ<η<1\eta^{\prime}=3(1+q^{\prime})\epsilon<\eta<1, applying the upper bound yields

τq(R)(1+3ϵ)τq+η(δ)+k0+1.\tau_{q^{\prime}}(R)\leq(1+3\epsilon)\tau_{q^{\prime}+\eta^{\prime}}(\delta)+k_{0}+1.

Observe that τq()\tau_{q}(\cdot) is nondecreasing in qq. Furthermore, since η<η\eta^{\prime}<\eta, we have that q+η=qη+ηqq^{\prime}+\eta^{\prime}=q-\eta+\eta^{\prime}\leq q. Therefore, τq+η(δ)τq(δ)\tau_{q^{\prime}+\eta^{\prime}}(\delta)\leq\tau_{q}(\delta). Applying this result, we conclude

τqη(R)=τq(R)(1+3ϵ)τq+η(δ)+k0+1(1+3ϵ)τq(δ)+k0+1.\tau_{q-\eta}(R)=\tau_{q^{\prime}}(R)\leq(1+3\epsilon)\tau_{q^{\prime}+\eta^{\prime}}(\delta)+k_{0}+1\leq(1+3\epsilon)\tau_{q}(\delta)+k_{0}+1.

This implies the lower bound 11+3ϵτqη(R)k01τq(δ)\frac{1}{1+3\epsilon}\tau_{q-\eta}(R)-k_{0}-1\leq\tau_{q}(\delta).

Step 2: Now, we show Equation (B.6) follows from Equation (B.5). To see this, we consider the two cases where τq(δ)τq(R)\tau_{q}(\delta)\geq\tau_{q}(R) and vice versa and apply Equation (B.5). In particular, on the event Ak0,ϵA_{k_{0},\epsilon}, then:

|τq(δ)τq(R)|\displaystyle|\tau_{q}(\delta)-\tau_{q}(R)| ={τq(δ)τq(R)τq(δ)τq(R)τq(R)τq(δ)τq(δ)τq(R)\displaystyle=\begin{cases}\tau_{q}(\delta)-\tau_{q}(R)&\tau_{q}(\delta)\geq\tau_{q}(R)\\ \tau_{q}(R)-\tau_{q}(\delta)&\tau_{q}(\delta)\leq\tau_{q}(R)\end{cases}
{(1+3ϵ)τq+η(R)+k0+1τq(R)τq(δ)τq(R)τq(R)11+3ϵτqη(R)+k0+1τq(δ)τq(R)\displaystyle\leq\begin{cases}(1+3\epsilon)\tau_{q+\eta}(R)+k_{0}+1-\tau_{q}(R)&\tau_{q}(\delta)\geq\tau_{q}(R)\\ \tau_{q}(R)-\frac{1}{1+3\epsilon}\tau_{q-\eta}(R)+k_{0}+1&\tau_{q}(\delta)\leq\tau_{q}(R)\end{cases} by Eq. (B.5)
=k0+1+{τq+η(R)τq(R)3ϵτq+η(R)τq(δ)τq(R)τq(R)τqη(R)+3ϵ1+3ϵτqη(R)τq(δ)τq(R)\displaystyle=k_{0}+1+\begin{cases}\tau_{q+\eta}(R)-\tau_{q}(R)-3\epsilon\tau_{q+\eta}(R)&\tau_{q}(\delta)\geq\tau_{q}(R)\\ \tau_{q}(R)-\tau_{q-\eta}(R)+\frac{3\epsilon}{1+3\epsilon}\tau_{q-\eta}(R)&\tau_{q}(\delta)\leq\tau_{q}(R)\end{cases}
k0+1+τq+η(R)τqη(R)+3ϵτq+η(R),\displaystyle\leq k_{0}+1+\tau_{q+\eta}(R)-\tau_{q-\eta}(R)+3\epsilon\tau_{q+\eta}(R),

where the last line follows because τq(R)\tau_{q}(R) is monotone in qq. This implies Equation (B.6), because |τq(δ)τq(R)|p|\tau_{q}(\delta)-\tau_{q}(R)|\leq p trivially on the event Ak0,ϵcA_{k_{0},\epsilon}^{c} because τq(δ),τq(R)[p]\tau_{q}(\delta),\tau_{q}(R)\in[p]. ∎

The following lemma proves a very simple algebraic identity used in the proof of Lemma B.2.

Lemma B.3.

For any x,y[0,1]x,y\in[0,1], k,k\in\mathbb{N}, and any γ(0,1)\gamma\in(0,1), suppose that 1+kkxkxq<q+γ1+kkyky\frac{1+k-kx}{kx}\leq q<q+\gamma\leq\frac{1+k-ky}{ky}. Then

xyγ(1+q)(1+q+γ)γ3(1+q).x-y\geq\frac{\gamma}{(1+q)(1+q+\gamma)}\geq\frac{\gamma}{3(1+q)}.
Proof.

By assumption, x0x\neq 0, since otherwise 1+kkxkx=\frac{1+k-kx}{kx}=\infty by convention. For x>0x>0, we have that

1+kkxkxq1+kkxkqxxk+1k(1+q).\frac{1+k-kx}{kx}\leq q\implies 1+k-kx\leq kqx\implies x\geq\frac{k+1}{k(1+q)}. (B.8)

Now, there are two cases. If y=0y=0, the inequality holds trivially:

xy=xk+1k11+qγ3(1+q).x-y=x\geq\frac{k+1}{k}\cdot\frac{1}{1+q}\geq\frac{\gamma}{3(1+q)}.

Alternatively, if y>0y>0, we observe similarly to before that

1+kkykyq+γyk+1k(1+q+γ).\frac{1+k-ky}{ky}\geq q+\gamma\implies y\leq\frac{k+1}{k(1+q+\gamma)}. (B.9)

Combining Equations (B.8)–(B.9) yields the result:

xy\displaystyle x-y k+1k(11+q11+q+γ)=k+1kγ(1+q+γ)(1+q)γ3(1+q).\displaystyle\geq\frac{k+1}{k}\left(\frac{1}{1+q}-\frac{1}{1+q+\gamma}\right)=\frac{k+1}{k}\frac{\gamma}{(1+q+\gamma)(1+q)}\geq\frac{\gamma}{3(1+q)}.

We are now ready to prove Theorem 3.2. As a reminder, we consider an asymptotic regime with data 𝐗(n)n×pn,𝐘(n)n\mathbf{X}^{(n)}\in\mathbb{R}^{n\times p_{n}},\mathbf{Y}^{(n)}\in\mathbb{R}^{n} and knockoffs 𝐗~(n)\widetilde{\mathbf{X}}^{(n)}, where PnπP^{\pi}_{n} is the Bayesian model. We let D(n)D^{(n)} denote the masked data for knockoffs as defined in Section 3.1 and let sns_{n} denote the expected number of non-nulls under PnπP^{\pi}_{n}. We will analyze the limiting normalized number of discoveries of feature statistics W(n)=wn([𝐗(n),𝐗~(n)],𝐘)W^{(n)}=w_{n}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}) with rejection set S(n)(q)S^{(n)}(q), defined as the expected number of discoveries divided by the expected number of non-nulls:

Γq(wn)=𝔼Pnπ[|S(n)(q)|]sn.\Gamma_{q}(w_{n})=\frac{\mathbb{E}_{P^{\pi}_{n}}[|S^{(n)}(q)|]}{s_{n}}. (3.9)

For convenience, we restate Theorem 3.2 and then prove it.

Theorem 3.2.

For each nn, let MLRπ=mlrnπ([𝐗(n),𝐗~(n)],𝐘(n))\mathrm{MLR}^{\pi}=\mathrm{mlr}_{n}^{\pi}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)}) denote the MLR statistics with respect to PnπP_{n}^{\pi} and let W=wn([𝐗(n),𝐗~(n)],𝐘(n))W=w_{n}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)}) denote any other sequence of feature statistics. Assume the following:

  • limnΓq(mlrnπ)\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi}) and limnΓq(wn)\lim_{n\to\infty}\Gamma_{q}(w_{n}) exist for each q(0,1)q\in(0,1).

  • The expected number of non-nulls grows faster than log(pn)4\log(p_{n})^{4}. Formally, assume that for some γ>0\gamma>0, limnsnlog(pn)4+γ=\lim_{n\to\infty}\frac{s_{n}}{\log(p_{n})^{4+\gamma}}=\infty.

  • Conditional on D(n)D^{(n)}, the covariance between the signs of MLRπ\mathrm{MLR}^{\pi} decays exponentially off the diagonal. That is, there exist constants C0,ρ(0,1)C\geq 0,\rho\in(0,1) such that

    |CovPnπ(𝕀(MLRiπ>0),𝕀(MLRjπ>0)D(n))|Cρ|ij|.|\operatorname{Cov}_{P^{\pi}_{n}}(\mathbb{I}(\mathrm{MLR}^{\pi}_{i}>0),\mathbb{I}(\mathrm{MLR}^{\pi}_{j}>0)\mid D^{(n)})|\leq C\rho^{|i-j|}. (3.8)

Then for all but countably many q(0,1)q\in(0,1),

limnΓq(mlrnπ)limnΓq(wn).\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})\geq\lim_{n\to\infty}\Gamma_{q}(w_{n}). (B.10)
Proof.

Note that in this proof, all expectations and probabilities are taken over PnπP_{n}^{\pi}.

The proof proceeds in three main steps, but we begin by introducing some notation and outlining the overall strategy. Following Lemma B.2, let R=𝕀(sorted(W)>0)R=\mathbb{I}(\mathrm{sorted}(W)>0) and Rπ=𝕀(sorted(MLRπ)>0)R^{\pi}=\mathbb{I}(\mathrm{sorted}(\mathrm{MLR}^{\pi})>0), and let δ=𝔼[WD(n)]\delta=\mathbb{E}[W\mid D^{(n)}] and δπ=𝔼[MLRπD(n)]\delta^{\pi}=\mathbb{E}[\mathrm{MLR}^{\pi}\mid D^{(n)}] be their conditional expectations. (Note that W,R,δ,MLRπ,RπW,R,\delta,\mathrm{MLR}^{\pi},R^{\pi} and δπ\delta^{\pi} all change with nn—however, we omit this dependency to lighten the notation.) As in Equation (B.4), we can write the number of discoveries made by WW and MLRπ\mathrm{MLR}^{\pi} as a function of RπR^{\pi} and RR, so:

Γq(mlrnπ)Γq(wn)=𝔼[τq(Rπ)]sn𝔼[τq(R)]sn.\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})-\Gamma_{q}(w_{n})=\frac{\mathbb{E}\left[\tau_{q}(R^{\pi})\right]}{s_{n}}-\frac{\mathbb{E}\left[\tau_{q}(R)\right]}{s_{n}}.

We will show that the limit of this quantity is nonnegative, and the main idea is to make the approximations τq(Rπ)τq(δπ)\tau_{q}(R^{\pi})\approx\tau_{q}(\delta^{\pi}) and τq(R)τq(δ)\tau_{q}(R)\approx\tau_{q}(\delta). In particular, we can decompose

Γq(mlrnπ)Γq(wn)\displaystyle\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})-\Gamma_{q}(w_{n}) =𝔼[τq(Rπ)τq(R)]sn\displaystyle=\frac{\mathbb{E}\left[\tau_{q}(R^{\pi})-\tau_{q}(R)\right]}{s_{n}}
=𝔼[τq(Rπ)τq(δπ)]sn+𝔼[τq(δπ)τq(δ)]sn+𝔼[τq(δ)τq(R)]sn\displaystyle=\frac{\mathbb{E}[\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})]}{s_{n}}+\frac{\mathbb{E}[\tau_{q}(\delta^{\pi})-\tau_{q}(\delta)]}{s_{n}}+\frac{\mathbb{E}\left[\tau_{q}(\delta)-\tau_{q}(R)\right]}{s_{n}}
𝔼[τq(δπ)τq(δ)]sn𝔼|τq(Rπ)τq(δπ)|sn𝔼|τq(δ)τq(R)|sn.\displaystyle\geq\frac{\mathbb{E}[\tau_{q}(\delta^{\pi})-\tau_{q}(\delta)]}{s_{n}}-\frac{\mathbb{E}|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})|}{s_{n}}-\frac{\mathbb{E}|\tau_{q}(\delta)-\tau_{q}(R)|}{s_{n}}. (B.11)

In particular, Step 1 of the proof is to show that τq(δπ)τq(δ)\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta) holds deterministically, for fixed nn. This implies that the first term of Equation (B.11) is nonnegative for fixed nn. In Step 2, we show that as nn\to\infty, the second and third terms of Equation (B.11) vanish. In Step 3, we combine these results and take limits to yield the final result.

Step 1: In this step, we show that τq(δπ)τq(δ)\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta) holds deterministically for fixed nn. To do this, it suffices to show that δ¯kπδ¯k\bar{\delta}^{\pi}_{k}\geq\bar{\delta}_{k} for each k[pn]k\in[p_{n}]. To see this, recall that τq(δπ)\tau_{q}(\delta^{\pi}) and τq(δ)\tau_{q}(\delta) are increasing functions of ψq(δπ)\psi_{q}(\delta^{\pi}) and ψq(δ)\psi_{q}(\delta), as defined below:

ψq(δπ)=maxk{kkδ¯kπ+1kδ¯kπq} and ψq(δ)=maxk{kkδ¯k+1kδ¯kq}\psi_{q}(\delta^{\pi})=\max_{k\in\mathbb{N}}\left\{\frac{k-k\bar{\delta}^{\pi}_{k}+1}{k\bar{\delta}^{\pi}_{k}}\leq q\right\}\text{ and }\psi_{q}(\delta)=\max_{k\in\mathbb{N}}\left\{\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}}\leq q\right\} (B.12)

where for k>nk>n we use the convention of “padding” δ\delta and δπ\delta^{\pi} with extra zeros, so (e.g.) δ¯k=nkδ¯\bar{\delta}_{k}=\frac{n}{k}\bar{\delta} for k>nk>n.

Since the function γkkγ+1kγ\gamma\mapsto\frac{k-k\gamma+1}{k\gamma} is decreasing in γ\gamma, δ¯kπδ¯k\bar{\delta}^{\pi}_{k}\geq\bar{\delta}_{k} implies that kkδ¯kπ+1kδ¯kπkkδ¯k+1kδ¯k\frac{k-k\bar{\delta}^{\pi}_{k}+1}{k\bar{\delta}^{\pi}_{k}}\leq\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}} for each kk, and therefore ψq(δπ)ψq(δ)\psi_{q}(\delta^{\pi})\geq\psi_{q}(\delta), which implies τq(δπ)τq(δ)\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta). Thus, it suffices to show that δ¯kπδ¯k\bar{\delta}^{\pi}_{k}\geq\bar{\delta}_{k} holds for each kk.

Intuitively, it makes sense that δ¯kπδ¯k\bar{\delta}^{\pi}_{k}\geq\bar{\delta}_{k} for each kk, since MLRπ\mathrm{MLR}^{\pi} maximizes (MLRjπ>0D)\mathbb{P}(\mathrm{MLR}_{j}^{\pi}>0\mid D) coordinate-wise and chooses MLRπ\mathrm{MLR}^{\pi} so that δπ\delta^{\pi} is sorted in decreasing order. To prove this formally, we first argue that conditional on D(n)D^{(n)}, RπR^{\pi} is a deterministic function of RR. Recall that according to Corollary B.1, the event sign(Wj)sign(MLRjπ)\operatorname*{sign}(W_{j})\neq\operatorname*{sign}(\mathrm{MLR}_{j}^{\pi}) is completely determined by the masked data D(n)D^{(n)}. Furthermore, since RπR^{\pi} and RR are random permutations of the vectors 𝕀(W>0)\mathbb{I}(W>0) and 𝕀(MLRπ>0)\mathbb{I}(\mathrm{MLR}^{\pi}>0) where the random permutations only depend on |W||W| and |MLRπ||\mathrm{MLR}^{\pi}|, this implies there exists a random vector ξ{0,1}pn\xi\in\{0,1\}^{p_{n}} and a random permutation σSpn\sigma\in S_{p_{n}} such that Rπ=ξσ(R)R^{\pi}=\xi\oplus\sigma(R) and ξ,σ\xi,\sigma are deterministic conditional on D(n)D^{(n)}. Note that here, \oplus denotes the generalized parity function, so

bx{xb=01xb=1 for x[0,1],b{0,1}b\oplus x\coloneqq\begin{cases}x&b=0\\ 1-x&b=1\end{cases}\text{ for }x\in[0,1],b\in\{0,1\} (B.13)

which guarantees that 00=11=00\oplus 0=1\oplus 1=0 and 01=10=10\oplus 1=1\oplus 0=1, etc.

The intuition here is that following Proposition 3.1, fitting a feature statistic WW is equivalent to observing D(n)D^{(n)}, assigning an ordering to the features, and then guessing which one of {𝐗j,𝐗~j}\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\} is the true feature and which is a knockoff, where Wj>0W_{j}>0 if and only if this “guess” is correct. Since these decisions are made as deterministic functions of D(n)D^{(n)}, MLRπ\mathrm{MLR}^{\pi} can only be different than WW in that (i) it may make different guesses, flipping the sign of WW (as represented by ξ)\xi), and (ii) its absolute values may be sorted in a different order (as represented by σ\sigma).

Now, since ξ\xi and σ\sigma are deterministic functions of D(n)D^{(n)}, this implies that

δiπ=𝔼[RiπD(n)]=𝔼[ξiRσ(i)D(n)]={1δσ(i)ξi=1δσ(i)ξi=0.\delta^{\pi}_{i}=\mathbb{E}[R_{i}^{\pi}\mid D^{(n)}]=\mathbb{E}[\xi_{i}\oplus R_{\sigma(i)}\mid D^{(n)}]=\begin{cases}1-\delta_{\sigma(i)}&\xi_{i}=1\\ \delta_{\sigma(i)}&\xi_{i}=0.\end{cases}

However, by definition, 𝔼[RiπD(n)]=(sorted(MLRπ)i>0D(n))\mathbb{E}[R_{i}^{\pi}\mid D^{(n)}]=\mathbb{P}(\mathrm{sorted}(\mathrm{MLR}^{\pi})_{i}>0\mid D^{(n)}), and Proposition 3.3 implies that (MLRiπ>0D(n))0.5\mathbb{P}(\mathrm{MLR}^{\pi}_{i}>0\mid D^{(n)})\geq 0.5 for all i[pn]i\in[p_{n}]: since the ordering of MLRπ\mathrm{MLR}^{\pi} is deterministic conditional on D(n)D^{(n)}, this also implies δiπ=(sorted(MLRπ)i>0D(n))0.5\delta_{i}^{\pi}=\mathbb{P}(\mathrm{sorted}(\mathrm{MLR}^{\pi})_{i}>0\mid D^{(n)})\geq 0.5. Therefore, δiπδσ(i)\delta^{\pi}_{i}\geq\delta_{\sigma(i)} for each i[pn]i\in[p_{n}]. Additionally, by construction MLRπ\mathrm{MLR}^{\pi} ensures that δ1πδ2πδpnπ\delta^{\pi}_{1}\geq\delta^{\pi}_{2}\geq\dots\geq\delta^{\pi}_{p_{n}}. If δ(1),,δ(pn)\delta_{(1)},\dots,\delta_{(p_{n})} are the order statistics of δ\delta in decreasing order, this implies that δiπδ(i)\delta^{\pi}_{i}\geq\delta_{(i)} for all ii. Therefore,

δ¯kπ=1ki=1kδiπ1ki=1kδ(i)1ki=1nδi.\bar{\delta}^{\pi}_{k}=\frac{1}{k}\sum_{i=1}^{k}\delta^{\pi}_{i}\geq\frac{1}{k}\sum_{i=1}^{k}\delta_{(i)}\geq\frac{1}{k}\sum_{i=1}^{n}\delta_{i}.

By the previous analysis, this proves that τq(δπ)τq(δ)\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta).

Step 2: In this step, we show that 𝔼|τq(δπ)τq(Rπ)|sn0\frac{\mathbb{E}|\tau_{q}(\delta^{\pi})-\tau_{q}(R^{\pi})|}{s_{n}}\to 0 for all but countably many q(0,1)q\in(0,1), as well as the analogous result for RR and δ\delta. We first prove the result for RπR^{\pi} and δπ\delta^{\pi}, and in particular, for any fixed v>0v>0, we will show that lim supn𝔼|τq(δπ)τq(Rπ)|snv\limsup_{n\to\infty}\frac{\mathbb{E}|\tau_{q}(\delta^{\pi})-\tau_{q}(R^{\pi})|}{s_{n}}\leq v. Since we will show this for any arbitrary v>0v>0, this implies 𝔼|τq(δπ)τq(Rπ)|sn0\frac{\mathbb{E}|\tau_{q}(\delta^{\pi})-\tau_{q}(R^{\pi})|}{s_{n}}\to 0.

We begin by applying Lemma B.2. In particular, fix any kn[pn]k_{n}\in[p_{n}], any ϵ>0\epsilon>0, and define

An={maxknkpn|R¯kπδ¯kπ|ϵ}.A_{n}=\left\{\max_{k_{n}\leq k\leq p_{n}}|\bar{R}_{k}^{\pi}-\bar{\delta}_{k}^{\pi}|\leq\epsilon\right\}.

Then by Lemma B.2,

|τq(Rπ)τq(δπ)|pn𝕀(Anc)+τq+η(Rπ)τqη(Rπ)+kn+1+3ϵτq+η(Rπ).|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})|\leq p_{n}\mathbb{I}(A_{n}^{c})+\tau_{q+\eta}(R^{\pi})-\tau_{q-\eta}(R^{\pi})+k_{n}+1+3\epsilon\tau_{q+\eta}(R^{\pi}).

where η=3(1+q)ϵ\eta=3(1+q)\epsilon. Therefore,

𝔼|τq(Rπ)τq(δπ)|snpn(Anc)sn+kn+1sn+𝔼[τq+η(Rπ)]𝔼[τqη(Rπ)]sn+3ϵ𝔼[τq+η(Rπ)]sn.\frac{\mathbb{E}|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})|}{s_{n}}\leq\frac{p_{n}\mathbb{P}(A_{n}^{c})}{s_{n}}+\frac{k_{n}+1}{s_{n}}+\frac{\mathbb{E}[\tau_{q+\eta}(R^{\pi})]-\mathbb{E}[\tau_{q-\eta}(R^{\pi})]}{s_{n}}+\frac{3\epsilon\mathbb{E}[\tau_{q+\eta}(R^{\pi})]}{s_{n}}. (B.14)

We now analyze these terms in order: while doing so, we will choose a sequence {kn}\{k_{n}\} and constant ϵ>0\epsilon>0 which guarantee the desired result. Note that eventually, our choice of ϵ\epsilon will depend on qq, so the convergence is not necessarily uniform, but that does not pose a problem for our proof.

First term: To start, we will first apply a finite-sample concentration result to bound (Anc)\mathbb{P}(A_{n}^{c}). In particular, we show in Corollary C.1 that if X1,,XnX_{1},\dots,X_{n} are mean-zero, [1,1][-1,1]-valued random variables satisfying the exponential decay condition from Equation (3.8), then there exists a universal constant C>0C^{\prime}>0 depending only on CC and ρ\rho such that

(maxn0in|X¯i|t)nexp(Ct2n01/4).\mathbb{P}\left(\max_{n_{0}\leq i\leq n}|\bar{X}_{i}|\geq t\right)\leq n\exp(-C^{\prime}t^{2}n_{0}^{1/4}). (B.15)

Furthermore, Corollary C.1 shows that this result holds even if we permute X1,,XnX_{1},\dots,X_{n} according to some arbitrary fixed permutation σ\sigma. Now, observe that conditional on D(n)D^{(n)}, RjπδjπR_{j}^{\pi}-\delta_{j}^{\pi} is a zero-mean, [1,1][-1,1]-valued random variable which is a fixed permutation of 𝕀(MLRπ>0)\mathbb{I}(\mathrm{MLR}^{\pi}>0) minus its (conditional) expectation. Since 𝕀(MLRπ>0)\mathbb{I}(\mathrm{MLR}^{\pi}>0) obeys the conditional exponential decay condition in Equation (3.8), we can apply Corollary C.1 to RjπδjπR_{j}^{\pi}-\delta_{j}^{\pi}:

(AncD(n))pnexp(Cϵ2kn1/4)\mathbb{P}(A_{n}^{c}\mid D^{(n)})\leq p_{n}\exp(-C^{\prime}\epsilon^{2}k_{n}^{1/4}) (B.16)

which implies by the tower property that pn(Anc)pn2exp(Cϵ2kn1/4)p_{n}\mathbb{P}(A_{n}^{c})\leq p_{n}^{2}\exp(-C^{\prime}\epsilon^{2}k_{n}^{1/4}). Now, suppose we take

kn=log(pn)4+γ.k_{n}=\left\lceil\log(p_{n})^{4+\gamma}\right\rceil.

Then observe that ϵ\epsilon is fixed, so as nn\to\infty, kn1/4ϵ2=Ω(log(pn)1+γ/4)k_{n}^{1/4}\epsilon^{2}=\Omega(\log(p_{n})^{1+\gamma/4}). Thus

log(pn(Anc))2log(pn)Ω(log(pn)1+γ/4).\log(p_{n}\mathbb{P}(A_{n}^{c}))\leq 2\log(p_{n})-\Omega\left(\log(p_{n})^{1+\gamma/4}\right)\to-\infty.

Therefore, for this choice of knk_{n}, we have shown the stronger result that pn(Anc)0p_{n}\mathbb{P}(A_{n}^{c})\to 0. Of course, this implies pn(Anc)sn0\frac{p_{n}\mathbb{P}(A_{n}^{c})}{s_{n}}\to 0 as well.

Second term: This term is easy, as we assume in the statement that knsnlog(pn)4+γsn0\frac{k_{n}}{s_{n}}\sim\frac{\log(p_{n})^{4+\gamma}}{s_{n}}\to 0.

Third term: We will now show that for all but countably many q(0,1)q\in(0,1), for any sufficiently small ϵ\epsilon (and thus for any sufficiently small η\eta), lim supn𝔼[τq+η(Rπ)]𝔼[τqη(Rπ)]snv/2\limsup_{n\to\infty}\frac{\mathbb{E}[\tau_{q+\eta}(R^{\pi})]-\mathbb{E}[\tau_{q-\eta}(R^{\pi})]}{s_{n}}\leq v/2 for any fixed v>0v>0.

To do this, recall by assumption that for all q(0,1)q\in(0,1), we have that limn𝔼[τq(Rπ)]sn\lim_{n\to\infty}\frac{\mathbb{E}[\tau_{q}(R^{\pi})]}{s_{n}} exists and converges to some (extended) real number L(q)L(q). Furthermore, we show in Lemma C.2 that L(q)L(q) is always finite—this is intuitively a consequence of the fact that knockoffs controls the false discovery rate, and thus the expected number of discoveries cannot exceed the number of non-nulls by more than a constant factor. Importantly, since τq(Rπ)\tau_{q}(R^{\pi}) is increasing in qq, the function L(q)L(q) is increasing in qq for all q(0,1)q\in(0,1): therefore, it is continuous on (0,1)(0,1) except on a countable set.

Supposing that qq is a continuity point of L(q)L(q), there exists some β>0\beta>0 such that |qq|β|L(q)L(q)|v/4|q-q^{\prime}|\leq\beta\implies|L(q)-L(q^{\prime})|\leq v/4. Take ϵ\epsilon to be any positive constant such that ϵβ3(1+q)\epsilon\leq\frac{\beta}{3(1+q)} and thus ηβ\eta\leq\beta. Then we conclude

lim supn𝔼[τq+η(Rπ)]𝔼[τqη(Rπ)]sn\displaystyle\limsup_{n\to\infty}\frac{\mathbb{E}[\tau_{q+\eta}(R^{\pi})]-\mathbb{E}[\tau_{q-\eta}(R^{\pi})]}{s_{n}} =L(q+η)L(qη)\displaystyle=L(q+\eta)-L(q-\eta) because 𝔼[τq(Rπ)]snL(q) pointwise\displaystyle\text{ because }\frac{\mathbb{E}[\tau_{q}(R^{\pi})]}{s_{n}}\to L(q)\text{ pointwise}
v2.\displaystyle\leq\frac{v}{2}. by continuity

Fourth term: We now show that for all but countably many q(0,1)q\in(0,1), for any sufficiently small ϵ\epsilon, limn3ϵ𝔼[τq+η(Rπ)]sn=3ϵL(q+η)v/2\lim_{n\to\infty}\frac{3\epsilon\mathbb{E}[\tau_{q+\eta}(R^{\pi})]}{s_{n}}=3\epsilon L(q+\eta)\leq v/2. However, this is simple, since Lemma C.2 tells us that L(q)L(q) is finite and continuous except at countably many points. Thus, we can take ϵ\epsilon sufficiently small so that L(q+η)=L(q+3(1+q)ϵ)L(q)+1L(q+\eta)=L(q+3(1+q)\epsilon)\leq L(q)+1, and then also take ϵ<v6(L(q)+1)\epsilon<\frac{v}{6(L(q)+1)} so that 3ϵL(q+η)v/23\epsilon L(q+\eta)\leq v/2.

Combining the results for all four terms, we see the following: for each v>0v>0, there exists a sequence {kn}\{k_{n}\} and a constant ϵ\epsilon guaranteeing that

lim supn𝔼|τq(Rπ)τq(δπ)|snv.\displaystyle\limsup_{n\to\infty}\frac{\mathbb{E}|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})|}{s_{n}}\leq v.

Since this holds for all v>0v>0, we conclude limn𝔼|τq(Rπ)τq(δπ)|sn=0\lim_{n\to\infty}\frac{\mathbb{E}|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})|}{s_{n}}=0 as desired.

Lastly in this step, we need to show the same result for RR and δ\delta in place of RπR^{\pi} and δπ\delta^{\pi}. However, the proof for RR and δ\delta is identical to the proof for RπR^{\pi} and δπ\delta^{\pi}. The one subtlety worth mentioning is that we do not directly assume the exponential decay condition in Equation (3.8) for WW. However, as we argued in Step 1, we can write 𝕀(W>0)=ξ𝕀(MLRπ>0)\mathbb{I}(W>0)=\xi\oplus\mathbb{I}(\mathrm{MLR}^{\pi}>0) for some random vector ξ{0,1}pn\xi\in\{0,1\}^{p_{n}} which is a deterministic function of D(n)D^{(n)}. As a result, we have that

|Cov(𝕀(Wi>0),𝕀(Wj>0)D(n))|=|Cov(𝕀(MLRiπ>0),𝕀(MLRjπ>0)D(n))|Cρ|ij|.\displaystyle|\operatorname{Cov}(\mathbb{I}(W_{i}>0),\mathbb{I}(W_{j}>0)\mid D^{(n)})|=|\operatorname{Cov}(\mathbb{I}(\mathrm{MLR}^{\pi}_{i}>0),\mathbb{I}(\mathrm{MLR}^{\pi}_{j}>0)\mid D^{(n)})|\leq C\rho^{|i-j|}.

Thus, we also conclude that limn𝔼|τq(R)τq(δ)|sn=0\lim_{n\to\infty}\frac{\mathbb{E}|\tau_{q}(R)-\tau_{q}(\delta)|}{s_{n}}=0.

Step 3: Finishing the proof. Recall Equation (B.11), which states that

Γq(mlrnπ)Γq(wn)\displaystyle\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})-\Gamma_{q}(w_{n}) 𝔼[τq(δπ)τq(δ)]sn𝔼|τq(Rπ)τq(δπ)|sn𝔼|τq(δ)τq(R)|sn.\displaystyle\geq\frac{\mathbb{E}[\tau_{q}(\delta^{\pi})-\tau_{q}(\delta)]}{s_{n}}-\frac{\mathbb{E}|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})|}{s_{n}}-\frac{\mathbb{E}|\tau_{q}(\delta)-\tau_{q}(R)|}{s_{n}}. (B.11)

In Step 1, we showed that τq(δπ)τq(δ)\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta) for fixed nn. Furthermore, in Step 2, we showed that the second two terms vanish asymptotically. As a result, we take limits and conclude

lim infnΓq(mlrnπ)Γq(wn)0.\liminf_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})-\Gamma_{q}(w_{n})\geq 0.

Furthermore, since we assume that the limits limnΓq(mlrnπ),limnΓq(wn)\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi}),\lim_{n\to\infty}\Gamma_{q}(w_{n}) exist, this implies that

limnΓq(mlrnπ)limnΓq(wn)0.\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})-\lim_{n\to\infty}\Gamma_{q}(w_{n})\geq 0.

This concludes the proof. ∎

B.5 Relaxing the assumptions in Theorem 3.2

In this section, we discuss a few ways to relax the assumptions in Theorem 3.2.

First, we can easily relax the assumption that the limits L(q)limnΓq(wn)L(q)\coloneqq\lim_{n\to\infty}\Gamma_{q}(w_{n}) and L(q)limnΓn(mlrnπ)L^{\star}(q)\coloneqq\lim_{n\to\infty}\Gamma_{n}(\mathrm{mlr}_{n}^{\pi}) exist for each q(0,1)q\in(0,1). Indeed, the proof of Theorem 3.2 only uses this assumption to argue that there exists a sequence ηn0\eta_{n}\to 0 such that L(q+ηn)L(q),L(qηn)L(q)L(q+\eta_{n})\to L(q),L(q-\eta_{n})\to L(q) (and similarly for L(q)L^{\star}(q)). Thus, we do not need the limits L(q)L(q) to exist for every q(0,1)q\in(0,1): in contrast, the result of Theorem 3.2 will hold (e.g.) for any qq such that L(),L()L(\cdot),L^{\star}(\cdot) are continuous at qq. Intuitively, this means that the result in Theorem 3.2 holds except at points qq that delineate a “phase transition,” where the power of knockoffs jumps in a discontinuous fashion as qq increases.

Second, it is important to note that the precise form of the local dependency condition (3.8) is not crucial. Indeed, the proof of Theorem 3.2 only uses this condition to show that the partial sums of 𝕀(MLRπ>0)\mathbb{I}(\mathrm{MLR}^{\pi}>0) converge to their conditional mean given DD. To be precise, fix any permutation κ:[pn][pn]\kappa:[p_{n}]\to[p_{n}] and let R=𝕀(κ(MLRπ)>0)R=\mathbb{I}(\kappa(\mathrm{MLR}^{\pi})>0) where κ(MLRπ)\kappa(\mathrm{MLR}^{\pi}) permutes MLRπ\mathrm{MLR}^{\pi} according to κ\kappa. Let δ=𝔼[RD]\delta=\mathbb{E}[R\mid D]. Then the proof of Theorem 3.2 will go through exactly as written if we replace Equation (3.8) with the following condition:

(maxknkpn|R¯kδ¯k|D)=o(pn1)\mathbb{P}\left(\max_{k_{n}\leq k\leq p_{n}}|\bar{R}_{k}-\bar{\delta}_{k}|\mid D\right)=o(p_{n}^{-1}) (B.17)

where knk_{n} is some sequence satisfying knk_{n}\to\infty and knsn0\frac{k_{n}}{s_{n}}\to 0.

The upshot is this: under any condition where each permutation of 𝕀(MLRπ>0)\mathbb{I}(\mathrm{MLR}^{\pi}>0) obeys a certain strong law of large numbers, we should expect Theorem 3.2 to hold. Although it is unusual to require that a strong law holds for any fixed permutation of a vector, in some cases there is a “worst-case” permutation where if Equation (B.17) holds for some choice of κ\kappa, then it holds for every choice of κ\kappa. For example, in Corollary C.1, we show that if the exponential decay condition holds, then it suffices to show Equation (B.17) in the case where κ\kappa is the identity permutation, since the identity permutation places the most correlated coordinates of MLRπ\mathrm{MLR}^{\pi} next to each other.

B.6 Proof of Propositions 3.4-3.5

Proposition 3.4.

If {𝕀(MLRjπ>0)}j=1p\{\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)\}_{j=1}^{p} are conditionally independent given DD under PπP^{\pi}, then
ENDiscπ(mlrπ)ENDiscπ(w)\texttt{ENDisc}^{\pi}(\mathrm{mlr}^{\pi})\geq\texttt{ENDisc}^{\pi}(w) for any valid feature statistic ww.

Proof.

Note that the proof here is essentially the same argument used in Proposition 2 of Li and Fithian, (2021), but for completeness we restate it here. Let W=w([𝐗,𝐗~],𝐘)W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}) denote any other feature statistic. Let S[p]S\subset[p] and Sπ[p]S^{\pi}\subset[p] denote the discovery sets based on WW and MLRπ\mathrm{MLR}^{\pi}.

It suffices to show that 𝔼Pπ[|S|]𝔼Pπ[|Sπ|]\mathbb{E}_{P^{\pi}}[|S|]\leq\mathbb{E}_{P^{\pi}}[|S^{\pi}|]. The argument from proof of Theorem 3.2 (the beginning of Appendix B.4) shows that the number of discoveries |S||S| is a monotone function of ψq(𝕀(sorted(W)>0))\psi_{q}(\mathbb{I}(\mathrm{sorted}(W)>0)), where sorted(W)\mathrm{sorted}(W) denotes WW sorted in decreasing order of its absolute values. Therefore, it suffices to show that if R=𝕀(sorted(W)>0)R=\mathbb{I}(\mathrm{sorted}(W)>0) and Rπ=𝕀(sorted(MLRπ)>0)R^{\pi}=\mathbb{I}(\mathrm{sorted}(\mathrm{MLR}^{\pi})>0),

𝔼Pπ[ψq(R)]𝔼Pπ[ψq(Rπ)],\mathbb{E}_{P^{\pi}}[\psi_{q}(R)]\leq\mathbb{E}_{P^{\pi}}[\psi_{q}(R^{\pi})], (B.18)

where as defined in Eq. (B.4), ψq(η)maxk{k:kkη¯k+1kη¯kq}\psi_{q}(\eta)\coloneqq\max_{k}\left\{k:\frac{k-k\bar{\eta}_{k}+1}{k\bar{\eta}_{k}}\leq q\right\} for any η{0,1}p\eta\in\{0,1\}^{p}. To do this, we recall from the proof of Theorem 3.2 that there exists a DD-measurable vector ξ{0,1}p\xi\in\{0,1\}^{p} and a DD-measurable permutation σ:[p][p]\sigma:[p]\to[p] such that

R=σ(ξRπ)R=\sigma(\xi\oplus R^{\pi})

where for any vector xpx\in\mathbb{R}^{p}, σ(x)(xσ(1),,xσ(p))\sigma(x)\coloneqq(x_{\sigma(1)},\dots,x_{\sigma(p)}) and \oplus denotes the parity function (see Eq. B.13). We now make a few observations:

  1. 1.

    We assume {𝕀(MLRjπ>0)}j=1p\{\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)\}_{j=1}^{p} are conditionally independent. Since the magnitudes of MLRπ\mathrm{MLR}^{\pi} are DD-measurable by Proposition 3.1, RπR^{\pi} is equal to a DD-measurable permutation of 𝕀(MLRπ>0)\mathbb{I}(\mathrm{MLR}^{\pi}>0). Therefore, the entries of RπR^{\pi} are conditionally independent given DD.

  2. 2.

    Since σ\sigma and ξ\xi are DD-measurable, the entries of RR are also conditionally independent given DD.

  3. 3.

    Since MLRjπ\mathrm{MLR}_{j}^{\pi} maximizes Pπ(MLRjπ>0D)P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D) among all feature statistics by Proposition 3.3, this implies that for any jj, Pπ(Rjπ>0D)12P^{\pi}(R_{j}^{\pi}>0\mid D)\geq\frac{1}{2}. Thus, Pπ(Rj>0D)=Pπ(ξjRjπ>0D)Pπ(Rjπ>0D)P^{\pi}(R_{j}>0\mid D)=P^{\pi}(\xi_{j}\oplus R_{j}^{\pi}>0\mid D)\leq P^{\pi}(R_{j}^{\pi}>0\mid D) for all jj.

Thus, we can create a coupling R~\tilde{R} such that R~\tilde{R} has the same marginal law as σ(Rπ)\sigma(R^{\pi}), but R~R\tilde{R}\geq R a.s. (by the third observation above). This implies that kkR~¯k+1kR~¯kkkR¯k+1kR¯k\frac{k-k\bar{\tilde{R}}_{k}+1}{k\bar{\tilde{R}}_{k}}\leq\frac{k-k\bar{R}_{k}+1}{k\bar{R}_{k}} for all kk, and therefore ψq(R~)ψq(R)\psi_{q}(\tilde{R})\geq\psi_{q}(R). Therefore

𝔼Pπ[ψq(R)]𝔼[ψq(R~)]=𝔼Pπ[ψq(σ(Rπ))].\mathbb{E}_{P^{\pi}}[\psi_{q}(R)]\leq\mathbb{E}[\psi_{q}(\tilde{R})]=\mathbb{E}_{P^{\pi}}[\psi_{q}(\sigma(R^{\pi}))].

Therefore, to complete the proof, it suffices to show that 𝔼Pπ[ψq(σ(Rπ))]𝔼Pπ[ψq(Rπ)]\mathbb{E}_{P^{\pi}}[\psi_{q}(\sigma(R^{\pi}))]\leq\mathbb{E}_{P^{\pi}}[\psi_{q}(R^{\pi})]—to simplify notation, take R=σ(Rπ)R=\sigma(R^{\pi}), i.e., assume ξ=0\xi=0 without loss of generality. Note that by Proposition 3.3, the entries of δπ𝔼[RπD][0,1]p\delta^{\pi}\coloneqq\mathbb{E}[R^{\pi}\mid D]\in[0,1]^{p} are arranged in decreasing order. To show the desired result, let δ𝔼[RD]\delta\coloneqq\mathbb{E}[R\mid D] and fix any i<ji<j such that δi<δj\delta_{i}<\delta_{j} are “misordered” (i.e. not in decreasing order). It is sufficient to show that 𝔼[ψq(R)D]𝔼[ψq(Rswap({i,j}))D]\mathbb{E}[\psi_{q}(R)\mid D]\leq\mathbb{E}[\psi_{q}(R_{{\text{swap}(\{i,j\})}})\mid D], since conditional on DD, Rπ=σ1(R)R^{\pi}=\sigma^{-1}(R) is simply the result of iteratively swapping elements of RR to sort δ\delta in decreasing order.

To show this, for any ri,rj{0,1}r_{i},r_{j}\in\{0,1\}, let (R1,,ri,,rj,,Rp)(R_{1},\dots,r_{i},\dots,r_{j},\dots,R_{p}) denote the vector which replaces the iith and jjth entries of RR with rir_{i} and rjr_{j}, respectively. Since the entries of RDR\mid D are conditionally independent, after conditioning on R{i,j}R_{-\{i,j\}}, we can write out the relevant conditional expectations:

𝔼[ψq(R)ψq(Rswap({i,j}))D,R{i,j}]\displaystyle\mathbb{E}[\psi_{q}(R)-\psi_{q}(R_{{\text{swap}(\{i,j\})}})\mid D,R_{-\{i,j\}}]
=\displaystyle= ri,rj{0,1}[ψq((R1,,ri,,rj,,Rp))ψq(R1,,ri,,rj,,Rp)]δiri(1δi)riδjrj(1δj)rj\displaystyle\sum_{r_{i},r_{j}\in\{0,1\}}[\psi_{q}((R_{1},\dots,r_{i},\dots,r_{j},\dots,R_{p}))-\psi_{q}(R_{1},\dots,r_{i},\dots,r_{j},\dots,R_{p})]\delta_{i}^{r_{i}}(1-\delta_{i})^{r_{i}}\delta_{j}^{r_{j}}(1-\delta_{j})^{r_{j}}
=\displaystyle= [ψq((R1,,1,,0,,Rp))ψq((R1,,0,,1,,Rp))][δi(1δj)δj(1δi)]\displaystyle\left[\psi_{q}((R_{1},\dots,1,\dots,0,\dots,R_{p}))-\psi_{q}((R_{1},\dots,0,\dots,1,\dots,R_{p}))\right]\left[\delta_{i}(1-\delta_{j})-\delta_{j}(1-\delta_{i})\right]
=\displaystyle= [ψq((R1,,1,,0,,Rp))ψq((R1,,0,,1,,Rp))][δiδj]\displaystyle\left[\psi_{q}((R_{1},\dots,1,\dots,0,\dots,R_{p}))-\psi_{q}((R_{1},\dots,0,\dots,1,\dots,R_{p}))\right]\left[\delta_{i}-\delta_{j}\right]
\displaystyle\leq 0.\displaystyle 0.

where the first equality uses conditional independence and the definition of expectation, the second equality cancels relevant terms, the third equality is simple algebra, and the final inequality uses the fact that δi<δj\delta_{i}<\delta_{j} by assumption but ψq((R1,,1,,0,,Rp))ψq((R1,,0,,1,,Rp))0\psi_{q}((R_{1},\dots,1,\dots,0,\dots,R_{p}))-\psi_{q}((R_{1},\dots,0,\dots,1,\dots,R_{p}))\geq 0. In particular, to see this latter point, define r(1,0)=(R1,,1,,0,,Rp),r(0,1)(R1,,0,,1,,Rp)r^{(1,0)}=(R_{1},\dots,1,\dots,0,\dots,R_{p}),r^{(0,1)}\coloneqq(R_{1},\dots,0,\dots,1,\dots,R_{p}) and note that the partial averages obey r¯k(1,0)r¯k(0,1)\bar{r}_{k}^{(1,0)}\geq\bar{r}_{k}^{(0,1)} for every k[p]k\in[p], which implies ψq(r(1,0))ψq(r(0,1))\psi_{q}(r^{(1,0)})\geq\psi_{q}(r^{(0,1)}) by definition of ψq\psi_{q}.

Thus, by the tower property, 𝔼[ψq(R)D]𝔼[ψq(Rswap({i,j}))D]\mathbb{E}[\psi_{q}(R)\mid D]\leq\mathbb{E}[\psi_{q}(R_{{\text{swap}(\{i,j\})}})\mid D], which completes the proof. ∎

Proposition 3.5.

Suppose that (i) 𝐗\mathbf{X} are FX knockoffs or Gaussian conditional MX knockoffs (Huang and Janson,, 2020) and (ii) under PP^{\star}, 𝐘𝐗𝒩(𝐗β,σ2In)\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,\sigma^{2}I_{n}). Then under PP^{\star}, {𝕀(MLRjoracle>0)}j=1pD\{\mathbb{I}(\mathrm{MLR}_{j}^{\mathrm{oracle}}>0)\}_{j=1}^{p}\mid D are conditionally independent.

Proof.

This result is already proved for the fixed-X case by Li and Fithian, (2021), so we only prove it for the case where 𝐗~\widetilde{\mathbf{X}} are conditional Gaussian MX knockoffs. In particular, define 𝐗^jargmax𝐱{𝐗j,𝐗~j}Pj(𝐱D)\widehat{\mathbf{X}}_{j}^{\star}\coloneqq\operatorname*{arg\,max}_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\star}(\mathbf{x}\mid D) and recall that MLRjoracle>0\mathrm{MLR}_{j}^{\mathrm{oracle}}>0 if and only if 𝐗^j=𝐗j\widehat{\mathbf{X}}_{j}^{\star}=\mathbf{X}_{j}. Therefore, to show that {MLRjoracle>0}j=1pD\{\mathrm{MLR}_{j}^{\mathrm{oracle}}>0\}_{j=1}^{p}\mid D are conditionally independent under PP^{\star}, it suffices to show that {𝐗j}j=1pD\{\mathbf{X}_{j}\}_{j=1}^{p}\mid D are conditionally independent under PP^{\star}.

Fix any value d=(𝐘(0),{𝐱j,𝐱~j}j=1p)d=(\mathbf{Y}^{(0)},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p}) and let 𝐗(0)n×p\mathbf{X}^{(0)}\in\mathbb{R}^{n\times p} denote a possible value for the design matrix which is consistent with observing D=dD=d in the sense that 𝐗j(0){𝐱j,𝐱~j}\mathbf{X}_{j}^{(0)}\in\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\} for all j[p]j\in[p]. It suffices to show the factorization

P(𝐗=𝐗(0)D=d)j=1pqj(𝐗j(0))P^{\star}(\mathbf{X}=\mathbf{X}^{(0)}\mid D=d)\propto\prod_{j=1}^{p}q_{j}(\mathbf{X}_{j}^{(0)})

for some functions q1,,qp:n0q_{1},\dots,q_{p}:\mathbb{R}^{n}\to\mathbb{R}_{\geq 0} which may depend on the value dd. To do this, observe that

P(𝐗=𝐗(0)D=d)\displaystyle P^{\star}(\mathbf{X}=\mathbf{X}^{(0)}\mid D=d) P𝐘𝐗(𝐘(0)𝐗(0))P(𝐗=𝐗(0){𝐗j,𝐗~j}={𝐱j,𝐱~j}j=1p for j=1,,p)\displaystyle\propto P^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}^{(0)}\mid\mathbf{X}^{(0)})\cdot P^{\star}(\mathbf{X}=\mathbf{X}^{(0)}\mid\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}=\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p}\text{ for }j=1,\dots,p)
exp(12σ2𝐘(0)𝐗(0)β22)12p.\displaystyle\propto\exp\left(-\frac{1}{2\sigma^{2}}\|\mathbf{Y}^{(0)}-\mathbf{X}^{(0)}\beta\|_{2}^{2}\right)\frac{1}{2^{p}}.

where the last line uses the Gaussian linear model assumption that under PP^{\star}, 𝐘𝐗𝒩(𝐗β,σ2In)\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,\sigma^{2}I_{n}) for some fixed βp,σ20\beta\in\mathbb{R}^{p},\sigma^{2}\geq 0 as well as the pairwise exchangeability of {𝐗j,𝐗~j}\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}. Continuing yields,

P(𝐗=𝐗(0)D=d)\displaystyle P^{\star}(\mathbf{X}=\mathbf{X}^{(0)}\mid D=d) exp(12σ2(βT𝐗(0)T𝐗(0)β2𝐘(0)T𝐗(0)β))\displaystyle\propto\exp\left(-\frac{1}{2\sigma^{2}}\left(\beta^{T}{\mathbf{X}^{(0)}}^{T}\mathbf{X}^{(0)}\beta-2{\mathbf{Y}^{(0)}}^{T}\mathbf{X}^{(0)}\beta\right)\right)
exp(𝐘(0)T𝐗(0)βσ2).\displaystyle\propto\exp\left(\frac{{\mathbf{Y}^{(0)}}^{T}\mathbf{X}^{(0)}\beta}{\sigma^{2}}\right).

Here, the last step uses the key assumption that 𝐗~\widetilde{\mathbf{X}} are conditional Gaussian MX knockoffs, in which case 𝐗T𝐗=𝐗~T𝐗~\mathbf{X}^{T}\mathbf{X}=\widetilde{\mathbf{X}}^{T}\widetilde{\mathbf{X}} and 𝐗T𝐗~=𝐗T𝐗S\mathbf{X}^{T}\widetilde{\mathbf{X}}=\mathbf{X}^{T}\mathbf{X}-S for some diagonal matrix Sp×pS\in\mathbb{R}^{p\times p}. In other words, the value of 𝐗T𝐗\mathbf{X}^{T}\mathbf{X} is DD-measurable, and thus conditional on D=dD=d, the value of βT𝐗(0)T𝐗(0)β\beta^{T}{\mathbf{X}^{(0)}}^{T}\mathbf{X}^{(0)}\beta is a constant. At this point, we conclude that

P(𝐗=𝐗(0)D=d)\displaystyle P^{\star}(\mathbf{X}=\mathbf{X}^{(0)}\mid D=d) j=1pexp(𝐘(0)T𝐗j(0)βjσ2).\displaystyle\propto\prod_{j=1}^{p}\exp\left(\frac{{\mathbf{Y}^{(0)}}^{T}\mathbf{X}_{j}^{(0)}\beta_{j}}{\sigma^{2}}\right).

which completes the proof by the factorization argument above.

B.7 Maximizing the expected number of true discoveries

Theorem 3.2 shows that MLR statistics maximize the (normalized) expected number of discoveries, but not necessarily the expected number of true discoveries. In this section, we give a sketch of the derivation of AMLR statistics and prove that they maximize the expected number of true discoveries.

This section uses the notation introduced in Section B.4. All probabilities and expectations are taken over PπP^{\pi}. As a reminder, for any feature statistic WW, let R=𝕀(sorted(W)>0)R=\mathbb{I}(\mathrm{sorted}(W)>0), let δ=𝔼[RD]\delta=\mathbb{E}[R\mid D], and let ψq()\psi_{q}(\cdot) be as defined in Equation (B.4) so that knockoffs makes τq(R)=ψq(R)+11+q\tau_{q}(R)=\left\lceil\frac{\psi_{q}(R)+1}{1+q}\right\rceil discoveries.

B.7.1 Proof of Corollary 3.1

Corollary 3.1.

AMLR statistics from Definition 3.3 are valid knockoff statistics.

Proof.

The signs of the AMLR statistics are identical to the signs of the MLR statistics. Therefore, by Propositions 3.1 and 3.2 (in the MX and FX case, respectively), it suffices to show that the absolute values of the AMLR statistics are functions of the masked data. However, the AMLR statistics magnitudes are purely a function of (i) the magnitudes of the MLR statistics and (ii) νj\nu_{j}, which is the ratio of conditional probabilities given the masked data DD. These conditional probabilities by definition are functions of DD, and since MLR statistics are valid knockoff statistics by Lemma 3.1, the MLR magnitudes are also a function of the masked data DD by Proposition 3.1. Thus, the AMLR statistic magnitudes are a function of DD, which concludes the proof. ∎

B.7.2 Proof sketch and intuition

Proof sketch: The key idea behind the proof of Theorem 3.2 is to observe that:

  1. 1.

    The number of discoveries τq(R)\tau_{q}(R) only depends on cumulative averages of RR, denoted R¯k=1ki=1kRi\bar{R}_{k}=\frac{1}{k}\sum_{i=1}^{k}R_{i}.

  2. 2.

    As pp\to\infty, R¯ka.s.δ¯k\bar{R}_{k}\stackrel{{\scriptstyle\mathrm{a.s.}}}{{\to}}\bar{\delta}_{k} under suitable assumptions. Thus, τq(R)τq(δ)\tau_{q}(R)\approx\tau_{q}(\delta).

  3. 3.

    If Rπ=𝕀(sorted(MLRπ)>0)R^{\pi}=\mathbb{I}(\mathrm{sorted}(\mathrm{MLR}^{\pi})>0) are MLR statistics with δπ=𝔼[RπD]\delta^{\pi}=\mathbb{E}[R^{\pi}\mid D], then RπR^{\pi} is asymptotically optimal because τq(δπ)τq(δ)\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta) holds in finite samples for any choice of δ\delta. Thus we conclude:

    τq(Rπ)τq(δπ)τq(δ)τq(R).\tau_{q}(R^{\pi})\approx\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta)\approx\tau_{q}(R). (B.19)

    In particular, this holds because MLR statistics ensure δπ\delta^{\pi} is in descending order.

To show a similar result for the number of true discoveries, we repeat the three steps used in the proof of Theorem 3.2. To do this, let IjI_{j} be the indicator that the feature corresponding to the jjth coordinate of RjR_{j} is non-null, and let Bj=IjRjB_{j}=I_{j}R_{j} be the indicator that sorted(W)j>0\mathrm{sorted}(W)_{j}>0 and that the corresponding feature is non-null. Let b=𝔼[BD]b=\mathbb{E}[B\mid D]. Then:

  1. 1.

    Let Tq(R,B)T_{q}(R,B) denote the number of true discoveries. We claim that Tq(R,B)T_{q}(R,B) is a function of the successive partial means of RR and BB. To see this, recall from Section B.4 that knockoffs will make τq(R)\tau_{q}(R) discoveries, and in particular it will make discoveries corresponding to any of the first ψq(R)\psi_{q}(R) coordinates of RR which are positive. Therefore,

    Tq(R,B)=j=1ψq(R)Bj=ψq(R)1ψq(R)j=1ψq(R)Bj.T_{q}(R,B)=\sum_{j=1}^{\psi_{q}(R)}B_{j}=\psi_{q}(R)\cdot\frac{1}{\psi_{q}(R)}\sum_{j=1}^{\psi_{q}(R)}B_{j}. (B.20)

    Since ψq(R)\psi_{q}(R) only depends on the successive averages of RR and the second term is itself a successive average of {Bj}\{B_{j}\}, this finishes the first step.

  2. 2.

    The second step is to show that as pp\to\infty, B¯ka.s.b¯k,R¯ka.s.δ¯k\bar{B}_{k}\stackrel{{\scriptstyle\mathrm{a.s.}}}{{\to}}\bar{b}_{k},\bar{R}_{k}\stackrel{{\scriptstyle\mathrm{a.s.}}}{{\to}}\bar{\delta}_{k} and therefore Tq(R,B)Tq(δ,b)T_{q}(R,B)\approx T_{q}(\delta,b). This is done using the same techniques as Theorem 3.2, although it requires an extra assumption that BB also obeys the local dependency condition (3.8). Like the original condition, this condition also only depends on the posterior of BDB\mid D, so it can be diagnosed using the data at hand.

  3. 3.

    To complete the proof, we define the adjusted MLR statistic AMLRπn\mathrm{AMLR}^{\pi}\in\mathbb{R}^{n} with corresponding R~,δ~,b~\tilde{R},\tilde{\delta},\tilde{b} such that Tq(δ~,b~)Tq(δ,b)T_{q}(\tilde{\delta},\tilde{b})\geq T_{q}(\delta,b) holds in finite samples for any other feature statistic WW. It is easy to see that AMLRπ\mathrm{AMLR}^{\pi} must have the same signs as the original MLR statistics MLRπ\mathrm{MLR}^{\pi}, since the signs of MLRπ\mathrm{MLR}^{\pi} maximize δπ\delta^{\pi} and bπb^{\pi} coordinatewise. However, the absolute values of AMLRπ\mathrm{AMLR}^{\pi} may differ from those of MLRπ\mathrm{MLR}^{\pi}, since it is not always true that sorting δ\delta in decreasing order maximizes Tq(δ,b)T_{q}(\delta,b).

It turns out that the absolute values of the AMLR statistics in Eq. (3.12) yield vectors δ~,b~\tilde{\delta},\tilde{b} which maximize Tq(δ~,b~)T_{q}(\tilde{\delta},\tilde{b}) up to an O(1)O(1) additive constant. Theorem B.6 formally proves this, but we now give some intuition.

Intuition: To determine the optimal absolute values for AMLR statistics, assume WLOG by relabelling the variables that |MLR1π||MLR2π||MLRpπ||\mathrm{MLR}_{1}^{\pi}|\geq|\mathrm{MLR}_{2}^{\pi}|\geq\dots\geq|\mathrm{MLR}_{p}^{\pi}|. Let S[p]S\subset[p] denote an optimization variable representing the set of variables with the KK largest absolute values for the AMLR statistics, for some KK. We will try to design SS such that we can make as many true discoveries within SS as possible.

As argued above, AMLR and MLR statistics should have the same signs. Thus, roughly speaking, we can discover all features with positive signs among SS whenever

1|S|jS𝕀(MLRjπ>0)(1+q)1.\frac{1}{|S|}\sum_{j\in S}\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)\geq(1+q)^{-1}.

Making our usual approximation 𝕀(MLRjπ>0)𝔼[MLRjπD]=δjπ\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)\approx\mathbb{E}[\mathrm{MLR}_{j}^{\pi}\mid D]=\delta_{j}^{\pi}, this is equivalent to the constraint

jS(1+q)1δjπ0.\sum_{j\in S}(1+q)^{-1}-\delta_{j}^{\pi}\geq 0. (B.21)

Furthermore, if we can discover all of the features with positive signs in SS, we make exactly jSBjπjSbjπ\sum_{j\in S}B_{j}^{\pi}\approx\sum_{j\in S}b_{j}^{\pi} true discoveries, where BjπB_{j}^{\pi} is the indicator that the jjth MLR statistic is positive and the jjth feature is nonnull. Maximizing this approximate objective subject to the constraint in Eq. (B.21) yields the optimization problem:

maxS[p]jSbjπ s.t. jS(1+q)1δjπ0.\max_{S\subset[p]}\sum_{j\in S}b_{j}^{\pi}\text{ s.t. }\sum_{j\in S}(1+q)^{-1}-\delta_{j}^{\pi}\geq 0. (B.22)

In other words, including jSj\in S has “benefit” bjπb_{j}^{\pi} and “cost” (1+q)1δjπ(1+q)^{-1}-\delta_{j}^{\pi}. This is a simple integer linear program with one constraint, often known as a “knapsack” problem. An approximately optimal solution to this problem is to do the following:

  • Include all variables with “negative” costs, meaning δjπ=Pπ(MLRjπ>0D)(1+q)1\delta_{j}^{\pi}=P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\geq(1+q)^{-1}. This is accomplished by ensuring that these features have the largest absolute values.

  • Prioritize all other variables in descending order of the ratio of the benefit to the cost, bjπ(1+q)1δjπ\frac{b_{j}^{\pi}}{(1+q)^{-1}-\delta_{j}^{\pi}}.

This solution is indeed accomplished by the AMLR formula (Eq. (3.12)), which gives the highest absolute values to features with negative costs; then, all other absolute values have the same order as the benefit-to-cost ratios bjπ(1+q)1δjπ=νj=Pπ(MLRjπ>0,j1(θ)D)(1+q)1Pπ(MLRjπ>0D)\frac{b_{j}^{\pi}}{(1+q)^{-1}-\delta_{j}^{\pi}}=\nu_{j}=\frac{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)}{(1+q)^{-1}-P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)}.

B.7.3 Theorem statement and proof

We now show that AMLR statistics asymptotically maximize the true positive rate. To do this, we require two additional regularity conditions beyond those assumed in Theorem 3.2. First, we need a condition that the number of non-nulls under PnπP_{n}^{\pi} is not too heavy-tailed; namely, that its coefficient of variation is uniformly bounded.

Assumption B.1.

There exists a constant CC\in\mathbb{R} such that as nn\to\infty,

VarPnπ(|1(θ)|)snC,\frac{\sqrt{\mathrm{Var}_{P_{n}^{\pi}}(|\mathcal{H}_{1}(\theta^{\star})|)}}{s_{n}}\leq C,

where sn𝔼Pnπ[|1(θ)|]s_{n}\coloneqq\mathbb{E}_{P_{n}^{\pi}}[|\mathcal{H}_{1}(\theta^{\star})|] is the expected number of non-nulls under PnπP_{n}^{\pi}.

Assumption B.1 is needed for a technical reason. As we will see in Step 3 of the proof, combining this assumption with Lemma C.2 ensures that the normalized number of discoveries is uniformly integrable, which is necessary to show that certain error terms converge in L1L^{1} to zero. Nonetheless, this assumption is already quite mild, and it is satisfied in previously studied linear and polynomial sparsity regimes (Donoho and Jin,, 2004; Weinstein et al.,, 2017; Ke et al.,, 2020).

Next, we need an additional local dependence condition.

Assumption B.2 (Additional local dependence condition).

Let Ij+=𝕀(MLRjπ>0,j1(θ))I_{j}^{+}=\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})) indicate the event that jj is non-null and MLRjπ\mathrm{MLR}_{j}^{\pi} is positive. Let Ij=𝕀(MLRjπ<0,j1(θ))I_{j}^{-}=\mathbb{I}(\mathrm{MLR}_{j}^{\pi}<0,j\in\mathcal{H}_{1}(\theta^{\star})) indicate the event that jj is non-null and MLRjπ\mathrm{MLR}_{j}^{\pi} is negative. We assume that there exist constants C0,ρ(0,1)C\geq 0,\rho\in(0,1) such that for all i,j[p]i,j\in[p]:

|CovPnπ(Ii+,Ij+D(n))|Cρ|ij|.|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{+},I_{j}^{+}\mid D^{(n)})|\leq C\rho^{|i-j|}. (B.23)
|CovPnπ(Ii,Ij+D(n))|Cρ|ij|.|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{-},I_{j}^{+}\mid D^{(n)})|\leq C\rho^{|i-j|}. (B.24)
|CovPnπ(Ii,IjD(n))|Cρ|ij|.|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{-},I_{j}^{-}\mid D^{(n)})|\leq C\rho^{|i-j|}. (B.25)

Assumption B.2 is needed because it implies that for any feature statistic WW, {𝕀(Wi>0,i1(θ))}i=1p\{\mathbb{I}(W_{i}>0,i\in\mathcal{H}_{1}(\theta^{\star}))\}_{i=1}^{p} obey the same local dependence condition.

Lemma B.5.

Assume Assumption B.2. Then for any feature statistic WW and all i,j[p]i,j\in[p],

CovPnπ(𝕀(Wi>0,i1(θ)),𝕀(Wj>0,j1(θ))D(n))Cρ|ij|.\operatorname{Cov}_{P_{n}^{\pi}}(\mathbb{I}(W_{i}>0,i\in\mathcal{H}_{1}(\theta^{\star})),\mathbb{I}(W_{j}>0,j\in\mathcal{H}_{1}(\theta^{\star}))\mid D^{(n)})\leq C\rho^{|i-j|}.
Proof.

By Corollary B.1, the event sign(Wj)sign(MLRjπ)\operatorname*{sign}(W_{j})\neq\operatorname*{sign}(\mathrm{MLR}_{j}\pi) is D(n)D^{(n)}-measurable. This implies that for each j[p]j\in[p], there is a deterministic (conditional on D(n)D^{(n)}) choice of Ij+,IjI_{j}^{+},I_{j}^{-} such that either 𝕀(Wj>0,j1(θ))=Ij+\mathbb{I}(W_{j}>0,j\in\mathcal{H}_{1}(\theta^{\star}))=I_{j}^{+} or 𝕀(Wj>0,j1(θ))=Ij\mathbb{I}(W_{j}>0,j\in\mathcal{H}_{1}(\theta^{\star}))=I_{j}^{-}. As a result, we have that

|CovPnπ(𝕀(Wi>0,i1(θ)),𝕀(Wj>0,j1(θ))D(n))|\displaystyle|\operatorname{Cov}_{P_{n}^{\pi}}(\mathbb{I}(W_{i}>0,i\in\mathcal{H}_{1}(\theta^{\star})),\mathbb{I}(W_{j}>0,j\in\mathcal{H}_{1}(\theta^{\star}))\mid D^{(n)})|
\displaystyle\leq max(|CovPnπ(Ii+,Ij+D(n))|,|CovPnπ(Ii+,IjD(n))|,(|CovPnπ(Ii,Ij+D(n))|,|CovPnπ(Ii,IjD(n))|)\displaystyle\max(|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{+},I_{j}^{+}\mid D^{(n)})|,|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{+},I_{j}^{-}\mid D^{(n)})|,(|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{-},I_{j}^{+}\mid D^{(n)})|,|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{-},I_{j}^{-}\mid D^{(n)})|)
\displaystyle\leq Cρ|ij|\displaystyle C\rho^{|i-j|}

where the last step follows by Assumption B.2. ∎

Theorem B.6.

Suppose the conditions of Theorem 3.2 plus Assumptions B.1 and B.2 hold. Let amlrnπ\mathrm{amlr}_{n}^{\pi} denote the AMLR statistics with respect to PnπP_{n}^{\pi}. Then for any sequence of feature statistics {wn}n\{w_{n}\}_{n\in\mathbb{N}}, the following holds for all but countably many q[0,1]q\in[0,1]:

lim infnTPqπ(amlrnπ)TPqπ(wn)sn0,\liminf_{n\to\infty}\frac{\mathrm{TP}_{q}^{\pi}(\mathrm{amlr}_{n}^{\pi})-\mathrm{TP}_{q}^{\pi}(w_{n})}{s_{n}}\geq 0, (B.26)

where sns_{n} is the expected number of non-nulls under PnπP_{n}^{\pi} as defined in Assumption 3.1, and TPqπ(wn)\mathrm{TP}_{q}^{\pi}(w_{n}) is the expected number of true discoveries made by feature statistic wnw_{n} under PnπP_{n}^{\pi} as defined in Section B.7.

Proof.

The proof is in three steps.

Step 1: Notation and setup. Throughout, we use the notation and ideas from Section B.7.2 and the proof of Theorem 3.2 (Section B.4), although to ease readability, we will try to give reminders about notation when needed. In particular:

  • Define W=wn([𝐗(n),𝐗~(n)],𝐘(n))pnW=w_{n}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)})\in\mathbb{R}^{p_{n}} and AMLRπ=\mathrm{AMLR}^{\pi}= amlrnπ([𝐗(n),𝐗~(n)],𝐘(n))pn\mathrm{amlr}_{n}^{\pi}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)})\in\mathbb{R}^{p_{n}}. For simplicity, we suppress the dependence on nn.

  • Define R=𝕀(sorted(W)>0)R=\mathbb{I}(\mathrm{sorted}(W)>0) and R~=𝕀(sorted(AMLRπ)>0)\tilde{R}=\mathbb{I}(\mathrm{sorted}(\mathrm{AMLR}^{\pi})>0).

  • Let σ,σ~:[pn][pn]\sigma,\tilde{\sigma}:[p_{n}]\to[p_{n}] denote random the permutations such that σ(W)\sigma(W) and σ~(AMLRπ)\tilde{\sigma}(\mathrm{AMLR}^{\pi}) are sorted in descending order of absolute values; with this notation, note that Rj𝕀(Wσ(j)>0),R~j𝕀(AMLRσ~(j)π>0)R_{j}\coloneqq\mathbb{I}(W_{\sigma(j)}>0),\tilde{R}_{j}\coloneqq\mathbb{I}(\mathrm{AMLR}^{\pi}_{\tilde{\sigma}(j)}>0).

  • Let Ij=𝕀(σ(j)1(θ))I_{j}=\mathbb{I}(\sigma(j)\in\mathcal{H}_{1}(\theta^{\star})) and I~j𝕀(σ~(j)1(θ))\tilde{I}_{j}\coloneqq\mathbb{I}(\tilde{\sigma}(j)\in\mathcal{H}_{1}(\theta^{\star})) be the indicators that the feature statistic with the jjth largest absolute value of WW (resp. AMLRπ\mathrm{AMLR}^{\pi}) represents a non-null feature.

  • Let BjRjIjB_{j}\coloneqq R_{j}I_{j} be the indicator that the feature with the jjth largest absolute value among WW is non-null and that its feature statistic is postive. Similarly, B~jR~jI~j\tilde{B}_{j}\coloneqq\tilde{R}_{j}\tilde{I}_{j} is the indicator that the feature with the jjth largest AMLR statistic is non-null and its AMLR statistic is positive. Let B,B~pnB,\tilde{B}\in\mathbb{R}^{p_{n}} denote the vectors of these indicators.

  • Define δ~=𝔼[R~D(n)],δ=𝔼[RD(n)],b~=𝔼[B~D(n)]\tilde{\delta}=\mathbb{E}[\tilde{R}\mid D^{(n)}],\delta=\mathbb{E}[R\mid D^{(n)}],\tilde{b}=\mathbb{E}[\tilde{B}\mid D^{(n)}] and b=𝔼[BD(n)]b=\mathbb{E}[B\mid D^{(n)}] to be the conditional expectations of these quantities given the masked data.

  • Throughout, we only consider feature statistics whose values are nonzero a.s., because one can provably increase the power of knockoffs by ensuring that each coordinate of WW is nonzero.

Equation (B.20) shows that we can write

TPqπ(wn)𝔼Pnπ[|Swn1(θ)|]=𝔼Pnπ[Tq(R,B)] where Tq(R,B)ψq(R)B¯ψq(R)j=1ψq(R)Bj\mathrm{TP}_{q}^{\pi}(w_{n})\coloneqq\mathbb{E}_{P_{n}^{\pi}}[|S_{w_{n}}\cap\mathcal{H}_{1}(\theta^{\star})|]=\mathbb{E}_{P_{n}^{\pi}}[T_{q}(R,B)]\text{ where }T_{q}(R,B)\coloneqq\psi_{q}(R)\bar{B}_{\psi_{q}(R)}\coloneqq\sum_{j=1}^{\psi_{q}(R)}B_{j} (B.27)

and similarly TPqπ(amlrnπ)=𝔼Pnπ[Tq(R~,B~)]\mathrm{TP}_{q}^{\pi}(\mathrm{amlr}_{n}^{\pi})=\mathbb{E}_{P_{n}^{\pi}}[T_{q}(\tilde{R},\tilde{B})]. Therefore it suffices to show that

lim infnsn1(Tq(R~,B~)Tq(R,B))0.\liminf_{n\to\infty}s_{n}^{-1}\left(T_{q}(\tilde{R},\tilde{B})-T_{q}(R,B)\right)\geq 0. (B.28)

To do this, we make the following approximation using the triangle inequality:

sn1(Tq(R~,B~)Tq(R,B))sn1(Tq(δ~,b~)Tq(δ,b)Term 1|Tq(R,B)Tq(δ,b)|Term 2|Tq(R~,B~)Tq(δ~,b~)|Term 3).\displaystyle s_{n}^{-1}\left(T_{q}(\tilde{R},\tilde{B})-T_{q}(R,B)\right)\geq s_{n}^{-1}\left(\underbrace{T_{q}(\tilde{\delta},\tilde{b})-T_{q}(\delta,b)}_{\text{Term 1}}-\underbrace{|T_{q}(R,B)-T_{q}(\delta,b)|}_{\text{Term 2}}-\underbrace{|T_{q}(\tilde{R},\tilde{B})-T_{q}(\tilde{\delta},\tilde{b})|}_{\text{Term 3}}\right). (B.29)

Step 2 of the proof shows that Term 1 is asymptotically positive, and Step 3 shows that Terms 2 and 3 are asymptotically negligible in expectation (i.e., of order o(sn)o(s_{n})). This is sufficient to complete the proof.

Step 2: Analyzing Term 1. In this step, we show that

lim infn𝔼[Tq(δ~,b~)Tq(δ,b)sn]0.\liminf_{n\to\infty}\mathbb{E}\left[\frac{T_{q}(\tilde{\delta},\tilde{b})-T_{q}(\delta,b)}{s_{n}}\right]\geq 0. (B.30)

To do this, Step 2a shows that we may assume sign(AMLRπ)=sign(W)\operatorname*{sign}(\mathrm{AMLR}^{\pi})=\operatorname*{sign}(W) and therefore δ~=σ~(σ1(δ))\tilde{\delta}=\tilde{\sigma}(\sigma^{-1}(\delta)) and b~=σ~(σ1(δ))\tilde{b}=\tilde{\sigma}(\sigma^{-1}(\delta)) (with the usual notation that σ(δ)(δσ(1),,δσ(p))\sigma(\delta)\coloneqq(\delta_{\sigma(1)},\dots,\delta_{\sigma(p)})). Step 2b then proves Eq. (B.30).

Step 2a: Define W^=|W|sign(AMLRπ)\hat{W}=|W|\cdot\operatorname*{sign}(\mathrm{AMLR}^{\pi}) to have the absolute values of WW and the signs of the AMLR statistics. We claim that if δ^,b^\hat{\delta},\hat{b} are defined analogously for W^\hat{W} instead of WW, then Tq(δ,b)Tq(δ^,b^)T_{q}(\delta,b)\leq T_{q}(\hat{\delta},\hat{b}).

To see this, we prove that (i) δjδ^j\delta_{j}\leq\hat{\delta}_{j} holds elementwise and (ii) bjb^jb_{j}\leq\hat{b}_{j} holds elementwise. Results (i) and (ii) complete the proof of Step 2(a) since Tq(δ,b)=j=1ψq(δ)bjT_{q}(\delta,b)=\sum_{j=1}^{\psi_{q}(\delta)}b_{j} is nondecreasing in its inputs (namely because bb is nonnegative and ψq(δ)\psi_{q}(\delta) is nondecreasing in its inputs).

To show (i), Proposition 3.3 shows that δjPnπ(Wσ(j)>0D(n))Pnπ(AMLRσ(j)>0D(n))=Pnπ(W^σ(j)>0D(n))=δ^j\delta_{j}\coloneqq P_{n}^{\pi}(W_{\sigma(j)}>0\mid D^{(n)})\leq P_{n}^{\pi}(\mathrm{AMLR}_{\sigma(j)}>0\mid D^{(n)})=P_{n}^{\pi}(\hat{W}_{\sigma(j)}>0\mid D^{(n)})=\hat{\delta}_{j}, where this argument also uses the facts that (a) MLR and AMLR statistics have the same signs and (b) the permutation σ\sigma depends only on |W||W| and thus is D(n)D^{(n)}-measurable.

To show (ii), we note that the law of total probability yields

δj=Pnπ(Wσ(j)>0Ij=1,D(n))Pnπ(Ij=1D(n))+12Pnπ(Ij=0D(n))\delta_{j}=P_{n}^{\pi}(W_{\sigma(j)}>0\mid I_{j}=1,D^{(n)})P_{n}^{\pi}(I_{j}=1\mid D^{(n)})+\frac{1}{2}P_{n}^{\pi}(I_{j}=0\mid D^{(n)})

where above we use the fact that Pnπ(Wσ(j)>0Ij=0,D(n))=12P_{n}^{\pi}(W_{\sigma(j)}>0\mid I_{j}=0,D^{(n)})=\frac{1}{2}—i.e., under the null, Wσ(j)W_{\sigma(j)} is conditionally symmetric. Thus, δjδ^j\delta_{j}\leq\hat{\delta}_{j} holds iff Pnπ(Wσ(j)>0Ij=1,D(n))Pnπ(W^σ(j)>0Ij=1,D(n))P_{n}^{\pi}(W_{\sigma(j)}>0\mid I_{j}=1,D^{(n)})\leq P_{n}^{\pi}(\hat{W}_{\sigma(j)}>0\mid I_{j}=1,D^{(n)}). Using this result, we conclude

bj\displaystyle b_{j} =Pnπ(Wσ(j)>0,Ij=1D(n))\displaystyle=P_{n}^{\pi}(W_{\sigma(j)}>0,I_{j}=1\mid D^{(n)})
=Pnπ(Wσ(j)>0Ij=1,D(n))Pnπ(Ij=1D(n))\displaystyle=P_{n}^{\pi}(W_{\sigma(j)}>0\mid I_{j}=1,D^{(n)})P_{n}^{\pi}(I_{j}=1\mid D^{(n)})
Pnπ(W^σ(j)>0Ij=1,D(n))Pnπ(Ij=1D(n))\displaystyle\leq P_{n}^{\pi}(\hat{W}_{\sigma(j)}>0\mid I_{j}=1,D^{(n)})P_{n}^{\pi}(I_{j}=1\mid D^{(n)}) since δjδ^j by result (i)\displaystyle\text{ since }\delta_{j}\leq\hat{\delta}_{j}\text{ by result (i)}
=b^j.\displaystyle=\hat{b}_{j}.

This proves that T(δ^,b^)T(δ,b)T(\hat{\delta},\hat{b})\geq T(\delta,b). Yet in this step, we seek to show that T(δ~,b~)T(δ,b)T(\tilde{\delta},\tilde{b})\geq T(\delta,b); thus, we may assume that W=W^W=\hat{W} and thus sign(W)=sign(AMLRπ)\operatorname*{sign}(W)=\operatorname*{sign}(\mathrm{AMLR}^{\pi}). This implies that δ,b\delta,b and δ~,b~\tilde{\delta},\tilde{b} take the same values but in different orders; formally, δ~=σ~(σ1(δ))\tilde{\delta}=\tilde{\sigma}(\sigma^{-1}(\delta)) and b~=σ~(σ1(δ))\tilde{b}=\tilde{\sigma}(\sigma^{-1}(\delta)).

Step 2b: Now we show Eq. (B.30). Recall that Tq(δ,b)=j=1ψq(δ)bjT_{q}(\delta,b)=\sum_{j=1}^{\psi_{q}(\delta)}b_{j} is the partial sum of the first ψq(δ)\psi_{q}(\delta) elements of bb, where ψq(δ)max{k:kkδ¯k+1kδ¯kq}\psi_{q}(\delta)\coloneqq\max\left\{k:\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}}\leq q\right\} is the maximum integer such that kkδ¯k+1kδ¯kq\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}}\leq q. It follows that Tq(δ,b)T_{q}(\delta,b) is bounded by the following quantity:

Tq(δ,b)maxS[p]jSbj s.t. |S||S|δ¯S+1|S|δ¯Sq,T_{q}(\delta,b)\leq\max_{S\subset[p]}\sum_{j\in S}b_{j}\text{ s.t. }\frac{|S|-|S|\bar{\delta}_{S}+1}{|S|\bar{\delta}_{S}}\leq q,

where the notation δ¯S=1|S|jSδj\bar{\delta}_{S}=\frac{1}{|S|}\sum_{j\in S}\delta_{j} defines the average value of δj\delta_{j} averaged over the set SS. Indeed, this inequality follows because Tq(δ,b)T_{q}(\delta,b) is precisely the solution to this optimization problem when SS is constrained to be a contiguous set of the form {1,,k}\{1,\dots,k\} for some kk. Relaxing this constraint to allow SS to be an arbitrary subset of [p][p] should only increase the value of the objective. Manipulating this optimization problem yields:

maxS[p]jSbj s.t. |S||S|δ¯S+1|S|δ¯Sq=\displaystyle\max_{S\subset[p]}\sum_{j\in S}b_{j}\text{ s.t. }\frac{|S|-|S|\bar{\delta}_{S}+1}{|S|\bar{\delta}_{S}}\leq q= maxS[p]jSbj s.t. |S||S|δ¯S+1(|S|δ¯S)q\displaystyle\max_{S\subset[p]}\sum_{j\in S}b_{j}\text{ s.t. }|S|-|S|\bar{\delta}_{S}+1\leq(|S|\bar{\delta}_{S})q
=\displaystyle= maxS[p]j=1p𝕀(jS)bj s.t. j=1p𝕀(jS)(1(1+q)δj)1\displaystyle\max_{S\subset[p]}\sum_{j=1}^{p}\mathbb{I}(j\in S)b_{j}\text{ s.t. }\sum_{j=1}^{p}\mathbb{I}(j\in S)\left(1-(1+q)\delta_{j}\right)\leq-1
=\displaystyle= maxS[p]j=1p𝕀(jS)bj s.t. j=1p𝕀(jS)(11+qδj)11+q.\displaystyle\max_{S\subset[p]}\sum_{j=1}^{p}\mathbb{I}(j\in S)b_{j}\text{ s.t. }\sum_{j=1}^{p}\mathbb{I}(j\in S)\left(\frac{1}{1+q}-\delta_{j}\right)\leq-\frac{1}{1+q}.

This is an integer linear program with pp integer decision variables xj𝕀(jS)x_{j}\coloneqq\mathbb{I}(j\in S) and one constraint:

=maxx1,,xp{0,1}j=1pbjxj s.t. j=1p(11+qδj)xj11+q.=\max_{x_{1},\dots,x_{p}\in\{0,1\}}\sum_{j=1}^{p}b_{j}x_{j}\text{ s.t. }\sum_{j=1}^{p}\left(\frac{1}{1+q}-\delta_{j}\right)x_{j}\leq-\frac{1}{1+q}.

Such problems—commonly referred to as knapsack problems—are well studied. The maximum value is bounded by the following greedy strategy:

  • Let Sobvious={j[p]:δj11+q}S_{\mathrm{obvious}}=\{j\in[p]:\delta_{j}\geq\frac{1}{1+q}\} be the set of coordinates such that the constraint coefficient on xjx_{j} is negative. This is an “obvious” set because for jSobviousj\in S_{\mathrm{obvious}}, setting xj=1x_{j}=1 never decreases the objective value (since bj0b_{j}\geq 0) and never increases the constraint value (since (1+q)1δj0(1+q)^{-1}-\delta_{j}\leq 0). Thus, any optimal solution must have xj=1x_{j}=1 for all jSobviousj\in S_{\mathrm{obvious}}.

  • Let nobvious=|Sobvious|n_{\mathrm{obvious}}=|S_{\mathrm{obvious}}| denote the number of obvious coordinates, and let Snonobvious=[p]SobviousS_{\mathrm{non-obvious}}=[p]\setminus S_{\mathrm{obvious}} denote the non-obvious coordinates.

  • After setting xj=1x_{j}=1 for all coordinates jSobviousj\in S_{\mathrm{obvious}}, we should sort the coordinates in SnonobviousS_{\mathrm{non-obvious}} in descending order of the ratio bj(1+q)1δj\frac{b_{j}}{(1+q)^{-1}-\delta_{j}} and include as many coordinates of SnonobviousS_{\mathrm{non-obvious}} as possible until we hit the constraint that j=1pxj((1+q)1δj)11+q\sum_{j=1}^{p}x_{j}((1+q)^{-1}-\delta_{j})\leq-\frac{1}{1+q}. Then, if we include one additional coordinate, the value of this solution (which violates the constraint by a small amount) bounds the optimal value to the overall objective problem. Indeed, this is because this solution actually has a higher objective value than the solution to the relaxed LP which only requires x1,,xp[0,1]x_{1},\dots,x_{p}\in[0,1] instead of x1,,xp{0,1}x_{1},\dots,x_{p}\in\{0,1\} (Martello and Toth,, 1990).

To make this strategy precise, let γ:[p][p]\gamma:[p]\to[p] denote any permutation with the following properties:

  • γ(1),,γ(nobvious)Sobvious\gamma(1),\dots,\gamma(n_{\mathrm{obvious}})\in S_{\mathrm{obvious}}. I.e., the first nobviousn_{\mathrm{obvious}} coordinates specified by γ\gamma are the set SobviousS_{\mathrm{obvious}}.

  • For i,j>nobviousi,j>n_{\mathrm{obvious}}, γ(i)γ(j)\gamma(i)\geq\gamma(j) if and only if bi(1+q)1δibj(1+q)1δj\frac{b_{i}}{(1+q)^{-1}-\delta_{i}}\geq\frac{b_{j}}{(1+q)^{-1}-\delta_{j}}. I.e., γ\gamma orders the rest of the coordinates by the ratio bj(1+q)1δj\frac{b_{j}}{(1+q)^{-1}-\delta_{j}}.

Then, let kmax{k[p]:j=1k((1+q)1δγ(j))11+q}k^{\star}\coloneqq\max\left\{k\in[p]:\sum_{j=1}^{k}((1+q)^{-1}-\delta_{\gamma(j)})\leq-\frac{1}{1+q}\right\} denote the maximum value of kk such that setting xγ(1),,xγ(k)=1x_{\gamma(1)},\dots,x_{\gamma(k)}=1 yields a feasible solution to the integer LP. Then we have that

Tq(δ,b)maxx1,,xp{0,1}[j=1pbjxj s.t. j=1p(11+qδj)xj11+q]j=1k+1bγ(j).T_{q}(\delta,b)\leq\max_{x_{1},\dots,x_{p}\in\{0,1\}}\left[\sum_{j=1}^{p}b_{j}x_{j}\text{ s.t. }\sum_{j=1}^{p}\left(\frac{1}{1+q}-\delta_{j}\right)x_{j}\leq-\frac{1}{1+q}\right]\leq\sum_{j=1}^{k^{\star}+1}b_{\gamma(j)}.
Remark 6.

γ\gamma is not uniquely specified by the construction above, but the bound holds for any such γ\gamma.

Step 2a shows that for some permutation κ:[p][p]\kappa:[p]\to[p], we can write δ~=κ(δ)\tilde{\delta}=\kappa(\delta) and b~=κ(b)\tilde{b}=\kappa(b). We note that κ\kappa must satisfy the following:

  • κ(1),,κ(nobvious)Sobvious\kappa(1),\dots,\kappa(n_{\mathrm{obvious}})\in S_{\mathrm{obvious}}. This is because the AMLR statistics with the top absolute values are constructed to be the AMLR statistics such that Pnπ(AMLRjπ>0D)=Pnπ(MLRjπ>0D)(1+q)1P_{n}^{\pi}(\mathrm{AMLR}_{j}^{\pi}>0\mid D)=P_{n}^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\geq(1+q)^{-1} (see Definition 3.3), which exactly coincides with the definition of the set SobviousS_{\mathrm{obvious}}.

  • For i,j>nobviousi,j>n_{\mathrm{obvious}}, κ(i)κ(j)\kappa(i)\geq\kappa(j) if and only if bi(1+q)1δibj(1+q)1δj\frac{b_{i}}{(1+q)^{-1}-\delta_{i}}\geq\frac{b_{j}}{(1+q)^{-1}-\delta_{j}}. This again follows from Definition 3.3, as the absolute values of AMLR statistics are explicitly chosen to guarantee this.

In other words, κ\kappa satisfies the same properties as γ\gamma above. Thus, we may take γ=κ\gamma=\kappa. This yields the bound Tq(δ,b)j=1k+1bγ(j)T_{q}(\delta,b)\leq\sum_{j=1}^{k^{\star}+1}b_{\gamma(j)}. However, we also know that

Tq(δ~,b~)=j=1ψq(δ~)b~j=j=1ψq(δ~)bγ(j)=j=1kbγ(j),\displaystyle T_{q}(\tilde{\delta},\tilde{b})=\sum_{j=1}^{\psi_{q}(\tilde{\delta})}\tilde{b}_{j}=\sum_{j=1}^{\psi_{q}(\tilde{\delta})}b_{\gamma(j)}=\sum_{j=1}^{k^{\star}}b_{\gamma(j)},

where the last step follows because

ψq(δ~)=ψq(κ(δ))\displaystyle\psi_{q}(\tilde{\delta})=\psi_{q}(\kappa(\delta)) max{k[p]:kj=1kδγ(j)+1j=1kδγ(j)q}\displaystyle\coloneqq\max\left\{k\in[p]:\frac{k-\sum_{j=1}^{k}\delta_{\gamma(j)}+1}{\sum_{j=1}^{k}\delta_{\gamma(j)}}\leq q\right\} by definition
=max{k[p]:j=1k((1+q)1δγ(j))11+q}\displaystyle=\max\left\{k\in[p]:\sum_{j=1}^{k}((1+q)^{-1}-\delta_{\gamma(j)})\leq-\frac{1}{1+q}\right\} by algebraic manipulation
k.\displaystyle\coloneqq k^{\star}. by definition

To summarize, this implies that

Tq(δ,b)j=1k+1bγ(j)j=1kbγ(j)+1=Tq(δ~,b~)+1.T_{q}(\delta,b)\leq\sum_{j=1}^{k^{\star}+1}b_{\gamma(j)}\leq\sum_{j=1}^{k^{\star}}b_{\gamma(j)}+1=T_{q}(\tilde{\delta},\tilde{b})+1.

This completes the proof of Step 2, since as a consequence,

𝔼[Tq(δ~,b~)Tq(δ,b)sn]1sn0.\mathbb{E}\left[\frac{T_{q}(\tilde{\delta},\tilde{b})-T_{q}(\delta,b)}{s_{n}}\right]\geq-\frac{1}{s_{n}}\to 0. (B.31)

Step 3: In this step, we show that 𝔼[|Tq(δ,b)Tq(R,B)|]sn0\frac{\mathbb{E}[|T_{q}(\delta,b)-T_{q}(R,B)|]}{s_{n}}\to 0 holds for all but countably many q[0,1]q\in[0,1]. The same logic applies to the term involving Tq(R~,B~)Tq(δ~,b~)T_{q}(\tilde{R},\tilde{B})-T_{q}(\tilde{\delta},\tilde{b}).

To do this, we will essentially bound |Tq(δ,b)Tq(R,B)||T_{q}(\delta,b)-T_{q}(R,B)| by the quantities maxkk0|B¯kb¯k|\max_{k\geq k_{0}}|\bar{B}_{k}-\bar{b}_{k}| and |ψq(R)ψq(δ)||\psi_{q}(R)-\psi_{q}(\delta)|; as we shall see, these are both oL1(sn)o_{L_{1}}(s_{n}). To begin, the triangle inequality yields:

|Tq(R,B)Tq(δ,b)|\displaystyle|T_{q}(R,B)-T_{q}(\delta,b)| =|ψq(R)B¯ψq(R)ψq(δ)b¯ψq(δ)|\displaystyle=\left|\psi_{q}(R)\bar{B}_{\psi_{q}(R)}-\psi_{q}(\delta)\bar{b}_{\psi_{q}(\delta)}\right| by definition
=|ψq(R)B¯ψq(R)ψq(R)b¯ψq(δ)+ψq(R)b¯ψq(δ)ψq(δ)b¯ψq(δ)|\displaystyle=\left|\psi_{q}(R)\bar{B}_{\psi_{q}(R)}-\psi_{q}(R)\bar{b}_{\psi_{q}(\delta)}+\psi_{q}(R)\bar{b}_{\psi_{q}(\delta)}-\psi_{q}(\delta)\bar{b}_{\psi_{q}(\delta)}\right|
ψq(R)|B¯ψq(R)b¯ψq(δ)|+b¯ψq(δ)|ψq(R)ψq(δ)|\displaystyle\leq\psi_{q}(R)|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}|+\bar{b}_{\psi_{q}(\delta)}|\psi_{q}(R)-\psi_{q}(\delta)| triangle inequality
ψq(R)|B¯ψq(R)b¯ψq(δ)|Term (a)+|ψq(R)ψq(δ)|Term (b)\displaystyle\leq\underbrace{\psi_{q}(R)|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}|}_{\text{Term (a)}}+\underbrace{|\psi_{q}(R)-\psi_{q}(\delta)|}_{\text{Term (b)}} since b¯ψq(δ)[0,1].\displaystyle\text{ since }\bar{b}_{\psi_{q}(\delta)}\in[0,1]. (B.32)

First, we bound Term (a). Applying the triangle inequality (again) plus a simple algebraic lemma yields:

ψq(R)|B¯ψq(R)b¯ψq(δ)|\displaystyle\psi_{q}(R)|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}| ψq(R)|B¯ψq(R)b¯ψq(R)|+ψq(R)|b¯ψq(R)b¯ψq(δ)|\displaystyle\leq\psi_{q}(R)|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}|+\psi_{q}(R)|\bar{b}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}|
ψq(R)|B¯ψq(R)b¯ψq(R)|+2|ψq(R)ψq(δ)|\displaystyle\leq\psi_{q}(R)|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}|+2|\psi_{q}(R)-\psi_{q}(\delta)| by Lemma B.7.\displaystyle\text{ by Lemma \ref{lem::algebra4powerresult}}.

The second line is not obvious, but it follows because Lemma B.7 shows that |b¯ψq(R)b¯ψq(δ)|2|ψq(R)ψq(δ)|max(ψq(R),ψq(δ))|\bar{b}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}|\leq\frac{2|\psi_{q}(R)-\psi_{q}(\delta)|}{\max(\psi_{q}(R),\psi_{q}(\delta))}. This algebraic result follows intuitively because b¯k[0,1]\bar{b}_{k}\in[0,1] holds for all kk; thus, |b¯ψq(R)b¯ψq(δ)||\bar{b}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}| cannot be large unless |ψq(R)ψq(δ)||\psi_{q}(R)-\psi_{q}(\delta)| is also large.

Combining this bound on Term (a) with the initial result in Eq. (B.32) yields:

sn1𝔼[|Tq(R,B)Tq(δ,b)|]3sn1𝔼[|ψq(R)ψq(δ)|]Term (b)+sn1𝔼[ψq(R)|B¯ψq(R)b¯ψq(R)|]Term (c).{s_{n}}^{-1}\mathbb{E}\left[|T_{q}(R,B)-T_{q}(\delta,b)|\right]\leq 3\underbrace{{s_{n}}^{-1}\mathbb{E}\left[|\psi_{q}(R)-\psi_{q}(\delta)|\right]}_{\text{Term (b)}}+\underbrace{{s_{n}}^{-1}\mathbb{E}\left[\psi_{q}(R)|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}|\right]}_{\text{Term (c)}}.

To show that Term (b) vanishes, recall that Step 2 of Theorem 3.2 shows that sn1𝔼[|τq(R)τq(δ)|]0s_{n}^{-1}\mathbb{E}[|\tau_{q}(R)-\tau_{q}(\delta)|]\to 0 for all but countably many q[0,1]q\in[0,1]. However, τq(R)=ψq(R)+11+q=11+qψq(R)+O(1)\tau_{q}(R)=\left\lceil\frac{\psi_{q}(R)+1}{1+q}\right\rceil=\frac{1}{1+q}\psi_{q}(R)+O(1) and similarly τq(δ)=11+qψq(δ)+O(1)\tau_{q}(\delta)=\frac{1}{1+q}\psi_{q}(\delta)+O(1). Therefore, sn1𝔼[|ψq(R)ψq(δ)|]0{s_{n}}^{-1}\mathbb{E}\left[|\psi_{q}(R)-\psi_{q}(\delta)|\right]\to 0 for all but countably many q[0,1]q\in[0,1].

Thus, it suffices to show that Term (c) vanishes. To do this, fix a sequence of integers {kn}n=1\{k_{n}\}_{n=1}^{\infty} such that knlog(pn)5k_{n}\sim\log(p_{n})^{5}. Separately considering the cases where ψq(R)<kn\psi_{q}(R)<k_{n} and ψq(R)kn\psi_{q}(R)\geq k_{n} yields

𝔼[sn1ψq(R)|B¯ψq(R)b¯ψq(R)|]\displaystyle\mathbb{E}\left[{s_{n}}^{-1}\psi_{q}(R)|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}|\right] 𝔼[|B¯ψq(R)b¯ψq(R)|sn1kn]+𝔼[sn1ψq(R)maxkkn|B¯kb¯k|]\displaystyle\leq\mathbb{E}[|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}|s_{n}^{-1}k_{n}]+\mathbb{E}[s_{n}^{-1}\psi_{q}(R)\max_{k\geq k_{n}}|\bar{B}_{k}-\bar{b}_{k}|]
2sn1kn+𝔼[sn1ψq(R)maxkkn|B¯kb¯k|]Term (d)\displaystyle\leq 2s_{n}^{-1}k_{n}+\underbrace{\mathbb{E}[s_{n}^{-1}\psi_{q}(R)\max_{k\geq k_{n}}|\bar{B}_{k}-\bar{b}_{k}|]}_{\text{Term (d)}}

where the second line follows because B¯ψq(R),b¯ψq(R)[0,1]\bar{B}_{\psi_{q}(R)},\bar{b}_{\psi_{q}(R)}\in[0,1]. Assumption 3.1 guarantees that that sn1kn0s_{n}^{-1}k_{n}\to 0, so it suffices to show that Term (d) vanishes. To do this, we observe:

  • Assumption B.2 plus Lemma B.5 shows that BB satisfies an exponential decay condition, so we may apply Lemma C.1. Namely, for knlog(pn)5k_{n}\sim\log(p_{n})^{5}, maxkkn|B¯kb¯k|p0\max_{k\geq k_{n}}|\bar{B}_{k}-\bar{b}_{k}|\stackrel{{\scriptstyle\mathrm{p}}}{{\to}}0 (see Theorem 3.2, Step 2 for more details).

  • Lemma C.2 shows that there exists a universal constant CC such that 𝔼[τq(R)2]C𝔼Pnπ[|1(θ)|2]\mathbb{E}[\tau_{q}(R)^{2}]\leq C\mathbb{E}_{P_{n}^{\pi}}[|\mathcal{H}_{1}(\theta^{\star})|^{2}]. Note that the coefficient of variation of |1(θ)||\mathcal{H}_{1}(\theta^{\star})| is bounded under Assumption B.1, so there exists another universal constant CC^{\prime} such that 𝔼Pnπ[ψq(R)2]𝔼Pnπ[τq(R)2]Csn2\mathbb{E}_{P_{n}^{\pi}}[\psi_{q}(R)^{2}]\sim\mathbb{E}_{P_{n}^{\pi}}[\tau_{q}(R)^{2}]\leq C^{\prime}s_{n}^{2}. This implies that sn1ψq(R){s_{n}}^{-1}\psi_{q}(R) is uniformly integrable since it has a uniformly bounded second moment.

  • The latter two observations imply that 𝔼[sn1ψq(R)maxkkn|B¯kb¯k|]0\mathbb{E}[s_{n}^{-1}\psi_{q}(R)\max_{k\geq k_{n}}|\bar{B}_{k}-\bar{b}_{k}|]\to 0, since sn1ψq(R)s_{n}^{-1}\psi_{q}(R) is uniformly integrable, and maxkkn|B¯kb¯k|2\max_{k\geq k_{n}}|\bar{B}_{k}-\bar{b}_{k}|\leq 2 is also uniformly bounded and vanishes in probability.

Together, this proves that for all but countably many q[0,1]q\in[0,1], 𝔼[|Tq(δ,b)Tq(R,B)|]sn0\frac{\mathbb{E}[|T_{q}(\delta,b)-T_{q}(R,B)|]}{s_{n}}\to 0. ∎

The following algebraic lemma is used at the end of Step 3 of the proof of Theorem B.6.

Lemma B.7.

For b=(b1,,bp)[0,1]pb=(b_{1},\dots,b_{p})\in[0,1]^{p}, fix k,[p]k,\ell\in[p]. Then

max(k,)|b¯kb¯|2|k|.\max(k,\ell)\cdot|\bar{b}_{k}-\bar{b}_{\ell}|\leq 2|k-\ell|.
Proof.

Define m=min(k,)m=\min(k,\ell) and M=max(k,)M=\max(k,\ell). The lemma holds trivially when m=Mm=M, so we may assume m<Mm<M. Then we have that

|b¯kb¯|\displaystyle\left|\bar{b}_{k}-\bar{b}_{\ell}\right| |1ki=1kbi1i=1bi|\displaystyle\coloneqq\left|\frac{1}{k}\sum_{i=1}^{k}b_{i}-\frac{1}{\ell}\sum_{i=1}^{\ell}b_{i}\right|
=|i=1mbi(1m1M)1Mi=m+1Mbi|\displaystyle=\left|\sum_{i=1}^{m}b_{i}\left(\frac{1}{m}-\frac{1}{M}\right)-\frac{1}{M}\sum_{i=m+1}^{M}b_{i}\right| by definition of m,M\displaystyle\text{ by definition of }m,M
i=1mbi(1m1M)+1Mi=m+1Mbi\displaystyle\leq\sum_{i=1}^{m}b_{i}\left(\frac{1}{m}-\frac{1}{M}\right)+\frac{1}{M}\sum_{i=m+1}^{M}b_{i} since bi0,m<M\displaystyle\text{ since }b_{i}\geq 0,m<M
m(1m1M)+1M(Mm)\displaystyle\leq m\left(\frac{1}{m}-\frac{1}{M}\right)+\frac{1}{M}\left(M-m\right) since bi1.\displaystyle\text{ since }b_{i}\leq 1.

Therefore we conclude that

M|b¯kb¯|Mm(1m1M)+MM(Mm)=2(Mm)=2|k|M|\bar{b}_{k}-\bar{b}_{\ell}|\leq Mm\left(\frac{1}{m}-\frac{1}{M}\right)+\frac{M}{M}(M-m)=2(M-m)=2|k-\ell|

which completes the proof. ∎

B.8 Verifying the local dependence assumption in a simple setting

We now verify the local dependency condition (3.8) in the setting where 𝐗T𝐗\mathbf{X}^{T}\mathbf{X} is block-diagonal and σ2\sigma^{2} is known.

Proposition B.2.

Suppose 𝐗T𝐗\mathbf{X}^{T}\mathbf{X} is a block-diagonal matrix with maximum block size MM\in\mathbb{N}. Suppose PπP^{\pi} is any Bayesian model such that (i) the model class 𝒫\mathcal{P} is the class of all Gaussian models of the form 𝐘𝐗𝒩(𝐗β,σ2In)\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,\sigma^{2}I_{n}) for θ=(β,σ2)Θp×0\theta=(\beta,\sigma^{2})\in\Theta\coloneqq\mathbb{R}^{p}\times\mathbb{R}_{\geq 0}, (ii) the coordinates of β\beta are marginally independent under PπP^{\pi} and (iii) σ2\sigma^{2} is a constant under PπP^{\pi}. Then if 𝐗~\widetilde{\mathbf{X}} are either fixed-X knockoffs or conditional Gaussian model-X knockoffs (Huang and Janson,, 2020), the coordinates of sign(MLRπ)\operatorname*{sign}(\mathrm{MLR}^{\pi}) are MM-dependent conditional on DD under PπP^{\pi}, implying that Equation (3.8) holds, e.g., with C=2MC=2^{M} and ρ=12\rho=\frac{1}{2}.

Proof.

Define R𝕀(MLRπ>0)R\coloneqq\mathbb{I}(\mathrm{MLR}^{\pi}>0). We will prove the stronger result that if J1,,Jm[p]J_{1},\dots,J_{m}\subset[p] are a partition of [p][p] corresponding to the blocks of 𝐗T𝐗\mathbf{X}^{T}\mathbf{X}, then RJ1,,RJmR_{J_{1}},\dots,R_{J_{m}} are jointly independent conditional on DD. As notation, suppose without loss of generality that J1,,JmJ_{1},\dots,J_{m} are contiguous subsets and 𝐗T𝐗=diag{Σ1,,Σm}\mathbf{X}^{T}\mathbf{X}=\mbox{$\mathrm{diag}\left\{\Sigma_{1},\dots,\Sigma_{m}\right\}$} for Σi|Ji|×|Ji|\Sigma_{i}\in\mathbb{R}^{|J_{i}|\times|J_{i}|}. All probabilities and expectations are taken under PπP^{\pi}.

We give the proof for model-X knockoffs; the proof for fixed-X knockoffs is quite similar. Recall by Proposition 3.1 that we can write Rj=𝕀(Wj>0)=𝕀(𝐗j=𝐗^j)R_{j}=\mathbb{I}(W_{j}>0)=\mathbb{I}(\mathbf{X}_{j}=\widehat{\mathbf{X}}_{j}) where 𝐗^j\widehat{\mathbf{X}}_{j} is a function of the masked data DD. Therefore, to show RJ1,,RJmR_{J_{1}},\dots,R_{J_{m}} are independent conditional on DD, it suffices to show 𝐗J1,,𝐗Jm\mathbf{X}_{J_{1}},\dots,\mathbf{X}_{J_{m}} are conditionally independent given DD. To do this, it will first be useful to note that the likelihood is

P𝐘𝐗(β,σ)(𝐘𝐗)\displaystyle P_{\mathbf{Y}\mid\mathbf{X}}^{(\beta,\sigma)}(\mathbf{Y}\mid\mathbf{X}) exp(12σ2𝐘𝐗β22)\displaystyle\propto\exp\left(-\frac{1}{2\sigma^{2}}||\mathbf{Y}-\mathbf{X}\beta||_{2}^{2}\right)
exp(2βT𝐗T𝐘βT𝐗T𝐗β2σ2)\displaystyle\propto\exp\left(\frac{2\beta^{T}\mathbf{X}^{T}\mathbf{Y}-\beta^{T}\mathbf{X}^{T}\mathbf{X}\beta}{2\sigma^{2}}\right)
i=1mexp(2βJiT𝐗JiT𝐘βJiTΣiβJi2σ2),\displaystyle\propto\prod_{i=1}^{m}\exp\left(\frac{2\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right),

where above, we only include terms depending on 𝐗\mathbf{X}, since these are the only terms relevant to the later stages of the proof. A subtle but important observation in the calculation above is that we can verify that 𝐗T𝐗=diag{Σ1,,Σm}\mathbf{X}^{T}\mathbf{X}=\mbox{$\mathrm{diag}\left\{\Sigma_{1},\dots,\Sigma_{m}\right\}$} having only observed DD without observing 𝐗\mathbf{X}. Indeed, this follows because for conditional Gaussian MX knockoffs, 𝐗~T𝐗~=𝐗T𝐗\widetilde{\mathbf{X}}^{T}\widetilde{\mathbf{X}}=\mathbf{X}^{T}\mathbf{X} and 𝐗~T𝐗\widetilde{\mathbf{X}}^{T}\mathbf{X} only differs from 𝐗T𝐗\mathbf{X}^{T}\mathbf{X} on the main diagonal (just like in the fixed-X case). With this observation in mind, we now abuse notation slightly and let p()p(\cdot\mid\cdot) denote an arbitrary conditional density under PπP^{\pi}. Observe that

p(𝐗D)\displaystyle p(\mathbf{X}\mid D) p(𝐗,𝐘{𝐗j,𝐗~j}j=1p)\displaystyle\propto p(\mathbf{X},\mathbf{Y}\mid\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p})
=p(𝐗{𝐗j,𝐗~j}j=1p)p(𝐘𝐗) since 𝐘𝐗~𝐗\displaystyle=p(\mathbf{X}\mid\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p})p(\mathbf{Y}\mid\mathbf{X})\,\,\,\,\,\,\,\,\,\,\text{ since }\mathbf{Y}\perp\!\!\!\perp\widetilde{\mathbf{X}}\mid\mathbf{X}
=12pp(𝐘𝐗) by pairwise exchangeability\displaystyle=\frac{1}{2^{p}}p(\mathbf{Y}\mid\mathbf{X})\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\text{ by pairwise exchangeability}
βp(β)p(𝐘𝐗,β)𝑑β\displaystyle\propto\int_{\beta}p(\beta)p(\mathbf{Y}\mid\mathbf{X},\beta)d\beta
βi=1mp(βJi)exp(βJiTΣiβJi2σ2)exp(βJiT𝐗JiT𝐘σ2)dβ\displaystyle\propto\int_{\beta}\prod_{i=1}^{m}p(\beta_{J_{i}})\exp\left(\frac{-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right)\exp\left(\frac{\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}}{\sigma^{2}}\right)d\beta
βJ1,βJmi=1mp(βJi)exp(βJiTΣiβJi2σ2)exp(βJiT𝐗JiT𝐘σ2)dβJ1βJ2dβJm.\displaystyle\propto\int_{\beta_{J_{1}}}\dots,\int_{\beta_{J_{m}}}\prod_{i=1}^{m}p(\beta_{J_{i}})\exp\left(\frac{-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right)\exp\left(\frac{\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}}{\sigma^{2}}\right)d\beta_{J_{1}}\,\beta_{J_{2}}\dots\,d\beta_{J_{m}}.

At this point, we can iteratively pull out parts of the product. In particular, define the following function:

qi(𝐗Ji)βJip(βJi)exp(βJiTΣiβJi2σ2)exp(βJiT𝐗JiT𝐘σ2)𝑑βJi.q_{i}(\mathbf{X}_{J_{i}})\coloneqq\int_{\beta_{J_{i}}}p(\beta_{J_{i}})\exp\left(\frac{-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right)\exp\left(\frac{\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}}{\sigma^{2}}\right)d\beta_{J_{i}}.

Since 𝐘,σ2\mathbf{Y},\sigma^{2} and Σi\Sigma_{i} are fixed, qi(𝐗Ji)q_{i}(\mathbf{X}_{J_{i}}) is a deterministic function of 𝐗Ji\mathbf{X}_{J_{i}} that does not depend on βJi\beta_{-J_{i}}. Therefore, we can iteratively integrate as below:

p(𝐗D)\displaystyle p(\mathbf{X}\mid D) βJ1,βJmi=1mp(βJi)exp(βJiTΣiβJi2σ2)exp(βJiT𝐗JiT𝐘σ2)dβJ1dβJ2dβJm\displaystyle\propto\int_{\beta_{J_{1}}}\dots,\int_{\beta_{J_{m}}}\prod_{i=1}^{m}p(\beta_{J_{i}})\exp\left(\frac{-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right)\exp\left(\frac{\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}}{\sigma^{2}}\right)d\beta_{J_{1}}\,d\beta_{J_{2}}\dots\,d\beta_{J_{m}}
=βJ1βJm1i=1m1p(βJi)exp(βJiTΣiβJi2σ2)exp(βJiT𝐗JiT𝐘σ2)qm(𝐗jm)dβJ1dβJ2dβJm1\displaystyle=\int_{\beta_{J_{1}}}\dots\int_{\beta_{J_{m-1}}}\prod_{i=1}^{m-1}p(\beta_{J_{i}})\exp\left(\frac{-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right)\exp\left(\frac{\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}}{\sigma^{2}}\right)q_{m}(\mathbf{X}_{j_{m}})d\beta_{J_{1}}\,d\beta_{J_{2}}\dots\,d\beta_{J_{m-1}}
=i=1mqi(𝐗Ji).\displaystyle=\prod_{i=1}^{m}q_{i}(\mathbf{X}_{J_{i}}).

This shows that 𝐗J1,,𝐗JmD\mathbf{X}_{J_{1}},\dots,\mathbf{X}_{J_{m}}\mid D are jointly (conditionally) independent since their density factors, thus completing the proof. For fixed-X knockoffs, the proof is very similar as one can show that the density of 𝐗T𝐘D\mathbf{X}^{T}\mathbf{Y}\mid D factors into blocks. ∎

B.9 Intuition for the local dependency condition and Figure 6

In Figure 6, we see that even when 𝐗\mathbf{X} is very highly correlated, CovPπ(sign(MLRπ)D)\operatorname{Cov}_{P^{\pi}}(\operatorname*{sign}(\mathrm{MLR}^{\pi})\mid D) looks similar to a diagonal matrix, indicating that the local dependency condition (3.8) holds well empirically. The empirical result is striking and may be surprising initially, this section offers some explanation.

For the sake of intuition, suppose that we are fitting model-X knockoffs and using the Bayesian model PπP^{\pi} from Example 1 with the original features. Suppose we observe that the masked data DD is equal to some fixed value d=(𝐲,{𝐱j,𝐱~j}j=1p)d=(\mathbf{y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p}). After observing D=dD=d, Appendix E shows how to sample from the posterior distribution 𝐗D=d\mathbf{X}\mid D=d via the following Gibbs sampling approach:

  • For each j[p]j\in[p], initialize βj(0)\beta_{j}^{(0)} and 𝐗j(0)\mathbf{X}_{j}^{(0)} to some value.

  • For i=1,,nsamplei=1,\dots,n_{\mathrm{sample}}:

    1. 1.

      Set β(i)=β(i1)\beta^{(i)}=\beta^{(i-1)} and 𝐗(i)=𝐗(i1)\mathbf{X}^{(i)}=\mathbf{X}^{(i-1)}.

    2. 2.

      For j[p]j\in[p]:

      1. (a)

        Resample 𝐗j(i)\mathbf{X}_{j}^{(i)} from the law of 𝐗j𝐗j=𝐗j(i),βj=βj(i),D=d\mathbf{X}_{j}\mid\mathbf{X}_{-j}=\mathbf{X}_{-j}^{(i)},\beta_{-j}=\beta_{-j}^{(i)},D=d. It may be helpful to recall 𝐗j(i){𝐱j,𝐱~j}\mathbf{X}_{j}^{(i)}\in\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\} holds deterministically.

      2. (b)

        Resample βj(i)\beta_{j}^{(i)} from the law of βj𝐗=𝐗(i),βj=βj(i),D=d\beta_{j}\mid\mathbf{X}=\mathbf{X}^{(i)},\beta_{-j}=\beta_{-j}^{(i)},D=d.

  • Return samples 𝐗(1),,𝐗(nsample)\mathbf{X}^{(1)},\dots,\mathbf{X}^{(n_{\mathrm{sample}})}.

Now, recall that MLRjπ>0\mathrm{MLR}_{j}^{\pi}>0 if and only if the Gibbs sampler consistently chooses 𝐗j(i)=𝐱j\mathbf{X}_{j}^{(i)}=\mathbf{x}_{j} instead of 𝐗j(i)=𝐱~j\mathbf{X}_{j}^{(i)}=\widetilde{\mathbf{x}}_{j}. Thus, to analyze CovPπ(𝕀(MLRjπ>0),𝕀(MLRkπ>0)D)\operatorname{Cov}_{P^{\pi}}(\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0),\mathbb{I}(\mathrm{MLR}_{k}^{\pi}>0)\mid D) for some fixed jkj\neq k, we must ask the following question: does the value of 𝐗k(i){𝐱k,𝐱~k}\mathbf{X}_{k}^{(i)}\in\{\mathbf{x}_{k},\widetilde{\mathbf{x}}_{k}\} strongly affect how we resample 𝐗j(i)\mathbf{X}_{j}^{(i)}?

To answer this question, the following key fact (reviewed in Appendix E) is useful. At iteration ii, step 2(a), for any jj, let rj=𝐲𝐗j(i)βj(i)r_{j}=\mathbf{y}-\mathbf{X}_{-j}^{(i)}\beta_{-j}^{(i)} be the residuals excluding feature jj for the current value of 𝐗j\mathbf{X}_{-j} and βj\beta_{-j} in the Gibbs sampler. Then, standard calculations show that Pπ(𝐗j=𝐱jD=d,𝐗j=𝐗j(i),βj=βj(i))P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j}\mid D=d,\mathbf{X}_{-j}=\mathbf{X}_{-j}^{(i)},\beta_{-j}=\beta_{-j}^{(i)}) only depends on 𝐗j(i)\mathbf{X}_{-j}^{(i)} and βj(i)\beta_{-j}^{(i)} through the following inner products:

αj𝐱jTrj=𝐱jT𝐲j𝐱jT𝐗(i)β(i)\alpha_{j}\coloneqq\mathbf{x}_{j}^{T}r_{j}=\mathbf{x}_{j}^{T}\mathbf{y}-\sum_{\ell\neq j}\mathbf{x}_{j}^{T}\mathbf{X}_{\ell}^{(i)}\beta_{\ell}^{(i)}
α~j𝐱~jTrj=𝐱~jT𝐲j𝐱~jT𝐗(i)β(i).\tilde{\alpha}_{j}\coloneqq\widetilde{\mathbf{x}}_{j}^{T}r_{j}=\widetilde{\mathbf{x}}_{j}^{T}\mathbf{y}-\sum_{\ell\neq j}\widetilde{\mathbf{x}}_{j}^{T}\mathbf{X}_{\ell}^{(i)}\beta_{\ell}^{(i)}.

Thus, the question we must answer is: how does the choice of 𝐗k(i){𝐱k,𝐱~k}\mathbf{X}_{k}^{(i)}\in\{\mathbf{x}_{k},\widetilde{\mathbf{x}}_{k}\} affect the value of αj,α~j\alpha_{j},\tilde{\alpha}_{j}? Heuristically, the answer is “not very much,” since 𝐗k(i)\mathbf{X}_{k}^{(i)} only appears above through inner products of the form 𝐱jT𝐗k(i)\mathbf{x}_{j}^{T}\mathbf{X}_{k}^{(i)} and 𝐱~jT𝐗k(i)\widetilde{\mathbf{x}}_{j}^{T}\mathbf{X}_{k}^{(i)}, and by definition of the knockoffs we know that 𝐱jT𝐱k𝐱jT𝐱~k\mathbf{x}_{j}^{T}\mathbf{x}_{k}\approx\mathbf{x}_{j}^{T}\widetilde{\mathbf{x}}_{k} and 𝐱~jT𝐱k𝐱~jT𝐱~k\widetilde{\mathbf{x}}_{j}^{T}\mathbf{x}_{k}\approx\widetilde{\mathbf{x}}_{j}^{T}\widetilde{\mathbf{x}}_{k}. Indeed, for fixed-X knockoffs, we know that this actually holds exactly, and for model-X knockoffs, the law of large numbers should ensure that these approximations are very accurate.

The main way that the choice of 𝐗k(i)\mathbf{X}_{k}^{(i)} can significantly influence the choice of 𝐗j(i)\mathbf{X}_{j}^{(i)} is that the choice of 𝐗k(i)\mathbf{X}_{k}^{(i)} may change the value of βk(i)\beta_{k}^{(i)}. In general, we expect this effect to be rather small, since in many highly-correlated settings, 𝐱k\mathbf{x}_{k} and 𝐱~k\widetilde{\mathbf{x}}_{k} are necessarily highly correlated and thus the choice of 𝐗k(i){𝐱k,𝐱~k}\mathbf{X}_{k}^{(i)}\in\{\mathbf{x}_{k},\widetilde{\mathbf{x}}_{k}\} should not affect the choice of βk(i)\beta_{k}^{(i)} too much. That said, there are a few known pathological settings where the choice of 𝐗k(i){𝐱k,𝐱~k}\mathbf{X}_{k}^{(i)}\in\{\mathbf{x}_{k},\widetilde{\mathbf{x}}_{k}\} does substantially change the estimated value of βk\beta_{k} (Chen et al., (2019); Spector and Janson, (2022)), and in these settings, the coordinates of sign(MLRπ)\operatorname*{sign}(\mathrm{MLR}^{\pi}) may be strongly conditionally dependent. The good news is that using MVR knockoffs instead of SDP knockoffs should ameliorate this problem (see Spector and Janson, (2022)).

Overall, we recognize that this explanation is purely heuristic and does not fully explain the results in Figure 6. However, it may provide some intuitive insight. A more rigorous theoretical analysis of CovPπ(𝕀(MLRπ>0)D)\operatorname{Cov}_{P^{\pi}}(\mathbb{I}(\mathrm{MLR}^{\pi}>0)\mid D) would be interesting; however, we leave this to future work.

Appendix C Technical proofs

C.1 Key concentration results

The proof of Theorem 3.2 relies on the fact that the successive averages of the vector 𝕀(sorted(W)>0)p\mathbb{I}(\mathrm{sorted}(W)>0)\in\mathbb{R}^{p} converge uniformly to their conditional expectation given the masked data D(n)D^{(n)}. In this section, we give a brief proof of this result, which is essentially an application of Theorem 1 from Doukhan and Neumann, (2007). For convenience, we first restate a special case of this theorem (namely, the case where the random variables in question are bounded and we have bounds on pairwise correlations) before proving the corollary we use in Theorem 3.2.

Theorem C.1 (Doukhan and Neumann, (2007)).

Suppose that X1,,XnX_{1},\dots,X_{n} are mean-zero random variables taking values in [1,1][-1,1] such that Var(X¯n)C0n\operatorname{Var}(\bar{X}_{n})\leq C_{0}n for a constant C0>0C_{0}>0. Let L1,L2<L_{1},L_{2}<\infty be constants such that for any iji\leq j,

|Cov(Xi,Xj)|4φ(ji)|\operatorname{Cov}(X_{i},X_{j})|\leq 4\varphi(j-i)

where {φ(k)}k\{\varphi(k)\}_{k\in\mathbb{N}} is a nonincreasing sequence satisfyng

s=0(s+1)kφ(s)L1L2kk! for all k0.\sum_{s=0}^{\infty}(s+1)^{k}\varphi(s)\leq L_{1}L_{2}^{k}k!\text{ for all }k\geq 0.

Then for all t(0,1)t\in(0,1), there exists a universal constant C1>0C_{1}>0 only depending on C0,L1C_{0},L_{1} and L2L_{2} such that

(X¯nt)exp(t2C0n+C1t7/4n7/4)exp(Ct2n1/4),\mathbb{P}\left(\bar{X}_{n}\geq t\right)\leq\exp\left(-\frac{t^{2}}{C_{0}n+C_{1}t^{7/4}n^{7/4}}\right)\leq\exp\left(-C^{\prime}t^{2}n^{1/4}\right),

where CC^{\prime} is a universal constant only depending on C0,L1,L2C_{0},L_{1},L_{2}.

If we take φ(s)=cρs\varphi(s)=c\rho^{s}, this yields the following corollary.

Corollary C.1.

Suppose that X1,,XnX_{1},\dots,X_{n} are mean-zero random variables taking values in [1,1][-1,1]. Suppose that for some C0,ρ(0,1)C\geq 0,\rho\in(0,1), the sequence satisfies

|Cov(Xi,Xj)|Cρ|ij|.|\operatorname{Cov}(X_{i},X_{j})|\leq C\rho^{|i-j|}. (C.1)

Then there exists a universal constant CC^{\prime} depending only on CC and ρ\rho such that

(X¯nt)exp(Ct2n1/4).\mathbb{P}(\bar{X}_{n}\geq t)\leq\exp\left(-C^{\prime}t^{2}n^{1/4}\right). (C.2)

Furthermore, let π:[n][n]\pi:[n]\to[n] be any permutation. For knk\leq n, define X¯k(π)1ki=1kXπ(i)\bar{X}_{k}^{(\pi)}\coloneqq\frac{1}{k}\sum_{i=1}^{k}X_{\pi(i)} to be the sample mean of the first kk random variables after permuting (X1,,Xn)(X_{1},\dots,X_{n}) according to π\pi. Then for any n0,t0n_{0}\in\mathbb{N},t\geq 0,

supπSn(maxn0in|X¯k(π)|t)nexp(Ct2n01/4).\sup_{\pi\in S_{n}}\mathbb{P}\left(\max_{n_{0}\leq i\leq n}|\bar{X}_{k}^{(\pi)}|\geq t\right)\leq n\exp(-C^{\prime}t^{2}n_{0}^{1/4}). (C.3)

where SnS_{n} is the symmetric group.

Proof.

The proof of Equation (C.2) follows an observation of Doukhan and Neumann, (2007), where we note φ(s)=Cexp(as)\varphi(s)=C\exp(-as) for a=log(ρ)a=-\log(\rho). Then

s=0(s+1)kexp(as)s=0i=1k(s+i)exp(as)=dkdpk(11p)|p=exp(a)=k!1(1exp(a))k.\sum_{s=0}^{\infty}(s+1)^{k}\exp(-as)\leq\sum_{s=0}^{\infty}\prod_{i=1}^{k}(s+i)\exp(-as)=\frac{d^{k}}{dp^{k}}\left(\frac{1}{1-p}\right)\bigg{|}_{p=\exp(-a)}=k!\frac{1}{(1-\exp(-a))^{k}}.

As a result, s=0(s+1)kφ(s)C(1(1exp(a)))kk!\sum_{s=0}^{\infty}(s+1)^{k}\varphi(s)\leq C\left(\frac{1}{(1-\exp(-a))}\right)^{k}k!, so we take L1=1(1exp(a))L_{1}=\frac{1}{(1-\exp(-a))} and L2=CL_{2}=C. Lastly, we observe that another geometric series argument yields

Var(X¯n)=i=1nj=1nCov(Xi,Xj)i=1nCj=1nρ|ij|nC21ρ.\operatorname{Var}(\bar{X}_{n})=\sum_{i=1}^{n}\sum_{j=1}^{n}\operatorname{Cov}(X_{i},X_{j})\leq\sum_{i=1}^{n}C\sum_{j=1}^{n}\rho^{|i-j|}\leq nC\frac{2}{1-\rho}.

Thus, we take C0=2C1ρC_{0}=\frac{2C}{1-\rho} and apply Theorem C.1, which yields the first result. To prove Equation (C.3), the main idea is that we can apply Equation (C.2) to each sample mean |X¯k(π)||\bar{X}_{k}^{(\pi)}|, at which point the Equation (C.3) follows from a union bound.

To prove this, note that if we rearrange (Xπ(1),,Xπ(k))(X_{\pi(1)},\dots,X_{\pi(k)}) into their “original order,” then these variables satisfy the condition in Equation (C.1). Formally, let A={π(1),,π(k)}A=\{\pi(1),\dots,\pi(k)\} and let ν:AA\nu:A\to A be the permutation such that ν(π(i))>ν(π(j))\nu(\pi(i))>\nu(\pi(j)) if and only if i>ji>j, for i,j[k]i,j\in[k]. Then define Yi=Xν(π(i))Y_{i}=X_{\nu(\pi(i))} for i[k]i\in[k], and note that

|Cov(Yi,Yj)|=|Cov(Xν(π(i)),Xν(π(j)))|Cρ|ν(π(i))ν(π(j))|Cρ|ij|,|\operatorname{Cov}(Y_{i},Y_{j})|=|\operatorname{Cov}(X_{\nu(\pi(i))},X_{\nu(\pi(j))})|\leq C\rho^{|\nu(\pi(i))-\nu(\pi(j))|}\leq C\rho^{|i-j|},

where in the last step, |ij||ν(π(i))ν(π(j))||i-j|\leq|\nu(\pi(i))-\nu(\pi(j))| follows by construction of ν\nu. Since Y¯k=X¯k(π)\bar{Y}_{k}=\bar{X}_{k}^{(\pi)} by construction, this means we may apply Equation (C.2) to X¯k(π)\bar{X}_{k}^{(\pi)} for each kk.

Thus, by Equation (C.2), for any πSn\pi\in S_{n},

(maxn0kn|X¯k(π)|t)k=n0n(|X¯k(π)|t)k=n0nexp(Ct2k1/4)nexp(Ct2n01/4).\mathbb{P}\left(\max_{n_{0}\leq k\leq n}|\bar{X}_{k}^{(\pi)}|\geq t\right)\leq\sum_{k=n_{0}}^{n}\mathbb{P}(|\bar{X}_{k}^{(\pi)}|\geq t)\leq\sum_{k=n_{0}}^{n}\exp(-C^{\prime}t^{2}k^{1/4})\leq n\exp(-C^{\prime}t^{2}n_{0}^{1/4}).

This completes the proof. ∎

C.2 Bounds on the expected number of false discoveries

The proof of Theorem 3.2 relied on the fact that limnΓq(wn)\lim_{n\to\infty}\Gamma_{q}(w_{n}) is finite whenever it exists. This is a consequence of the lemma below. The lemma below also proves a second moment bound which is needed when making a uniform integrability argument in Step 3 of the proof of Theorem B.6.

Lemma C.2.

Fix any q(0,1)q\in(0,1). Then there exist universal constants C(q),C(2)(q)C(q),C^{(2)}(q)\in\mathbb{R} such that for any Bayesian model PπP^{\pi} and any valid knockoff statistic W=w([𝐗,𝐗~],𝐘)W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}) with discovery set S[p]S\subset[p]:

  1. 1.

    Γq(w)C(q)\Gamma_{q}(w)\leq C(q) where C(q)C(q) is a finite constant depending only on qq.

  2. 2.

    𝔼Pπ[|S|2]C(2)(q)𝔼[|1(θ)|2]\mathbb{E}_{P^{\pi}}[|S|^{2}]\leq C^{(2)}(q)\mathbb{E}[|\mathcal{H}_{1}(\theta^{\star})|^{2}].

Note that above, 1(θ)\mathcal{H}_{1}(\theta^{\star}) denotes the random set of non-nulls under PπP^{\pi}.

Proof.

Recall that PπP^{\pi} denotes the joint law of (𝐗,𝐘,θ)(\mathbf{X},\mathbf{Y},\theta^{\star}). Throughout the proof, all expectations and probabilities are taken over PπP^{\pi}. Our strategy is to condition on the nuisance parameters θ\theta^{\star}. In particular, let M(θ)=|1(θ)|M(\theta^{\star})=|\mathcal{H}_{1}(\theta^{\star})| denote the number of non-nulls. To show the first result, it suffices to show

𝔼[|S|θ]C(q)M(θ).\mathbb{E}\left[|S|\mid\theta^{\star}\right]\leq C(q)M(\theta^{\star}). (C.4)

Proving Equation (C.4) proves the first result because it implies by the tower property that 𝔼[|S|]C(q)𝔼[M(θ)]\mathbb{E}[|S|]\leq C(q)\mathbb{E}[M(\theta^{\star})], and therefore Γq(w)=𝔼[|S|]𝔼[M(θ)]C(q)\Gamma_{q}(w)=\frac{\mathbb{E}[|S|]}{\mathbb{E}[M(\theta^{\star})]}\leq C(q). For the second result, by the tower property it also suffices to show that

𝔼[|S|2θ]C(2)(q)M(θ)2.\mathbb{E}[|S|^{2}\mid\theta^{\star}]\leq C^{(2)}(q)M(\theta^{\star})^{2}. (C.5)

The rest of the proof proceeds conditionally on θ\theta^{\star}, so we are essentially in the fully frequentist setting. Thus, for the rest of the proof, we will abbreviate M(θ)M(\theta^{\star}) as MM. We will also assume the “worst-case” values for the non-null coordinates of WjW_{j}: in particular, let WW^{\prime} denote WW but with all of the non-null coordinates replaced with the value \infty, and let S[p]S^{\prime}\subset[p] be the discovery set made when applying SeqStep to WW^{\prime}. These are the “worst-case” values in the sense that |S||S||S^{\prime}|\geq|S| deterministically (see Spector and Janson, (2022), Lemma B.4), so it suffices to show that 𝔼[|S|]C(q)M\mathbb{E}[|S^{\prime}|]\leq C(q)M and 𝔼[|S|2]C(q)M\mathbb{E}[|S^{\prime}|^{2}]\leq C(q)M.

As notation, let U=𝕀(sorted(W)>0)U=\mathbb{I}(\mathrm{sorted}(W^{\prime})>0) denote the signs of WW^{\prime} when sorted in descending order of absolute value. Following the notation in Equation (B.4), let ψ(U)=max{k:kkU¯k+1kU¯kq}\psi(U)=\max\left\{k:\frac{k-k\bar{U}_{k}+1}{k\bar{U}_{k}}\leq q\right\}, where U¯k=1ki=1min(k,p)Ui\bar{U}_{k}=\frac{1}{k}\sum_{i=1}^{\min(k,p)}U_{i}. This ensures that |S|=ψ(U)+11+qψ(U)|S^{\prime}|=\left\lceil\frac{\psi(U)+1}{1+q}\right\rceil\leq\psi(U) is the number of discoveries made by knockoffs (Spector and Janson, (2022), Lemma B.3). To prove the first result, it thus suffices to show 𝔼[ψ(U)]C(q)M\mathbb{E}[\psi(U)]\leq C(q)M. To do this, let K=M+11+qK=\left\lceil\frac{M+1}{1+q}\right\rceil and fix any integer cc\in\mathbb{N} (we will pick a specific value for cc later). Observe that

𝔼[ψ(U)]\displaystyle\mathbb{E}[\psi(U)] cK(ψ(U)cK)+k=cKk(ψ(U)=k)\displaystyle\leq cK\mathbb{P}(\psi(U)\leq cK)+\sum_{k=cK}^{\infty}k\mathbb{P}(\psi(U)=k) (C.6)
cK+k=cKk(Bin(kM,1/2)k+11+qM).\displaystyle\leq cK+\sum_{k=cK}^{\infty}k\mathbb{P}\left(\mathrm{Bin}(k-M,1/2)\geq\left\lceil\frac{k+1}{1+q}\right\rceil-M\right). (C.7)

where the second line follows because (i) the event ψ(U)=k\psi(U)=k implies that at least k+11+q\left\lceil\frac{k+1}{1+q}\right\rceil of the first kk coordinates of UU are positive and (ii) the knockoff flip-sign property guarantees that conditional on θ\theta^{\star}, the null coordinates of UjU_{j} are i.i.d. random signs conditional on the values of the non-null coordinates of UU.777Without loss of generality we may assume that the absolute values of WW^{\prime} are nonzero with probability one, since again, this only increases the number of discoveries made by knockoffs. Thus, doing simple arithmetic, in the first kk coordinates of UU, there are kMk-M null i.i.d. signs, of which at least k+11+qM\left\lceil\frac{k+1}{1+q}\right\rceil-M must be positive, yielding the expression above with the Binomial probability.

We now apply Hoeffding’s inequality. To do so, we must ensure k+11+qM\left\lceil\frac{k+1}{1+q}\right\rceil-M is larger than the mean of a Bin(kM,1/2)\mathrm{Bin}(k-M,1/2) random variable. It turns out that it suffices to pick the value of cc to satisfy c>(11+q12)1c>\left(\frac{1}{1+q}-\frac{1}{2}\right)^{-1}. To see why, fix any kcKk\geq cK, so we may write k=cK+k=cK+\ell for some 0\ell\geq 0. Then for all such kk, we have

k+11+qMkM2\displaystyle\frac{k+1}{1+q}-M-\frac{k-M}{2} k(11+q12)M2\displaystyle\geq k\left(\frac{1}{1+q}-\frac{1}{2}\right)-\frac{M}{2}
=(cK+)(11+q12)M2\displaystyle=(cK+\ell)\left(\frac{1}{1+q}-\frac{1}{2}\right)-\frac{M}{2} since k=cK+\displaystyle\text{ since }k=cK+\ell
2KM2+(11+q12)\displaystyle\geq\frac{2K-M}{2}+\ell\left(\frac{1}{1+q}-\frac{1}{2}\right) since c(11+q12)1\displaystyle\text{ since }c\geq\left(\frac{1}{1+q}-\frac{1}{2}\right)^{-1}
(11+q12)\displaystyle\geq\ell\left(\frac{1}{1+q}-\frac{1}{2}\right) since KM1+qM2 by definition.\displaystyle\text{ since }K\geq\frac{M}{1+q}\geq\frac{M}{2}\text{ by definition.}

Thus, we may apply Hoeffding’s inequality for kcKk\geq cK. Indeed, for any 0\ell\geq 0, the previous result yields that for kcKk\geq cK:

(Bin(kM,1/2)k+11+qM)\displaystyle\mathbb{P}\left(\mathrm{Bin}(k-M,1/2)\geq\left\lceil\frac{k+1}{1+q}\right\rceil-M\right) (Bin(kM,1/2)kM2(kcK)(11+q12))\displaystyle\leq\mathbb{P}\left(\mathrm{Bin}(k-M,1/2)-\frac{k-M}{2}\geq(k-cK)\left(\frac{1}{1+q}-\frac{1}{2}\right)\right)
exp(2(kcK)2(11+q12)2).\displaystyle\leq\exp\left(-2(k-cK)^{2}\left(\frac{1}{1+q}-\frac{1}{2}\right)^{2}\right).

As notation, set αq=11+q12\alpha_{q}=\frac{1}{1+q}-\frac{1}{2}. Combining the previous equation with Eq. (C.7), we obtain

𝔼[ψ(U)]\displaystyle\mathbb{E}[\psi(U)] cK+k=cKkexp(2(kcK)2αq2)\displaystyle\leq cK+\sum_{k=cK}^{\infty}k\exp\left(-2(k-cK)^{2}\alpha_{q}^{2}\right)
=cK+=0(+cK)exp(22αq2)\displaystyle=cK+\sum_{\ell=0}^{\infty}(\ell+cK)\exp\left(-2\ell^{2}\alpha_{q}^{2}\right)
=cK+cK=0exp(22αq2)+=0exp(22αq2).\displaystyle=cK+cK\sum_{\ell=0}^{\infty}\exp(-2\ell^{2}\alpha_{q}^{2})+\sum_{\ell=0}^{\infty}\ell\exp(-2\ell^{2}\alpha_{q}^{2}).

Note that the sums =0exp(22αq2)\sum_{\ell=0}^{\infty}\ell\exp(-2\ell^{2}\alpha_{q}^{2}) and =0exp(22αq2)\sum_{\ell=0}^{\infty}\exp(-2\ell^{2}\alpha_{q}^{2}) are both convergent. As a result, 𝔼[ψ(U)]\mathbb{E}[\psi(U)] is bounded by a constant multiple of cKc1+qMcK\sim\frac{c}{1+q}M, where the constant depends on qq but nothing else. Since ψ(U)|S||S|\psi(U)\geq|S^{\prime}|\geq|S| as previously argued, this completes the proof.

For the second statement, we note that by the same argument as above, we have that

𝔼[ψ(U)2]\displaystyle\mathbb{E}[\psi(U)^{2}] (cK)2+k=cKk2(Bin(kM,1/2)k+11+qM)\displaystyle\leq(cK)^{2}+\sum_{k=cK}^{\infty}k^{2}\mathbb{P}\left(\mathrm{Bin}(k-M,1/2)\geq\left\lceil\frac{k+1}{1+q}\right\rceil-M\right)
(cK)2+=0(+cK)2exp(22αq2)\displaystyle\leq(cK)^{2}+\sum_{\ell=0}^{\infty}(\ell+cK)^{2}\exp\left(-2\ell^{2}\alpha_{q}^{2}\right)
=(cK)2(1+=0exp(22αq2))+cK=0exp(22αq2)+=02exp(22αq2).\displaystyle=(cK)^{2}\left(1+\sum_{\ell=0}^{\infty}\exp\left(-2\ell^{2}\alpha_{q}^{2}\right)\right)+cK\sum_{\ell=0}^{\infty}\ell\exp\left(-2\ell^{2}\alpha_{q}^{2}\right)+\sum_{\ell=0}^{\infty}\ell^{2}\exp\left(-2\ell^{2}\alpha_{q}^{2}\right).

Once again, the three series above are convergent and the value they converge to depends only on qq. Since cKMcK\sim M asymptotically, this implies that there exists some constant C(2)(q)C^{(2)}(q) such that 𝔼[ψ(U)2]C(2)(q)M2\mathbb{E}[\psi(U)^{2}]\leq C^{(2)}(q)M^{2}. This completes the proof since ψ(U)2|S|2\psi(U)^{2}\geq|S|^{2}. ∎

Appendix D Additional comparison to prior work

D.1 Comparison to the unmasked likelihood ratio

In this section, we compare MLR statistics to the earlier unmasked likelihood statistic introduced by Katsevich and Ramdas, (2020), which this work builds upon. The upshot is that unmasked likelihood statistics give the most powerful “binary pp-values,” as shown by Katsevich and Ramdas, (2020), but do not yield jointly valid knockoff feature statistics in the sense required for the FDR control proof in Barber and Candès, (2015) and Candès et al., (2018).

In particular, we call a statistic Tj([𝐗,𝐗~],𝐘)T_{j}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}) a marginally symmetric knockoff statistic if TjT_{j} satisfies Tj([𝐗,𝐗~]swap(j),𝐘)=Tj([𝐗,𝐗~],𝐘)T_{j}([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(j)}},\mathbf{Y})=-T_{j}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}). Under the null, TjT_{j} is marginally symmetric, so the quantity kj=12+12𝕀(Tj0)k_{j}=\frac{1}{2}+\frac{1}{2}\mathbb{I}(T_{j}\leq 0) is a valid “binary pp-value” which only takes values in {1/2,1}\{1/2,1\}. Theorem 5 of Katsevich and Ramdas, (2020) shows that for any marginally symmetric knockoff statistic, P(kj=1/2)=P(Tj>0)P^{\star}(k_{j}=1/2)=P^{\star}(T_{j}>0) is maximized if Tj>0p𝐘𝐗(𝐘[𝐗j,𝐗j])>p𝐘𝐗(𝐘[𝐗~j,𝐗j])T_{j}>0\Leftrightarrow p^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}\mid[\mathbf{X}_{j},\mathbf{X}_{-j}])>p^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}\mid[\widetilde{\mathbf{X}}_{j},\mathbf{X}_{-j}]), where p𝐘𝐗p^{\star}_{\mathbf{Y}\mid\mathbf{X}} denotes the density of 𝐘𝐗\mathbf{Y}\mid\mathbf{X} under the true law PP^{\star} of the data. As such, one might initially hope to use the unmasked likelihood ratio as a knockoff statistic:

Wjunmasked=log(p𝐘𝐗(𝐘[𝐗j,𝐗j])p𝐘𝐗(𝐘[𝐗~j,𝐗j])).W_{j}^{\mathrm{unmasked}}=\log\left(\frac{p^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}\mid[\mathbf{X}_{j},\mathbf{X}_{-j}])}{p^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}\mid[\widetilde{\mathbf{X}}_{j},\mathbf{X}_{-j}])}\right).

However, a marginally symmetric knockoff statistic is not necessarily a valid knockoff feature statistic, which must satisfy the following stronger property (Barber and Candès,, 2015; Candès et al.,, 2018):

Wj([𝐗,𝐗~]swap(J),𝐘)={Wj([𝐗,𝐗~],𝐘)jJWj([𝐗,𝐗~],𝐘)jJ,W_{j}([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}},\mathbf{Y})=\begin{cases}W_{j}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})&j\not\in J\\ -W_{j}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})&j\in J,\end{cases}

for any J[p]J\subset[p]. This flip-sign property guarantees that the signs of the null coordinates of WW are jointly i.i.d. and symmetric. However, the unmasked likelihood statistic does not satisfy this property, as changing the observed value of 𝐗i\mathbf{X}_{i} for iji\neq j will typically change the value of the likelihood p𝐘𝐗(𝐘[𝐗j,𝐗j])p^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}\mid[\mathbf{X}_{j},\mathbf{X}_{-j}]).

D.2 Comparison to the adaptive knockoff filter

In this section, we compare our methodological contribution, MLR statistics, to the adaptive knockoff filter described in Ren and Candès, (2020), namely their approach based on Bayesian modeling. The main point is that although MLR statistics and the procedure from Ren and Candès, (2020) have some intuitive similarities, the procedures are different and in fact complementary, since one could use the Bayesian adaptive knockoff filter from Ren and Candès, (2020) in combination with MLR statistics.

As review, recall from Section 3.2 that valid knockoff feature statistics WW as initially defined by Barber and Candès, (2015); Candès et al., (2018) must ensure that |W||W| is a function of the masked data DD, and thus |W||W| cannot explicitly depend on sign(W)\operatorname*{sign}(W). (It is also important to remember that |W||W| determines the order and “prioritization” of the SeqStep hypothesis testing procedure.) The key innovation of Ren and Candès, (2020) is to relax this restriction: in particular, they define a procedure where the analyst sequentially reveals the signs of sign(W)\operatorname*{sign}(W) in reverse order of their prioritization, and after each sign is revealed, the analyst may arbitrarily reorder the remaining hypotheses. The advantage of this approach is that revealing the sign of (e.g.) W1W_{1} may reveal information that can be used to more accurately prioritize the hypotheses while still guaranteeing provable FDR control.

This raises the question: how should the analyst reorder the hypotheses after each coordinate of sign(W)\operatorname*{sign}(W) is revealed? One proposal from Ren and Candès, (2020) is to introduce an auxiliary Bayesian model for the relationship between sign(W)\operatorname*{sign}(W) and |Wj||W_{j}| (the authors also discuss the use of additional side information, although for brevity we do not discuss this here). For example, Ren and Candès, (2020) suggest using a two-groups model where

HjindBern(kj) and WjHj{𝒫1(Wj)Hj=1𝒫0(Wj)Hj=0.H_{j}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathrm{Bern}(k_{j})\text{ and }W_{j}\mid H_{j}\sim\begin{cases}\mathcal{P}_{1}(W_{j})&H_{j}=1\\ \mathcal{P}_{0}(W_{j})&H_{j}=0.\end{cases} (D.1)

Above, HjH_{j} is the indicator of whether the jjth hypothesis is non-null, and 𝒫1\mathcal{P}_{1} and 𝒫0\mathcal{P}_{0} are (e.g.) unknown parametric distributions that the analyst fits as they observe sign(W)\operatorname*{sign}(W). With this notation, the proposal from Ren and Candès, (2020) can be roughly summarized as follows:

  1. 1.

    Fit an initial feature statistic WW, such as an LCD statistic, and observe |W||W|.

  2. 2.

    Fit an initial version of the model in Equation (D.1) and use it to compute γj(Wj>0,Hj=1|Wj|)\gamma_{j}\coloneqq\mathbb{P}(W_{j}>0,H_{j}=1\mid|W_{j}|).

  3. 3.

    Observe sign(Wj)\operatorname*{sign}(W_{j}) for j=argminj{γj:sign(Wj) has not yet been observed }j=\operatorname*{arg\,min}_{j}\{\gamma_{j}:\operatorname*{sign}(W_{j})\text{ has not yet been observed }\}.

  4. 4.

    Using sign(Wj)\operatorname*{sign}(W_{j}), update the model in Equation (D.1), update {γj}j=1p\{\gamma_{j}\}_{j=1}^{p}, and return to Step 3.

  5. 5.

    Terminate when all of sign(W)\operatorname*{sign}(W) has been revealed, at which point sign(W)\operatorname*{sign}(W) is passed to SeqStep in the reverse of the order that the signs were revealed.

Note that in Step 3, the reason Ren and Candès, (2020) choose jj to be the index minimizing (Wj>0,Hj=1|Wj|)\mathbb{P}(W_{j}>0,H_{j}=1\mid|W_{j}|) is that in Step 5, SeqStep observes sign(W)\operatorname*{sign}(W) in reverse order of the analyst. Thus, the analyst should observe the least important hypotheses first so that SeqStep can observe the most important hypotheses first.

The main similarity between this procedure and MLR statistics is that both procedures, roughly speaking, attempt to prioritize the hypotheses according to (Wj>0)\mathbb{P}(W_{j}>0), although we condition on the full masked data to maximize power. That said, there are two important differences. First, we and Ren and Candès, (2020) both use an auxiliary Bayesian model—however, we take probabilities over a Bayesian model of the full dataset, whereas Ren and Candès, (2020) only fit a working model of the law of WW. Using the full data as opposed to only the statistics WW should lead to much higher power—for example, if WW are poor feature statistics which do not contain much relevant information about the dataset, the procedure from Ren and Candès, (2020) will have low power. Thus, despite their initial similarity, these procedures are quite different.

The second and more important difference is that the procedure above is not a feature statistic. Rather, it is an extension of SeqStep that wraps on top of any initial feature statistic. This “adaptive knockoffs” procedure augments the power of any feature statistic, although if the initial feature statistic WW has many negative signs to begin with or its absolute values |W||W| are truly uninformative of its signs, the procedure may still be powerless. Since MLR statistics have provable optimality guarantees—namely, they maximize Pπ(Wj>0D)P^{\pi}(W_{j}>0\mid D) and make |Wj||W_{j}| a monotone function of Pπ(Wj>0D)P^{\pi}(W_{j}>0\mid D)—one might expect that using MLR statistics in place of a lasso statistic could improve the power of the adaptive knockoff filter. Similarly, using the adaptive knockoff filter in combination with MLR statistics could be more powerful than using MLR statistics alone.

Appendix E Gibbs sampling for MLR statistics

E.1 Proof of Eq. (4.1)

Lemma E.1.

Fix any constants 𝐱,𝐱~n×p,𝐲n\mathbf{x},\widetilde{\mathbf{x}}\in\mathbb{R}^{n\times p},\mathbf{y}\in\mathbb{R}^{n}, and θΘ\theta\in\Theta, and define 𝐝=(𝐲,{𝐱j,𝐱~j}j=1p)\mathbf{d}=(\mathbf{y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p}). Then

Pπ(𝐗j=𝐱j𝐗j=𝐱j,θ=θ,D=𝐝)Pπ(𝐗j=𝐱~j𝐗j=𝐱j,θ=θ,D=𝐝)\displaystyle\frac{P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j}\mid\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}{P^{\pi}(\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j}\mid\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})} =P𝐘𝐗(θ)(𝐲𝐗j=𝐱j,𝐗j=𝐱j)P𝐘𝐗(θ)(𝐲𝐗j=𝐱~j,𝐗j=𝐱j).\displaystyle=\frac{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\mathbf{x}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j})}{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j})}.

as long as (𝐱,𝐱~,𝐲,θ)(\mathbf{x},\widetilde{\mathbf{x}},\mathbf{y},\theta) lies in the support of (𝐗,𝐗~,𝐘,θ)(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y},\theta^{\star}) under PπP^{\pi}.

Proof.

Throughout this proof, we abuse notation and let probabilities of the form (e.g.) Pπ(𝐘=𝐲,𝐗=𝐱)P^{\pi}(\mathbf{Y}=\mathbf{y},\mathbf{X}=\mathbf{x}) denote the density of this event with respect to the base measure of PπP^{\pi}. The definition of conditional probability yields

Pπ(𝐗j=𝐱j𝐗j=𝐱j,θ=θ,D=𝐝)Pπ(𝐗j=𝐱~j𝐗j=𝐱j,θ=θ,D=𝐝)=Pπ(𝐗j=𝐱j,𝐗j=𝐱j,θ=θ,D=𝐝)Pπ(𝐗j=𝐱~j,𝐗j=𝐱j,θ=θ,D=𝐝).\displaystyle\frac{P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j}\mid\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}{P^{\pi}(\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j}\mid\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}=\frac{P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}{P^{\pi}(\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}.

By definition of D=𝐝D=\mathbf{d}, the event in the numerator is equivalent to the event [𝐗,𝐗~]=[𝐱,𝐱~],𝐘=𝐲,θ=θ[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}],\mathbf{Y}=\mathbf{y},\theta^{\star}=\theta and the event in the denominator is equivalent to [𝐗,𝐗~]=[𝐱,𝐱~]swap(j),𝐘=𝐲,θ=θ[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]_{\mathrm{swap}(j)},\mathbf{Y}=\mathbf{y},\theta^{\star}=\theta. Plugging this in yields

=\displaystyle= Pπ([𝐗,𝐗~]=[𝐱,𝐱~],θ=θ,𝐘=𝐲)Pπ([𝐗,𝐗~]=[𝐱,𝐱~]swap(j),θ=θ,𝐘=𝐲)\displaystyle\frac{P^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}],\theta^{\star}=\theta,\mathbf{Y}=\mathbf{y})}{P^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]_{\mathrm{swap}(j)},\theta^{\star}=\theta,\mathbf{Y}=\mathbf{y})}
=\displaystyle= π(θ)Pπ([𝐗,𝐗~]=[𝐱,𝐱~]θ=θ)Pπ(𝐘=𝐲θ=θ,[𝐗,𝐗~]=[𝐱,𝐱~])π(θ)Pπ([𝐗,𝐗~]=[𝐱,𝐱~]swap(j)θ=θ)Pπ(𝐘=𝐲θ=θ,[𝐗,𝐗~]=[𝐱,𝐱~]swap(j)).\displaystyle\frac{\pi(\theta)P^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]\mid\theta^{\star}=\theta)P^{\pi}(\mathbf{Y}=\mathbf{y}\mid\theta^{\star}=\theta,[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}])}{\pi(\theta)P^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]_{\mathrm{swap}(j)}\mid\theta^{\star}=\theta)P^{\pi}(\mathbf{Y}=\mathbf{y}\mid\theta^{\star}=\theta,[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]_{\mathrm{swap}(j)})}.

where the second line uses the chain rule of conditional probability and the fact that θ\theta^{\star} has marginal density π\pi under PπP^{\pi}. Recall that by definition of PπP^{\pi} (see Section 1.2), the law the data given θ=θ\theta^{\star}=\theta is simply P(θ)P^{(\theta)}. Furthermore, for all θΘ\theta\in\Theta, under P(θ)P^{(\theta)}, [𝐗,𝐗~][\mathbf{X},\widetilde{\mathbf{X}}] are assumed to be pairwise exchangeable because they are valid knockoffs (see footnote 3). Therefore, cancelling terms, we conclude

=\displaystyle= P(θ)(𝐘=𝐲[𝐗,𝐗~]=[𝐱,𝐱~])P(θ)(𝐘=𝐲[𝐗,𝐗~]=[𝐱,𝐱~]swap(j))\displaystyle\frac{P^{(\theta)}(\mathbf{Y}=\mathbf{y}\mid[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}])}{P^{(\theta)}(\mathbf{Y}=\mathbf{y}\mid[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]_{\mathrm{swap}(j)})}
=P𝐘𝐗(θ)(𝐲𝐗j=𝐱j,𝐗j=𝐱j)P𝐘𝐗(θ)(𝐲𝐗j=𝐱~j,𝐗j=𝐱j).\displaystyle=\frac{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\mathbf{x}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j})}{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j})}.

where the last line follows since 𝐘𝐗~𝐗\mathbf{Y}\perp\!\!\!\perp\widetilde{\mathbf{X}}\mid\mathbf{X}, as 𝐗~\widetilde{\mathbf{X}} are valid knockoffs by assumption under P(θ)P^{(\theta)}. ∎

E.2 Derivation of Gibbs sampling updates

In this section, we derive the Gibbs sampling updates for the class of MLR statistics defined in Section 4.2. First, for convenience, we restate the model and choice of π\pi.

E.2.1 Model and prior

First, we consider the model-X case. For each j[p]j\in[p], let ϕj(𝐗j)n×kj\phi_{j}(\mathbf{X}_{j})\in\mathbb{R}^{n\times k_{j}} denote any vector of prespecified basis functions applied to 𝐗j\mathbf{X}_{j}. We assume the following additive model:

𝐘𝐗,β,σ2𝒩(j=1pϕj(𝐗j)β(j),σ2In)\mathbf{Y}\mid\mathbf{X},\beta,\sigma^{2}\sim\mathcal{N}\left(\sum_{j=1}^{p}\phi_{j}(\mathbf{X}_{j})\beta^{(j)},\sigma^{2}I_{n}\right)

with the following prior on β(j)kj\beta^{(j)}\in\mathbb{R}^{k_{j}}:

β(j)ind{0kjw.p. p0𝒩(0,τ2Ikj)w.p. 1p0.\beta^{(j)}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\begin{cases}0\in\mathbb{R}^{k_{j}}&\text{w.p. }p_{0}\\ \mathcal{N}(0,\tau^{2}I_{k_{j}})&\text{w.p. }1-p_{0}.\end{cases}

with the usual hyperpriors

τ2invGamma(aτ,bτ),σ2invGamma(aσ,bσ) and p0Beta(a0,b0).\tau^{2}\sim\mathrm{invGamma}(a_{\tau},b_{\tau}),\sigma^{2}\sim\mathrm{invGamma}(a_{\sigma},b_{\sigma})\text{ and }p_{0}\sim\mathrm{Beta}(a_{0},b_{0}).

This is effectively a group spike-and-slab prior on β(j)\beta^{(j)} which ensures group sparsity of β(j)\beta^{(j)}, meaning that either the whole vector equals zero or the whole vector is nonzero. We use this group spike-and-slab prior for two reasons. First, it reflects the intuition that ϕj\phi_{j} is meant to represent only a single feature and thus β(j)\beta^{(j)} will likely be entirely sparse (if 𝐗j\mathbf{X}_{j} is truly null) or entirely non-sparse. Second, and more importantly, the group sparsity will substantially improve computational efficiency in the Gibbs sampler.

Lastly, for the fixed-X case, we assume exactly the same model but with the basis functions ϕj()\phi_{j}(\cdot) chosen to be the identity. Thus, this model is a typical spike-and-slab Gaussian linear model in the fixed-X case (George and McCulloch,, 1997). It is worth noting that our implementation for the fixed-X case actually uses a slightly more general Gaussian mixture model as the prior on βj\beta_{j}, where the density p(βj)=k=1mpk𝒩(βj;0,τk2)p(\beta_{j})=\sum_{k=1}^{m}p_{k}\mathcal{N}(\beta_{j};0,\tau_{k}^{2}) for hyperpriors τ0=0,τkindinvGamma(ak,bk)\tau_{0}=0,\tau_{k}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathrm{invGamma}(a_{k},b_{k}), and (p0,,pm)Dir(α)(p_{0},\dots,p_{m})\sim\mathrm{Dir}(\alpha). However, for brevity, we only derive the Gibbs updates for the case of two mixture components.

E.2.2 Gibbs sampling updates

Following Section 4, we now review the details of the MLR Gibbs sampler which samples from the posterior of (𝐗,β)(\mathbf{X},\beta) given the masked data D={𝐘,{𝐱j,𝐱~j}j=1p}D=\{\mathbf{Y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p}\}.888This is a standard derivation, but we review it here for the reader’s convenience. As notation, let β\beta denote the concatenation of {β(j)}j=1p\{\beta^{(j)}\}_{j=1}^{p}, let β(j)\beta^{(\mathrm{-}j)} denote all of the coordinates of β\beta except those of β(j)\beta^{(j)}, let γj\gamma_{j} denote the indicator that β(j)0\beta^{(j)}\neq 0, and let ϕ(𝐗)n×jkj\phi(\mathbf{X})\in\mathbb{R}^{n\times\sum_{j}k_{j}} denote all of the basis functions concatenated together. Also note that although this section mostly uses the language of model-X knockoffs, when the basis functions ϕj()\phi_{j}(\cdot) are the identity, the Gibbs updates we are about to describe satisfy the sufficiency property required for fixed-X statistics, and indeed the resulting Gibbs sampler is actually a valid implementation of the fixed-X MLR statistic.

To improve the convergence of the Gibbs sampler, we slightly modify the meta-algorithm in Algorithm 1 to marginalize over the value of β(j)\beta^{(j)} when resampling 𝐗j\mathbf{X}_{j}. To be precise, this means that instead of sampling 𝐗j𝐗j,β,σ2\mathbf{X}_{j}\mid\mathbf{X}_{-j},\beta,\sigma^{2}, we sample 𝐗j𝐗j,β(j)\mathbf{X}_{j}\mid\mathbf{X}_{-j},\beta^{(\mathrm{-}j)}. We derive this update in three steps, and along the way we derive the update for β(j)𝐗,β(j),D\beta^{(j)}\mid\mathbf{X},\beta^{(\mathrm{-}j)},D.

Step 1: First, we derive the update for γj𝐗,β(j),D\gamma_{j}\mid\mathbf{X},\beta^{(\mathrm{-}j)},D. Observe

(γj=0𝐗,β(j),D)(γj=1𝐗,β(j),D)\displaystyle\frac{\mathbb{P}(\gamma_{j}=0\mid\mathbf{X},\beta^{(\mathrm{-}j)},D)}{\mathbb{P}(\gamma_{j}=1\mid\mathbf{X},\beta^{(\mathrm{-}j)},D)} =p0p(𝐘𝐗,β(j),β(j)=0)(1p0)p(𝐘𝐗,β(j),β(j)0).\displaystyle=\frac{p_{0}p(\mathbf{Y}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}=0)}{(1-p_{0})p(\mathbf{Y}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}\neq 0)}.

Analyzing the numerator is easy, as the model specifies that if we let 𝐫=𝐘ϕ(𝐗j)β(j)\mathbf{r}=\mathbf{Y}-\phi(\mathbf{X}_{-j})\beta^{(\mathrm{-}j)}, then

p(𝐘𝐗,β(j),β(j)=0)det(σ2In)1/2exp(12σ2𝐫22).p(\mathbf{Y}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}=0)\propto\det(\sigma^{2}I_{n})^{-1/2}\exp\left(-\frac{1}{2\sigma^{2}}\|\mathbf{r}\|_{2}^{2}\right).

For the denominator, observe that 𝐫,β(j)𝐗,β(j),β(j)0\mathbf{r},\beta^{(j)}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}\neq 0 is jointly Gaussian: in particular,

(β(j),𝐫)𝐗,β(j),β(j)0𝒩(0,[τ2Ikjτϕj(𝐗j)Tτϕj(𝐗j)τ2ϕj(𝐗j)ϕj(𝐗j)T+σ2In]).(\beta^{(j)},\mathbf{r})\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}\neq 0\sim\mathcal{N}\left(0,\begin{bmatrix}\tau^{2}I_{k_{j}}&\tau\phi_{j}(\mathbf{X}_{j})^{T}\\ \tau\phi_{j}(\mathbf{X}_{j})&\tau^{2}\phi_{j}(\mathbf{X}_{j})\phi_{j}(\mathbf{X}_{j})^{T}+\sigma^{2}I_{n}\end{bmatrix}\right). (E.1)

To lighten notation, let QjIkj+τ2σ2ϕ(𝐗j)Tϕ(𝐗j)Q_{j}\coloneqq I_{k_{j}}+\frac{\tau^{2}}{\sigma^{2}}\phi(\mathbf{X}_{j})^{T}\phi(\mathbf{X}_{j}). Using the above expression plus the Woodbury identity applied to the density of 𝐘𝐗,β(j),β(j)0\mathbf{Y}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}\neq 0, we conclude

(γj=0𝐗,β(j),D)(γj=1𝐗,β(j),D)=p01p0det(Qj)1/2exp(τ22σ4𝐫Tϕj(𝐗j)Qj1ϕj(𝐗j)T𝐫).\frac{\mathbb{P}(\gamma_{j}=0\mid\mathbf{X},\beta^{(\mathrm{-}j)},D)}{\mathbb{P}(\gamma_{j}=1\mid\mathbf{X},\beta^{(\mathrm{-}j)},D)}=\frac{p_{0}}{1-p_{0}}\det(Q_{j})^{1/2}\exp\left(-\frac{\tau^{2}}{2\sigma^{4}}\mathbf{r}^{T}\phi_{j}(\mathbf{X}_{j})Q_{j}^{-1}\phi_{j}(\mathbf{X}_{j})^{T}\mathbf{r}\right).

Since QjQ_{j} is a kj×kjk_{j}\times k_{j} matrix, this quantity can be computed relatively efficiently.

Step 2: Next, we derive the distribution of β(j)𝐘,𝐗,β(j),γj\beta^{(j)}\mid\mathbf{Y},\mathbf{X},\beta^{(\mathrm{-}j)},\gamma_{j}. Of course, the case where γj=0\gamma_{j}=0 is trivial since then β(j)=0\beta^{(j)}=0 by definition: in the alternative case, note from Equation (E.1) that we have

β(j)𝐘,𝐗,β(j),γj=1𝒩(τ2σ2ϕjT𝐫τ4σ4ϕjTϕjQj1ϕjT𝐫,τ2Ikjτ4σ2ϕjTϕj+τ6σ4ϕjTϕjQj1ϕjTϕj),\beta^{(j)}\mid\mathbf{Y},\mathbf{X},\beta^{(\mathrm{-}j)},\gamma_{j}=1\sim\mathcal{N}\left(\frac{\tau^{2}}{\sigma^{2}}\phi_{j}^{T}\mathbf{r}-\frac{\tau^{4}}{\sigma^{4}}\phi_{j}^{T}\phi_{j}Q_{j}^{-1}\phi_{j}^{T}\mathbf{r},\tau^{2}I_{k_{j}}-\frac{\tau^{4}}{\sigma^{2}}\phi_{j}^{T}\phi_{j}+\frac{\tau^{6}}{\sigma^{4}}\phi_{j}^{T}\phi_{j}Q_{j}^{-1}\phi_{j}^{T}\phi_{j}\right),

where above, we use ϕj\phi_{j} as shorthand for ϕj(𝐗j)\phi_{j}(\mathbf{X}_{j}) to lighten notation.

Step 3: Lastly, we derive the update for 𝐗j\mathbf{X}_{j} given 𝐗j,β(j),D\mathbf{X}_{-j},\beta^{(\mathrm{-}j)},D. In particular, for any vector 𝐱\mathbf{x}, let κ(𝐱)(γ=0𝐗j=𝐱,𝐗j,β(j))\kappa(\mathbf{x})\coloneqq\mathbb{P}(\gamma=0\mid\mathbf{X}_{j}=\mathbf{x},\mathbf{X}_{-j},\beta^{(\mathrm{-}j)}). Then by the law of total probability and the same Woodbury calculations as before,

(𝐗j=𝐱𝐗j,β(j),D)\displaystyle\mathbb{P}(\mathbf{X}_{j}=\mathbf{x}\mid\mathbf{X}_{-j},\beta^{(\mathrm{-}j)},D)\propto p(𝐘𝐗j=𝐱,𝐗j,β(j))\displaystyle p(\mathbf{Y}\mid\mathbf{X}_{j}=\mathbf{x},\mathbf{X}_{-j},\beta^{(\mathrm{-}j)})
=κ(𝐱)p(𝐘𝐗j=𝐱,𝐗j,β(j),β(j)=0)\displaystyle=\kappa(\mathbf{x})p(\mathbf{Y}\mid\mathbf{X}_{j}=\mathbf{x},\mathbf{X}_{-j},\beta^{(\mathrm{-}j)},\beta^{(j)}=0)
+(1κ(𝐱))p(𝐘𝐗j=𝐱,𝐗j,β(j),β(j)0)\displaystyle+(1-\kappa(\mathbf{x}))p(\mathbf{Y}\mid\mathbf{X}_{j}=\mathbf{x},\mathbf{X}_{-j},\beta^{(\mathrm{-}j)},\beta^{(j)}\neq 0)
κ(𝐱)exp(12σ2𝐫22)\displaystyle\propto\kappa(\mathbf{x})\exp\left(-\frac{1}{2\sigma^{2}}\|\mathbf{r}\|_{2}^{2}\right)
+(1κ(𝐱))det(Qj(𝐱))1/2exp(12σ2𝐫22+τ22σ4𝐫Tϕj(𝐱)TQj(𝐱)1ϕj(𝐱)T𝐫)\displaystyle+(1-\kappa(\mathbf{x}))\det(Q_{j}(\mathbf{x}))^{-1/2}\exp\left(-\frac{1}{2\sigma^{2}}\|\mathbf{r}\|_{2}^{2}+\frac{\tau^{2}}{2\sigma^{4}}\mathbf{r}^{T}\phi_{j}(\mathbf{x})^{T}Q_{j}(\mathbf{x})^{-1}\phi_{j}(\mathbf{x})^{T}\mathbf{r}\right)
κ(𝐱)+(1κ(𝐱))det(Qj(𝐱))1/2exp(τ22σ4𝐫Tϕj(𝐱)TQj(𝐱)1ϕj(𝐱)T𝐫)\displaystyle\propto\kappa(\mathbf{x})+(1-\kappa(\mathbf{x}))\det(Q_{j}(\mathbf{x}))^{-1/2}\exp\left(\frac{\tau^{2}}{2\sigma^{4}}\mathbf{r}^{T}\phi_{j}(\mathbf{x})^{T}Q_{j}(\mathbf{x})^{-1}\phi_{j}(\mathbf{x})^{T}\mathbf{r}\right)

where above Qj(𝐱)=Ikj+τ2σ2ϕj(𝐱)Tϕj(𝐱)Q_{j}(\mathbf{x})=I_{k_{j}}+\frac{\tau^{2}}{\sigma^{2}}\phi_{j}(\mathbf{x})^{T}\phi_{j}(\mathbf{x}) as before.

The only other sampling steps required in the Gibbs sampler are to sample from the conditional distributions of σ2,τ2\sigma^{2},\tau^{2} and p0p_{0}; however, this is straightforward since we use conjugate hyperpriors for each of these parameters.

E.2.3 Extension to binary regression

We can easily extend the Gibbs sampler in the preceding section to handle the case where the response is binary via a latent variable approach. Indeed, let us start by considering the case of Probit regression, which means we observe 𝐳=𝕀(𝐘0){0,1}n\mathbf{z}=\mathbb{I}(\mathbf{Y}\geq 0)\in\{0,1\}^{n} instead of the continuous outcome 𝐘\mathbf{Y}. Following Albert and Chib, (1993), we note that distribution of 𝐘𝐳,𝐗,β\mathbf{Y}\mid\mathbf{z},\mathbf{X},\beta is truncated normal, namely

𝐘i𝐳,𝐗,βind{TruncNorm(μi,σ2;(0,))𝐳i=1TruncNorm(μi,σ2;(,0)𝐳i=0,\mathbf{Y}_{i}\mid\mathbf{z},\mathbf{X},\beta\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\begin{cases}\mathrm{TruncNorm}(\mu_{i},\sigma^{2};(0,\infty))&\mathbf{z}_{i}=1\\ \mathrm{TruncNorm}(\mu_{i},\sigma^{2};(-\infty,0)&\mathbf{z}_{i}=0,\end{cases} (E.2)

where μ=ϕ(𝐗)β=𝔼[𝐘𝐗,β]\mu=\phi(\mathbf{X})\beta=\mathbb{E}[\mathbf{Y}\mid\mathbf{X},\beta]. Thus, when we observe a binary response 𝐳\mathbf{z} instead of the continuous response 𝐘\mathbf{Y}, we can employ the same Gibbs sampler as in Section E.2.2 except that after updating β(j)𝐗,β(j),𝐘\beta^{(j)}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\mathbf{Y}, we resample the latent variables 𝐘\mathbf{Y} according to Equation (E.2), which takes O(n)O(n) computation per iteration (since we can continuously update the value of μ\mu whenever we update 𝐗\mathbf{X} or β\beta in O(n)O(n) operations as well). As a result, the computational complexity of this algorithm is the same as that of the algorithm in Section E.2.2. A similar formulation based on PolyGamma random variables is available for the case of logistic regression (see Polson et al., (2013)).

E.3 Proof and discussion of Lemma 4.1

Lemma 4.1.

Suppose that under PπP^{\pi}, (i) pj(i)(0,1)p_{j}^{(i)}\in(0,1) a.s. for j[p]j\in[p] and (ii) the support of θ𝐗,𝐘\theta^{\star}\mid\mathbf{X},\mathbf{Y} equals the support of the marginal law of θ\theta^{\star}. Then as nsamplen_{\mathrm{sample}}\to\infty,

log(i=1nsamplepj(i))log(i=1nsample1pj(i))pMLRjπlog(Pjπ(𝐗jD)Pjπ(𝐗~jD)).\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}p_{j}^{(i)}\right)-\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}1-p_{j}^{(i)}\right)\stackrel{{\scriptstyle\mathrm{p}}}{{\to}}\mathrm{MLR}_{j}^{\pi}\coloneqq\log\left(\frac{P_{j}^{\pi}(\mathbf{X}_{j}\mid D)}{P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D)}\right).
Proof.

This proof follows because by the derivations in Section 4, Algorithm 1 is a standard Gibbs sampler as defined in Algorithm A.40 of Robert and Casella, (2004), i.e., at each step it samples from the conditional law of one unknown variable given all the others (where the unknown variables are 𝐗1,,𝐗p\mathbf{X}_{1},\dots,\mathbf{X}_{p} and θ\theta^{\star}), where everything is done conditional on DD. Furthermore, the condition (i) implies that the support of 𝐗jD,𝐗j,θ\mathbf{X}_{j}\mid D,\mathbf{X}_{-j},\theta^{\star} does not depend on (θ,𝐗j)(\theta^{\star},\mathbf{X}_{-j}). As a result, this theorem result is a direct consequence of Corollary 10.12 of Robert and Casella, (2004), applied conditional on DD. In particular, Corollary 10.12 proves that

1nsamplei=1nsamplepj(i)pPjπ(𝐗jD)\frac{1}{n_{\mathrm{sample}}}\sum_{i=1}^{n_{\mathrm{sample}}}p_{j}^{(i)}\stackrel{{\scriptstyle\mathrm{p}}}{{\to}}P_{j}^{\pi}(\mathbf{X}_{j}\mid D)

The result then follows from the continuous mapping theorem (note that the first assumption for the lemma ensures Pjπ(𝐗jD)(0,1)P_{j}^{\pi}(\mathbf{X}_{j}\mid D)\in(0,1) so the continuous mapping theorem applies). ∎

We note also that the two assumptions of Lemma 4.1 are satisfied in Example 1. In particular, to show that the support of 𝐗jD,θ,𝐗j\mathbf{X}_{j}\mid D,\theta^{\star},\mathbf{X}_{-j} does not depend on θ\theta^{\star}, observe that Eq. (4.1) tells us that conditional on D=𝐝D=\mathbf{d} for 𝐝=(𝐲,{𝐱j,𝐱~j}j=1p)\mathbf{d}=(\mathbf{y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p}), and for any θΘ\theta\in\Theta,

pj(i)=Pπ(𝐗j=𝐱jD=𝐝,θ=θ,𝐗j)Pπ(𝐗j=𝐱~jD=𝐝,θ=θ,𝐗j)=P𝐘𝐗(θ)(𝐲𝐗j=𝐱j,𝐗j)P𝐲𝐗(θ)(𝐲𝐗j=𝐱~j,𝐗j).p_{j}^{(i)}=\frac{P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j}\mid D=\mathbf{d},\theta^{\star}=\theta,\mathbf{X}_{-j})}{P^{\pi}(\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j}\mid D=\mathbf{d},\theta^{\star}=\theta,\mathbf{X}_{-j})}=\frac{P^{(\theta)}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{y}\mid\mathbf{X}_{j}=\mathbf{x}_{j},\mathbf{X}_{-j})}{P^{(\theta)}_{\mathbf{y}\mid\mathbf{X}}(\mathbf{y}\mid\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j},\mathbf{X}_{-j})}.

In Example 1, the numerator and denominator of the above equation are Gaussian likelihoods, so the likelihood ratio is always finite and thus pj(i)(0,1)p_{j}^{(i)}\in(0,1). Similarly, Example 1 is a conjugate Gaussian (additive) spike-and-slab linear model. Well–known results for these models establish that the support of the Gibbs distributions for the linear coefficients β\beta and hyperparameters σ2,τ2,p0\sigma^{2},\tau^{2},p_{0} is equal to the support of the marginal distribution for these parameters (George and McCulloch,, 1997)—see Appendix E.2 for examples of detailed derivations showing this result.

There may, of course, be other Bayesian models PπP^{\pi} where the assumptions of Lemma 4.1 do not hold. In these settings, the assumptions in Lemma 4.1 can be relaxed—see Robert and Casella, (2004).

E.4 Computing AMLR statistics

The AMLR statistics are a deterministic function of the MLR statistics {MLRjπ}j=1p\{\mathrm{MLR}_{j}^{\pi}\}_{j=1}^{p} and {νj}j=1p\{\nu_{j}\}_{j=1}^{p} where

νjPπ(MLRjπ>0,j1(θ)D)(1+q)1Pπ(MLRjπ>0D).\nu_{j}\coloneqq\frac{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)}{(1+q)^{-1}-P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)}.

By Proposition 3.3, Pπ(MLRjπ>0D)P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D) is a function of |MLRjπ||\mathrm{MLR}_{j}^{\pi}|, so computing the denominator of νj\nu_{j} is straightforward since we have already established how to compute {MLRjπ}j=1p\{\mathrm{MLR}_{j}^{\pi}\}_{j=1}^{p}. To compute the numerator, recall that Algorithm 1 samples from the joint posterior of θ,𝐗,𝐗~D\theta^{\star},\mathbf{X},\widetilde{\mathbf{X}}\mid D. Therefore, we can use the empirical mean of the samples from Algorithm 1 to compute this quantity:

Pπ(MLRjπ>0,j1(θ)D)1nsamplei=1nsample𝕀(𝐗^j=𝐗j(i),j1(θ(i)))P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)\approx\frac{1}{n_{\mathrm{sample}}}\sum_{i=1}^{n_{\mathrm{sample}}}\mathbb{I}(\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}^{(i)},j\in\mathcal{H}_{1}(\theta^{(i)}))

where 𝐗^j=argmax𝐱{𝐗j,𝐗~j}Pj(𝐱D)\widehat{\mathbf{X}}_{j}=\operatorname*{arg\,max}_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}(\mathbf{x}\mid D) is the MLR “guess” of the value of 𝐗j\mathbf{X}_{j}.

Appendix F MLR statistics for group knockoffs

In this section, we describe how MLR statistics extend to the setting of group knockoffs (Dai and Barber,, 2016). In particular, for a partition G1,,Gm[p]G_{1},\dots,G_{m}\subset[p] of the features, group knockoffs allow analysts to test the group null hypotheses HGj:𝐗Gj𝐘𝐗GjH_{G_{j}}:\mathbf{X}_{G_{j}}\perp\!\!\!\perp\mathbf{Y}\mid\mathbf{X}_{-G_{j}}, which can be useful in settings where 𝐗\mathbf{X} is highly correlated and there is not enough data to discover individual null variables. In particular, knockoffs 𝐗~\widetilde{\mathbf{X}} are model-X group knockoffs if they satisfy the group pairwise-exchangeability condition [𝐗,𝐗~]swap(Gj)=d[𝐗,𝐗~][\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(G_{j})}}\stackrel{{\scriptstyle\mathrm{d}}}{{=}}[\mathbf{X},\widetilde{\mathbf{X}}] for each j[m]j\in[m]. Similarly, 𝐗~\widetilde{\mathbf{X}} are fixed-X group knockoffs if (i) 𝐗T𝐗=𝐗~T𝐗~\mathbf{X}^{T}\mathbf{X}=\widetilde{\mathbf{X}}^{T}\widetilde{\mathbf{X}} and (ii) S=𝐗T𝐗𝐗~T𝐗S=\mathbf{X}^{T}\mathbf{X}-\widetilde{\mathbf{X}}^{T}\mathbf{X} is block-diagonal, where the blocks correspond to groups G1,,GmG_{1},\dots,G_{m}. Given group knockoffs, one computes a single knockoff feature statistic for each group.

MLR statistics extend naturally to the group knockoff setting because we can treat each group of features XGjX_{G_{j}} as a single compound feature. In particular, the masked data for group knockoffs is

D={(𝐘,{𝐗Gj,𝐗~Gj}j=1m) for model-X knockoffs(𝐗,𝐗~,{𝐗GjT𝐘,𝐗~GjT𝐘}j=1m) for fixed-X knockoffs,D=\begin{cases}(\mathbf{Y},\{\mathbf{X}_{G_{j}},\widetilde{\mathbf{X}}_{G_{j}}\}_{j=1}^{m})&\text{ for model-X knockoffs}\\ (\mathbf{X},\widetilde{\mathbf{X}},\{\mathbf{X}_{G_{j}}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{G_{j}}^{T}\mathbf{Y}\}_{j=1}^{m})&\text{ for fixed-X knockoffs,}\end{cases} (F.1)

and the corresponding MLR statistics are

MLRjπ=log(PGjπ(𝐗GjD)PGjπ(𝐗~GjD)) for model-X knockoffs,\mathrm{MLR}_{j}^{\pi}=\log\left(\frac{P_{G_{j}}^{\pi}(\mathbf{X}_{G_{j}}\mid D)}{P_{G_{j}}^{\pi}(\widetilde{\mathbf{X}}_{G_{j}}\mid D)}\right)\text{ for model-X knockoffs,}

where PGjπP_{G_{j}}^{\pi} above denotes the law of 𝐗GjD\mathbf{X}_{G_{j}}\mid D under PπP^{\pi}. For fixed-X knockoffs, we have

MLRjπ=log(PGj,fxπ(𝐗GjT𝐘D)PGj,fxπ(𝐗~GjT𝐘D)) for fixed-X knockoffs,\mathrm{MLR}_{j}^{\pi}=\log\left(\frac{P^{\pi}_{G_{j},\mathrm{fx}}(\mathbf{X}_{G_{j}}^{T}\mathbf{Y}\mid D)}{P^{\pi}_{G_{j},\mathrm{fx}}(\widetilde{\mathbf{X}}_{G_{j}}^{T}\mathbf{Y}\mid D)}\right)\text{ for fixed-X knockoffs,}

where PGj,fxπP^{\pi}_{G_{j},\mathrm{fx}} denotes the law of 𝐗GjT𝐘D\mathbf{X}_{G_{j}}^{T}\mathbf{Y}\mid D under PπP^{\pi}.

Throughout the paper, we have proved several optimality properties of MLR statistics, and if we treat 𝐗Gj\mathbf{X}_{G_{j}} as a single compound feature, all of these theoretical results (namely Proposition 3.3 and Theorem 3.2) immediately apply to group MLR statistics as well.

To compute group MLR statistics, we can use exactly the same Gibbs sampling strategy as in Section E.2—indeed, one can just treat XGjX_{G_{j}} as a basis representation of a single compound feature and use exactly the same equations as derived previously. This method is implemented in knockpy.

Appendix G Additional details for the simulations

Refer to caption
Figure 13: This plot is identical to Figure 7 except it shows the results for q=0.05q=0.05.

In this section, we describe the simulation settings in Section 5, and we also give the corresponding plot to Figure 7 which shows the results when q=0.05q=0.05. To start, we describe the simulation setting for each plot.

  1. 1.

    Sampling 𝐗\mathbf{X}: We sample each row of 𝐗\mathbf{X} as an i.i.d. 𝒩(0,Σ)\mathcal{N}(0,\Sigma) random vector in all simulations, with two choices of Σ\Sigma. First, in the “AR(1)” setting, we take Σ\Sigma to correspond to a nonstationary AR(1) Gaussian Markov chain, so 𝐗\mathbf{X} has i.i.d. rows satisfying XjX1,,Xj1𝒩(ρjXj1,1)X_{j}\mid X_{1},\dots,X_{j-1}\sim\mathcal{N}(\rho_{j}X_{j-1},1) with ρji.i.d.min(0.99,Beta(5,1))\rho_{j}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\min(0.99,\mathrm{Beta}(5,1)). Note that the AR(1) setting is the default used in any plot where the covariance matrix is not specified. Second, in the “ErdosRenyi” (ER) setting, we sample a random matrix VV such that 80%80\% of its off-diagonal entries (selected uniformly at random) are equal to zero; for the remaining entries, we sample Viji.i.d.Unif((1,0.1)(0.1,1))V_{ij}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathrm{Unif}((-1,-0.1)\cup(0.1,1)). To ensure the final covariance matrix is positive definite, we set Σ=(V+VT)+(0.1+λmin(VT+V))Ip\Sigma=(V+V^{T})+(0.1+\lambda_{\min}(V^{T}+V))I_{p} and then rescale Σ\Sigma to be a correlation matrix.

  2. 2.

    Sampling β\beta: Unless otherwise specified in the plot, we randomly choose s=10%s=10\% of the entries of β\beta to be nonzero and sample the nonzero entries as i.i.d. Unif([τ,τ/2][τ/2,τ])\mathrm{Unif}([-\tau,-\tau/2]\cup[\tau/2,\tau]) random variables with τ=0.5\tau=0.5 by default. The exceptions are: (1) in Figure 2, we set τ=0.5\tau=0.5 and s=0.1s=0.1, (2) in Figure 5, we set τ=0.3\tau=0.3, vary ss between 0.050.05 and 0.40.4 as shown in the plot, and in some panels sample the non-null coefficients as Laplace(τ)\mathrm{Laplace}(\tau) random variables, (3) in Figure 7 we take τ=2\tau=2 and s=0.3s=0.3, (4) in Figure 8 we take τ=1\tau=1.

  3. 3.

    Sampling 𝐘\mathbf{Y}: Throughout we sample 𝐘𝐗𝒩(𝐗β,In)\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,I_{n}), with only two exceptions. First, in Figure 7, we sample 𝐘𝐗𝒩(h(𝐗)β,In)\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(h(\mathbf{X})\beta,I_{n}) where hh is a nonlinear function applied elementwise to 𝐗\mathbf{X}, for h(x)=sin(x),h(x)=cos(x),h(x)=x2h(x)=\sin(x),h(x)=\cos(x),h(x)=x^{2} and h(x)=x3h(x)=x^{3}. Second, in Figure 8, 𝐘\mathbf{Y} is binary and (Y=1X)=exp(XTβ)1+exp(XTβ)\mathbb{P}(Y=1\mid X)=\frac{\exp(X^{T}\beta)}{1+\exp(X^{T}\beta)}.

  4. 4.

    Sampling knockoffs: We sample MVR and SDP Gaussian knockoffs using the default parameters from knockpy version 1.3, both in the fixed-X and model-X case. Note that in the model-X case, we use the true covariance matrix Σ\Sigma to sample knockoffs, thus guaranteeing finite-sample FDR control.

  5. 5.

    Fitting feature statistics: We fit the following types of feature statistics throughout the simulations: LCD statistics, LSM statistics, a random forest with swap importances (Gimenez et al.,, 2019), DeepPINK (Lu et al.,, 2018), MLR statistics (linear variant), MLR statistics with splines, and the MLR oracle. In all cases we use the default hyperparameters from knockpy version 1.3, and we do not adjust the hyperparameters, so that the MLR statistics do not have well-specified priors. The exception is that the MLR oracle has access to the underlying data-generating process and the true coefficients β\beta, which is why it is an “oracle.”

Now, recall that in Figure 7, we showed the results for q=0.1q=0.1 because several competitor feature statistics made no discoveries at q=0.05q=0.05. Figure 13 is corresponding plot for q=0.05q=0.05.

Appendix H Additional results for the real data applications

H.1 HIV drug resistance

For the HIV drug resistance application, Figures 16-21 show the same results as in Figure 1 but for all drugs in the protease inhibitor (PI) class; broadly, they show that MLR statistics have higher power because they ensure that the feature statistics with high absolute values are consistently positive, as discussed in Section 2. Note that in these plots, for the lasso-based statistics, we plot the normalized statistics Wjmaxi|Wi|\frac{W_{j}}{\max_{i}|W_{i}|} so that the absolute value of each statistic is less than one. Similarly, for the MLR statistics, instead of directly plotting the masked likelihood ratio as per Equation (1.3), we plot

Wj2(logit1(|MLRjπ|)0.5)=2((MLRjπ>0D)0.5)W_{j}^{\star\star}\coloneqq 2\left(\mathrm{logit}^{-1}(|\mathrm{MLR}_{j}^{\pi}|)-0.5\right)=2\left(\mathbb{P}(\mathrm{MLR}_{j}^{\pi}>0\mid D)-0.5\right)

because we find this quantity easier to interpret than a log likelihood ratio. In particular, Wj0W_{j}^{\star\star}\approx 0 if and only if MLRjπ\mathrm{MLR}_{j}^{\pi} is roughly equally likely to be positive or negative under PπP^{\pi}, and Wj=1W_{j}^{\star\star}=1 when MLRjπ\mathrm{MLR}_{j}^{\pi} is always positive under PπP^{\pi}.

Additionally, Figures 14 and 15 show the number of discoveries made by each feature statistic for SDP and MVR knockoffs, respectively, stratified by the drug in question. Note that the specific data analysis is identical to that of Barber and Candès, (2015) and Fithian and Lei, (2020) other than the choice of feature statistic—see either of those papers or https://github.com/amspector100/mlr_knockoff_paper for more details.

Refer to caption
Figure 14: This figure shows the number of discoveries made by each feature statistic for each drug in the HIV drug resistance dataset.
Refer to caption
Figure 15: This figure shows the number of discoveries made by each feature statistic for each drug in the HIV drug resistance dataset.
Refer to caption
Figure 16: This figure is the same as Figure 1, except it shows results for drugs in the PI class using SDP knockoffs.
Refer to caption
Figure 17: This figure is the same as Figure 1, except it shows results for drugs in the PI class using MVR knockoffs.
Refer to caption
Figure 18: This figure is the same as Figure 1, except it shows results for drugs in the NRTI class using SDP knockoffs.
Refer to caption
Figure 19: This figure is the same as Figure 1, except it shows results for drugs in the NRTI class using MVR knockoffs.
Refer to caption
Figure 20: This figure is the same as Figure 1, except it shows results for drugs in the NNRTI class using SDP knockoffs.
Refer to caption
Figure 21: This figure is the same as Figure 1, except it shows results for drugs in the NNRTI class using SDP knockoffs.

H.2 Financial factor selection

We now present a few additional details for the financial factor selection analysis from Section 6.2. First, we list the ten index funds we analyze, which are: XLB (materials), XLC (communication services), XLE (energy), XLF (financials), XLK (information technology), XLP (consumer staples), XLRE (real estate), XLU (utilities), XLV (health care), and XLY (consumer discretionary). Second, for each feature statistic, Table 1 shows the average realized FDP across all ten analyses—as desired, the average FDP for each method is lower than the nominal level of q=0.05%q=0.05\%. All code is available at https://github.com/amspector100/mlr_knockoff_paper.

Knockoff Type Feature Stat. Average FDP
MVR LCD 0.013636
LSM 0.004545
MLR 0.038571
SDP LCD 0.000000
LSM 0.035000
MLR 0.039002
Table 1: This table shows the average FDP, defined above, for each method in the financial factor selection analysis from Section 6.2.