Asymptotically Optimal Knockoff Statistics via the Masked Likelihood Ratio

Asher Spector Department of Statistics, Stanford University, USA William Fithian Department of Statistics, UC Berkeley, USA

Abstract

In feature selection problems, knockoffs are synthetic controls for the original features. Employing knockoffs allows analysts to use nearly any variable importance measure or “feature statistic” to select features while rigorously controlling false positives. However, it is not clear which statistic maximizes power. In this paper, we argue that state-of-the-art lasso-based feature statistics often prioritize features that are unlikely to be discovered, leading to low power in real applications. Instead, we introduce masked likelihood ratio (MLR) statistics, which prioritize features according to one’s ability to distinguish each feature from its knockoff. Although no single feature statistic is uniformly most powerful in all situations, we show that MLR statistics asymptotically maximize the number of discoveries under a user-specified Bayesian model of the data. (Like all feature statistics, MLR statistics always provide frequentist error control.) This result places no restrictions on the problem dimensions and makes no parametric assumptions; instead, we require a “local dependence” condition that depends only on known quantities. In simulations and three real applications, MLR statistics outperform state-of-the-art feature statistics, including in settings where the Bayesian model is highly misspecified. We implement MLR statistics in the python package knockpy; our implementation is often faster than computing a cross-validated lasso.

1 Introduction

Given a design matrix $\mathbf{X}=(\mathbf{X}_{1},\dots,\mathbf{X}_{p})\in\mathbb{R}^{n\times p}$ and a response vector $\mathbf{Y}\in\mathbb{R}^{n}$ , the task of controlled feature selection is, informally, to discover features that influence $\mathbf{Y}$ while controlling the false discovery rate (FDR). In this context, knockoffs (Barber and Candès,, 2015; Candès et al.,, 2018) are fake variables $\widetilde{\mathbf{X}}\in\mathbb{R}^{n\times p}$ which act as negative controls for the features $\mathbf{X}$ . Remarkably, employing knockoffs allows analysts to use nearly any machine learning model or test statistic, often known interchangeably as a “feature statistic” or “knockoff statistic,” to select features while exactly controlling the FDR. As a result, knockoffs has become popular in the analysis of genetic studies, financial data, clinical trials, and more (Sesia et al.,, 2018, 2019; Challet et al.,, 2021; Sechidis et al.,, 2021).

The flexibility of knockoffs has inspired the development of a variety of feature statistics based on penalized regression coefficients, sparse Bayesian models, random forests, neural networks, and more (see, e.g., Barber and Candès, (2015); Candès et al., (2018); Gimenez et al., (2019); Lu et al., (2018)). These feature statistics not only reflect different modeling assumptions, but more fundamentally, they estimate different quantities, including coefficient sizes, Bayesian posterior inclusion probabilities, and various other measures of variable importance. Yet there has been relatively little theoretical comparison of these methods, in large part because analyzing the power of knockoffs can be very challenging; see Section 1.4. In this work, we develop a principled approach and concrete methods for designing knockoff statistics that maximize power.

1.1 Review of model-X and fixed-X knockoffs

This section reviews the key elements of Model-X (MX) and Fixed-X (FX) knockoff methods.

Model-X (MX) knockoffs (Candès et al.,, 2018) is a method to test the hypotheses $H_{j}:\mathbf{X}_{j}\perp\!\!\!\perp\mathbf{Y}\mid\mathbf{X}_{-j}$ , where $\mathbf{X}_{-j}\coloneqq\{\mathbf{X}_{\ell}\}_{\ell\neq j}$ denotes all features except $\mathbf{X}_{j}$ , assuming that the law of $\mathbf{X}$ is known.¹¹1Note that this assumption can be relaxed to having a well-specified parametric model for $\mathbf{X}$ (Huang and Janson,, 2020), and knockoffs are known to be robust to misspecification of the law of $\mathbf{X}$ (Barber et al.,, 2020). Applying MX knockoffs requires three steps.

1. Constructing knockoffs. Valid MX knockoffs must satisfy two properties. First, the columns of $\mathbf{X}$ must be pairwise exchangeable with the corresponding columns of $\widetilde{\mathbf{X}}$ , i.e. $[\mathbf{X}_{-j},\widetilde{\mathbf{X}}_{-j},\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}]\stackrel{{\scriptstyle\mathrm{d}}}{{=}}[\mathbf{X}_{-j},\widetilde{\mathbf{X}}_{-j},\widetilde{\mathbf{X}}_{j},\mathbf{X}_{j}]$ must hold for all $j\in[p]$ . Second, we require that $\widetilde{\mathbf{X}}\perp\!\!\!\perp\mathbf{Y}\mid\mathbf{X}$ , which holds if one constructs $\widetilde{\mathbf{X}}$ without looking at $\mathbf{Y}$ . Informally, these constraints guarantee that $\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}$ are “indistinguishable” under $H_{j}$ . Sampling knockoffs can be challenging, but this problem is well studied (e.g., Bates et al.,, 2020).

2. Fitting feature statistics. Next, use any machine learning (ML) algorithm to fit feature importances $Z=z([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\in\mathbb{R}^{2p}$ , where $Z_{j}$ and $Z_{j+p}$ heuristically measure the “importance” of $\mathbf{X}_{j}$ and $\widetilde{\mathbf{X}}_{j}$ in predicting $\mathbf{Y}$ . The only restriction is that swapping $\mathbf{X}_{j}$ and $\widetilde{\mathbf{X}}_{j}$ must also swap the feature importances $Z_{j}$ and $Z_{j+p}$ without changing any of the other feature importances $\{Z_{\ell}\}_{\ell\neq j},\{Z_{\ell+p}\}_{\ell\neq j}$ . This restriction is satisfied by most ML algorithms, such as the lasso or various neural networks (Lu et al.,, 2018).

Given $Z$ , we define the feature statistics $W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\in\mathbb{R}^{p}$ via $W_{j}=f(Z_{j},Z_{j+p})$ where $f(x,y)=-f(y,x)$ is any antisymmetric function. E.g., the lasso coefficient difference (LCD) statistic sets $W_{j}=|Z_{j}|-|Z_{j+p}|$ , where $Z_{j}$ and $Z_{j+p}$ are coefficients from a lasso fit on $[\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y}$ . Intuitively, when $W_{j}$ is positive, this suggests that $\mathbf{X}_{j}$ is more important than $\widetilde{\mathbf{X}}_{j}$ and thus is evidence against the null. Indeed, Steps 1-2 guarantee that the signs of the null $\{W_{j}\}_{j=1}^{p}$ are i.i.d. random signs.

3.Make rejections. Define the data-dependent threshold $T\coloneqq\inf\left\{t>0:\frac{\#\{j:W_{j}\leq-t\}+1}{\#\{W_{j}\geq t\}}\leq q\right\}$ , where $\inf\emptyset\coloneqq\infty$ . Then, reject $S\coloneqq\{j:W_{j}\geq T\}$ , which guarantees finite-sample FDR control at level $q\in(0,1)$ . Note this result does not require any assumptions about the law of $\mathbf{Y}\mid\mathbf{X}$ .

Theorem 1.1 (Candès et al., (2018)).

Let $P^{\star}$ denote the unknown joint law of $(\mathbf{X},\mathbf{Y})$ , and suppose the law of $\mathbf{X}\sim P_{X}^{\star}$ is known, allowing one to construct valid knockoffs $\tilde{\mathbf{X}}$ . ²²2Typically, one assumes that the observations are i.i.d. to construct valid knockoffs, but the i.i.d. assumption is not necessary as long as $\widetilde{\mathbf{X}}$ are valid knockoffs. Then for any feature statistic $w$ ,

\mathrm{FDR}\coloneqq\mathbb{E}_{P^{\star}}\left[\frac{|S\cap\mathcal{H}_{0}|}{1\vee|S|}\right]\leq q,

where $\mathcal{H}_{0}=\{j\in[p]:H_{j}\text{ is true}\}$ is the set of nulls under $P^{\star}$ .

Fixed-X (FX) knockoffs (Barber and Candès,, 2015) treats $\mathbf{X}$ as nonrandom and yields exact FDR control under the Gaussian linear model $\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,\sigma^{2}I_{n})$ . Fitting FX knockoffs is identical to the steps above with two exceptions:

1.

FX knockoffs need not satisfy the constraints in Step 1: instead, $\widetilde{\mathbf{X}}$ must satisfy (i) $\widetilde{\mathbf{X}}^{T}\widetilde{\mathbf{X}}=\mathbf{X}^{T}\mathbf{X}$ and (ii) $\widetilde{\mathbf{X}}^{T}\mathbf{X}=\mathbf{X}^{T}\mathbf{X}-\Delta$ , for some diagonal matrix $\Delta$ satisfying $2\mathbf{X}^{T}\mathbf{X}\succ\Delta$ .
2.

The feature importances $Z$ can only depend on $\mathbf{Y}$ through $[\mathbf{X},\widetilde{\mathbf{X}}]^{T}\mathbf{Y}$ , which permits the use of many test statistics, but not all (for example, this prohibits the use of cross-validation).

Our theory applies to both MX and FX knockoffs,but oftenfocus on the MX context for brevity.

1.2 Theoretical problem statement

This section defines two types of optimal knockoff statistics: oracle statistics, which maximize the expected number of discoveries (ENDisc) for the true (unknown) data distribution $P^{\star}$ , and Bayes-optimal statistics, which maximize ENDisc with respect to a prior distribution over $P^{\star}$ . We focus on the expected number of discoveries since it greatly simplifies the analysis and all feature statistics provably control the frequentist FDR anyway. However, Section 3.4 extends our analysis to consider the expected number of true discoveries.

Let $S_{w}\subset[p]$ denote the discovery set using feature statistic $w$ on data $([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ . An oracle statistic maximizes the expected number of discoveries under $P^{\star}$ as defined below:

\texttt{ENDisc}^{\star}(w)\coloneqq\,\,\mathbb{E}_{P^{\star}}\left[|S_{w}|\right].

(1.1)

Next, let $\mathcal{P}=\{P^{(\theta)}:\theta\in\Theta\}$ denote a model class of potential distributions for $(\mathbf{X},\mathbf{Y})$ and let $\pi:\Theta\to\mathbb{R}_{\geq 0}$ denote a prior density over $\mathcal{P}$ .³³3We implicitly assume all elements of $\mathcal{P}$ are consistent with the core assumptions of MX/FX knockoffs (see Section 1.1). For example, when employing MX knockoffs, all $P^{(\theta)}\in\mathcal{P}$ should specify the correct marginal law for $X$ . Although this is not necessary for our theoretical results, it is necessary for the computational techniques in Section 4. A Bayes-optimal statistic maximizes the average-case expected number of discoveries with respect to $\pi$ :

\texttt{ENDisc}^{\pi}(w)\coloneqq\int_{\Theta}\mathbb{E}_{P^{(\theta)}}[|S_{w}|]\pi(\theta)d\theta\coloneqq\mathbb{E}_{P^{\pi}}[|S_{w}|],

(1.2)

where above, $P^{\pi}$ denotes the mixture distribution which first samples a parameter $\theta^{\star}\in\Theta$ according to the prior $\pi$ and then samples $(\mathbf{X},\mathbf{Y})\mid\theta^{\star}\sim P^{(\theta^{\star})}$ . We refer to $P^{\pi}$ as a “Bayesian model,” and we give a default choice of $P^{\pi}$ (based on sparse generalized additive models) in Section 4.2. Our paper introduces statistics $\mathrm{mlr}^{\mathrm{oracle}}$ and $\mathrm{mlr}^{\pi}$ that asymptotically maximize $\texttt{ENDisc}^{\star}$ and $\texttt{ENDisc}^{\pi}$ , respectively.

While introducing a prior distribution may seem a strong assumption to some readers, Bayesian models are routinely used in applications where knockoffs are commonly applied, such as genetic fine-mapping (e.g., Guan and Stephens,, 2011; Weissbrod et al.,, 2020). Furthermore, in simulations and real applications, our approach yields significant power gains over pre-existing approaches even when the prior is highly misspecified. Lastly, we emphasize that Theorem 1.1 guarantees that using $\mathrm{mlr}^{\pi}$ as a feature statistic guarantees frequentist FDR control under the true law $P^{\star}$ of the data, even if $P^{\star}$ is not a member of $\mathcal{P}$ —and in this Type I error result, the conditional independence null hypotheses $H_{1},\ldots,H_{p}$ are defined nonparametrically with respect to the unknown $P^{\star}$ , not with respect to $P^{\pi}$ .

1.3 Contribution and overview of results

This paper develops masked likelihood ratio (MLR) statistics, a class of feature statistics that are asymptotically optimal, computationally efficient, and powerful in applications. To derive these statistics, we reformulate MX knockoffs as a guessing game on masked data $D=(\mathbf{Y},\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p}),$ where the notation $\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ denotes an unordered set.⁴⁴4For brevity, this section only presents results for MX knockoffs. See Section 3 for analogous results for the FX case. After observing $D$ , the analyst must do as follows:

•

For each $j\in[p]$ , the analyst must produce a guess $\widehat{\mathbf{X}}_{j}\in\mathbb{R}^{n}$ of the value of the feature $\mathbf{X}_{j}$ based on $D$ . Note that given $D$ , $\widehat{\mathbf{X}}_{j}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ takes one of two values.
•

The analyst must then assign an order to their $p$ guesses, ideally from most to least promising.
•

The analyst may make $k$ discoveries if roughly $k$ of their first $(1+q)k$ guesses are correct (according to the order they specify), where $q$ is the FDR level. Here, guess $j$ is “correct” if $\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}$ .

We show that to maximize the expected number of discoveries, an asymptotically optimal strategy is:

•

For each $j\in[p]$ , guess the value $\widehat{\mathbf{X}}_{j}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ which is conditionally more likely given $D$ (see below).
•

Order the guesses in descending order of the probability that each guess is correct.

In the traditional language of knockoffs, this corresponds to using the masked data to compute the log-likelihood ratio between the two possible values of $\mathbf{X}_{j}$ (namely $\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\})$ given $D$ . Precisely, let $P^{\pi}$ denote a Bayesian model as defined in Section 1.2. Then for any value $\mathbf{d}=(\mathbf{y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p})$ in the support of $D$ and any fixed $\mathbf{x}\in\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}$ , let $P^{\pi}_{j}(\mathbf{x}\mid\mathbf{d})=P^{\pi}(\mathbf{X}_{j}=\mathbf{x}\mid D=\mathbf{d})$ denote the conditional law of $\mathbf{X}_{j}$ given $D$ . The masked likelihood ratio (MLR) statistic is defined as

\mathrm{MLR}_{j}^{\pi}\coloneqq\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\coloneqq\log\left(\frac{P^{\pi}_{j}(\mathbf{X}_{j}\mid D)}{P^{\pi}_{j}(\widetilde{\mathbf{X}}_{j}\mid D)}\right).

(1.3)

In words, the numerator plugs the observed values of $\mathbf{X}_{j}$ and $D$ into the conditional law of $\mathbf{X}_{j}\mid D$ under $P^{\pi}$ , and the denominator plugs $\widetilde{\mathbf{X}}_{j}$ and $D$ into the same law. Since swapping $\mathbf{X}_{j}$ and $\widetilde{\mathbf{X}}_{j}$ flips the sign of $\mathrm{MLR}_{j}^{\pi}$ without changing the values of $\{\mathrm{MLR}_{\ell}^{\pi}\}_{\ell\neq j}$ , this equation defines a valid knockoff statistic. See Section 1.4 for comparison to Katsevich and Ramdas, (2020)’s (unmasked) likelihood ratio statistic.

This paper gives three arguments motivating the use of MLR statistics.

1. Intuition: the right notion of variable importance. Existing feature statistics measure many different proxies for variable importance, ranging from regression coefficients to posterior inclusion probabilities. However, Section 2 shows that popular lasso-based methods incorrectly prioritize features $\mathbf{X}_{j}$ that are predictive of $\mathbf{Y}$ but are nearly indistinguishable from their knockoffs $\widetilde{\mathbf{X}}_{j}$ , leading to low power in real applications. In contrast to conventional variable importances, MLR statistics instead estimate whether a feature $\mathbf{X}_{j}$ is distinguishable from its knockoff $\widetilde{\mathbf{X}}_{j}$ .

2. Theory: asymptotic Bayes-optimality. Section 3.3 shows that MLR statistics asymptotically maximize the number of expected discoveries under $P^{\pi}$ . Namely, under technical assumptions, Theorem 3.2 shows that for any valid feature statistic $w$ ,

\texttt{ENDisc}^{\pi}(\mathrm{mlr}^{\pi})\geq\texttt{ENDisc}^{\pi}(w)+o(\text{\# of non-nulls}).

(1.4)

Our result applies to arbitrarily high-dimensional asymptotic regimes and allows $P^{\pi}$ to take any form—we do not assume that $\mathbf{Y}\mid\mathbf{X}$ follows a linear model under $P^{\pi}$ . Instead, we assume the signs of the MLR statistics satisfy a local dependency condition, similar to dependency conditions often assumed on p-values (Genovese and Wasserman,, 2004; Storey et al.,, 2004; Ferreira and Zwinderman,, 2006; Farcomeni,, 2007). Our condition does not involve unknown quantities, so it can be diagnosed in practice.

Despite the Bayesian nature of this optimality result, we emphasize that MLR statistics are valid knockoff statistics. Thus, if $S_{\mathrm{mlr}}\subset[p]$ are the discoveries made by the $\mathrm{mlr}^{\pi}$ feature statistic, Theorem 1.1 shows that the frequentist FDR is controlled in finite samples assuming only that $\widetilde{\mathbf{X}}$ are valid knockoffs:

\mathrm{FDR}\coloneqq\mathbb{E}_{P^{\star}}\left[\frac{|S_{\mathrm{mlr}}\cap\mathcal{H}_{0}|}{1\vee|S_{\mathrm{mlr}}|}\right]\leq q.

(1.5)

3. Empirical results. We demonstrate via simulations and three real data analyses that MLR statistics are powerful in practice, even when the user-specified Bayesian model $P^{\pi}$ is highly misspecified.

•

We develop concrete instantiations of MLR statistics based on uninformative (sparse) priors for generalized additive models and binary GLMs. Our Python implementation is often faster than fitting a cross-validated lasso.
•

In simulations, MLR statistics outperform other state-of-the-art feature statistics, often by wide margins. Even when $P^{\pi}$ is highly misspecified, MLR statistics often nearly match the performance of the oracle which sets $P^{\pi}=P^{\star}$ . Furthermore, when $\mathbf{Y}$ has a highly nonlinear relationship with $\mathbf{X}$ , MLR statistics also outperform “black-box” feature statistics based on neural networks and random forests.
•

We replicate three knockoff-based analyses of drug resistance (Barber and Candès,, 2015), financial factor selection (Challet et al.,, 2021), and RNA-seq data (Li and Maathuis,, 2019). We find that MLR statistics (with an uninformative prior) make one to ten times more discoveries than the original analyses.

Overall, our results suggest that MLR statistics can substantially increase the power of knockoffs.

1.4 Related literature

The literature contains many feature statistics, which can (roughly) be separated into three categories. First, perhaps the most common feature statistics are based on penalized regression coefficients, notably the lasso signed maximum (LSM) and lasso coefficient difference (LCD) statistics (Barber and Candès,, 2015). Indeed, these lasso-based statistics are often used in applied work (e.g., Sesia et al.,, 2019) and have received much theoretical attention (Weinstein et al.,, 2017; Fan et al.,, 2020; Ke et al.,, 2020; Weinstein et al.,, 2020; Wang and Janson,, 2021). Perhaps surprisingly, we argue that many of these statistics target the wrong notion of variable importance, leading to reduced power. Second, some works have introduced Bayesian knockoff statistics (e.g., Candès et al.,, 2018; Ren and Candès,, 2020). MLR statistics have a Bayesian flavor but take a different form than previous statistics. Furthermore, our motivation differs from those of previous works: the real innovation of MLR statistics is to estimate a masked likelihood ratio, and we mainly use a Bayesian framework to quantify uncertainty about nuisance parameters (see Section 3.2). In contrast, previous works largely motivated Bayesian statistics as a way to incorporate prior information (Candès et al.,, 2018; Ren and Candès,, 2020). That said, an important special case of MLR statistics is similar to the “BVS” statistics from Candès et al., (2018), as discussed in Section 4. Third, many feature statistics take advantage of “black-box” ML to assign variable importances (e.g., Lu et al.,, 2018; Gimenez et al.,, 2019). Empirically, our implementation of MLR statistics based on regression splines outperforms “black-box” feature statistics in Section 5.

Previous power analyses for knockoffs have largely focused on showing the consistency of coefficient-difference feature statistics (Liu and Rigollet,, 2019; Fan et al.,, 2020; Spector and Janson,, 2022) or quantifying the power of coefficient-difference feature statistics assuming $\mathbf{X}$ has i.i.d. Gaussian entries (Weinstein et al.,, 2017, 2020; Wang and Janson,, 2021). Ke et al., (2020) also derive a phase diagram for LCD statistics assuming $\mathbf{X}$ is blockwise orthogonal. Our goal is different: to show that MLR statistics are asymptotically optimal, with particular focus on settings where the asymptotic power lies strictly between $0$ and $1$ . Furthermore, the works above exclusively focus on Gaussian linear models, whereas our analysis places no explicit restrictions on the law of $\mathbf{Y}\mid\mathbf{X}$ or the dimensionality of the problem. Instead, we assume the signs of the MLR statistics satisfy a local dependency condition, similar to common dependency conditions on p-values (Genovese and Wasserman,, 2004; Storey et al.,, 2004; Ferreira and Zwinderman,, 2006; Farcomeni,, 2007). However, our proof technique is novel and specific to knockoffs.

Our theory builds on Li and Fithian, (2021), who developed knockoff $\star$ , a provably optimal oracle statistic for FX knockoffs—in fact, oracle MLR statistics are equivalent to knockoff $\star$ for FX knockoffs. Our work also builds on Katsevich and Ramdas, (2020), who showed that unmasked likelihood statistics maximize $P^{\star}(W_{j}>0)$ . MLR statistics also have this property, although we show the stronger result that MLR statistics maximize the expected number of overall discoveries. Another key difference is that unmasked likelihood statistics are not jointly valid knockoff statistics (see Appendix D.1). Thus, unmasked likelihood statistics do not yield provable FDR control, whereas MLR statistics do. Lastly, we note that the oracle procedures derived in these two works cannot be used in practice since they depend on unknown parameters. To our knowledge, MLR statistics are the first usable knockoff statistics with explicit optimality guarantees.

1.5 Notation and outline

Notation: Let $\mathbf{X}\in\mathbb{R}^{n\times p}$ and $\mathbf{Y}\in\mathbb{R}^{n}$ denote the design matrix and response vector in a feature selection problem with $n$ data points and $p$ features. We let the non-bold versions $X=(X_{1},\dots,X_{p})\in\mathbb{R}^{p}$ and $Y\in\mathbb{R}$ denote the features and response for any arbitrary observation. For $k\in\mathbb{N}$ , define $[k]\coloneqq\{1,\dots,k\}$ . For any $M\in\mathbb{R}^{m\times k}$ and $J\subset[k]$ , $M_{J}$ denotes the columns of $M$ corresponding to the indices in $J$ . Similarly, $M_{-J}$ denotes the columns of $M$ which do not appear in $J$ , and $M_{-j}$ denotes all columns except column $j\in[k]$ . For matrices $M_{1}\in\mathbb{R}^{n\times k_{1}},M_{2}\in\mathbb{R}^{n\times k_{2}}$ , $[M_{1},M_{2}]\in\mathbb{R}^{n\times(k_{1}+k_{2})}$ denotes the column-wise concatenation of $M_{1},M_{2}$ . $I_{n}$ denotes the $n\times n$ identity. Throughout, $P^{\star}$ denotes the true (unknown) law of $\mathbf{X},\mathbf{Y}$ , and $P^{\pi}$ denotes a user-specified Bayesian model of the law of $\mathbf{X},\mathbf{Y},\theta^{\star}$ as defined in Section 1.2.

Outline: Section 2 gives intuition explaining why popular feature statistics may have low power, using an HIV resistance dataset as motivation. Section 3 introduces MLR statistics and presents our theoretical results. Section 4 discusses computation and suggests default choices of the Bayesian model $P^{\pi}$ . Section 5 compares MLR statistics to competitors via simulations. Section 6 applies MLR statistics to three real datasets. Section 7 discusses future directions.

2 Intuition and motivation from an HIV drug resistance dataset

2.1 Intuition: what makes knockoffs powerful?

Given a vector of knockoff statistics $W\in\mathbb{R}^{p}$ , the number of discoveries is determined as follows:

•

Step 1: Let $\sigma:[p]\to[p]$ denote the permutation such that $|W_{\sigma(1)}|\geq|W_{\sigma(2)}|\geq\dots\geq|W_{\sigma(p)}|$ .
•

Step 2: Let $k$ be the largest integer such that at least $\left\lceil(k+1)/(1+q)\right\rceil$ of the $k$ feature statistics $W_{\sigma(1)},\dots,W_{\sigma(k)}$ with the largest absolute values have positive signs. Then the analyst may discover the features corresponding to the positive signs among $W_{\sigma(1)},\dots,W_{\sigma(k)}$ .

The procedure above is equivalent to using the “data-dependent threshold” from Section 1.1 (see Barber and Candès,, 2015). This characterization suggests that to make many discoveries, we should:

•

Goal 1: Maximize the probability that each coordinate $W_{j}$ has a positive sign. (Note that null coordinates are guaranteed to be symmetric.)
•

Goal 2: Assign absolute values such that coordinates $W_{j}$ with larger absolute values also have higher probabilities of being positive. This ensures that for each $k$ , the $k$ feature statistics with the highest absolute values contain as many positive signs as possible, thus maximizing the number of discoveries. Although it is not yet clear how to formalize this goal, intuitively, we would like $\{|W_{j}|\}_{j=1}^{p}$ to have the same order as $\{P^{\star}(W_{j}>0)\}_{j=1}^{p}$ .

We emphasize that the second goal is crucial to make discoveries when $\{P^{\star}(W_{j}>0)\}_{j=1}^{p}$ is heterogeneous, as illustrated in Section 2.2. See also Appendix A for a simpler simulated example.

2.2 Motivation from the HIV drug resistance dataset

We now ask: do the most common choices of feature statistics used in the literature, LCD and LSM statistics, accomplish Goals 1-2? We argue no, using the HIV drug resistance dataset from Rhee et al., (2006) as an illustrative example. This dataset has been used as a benchmark in several papers about knockoffs, e.g., Barber and Candès, (2015); Romano et al., (2020), and we perform a complete analysis of this dataset in Section 6. For now, note that the design $\mathbf{X}$ consists of genotype data from $n\approx 750$ HIV samples, the response $\mathbf{Y}$ measures the resistance of each sample to a drug (in this case Indinavir), and we apply knockoffs to discover which genetic variants affect drug resistance—note our analysis exactly mimics that of Barber and Candès, (2015). As notation, let $(\hat{\beta}^{(\lambda)},\tilde{\beta}^{(\lambda)})\in\mathbb{R}^{2p}$ denote the estimated lasso coefficients fit on $[\mathbf{X},\widetilde{\mathbf{X}}]$ and $\mathbf{Y}$ with regularization parameter $\lambda$ . Furthermore, let $\hat{\lambda}_{j}$ (resp. $\tilde{\lambda}_{j})$ denote the smallest value of $\lambda$ such that $\hat{\beta}^{(\lambda)}_{j}\neq 0$ (resp. $\tilde{\beta}^{(\lambda)}_{j}\neq 0$ ). Then the LCD and LSM statistics are defined as:

W_{j}^{\mathrm{LCD}}=|\hat{\beta}_{j}^{(\lambda)}|-|\tilde{\beta}_{j}^{(\lambda)}|,\,\,\,\,\,\,\,\,\,\,W_{j}^{\mathrm{LSM}}=\operatorname*{sign}(\hat{\lambda}_{j}-\tilde{\lambda}_{j})\max(\hat{\lambda}_{j},\tilde{\lambda}_{j}).

(2.1)

Refer to caption — Figure 1: We plot the first $50$ LCD, LSM, and MLR feature statistics sorted in descending order of absolute value when applied to the HIV drug resistance dataset for the drug Indinavir (IDV). For FDR level $q=0.05$ , all positive feature statistics to the left of the dotted black line are discoveries. This figure shows that when $\mathbf{X}$ is correlated, LCD and LSM statistics make few discoveries because they occasionally yield highly negative $W$ -statistics for highly predictive variables that have low-quality knockoffs, such as the “P90.M” variant from Section 2. In contrast, MLR statistics (defined in Section 3) deprioritize the P90.M variant; although they still do not discover P.90M, this deprioritization allows the discovery of $22$ other features. For visualization, we apply a monotone transformation to $\{W_{j}\}$ such that $|W_{j}|\leq 1$ , which (provably) does not change the performance of knockoffs. See Appendix H for further details and corresponding plots for the other fifteen drugs in the dataset.

As intuition, imagine that a feature $\mathbf{X}_{j}$ appears to influence $\mathbf{Y}$ : however, due to high correlations within $\mathbf{X}$ , we must create a knockoff $\widetilde{\mathbf{X}}_{j}$ which is highly correlated with $\mathbf{X}_{j}$ . For example, the “P90.M” variant in the HIV dataset is extremely predictive of resistance to Indinavir (IDV), as its OLS t-statistic is $\approx 8.95$ . However, in the original analysis, P90.M is $>99\%$ correlated with its knockoff, so the lasso may select $\widetilde{\mathbf{X}}_{j}$ instead of $\mathbf{X}_{j}$ . Furthermore, since the lasso induces sparsity, it is unlikely to select both $\mathbf{X}_{j}$ and $\widetilde{\mathbf{X}}_{j}$ as they are highly correlated. Thus, $W_{j}^{\mathrm{LCD}}$ and $W_{j}^{\mathrm{LSM}}$ will have large absolute values, since $\mathbf{X}_{j}$ appears significant, and a reasonably high probability of being negative, since $\mathbf{X}_{j}\approx\widetilde{\mathbf{X}}_{j}$ . Indeed, the LCD and LSM statistics for P90.M have, respectively, the largest and second-largest absolute values among all genetic variants, but both statistics are negative because the lasso selected the knockoff instead of the feature.

Figure 1 shows that this misprioritization prevents the LCD and LSM statistics from making any discoveries when $q=0.05$ . Yet this problem is avoidable. If $\operatorname{Corr}(X_{j},\widetilde{X}_{j})$ is large and $W_{j}$ may be negative, we can “deprioritize” $W_{j}$ by lowering its absolute value. As shown by Figure 1, this is exactly what MLR statistics do for the P90.M variant. Although this does not allow us to discover P90.M, it does allow us to discover $22$ other features.

Remark 1 (Alternative solutions).

To our knowledge, this problem with lasso statistics has not been previously discussed (see Section 1.4). Once pointed out, there are many intuitive approaches that mitigate (but do not fully solve) this problem, such as studentizing the coefficients or adding a ridge penalty. These practical ideas may merit further exploration; however, we focus on obtaining optimal feature statistics. Furthermore, some may argue that the best solution is simply to ensure that P90.M is less correlated with its knockoff. We wholeheartedly agree that the SDP knockoff construction from Barber and Candès, (2015) is sub-optimal here, and thus, we also use an alternative knockoff construction in Section 6. Yet reasonably strong correlations between features and knockoffs are inevitable when the features are correlated (Dai and Barber,, 2016). However, knockoffs can still be powerful in these settings if the feature statistics properly account for the (known) dependencies among $[\mathbf{X},\widetilde{\mathbf{X}}]$ . MLR statistics are designed to do this.

3 Masked likelihood ratio statistics

This section introduces and analyzes MLR statistics. First, Section 3.1 introduces the notation needed to define MLR statistics. Then, Section 3.2 defines MLR statistics, and Section 3.3 shows that MLR statistics asymptotically maximize the expected number of discoveries. Finally, Section 3.4 introduces an adjusted MLR statistic that asymptotically maximizes the expected number of true discoveries.

3.1 Knockoffs as inference on masked data

Section 2 argued that to maximize power, we should assign $W_{j}$ a large absolute value if and only if $P^{\star}(W_{j}>0)$ is large. To do this, we must estimate $P^{\star}(W_{j}>0)$ from the data, but we cannot use all the data for this purpose: e.g., we cannot directly adjust $|W|$ based on $\operatorname*{sign}(W)$ without violating FDR control. To resolve this ambiguity, we reformulate knockoffs as inference on masked data.

Definition 3.1.

Suppose we observe data $\mathbf{X},\mathbf{Y}$ , knockoffs $\widetilde{\mathbf{X}}$ , and independent random noise $U$ . ( $U$ may be used to fit a randomized feature statistic.) The masked data $D$ is defined as

D=\begin{cases}(\mathbf{Y},\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p},U)&\text{ for model-X knockoffs}\\ (\mathbf{X},\widetilde{\mathbf{X}},\{\mathbf{X}_{j}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\}_{j=1}^{p},U)&\text{ for fixed-X knockoffs.}\end{cases}

(3.1)

As shown in Propositions 3.1-3.2, the masked data $D$ is all the data we may use when assigning magnitudes to $W$ , and knockoffs will be powerful when we can recover the full data from $D$ .

Proposition 3.1.

Let $\widetilde{\mathbf{X}}$ be model-X knockoffs such that $\mathbf{X}_{j}\neq\widetilde{\mathbf{X}}_{j}$ a.s. for $j\in[p]$ . Then $W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ is a valid feature statistic if and only if:

1.

$|W|$ is a function of the masked data $D$ .
2.

For all $j\in[p]$ , there exists a $D$ -measurable random vector $\widehat{\mathbf{X}}_{j}$ such that $W_{j}>0$ if and only if $\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}$ .

Proposition 3.1 reformulates knockoffs as a guessing game, where we produce a “guess” $\widehat{\mathbf{X}}_{j}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ of the value of $\mathbf{X}_{j}$ based on $D=(\mathbf{Y},\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p})$ . If our guess is right, meaning $\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}$ , then we are rewarded and $W_{j}>0$ : else $W_{j}<0$ . To avoid highly negative $W$ -statistics, we should only assign $W_{j}$ a large absolute value when we are confident that our “guess” $\widehat{\mathbf{X}}_{j}$ is correct. We discuss more implications of this result in the next section: for now, we obtain an analogous result for fixed-X knockoffs (similar to a result from Li and Fithian, (2021)) by substituting $\{\mathbf{X}_{j}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\}$ for $\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ .

Proposition 3.2.

Let $\widetilde{\mathbf{X}}$ be fixed-X knockoffs satisfying $\mathbf{X}_{j}^{T}\mathbf{Y}\neq\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}$ a.s. for $j\in[p]$ . Then $W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ is a valid feature statistic if and only if:

1.

$|W|$ is a function of the masked data $D$ .
2.

For $j\in[p]$ , there exists a $D$ -measurable random variable $R_{j}$ such that $W_{j}>0$ if and only if $R_{j}=\mathbf{X}_{j}^{T}\mathbf{Y}$ .

Remark 2.

Propositions 3.1-3.2 hold for knockoffs as defined in Barber and Candès, (2015); Candès et al., (2018). However, in the fixed-X case, one can also augment $D$ to include $\hat{\sigma}^{2}=\|(I_{n}-H)\mathbf{Y}\|_{2}^{2}$ , where $H$ is the OLS projection matrix of $[\mathbf{X},\widetilde{\mathbf{X}}]$ while preserving validity (Chen et al.,, 2019; Li and Fithian,, 2021). Our theory also applies to this extension of the knockoffs framework.

3.2 Introducing masked likelihood ratio (MLR) statistics

We now introduce masked likelihood ratio (MLR) statistics in two steps. First, we introduce oracle MLR statistics, which depend on the unknown law $P^{\star}$ of the data. Then, we introduce Bayesian MLR statistics, which substitute $P^{\pi}$ for $P^{\star}$ . Throughout, we focus on MX knockoffs, but analogous results for FX knockoffs merely replace $\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ with $\{\mathbf{X}_{j}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\}$ (see Definition 3.2).

Step 1: Oracle MLR statistics. We now apply Proposition 3.1 to achieve the two intuitive optimality criteria from Section 2.

•

Goal 1 asks that we maximize $P^{\star}(W_{j}>0)$ . Proposition 3.1 shows that ensuring $W_{j}>0$ is equivalent to correctly guessing the value of $\mathbf{X}_{j}$ from the masked data $D=(\mathbf{Y},\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p})$ . Thus, to maximize $P^{\star}(W_{j}>0\mid D)=P^{\star}(\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}\mid D)$ , the analyst should guess the value $\widehat{\mathbf{X}}_{j}=\operatorname*{arg\,max}_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\star}(\mathbf{x}\mid D)$ which maximizes the likelihood that the guess is correct.
•

Goal 2 asks us to order the guesses in descending order of $P^{\star}(W_{j}>0)$ , i.e., in descending order of the likelihood that each guess $\widehat{\mathbf{X}}_{j}$ is correct.

Both goals are achieved by using the masked data to compute a log-likelihood ratio between the two possible values of $\mathbf{X}_{j}$ (namely $\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ ). This defines the oracle masked likelihood ratio:

\mathrm{MLR}_{j}^{\mathrm{oracle}}=\log\left(\frac{P_{j}^{\star}(\mathbf{X}_{j}\mid D)}{P_{j}^{\star}(\widetilde{\mathbf{X}}_{j}\mid D)}\right),

(3.2)

where $P^{\star}_{j}\coloneqq P^{\star}_{\mathbf{X}_{j}\mid D}$ is the true (unknown) conditional law of $\mathbf{X}_{j}$ given $D$ . Soon, Proposition 3.3 will verify that $\mathrm{MLR}^{\mathrm{oracle}}$ achieves both goals above, and Section 3.3 shows that $\mathrm{MLR}_{j}^{\mathrm{oracle}}$ asymptotically maximizes the expected number of discoveries under $P^{\star}$ (under regularity conditions).

Step 2: Bayesian MLR statistics. $\mathrm{MLR}^{\mathrm{oracle}}$ cannot be used in practice since it depends on $P^{\star}$ . A heuristic solution is to “plug in” an estimate $\hat{P}$ for $P^{\star}$ . For example, given some model class $\mathcal{P}=\{P^{(\theta)}:\theta\in\Theta\}$ of the law of $(\mathbf{Y},\mathbf{X})$ , one could estimate $\hat{\theta}$ using $D$ and replace $P^{\star}$ with $P^{(\hat{\theta})}$ . However, this “plug-in” approach can perform poorly, since knockoffs are most popular in high-dimensional settings with significant uncertainty about the true value of any unknown parameters. Thus, to account for uncertainty, we suggest replacing $P^{\star}$ with a Bayesian model $P^{\pi}$ .

Definition 3.2 (MLR statistics).

For any Bayesian model $P^{\pi}$ (see Section 1.2), we define the model-X masked likelihood ratio (MLR) statistic below:

\mathrm{MLR}_{j}^{\pi}\coloneqq\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\coloneqq\log\left(\frac{P^{\pi}_{j}(\mathbf{X}_{j}\mid D)}{P^{\pi}_{j}(\widetilde{\mathbf{X}}_{j}\mid D)}\right)\text{ for model-X knockoffs,}

(3.3)

where $P^{\pi}_{j}\coloneqq P^{\pi}_{\mathbf{X}_{j}\mid D}$ denotes the conditional law of $\mathbf{X}_{j}\mid D$ under $P^{\pi}$ .

The fixed-X MLR statistic is analogous but replaces $\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ with $\{\mathbf{X}_{j}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\}$ . In particular, if $P^{\pi}_{j,\mathrm{fx}}$ denotes the conditional law of $\mathbf{X}_{j}^{T}\mathbf{Y}\mid D$ under $P^{\pi}$ , then

\mathrm{MLR}_{j}^{\pi}\coloneqq\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\coloneqq\log\left(\frac{P^{\pi}_{j,\mathrm{fx}}(\mathbf{X}_{j}^{T}\mathbf{Y}\mid D)}{P^{\pi}_{j,\mathrm{fx}}(\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\mid D)}\right)\text{ for fixed-X knockoffs.}

(3.4)

Lemma 3.1.

Equations (3.3) and (3.4) define valid MX and FX knockoff statistics, respectively.

To see how MLR statistics account for uncertainty about nuisance parameters, let $\pi(\theta\mid D)$ denote the posterior density of $\theta^{\star}\mid D$ under $P^{\pi}$ . We can write, e.g., in the model-X case:

\mathrm{MLR}_{j}^{\pi}=\log\left(\frac{\int_{\Theta}P^{(\theta)}_{j}(\mathbf{X}_{j}\mid D)\pi(\theta\mid D)d\theta}{\int_{\Theta}P^{(\theta)}_{j}(\widetilde{\mathbf{X}}_{j}\mid D)\pi(\theta\mid D)d\theta}\right).

(3.5)

Unlike the “plug-in” approach, $\mathrm{MLR}_{j}^{\pi}$ does not rely on a single estimate of $\theta$ —instead, it takes the weighted average of the likelihoods $P^{(\theta)}_{j}(\mathbf{X}_{j}\mid D)$ , weighted by the posterior law of $\theta\mid D$ under $P^{\pi}$ .

We now verify that under $P^{\pi}$ , MLR statistics achieve the intuitive criteria from Section 2 (Goals 1-2). This result applies to oracle MLR statistics, since $\mathrm{MLR}_{j}^{\mathrm{oracle}}=\mathrm{MLR}_{j}^{\pi}$ in the special case where $P^{\pi}=P^{\star}$ .

Proposition 3.3.

Let $\mathrm{MLR}^{\pi}$ be the MLR statistics with respect to a Bayesian model $P^{\pi}$ . Then for any other feature statistic $W$ ,

P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\geq P^{\pi}(W_{j}>0\mid D).

(3.6)

Furthermore, $\{|\mathrm{MLR}_{j}^{\pi}|\}_{j=1}^{p}$ has the same order as $\{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\}_{j=1}^{p}$ . More precisely,

P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)=\frac{\exp(|\mathrm{MLR}_{j}^{\pi}|)}{1+\exp(|\mathrm{MLR}_{j}^{\pi}|)}.

(3.7)

Equation (3.7) shows that the absolute values $|\mathrm{MLR}_{j}^{\pi}|$ have the same order as $P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)$ , so under $P^{\pi}$ , MLR statistics prioritize the hypotheses “correctly.” More generally, if $\mathbf{X}_{j}$ is predictive of $\mathbf{Y}$ but $\widetilde{\mathbf{X}}_{j}$ is nearly indistinguishable from $\mathbf{X}_{j}$ , $|\mathrm{MLR}_{j}^{\pi}|$ should be small, since $\mathbf{X}_{j}\approx\widetilde{\mathbf{X}}_{j}$ suggests $P^{\pi}_{j}(\mathbf{X}_{j}\mid D)\approx P^{\pi}_{j}(\widetilde{\mathbf{X}}_{j}\mid D)$ . Thus, MLR statistics should rarely be highly negative (see Figure 1).

Lastly, we make two connections to the literature. First, one proposal in Ren and Candès, (2020) also suggests ranking the hypotheses by $\mathbb{P}(W_{j}>0\mid|W_{j}|)$ (see their footnote 8). That said, Ren and Candès, (2020) do not propose a feature statistic accomplishing this. Rather, they develop “adaptive knockoffs,” an extension of knockoffs that can be combined with any predefined feature statistic, including MLR or lasso statistics. Indeed, using better initial feature statistics should increase the power of adaptive knockoffs, so our contribution is both orthogonal and complementary to theirs (see Appendix D.2 for more details). Second, Katsevich and Ramdas, (2020) show that the unmasked likelihood statistic maximizes $P^{\star}(W_{j}>0)$ ; indeed, our work builds on theirs. However, there are two key differences. First, unlike MLR statistics, the unmasked likelihood statistic is not a valid knockoff statistic even though it is marginally symmetric under the null (see Appendix D.1), so it does not provably control the FDR. Second, MLR statistics have additional guarantees on their magnitudes (Eq. 3.7), allowing us to show much stronger theoretical results in Section 3.3.

Remark 3.

Appendix F extends this section’s results to apply to group knockoffs (Dai and Barber,, 2016).

3.3 MLR statistics are asymptotically optimal

We now show that MLR statistics asymptotically maximize $\texttt{ENDisc}^{\pi}$ , the expected number of discoveries under $P^{\pi}$ . Indeed, Proposition 3.3 might make one hope that MLR statistics exactly maximize $\texttt{ENDisc}^{\pi}$ , since MLR statistics exactly accomplish Goals 1-2 from Section 2. This intuition is correct under the conditional independence condition below (generalizing Li and Fithian, (2021) Proposition 2).

Proposition 3.4.

If $\{\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)\}_{j=1}^{p}$ are conditionally independent given $D$ under $P^{\pi}$ , then
$\texttt{ENDisc}^{\pi}(\mathrm{mlr}^{\pi})\geq\texttt{ENDisc}^{\pi}(w)$ for any valid feature statistic $w$ .

Furthermore, in Gaussian linear models, oracle MLR statistics satisfy this conditional independence condition, making them finite-sample optimal.

Proposition 3.5.

Suppose that (i) $\mathbf{X}$ are FX knockoffs or Gaussian conditional MX knockoffs (Huang and Janson,, 2020) and (ii) under $P^{\star}$ , $\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,\sigma^{2}I_{n})$ . Then under $P^{\star}$ , $\{\mathbb{I}(\mathrm{MLR}_{j}^{\mathrm{oracle}}>0)\}_{j=1}^{p}\mid D$ are conditionally independent.

Absent this independence condition, it may be possible to exploit dependencies among the coordinates of $\operatorname*{sign}(\mathrm{MLR}^{\pi})$ to slightly improve power. Yet Appendix B.3 shows that to improve power even slightly seems to require pathological dependencies, making it hard to imagine that accounting for dependencies can substantially increase power in practice. Formally, we now show that MLR statistics are asymptotically optimal under regularity conditions on the dependence of $\operatorname*{sign}(\mathrm{MLR}^{\pi})\mid D$ .

To this end, consider any asymptotic regime where we observe $\mathbf{X}^{(n)}\in\mathbb{R}^{n\times p_{n}},\mathbf{Y}^{(n)}\in\mathbb{R}^{n}$ and construct knockoffs $\widetilde{\mathbf{X}}^{(n)}$ . For each $n$ , let $P^{\pi}_{n}$ denote a Bayesian model based on a model class $\mathcal{P}^{(n)}=\{P^{(\theta)}:\theta\in\Theta^{(n)}\}$ and prior density $\pi^{(n)}:\Theta^{(n)}\to\mathbb{R}_{\geq 0}$ . Let $D^{(n)}$ denote the masked data (Definition 3.1). For a sequence of feature statistics $W^{(n)}=w_{n}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)})$ , let $S^{(n)}(q)$ denote the rejection set of $W^{(n)}$ when controlling the FDR at level $q$ . So far, we have made no assumptions about the law of $\mathbf{Y}^{(n)},\mathbf{X}^{(n)}$ under $P^{\pi}_{n}$ , and we allow the dimension $p_{n}$ to grow arbitrarily with $n$ . To analyze the asymptotic behavior of MLR statistics under $P^{\pi}_{n}$ , we need two main assumptions.

Assumption 3.1 (Sparsity).

For $\theta\in\Theta^{(n)}$ , let $s_{n}^{(\theta)}$ denote the number of non-nulls under $P^{(\theta)}$ and $s_{n}=\int_{\Theta}s_{n}^{(\theta)}\pi(\theta)d\theta$ denote the expected number of non-nulls under $P_{n}^{\pi}$ . We assume $s_{n}\gg\log(p_{n})^{5}$ as $n\to\infty$ .

Assumption 3.1 allows for many previously studied sparsity regimes, such as polynomial (Donoho and Jin,, 2004; Ke et al.,, 2020) and linear (e.g., Weinstein et al.,, 2017) sparsity regimes.

Assumption 3.2 (Local dependence).

Under $P^{\pi}_{n}$ , the conditional covariance matrix of $\operatorname*{sign}(\mathrm{MLR}^{\pi})$ given $D^{(n)}$ decays exponentially off its diagonal. Formally, there exist constants $C\geq 0,\rho\in(0,1)$ such that

|\operatorname{Cov}_{P^{\pi}_{n}}(\mathbb{I}(\mathrm{MLR}^{\pi}_{i}>0),\mathbb{I}(\mathrm{MLR}^{\pi}_{j}>0)\mid D^{(n)})|\leq C\rho^{|i-j|}.

(3.8)

Assumption 3.2 quantifies the requirement that $\operatorname*{sign}(\mathrm{MLR}^{\pi})\mid D^{(n)}$ are not “too” conditionally dependent. Similar local dependence conditions are common in the multiple testing literature (Genovese and Wasserman,, 2004; Storey et al.,, 2004; Ferreira and Zwinderman,, 2006; Farcomeni,, 2007), although previous assumptions are typically made about p-values. We justify this assumption below.

1.

This assumption is intuitively plausible because knockoffs guarantee that the null coordinates of $\operatorname*{sign}(W)$ are independent given $D$ under $P^{\star}$ , regardless of the correlations among $\mathbf{X}$ (Barber and Candès,, 2015). This independence also holds for non-null coordinates in Gaussian linear models (see Prop. 3.5). Appendix B.9 gives additional informal intuition explaining why this result often holds approximately under $P_{n}^{\pi}$ for both null and non-null variables.
2.

Empirically, $\operatorname{Cov}_{P^{\pi}}(\operatorname*{sign}(\mathrm{MLR}^{\pi})\mid D)$ is nearly indistinguishable from a diagonal matrix in all of our simulations and three real analyses. This suggests that Assumption 3.2 holds in practice.
3.

This assumption can also be diagnosed in real applications, since it depends only on $P^{\pi}$ , which is specified by the analyst. To this end, Section 4 shows how to compute $\operatorname{Cov}_{P^{\pi}}(\operatorname*{sign}(\mathrm{MLR}^{\pi})\mid D)$ .
4.

Explicit analysis of the covariances in Eq. (3.8) is known to be challenging. Nonetheless, in Appendix B.8, we prove that Assumption 3.2 holds if the design matrix $\mathbf{X}$ is blockwise orthogonal, which is an important (if not entirely realistic) special case studied by Ke et al., (2020).
5.

Assumption 3.2 can also be substantially relaxed (see Appendix B.5). All we require is that the partial sums of $\{\operatorname*{sign}(\mathrm{MLR}_{j}^{\pi})\}_{j=1}^{p}$ obey a strong law of large numbers conditional on $D$ .

With these two assumptions, we show that MLR statistics asymptotically maximize $\Gamma_{q}(w_{n})$ , the expected number of discoveries normalized by the expected number of non-nulls:

\Gamma_{q}(w_{n})\coloneqq\frac{\mathbb{E}_{(\mathbf{X}^{(n)},\mathbf{Y}^{(n)})\sim P^{\pi}_{n}}[|S^{(n)}(q)|]}{s_{n}}.

(3.9)

Theorem 3.2.

Consider any high-dimensional asymptotic regime where we observe data $\mathbf{X}^{(n)}\in\mathbb{R}^{n\times p_{n}},\mathbf{Y}^{(n)}\in\mathbb{R}^{n}$ and knockoffs $\widetilde{\mathbf{X}}^{(n)}$ with $D^{(n)}$ denoting the masked data. Let $P^{\pi}_{n}$ be a sequence of Bayesian models of the data satisfying Assumptions 3.1-3.2, and let $\mathrm{mlr}_{n}^{\pi}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)})$ denote the MLR statistics with respect $P^{\pi}_{n}$ . Let $w_{n}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)})$ denote any other sequence of feature statistics.

Then, if the limits $\lim_{n\to\infty}\Gamma_{q}(w_{n})$ and $\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})$ exist for $q\in(0,1)$ , we have that

\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})\geq\lim_{n\to\infty}\Gamma_{q}(w_{n})

(3.10)

holds for all but countably many $q$ .

Theorem 3.2 shows that MLR statistics asymptotically maximize the (normalized) number of expected discoveries without any explicit assumptions on the relationship between $\mathbf{Y}$ and $\mathbf{X}$ or the dimensionality. Besides Assumptions 3.1-3.2, we also assume that the quantities we aim to study actually exist, i.e., $\lim_{n\to\infty}\Gamma_{q}(w_{n})$ and $\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})$ exist—however, even this assumption can be relaxed (see Appendix B.5). Yet the weakest aspect of Theorem 3.2 is that MLR statistics are only provably optimal under $P^{\pi}$ . If $P_{j}^{\star}$ and $P_{j}^{\pi}$ are quite different, MLR statistics may not perform well. For this reason, Section 4.2 suggests practical choices of $P^{\pi}$ that performed well empirically, even under misspecification.

3.4 Maximizing the expected number of true discoveries

We now introduce adjusted MLR (AMLR) statistics, which asymptotically maximize the number of expected true discoveries under $P^{\pi}$ . Empirically, AMLR and MLR statistics perform similarly, but AMLR statistics are less intuitive and depend somewhat counterintuitively on the FDR level. (This is why our paper focuses mostly on MLR statistics.) Thus, for brevity, this section gives only a little intuition and a slightly informal theorem statement. Please see Appendix B.7 for a rigorous theorem statement.

We begin with notation. For $\theta\in\Theta$ , $\mathcal{H}_{1}(\theta)\subset[p]$ denotes the set of non-nulls under $P^{(\theta)}$ and $\mathcal{H}_{1}(\theta^{\star})\subset[p]$ denotes the random set of non-nulls under $P^{\pi}$ . Then, $P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)$ is the conditional probability that $\mathrm{MLR}_{j}^{\pi}$ is positive and the $j$ th feature is non-null given the masked data. Finally, define the following ratio $\nu_{j}$ :

\nu_{j}=\frac{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)}{(1+q)^{-1}-P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)}.

(3.11)

Definition 3.3.

With this notation, we now define AMLR statistics $\{\mathrm{AMLR}_{j}\}_{j=1}^{p}$ in two cases.

•

Case 1: $\mathrm{AMLR}_{j}^{\pi}=\mathrm{MLR}_{j}^{\pi}$ if $P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\geq(1+q)^{-1}$ .

•

Case 2: Otherwise, with $\mathrm{logit}(x)\coloneqq\log(x/(1-x))$ , we define

\mathrm{AMLR}_{j}^{\pi}=\operatorname*{sign}(\mathrm{MLR}_{j}^{\pi})\cdot\mathrm{logit}\left((1+q)^{-1}\right)\cdot\mathrm{logit}^{-1}\left(\nu_{j}\right).

(3.12)

By construction, all AMLR statistics in Case 2 have smaller absolute values than all statistics in Case 1. Note that Appendix E.4 shows how to compute AMLR statistics.

Corollary 3.1.

AMLR statistics from Definition 3.3 are valid knockoff statistics.

MLR and AMLR statistics have the same signs but different absolute values. To understand why, Appendix B.7 argues that maximizing the expected number of true discoveries can be formulated as a simple linear program where the “benefit” of prioritizing a feature is $b_{j}\coloneqq P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)$ —the probability that $\mathrm{MLR}_{j}^{\pi}$ is positive and $j$ is non-null—and the “cost” is $c_{j}\coloneqq(1+q)^{-1}-P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)$ . The intuition is that to make $k$ discoveries, $\approx(1+q)^{-1}k$ of the $k$ feature statistics with the largest absolute values must have positive signs. Thus, $c_{j}$ measures the difference between $(1+q)^{-1}$ and the (conditional) probability that $\mathrm{MLR}_{j}^{\pi}$ is positive. Feature $j$ has a negative cost $c_{j}<0$ if it produces a “surplus” of $\geq(1+q)^{-1}$ positive signs in expectation.

The optimal solution to this problem is to (a) maximally prioritize all features with negative costs by giving them the highest absolute values—i.e., the features in Case 1 above—and (b) prioritize all other features in descending order of the benefit-cost ratio $\nu_{j}=b_{j}/c_{j}$ . This is accomplished by the AMLR formulas in Definition 3.3. In contrast, MLR statistic magnitudes are a decreasing function of only the costs $c_{j}$ . By incorporating the benefit $b_{j}$ , AMLR statistics reduce the expected number of discoveries while increasing the expected number of true discoveries. See Appendix B.7 for further details.

AMLR and MLR statistics are different but not too different, since typically, $\{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)\}_{j=1}^{p}$ has a similar order as $\{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\}_{j=1}^{p}$ . (When the orders are the same, AMLR and MLR statistics yield identical rejection sets.) Indeed, Figure 2 shows in a simple simulation that the power of AMLR and MLR statistics is nearly identical.

We now show that AMLR statistics asymptotically maximize power under $P^{\pi}$ (see Appendix B.7 for a formal statement and proof). For any statistic $w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ with discovery set $S_{w}\subset[p]$ , let $\mathrm{TP}^{\pi}(w)$ denote the expected number of true positives under $P^{\pi}$ :

\mathrm{TP}^{\pi}(w)\,\coloneqq\,\int_{\Theta}\mathbb{E}_{P^{(\theta)}}\big{[}|S_{w}\cap\mathcal{H}_{1}(\theta)|\big{]}\,\pi(\theta)\,d\theta\,=\,\mathbb{E}_{P^{\pi}}[|S_{w}\cap\mathcal{H}_{1}(\theta^{\star})|].

(3.13)

Theorem 3.3 (Informal).

Suppose the conditions of Theorem 3.2 hold. Furthermore, suppose that (i) the local dependence condition in Assumption 3.2 holds when replacing $\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)$ with $\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star}))$ and (ii) the coefficient of variation of the number of non-nulls $|\mathcal{H}_{1}(\theta^{\star})|$ is bounded as $n\to\infty$ . Then for any sequence of feature statistics $\{w_{n}\}_{n\in\mathbb{N}}$ ,

\mathrm{TP}^{\pi}(\mathrm{amlr})\geq\mathrm{TP}^{\pi}(w_{n})+o(s_{n}),

(3.14)

where $s_{n}$ is the expected number of non-nulls under $P_{n}^{\pi}$ , as defined in Assumption 3.1.

4 Computing MLR statistics

4.1 General strategy

We now show how to compute $P^{\pi}_{j}(\mathbf{X}_{j}\mid D)$ and $P^{\pi}_{j}(\widetilde{\mathbf{X}}_{j}\mid D)$ by Gibbs sampling from the law of $\mathbf{X}\mid D$ under $P^{\pi}$ . For brevity, we focus on the MX setting—Appendix E discusses the FX case.

The key idea is that conditional on $\mathbf{X}_{-j}$ and the latent parameter $\theta^{\star}$ , sampling from the law of $\mathbf{X}_{j}\mid\mathbf{X}_{-j},\theta^{\star},D$ is easy. In particular, for any fixed $\mathbf{d}=(\mathbf{y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p})$ , observing $D=\mathbf{d}$ implies that $\mathbf{X}_{j}$ must lie in $\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}$ . Lemma E.1 shows that the conditional likelihood ratio equals:

\displaystyle\frac{P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j}\mid\mathbf{X}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}{P^{\pi}(\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j}\mid\mathbf{X}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}

\displaystyle=\frac{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\mathbf{x}_{j},\mathbf{X}_{-j})}{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j},\mathbf{X}_{-j})}.

(4.1)

The right-hand side of Eq. (4.1) is easy to compute for most parametric models $\mathcal{P}$ , since it only involves computing the likelihood of $\mathbf{Y}$ given $\mathbf{X}$ . Thus, we can easily sample from the law of $\mathbf{X}_{j}\mid D,\theta^{\star},\mathbf{X}_{-j}$ .

To sample from the law of $\mathbf{X}\mid D$ , Algorithm 1 describes a Gibbs sampler which (i) for $j\in[p]$ , resamples from $\mathbf{X}_{j}\mid\mathbf{Y},\mathbf{X}_{-j},\theta^{\star}$ and (ii) resamples from the posterior of $\theta^{\star}\mid\mathbf{Y},\mathbf{X}$ . Step (ii) can be done using any off-the-shelf Bayesian sampler (Brooks et al.,, 2011), since this step is identical to a typical Bayesian regression. Lemma 4.1 shows that Algorithm 1 correctly computes the MLR statistics as $n_{\mathrm{sample}}\to\infty$ under standard regularity conditions (Robert and Casella,, 2004). These mild conditions are satisfied by our default choices (Example 1), but they can also be relaxed further (see Appendix E.3).

Algorithm 1 Gibbs sampling meta-algorithm to compute MLR statistics.

Input: $\mathbf{Y},\mathbf{X},\widetilde{\mathbf{X}}$ , a model class $\{P^{(\theta)}:\theta\in\Theta\}$ and prior $\pi:\Theta\to\mathbb{R}_{\geq 0}$ .

1:Initialize

\theta^{(0)}\sim\pi

and

\mathbf{X}_{j}^{(0)}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathrm{Unif}(\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\})

for

j\in[p]

\triangleright

Initialization

2:for

i=1,2,\dots,n_{\mathrm{sample}}

do:

3: Initialize

\mathbf{X}^{(i)}=\mathbf{X}^{(i-1)}\in\mathbb{R}^{n\times p}

4: for

j=1,\dots,p

do:

\triangleright

Resample

\mathbf{X}^{(i)}

5: Set

\eta_{j}^{(i)}=\log\left(P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta_{i})}(\mathbf{Y}\mid[\mathbf{X}_{-j}^{(i)},\mathbf{X}_{j}]\right)-\log\left(P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta_{i})}(\mathbf{Y}\mid[\mathbf{X}_{-j}^{(i)},\widetilde{\mathbf{X}}_{j}])\right)

6: Define

p_{j}^{(i)}=\mathrm{logit}^{-1}(\eta_{j}^{(i)})

7: Set

\mathbf{X}_{j}^{(i)}=\mathbf{X}_{j}

with probability

p_{j}^{(i)}

. Else set

\mathbf{X}_{j}^{(i)}=\widetilde{\mathbf{X}}_{j}

8: Sample

\theta^{(i)}

from the law of

\theta^{\star}\mid\mathbf{Y},\mathbf{X}=\mathbf{X}^{(i)}

under

P^{\pi}

\triangleright

Resample

\theta^{(i)}

9:Return

\mathrm{MLR}_{j}^{\pi}=\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}p_{j}^{(i)}\right)-\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}1-p_{j}^{(i)}\right)

, for

j\in[p]

Lemma 4.1.

Let $p_{j}^{(i)}$ be defined as in Algorithm 1. Suppose that under $P^{\pi}$ , (i) $p_{j}^{(i)}\in(0,1)$ a.s. for $j\in[p]$ and (ii) the support of $\theta^{\star}\mid\mathbf{X},\mathbf{Y}$ equals $\Theta$ . Then as $n_{\mathrm{sample}}\to\infty$ ,

\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}p_{j}^{(i)}\right)-\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}1-p_{j}^{(i)}\right)\stackrel{{\scriptstyle\mathrm{p}}}{{\to}}\mathrm{MLR}_{j}^{\pi}\coloneqq\log\left(\frac{P_{j}^{\pi}(\mathbf{X}_{j}\mid D)}{P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D)}\right).

Remark 4.

Algorithm 1 also allows us to diagnose Assumption 3.2. Prop. 3.1 yields that if $\widehat{\mathbf{X}}_{j}=\operatorname*{arg\,max}_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\pi}(\mathbf{x}\mid D)$ , then

\displaystyle\operatorname{Cov}_{P^{\pi}}(\mathbb{I}(\mathrm{MLR}_{k}^{\pi}>0),\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)\mid D)=\operatorname{Cov}_{P^{\pi}}(\mathbb{I}(\widehat{\mathbf{X}}_{k}=\mathbf{X}_{k}),\mathbb{I}(\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j})\mid D).

(4.2)

Thus, we can approximate the covariance above with the empirical covariance of $\{\mathbb{I}(\mathbf{X}_{j}^{(i)}=\widehat{\mathbf{X}}_{j})\}_{i=1}^{n_{\mathrm{sample}}}$ and $\{\mathbb{I}(\mathbf{X}_{k}^{(i)}=\widehat{\mathbf{X}}_{k})\}_{i=1}^{n_{\mathrm{sample}}}$ , where $\{\mathbf{X}^{(i)}\}_{i=1}^{n_{\mathrm{sample}}}$ are the samples from Algorithm 1.

Remark 5.

In the special case of Gaussian linear models with a sparse prior on the coefficients $\beta$ , Algorithm 1 is similar in flavor to the “Bayesian Variable Selection” (BVS) feature statistic from Candès et al., (2018), although there are differences in the Gibbs sampler and the final estimand. Broadly, we see our work as complementary to theirs. Yet aside from technical details, a main difference is that Candès et al., (2018) seemed to argue that the advantage of BVS was to incorporate accurate prior information. In contrast, we argue that MLR statistics can improve power even without prior information (see Section 5) by estimating the right notion of variable importance.

4.2 A default choice of Bayesian model

Below, we describe a class of Bayesian models that is computationally efficient and can flexibly model both linear and nonlinear relationships. Note that to specify $\mathcal{P}$ , it suffices to model the law of $\mathbf{Y}\mid\mathbf{X}$ , since the law of $\mathbf{X}$ is assumed known in the MX case and $\mathbf{X}$ is fixed in the FX case.

Example 1 (Sparse generalized additive model).

For linear coefficients $\beta^{(j)}\in\mathbb{R}^{k_{j}}$ and noise variance $\sigma^{2}\in\mathbb{R}$ , let $\theta=(\beta^{(1)},\dots,\beta^{(p)},\sigma^{2})\in\Theta\coloneqq\mathbb{R}^{K}\times\mathbb{R}_{\geq 0}$ . For a prespecified set of basis functions $\phi_{j}:\mathbb{R}\to\mathbb{R}^{k_{j}}$ , we consider the model class $\mathcal{P}=\{P^{(\theta)}:\theta\in\Theta\}$ where

\mathbf{Y}_{i}\mid\mathbf{X}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathcal{N}\left(\sum_{j=1}^{p}\phi_{j}(\mathbf{X}_{ij})^{T}\beta^{(j)},\sigma^{2}\right)\text{ for }i=1,\dots,n\text{ under }P^{(\theta)}.

(4.3)

By default, we take $\phi_{j}$ to be the identity function, which reduces to a Gaussian linear model. However, if $\mathbf{Y}$ and $\mathbf{X}$ may have nonlinear relationships, we suggest taking $\phi_{j}(\cdot)$ to be the basis representation of regression splines (see Hastie et al., (2001) for review), as we do in Section 5.2. For the prior, we let $\pi$ denote the law of $\theta^{\star}$ after sampling from the following process:

•

Sample hyperparameters $p_{0}\sim\mathrm{Beta}(a_{0},b_{0})$ (sparsity), $\tau^{2}\sim\mathrm{invGamma}(a_{\tau},b_{\tau})$ (signal size), and $\sigma^{2}\sim\mathrm{invGamma}(a_{\sigma},b_{\sigma})$ (noise variance). By default, we take $a=b=b_{\tau}=b_{\sigma}=1,a_{\tau}=a_{\sigma}=2$ .
•

Sample $\beta^{(j)}=B_{j}Z_{j}$ for $Z_{j}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathcal{N}(0,\tau^{2}I_{k_{j}})$ and $B_{j}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathrm{Bern}(1-p_{0}).$

This group-sparse prior is effectively a “two-groups” model, as $\mathbf{X}_{j}$ is null if and only if $\beta^{(j)}=0$ . As shown in Section 5, using these hyperpriors allows us to adaptively estimate the sparsity level.

Standard techniques for “spike-and-slab” models (George and McCulloch,, 1997) allow us to compute the MLR statistics from Ex. 1 in $O(n_{\mathrm{sample}}np)$ operations (assuming $\sum_{j=1}^{p}k_{j}=O(p)$ )—see Appendix E for review. This cost is cheaper than computing Gaussian MX or FX knockoffs, which requires $O(np^{2}+p^{3})$ operations. Fitting the LASSO has a comparable cost, which is $O(n_{\mathrm{sample}}np)$ using coordinate descent or $O(np^{2})$ using the LARS algorithm (Efron et al.,, 2004).

Lastly, we can easily extend this algorithm to binary responses. In particular, using techniques from Albert and Chib, (1993), we can compute Gibbs updates in the same computational complexity when $P^{\pi}(Y=1\mid X)=\Phi\left(\sum_{j=1}^{p}\phi_{j}(X_{j})^{T}\beta^{(j)}\right)$ , where $\Phi$ is the Gaussian CDF (see Appendix E for details).

5 Simulations

We now analyze the power of MLR statistics in simulations. Throughout, MLR statistics do not have accurate prior information: we use exactly the same choice of Bayesian model $P^{\pi}$ (the default from Section 4.2) to compute MLR statistics in every plot. Also, we let $\mathbf{X}$ be highly correlated to test whether MLR statistics perform well even when Assumption 3.2 may fail. Nonetheless, MLR statistics uniformly outperform existing competitors.

The FDR level is $q=0.05$ . All plots have two standard deviation error bars, although the bars may be too small to be visible. In each plot, knockoffs provably control the frequentist FDR, so we only plot power. All code is available at https://github.com/amspector100/mlr_knockoff_paper.

5.1 Gaussian linear models

In this section, we sample $\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,I_{n})$ for sparse $\beta$ . We draw $X\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathcal{N}(0,\Sigma)$ for two choices of $\Sigma$ . By default, $\Sigma$ corresponds to a highly correlated nonstationary AR(1) process, inspired by real genetic design matrices. However, we also analyze an “ErdosRenyi” covariance matrix where $\Sigma$ is $80\%$ sparse with the nonzero entries drawn uniformly at random. We compute both “SDP” and “MVR” knockoffs (Candès et al.,, 2018; Spector and Janson,, 2022) to show that MLR statistics perform well in both cases. See Appendix G for further simulation details.

We compare four feature statistics. First, we compute MLR statistics using the default Bayesian model from Section 4—in plots, “MLR” refers to this version of MLR statistics. Second, we compute LCD and LSM statistics as described in Section 2. Lastly, we compute the oracle MLR statistics which have full knowledge of the true value of $\beta$ . Figure 3 shows the results while varying $n$ in low dimensions (using FX knockoffs) and high dimensions (using MX knockoffs). It shows that MLR statistics are substantially more powerful than the lasso-based statistics and, in the FX case, MLR statistics almost perfectly match the power of the oracle. Indeed, this result holds even for the “ErdosRenyi” covariance matrix, where $\mathbf{X}$ exhibits strong non-local dependencies (in contrast to Assumption 3.2). Furthermore, Figure 4 shows that MLR statistics are computationally efficient, often faster than a cross-validated lasso and comparable to the cost of computing FX knockoffs.

Next, we analyze the performance of MLR statistics when the prior is misspecified. In Figure 5, we vary the sparsity (proportion of non-nulls) between $5\%$ and $40\%$ , and we draw the non-null coefficients as (i) heavy-tailed i.i.d. Laplace variables and (ii) “light-tailed” i.i.d. $\mathrm{Unif}([-1/2,-1/4]\cup[1/4,1/2])$ variables. In all cases, the MLR prior assumes the non-null coefficients are i.i.d. $\mathcal{N}(0,\tau^{2})$ with sparsity $p_{0}\sim\mathrm{Beta}(1,1)$ . Nonetheless, MLR statistics consistently outperform the lasso-based statistics and nearly match the performance of the oracle.

Lastly, we verify that the local dependence condition assumed in Theorem 3.2 holds empirically. We consider the AR(1) setting but modify the parameters so that $\mathbf{X}$ is extremely highly correlated, with adjacent correlations drawn as i.i.d. $\mathrm{Beta}(50,1)$ variables. We also consider a setting where $\mathbf{X}$ is equicorrelated with correlation $95\%$ . In both cases, Figure 6 shows that $\operatorname{Cov}_{P^{\pi}}(\mathbb{I}(\mathrm{MLR}^{\pi}>0)\mid D)$ has entries which decay off the main diagonal—in fact, the maximum off-diagonal covariance across both examples is $0.07$ . Please see Section 3.3 and Appendix B.9 for intuition behind this result, although we cannot perfectly explain it.

5.2 Generalized additive models

We now sample $\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(h(\mathbf{X})\beta,I_{n})$ for some non-linear function $h:\mathbb{R}\to\mathbb{R}$ applied element-wise to $\mathbf{X}\in\mathbb{R}^{n\times p}$ . We consider the AR(1) setting from Section 5.1 with four choices of $h$ : $h(x)=\sin(x),h(x)=\cos(x),h(x)=x^{2},h(x)=x^{3}$ . We compare six feature statistics: linear MLR statistics, MLR based on cubic regression splines with one knot, the LCD, a random forest with swap importances as in Gimenez et al., (2019), and DeepPINK (Lu et al.,, 2018), which is based on a feedforward neural network. This setting is more challenging than the linear setting, since the feature statistics must learn (or approximate) the function $h$ . Thus, our simulations in this section are low-dimensional with $n>p$ , and we should not expect any feature statistic to match the performance of the oracle MLR statistics.

Figure 7 shows that “MLR (splines)” uniformly outperforms every other feature statistic, often by wide margins. Linear MLR and LCD statistics are powerless in the $\cos$ and quadratic settings, where $h$ is an even function and thus the non-null features have no linear relationship with the response. However, in the $\sin$ and cubic settings, linear MLR statistics outperform the LCD, suggesting that linear MLR statistics can be powerful under misspecification as long as there is some linear effect.

5.3 Logistic regression

Lastly, we now consider the setting of logistic regression, so we sample $Y\mid X\sim\mathrm{Bern}(s(X^{\top}\beta))$ where $s$ is the sigmoid function. We run the same simulation setting as Figure 3, except that now $Y$ is binary and we consider low-dimensional settings, since inference in logistic regression is generally more challenging than in linear regression. The results are shown by Figure 8, which shows that MLR statistics outperform the LCD, although there is a substantial gap between the performances of the MLR and oracle MLR statistics.

6 Real applications

In this section, we apply MLR statistics to three real datasets which have been previously analyzed using knockoffs. We use the same default choice of MLR statistics from our simulations in all three applications. In each case, MLR statistics have comparable or higher power than competitor statistics. All code and data are available at https://github.com/amspector100/mlr_knockoff_paper.

6.1 HIV drug resistance

We begin with the HIV drug resistance dataset from Rhee et al., (2006), which (e.g.) Barber and Candès, (2015) previously analyzed using knockoffs. The dataset consists of genotype data from $n\approx 750$ HIV samples as well as drug resistance measurements for $16$ different drugs, and the goal is to discover genetic variants that affect drug resistance for each of the drugs. Furthermore, Rhee et al., (2005) published treatment-selected mutation panels for this setting, so we can check whether any discoveries made by knockoffs are corroborated by this separate analysis.

We preprocess and model the data following Barber and Candès, (2015). Then, we apply FX knockoffs with LCD, LSM, and MLR statistics and FDR level $q=0.05$ . For both MVR and SDP knockoffs, Figure 9 shows the total number of discoveries made by each statistic, stratified by whether each discovery is corroborated by Rhee et al., (2005). For SDP knockoffs, the MLR statistics make nearly an order of magnitude more discoveries than the competitor methods with a comparable corroboration rate. For MVR knockoffs, MLR and LCD statistics perform roughly equally well, although MLR statistics make $\approx 5\%$ more discoveries with a slightly higher corroboration rate. Overall, in this setting, MLR statistics are competitive with and sometimes substantially outperform the lasso-based statistics. See Appendix H for specific results for each drug.

6.2 Financial factor selection

In finance, analysts often aim to select factors that drive the performance of a particular asset. Challet et al., (2021) applied FX knockoffs to factor selection, and as a benchmark, they tested which US equities explain the performance of an index fund for the energy sector (XLE). Here, the ground truth is available since the index fund is a weighted combination of a known list of stocks.

We perform the same analysis for ten index funds of key sectors of the US economy, including energy, technology, and more (see Appendix H). Here, $\mathbf{Y}$ is the index fund’s daily log return and $\mathbf{X}$ contains the daily log returns of each stock in the S&P 500 since $2013$ , so $p\approx 500$ and $n\approx 2300$ . We compute fixed-X MVR and SDP knockoffs and apply LCD, LSM, and MLR statistics. Figure 10 shows the number of true and false discoveries summed across all index funds with $q=0.05$ . In particular, MLR statistics make $35\%$ and $78\%$ more discoveries than the LCD for MVR and SDP knockoffs (respectively), and the LSM makes more than $5$ times fewer discoveries than the MLR statistics. Thus, MLR statistics substantially outperform the lasso-based statistics. Appendix H also shows that the FDP (averaged across all index funds) is well below $5\%$ for each method.

6.3 Graphical model discovery for gene networks

Lastly, we consider the problem of recovering a gene network from single-cell RNAseq data. Our analysis follows Li and Maathuis, (2019), who model gene expression log counts as a Gaussian graphical model (see Li and Maathuis, (2019) for justification of the Gaussian assumption). In particular, they develop an extension of FX knockoffs that detects edges in Gaussian graphical models while controlling the FDR across discovered edges. They applied this method to RNAseq data from Zheng et al., (2017). The ground truth is not available, so following Li and Maathuis, (2019), we only evaluate methods based on the number of discoveries they make.

We replicate this analysis and compare LCD, LSM, and MLR statistics. Figure 11 plots the number of discoveries as a function of $q\in[0,0.5]$ . MLR statistics make the most discoveries for nearly every value of $q$ , although often by a small margin. For small $q$ , the LSM statistic performs poorly, and for large $q$ , the LCD statistic performs poorly, whereas the MLR statistic is consistently powerful.

7 Discussion

This paper introduces masked likelihood ratio statistics, a class of asymptotically Bayes-optimal knockoff statistics. We show in simulations and three applications that MLR statistics are efficient and powerful. However, our work leaves open several directions for future research.

•

MLR statistics are asymptotically Bayes optimal. However, it might be worthwhile to develop minimax-optimal knockoff-statistics, e.g., by computing a “least-favorable” prior.
•

Our theory requires a “local dependency” condition which is challenging to verify analytically, although it can be diagnosed using the data at hand. It might be interesting to investigate (i) precisely when this condition holds and (ii) if MLR statistics are still optimal when it fails.
•

We only consider classes of MLR statistics designed for binary GLMs and generalized additive models. However, other types of MLR statistics could be more powerful, e.g., those based on Bayesian additive regression trees (Chipman et al.,, 2010).
•

In practice, analysts may prefer to discover features with large effect sizes. E.g., in Section 2, the P90.M variant has a large estimated OLS coefficient; thus, while it is particularly hard to discover, it may be particularly valuable to discover. In principle, the Bayesian framework in Section 3.4 could be used to find knockoff statistics which asymptotically maximize many different notions of power, e.g., the sum of squared coefficient sizes across all discovered variables.

8 Acknowledgements

The authors thank John Cherian, Kevin Guo, Lucas Janson, Lihua Lei, Basil Saeed, Anav Sood, and Timothy Sudijono for valuable comments. A.S. is partially supported by a Citadel GQS PhD Fellowship, the Two Sigma Graduate Fellowship Fund, and an NSF Graduate Research Fellowship. W.F. is partially supported by the NSF DMS-1916220 and a Hellman Fellowship from Berkeley.

References

Albert and Chib, (1993) Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679.
Barber and Candès, (2015) Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist., 43(5):2055–2085.
Barber et al., (2020) Barber, R. F., Candès, E. J., and Samworth, R. J. (2020). Robust inference with knockoffs. Ann. Statist., 48(3):1409–1431.
Bates et al., (2020) Bates, S., Candès, E., Janson, L., and Wang, W. (2020). Metropolized knockoff sampling. Journal of the American Statistical Association, 0(0):1–15.
Brooks et al., (2011) Brooks, S., Gelman, A., Jones, G., and Meng, X. (2011). Handbook of Markov Chain Monte Carlo. CRC Press, United States.
Candès et al., (2018) Candès, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold: Model-X knockoffs for high-dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B, 80(3):551–577.
Challet et al., (2021) Challet, D., Bongiorno, C., and Pelletier, G. (2021). Financial factors selection with knockoffs: Fund replication, explanatory and prediction networks. Physica A: Statistical Mechanics and its Applications, 580:126105.
Chen et al., (2019) Chen, J., Hou, A., and Hou, T. Y. (2019). A prototype knockoff filter for group selection with FDR control. Information and Inference: A Journal of the IMA, 9(2):271–288.
Chipman et al., (2010) Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266 – 298.
Dai and Barber, (2016) Dai, R. and Barber, R. (2016). The knockoff filter for fdr control in group-sparse and multitask regression. In Balcan, M. F. and Weinberger, K. Q., editors, International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1851–1859, New York, New York, USA. PMLR.
Donoho and Jin, (2004) Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics, 32(3):962 – 994.
Doukhan and Neumann, (2007) Doukhan, P. and Neumann, M. H. (2007). Probability and moment inequalities for sums of weakly dependent random variables, with applications. Stochastic Processes and their Applications, 117(7):878–903.
Efron et al., (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2):407 – 499.
Fan et al., (2020) Fan, Y., Demirkaya, E., Li, G., and Lv, J. (2020). Rank: Large-scale inference with graphical nonlinear knockoffs. Journal of the American Statistical Association, 115(529):362–379. PMID: 32742045.
Farcomeni, (2007) Farcomeni, A. (2007). Some results on the control of the false discovery rate under dependence. Scandinavian Journal of Statistics, 34(2):275–297.
Ferreira and Zwinderman, (2006) Ferreira, J. A. and Zwinderman, A. H. (2006). On the Benjamini–Hochberg method. The Annals of Statistics, 34(4):1827 – 1849.
Fithian and Lei, (2020) Fithian, W. and Lei, L. (2020). Conditional calibration for false discovery rate control under dependence.
Genovese and Wasserman, (2004) Genovese, C. and Wasserman, L. (2004). A stochastic process approach to false discovery control. The Annals of Statistics, 32(3):1035 – 1061.
George and McCulloch, (1997) George, E. I. and McCulloch, R. E. (1997). Approaches for bayesian variable selection. Statistica Sinica, 7(2):339–373.
Gimenez et al., (2019) Gimenez, J. R., Ghorbani, A., and Zou, J. Y. (2019). Knockoffs for the mass: New feature importance statistics with false discovery guarantees. In Chaudhuri, K. and Sugiyama, M., editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 2125–2133. PMLR.
Guan and Stephens, (2011) Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics, 5(3):1780 – 1815.
Hastie et al., (2001) Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA.
Huang and Janson, (2020) Huang, D. and Janson, L. (2020). Relaxing the assumptions of knockoffs by conditioning. Ann. Statist., 48(5):3021–3042.
Katsevich and Ramdas, (2020) Katsevich, E. and Ramdas, A. (2020). On the power of conditional independence testing under model-x.
Ke et al., (2020) Ke, Z. T., Liu, J. S., and Ma, Y. (2020). Power of fdr control methods: The impact of ranking algorithm, tampered design, and symmetric statistic. arXiv preprint: arXiv:2010.08132.
Li and Maathuis, (2019) Li, J. and Maathuis, M. H. (2019). Ggm knockoff filter: False discovery rate control for gaussian graphical models.
Li and Fithian, (2021) Li, X. and Fithian, W. (2021). Whiteout: when do fixed-x knockoffs fail?
Liu and Rigollet, (2019) Liu, J. and Rigollet, P. (2019). Power analysis of knockoff filters for correlated designs. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Lu et al., (2018) Lu, Y. Y., Fan, Y., Lv, J., and Noble, W. S. (2018). Deeppink: reproducible feature selection in deep neural networks. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 8690–8700.
Martello and Toth, (1990) Martello, S. and Toth, P. (1990). Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc., USA.
Polson et al., (2013) Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American Statistical Association, 108(504):1339–1349.
Ren and Candès, (2020) Ren, Z. and Candès, E. J. (2020). Knockoffs with side information. Annals of Applied Statistics.
Rhee et al., (2005) Rhee, S.-Y., Fessel, W. J., Zolopa, A. R., Hurley, L., Liu, T., Taylor, J., Nguyen, D. P., Slome, S., Klein, D., Horberg, M., Flamm, J., Follansbee, S., Schapiro, J. M., and Shafer, R. W. (2005). Hiv-1 protease and reverse-transcriptase mutations: Correlations with antiretroviral therapy in subtype b isolates and implications for drug-resistance surveillance. The Journal of Infectious Diseases, 192(3):456–465.
Rhee et al., (2006) Rhee, S.-Y., Taylor, J., Wadhera, G., Ben-Hur, A., Brutlag, D. L., and Shafer, R. W. (2006). Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences, 103(46):17355–17360.
Robert and Casella, (2004) Robert, C. and Casella, G. (2004). Monte Carlo statistical methods. Springer Verlag.
Romano et al., (2020) Romano, Y., Sesia, M., and Candès, E. (2020). Deep knockoffs. Journal of the American Statistical Association, 115(532):1861–1872.
Sechidis et al., (2021) Sechidis, K., Kormaksson, M., and Ohlssen, D. (2021). Using knockoffs for controlled predictive biomarker identification. Statistics in Medicine, 40(25):5453–5473.
Sesia et al., (2019) Sesia, M., Katsevich, E., Bates, S., Candès, E., and Sabatti, C. (2019). Multi-resolution localization of causal variants across the genome. bioRxiv.
Sesia et al., (2018) Sesia, M., Sabatti, C., and Candès, E. J. (2018). Gene hunting with hidden Markov model knockoffs. Biometrika, 106(1):1–18.
Spector and Janson, (2022) Spector, A. and Janson, L. (2022). Powerful knockoffs via minimizing reconstructability. The Annals of Statistics, 50(1):252 – 276.
Storey et al., (2004) Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(1):187–205.
Wang and Janson, (2021) Wang, W. and Janson, L. (2021). A high-dimensional power analysis of the conditional randomization test and knockoffs. Biometrika.
Weinstein et al., (2017) Weinstein, A., Barber, R. F., and Candès, E. J. (2017). A power analysis for knockoffs under Gaussian designs. IEEE Transactions on Information Theory.
Weinstein et al., (2020) Weinstein, A., Su, W. J., Bogdan, M., Barber, R. F., and Candès, E. J. (2020). A power analysis for knockoffs with the lasso coefficient-difference statistic. arXiv.
Weissbrod et al., (2020) Weissbrod, O., Hormozdiari, F., Benner, C., Cui, R., Ulirsch, J., Gazal, S., Schoech, A., van de Geijn, B., Reshef, Y., Márquez-Luna, C., O’Connor, L., Pirinen, M., Finucane, H. K., and Price, A. L. (2020). Functionally-informed fine-mapping and polygenic localization of complex trait heritability. Nature Genetics, 52:1355–1363.
Zheng et al., (2017) Zheng, G., Terry, J., Belgrader, P., Ryvkin, P., Bent, Z., Wilson, R., Ziraldo, S., Wheeler, T., McDermott, G., Zhu, J., Gregory, M., Shuga, J., Montesclaros, L., Underwood, J., Masquelier, D., Nishimura, S., Schnall-Levin, M., Wyatt, P., Hindson, C., Bharadwaj, R., Wong, A., Ness, K., Beppu, L., Deeg, H., McFarland, C., Loeb, K., Valente, W., Ericson, N., Stevens, E., Radich, J., Mikkelsen, T., Hindson, B., and Bielas, J. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications.

Appendix A An illustration of the importance of the order of $|W|$

Section 2.1 argues intuitively that a good knockoff statistic should roughly achieve the following goals:

1.

For each $j$ , it should maximize $P^{\star}(W_{j}>0)$ .
2.

The order of $\{|W_{j}|\}_{j=1}^{p}$ should match the order of $\{P^{\star}(W_{j}>0)\}_{j=1}^{p}$ —i.e., $|W_{j}|$ should be an increasing function of $P^{\star}(W_{j}>0)$ .

Sections 3 formalizes (and slightly modifies) these goals to develop an asymptotically optimal test statistic. However, to build intuition, we now give a concrete (if contrived) example showing the importance of the second goal.

Consider a setting with $p=50$ features where $25$ features have a large signal size with $P^{\star}(W_{j}>0)=99.9\%$ , $10$ features have a moderate signal size with $P^{\star}(W_{j}>0)=75\%$ , the last $15$ features are null with $P^{\star}(W_{j}>0)=50\%$ , and $\{\operatorname*{sign}(W_{j})\}_{j=1}^{p}$ are independent. In this case, what absolute values should $W$ take to maximize power?

To make any discoveries and control the FDR at level $q=0.05$ , we must ensure that $>95\%$ of the $k$ feature statistics with the largest absolute values have positive signs (for some $k\geq 20$ due to the ceiling function in Step 2 in Section 2.1). Since only the features with large signal sizes have a $>95\%$ chance of being positive, making any discoveries is extremely unlikely unless the features with large signal sizes generally have the highest absolute values. Figure 12 illustrates this argument—in the “random prioritization” setting, we sample $|W_{j}|\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathrm{Unif}(0,1)$ , and in the “oracle prioritization” setting, we set $|W_{j}|=P^{\star}(W_{j}>0)$ . As expected, knockoffs makes zero discoveries with random prioritization and $\approx 30$ discoveries with oracle prioritization.

Appendix B Main proofs and interpretation

In this section, we prove the main results of the paper. We also offer additional discussion of these results.

B.1 Knockoffs as inference on masked data

In this section, we prove Propositions 3.1 and 3.2, Lemma 3.1, and one more related corollary which will be useful when proving Theorem 3.2. As notation, for any matrices $M_{1},M_{2}\in\mathbb{R}^{n\times p}$ , let $[M_{1},M_{2}]_{{\text{swap}(j)}}$ denote the matrix $[M_{1},M_{2}]$ but with the $j$ th column of $M_{1}$ and $M_{2}$ swapped: similarly, $[M_{1},M_{2}]_{{\text{swap}(J)}}$ swaps all columns $j\in J$ of $M_{1}$ and $M_{2}$ .

Proposition 3.1.

1.

$|W|$ is a function of the masked data $D$ .
2.

For all $j\in[p]$ , there exists a $D$ -measurable random vector $\widehat{\mathbf{X}}_{j}$ such that $W_{j}>0$ if and only if $\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}$ .

Proof.

Forward direction: Suppose $W$ is a valid feature statistic; we will now show conditions (i) and (ii). To show (i), note that observing $\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p}$ is equivalent to observing $[\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}}$ for some unobserved $J\subset[p]$ chosen uniformly at random. Define $[\mathbf{X}^{(1)},\mathbf{X}^{(2)}]\coloneqq[\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}}$ and let $W^{\prime}=w([\mathbf{X}^{(1)},\mathbf{X}^{(2)}],\mathbf{Y})$ . Then by the swap invariance property of knockoffs, we have that $|W|=|W^{\prime}|$ . Since $|W^{\prime}|$ is a function of $D$ , this implies $|W|$ is a function of $D$ as well, which proves (i).

To show (ii), we construct $\widehat{\mathbf{X}}_{j}$ as follows. Let $O_{j}\in\mathbb{R}^{n}$ be any “other” random vector chosen such that $O_{j}\not\in\{\mathbf{X}_{j}^{(1)},\mathbf{X}_{j}^{(2)}\}$ . Then define

\widehat{\mathbf{X}}_{j}\coloneqq\begin{cases}\mathbf{X}_{j}^{(1)}&W_{j}^{\prime}>0\\ \mathbf{X}_{j}^{(2)}&W_{j}^{\prime}<0\\ O_{j}&W_{j}^{\prime}=0.\end{cases}

Intuitively, we set $\widehat{\mathbf{X}}_{j}=O_{j}$ if and only if $W_{j}=0\Leftrightarrow W_{j}^{\prime}=0$ , since this will guarantee that $\widehat{\mathbf{X}}_{j}\neq\mathbf{X}_{j}$ whenever $W_{j}=0$ .

Note that $\widehat{\mathbf{X}}_{j}$ is a function of $[\mathbf{X}^{(1)},\mathbf{X}^{(2)}],\mathbf{Y}$ and therefore is $D$ -measurable. To show $\widehat{\mathbf{X}}_{j}$ it is well-defined (does not depend on $J$ ), note that $\widehat{\mathbf{X}}_{j}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j},O_{j}\}$ can only take one of three values conditional on $D$ . Thus, it suffices to show that the events $\widehat{\mathbf{X}}_{j}=O_{j}$ and $\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}$ do not depend on the random set $J$ .

To show that the event $\widehat{\mathbf{X}}_{j}=O_{j}$ does not depend on $J$ , recall $\widehat{\mathbf{X}}_{j}=O_{j}$ iff $|W_{j}^{\prime}|=0$ ; since $|W_{j}^{\prime}|=|W_{j}|$ , this event does not depend on $J$ .

To show that the event $\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}$ does not depend on $J$ , it suffices to show $\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}$ if and only if $W_{j}>0$ , which also shows (ii). There are two cases. In the first case, if $j\not\in J$ , then $\mathbf{X}_{j}^{(1)}=\mathbf{X}_{j}$ by definition of $\mathbf{X}^{(1)}$ and also $W_{j}^{\prime}=W_{j}$ by the “flip-sign” property of $w$ . Thus $\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}^{(1)}=\mathbf{X}_{j}$ if and only if $W_{j}>0$ . The second case is analogous: if $j\in J$ , then $W_{j}^{\prime}=-W_{j}$ , so $\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}^{(2)}=\mathbf{X}_{j}$ if and only if $W_{j}^{\prime}<0\Leftrightarrow W_{j}>0$ . In both cases, $W_{j}>0$ if and only if $\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}$ , proving (ii).

Backwards direction: To show $W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ is a valid feature statistic, it suffices to show the flip-sign property, namely that $W^{\prime}\coloneqq w([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}},\mathbf{Y})=-1_{J}\odot W$ , where $\odot$ denotes elementwise multiplication and $-1_{J}$ is the vector of all ones but with negative ones at the indices in $J$ . To do this, note that $D$ is invariant to swaps of $\mathbf{X}$ and $\widetilde{\mathbf{X}}$ , so $|W|=|W^{\prime}|$ because by assumption $|W|,|W^{\prime}|$ are a function of $D$ . Furthermore, for any $j\in[p]$ , we have that $W_{j}>0$ if and only if $\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}$ ; however, since $\widehat{\mathbf{X}}_{j}$ is also a function of $D$ , we have that $\operatorname*{sign}(W_{j})=\operatorname*{sign}(W^{\prime}_{j})$ if and only if $j\not\in J$ . This completes the proof. ∎

The proof of Proposition 3.2 is identical to the proof of Proposition 3.1, so we omit it for brevity.

We now prove Lemma 3.1.

Lemma 3.1.

Equations (3.3) and (3.4) define valid MX and FX knockoff statistics, respectively.

Proof.

For the MX case, we will show that for any $J\subset[p]$ , $\mathrm{mlr}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}},\mathbf{Y})=-1_{J}\odot\mathrm{mlr}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ , where $\odot$ denotes elementwise multiplication and $-1_{J}\in\mathbb{R}^{p}$ is the vector of ones but with negative ones at the indices in $J$ . To show this, note that the masked data $D$ is invariant to swaps. Therefore, applying Eq. 3.3 yields

\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}},\mathbf{Y})=\begin{cases}\log\left(\frac{P_{j}^{\pi}(\mathbf{X}_{j}\mid D)}{P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D)}\right)&j\not\in J\\ \log\left(\frac{P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D)}{P_{j}^{\pi}(\mathbf{X}_{j}\mid D)}\right)&j\in J.\end{cases}

(B.1)

Since $\log(x/y)=\log(x)-\log(y)$ is an antisymmetric function, this proves that

\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}},\mathbf{Y})=\begin{cases}\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})&j\not\in J\\ -\mathrm{mlr}_{j}^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})&j\in J.\end{cases}

(B.2)

which completes the proof. The proof for the FX case is analogous. ∎

Finally, the following corollary of Propositions 3.1 and 3.2 will be important when proving Theorem 3.2.

Corollary B.1.

Let $W,W^{\prime}$ be two knockoff feature statistics. Then in the same setting as Propositions 3.1 and 3.2, the event $\operatorname*{sign}(W_{j})=\operatorname*{sign}(W_{j}^{\prime})$ is a deterministic function of the masked data $D$ .

Proof.

We give the proof for the model-X case, and the fixed-X case is analogous. First, note that the events $W_{j}=0$ and $W_{j}^{\prime}=0$ are $D$ -measurable events since $|W_{j}|,|W_{j}^{\prime}|$ are $D$ -measurable by Proposition 3.1. Therefore, the only non-trivial case is the case where $W_{j},W_{j}^{\prime}\neq 0$ , which we now consider.

By Proposition (3.1), there exist $D$ -measurable vectors $\widehat{\mathbf{X}}_{j},\widehat{\mathbf{X}}_{j}^{\prime}$ such that $W_{j}>0\Leftrightarrow\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}$ and $W_{j}^{\prime}>0\Leftrightarrow\widehat{\mathbf{X}}_{j}^{\prime}=\mathbf{X}_{j}$ . Since $\widehat{\mathbf{X}}_{j},\widehat{\mathbf{X}}_{j}^{\prime}$ must take one of exactly two distinct values, this implies that

\operatorname*{sign}(W_{j})=\operatorname*{sign}(W_{j}^{\prime})\Leftrightarrow\widehat{\mathbf{X}}_{j}=\widehat{\mathbf{X}}_{j}^{\prime}

where the right-most expression is $D$ -measurable since $\widehat{\mathbf{X}}_{j},\widehat{\mathbf{X}}_{j}^{\prime}$ are $D$ -measurable. This completes the proof. ∎

B.2 Proof of Proposition 3.3

Proposition 3.3.

Given data $\mathbf{Y},\mathbf{X}$ and knockoffs $\widetilde{\mathbf{X}}$ , let $\mathrm{MLR}^{\pi}$ be the MLR statistics with respect to some Bayesian model $P^{\pi}$ . Let $W$ be any other valid knockoff feature statistic. Then,

P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\geq P^{\pi}(W_{j}>0\mid D).

(3.6)

Furthermore, $\{|\mathrm{MLR}_{j}^{\pi}|\}_{j=1}^{p}$ has the same order as $\{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\}_{j=1}^{p}$ . More precisely,

P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)=\frac{\exp(|\mathrm{MLR}_{j}^{\pi}|)}{1+\exp(|\mathrm{MLR}_{j}^{\pi}|)}.

(3.7)

Proof.

First, we prove Eq. (3.6). Let $\widehat{\mathbf{X}}_{j}^{\star}=\operatorname*{arg\,max}_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\pi}(\mathbf{x}\mid D)$ be the “best guess” of the value of $\mathbf{X}_{j}$ based on $D$ , and observe that by definition $\mathrm{MLR}_{j}^{\pi}\coloneqq\log(P_{j}^{\pi}(\mathbf{X}_{j}\mid D))-\log(P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D))>0$ if and only if $\widehat{\mathbf{X}}_{j}^{\star}=\mathbf{X}_{j}$ . Similarly, Proposition 3.1 proves that there exists some alternative $D$ -measurable random variable $\widehat{\mathbf{X}}_{j}$ such that $W_{j}>0$ if and only if $\widehat{\mathbf{X}}_{j}>0$ . However, we note that by definition of $\widehat{\mathbf{X}}_{j}^{\star}$ ,

P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)=P^{\pi}(\widehat{\mathbf{X}}_{j}^{\star}=\mathbf{X}_{j}\mid D)\geq P^{\pi}(\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}\mid D)=P^{\pi}(W_{j}>0\mid D),

(B.3)

which completes the proof of Eq. (3.6).

To prove Eq. (3.7), observe $P_{j}^{\pi}(\mathbf{X}_{j}\mid D)=1-P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D)$ , since conditional on $D$ we observe the set $\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ . Therefore,

|\mathrm{MLR}_{j}^{\pi}|=\log\left(\frac{\max_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\pi}(\mathbf{x}\mid D)}{1-\max_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\pi}(\mathbf{x}\mid D)}\right)=\log\left(\frac{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)}{1-P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)}\right),

where the second step uses the fact that $P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)=P^{\pi}\left(\widehat{\mathbf{X}}_{j}^{\star}=\mathbf{X}_{j}\mid D\right)=\max_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\pi}(\mathbf{x}\mid D)$ for $\widehat{\mathbf{X}}_{j}^{\star}$ as defined above. This completes the proof for model-X knockoffs; the proof in the fixed-X case is analogous and just replaces $\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ with $\{\mathbf{X}_{j}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{j}^{T}\mathbf{Y}\}$ . ∎

B.3 How far from optimality are MLR statistics in finite samples?

Our main result (Theorem 3.2) shows that MLR statistics asymptotically maximize the number of discoveries made by knockoffs under $P^{\pi}$ . However, before rigorously proving Theorem 3.2, we give intuition suggesting that even in finite samples, MLR statistics are probably nearly optimal anyway.

Recall from Section 3.2 that in finite samples, MLR statistics (i) maximize $P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)$ for each $j\in[p]$ and (ii) ensure that the absolute values of the feature statistics $\{|\mathrm{MLR}_{j}^{\pi}|\}_{j=1}^{p}$ have the same order as the probabilities $\{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0)\}_{j=1}^{p}$ . As per Proposition 3.4, this strategy is exactly optimal when the vector of signs $\operatorname*{sign}(\mathrm{MLR}^{\pi})$ are conditionally independent given $D$ , but in general, it is possible to exploit conditional dependencies among the coordinates of $\operatorname*{sign}(\mathrm{MLR}^{\pi})$ to slightly improve power. However, we argue below that it is challenging to even slightly improve power without unrealistically strong dependencies.

To see why this is the case, consider a very simple setting with $p=6$ features and FDR level $q=0.2$ , so knockoffs will make discoveries exactly when the first five $W$ -statistics with the largest absolute values have positive signs. Suppose that $W_{1},\dots,W_{5}\mid D$ are perfectly correlated and satisfy $P^{\pi}(W_{1}>0\mid D)=\dots=P^{\pi}(W_{5}>0\mid D)=70\%$ , and $W_{6}\perp\!\!\!\perp W_{1:5}\mid D$ satisfies $P^{\pi}(W_{6}>0\mid D)=90\%$ . Since $W_{6}$ has the highest chance of being positive, MLR statistics will assign it the highest absolute value, in which case knockoffs will make discoveries with probability $70\%\cdot 90\%=63\%$ . However, in this example, knockoffs will be more powerful if we ensure that $W_{1},\dots,W_{5}$ have the five largest absolute values, since their signs are perfectly correlated and thus $P^{\pi}(W_{1}>0,\dots,W_{5}>0\mid D)=70\%>63\%$ .⁶⁶6Note that there is nothing special about positive correlations in this example: one can also find similar examples where negative correlations among $\mathrm{MLR}^{\pi}$ can be prioritized to slightly increase power.

This example has two properties which shed light on the more general situation. First, to even get a slight improvement in power required extremely strong dependencies among the coordinates of $\operatorname*{sign}(\mathrm{MLR}^{\pi})$ , which is not realistic. Indeed, empirically in Figure 6, the coordinates of $\operatorname*{sign}(\mathrm{MLR}^{\pi})$ appear to be almost completely conditionally uncorrelated even when $\mathbf{X}$ is extremely highly correlated. Thus, although it may be possible to slightly improve power by exploiting dependencies among $\operatorname*{sign}(\mathrm{MLR}^{\pi})$ , the magnitude of the improvement in power is likely to be small. Second, the reason that it is possible to exploit dependencies to improve power in this case is because knockoffs has a “hard” threshold where one can only make any discoveries if one makes at least 5 discoveries, and exploiting conditional correlations among the vector $\operatorname*{sign}(\mathrm{MLR}^{\pi})$ can slightly improve the probability that we reach that initial threshold. However, this “threshold” phenomenon is less important in situations where knockoffs are guaranteed to make at least a few discoveries; thus, if the number of discoveries grows with $p$ , this effect should be insignificant asymptotically.

B.4 Proof of Theorem 3.2

Notation: For any vector $x\in\mathbb{R}^{n}$ and $k\leq n$ , we let $\bar{x}_{k}\coloneqq\frac{1}{k}\sum_{i=1}^{k}x_{i}$ be the sample mean of the first $k$ elements of $x$ . For $k>n$ , we let $\bar{x}_{k}\coloneqq\frac{n}{k}\bar{x}$ equal the sample mean of the vector $x$ plus $k-n$ additional zeros. Additionally, for $x\in\mathbb{R}^{n}$ and a permutation $\kappa:[n]\to[n]$ , $\kappa(x)$ denotes the coordinates of $x$ permuted according to $\kappa$ , so that $\kappa(x)_{i}=x_{\kappa(i)}$ . Throughout this section, all probabilities $\mathbb{P}$ and expectations $\mathbb{E}$ are taken under $P^{\pi}$ .

Main idea: There are two main ideas behind Theorem 3.2. First, for any feature statistic $W$ , we will compare the power of $W$ to the power of a “soft” version of the SeqStep procedure, which depends only on the conditional expectation of $\operatorname*{sign}(W)$ instead of the realized values of $\operatorname*{sign}(W)$ . Roughly speaking, if the coordinates of $\operatorname*{sign}(W)$ obey a strong law of large numbers, the power of SeqStep and the power of the “soft” version of SeqStep will be the same asymptotically. Second, we will show that MLR statistics $\mathrm{MLR}^{\pi}$ exactly maximize the power of the “soft” version of the SeqStep procedure. Taken together, these two results imply that MLR statistics are asymptotically optimal.

To make this precise, for a feature statistic $W$ , let $\mathrm{sorted}(W)$ denote $W$ sorted in decreasing order of its absolute values, and let $R=\mathbb{I}(\mathrm{sorted}(W)>0)\in\{0,1\}^{p}$ be the vector indicating where $\mathrm{sorted}(W)$ has positive entries. The number of discoveries made by knockoffs only depends on $R$ . Indeed, for any vector $\eta\in[0,1]^{p}$ and any desired FDR level $q\in(0,1)$ , define

\psi_{q}(\eta)\coloneqq\max_{k\in\mathbb{N}}\left\{k:\frac{k-k\bar{\eta}_{k}+1}{k\bar{\eta}_{k}}\leq q\right\}\text{ and }\tau_{q}(\eta)=\left\lceil\frac{\psi_{q}(\eta)+1}{1+q}\right\rceil,

(B.4)

where by convention we set $\frac{x}{0}=\infty$ for any $x\in\mathbb{R}_{>0}$ and we remind the reader that for $k>p$ , $\bar{\eta}_{k}=\frac{p}{k}\bar{\eta}$ . It turns out that knockoffs makes exactly $\tau_{q}(R)$ discoveries. For brevity, we refer the reader to Lemma B.3 of Spector and Janson, (2022) for a formal proof of this: however, to see this intuitively, note that $k-k\bar{R}_{k}$ (resp. $k\bar{R}_{k}$ ) counts the number of negative (resp. positive) entries in the first $k$ coordinates of $\mathrm{sorted}(W)$ , so this definition lines up with the definition of the data-dependent threshold in Section 1.1.

Now, let $\delta\coloneqq\mathbb{E}[R\mid D]\in[0,1]^{p}$ be the conditional expectation of $R$ given the masked data $D$ (defined in Equation 3.1). The “soft” version of SeqStep simply applies the functions $\psi_{q}$ and $\tau_{q}$ to the conditional expectation $\delta$ instead of the realized indicators $R$ . Intuitively speaking, our goal will be to apply a law of large numbers to show the following asymptotic result:

|\tau_{q}(\delta)-\tau_{q}(R)|=o_{p}\left(\#\text{ of non-nulls}\right).

Once we have shown this, it will be straightforward to show that MLR statistics are asymptotically optimal, since MLR statistics maximize $\tau_{q}(\delta)$ in finite samples.

We now begin to prove Theorem 3.2 in earnest. In particular, the following pair of lemmas tells us that if $\bar{R}_{k}$ converges to $\bar{\delta}_{k}$ uniformly in $k$ , then $\tau_{q}(\delta)\approx\tau_{q}(R)$ .

Lemma B.2.

Let $W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ be any feature statistic with $R,\delta,\psi_{q},\tau_{q}$ as defined earlier. Fix any $k_{0}\in[p]$ and sufficiently small $\epsilon>0$ such that $\eta\coloneqq 3(1+q)\epsilon<q$ . Define the event

A_{k_{0},\epsilon}=\left\{\max_{k_{0}\leq k\leq p}|\bar{R}_{k}-\bar{\delta}_{k}|\leq\epsilon\right\}.

Then on the event $A_{k_{0},\epsilon}$ , we have that

\frac{1}{1+3\epsilon}\tau_{q-\eta}(R)-k_{0}-1\leq\tau_{q}(\delta)\leq(1+3\epsilon)\tau_{q+\eta}(R)+k_{0}+1.

(B.5)

This implies that

\displaystyle|\tau_{q}(R)-\tau_{q}(\delta)|

\displaystyle\leq p\mathbb{I}(A_{k_{0},\epsilon}^{c})+\big{[}\tau_{q+\eta}(R)-\tau_{q-\eta}(R)\big{]}+k_{0}+1+3\epsilon\tau_{q+\eta}(R).

(B.6)

Proof.

Note the proof is entirely algebraic (there is no probabilistic content). We proceed in two steps, first showing Equation (B.5), then Equation (B.6).

Step 1: We now prove Equation (B.5). To start, define the sets

\mathcal{R}=\left\{k\in[p]:\frac{k-k\bar{R}_{k}+1}{k\bar{R}_{k}}\leq q+\eta\right\}\text{ and }\mathcal{D}=\left\{k\in[p]:\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}}\leq q\right\}

and recall that by definition $\psi_{q+\eta}(R)=\max(\mathcal{R})$ , $\psi_{q}(\delta)=\max(\mathcal{D})$ . To analyze the difference between these quantities, fix any $k\in\mathcal{D}\setminus\mathcal{R}$ . Then by definition of $\mathcal{D}$ and $\mathcal{R}$ , we know

\displaystyle\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}}\leq q<q+\eta<\frac{k-k\bar{R}_{k}+1}{k\bar{R}_{k}}.

However, Lemma B.3 (proved in a moment) tells us that this implies the following algebraic identity:

\bar{\delta}_{k}-\bar{R}_{k}\geq\frac{\eta}{3(1+q)}=\frac{3(1+q)\epsilon}{3(1+q)}=\epsilon.

However, on the event $A_{k_{0},\epsilon}$ this cannot occur for any $k\geq k_{0}$ . Therefore, on the event $A_{k_{0},\epsilon}$ , $\mathcal{D}\setminus\mathcal{R}\subset\{1,\dots,k_{0}-1\}$ . This implies that

\psi_{q}(\delta)-\psi_{q+\eta}(R)=\max(\mathcal{D})-\max(\mathcal{R})\leq\begin{cases}0&\max(\mathcal{D})\geq k_{0}\\ k_{0}-1&\max(\mathcal{D})<k_{0}.\end{cases}

(B.7)

We can combine these conditions by writing that $\psi_{q}(\delta)-\psi_{q+\eta}(R)\leq k_{0}-1$ . Using the definition of $\tau_{q}(\cdot)$ , we conclude

$\displaystyle\tau_{q}(\delta)-\tau_{q+\eta}(R)$	$\displaystyle=\left\lceil\frac{\psi_{q}(R)+1}{1+q}\right\rceil-\left\lceil\frac{\psi_{q+\eta}(\delta)+1}{1+q+\eta}\right\rceil$
	$\displaystyle\leq 1+\frac{\psi_{q}(R)+1}{1+q}-\frac{\psi_{q+\eta}(\delta)+1}{1+q+\eta}$
	$\displaystyle\leq 2+\frac{\psi_{q}(R)-\psi_{q+\eta}(\delta)}{1+q}+\left(\frac{1}{1+q}-\frac{1}{1+q+\eta}\right)\psi_{q+\eta}(R)$
	$\displaystyle=2+\frac{\psi_{q}(R)-\psi_{q+\eta}(\delta)}{1+q}+\frac{3\epsilon}{1+q+\eta}\psi_{q+\eta}(R)$	by def. of $\eta$
	$\displaystyle\leq 2+\frac{k_{0}-1}{1+q}+\frac{3\epsilon}{1+q+\eta}\psi_{q+\eta}(R)$	by Eq. (B.7)
	$\displaystyle\leq k_{0}+1+3\epsilon\tau_{q+\eta}(R)$	by def. of $\tau_{q}(R)$ .

This proves the upper bound, namely that $\tau_{q}(\delta)\leq(1+3\epsilon)\tau_{q+\eta}(R)+k_{0}+1$ . To prove the lower bound, note that we can swap the role of $R$ and $\delta$ and apply the upper bound to $q^{\prime}=q-\eta$ . Then if we take $\eta^{\prime}=3(1+q^{\prime})\epsilon<\eta<1$ , applying the upper bound yields

\tau_{q^{\prime}}(R)\leq(1+3\epsilon)\tau_{q^{\prime}+\eta^{\prime}}(\delta)+k_{0}+1.

Observe that $\tau_{q}(\cdot)$ is nondecreasing in $q$ . Furthermore, since $\eta^{\prime}<\eta$ , we have that $q^{\prime}+\eta^{\prime}=q-\eta+\eta^{\prime}\leq q$ . Therefore, $\tau_{q^{\prime}+\eta^{\prime}}(\delta)\leq\tau_{q}(\delta)$ . Applying this result, we conclude

\tau_{q-\eta}(R)=\tau_{q^{\prime}}(R)\leq(1+3\epsilon)\tau_{q^{\prime}+\eta^{\prime}}(\delta)+k_{0}+1\leq(1+3\epsilon)\tau_{q}(\delta)+k_{0}+1.

This implies the lower bound $\frac{1}{1+3\epsilon}\tau_{q-\eta}(R)-k_{0}-1\leq\tau_{q}(\delta)$ .

Step 2: Now, we show Equation (B.6) follows from Equation (B.5). To see this, we consider the two cases where $\tau_{q}(\delta)\geq\tau_{q}(R)$ and vice versa and apply Equation (B.5). In particular, on the event $A_{k_{0},\epsilon}$ , then:

$\displaystyle\|\tau_{q}(\delta)-\tau_{q}(R)\|$	$\displaystyle=\begin{cases}\tau_{q}(\delta)-\tau_{q}(R)&\tau_{q}(\delta)\geq\tau_{q}(R)\\ \tau_{q}(R)-\tau_{q}(\delta)&\tau_{q}(\delta)\leq\tau_{q}(R)\end{cases}$
	$\displaystyle\leq\begin{cases}(1+3\epsilon)\tau_{q+\eta}(R)+k_{0}+1-\tau_{q}(R)&\tau_{q}(\delta)\geq\tau_{q}(R)\\ \tau_{q}(R)-\frac{1}{1+3\epsilon}\tau_{q-\eta}(R)+k_{0}+1&\tau_{q}(\delta)\leq\tau_{q}(R)\end{cases}$	by Eq. (B.5)
	$\displaystyle=k_{0}+1+\begin{cases}\tau_{q+\eta}(R)-\tau_{q}(R)-3\epsilon\tau_{q+\eta}(R)&\tau_{q}(\delta)\geq\tau_{q}(R)\\ \tau_{q}(R)-\tau_{q-\eta}(R)+\frac{3\epsilon}{1+3\epsilon}\tau_{q-\eta}(R)&\tau_{q}(\delta)\leq\tau_{q}(R)\end{cases}$
	$\displaystyle\leq k_{0}+1+\tau_{q+\eta}(R)-\tau_{q-\eta}(R)+3\epsilon\tau_{q+\eta}(R),$

where the last line follows because $\tau_{q}(R)$ is monotone in $q$ . This implies Equation (B.6), because $|\tau_{q}(\delta)-\tau_{q}(R)|\leq p$ trivially on the event $A_{k_{0},\epsilon}^{c}$ because $\tau_{q}(\delta),\tau_{q}(R)\in[p]$ . ∎

The following lemma proves a very simple algebraic identity used in the proof of Lemma B.2.

Lemma B.3.

For any $x,y\in[0,1]$ , $k\in\mathbb{N},$ and any $\gamma\in(0,1)$ , suppose that $\frac{1+k-kx}{kx}\leq q<q+\gamma\leq\frac{1+k-ky}{ky}$ . Then

x-y\geq\frac{\gamma}{(1+q)(1+q+\gamma)}\geq\frac{\gamma}{3(1+q)}.

Proof.

By assumption, $x\neq 0$ , since otherwise $\frac{1+k-kx}{kx}=\infty$ by convention. For $x>0$ , we have that

\frac{1+k-kx}{kx}\leq q\implies 1+k-kx\leq kqx\implies x\geq\frac{k+1}{k(1+q)}.

(B.8)

Now, there are two cases. If $y=0$ , the inequality holds trivially:

x-y=x\geq\frac{k+1}{k}\cdot\frac{1}{1+q}\geq\frac{\gamma}{3(1+q)}.

Alternatively, if $y>0$ , we observe similarly to before that

\frac{1+k-ky}{ky}\geq q+\gamma\implies y\leq\frac{k+1}{k(1+q+\gamma)}.

(B.9)

Combining Equations (B.8)–(B.9) yields the result:

\displaystyle x-y

\displaystyle\geq\frac{k+1}{k}\left(\frac{1}{1+q}-\frac{1}{1+q+\gamma}\right)=\frac{k+1}{k}\frac{\gamma}{(1+q+\gamma)(1+q)}\geq\frac{\gamma}{3(1+q)}.

∎

We are now ready to prove Theorem 3.2. As a reminder, we consider an asymptotic regime with data $\mathbf{X}^{(n)}\in\mathbb{R}^{n\times p_{n}},\mathbf{Y}^{(n)}\in\mathbb{R}^{n}$ and knockoffs $\widetilde{\mathbf{X}}^{(n)}$ , where $P^{\pi}_{n}$ is the Bayesian model. We let $D^{(n)}$ denote the masked data for knockoffs as defined in Section 3.1 and let $s_{n}$ denote the expected number of non-nulls under $P^{\pi}_{n}$ . We will analyze the limiting normalized number of discoveries of feature statistics $W^{(n)}=w_{n}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y})$ with rejection set $S^{(n)}(q)$ , defined as the expected number of discoveries divided by the expected number of non-nulls:

\Gamma_{q}(w_{n})=\frac{\mathbb{E}_{P^{\pi}_{n}}[|S^{(n)}(q)|]}{s_{n}}.

(3.9)

For convenience, we restate Theorem 3.2 and then prove it.

Theorem 3.2.

For each $n$ , let $\mathrm{MLR}^{\pi}=\mathrm{mlr}_{n}^{\pi}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)})$ denote the MLR statistics with respect to $P_{n}^{\pi}$ and let $W=w_{n}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)})$ denote any other sequence of feature statistics. Assume the following:

•

$\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})$ and $\lim_{n\to\infty}\Gamma_{q}(w_{n})$ exist for each $q\in(0,1)$ .
•

The expected number of non-nulls grows faster than $\log(p_{n})^{4}$ . Formally, assume that for some $\gamma>0$ , $\lim_{n\to\infty}\frac{s_{n}}{\log(p_{n})^{4+\gamma}}=\infty$ .

•

Conditional on $D^{(n)}$ , the covariance between the signs of $\mathrm{MLR}^{\pi}$ decays exponentially off the diagonal. That is, there exist constants $C\geq 0,\rho\in(0,1)$ such that

|\operatorname{Cov}_{P^{\pi}_{n}}(\mathbb{I}(\mathrm{MLR}^{\pi}_{i}>0),\mathbb{I}(\mathrm{MLR}^{\pi}_{j}>0)\mid D^{(n)})|\leq C\rho^{|i-j|}.

(3.8)

Then for all but countably many $q\in(0,1)$ ,

\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})\geq\lim_{n\to\infty}\Gamma_{q}(w_{n}).

(B.10)

Proof.

Note that in this proof, all expectations and probabilities are taken over $P_{n}^{\pi}$ .

The proof proceeds in three main steps, but we begin by introducing some notation and outlining the overall strategy. Following Lemma B.2, let $R=\mathbb{I}(\mathrm{sorted}(W)>0)$ and $R^{\pi}=\mathbb{I}(\mathrm{sorted}(\mathrm{MLR}^{\pi})>0)$ , and let $\delta=\mathbb{E}[W\mid D^{(n)}]$ and $\delta^{\pi}=\mathbb{E}[\mathrm{MLR}^{\pi}\mid D^{(n)}]$ be their conditional expectations. (Note that $W,R,\delta,\mathrm{MLR}^{\pi},R^{\pi}$ and $\delta^{\pi}$ all change with $n$ —however, we omit this dependency to lighten the notation.) As in Equation (B.4), we can write the number of discoveries made by $W$ and $\mathrm{MLR}^{\pi}$ as a function of $R^{\pi}$ and $R$ , so:

\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})-\Gamma_{q}(w_{n})=\frac{\mathbb{E}\left[\tau_{q}(R^{\pi})\right]}{s_{n}}-\frac{\mathbb{E}\left[\tau_{q}(R)\right]}{s_{n}}.

We will show that the limit of this quantity is nonnegative, and the main idea is to make the approximations $\tau_{q}(R^{\pi})\approx\tau_{q}(\delta^{\pi})$ and $\tau_{q}(R)\approx\tau_{q}(\delta)$ . In particular, we can decompose

$\displaystyle\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})-\Gamma_{q}(w_{n})$	$\displaystyle=\frac{\mathbb{E}\left[\tau_{q}(R^{\pi})-\tau_{q}(R)\right]}{s_{n}}$
	$\displaystyle=\frac{\mathbb{E}[\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})]}{s_{n}}+\frac{\mathbb{E}[\tau_{q}(\delta^{\pi})-\tau_{q}(\delta)]}{s_{n}}+\frac{\mathbb{E}\left[\tau_{q}(\delta)-\tau_{q}(R)\right]}{s_{n}}$
	$\displaystyle\geq\frac{\mathbb{E}[\tau_{q}(\delta^{\pi})-\tau_{q}(\delta)]}{s_{n}}-\frac{\mathbb{E}\|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})\|}{s_{n}}-\frac{\mathbb{E}\|\tau_{q}(\delta)-\tau_{q}(R)\|}{s_{n}}.$	(B.11)

In particular, Step 1 of the proof is to show that $\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta)$ holds deterministically, for fixed $n$ . This implies that the first term of Equation (B.11) is nonnegative for fixed $n$ . In Step 2, we show that as $n\to\infty$ , the second and third terms of Equation (B.11) vanish. In Step 3, we combine these results and take limits to yield the final result.

Step 1: In this step, we show that $\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta)$ holds deterministically for fixed $n$ . To do this, it suffices to show that $\bar{\delta}^{\pi}_{k}\geq\bar{\delta}_{k}$ for each $k\in[p_{n}]$ . To see this, recall that $\tau_{q}(\delta^{\pi})$ and $\tau_{q}(\delta)$ are increasing functions of $\psi_{q}(\delta^{\pi})$ and $\psi_{q}(\delta)$ , as defined below:

\psi_{q}(\delta^{\pi})=\max_{k\in\mathbb{N}}\left\{\frac{k-k\bar{\delta}^{\pi}_{k}+1}{k\bar{\delta}^{\pi}_{k}}\leq q\right\}\text{ and }\psi_{q}(\delta)=\max_{k\in\mathbb{N}}\left\{\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}}\leq q\right\}

(B.12)

where for $k>n$ we use the convention of “padding” $\delta$ and $\delta^{\pi}$ with extra zeros, so (e.g.) $\bar{\delta}_{k}=\frac{n}{k}\bar{\delta}$ for $k>n$ .

Since the function $\gamma\mapsto\frac{k-k\gamma+1}{k\gamma}$ is decreasing in $\gamma$ , $\bar{\delta}^{\pi}_{k}\geq\bar{\delta}_{k}$ implies that $\frac{k-k\bar{\delta}^{\pi}_{k}+1}{k\bar{\delta}^{\pi}_{k}}\leq\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}}$ for each $k$ , and therefore $\psi_{q}(\delta^{\pi})\geq\psi_{q}(\delta)$ , which implies $\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta)$ . Thus, it suffices to show that $\bar{\delta}^{\pi}_{k}\geq\bar{\delta}_{k}$ holds for each $k$ .

Intuitively, it makes sense that $\bar{\delta}^{\pi}_{k}\geq\bar{\delta}_{k}$ for each $k$ , since $\mathrm{MLR}^{\pi}$ maximizes $\mathbb{P}(\mathrm{MLR}_{j}^{\pi}>0\mid D)$ coordinate-wise and chooses $\mathrm{MLR}^{\pi}$ so that $\delta^{\pi}$ is sorted in decreasing order. To prove this formally, we first argue that conditional on $D^{(n)}$ , $R^{\pi}$ is a deterministic function of $R$ . Recall that according to Corollary B.1, the event $\operatorname*{sign}(W_{j})\neq\operatorname*{sign}(\mathrm{MLR}_{j}^{\pi})$ is completely determined by the masked data $D^{(n)}$ . Furthermore, since $R^{\pi}$ and $R$ are random permutations of the vectors $\mathbb{I}(W>0)$ and $\mathbb{I}(\mathrm{MLR}^{\pi}>0)$ where the random permutations only depend on $|W|$ and $|\mathrm{MLR}^{\pi}|$ , this implies there exists a random vector $\xi\in\{0,1\}^{p_{n}}$ and a random permutation $\sigma\in S_{p_{n}}$ such that $R^{\pi}=\xi\oplus\sigma(R)$ and $\xi,\sigma$ are deterministic conditional on $D^{(n)}$ . Note that here, $\oplus$ denotes the generalized parity function, so

b\oplus x\coloneqq\begin{cases}x&b=0\\ 1-x&b=1\end{cases}\text{ for }x\in[0,1],b\in\{0,1\}

(B.13)

which guarantees that $0\oplus 0=1\oplus 1=0$ and $0\oplus 1=1\oplus 0=1$ , etc.

The intuition here is that following Proposition 3.1, fitting a feature statistic $W$ is equivalent to observing $D^{(n)}$ , assigning an ordering to the features, and then guessing which one of $\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ is the true feature and which is a knockoff, where $W_{j}>0$ if and only if this “guess” is correct. Since these decisions are made as deterministic functions of $D^{(n)}$ , $\mathrm{MLR}^{\pi}$ can only be different than $W$ in that (i) it may make different guesses, flipping the sign of $W$ (as represented by $\xi)$ , and (ii) its absolute values may be sorted in a different order (as represented by $\sigma$ ).

Now, since $\xi$ and $\sigma$ are deterministic functions of $D^{(n)}$ , this implies that

\delta^{\pi}_{i}=\mathbb{E}[R_{i}^{\pi}\mid D^{(n)}]=\mathbb{E}[\xi_{i}\oplus R_{\sigma(i)}\mid D^{(n)}]=\begin{cases}1-\delta_{\sigma(i)}&\xi_{i}=1\\ \delta_{\sigma(i)}&\xi_{i}=0.\end{cases}

However, by definition, $\mathbb{E}[R_{i}^{\pi}\mid D^{(n)}]=\mathbb{P}(\mathrm{sorted}(\mathrm{MLR}^{\pi})_{i}>0\mid D^{(n)})$ , and Proposition 3.3 implies that $\mathbb{P}(\mathrm{MLR}^{\pi}_{i}>0\mid D^{(n)})\geq 0.5$ for all $i\in[p_{n}]$ : since the ordering of $\mathrm{MLR}^{\pi}$ is deterministic conditional on $D^{(n)}$ , this also implies $\delta_{i}^{\pi}=\mathbb{P}(\mathrm{sorted}(\mathrm{MLR}^{\pi})_{i}>0\mid D^{(n)})\geq 0.5$ . Therefore, $\delta^{\pi}_{i}\geq\delta_{\sigma(i)}$ for each $i\in[p_{n}]$ . Additionally, by construction $\mathrm{MLR}^{\pi}$ ensures that $\delta^{\pi}_{1}\geq\delta^{\pi}_{2}\geq\dots\geq\delta^{\pi}_{p_{n}}$ . If $\delta_{(1)},\dots,\delta_{(p_{n})}$ are the order statistics of $\delta$ in decreasing order, this implies that $\delta^{\pi}_{i}\geq\delta_{(i)}$ for all $i$ . Therefore,

\bar{\delta}^{\pi}_{k}=\frac{1}{k}\sum_{i=1}^{k}\delta^{\pi}_{i}\geq\frac{1}{k}\sum_{i=1}^{k}\delta_{(i)}\geq\frac{1}{k}\sum_{i=1}^{n}\delta_{i}.

By the previous analysis, this proves that $\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta)$ .

Step 2: In this step, we show that $\frac{\mathbb{E}|\tau_{q}(\delta^{\pi})-\tau_{q}(R^{\pi})|}{s_{n}}\to 0$ for all but countably many $q\in(0,1)$ , as well as the analogous result for $R$ and $\delta$ . We first prove the result for $R^{\pi}$ and $\delta^{\pi}$ , and in particular, for any fixed $v>0$ , we will show that $\limsup_{n\to\infty}\frac{\mathbb{E}|\tau_{q}(\delta^{\pi})-\tau_{q}(R^{\pi})|}{s_{n}}\leq v$ . Since we will show this for any arbitrary $v>0$ , this implies $\frac{\mathbb{E}|\tau_{q}(\delta^{\pi})-\tau_{q}(R^{\pi})|}{s_{n}}\to 0$ .

We begin by applying Lemma B.2. In particular, fix any $k_{n}\in[p_{n}]$ , any $\epsilon>0$ , and define

A_{n}=\left\{\max_{k_{n}\leq k\leq p_{n}}|\bar{R}_{k}^{\pi}-\bar{\delta}_{k}^{\pi}|\leq\epsilon\right\}.

Then by Lemma B.2,

|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})|\leq p_{n}\mathbb{I}(A_{n}^{c})+\tau_{q+\eta}(R^{\pi})-\tau_{q-\eta}(R^{\pi})+k_{n}+1+3\epsilon\tau_{q+\eta}(R^{\pi}).

where $\eta=3(1+q)\epsilon$ . Therefore,

\frac{\mathbb{E}|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})|}{s_{n}}\leq\frac{p_{n}\mathbb{P}(A_{n}^{c})}{s_{n}}+\frac{k_{n}+1}{s_{n}}+\frac{\mathbb{E}[\tau_{q+\eta}(R^{\pi})]-\mathbb{E}[\tau_{q-\eta}(R^{\pi})]}{s_{n}}+\frac{3\epsilon\mathbb{E}[\tau_{q+\eta}(R^{\pi})]}{s_{n}}.

(B.14)

We now analyze these terms in order: while doing so, we will choose a sequence $\{k_{n}\}$ and constant $\epsilon>0$ which guarantee the desired result. Note that eventually, our choice of $\epsilon$ will depend on $q$ , so the convergence is not necessarily uniform, but that does not pose a problem for our proof.

First term: To start, we will first apply a finite-sample concentration result to bound $\mathbb{P}(A_{n}^{c})$ . In particular, we show in Corollary C.1 that if $X_{1},\dots,X_{n}$ are mean-zero, $[-1,1]$ -valued random variables satisfying the exponential decay condition from Equation (3.8), then there exists a universal constant $C^{\prime}>0$ depending only on $C$ and $\rho$ such that

\mathbb{P}\left(\max_{n_{0}\leq i\leq n}|\bar{X}_{i}|\geq t\right)\leq n\exp(-C^{\prime}t^{2}n_{0}^{1/4}).

(B.15)

Furthermore, Corollary C.1 shows that this result holds even if we permute $X_{1},\dots,X_{n}$ according to some arbitrary fixed permutation $\sigma$ . Now, observe that conditional on $D^{(n)}$ , $R_{j}^{\pi}-\delta_{j}^{\pi}$ is a zero-mean, $[-1,1]$ -valued random variable which is a fixed permutation of $\mathbb{I}(\mathrm{MLR}^{\pi}>0)$ minus its (conditional) expectation. Since $\mathbb{I}(\mathrm{MLR}^{\pi}>0)$ obeys the conditional exponential decay condition in Equation (3.8), we can apply Corollary C.1 to $R_{j}^{\pi}-\delta_{j}^{\pi}$ :

\mathbb{P}(A_{n}^{c}\mid D^{(n)})\leq p_{n}\exp(-C^{\prime}\epsilon^{2}k_{n}^{1/4})

(B.16)

which implies by the tower property that $p_{n}\mathbb{P}(A_{n}^{c})\leq p_{n}^{2}\exp(-C^{\prime}\epsilon^{2}k_{n}^{1/4})$ . Now, suppose we take

k_{n}=\left\lceil\log(p_{n})^{4+\gamma}\right\rceil.

Then observe that $\epsilon$ is fixed, so as $n\to\infty$ , $k_{n}^{1/4}\epsilon^{2}=\Omega(\log(p_{n})^{1+\gamma/4})$ . Thus

\log(p_{n}\mathbb{P}(A_{n}^{c}))\leq 2\log(p_{n})-\Omega\left(\log(p_{n})^{1+\gamma/4}\right)\to-\infty.

Therefore, for this choice of $k_{n}$ , we have shown the stronger result that $p_{n}\mathbb{P}(A_{n}^{c})\to 0$ . Of course, this implies $\frac{p_{n}\mathbb{P}(A_{n}^{c})}{s_{n}}\to 0$ as well.

Second term: This term is easy, as we assume in the statement that $\frac{k_{n}}{s_{n}}\sim\frac{\log(p_{n})^{4+\gamma}}{s_{n}}\to 0$ .

Third term: We will now show that for all but countably many $q\in(0,1)$ , for any sufficiently small $\epsilon$ (and thus for any sufficiently small $\eta$ ), $\limsup_{n\to\infty}\frac{\mathbb{E}[\tau_{q+\eta}(R^{\pi})]-\mathbb{E}[\tau_{q-\eta}(R^{\pi})]}{s_{n}}\leq v/2$ for any fixed $v>0$ .

To do this, recall by assumption that for all $q\in(0,1)$ , we have that $\lim_{n\to\infty}\frac{\mathbb{E}[\tau_{q}(R^{\pi})]}{s_{n}}$ exists and converges to some (extended) real number $L(q)$ . Furthermore, we show in Lemma C.2 that $L(q)$ is always finite—this is intuitively a consequence of the fact that knockoffs controls the false discovery rate, and thus the expected number of discoveries cannot exceed the number of non-nulls by more than a constant factor. Importantly, since $\tau_{q}(R^{\pi})$ is increasing in $q$ , the function $L(q)$ is increasing in $q$ for all $q\in(0,1)$ : therefore, it is continuous on $(0,1)$ except on a countable set.

Supposing that $q$ is a continuity point of $L(q)$ , there exists some $\beta>0$ such that $|q-q^{\prime}|\leq\beta\implies|L(q)-L(q^{\prime})|\leq v/4$ . Take $\epsilon$ to be any positive constant such that $\epsilon\leq\frac{\beta}{3(1+q)}$ and thus $\eta\leq\beta$ . Then we conclude

	$\displaystyle\limsup_{n\to\infty}\frac{\mathbb{E}[\tau_{q+\eta}(R^{\pi})]-\mathbb{E}[\tau_{q-\eta}(R^{\pi})]}{s_{n}}$	$\displaystyle=L(q+\eta)-L(q-\eta)$	$\displaystyle\text{ because }\frac{\mathbb{E}[\tau_{q}(R^{\pi})]}{s_{n}}\to L(q)\text{ pointwise}$
		$\displaystyle\leq\frac{v}{2}.$	by continuity

Fourth term: We now show that for all but countably many $q\in(0,1)$ , for any sufficiently small $\epsilon$ , $\lim_{n\to\infty}\frac{3\epsilon\mathbb{E}[\tau_{q+\eta}(R^{\pi})]}{s_{n}}=3\epsilon L(q+\eta)\leq v/2$ . However, this is simple, since Lemma C.2 tells us that $L(q)$ is finite and continuous except at countably many points. Thus, we can take $\epsilon$ sufficiently small so that $L(q+\eta)=L(q+3(1+q)\epsilon)\leq L(q)+1$ , and then also take $\epsilon<\frac{v}{6(L(q)+1)}$ so that $3\epsilon L(q+\eta)\leq v/2$ .

Combining the results for all four terms, we see the following: for each $v>0$ , there exists a sequence $\{k_{n}\}$ and a constant $\epsilon$ guaranteeing that

\displaystyle\limsup_{n\to\infty}\frac{\mathbb{E}|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})|}{s_{n}}\leq v.

Since this holds for all $v>0$ , we conclude $\lim_{n\to\infty}\frac{\mathbb{E}|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})|}{s_{n}}=0$ as desired.

Lastly in this step, we need to show the same result for $R$ and $\delta$ in place of $R^{\pi}$ and $\delta^{\pi}$ . However, the proof for $R$ and $\delta$ is identical to the proof for $R^{\pi}$ and $\delta^{\pi}$ . The one subtlety worth mentioning is that we do not directly assume the exponential decay condition in Equation (3.8) for $W$ . However, as we argued in Step 1, we can write $\mathbb{I}(W>0)=\xi\oplus\mathbb{I}(\mathrm{MLR}^{\pi}>0)$ for some random vector $\xi\in\{0,1\}^{p_{n}}$ which is a deterministic function of $D^{(n)}$ . As a result, we have that

\displaystyle|\operatorname{Cov}(\mathbb{I}(W_{i}>0),\mathbb{I}(W_{j}>0)\mid D^{(n)})|=|\operatorname{Cov}(\mathbb{I}(\mathrm{MLR}^{\pi}_{i}>0),\mathbb{I}(\mathrm{MLR}^{\pi}_{j}>0)\mid D^{(n)})|\leq C\rho^{|i-j|}.

Thus, we also conclude that $\lim_{n\to\infty}\frac{\mathbb{E}|\tau_{q}(R)-\tau_{q}(\delta)|}{s_{n}}=0$ .

Step 3: Finishing the proof. Recall Equation (B.11), which states that

\displaystyle\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})-\Gamma_{q}(w_{n})

\displaystyle\geq\frac{\mathbb{E}[\tau_{q}(\delta^{\pi})-\tau_{q}(\delta)]}{s_{n}}-\frac{\mathbb{E}|\tau_{q}(R^{\pi})-\tau_{q}(\delta^{\pi})|}{s_{n}}-\frac{\mathbb{E}|\tau_{q}(\delta)-\tau_{q}(R)|}{s_{n}}.

(B.11)

In Step 1, we showed that $\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta)$ for fixed $n$ . Furthermore, in Step 2, we showed that the second two terms vanish asymptotically. As a result, we take limits and conclude

\liminf_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})-\Gamma_{q}(w_{n})\geq 0.

Furthermore, since we assume that the limits $\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi}),\lim_{n\to\infty}\Gamma_{q}(w_{n})$ exist, this implies that

\lim_{n\to\infty}\Gamma_{q}(\mathrm{mlr}_{n}^{\pi})-\lim_{n\to\infty}\Gamma_{q}(w_{n})\geq 0.

This concludes the proof. ∎

B.5 Relaxing the assumptions in Theorem 3.2

In this section, we discuss a few ways to relax the assumptions in Theorem 3.2.

First, we can easily relax the assumption that the limits $L(q)\coloneqq\lim_{n\to\infty}\Gamma_{q}(w_{n})$ and $L^{\star}(q)\coloneqq\lim_{n\to\infty}\Gamma_{n}(\mathrm{mlr}_{n}^{\pi})$ exist for each $q\in(0,1)$ . Indeed, the proof of Theorem 3.2 only uses this assumption to argue that there exists a sequence $\eta_{n}\to 0$ such that $L(q+\eta_{n})\to L(q),L(q-\eta_{n})\to L(q)$ (and similarly for $L^{\star}(q)$ ). Thus, we do not need the limits $L(q)$ to exist for every $q\in(0,1)$ : in contrast, the result of Theorem 3.2 will hold (e.g.) for any $q$ such that $L(\cdot),L^{\star}(\cdot)$ are continuous at $q$ . Intuitively, this means that the result in Theorem 3.2 holds except at points $q$ that delineate a “phase transition,” where the power of knockoffs jumps in a discontinuous fashion as $q$ increases.

Second, it is important to note that the precise form of the local dependency condition (3.8) is not crucial. Indeed, the proof of Theorem 3.2 only uses this condition to show that the partial sums of $\mathbb{I}(\mathrm{MLR}^{\pi}>0)$ converge to their conditional mean given $D$ . To be precise, fix any permutation $\kappa:[p_{n}]\to[p_{n}]$ and let $R=\mathbb{I}(\kappa(\mathrm{MLR}^{\pi})>0)$ where $\kappa(\mathrm{MLR}^{\pi})$ permutes $\mathrm{MLR}^{\pi}$ according to $\kappa$ . Let $\delta=\mathbb{E}[R\mid D]$ . Then the proof of Theorem 3.2 will go through exactly as written if we replace Equation (3.8) with the following condition:

\mathbb{P}\left(\max_{k_{n}\leq k\leq p_{n}}|\bar{R}_{k}-\bar{\delta}_{k}|\mid D\right)=o(p_{n}^{-1})

(B.17)

where $k_{n}$ is some sequence satisfying $k_{n}\to\infty$ and $\frac{k_{n}}{s_{n}}\to 0$ .

The upshot is this: under any condition where each permutation of $\mathbb{I}(\mathrm{MLR}^{\pi}>0)$ obeys a certain strong law of large numbers, we should expect Theorem 3.2 to hold. Although it is unusual to require that a strong law holds for any fixed permutation of a vector, in some cases there is a “worst-case” permutation where if Equation (B.17) holds for some choice of $\kappa$ , then it holds for every choice of $\kappa$ . For example, in Corollary C.1, we show that if the exponential decay condition holds, then it suffices to show Equation (B.17) in the case where $\kappa$ is the identity permutation, since the identity permutation places the most correlated coordinates of $\mathrm{MLR}^{\pi}$ next to each other.

B.6 Proof of Propositions 3.4-3.5

Proposition 3.4.

Proof.

Note that the proof here is essentially the same argument used in Proposition 2 of Li and Fithian, (2021), but for completeness we restate it here. Let $W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ denote any other feature statistic. Let $S\subset[p]$ and $S^{\pi}\subset[p]$ denote the discovery sets based on $W$ and $\mathrm{MLR}^{\pi}$ .

It suffices to show that $\mathbb{E}_{P^{\pi}}[|S|]\leq\mathbb{E}_{P^{\pi}}[|S^{\pi}|]$ . The argument from proof of Theorem 3.2 (the beginning of Appendix B.4) shows that the number of discoveries $|S|$ is a monotone function of $\psi_{q}(\mathbb{I}(\mathrm{sorted}(W)>0))$ , where $\mathrm{sorted}(W)$ denotes $W$ sorted in decreasing order of its absolute values. Therefore, it suffices to show that if $R=\mathbb{I}(\mathrm{sorted}(W)>0)$ and $R^{\pi}=\mathbb{I}(\mathrm{sorted}(\mathrm{MLR}^{\pi})>0)$ ,

\mathbb{E}_{P^{\pi}}[\psi_{q}(R)]\leq\mathbb{E}_{P^{\pi}}[\psi_{q}(R^{\pi})],

(B.18)

where as defined in Eq. (B.4), $\psi_{q}(\eta)\coloneqq\max_{k}\left\{k:\frac{k-k\bar{\eta}_{k}+1}{k\bar{\eta}_{k}}\leq q\right\}$ for any $\eta\in\{0,1\}^{p}$ . To do this, we recall from the proof of Theorem 3.2 that there exists a $D$ -measurable vector $\xi\in\{0,1\}^{p}$ and a $D$ -measurable permutation $\sigma:[p]\to[p]$ such that

R=\sigma(\xi\oplus R^{\pi})

where for any vector $x\in\mathbb{R}^{p}$ , $\sigma(x)\coloneqq(x_{\sigma(1)},\dots,x_{\sigma(p)})$ and $\oplus$ denotes the parity function (see Eq. B.13). We now make a few observations:

1.

We assume $\{\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)\}_{j=1}^{p}$ are conditionally independent. Since the magnitudes of $\mathrm{MLR}^{\pi}$ are $D$ -measurable by Proposition 3.1, $R^{\pi}$ is equal to a $D$ -measurable permutation of $\mathbb{I}(\mathrm{MLR}^{\pi}>0)$ . Therefore, the entries of $R^{\pi}$ are conditionally independent given $D$ .
2.

Since $\sigma$ and $\xi$ are $D$ -measurable, the entries of $R$ are also conditionally independent given $D$ .
3.

Since $\mathrm{MLR}_{j}^{\pi}$ maximizes $P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)$ among all feature statistics by Proposition 3.3, this implies that for any $j$ , $P^{\pi}(R_{j}^{\pi}>0\mid D)\geq\frac{1}{2}$ . Thus, $P^{\pi}(R_{j}>0\mid D)=P^{\pi}(\xi_{j}\oplus R_{j}^{\pi}>0\mid D)\leq P^{\pi}(R_{j}^{\pi}>0\mid D)$ for all $j$ .

Thus, we can create a coupling $\tilde{R}$ such that $\tilde{R}$ has the same marginal law as $\sigma(R^{\pi})$ , but $\tilde{R}\geq R$ a.s. (by the third observation above). This implies that $\frac{k-k\bar{\tilde{R}}_{k}+1}{k\bar{\tilde{R}}_{k}}\leq\frac{k-k\bar{R}_{k}+1}{k\bar{R}_{k}}$ for all $k$ , and therefore $\psi_{q}(\tilde{R})\geq\psi_{q}(R)$ . Therefore

\mathbb{E}_{P^{\pi}}[\psi_{q}(R)]\leq\mathbb{E}[\psi_{q}(\tilde{R})]=\mathbb{E}_{P^{\pi}}[\psi_{q}(\sigma(R^{\pi}))].

Therefore, to complete the proof, it suffices to show that $\mathbb{E}_{P^{\pi}}[\psi_{q}(\sigma(R^{\pi}))]\leq\mathbb{E}_{P^{\pi}}[\psi_{q}(R^{\pi})]$ —to simplify notation, take $R=\sigma(R^{\pi})$ , i.e., assume $\xi=0$ without loss of generality. Note that by Proposition 3.3, the entries of $\delta^{\pi}\coloneqq\mathbb{E}[R^{\pi}\mid D]\in[0,1]^{p}$ are arranged in decreasing order. To show the desired result, let $\delta\coloneqq\mathbb{E}[R\mid D]$ and fix any $i<j$ such that $\delta_{i}<\delta_{j}$ are “misordered” (i.e. not in decreasing order). It is sufficient to show that $\mathbb{E}[\psi_{q}(R)\mid D]\leq\mathbb{E}[\psi_{q}(R_{{\text{swap}(\{i,j\})}})\mid D]$ , since conditional on $D$ , $R^{\pi}=\sigma^{-1}(R)$ is simply the result of iteratively swapping elements of $R$ to sort $\delta$ in decreasing order.

To show this, for any $r_{i},r_{j}\in\{0,1\}$ , let $(R_{1},\dots,r_{i},\dots,r_{j},\dots,R_{p})$ denote the vector which replaces the $i$ th and $j$ th entries of $R$ with $r_{i}$ and $r_{j}$ , respectively. Since the entries of $R\mid D$ are conditionally independent, after conditioning on $R_{-\{i,j\}}$ , we can write out the relevant conditional expectations:

		$\displaystyle\mathbb{E}[\psi_{q}(R)-\psi_{q}(R_{{\text{swap}(\{i,j\})}})\mid D,R_{-\{i,j\}}]$
	$\displaystyle=$	$\displaystyle\sum_{r_{i},r_{j}\in\{0,1\}}[\psi_{q}((R_{1},\dots,r_{i},\dots,r_{j},\dots,R_{p}))-\psi_{q}(R_{1},\dots,r_{i},\dots,r_{j},\dots,R_{p})]\delta_{i}^{r_{i}}(1-\delta_{i})^{r_{i}}\delta_{j}^{r_{j}}(1-\delta_{j})^{r_{j}}$
	$\displaystyle=$	$\displaystyle\left[\psi_{q}((R_{1},\dots,1,\dots,0,\dots,R_{p}))-\psi_{q}((R_{1},\dots,0,\dots,1,\dots,R_{p}))\right]\left[\delta_{i}(1-\delta_{j})-\delta_{j}(1-\delta_{i})\right]$
	$\displaystyle=$	$\displaystyle\left[\psi_{q}((R_{1},\dots,1,\dots,0,\dots,R_{p}))-\psi_{q}((R_{1},\dots,0,\dots,1,\dots,R_{p}))\right]\left[\delta_{i}-\delta_{j}\right]$
	$\displaystyle\leq$	$\displaystyle 0.$

where the first equality uses conditional independence and the definition of expectation, the second equality cancels relevant terms, the third equality is simple algebra, and the final inequality uses the fact that $\delta_{i}<\delta_{j}$ by assumption but $\psi_{q}((R_{1},\dots,1,\dots,0,\dots,R_{p}))-\psi_{q}((R_{1},\dots,0,\dots,1,\dots,R_{p}))\geq 0$ . In particular, to see this latter point, define $r^{(1,0)}=(R_{1},\dots,1,\dots,0,\dots,R_{p}),r^{(0,1)}\coloneqq(R_{1},\dots,0,\dots,1,\dots,R_{p})$ and note that the partial averages obey $\bar{r}_{k}^{(1,0)}\geq\bar{r}_{k}^{(0,1)}$ for every $k\in[p]$ , which implies $\psi_{q}(r^{(1,0)})\geq\psi_{q}(r^{(0,1)})$ by definition of $\psi_{q}$ .

Thus, by the tower property, $\mathbb{E}[\psi_{q}(R)\mid D]\leq\mathbb{E}[\psi_{q}(R_{{\text{swap}(\{i,j\})}})\mid D]$ , which completes the proof. ∎

Proposition 3.5.

Proof.

This result is already proved for the fixed-X case by Li and Fithian, (2021), so we only prove it for the case where $\widetilde{\mathbf{X}}$ are conditional Gaussian MX knockoffs. In particular, define $\widehat{\mathbf{X}}_{j}^{\star}\coloneqq\operatorname*{arg\,max}_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}^{\star}(\mathbf{x}\mid D)$ and recall that $\mathrm{MLR}_{j}^{\mathrm{oracle}}>0$ if and only if $\widehat{\mathbf{X}}_{j}^{\star}=\mathbf{X}_{j}$ . Therefore, to show that $\{\mathrm{MLR}_{j}^{\mathrm{oracle}}>0\}_{j=1}^{p}\mid D$ are conditionally independent under $P^{\star}$ , it suffices to show that $\{\mathbf{X}_{j}\}_{j=1}^{p}\mid D$ are conditionally independent under $P^{\star}$ .

Fix any value $d=(\mathbf{Y}^{(0)},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p})$ and let $\mathbf{X}^{(0)}\in\mathbb{R}^{n\times p}$ denote a possible value for the design matrix which is consistent with observing $D=d$ in the sense that $\mathbf{X}_{j}^{(0)}\in\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}$ for all $j\in[p]$ . It suffices to show the factorization

P^{\star}(\mathbf{X}=\mathbf{X}^{(0)}\mid D=d)\propto\prod_{j=1}^{p}q_{j}(\mathbf{X}_{j}^{(0)})

for some functions $q_{1},\dots,q_{p}:\mathbb{R}^{n}\to\mathbb{R}_{\geq 0}$ which may depend on the value $d$ . To do this, observe that

	$\displaystyle P^{\star}(\mathbf{X}=\mathbf{X}^{(0)}\mid D=d)$	$\displaystyle\propto P^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}^{(0)}\mid\mathbf{X}^{(0)})\cdot P^{\star}(\mathbf{X}=\mathbf{X}^{(0)}\mid\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}=\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p}\text{ for }j=1,\dots,p)$
		$\displaystyle\propto\exp\left(-\frac{1}{2\sigma^{2}}\\|\mathbf{Y}^{(0)}-\mathbf{X}^{(0)}\beta\\|_{2}^{2}\right)\frac{1}{2^{p}}.$

where the last line uses the Gaussian linear model assumption that under $P^{\star}$ , $\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,\sigma^{2}I_{n})$ for some fixed $\beta\in\mathbb{R}^{p},\sigma^{2}\geq 0$ as well as the pairwise exchangeability of $\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}$ . Continuing yields,

	$\displaystyle P^{\star}(\mathbf{X}=\mathbf{X}^{(0)}\mid D=d)$	$\displaystyle\propto\exp\left(-\frac{1}{2\sigma^{2}}\left(\beta^{T}{\mathbf{X}^{(0)}}^{T}\mathbf{X}^{(0)}\beta-2{\mathbf{Y}^{(0)}}^{T}\mathbf{X}^{(0)}\beta\right)\right)$
		$\displaystyle\propto\exp\left(\frac{{\mathbf{Y}^{(0)}}^{T}\mathbf{X}^{(0)}\beta}{\sigma^{2}}\right).$

Here, the last step uses the key assumption that $\widetilde{\mathbf{X}}$ are conditional Gaussian MX knockoffs, in which case $\mathbf{X}^{T}\mathbf{X}=\widetilde{\mathbf{X}}^{T}\widetilde{\mathbf{X}}$ and $\mathbf{X}^{T}\widetilde{\mathbf{X}}=\mathbf{X}^{T}\mathbf{X}-S$ for some diagonal matrix $S\in\mathbb{R}^{p\times p}$ . In other words, the value of $\mathbf{X}^{T}\mathbf{X}$ is $D$ -measurable, and thus conditional on $D=d$ , the value of $\beta^{T}{\mathbf{X}^{(0)}}^{T}\mathbf{X}^{(0)}\beta$ is a constant. At this point, we conclude that

\displaystyle P^{\star}(\mathbf{X}=\mathbf{X}^{(0)}\mid D=d)

\displaystyle\propto\prod_{j=1}^{p}\exp\left(\frac{{\mathbf{Y}^{(0)}}^{T}\mathbf{X}_{j}^{(0)}\beta_{j}}{\sigma^{2}}\right).

which completes the proof by the factorization argument above.

∎

B.7 Maximizing the expected number of true discoveries

Theorem 3.2 shows that MLR statistics maximize the (normalized) expected number of discoveries, but not necessarily the expected number of true discoveries. In this section, we give a sketch of the derivation of AMLR statistics and prove that they maximize the expected number of true discoveries.

This section uses the notation introduced in Section B.4. All probabilities and expectations are taken over $P^{\pi}$ . As a reminder, for any feature statistic $W$ , let $R=\mathbb{I}(\mathrm{sorted}(W)>0)$ , let $\delta=\mathbb{E}[R\mid D]$ , and let $\psi_{q}(\cdot)$ be as defined in Equation (B.4) so that knockoffs makes $\tau_{q}(R)=\left\lceil\frac{\psi_{q}(R)+1}{1+q}\right\rceil$ discoveries.

B.7.1 Proof of Corollary 3.1

Corollary 3.1.

AMLR statistics from Definition 3.3 are valid knockoff statistics.

Proof.

The signs of the AMLR statistics are identical to the signs of the MLR statistics. Therefore, by Propositions 3.1 and 3.2 (in the MX and FX case, respectively), it suffices to show that the absolute values of the AMLR statistics are functions of the masked data. However, the AMLR statistics magnitudes are purely a function of (i) the magnitudes of the MLR statistics and (ii) $\nu_{j}$ , which is the ratio of conditional probabilities given the masked data $D$ . These conditional probabilities by definition are functions of $D$ , and since MLR statistics are valid knockoff statistics by Lemma 3.1, the MLR magnitudes are also a function of the masked data $D$ by Proposition 3.1. Thus, the AMLR statistic magnitudes are a function of $D$ , which concludes the proof. ∎

B.7.2 Proof sketch and intuition

Proof sketch: The key idea behind the proof of Theorem 3.2 is to observe that:

1.

The number of discoveries $\tau_{q}(R)$ only depends on cumulative averages of $R$ , denoted $\bar{R}_{k}=\frac{1}{k}\sum_{i=1}^{k}R_{i}$ .
2.

As $p\to\infty$ , $\bar{R}_{k}\stackrel{{\scriptstyle\mathrm{a.s.}}}{{\to}}\bar{\delta}_{k}$ under suitable assumptions. Thus, $\tau_{q}(R)\approx\tau_{q}(\delta)$ .
3.

If $R^{\pi}=\mathbb{I}(\mathrm{sorted}(\mathrm{MLR}^{\pi})>0)$ are MLR statistics with $\delta^{\pi}=\mathbb{E}[R^{\pi}\mid D]$ , then $R^{\pi}$ is asymptotically optimal because $\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta)$ holds in finite samples for any choice of $\delta$ . Thus we conclude:

$\tau_{q}(R^{\pi})\approx\tau_{q}(\delta^{\pi})\geq\tau_{q}(\delta)\approx\tau_{q}(R).$ (B.19)

In particular, this holds because MLR statistics ensure $\delta^{\pi}$ is in descending order.

To show a similar result for the number of true discoveries, we repeat the three steps used in the proof of Theorem 3.2. To do this, let $I_{j}$ be the indicator that the feature corresponding to the $j$ th coordinate of $R_{j}$ is non-null, and let $B_{j}=I_{j}R_{j}$ be the indicator that $\mathrm{sorted}(W)_{j}>0$ and that the corresponding feature is non-null. Let $b=\mathbb{E}[B\mid D]$ . Then:

1.

Let $T_{q}(R,B)$ denote the number of true discoveries. We claim that $T_{q}(R,B)$ is a function of the successive partial means of $R$ and $B$ . To see this, recall from Section B.4 that knockoffs will make $\tau_{q}(R)$ discoveries, and in particular it will make discoveries corresponding to any of the first $\psi_{q}(R)$ coordinates of $R$ which are positive. Therefore,

$T_{q}(R,B)=\sum_{j=1}^{\psi_{q}(R)}B_{j}=\psi_{q}(R)\cdot\frac{1}{\psi_{q}(R)}\sum_{j=1}^{\psi_{q}(R)}B_{j}.$ (B.20)

Since $\psi_{q}(R)$ only depends on the successive averages of $R$ and the second term is itself a successive average of $\{B_{j}\}$ , this finishes the first step.
2.

The second step is to show that as $p\to\infty$ , $\bar{B}_{k}\stackrel{{\scriptstyle\mathrm{a.s.}}}{{\to}}\bar{b}_{k},\bar{R}_{k}\stackrel{{\scriptstyle\mathrm{a.s.}}}{{\to}}\bar{\delta}_{k}$ and therefore $T_{q}(R,B)\approx T_{q}(\delta,b)$ . This is done using the same techniques as Theorem 3.2, although it requires an extra assumption that $B$ also obeys the local dependency condition (3.8). Like the original condition, this condition also only depends on the posterior of $B\mid D$ , so it can be diagnosed using the data at hand.
3.

To complete the proof, we define the adjusted MLR statistic $\mathrm{AMLR}^{\pi}\in\mathbb{R}^{n}$ with corresponding $\tilde{R},\tilde{\delta},\tilde{b}$ such that $T_{q}(\tilde{\delta},\tilde{b})\geq T_{q}(\delta,b)$ holds in finite samples for any other feature statistic $W$ . It is easy to see that $\mathrm{AMLR}^{\pi}$ must have the same signs as the original MLR statistics $\mathrm{MLR}^{\pi}$ , since the signs of $\mathrm{MLR}^{\pi}$ maximize $\delta^{\pi}$ and $b^{\pi}$ coordinatewise. However, the absolute values of $\mathrm{AMLR}^{\pi}$ may differ from those of $\mathrm{MLR}^{\pi}$ , since it is not always true that sorting $\delta$ in decreasing order maximizes $T_{q}(\delta,b)$ .

It turns out that the absolute values of the AMLR statistics in Eq. (3.12) yield vectors $\tilde{\delta},\tilde{b}$ which maximize $T_{q}(\tilde{\delta},\tilde{b})$ up to an $O(1)$ additive constant. Theorem B.6 formally proves this, but we now give some intuition.

Intuition: To determine the optimal absolute values for AMLR statistics, assume WLOG by relabelling the variables that $|\mathrm{MLR}_{1}^{\pi}|\geq|\mathrm{MLR}_{2}^{\pi}|\geq\dots\geq|\mathrm{MLR}_{p}^{\pi}|$ . Let $S\subset[p]$ denote an optimization variable representing the set of variables with the $K$ largest absolute values for the AMLR statistics, for some $K$ . We will try to design $S$ such that we can make as many true discoveries within $S$ as possible.

As argued above, AMLR and MLR statistics should have the same signs. Thus, roughly speaking, we can discover all features with positive signs among $S$ whenever

\frac{1}{|S|}\sum_{j\in S}\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)\geq(1+q)^{-1}.

Making our usual approximation $\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0)\approx\mathbb{E}[\mathrm{MLR}_{j}^{\pi}\mid D]=\delta_{j}^{\pi}$ , this is equivalent to the constraint

\sum_{j\in S}(1+q)^{-1}-\delta_{j}^{\pi}\geq 0.

(B.21)

Furthermore, if we can discover all of the features with positive signs in $S$ , we make exactly $\sum_{j\in S}B_{j}^{\pi}\approx\sum_{j\in S}b_{j}^{\pi}$ true discoveries, where $B_{j}^{\pi}$ is the indicator that the $j$ th MLR statistic is positive and the $j$ th feature is nonnull. Maximizing this approximate objective subject to the constraint in Eq. (B.21) yields the optimization problem:

\max_{S\subset[p]}\sum_{j\in S}b_{j}^{\pi}\text{ s.t. }\sum_{j\in S}(1+q)^{-1}-\delta_{j}^{\pi}\geq 0.

(B.22)

In other words, including $j\in S$ has “benefit” $b_{j}^{\pi}$ and “cost” $(1+q)^{-1}-\delta_{j}^{\pi}$ . This is a simple integer linear program with one constraint, often known as a “knapsack” problem. An approximately optimal solution to this problem is to do the following:

•

Include all variables with “negative” costs, meaning $\delta_{j}^{\pi}=P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\geq(1+q)^{-1}$ . This is accomplished by ensuring that these features have the largest absolute values.
•

Prioritize all other variables in descending order of the ratio of the benefit to the cost, $\frac{b_{j}^{\pi}}{(1+q)^{-1}-\delta_{j}^{\pi}}$ .

This solution is indeed accomplished by the AMLR formula (Eq. (3.12)), which gives the highest absolute values to features with negative costs; then, all other absolute values have the same order as the benefit-to-cost ratios $\frac{b_{j}^{\pi}}{(1+q)^{-1}-\delta_{j}^{\pi}}=\nu_{j}=\frac{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)}{(1+q)^{-1}-P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)}$ .

B.7.3 Theorem statement and proof

We now show that AMLR statistics asymptotically maximize the true positive rate. To do this, we require two additional regularity conditions beyond those assumed in Theorem 3.2. First, we need a condition that the number of non-nulls under $P_{n}^{\pi}$ is not too heavy-tailed; namely, that its coefficient of variation is uniformly bounded.

Assumption B.1.

There exists a constant $C\in\mathbb{R}$ such that as $n\to\infty$ ,

\frac{\sqrt{\mathrm{Var}_{P_{n}^{\pi}}(|\mathcal{H}_{1}(\theta^{\star})|)}}{s_{n}}\leq C,

where $s_{n}\coloneqq\mathbb{E}_{P_{n}^{\pi}}[|\mathcal{H}_{1}(\theta^{\star})|]$ is the expected number of non-nulls under $P_{n}^{\pi}$ .

Assumption B.1 is needed for a technical reason. As we will see in Step 3 of the proof, combining this assumption with Lemma C.2 ensures that the normalized number of discoveries is uniformly integrable, which is necessary to show that certain error terms converge in $L^{1}$ to zero. Nonetheless, this assumption is already quite mild, and it is satisfied in previously studied linear and polynomial sparsity regimes (Donoho and Jin,, 2004; Weinstein et al.,, 2017; Ke et al.,, 2020).

Next, we need an additional local dependence condition.

Assumption B.2 (Additional local dependence condition).

Let $I_{j}^{+}=\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star}))$ indicate the event that $j$ is non-null and $\mathrm{MLR}_{j}^{\pi}$ is positive. Let $I_{j}^{-}=\mathbb{I}(\mathrm{MLR}_{j}^{\pi}<0,j\in\mathcal{H}_{1}(\theta^{\star}))$ indicate the event that $j$ is non-null and $\mathrm{MLR}_{j}^{\pi}$ is negative. We assume that there exist constants $C\geq 0,\rho\in(0,1)$ such that for all $i,j\in[p]$ :

|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{+},I_{j}^{+}\mid D^{(n)})|\leq C\rho^{|i-j|}.

(B.23)

|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{-},I_{j}^{+}\mid D^{(n)})|\leq C\rho^{|i-j|}.

(B.24)

|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{-},I_{j}^{-}\mid D^{(n)})|\leq C\rho^{|i-j|}.

(B.25)

Assumption B.2 is needed because it implies that for any feature statistic $W$ , $\{\mathbb{I}(W_{i}>0,i\in\mathcal{H}_{1}(\theta^{\star}))\}_{i=1}^{p}$ obey the same local dependence condition.

Lemma B.5.

Assume Assumption B.2. Then for any feature statistic $W$ and all $i,j\in[p]$ ,

\operatorname{Cov}_{P_{n}^{\pi}}(\mathbb{I}(W_{i}>0,i\in\mathcal{H}_{1}(\theta^{\star})),\mathbb{I}(W_{j}>0,j\in\mathcal{H}_{1}(\theta^{\star}))\mid D^{(n)})\leq C\rho^{|i-j|}.

Proof.

By Corollary B.1, the event $\operatorname*{sign}(W_{j})\neq\operatorname*{sign}(\mathrm{MLR}_{j}\pi)$ is $D^{(n)}$ -measurable. This implies that for each $j\in[p]$ , there is a deterministic (conditional on $D^{(n)}$ ) choice of $I_{j}^{+},I_{j}^{-}$ such that either $\mathbb{I}(W_{j}>0,j\in\mathcal{H}_{1}(\theta^{\star}))=I_{j}^{+}$ or $\mathbb{I}(W_{j}>0,j\in\mathcal{H}_{1}(\theta^{\star}))=I_{j}^{-}$ . As a result, we have that

		$\displaystyle\|\operatorname{Cov}_{P_{n}^{\pi}}(\mathbb{I}(W_{i}>0,i\in\mathcal{H}_{1}(\theta^{\star})),\mathbb{I}(W_{j}>0,j\in\mathcal{H}_{1}(\theta^{\star}))\mid D^{(n)})\|$
	$\displaystyle\leq$	$\displaystyle\max(\|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{+},I_{j}^{+}\mid D^{(n)})\|,\|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{+},I_{j}^{-}\mid D^{(n)})\|,(\|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{-},I_{j}^{+}\mid D^{(n)})\|,\|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{-},I_{j}^{-}\mid D^{(n)})\|)$
	$\displaystyle\leq$	$\displaystyle C\rho^{\|i-j\|}$

where the last step follows by Assumption B.2. ∎

Theorem B.6.

Suppose the conditions of Theorem 3.2 plus Assumptions B.1 and B.2 hold. Let $\mathrm{amlr}_{n}^{\pi}$ denote the AMLR statistics with respect to $P_{n}^{\pi}$ . Then for any sequence of feature statistics $\{w_{n}\}_{n\in\mathbb{N}}$ , the following holds for all but countably many $q\in[0,1]$ :

\liminf_{n\to\infty}\frac{\mathrm{TP}_{q}^{\pi}(\mathrm{amlr}_{n}^{\pi})-\mathrm{TP}_{q}^{\pi}(w_{n})}{s_{n}}\geq 0,

(B.26)

where $s_{n}$ is the expected number of non-nulls under $P_{n}^{\pi}$ as defined in Assumption 3.1, and $\mathrm{TP}_{q}^{\pi}(w_{n})$ is the expected number of true discoveries made by feature statistic $w_{n}$ under $P_{n}^{\pi}$ as defined in Section B.7.

Proof.

The proof is in three steps.

Step 1: Notation and setup. Throughout, we use the notation and ideas from Section B.7.2 and the proof of Theorem 3.2 (Section B.4), although to ease readability, we will try to give reminders about notation when needed. In particular:

•

Define $W=w_{n}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)})\in\mathbb{R}^{p_{n}}$ and $\mathrm{AMLR}^{\pi}=$ $\mathrm{amlr}_{n}^{\pi}([\mathbf{X}^{(n)},\widetilde{\mathbf{X}}^{(n)}],\mathbf{Y}^{(n)})\in\mathbb{R}^{p_{n}}$ . For simplicity, we suppress the dependence on $n$ .
•

Define $R=\mathbb{I}(\mathrm{sorted}(W)>0)$ and $\tilde{R}=\mathbb{I}(\mathrm{sorted}(\mathrm{AMLR}^{\pi})>0)$ .
•

Let $\sigma,\tilde{\sigma}:[p_{n}]\to[p_{n}]$ denote random the permutations such that $\sigma(W)$ and $\tilde{\sigma}(\mathrm{AMLR}^{\pi})$ are sorted in descending order of absolute values; with this notation, note that $R_{j}\coloneqq\mathbb{I}(W_{\sigma(j)}>0),\tilde{R}_{j}\coloneqq\mathbb{I}(\mathrm{AMLR}^{\pi}_{\tilde{\sigma}(j)}>0)$ .
•

Let $I_{j}=\mathbb{I}(\sigma(j)\in\mathcal{H}_{1}(\theta^{\star}))$ and $\tilde{I}_{j}\coloneqq\mathbb{I}(\tilde{\sigma}(j)\in\mathcal{H}_{1}(\theta^{\star}))$ be the indicators that the feature statistic with the $j$ th largest absolute value of $W$ (resp. $\mathrm{AMLR}^{\pi}$ ) represents a non-null feature.
•

Let $B_{j}\coloneqq R_{j}I_{j}$ be the indicator that the feature with the $j$ th largest absolute value among $W$ is non-null and that its feature statistic is postive. Similarly, $\tilde{B}_{j}\coloneqq\tilde{R}_{j}\tilde{I}_{j}$ is the indicator that the feature with the $j$ th largest AMLR statistic is non-null and its AMLR statistic is positive. Let $B,\tilde{B}\in\mathbb{R}^{p_{n}}$ denote the vectors of these indicators.
•

Define $\tilde{\delta}=\mathbb{E}[\tilde{R}\mid D^{(n)}],\delta=\mathbb{E}[R\mid D^{(n)}],\tilde{b}=\mathbb{E}[\tilde{B}\mid D^{(n)}]$ and $b=\mathbb{E}[B\mid D^{(n)}]$ to be the conditional expectations of these quantities given the masked data.
•

Throughout, we only consider feature statistics whose values are nonzero a.s., because one can provably increase the power of knockoffs by ensuring that each coordinate of $W$ is nonzero.

Equation (B.20) shows that we can write

\mathrm{TP}_{q}^{\pi}(w_{n})\coloneqq\mathbb{E}_{P_{n}^{\pi}}[|S_{w_{n}}\cap\mathcal{H}_{1}(\theta^{\star})|]=\mathbb{E}_{P_{n}^{\pi}}[T_{q}(R,B)]\text{ where }T_{q}(R,B)\coloneqq\psi_{q}(R)\bar{B}_{\psi_{q}(R)}\coloneqq\sum_{j=1}^{\psi_{q}(R)}B_{j}

(B.27)

and similarly $\mathrm{TP}_{q}^{\pi}(\mathrm{amlr}_{n}^{\pi})=\mathbb{E}_{P_{n}^{\pi}}[T_{q}(\tilde{R},\tilde{B})]$ . Therefore it suffices to show that

\liminf_{n\to\infty}s_{n}^{-1}\left(T_{q}(\tilde{R},\tilde{B})-T_{q}(R,B)\right)\geq 0.

(B.28)

To do this, we make the following approximation using the triangle inequality:

\displaystyle s_{n}^{-1}\left(T_{q}(\tilde{R},\tilde{B})-T_{q}(R,B)\right)\geq s_{n}^{-1}\left(\underbrace{T_{q}(\tilde{\delta},\tilde{b})-T_{q}(\delta,b)}_{\text{Term 1}}-\underbrace{|T_{q}(R,B)-T_{q}(\delta,b)|}_{\text{Term 2}}-\underbrace{|T_{q}(\tilde{R},\tilde{B})-T_{q}(\tilde{\delta},\tilde{b})|}_{\text{Term 3}}\right).

(B.29)

Step 2 of the proof shows that Term 1 is asymptotically positive, and Step 3 shows that Terms 2 and 3 are asymptotically negligible in expectation (i.e., of order $o(s_{n})$ ). This is sufficient to complete the proof.

Step 2: Analyzing Term 1. In this step, we show that

\liminf_{n\to\infty}\mathbb{E}\left[\frac{T_{q}(\tilde{\delta},\tilde{b})-T_{q}(\delta,b)}{s_{n}}\right]\geq 0.

(B.30)

To do this, Step 2a shows that we may assume $\operatorname*{sign}(\mathrm{AMLR}^{\pi})=\operatorname*{sign}(W)$ and therefore $\tilde{\delta}=\tilde{\sigma}(\sigma^{-1}(\delta))$ and $\tilde{b}=\tilde{\sigma}(\sigma^{-1}(\delta))$ (with the usual notation that $\sigma(\delta)\coloneqq(\delta_{\sigma(1)},\dots,\delta_{\sigma(p)})$ ). Step 2b then proves Eq. (B.30).

Step 2a: Define $\hat{W}=|W|\cdot\operatorname*{sign}(\mathrm{AMLR}^{\pi})$ to have the absolute values of $W$ and the signs of the AMLR statistics. We claim that if $\hat{\delta},\hat{b}$ are defined analogously for $\hat{W}$ instead of $W$ , then $T_{q}(\delta,b)\leq T_{q}(\hat{\delta},\hat{b})$ .

To see this, we prove that (i) $\delta_{j}\leq\hat{\delta}_{j}$ holds elementwise and (ii) $b_{j}\leq\hat{b}_{j}$ holds elementwise. Results (i) and (ii) complete the proof of Step 2(a) since $T_{q}(\delta,b)=\sum_{j=1}^{\psi_{q}(\delta)}b_{j}$ is nondecreasing in its inputs (namely because $b$ is nonnegative and $\psi_{q}(\delta)$ is nondecreasing in its inputs).

To show (i), Proposition 3.3 shows that $\delta_{j}\coloneqq P_{n}^{\pi}(W_{\sigma(j)}>0\mid D^{(n)})\leq P_{n}^{\pi}(\mathrm{AMLR}_{\sigma(j)}>0\mid D^{(n)})=P_{n}^{\pi}(\hat{W}_{\sigma(j)}>0\mid D^{(n)})=\hat{\delta}_{j}$ , where this argument also uses the facts that (a) MLR and AMLR statistics have the same signs and (b) the permutation $\sigma$ depends only on $|W|$ and thus is $D^{(n)}$ -measurable.

To show (ii), we note that the law of total probability yields

\delta_{j}=P_{n}^{\pi}(W_{\sigma(j)}>0\mid I_{j}=1,D^{(n)})P_{n}^{\pi}(I_{j}=1\mid D^{(n)})+\frac{1}{2}P_{n}^{\pi}(I_{j}=0\mid D^{(n)})

where above we use the fact that $P_{n}^{\pi}(W_{\sigma(j)}>0\mid I_{j}=0,D^{(n)})=\frac{1}{2}$ —i.e., under the null, $W_{\sigma(j)}$ is conditionally symmetric. Thus, $\delta_{j}\leq\hat{\delta}_{j}$ holds iff $P_{n}^{\pi}(W_{\sigma(j)}>0\mid I_{j}=1,D^{(n)})\leq P_{n}^{\pi}(\hat{W}_{\sigma(j)}>0\mid I_{j}=1,D^{(n)})$ . Using this result, we conclude

$\displaystyle b_{j}$	$\displaystyle=P_{n}^{\pi}(W_{\sigma(j)}>0,I_{j}=1\mid D^{(n)})$
	$\displaystyle=P_{n}^{\pi}(W_{\sigma(j)}>0\mid I_{j}=1,D^{(n)})P_{n}^{\pi}(I_{j}=1\mid D^{(n)})$
	$\displaystyle\leq P_{n}^{\pi}(\hat{W}_{\sigma(j)}>0\mid I_{j}=1,D^{(n)})P_{n}^{\pi}(I_{j}=1\mid D^{(n)})$	$\displaystyle\text{ since }\delta_{j}\leq\hat{\delta}_{j}\text{ by result (i)}$
	$\displaystyle=\hat{b}_{j}.$

This proves that $T(\hat{\delta},\hat{b})\geq T(\delta,b)$ . Yet in this step, we seek to show that $T(\tilde{\delta},\tilde{b})\geq T(\delta,b)$ ; thus, we may assume that $W=\hat{W}$ and thus $\operatorname*{sign}(W)=\operatorname*{sign}(\mathrm{AMLR}^{\pi})$ . This implies that $\delta,b$ and $\tilde{\delta},\tilde{b}$ take the same values but in different orders; formally, $\tilde{\delta}=\tilde{\sigma}(\sigma^{-1}(\delta))$ and $\tilde{b}=\tilde{\sigma}(\sigma^{-1}(\delta))$ .

Step 2b: Now we show Eq. (B.30). Recall that $T_{q}(\delta,b)=\sum_{j=1}^{\psi_{q}(\delta)}b_{j}$ is the partial sum of the first $\psi_{q}(\delta)$ elements of $b$ , where $\psi_{q}(\delta)\coloneqq\max\left\{k:\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}}\leq q\right\}$ is the maximum integer such that $\frac{k-k\bar{\delta}_{k}+1}{k\bar{\delta}_{k}}\leq q$ . It follows that $T_{q}(\delta,b)$ is bounded by the following quantity:

T_{q}(\delta,b)\leq\max_{S\subset[p]}\sum_{j\in S}b_{j}\text{ s.t. }\frac{|S|-|S|\bar{\delta}_{S}+1}{|S|\bar{\delta}_{S}}\leq q,

where the notation $\bar{\delta}_{S}=\frac{1}{|S|}\sum_{j\in S}\delta_{j}$ defines the average value of $\delta_{j}$ averaged over the set $S$ . Indeed, this inequality follows because $T_{q}(\delta,b)$ is precisely the solution to this optimization problem when $S$ is constrained to be a contiguous set of the form $\{1,\dots,k\}$ for some $k$ . Relaxing this constraint to allow $S$ to be an arbitrary subset of $[p]$ should only increase the value of the objective. Manipulating this optimization problem yields:

	$\displaystyle\max_{S\subset[p]}\sum_{j\in S}b_{j}\text{ s.t. }\frac{\|S\|-\|S\|\bar{\delta}_{S}+1}{\|S\|\bar{\delta}_{S}}\leq q=$	$\displaystyle\max_{S\subset[p]}\sum_{j\in S}b_{j}\text{ s.t. }\|S\|-\|S\|\bar{\delta}_{S}+1\leq(\|S\|\bar{\delta}_{S})q$
	$\displaystyle=$	$\displaystyle\max_{S\subset[p]}\sum_{j=1}^{p}\mathbb{I}(j\in S)b_{j}\text{ s.t. }\sum_{j=1}^{p}\mathbb{I}(j\in S)\left(1-(1+q)\delta_{j}\right)\leq-1$
	$\displaystyle=$	$\displaystyle\max_{S\subset[p]}\sum_{j=1}^{p}\mathbb{I}(j\in S)b_{j}\text{ s.t. }\sum_{j=1}^{p}\mathbb{I}(j\in S)\left(\frac{1}{1+q}-\delta_{j}\right)\leq-\frac{1}{1+q}.$

This is an integer linear program with $p$ integer decision variables $x_{j}\coloneqq\mathbb{I}(j\in S)$ and one constraint:

=\max_{x_{1},\dots,x_{p}\in\{0,1\}}\sum_{j=1}^{p}b_{j}x_{j}\text{ s.t. }\sum_{j=1}^{p}\left(\frac{1}{1+q}-\delta_{j}\right)x_{j}\leq-\frac{1}{1+q}.

Such problems—commonly referred to as knapsack problems—are well studied. The maximum value is bounded by the following greedy strategy:

•

Let $S_{\mathrm{obvious}}=\{j\in[p]:\delta_{j}\geq\frac{1}{1+q}\}$ be the set of coordinates such that the constraint coefficient on $x_{j}$ is negative. This is an “obvious” set because for $j\in S_{\mathrm{obvious}}$ , setting $x_{j}=1$ never decreases the objective value (since $b_{j}\geq 0$ ) and never increases the constraint value (since $(1+q)^{-1}-\delta_{j}\leq 0$ ). Thus, any optimal solution must have $x_{j}=1$ for all $j\in S_{\mathrm{obvious}}$ .
•

Let $n_{\mathrm{obvious}}=|S_{\mathrm{obvious}}|$ denote the number of obvious coordinates, and let $S_{\mathrm{non-obvious}}=[p]\setminus S_{\mathrm{obvious}}$ denote the non-obvious coordinates.
•

After setting $x_{j}=1$ for all coordinates $j\in S_{\mathrm{obvious}}$ , we should sort the coordinates in $S_{\mathrm{non-obvious}}$ in descending order of the ratio $\frac{b_{j}}{(1+q)^{-1}-\delta_{j}}$ and include as many coordinates of $S_{\mathrm{non-obvious}}$ as possible until we hit the constraint that $\sum_{j=1}^{p}x_{j}((1+q)^{-1}-\delta_{j})\leq-\frac{1}{1+q}$ . Then, if we include one additional coordinate, the value of this solution (which violates the constraint by a small amount) bounds the optimal value to the overall objective problem. Indeed, this is because this solution actually has a higher objective value than the solution to the relaxed LP which only requires $x_{1},\dots,x_{p}\in[0,1]$ instead of $x_{1},\dots,x_{p}\in\{0,1\}$ (Martello and Toth,, 1990).

To make this strategy precise, let $\gamma:[p]\to[p]$ denote any permutation with the following properties:

•

$\gamma(1),\dots,\gamma(n_{\mathrm{obvious}})\in S_{\mathrm{obvious}}$ . I.e., the first $n_{\mathrm{obvious}}$ coordinates specified by $\gamma$ are the set $S_{\mathrm{obvious}}$ .
•

For $i,j>n_{\mathrm{obvious}}$ , $\gamma(i)\geq\gamma(j)$ if and only if $\frac{b_{i}}{(1+q)^{-1}-\delta_{i}}\geq\frac{b_{j}}{(1+q)^{-1}-\delta_{j}}$ . I.e., $\gamma$ orders the rest of the coordinates by the ratio $\frac{b_{j}}{(1+q)^{-1}-\delta_{j}}$ .

Then, let $k^{\star}\coloneqq\max\left\{k\in[p]:\sum_{j=1}^{k}((1+q)^{-1}-\delta_{\gamma(j)})\leq-\frac{1}{1+q}\right\}$ denote the maximum value of $k$ such that setting $x_{\gamma(1)},\dots,x_{\gamma(k)}=1$ yields a feasible solution to the integer LP. Then we have that

T_{q}(\delta,b)\leq\max_{x_{1},\dots,x_{p}\in\{0,1\}}\left[\sum_{j=1}^{p}b_{j}x_{j}\text{ s.t. }\sum_{j=1}^{p}\left(\frac{1}{1+q}-\delta_{j}\right)x_{j}\leq-\frac{1}{1+q}\right]\leq\sum_{j=1}^{k^{\star}+1}b_{\gamma(j)}.

Remark 6.

$\gamma$ is not uniquely specified by the construction above, but the bound holds for any such $\gamma$ .

Step 2a shows that for some permutation $\kappa:[p]\to[p]$ , we can write $\tilde{\delta}=\kappa(\delta)$ and $\tilde{b}=\kappa(b)$ . We note that $\kappa$ must satisfy the following:

•

$\kappa(1),\dots,\kappa(n_{\mathrm{obvious}})\in S_{\mathrm{obvious}}$ . This is because the AMLR statistics with the top absolute values are constructed to be the AMLR statistics such that $P_{n}^{\pi}(\mathrm{AMLR}_{j}^{\pi}>0\mid D)=P_{n}^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)\geq(1+q)^{-1}$ (see Definition 3.3), which exactly coincides with the definition of the set $S_{\mathrm{obvious}}$ .
•

For $i,j>n_{\mathrm{obvious}}$ , $\kappa(i)\geq\kappa(j)$ if and only if $\frac{b_{i}}{(1+q)^{-1}-\delta_{i}}\geq\frac{b_{j}}{(1+q)^{-1}-\delta_{j}}$ . This again follows from Definition 3.3, as the absolute values of AMLR statistics are explicitly chosen to guarantee this.

In other words, $\kappa$ satisfies the same properties as $\gamma$ above. Thus, we may take $\gamma=\kappa$ . This yields the bound $T_{q}(\delta,b)\leq\sum_{j=1}^{k^{\star}+1}b_{\gamma(j)}$ . However, we also know that

\displaystyle T_{q}(\tilde{\delta},\tilde{b})=\sum_{j=1}^{\psi_{q}(\tilde{\delta})}\tilde{b}_{j}=\sum_{j=1}^{\psi_{q}(\tilde{\delta})}b_{\gamma(j)}=\sum_{j=1}^{k^{\star}}b_{\gamma(j)},

where the last step follows because

$\displaystyle\psi_{q}(\tilde{\delta})=\psi_{q}(\kappa(\delta))$	$\displaystyle\coloneqq\max\left\{k\in[p]:\frac{k-\sum_{j=1}^{k}\delta_{\gamma(j)}+1}{\sum_{j=1}^{k}\delta_{\gamma(j)}}\leq q\right\}$	by definition
	$\displaystyle=\max\left\{k\in[p]:\sum_{j=1}^{k}((1+q)^{-1}-\delta_{\gamma(j)})\leq-\frac{1}{1+q}\right\}$	by algebraic manipulation
	$\displaystyle\coloneqq k^{\star}.$	by definition

To summarize, this implies that

T_{q}(\delta,b)\leq\sum_{j=1}^{k^{\star}+1}b_{\gamma(j)}\leq\sum_{j=1}^{k^{\star}}b_{\gamma(j)}+1=T_{q}(\tilde{\delta},\tilde{b})+1.

This completes the proof of Step 2, since as a consequence,

\mathbb{E}\left[\frac{T_{q}(\tilde{\delta},\tilde{b})-T_{q}(\delta,b)}{s_{n}}\right]\geq-\frac{1}{s_{n}}\to 0.

(B.31)

Step 3: In this step, we show that $\frac{\mathbb{E}[|T_{q}(\delta,b)-T_{q}(R,B)|]}{s_{n}}\to 0$ holds for all but countably many $q\in[0,1]$ . The same logic applies to the term involving $T_{q}(\tilde{R},\tilde{B})-T_{q}(\tilde{\delta},\tilde{b})$ .

To do this, we will essentially bound $|T_{q}(\delta,b)-T_{q}(R,B)|$ by the quantities $\max_{k\geq k_{0}}|\bar{B}_{k}-\bar{b}_{k}|$ and $|\psi_{q}(R)-\psi_{q}(\delta)|$ ; as we shall see, these are both $o_{L_{1}}(s_{n})$ . To begin, the triangle inequality yields:

$\displaystyle\|T_{q}(R,B)-T_{q}(\delta,b)\|$	$\displaystyle=\left\|\psi_{q}(R)\bar{B}_{\psi_{q}(R)}-\psi_{q}(\delta)\bar{b}_{\psi_{q}(\delta)}\right\|$	by definition
	$\displaystyle=\left\|\psi_{q}(R)\bar{B}_{\psi_{q}(R)}-\psi_{q}(R)\bar{b}_{\psi_{q}(\delta)}+\psi_{q}(R)\bar{b}_{\psi_{q}(\delta)}-\psi_{q}(\delta)\bar{b}_{\psi_{q}(\delta)}\right\|$
	$\displaystyle\leq\psi_{q}(R)\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}\|+\bar{b}_{\psi_{q}(\delta)}\|\psi_{q}(R)-\psi_{q}(\delta)\|$	triangle inequality
	$\displaystyle\leq\underbrace{\psi_{q}(R)\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}\|}_{\text{Term (a)}}+\underbrace{\|\psi_{q}(R)-\psi_{q}(\delta)\|}_{\text{Term (b)}}$	$\displaystyle\text{ since }\bar{b}_{\psi_{q}(\delta)}\in[0,1].$	(B.32)

First, we bound Term (a). Applying the triangle inequality (again) plus a simple algebraic lemma yields:

	$\displaystyle\psi_{q}(R)\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}\|$	$\displaystyle\leq\psi_{q}(R)\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}\|+\psi_{q}(R)\|\bar{b}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}\|$
		$\displaystyle\leq\psi_{q}(R)\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}\|+2\|\psi_{q}(R)-\psi_{q}(\delta)\|$	$\displaystyle\text{ by Lemma \ref{lem::algebra4powerresult}}.$

The second line is not obvious, but it follows because Lemma B.7 shows that $|\bar{b}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}|\leq\frac{2|\psi_{q}(R)-\psi_{q}(\delta)|}{\max(\psi_{q}(R),\psi_{q}(\delta))}$ . This algebraic result follows intuitively because $\bar{b}_{k}\in[0,1]$ holds for all $k$ ; thus, $|\bar{b}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}|$ cannot be large unless $|\psi_{q}(R)-\psi_{q}(\delta)|$ is also large.

Combining this bound on Term (a) with the initial result in Eq. (B.32) yields:

{s_{n}}^{-1}\mathbb{E}\left[|T_{q}(R,B)-T_{q}(\delta,b)|\right]\leq 3\underbrace{{s_{n}}^{-1}\mathbb{E}\left[|\psi_{q}(R)-\psi_{q}(\delta)|\right]}_{\text{Term (b)}}+\underbrace{{s_{n}}^{-1}\mathbb{E}\left[\psi_{q}(R)|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}|\right]}_{\text{Term (c)}}.

To show that Term (b) vanishes, recall that Step 2 of Theorem 3.2 shows that $s_{n}^{-1}\mathbb{E}[|\tau_{q}(R)-\tau_{q}(\delta)|]\to 0$ for all but countably many $q\in[0,1]$ . However, $\tau_{q}(R)=\left\lceil\frac{\psi_{q}(R)+1}{1+q}\right\rceil=\frac{1}{1+q}\psi_{q}(R)+O(1)$ and similarly $\tau_{q}(\delta)=\frac{1}{1+q}\psi_{q}(\delta)+O(1)$ . Therefore, ${s_{n}}^{-1}\mathbb{E}\left[|\psi_{q}(R)-\psi_{q}(\delta)|\right]\to 0$ for all but countably many $q\in[0,1]$ .

Thus, it suffices to show that Term (c) vanishes. To do this, fix a sequence of integers $\{k_{n}\}_{n=1}^{\infty}$ such that $k_{n}\sim\log(p_{n})^{5}$ . Separately considering the cases where $\psi_{q}(R)<k_{n}$ and $\psi_{q}(R)\geq k_{n}$ yields

	$\displaystyle\mathbb{E}\left[{s_{n}}^{-1}\psi_{q}(R)\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}\|\right]$	$\displaystyle\leq\mathbb{E}[\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}\|s_{n}^{-1}k_{n}]+\mathbb{E}[s_{n}^{-1}\psi_{q}(R)\max_{k\geq k_{n}}\|\bar{B}_{k}-\bar{b}_{k}\|]$
		$\displaystyle\leq 2s_{n}^{-1}k_{n}+\underbrace{\mathbb{E}[s_{n}^{-1}\psi_{q}(R)\max_{k\geq k_{n}}\|\bar{B}_{k}-\bar{b}_{k}\|]}_{\text{Term (d)}}$

where the second line follows because $\bar{B}_{\psi_{q}(R)},\bar{b}_{\psi_{q}(R)}\in[0,1]$ . Assumption 3.1 guarantees that that $s_{n}^{-1}k_{n}\to 0$ , so it suffices to show that Term (d) vanishes. To do this, we observe:

•

Assumption B.2 plus Lemma B.5 shows that $B$ satisfies an exponential decay condition, so we may apply Lemma C.1. Namely, for $k_{n}\sim\log(p_{n})^{5}$ , $\max_{k\geq k_{n}}|\bar{B}_{k}-\bar{b}_{k}|\stackrel{{\scriptstyle\mathrm{p}}}{{\to}}0$ (see Theorem 3.2, Step 2 for more details).
•

Lemma C.2 shows that there exists a universal constant $C$ such that $\mathbb{E}[\tau_{q}(R)^{2}]\leq C\mathbb{E}_{P_{n}^{\pi}}[|\mathcal{H}_{1}(\theta^{\star})|^{2}]$ . Note that the coefficient of variation of $|\mathcal{H}_{1}(\theta^{\star})|$ is bounded under Assumption B.1, so there exists another universal constant $C^{\prime}$ such that $\mathbb{E}_{P_{n}^{\pi}}[\psi_{q}(R)^{2}]\sim\mathbb{E}_{P_{n}^{\pi}}[\tau_{q}(R)^{2}]\leq C^{\prime}s_{n}^{2}$ . This implies that ${s_{n}}^{-1}\psi_{q}(R)$ is uniformly integrable since it has a uniformly bounded second moment.
•

The latter two observations imply that $\mathbb{E}[s_{n}^{-1}\psi_{q}(R)\max_{k\geq k_{n}}|\bar{B}_{k}-\bar{b}_{k}|]\to 0$ , since $s_{n}^{-1}\psi_{q}(R)$ is uniformly integrable, and $\max_{k\geq k_{n}}|\bar{B}_{k}-\bar{b}_{k}|\leq 2$ is also uniformly bounded and vanishes in probability.

Together, this proves that for all but countably many $q\in[0,1]$ , $\frac{\mathbb{E}[|T_{q}(\delta,b)-T_{q}(R,B)|]}{s_{n}}\to 0$ . ∎

The following algebraic lemma is used at the end of Step 3 of the proof of Theorem B.6.

Lemma B.7.

For $b=(b_{1},\dots,b_{p})\in[0,1]^{p}$ , fix $k,\ell\in[p]$ . Then

\max(k,\ell)\cdot|\bar{b}_{k}-\bar{b}_{\ell}|\leq 2|k-\ell|.

Proof.

Define $m=\min(k,\ell)$ and $M=\max(k,\ell)$ . The lemma holds trivially when $m=M$ , so we may assume $m<M$ . Then we have that

$\displaystyle\left\|\bar{b}_{k}-\bar{b}_{\ell}\right\|$	$\displaystyle\coloneqq\left\|\frac{1}{k}\sum_{i=1}^{k}b_{i}-\frac{1}{\ell}\sum_{i=1}^{\ell}b_{i}\right\|$
	$\displaystyle=\left\|\sum_{i=1}^{m}b_{i}\left(\frac{1}{m}-\frac{1}{M}\right)-\frac{1}{M}\sum_{i=m+1}^{M}b_{i}\right\|$	$\displaystyle\text{ by definition of }m,M$
	$\displaystyle\leq\sum_{i=1}^{m}b_{i}\left(\frac{1}{m}-\frac{1}{M}\right)+\frac{1}{M}\sum_{i=m+1}^{M}b_{i}$	$\displaystyle\text{ since }b_{i}\geq 0,m<M$
	$\displaystyle\leq m\left(\frac{1}{m}-\frac{1}{M}\right)+\frac{1}{M}\left(M-m\right)$	$\displaystyle\text{ since }b_{i}\leq 1.$

Therefore we conclude that

M|\bar{b}_{k}-\bar{b}_{\ell}|\leq Mm\left(\frac{1}{m}-\frac{1}{M}\right)+\frac{M}{M}(M-m)=2(M-m)=2|k-\ell|

which completes the proof. ∎

B.8 Verifying the local dependence assumption in a simple setting

We now verify the local dependency condition (3.8) in the setting where $\mathbf{X}^{T}\mathbf{X}$ is block-diagonal and $\sigma^{2}$ is known.

Proposition B.2.

Suppose $\mathbf{X}^{T}\mathbf{X}$ is a block-diagonal matrix with maximum block size $M\in\mathbb{N}$ . Suppose $P^{\pi}$ is any Bayesian model such that (i) the model class $\mathcal{P}$ is the class of all Gaussian models of the form $\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,\sigma^{2}I_{n})$ for $\theta=(\beta,\sigma^{2})\in\Theta\coloneqq\mathbb{R}^{p}\times\mathbb{R}_{\geq 0}$ , (ii) the coordinates of $\beta$ are marginally independent under $P^{\pi}$ and (iii) $\sigma^{2}$ is a constant under $P^{\pi}$ . Then if $\widetilde{\mathbf{X}}$ are either fixed-X knockoffs or conditional Gaussian model-X knockoffs (Huang and Janson,, 2020), the coordinates of $\operatorname*{sign}(\mathrm{MLR}^{\pi})$ are $M$ -dependent conditional on $D$ under $P^{\pi}$ , implying that Equation (3.8) holds, e.g., with $C=2^{M}$ and $\rho=\frac{1}{2}$ .

Proof.

Define $R\coloneqq\mathbb{I}(\mathrm{MLR}^{\pi}>0)$ . We will prove the stronger result that if $J_{1},\dots,J_{m}\subset[p]$ are a partition of $[p]$ corresponding to the blocks of $\mathbf{X}^{T}\mathbf{X}$ , then $R_{J_{1}},\dots,R_{J_{m}}$ are jointly independent conditional on $D$ . As notation, suppose without loss of generality that $J_{1},\dots,J_{m}$ are contiguous subsets and $\mathbf{X}^{T}\mathbf{X}=\mbox{$\mathrm{diag}\left\{\Sigma_{1},\dots,\Sigma_{m}\right\}$}$ for $\Sigma_{i}\in\mathbb{R}^{|J_{i}|\times|J_{i}|}$ . All probabilities and expectations are taken under $P^{\pi}$ .

We give the proof for model-X knockoffs; the proof for fixed-X knockoffs is quite similar. Recall by Proposition 3.1 that we can write $R_{j}=\mathbb{I}(W_{j}>0)=\mathbb{I}(\mathbf{X}_{j}=\widehat{\mathbf{X}}_{j})$ where $\widehat{\mathbf{X}}_{j}$ is a function of the masked data $D$ . Therefore, to show $R_{J_{1}},\dots,R_{J_{m}}$ are independent conditional on $D$ , it suffices to show $\mathbf{X}_{J_{1}},\dots,\mathbf{X}_{J_{m}}$ are conditionally independent given $D$ . To do this, it will first be useful to note that the likelihood is

	$\displaystyle P_{\mathbf{Y}\mid\mathbf{X}}^{(\beta,\sigma)}(\mathbf{Y}\mid\mathbf{X})$	$\displaystyle\propto\exp\left(-\frac{1}{2\sigma^{2}}\|\|\mathbf{Y}-\mathbf{X}\beta\|\|_{2}^{2}\right)$
		$\displaystyle\propto\exp\left(\frac{2\beta^{T}\mathbf{X}^{T}\mathbf{Y}-\beta^{T}\mathbf{X}^{T}\mathbf{X}\beta}{2\sigma^{2}}\right)$
		$\displaystyle\propto\prod_{i=1}^{m}\exp\left(\frac{2\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right),$

where above, we only include terms depending on $\mathbf{X}$ , since these are the only terms relevant to the later stages of the proof. A subtle but important observation in the calculation above is that we can verify that $\mathbf{X}^{T}\mathbf{X}=\mbox{$\mathrm{diag}\left\{\Sigma_{1},\dots,\Sigma_{m}\right\}$}$ having only observed $D$ without observing $\mathbf{X}$ . Indeed, this follows because for conditional Gaussian MX knockoffs, $\widetilde{\mathbf{X}}^{T}\widetilde{\mathbf{X}}=\mathbf{X}^{T}\mathbf{X}$ and $\widetilde{\mathbf{X}}^{T}\mathbf{X}$ only differs from $\mathbf{X}^{T}\mathbf{X}$ on the main diagonal (just like in the fixed-X case). With this observation in mind, we now abuse notation slightly and let $p(\cdot\mid\cdot)$ denote an arbitrary conditional density under $P^{\pi}$ . Observe that

	$\displaystyle p(\mathbf{X}\mid D)$	$\displaystyle\propto p(\mathbf{X},\mathbf{Y}\mid\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p})$
		$\displaystyle=p(\mathbf{X}\mid\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}_{j=1}^{p})p(\mathbf{Y}\mid\mathbf{X})\,\,\,\,\,\,\,\,\,\,\text{ since }\mathbf{Y}\perp\!\!\!\perp\widetilde{\mathbf{X}}\mid\mathbf{X}$
		$\displaystyle=\frac{1}{2^{p}}p(\mathbf{Y}\mid\mathbf{X})\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\text{ by pairwise exchangeability}$
		$\displaystyle\propto\int_{\beta}p(\beta)p(\mathbf{Y}\mid\mathbf{X},\beta)d\beta$
		$\displaystyle\propto\int_{\beta}\prod_{i=1}^{m}p(\beta_{J_{i}})\exp\left(\frac{-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right)\exp\left(\frac{\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}}{\sigma^{2}}\right)d\beta$
		$\displaystyle\propto\int_{\beta_{J_{1}}}\dots,\int_{\beta_{J_{m}}}\prod_{i=1}^{m}p(\beta_{J_{i}})\exp\left(\frac{-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right)\exp\left(\frac{\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}}{\sigma^{2}}\right)d\beta_{J_{1}}\,\beta_{J_{2}}\dots\,d\beta_{J_{m}}.$

At this point, we can iteratively pull out parts of the product. In particular, define the following function:

q_{i}(\mathbf{X}_{J_{i}})\coloneqq\int_{\beta_{J_{i}}}p(\beta_{J_{i}})\exp\left(\frac{-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right)\exp\left(\frac{\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}}{\sigma^{2}}\right)d\beta_{J_{i}}.

Since $\mathbf{Y},\sigma^{2}$ and $\Sigma_{i}$ are fixed, $q_{i}(\mathbf{X}_{J_{i}})$ is a deterministic function of $\mathbf{X}_{J_{i}}$ that does not depend on $\beta_{-J_{i}}$ . Therefore, we can iteratively integrate as below:

	$\displaystyle p(\mathbf{X}\mid D)$	$\displaystyle\propto\int_{\beta_{J_{1}}}\dots,\int_{\beta_{J_{m}}}\prod_{i=1}^{m}p(\beta_{J_{i}})\exp\left(\frac{-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right)\exp\left(\frac{\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}}{\sigma^{2}}\right)d\beta_{J_{1}}\,d\beta_{J_{2}}\dots\,d\beta_{J_{m}}$
		$\displaystyle=\int_{\beta_{J_{1}}}\dots\int_{\beta_{J_{m-1}}}\prod_{i=1}^{m-1}p(\beta_{J_{i}})\exp\left(\frac{-\beta_{J_{i}}^{T}\Sigma_{i}\beta_{J_{i}}}{2\sigma^{2}}\right)\exp\left(\frac{\beta_{J_{i}}^{T}\mathbf{X}_{J_{i}}^{T}\mathbf{Y}}{\sigma^{2}}\right)q_{m}(\mathbf{X}_{j_{m}})d\beta_{J_{1}}\,d\beta_{J_{2}}\dots\,d\beta_{J_{m-1}}$
		$\displaystyle=\prod_{i=1}^{m}q_{i}(\mathbf{X}_{J_{i}}).$

This shows that $\mathbf{X}_{J_{1}},\dots,\mathbf{X}_{J_{m}}\mid D$ are jointly (conditionally) independent since their density factors, thus completing the proof. For fixed-X knockoffs, the proof is very similar as one can show that the density of $\mathbf{X}^{T}\mathbf{Y}\mid D$ factors into blocks. ∎

B.9 Intuition for the local dependency condition and Figure 6

In Figure 6, we see that even when $\mathbf{X}$ is very highly correlated, $\operatorname{Cov}_{P^{\pi}}(\operatorname*{sign}(\mathrm{MLR}^{\pi})\mid D)$ looks similar to a diagonal matrix, indicating that the local dependency condition (3.8) holds well empirically. The empirical result is striking and may be surprising initially, this section offers some explanation.

For the sake of intuition, suppose that we are fitting model-X knockoffs and using the Bayesian model $P^{\pi}$ from Example 1 with the original features. Suppose we observe that the masked data $D$ is equal to some fixed value $d=(\mathbf{y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p})$ . After observing $D=d$ , Appendix E shows how to sample from the posterior distribution $\mathbf{X}\mid D=d$ via the following Gibbs sampling approach:

•

For each $j\in[p]$ , initialize $\beta_{j}^{(0)}$ and $\mathbf{X}_{j}^{(0)}$ to some value.
•
For $i=1,\dots,n_{\mathrm{sample}}$ :
1. 1.
  
  Set $\beta^{(i)}=\beta^{(i-1)}$ and $\mathbf{X}^{(i)}=\mathbf{X}^{(i-1)}$ .
2. 2.
  For $j\in[p]$ :
  1. (a)
    
    Resample $\mathbf{X}_{j}^{(i)}$ from the law of $\mathbf{X}_{j}\mid\mathbf{X}_{-j}=\mathbf{X}_{-j}^{(i)},\beta_{-j}=\beta_{-j}^{(i)},D=d$ . It may be helpful to recall $\mathbf{X}_{j}^{(i)}\in\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}$ holds deterministically.
  2. (b)
    
    Resample $\beta_{j}^{(i)}$ from the law of $\beta_{j}\mid\mathbf{X}=\mathbf{X}^{(i)},\beta_{-j}=\beta_{-j}^{(i)},D=d$ .
•

Return samples $\mathbf{X}^{(1)},\dots,\mathbf{X}^{(n_{\mathrm{sample}})}$ .

Now, recall that $\mathrm{MLR}_{j}^{\pi}>0$ if and only if the Gibbs sampler consistently chooses $\mathbf{X}_{j}^{(i)}=\mathbf{x}_{j}$ instead of $\mathbf{X}_{j}^{(i)}=\widetilde{\mathbf{x}}_{j}$ . Thus, to analyze $\operatorname{Cov}_{P^{\pi}}(\mathbb{I}(\mathrm{MLR}_{j}^{\pi}>0),\mathbb{I}(\mathrm{MLR}_{k}^{\pi}>0)\mid D)$ for some fixed $j\neq k$ , we must ask the following question: does the value of $\mathbf{X}_{k}^{(i)}\in\{\mathbf{x}_{k},\widetilde{\mathbf{x}}_{k}\}$ strongly affect how we resample $\mathbf{X}_{j}^{(i)}$ ?

To answer this question, the following key fact (reviewed in Appendix E) is useful. At iteration $i$ , step 2(a), for any $j$ , let $r_{j}=\mathbf{y}-\mathbf{X}_{-j}^{(i)}\beta_{-j}^{(i)}$ be the residuals excluding feature $j$ for the current value of $\mathbf{X}_{-j}$ and $\beta_{-j}$ in the Gibbs sampler. Then, standard calculations show that $P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j}\mid D=d,\mathbf{X}_{-j}=\mathbf{X}_{-j}^{(i)},\beta_{-j}=\beta_{-j}^{(i)})$ only depends on $\mathbf{X}_{-j}^{(i)}$ and $\beta_{-j}^{(i)}$ through the following inner products:

\alpha_{j}\coloneqq\mathbf{x}_{j}^{T}r_{j}=\mathbf{x}_{j}^{T}\mathbf{y}-\sum_{\ell\neq j}\mathbf{x}_{j}^{T}\mathbf{X}_{\ell}^{(i)}\beta_{\ell}^{(i)}

\tilde{\alpha}_{j}\coloneqq\widetilde{\mathbf{x}}_{j}^{T}r_{j}=\widetilde{\mathbf{x}}_{j}^{T}\mathbf{y}-\sum_{\ell\neq j}\widetilde{\mathbf{x}}_{j}^{T}\mathbf{X}_{\ell}^{(i)}\beta_{\ell}^{(i)}.

Thus, the question we must answer is: how does the choice of $\mathbf{X}_{k}^{(i)}\in\{\mathbf{x}_{k},\widetilde{\mathbf{x}}_{k}\}$ affect the value of $\alpha_{j},\tilde{\alpha}_{j}$ ? Heuristically, the answer is “not very much,” since $\mathbf{X}_{k}^{(i)}$ only appears above through inner products of the form $\mathbf{x}_{j}^{T}\mathbf{X}_{k}^{(i)}$ and $\widetilde{\mathbf{x}}_{j}^{T}\mathbf{X}_{k}^{(i)}$ , and by definition of the knockoffs we know that $\mathbf{x}_{j}^{T}\mathbf{x}_{k}\approx\mathbf{x}_{j}^{T}\widetilde{\mathbf{x}}_{k}$ and $\widetilde{\mathbf{x}}_{j}^{T}\mathbf{x}_{k}\approx\widetilde{\mathbf{x}}_{j}^{T}\widetilde{\mathbf{x}}_{k}$ . Indeed, for fixed-X knockoffs, we know that this actually holds exactly, and for model-X knockoffs, the law of large numbers should ensure that these approximations are very accurate.

The main way that the choice of $\mathbf{X}_{k}^{(i)}$ can significantly influence the choice of $\mathbf{X}_{j}^{(i)}$ is that the choice of $\mathbf{X}_{k}^{(i)}$ may change the value of $\beta_{k}^{(i)}$ . In general, we expect this effect to be rather small, since in many highly-correlated settings, $\mathbf{x}_{k}$ and $\widetilde{\mathbf{x}}_{k}$ are necessarily highly correlated and thus the choice of $\mathbf{X}_{k}^{(i)}\in\{\mathbf{x}_{k},\widetilde{\mathbf{x}}_{k}\}$ should not affect the choice of $\beta_{k}^{(i)}$ too much. That said, there are a few known pathological settings where the choice of $\mathbf{X}_{k}^{(i)}\in\{\mathbf{x}_{k},\widetilde{\mathbf{x}}_{k}\}$ does substantially change the estimated value of $\beta_{k}$ (Chen et al., (2019); Spector and Janson, (2022)), and in these settings, the coordinates of $\operatorname*{sign}(\mathrm{MLR}^{\pi})$ may be strongly conditionally dependent. The good news is that using MVR knockoffs instead of SDP knockoffs should ameliorate this problem (see Spector and Janson, (2022)).

Overall, we recognize that this explanation is purely heuristic and does not fully explain the results in Figure 6. However, it may provide some intuitive insight. A more rigorous theoretical analysis of $\operatorname{Cov}_{P^{\pi}}(\mathbb{I}(\mathrm{MLR}^{\pi}>0)\mid D)$ would be interesting; however, we leave this to future work.

Appendix C Technical proofs

C.1 Key concentration results

The proof of Theorem 3.2 relies on the fact that the successive averages of the vector $\mathbb{I}(\mathrm{sorted}(W)>0)\in\mathbb{R}^{p}$ converge uniformly to their conditional expectation given the masked data $D^{(n)}$ . In this section, we give a brief proof of this result, which is essentially an application of Theorem 1 from Doukhan and Neumann, (2007). For convenience, we first restate a special case of this theorem (namely, the case where the random variables in question are bounded and we have bounds on pairwise correlations) before proving the corollary we use in Theorem 3.2.

Theorem C.1 (Doukhan and Neumann, (2007)).

Suppose that $X_{1},\dots,X_{n}$ are mean-zero random variables taking values in $[-1,1]$ such that $\operatorname{Var}(\bar{X}_{n})\leq C_{0}n$ for a constant $C_{0}>0$ . Let $L_{1},L_{2}<\infty$ be constants such that for any $i\leq j$ ,

|\operatorname{Cov}(X_{i},X_{j})|\leq 4\varphi(j-i)

where $\{\varphi(k)\}_{k\in\mathbb{N}}$ is a nonincreasing sequence satisfyng

\sum_{s=0}^{\infty}(s+1)^{k}\varphi(s)\leq L_{1}L_{2}^{k}k!\text{ for all }k\geq 0.

Then for all $t\in(0,1)$ , there exists a universal constant $C_{1}>0$ only depending on $C_{0},L_{1}$ and $L_{2}$ such that

\mathbb{P}\left(\bar{X}_{n}\geq t\right)\leq\exp\left(-\frac{t^{2}}{C_{0}n+C_{1}t^{7/4}n^{7/4}}\right)\leq\exp\left(-C^{\prime}t^{2}n^{1/4}\right),

where $C^{\prime}$ is a universal constant only depending on $C_{0},L_{1},L_{2}$ .

If we take $\varphi(s)=c\rho^{s}$ , this yields the following corollary.

Corollary C.1.

Suppose that $X_{1},\dots,X_{n}$ are mean-zero random variables taking values in $[-1,1]$ . Suppose that for some $C\geq 0,\rho\in(0,1)$ , the sequence satisfies

|\operatorname{Cov}(X_{i},X_{j})|\leq C\rho^{|i-j|}.

(C.1)

Then there exists a universal constant $C^{\prime}$ depending only on $C$ and $\rho$ such that

\mathbb{P}(\bar{X}_{n}\geq t)\leq\exp\left(-C^{\prime}t^{2}n^{1/4}\right).

(C.2)

Furthermore, let $\pi:[n]\to[n]$ be any permutation. For $k\leq n$ , define $\bar{X}_{k}^{(\pi)}\coloneqq\frac{1}{k}\sum_{i=1}^{k}X_{\pi(i)}$ to be the sample mean of the first $k$ random variables after permuting $(X_{1},\dots,X_{n})$ according to $\pi$ . Then for any $n_{0}\in\mathbb{N},t\geq 0$ ,

\sup_{\pi\in S_{n}}\mathbb{P}\left(\max_{n_{0}\leq i\leq n}|\bar{X}_{k}^{(\pi)}|\geq t\right)\leq n\exp(-C^{\prime}t^{2}n_{0}^{1/4}).

(C.3)

where $S_{n}$ is the symmetric group.

Proof.

The proof of Equation (C.2) follows an observation of Doukhan and Neumann, (2007), where we note $\varphi(s)=C\exp(-as)$ for $a=-\log(\rho)$ . Then

\sum_{s=0}^{\infty}(s+1)^{k}\exp(-as)\leq\sum_{s=0}^{\infty}\prod_{i=1}^{k}(s+i)\exp(-as)=\frac{d^{k}}{dp^{k}}\left(\frac{1}{1-p}\right)\bigg{|}_{p=\exp(-a)}=k!\frac{1}{(1-\exp(-a))^{k}}.

As a result, $\sum_{s=0}^{\infty}(s+1)^{k}\varphi(s)\leq C\left(\frac{1}{(1-\exp(-a))}\right)^{k}k!$ , so we take $L_{1}=\frac{1}{(1-\exp(-a))}$ and $L_{2}=C$ . Lastly, we observe that another geometric series argument yields

\operatorname{Var}(\bar{X}_{n})=\sum_{i=1}^{n}\sum_{j=1}^{n}\operatorname{Cov}(X_{i},X_{j})\leq\sum_{i=1}^{n}C\sum_{j=1}^{n}\rho^{|i-j|}\leq nC\frac{2}{1-\rho}.

Thus, we take $C_{0}=\frac{2C}{1-\rho}$ and apply Theorem C.1, which yields the first result. To prove Equation (C.3), the main idea is that we can apply Equation (C.2) to each sample mean $|\bar{X}_{k}^{(\pi)}|$ , at which point the Equation (C.3) follows from a union bound.

To prove this, note that if we rearrange $(X_{\pi(1)},\dots,X_{\pi(k)})$ into their “original order,” then these variables satisfy the condition in Equation (C.1). Formally, let $A=\{\pi(1),\dots,\pi(k)\}$ and let $\nu:A\to A$ be the permutation such that $\nu(\pi(i))>\nu(\pi(j))$ if and only if $i>j$ , for $i,j\in[k]$ . Then define $Y_{i}=X_{\nu(\pi(i))}$ for $i\in[k]$ , and note that

|\operatorname{Cov}(Y_{i},Y_{j})|=|\operatorname{Cov}(X_{\nu(\pi(i))},X_{\nu(\pi(j))})|\leq C\rho^{|\nu(\pi(i))-\nu(\pi(j))|}\leq C\rho^{|i-j|},

where in the last step, $|i-j|\leq|\nu(\pi(i))-\nu(\pi(j))|$ follows by construction of $\nu$ . Since $\bar{Y}_{k}=\bar{X}_{k}^{(\pi)}$ by construction, this means we may apply Equation (C.2) to $\bar{X}_{k}^{(\pi)}$ for each $k$ .

Thus, by Equation (C.2), for any $\pi\in S_{n}$ ,

\mathbb{P}\left(\max_{n_{0}\leq k\leq n}|\bar{X}_{k}^{(\pi)}|\geq t\right)\leq\sum_{k=n_{0}}^{n}\mathbb{P}(|\bar{X}_{k}^{(\pi)}|\geq t)\leq\sum_{k=n_{0}}^{n}\exp(-C^{\prime}t^{2}k^{1/4})\leq n\exp(-C^{\prime}t^{2}n_{0}^{1/4}).

This completes the proof. ∎

C.2 Bounds on the expected number of false discoveries

The proof of Theorem 3.2 relied on the fact that $\lim_{n\to\infty}\Gamma_{q}(w_{n})$ is finite whenever it exists. This is a consequence of the lemma below. The lemma below also proves a second moment bound which is needed when making a uniform integrability argument in Step 3 of the proof of Theorem B.6.

Lemma C.2.

Fix any $q\in(0,1)$ . Then there exist universal constants $C(q),C^{(2)}(q)\in\mathbb{R}$ such that for any Bayesian model $P^{\pi}$ and any valid knockoff statistic $W=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ with discovery set $S\subset[p]$ :

1.

$\Gamma_{q}(w)\leq C(q)$ where $C(q)$ is a finite constant depending only on $q$ .
2.

$\mathbb{E}_{P^{\pi}}[|S|^{2}]\leq C^{(2)}(q)\mathbb{E}[|\mathcal{H}_{1}(\theta^{\star})|^{2}]$ .

Note that above, $\mathcal{H}_{1}(\theta^{\star})$ denotes the random set of non-nulls under $P^{\pi}$ .

Proof.

Recall that $P^{\pi}$ denotes the joint law of $(\mathbf{X},\mathbf{Y},\theta^{\star})$ . Throughout the proof, all expectations and probabilities are taken over $P^{\pi}$ . Our strategy is to condition on the nuisance parameters $\theta^{\star}$ . In particular, let $M(\theta^{\star})=|\mathcal{H}_{1}(\theta^{\star})|$ denote the number of non-nulls. To show the first result, it suffices to show

\mathbb{E}\left[|S|\mid\theta^{\star}\right]\leq C(q)M(\theta^{\star}).

(C.4)

Proving Equation (C.4) proves the first result because it implies by the tower property that $\mathbb{E}[|S|]\leq C(q)\mathbb{E}[M(\theta^{\star})]$ , and therefore $\Gamma_{q}(w)=\frac{\mathbb{E}[|S|]}{\mathbb{E}[M(\theta^{\star})]}\leq C(q)$ . For the second result, by the tower property it also suffices to show that

\mathbb{E}[|S|^{2}\mid\theta^{\star}]\leq C^{(2)}(q)M(\theta^{\star})^{2}.

(C.5)

The rest of the proof proceeds conditionally on $\theta^{\star}$ , so we are essentially in the fully frequentist setting. Thus, for the rest of the proof, we will abbreviate $M(\theta^{\star})$ as $M$ . We will also assume the “worst-case” values for the non-null coordinates of $W_{j}$ : in particular, let $W^{\prime}$ denote $W$ but with all of the non-null coordinates replaced with the value $\infty$ , and let $S^{\prime}\subset[p]$ be the discovery set made when applying SeqStep to $W^{\prime}$ . These are the “worst-case” values in the sense that $|S^{\prime}|\geq|S|$ deterministically (see Spector and Janson, (2022), Lemma B.4), so it suffices to show that $\mathbb{E}[|S^{\prime}|]\leq C(q)M$ and $\mathbb{E}[|S^{\prime}|^{2}]\leq C(q)M$ .

As notation, let $U=\mathbb{I}(\mathrm{sorted}(W^{\prime})>0)$ denote the signs of $W^{\prime}$ when sorted in descending order of absolute value. Following the notation in Equation (B.4), let $\psi(U)=\max\left\{k:\frac{k-k\bar{U}_{k}+1}{k\bar{U}_{k}}\leq q\right\}$ , where $\bar{U}_{k}=\frac{1}{k}\sum_{i=1}^{\min(k,p)}U_{i}$ . This ensures that $|S^{\prime}|=\left\lceil\frac{\psi(U)+1}{1+q}\right\rceil\leq\psi(U)$ is the number of discoveries made by knockoffs (Spector and Janson, (2022), Lemma B.3). To prove the first result, it thus suffices to show $\mathbb{E}[\psi(U)]\leq C(q)M$ . To do this, let $K=\left\lceil\frac{M+1}{1+q}\right\rceil$ and fix any integer $c\in\mathbb{N}$ (we will pick a specific value for $c$ later). Observe that

	$\displaystyle\mathbb{E}[\psi(U)]$	$\displaystyle\leq cK\mathbb{P}(\psi(U)\leq cK)+\sum_{k=cK}^{\infty}k\mathbb{P}(\psi(U)=k)$		(C.6)
		$\displaystyle\leq cK+\sum_{k=cK}^{\infty}k\mathbb{P}\left(\mathrm{Bin}(k-M,1/2)\geq\left\lceil\frac{k+1}{1+q}\right\rceil-M\right).$		(C.7)

where the second line follows because (i) the event $\psi(U)=k$ implies that at least $\left\lceil\frac{k+1}{1+q}\right\rceil$ of the first $k$ coordinates of $U$ are positive and (ii) the knockoff flip-sign property guarantees that conditional on $\theta^{\star}$ , the null coordinates of $U_{j}$ are i.i.d. random signs conditional on the values of the non-null coordinates of $U$ .⁷⁷7Without loss of generality we may assume that the absolute values of $W^{\prime}$ are nonzero with probability one, since again, this only increases the number of discoveries made by knockoffs. Thus, doing simple arithmetic, in the first $k$ coordinates of $U$ , there are $k-M$ null i.i.d. signs, of which at least $\left\lceil\frac{k+1}{1+q}\right\rceil-M$ must be positive, yielding the expression above with the Binomial probability.

We now apply Hoeffding’s inequality. To do so, we must ensure $\left\lceil\frac{k+1}{1+q}\right\rceil-M$ is larger than the mean of a $\mathrm{Bin}(k-M,1/2)$ random variable. It turns out that it suffices to pick the value of $c$ to satisfy $c>\left(\frac{1}{1+q}-\frac{1}{2}\right)^{-1}$ . To see why, fix any $k\geq cK$ , so we may write $k=cK+\ell$ for some $\ell\geq 0$ . Then for all such $k$ , we have

$\displaystyle\frac{k+1}{1+q}-M-\frac{k-M}{2}$	$\displaystyle\geq k\left(\frac{1}{1+q}-\frac{1}{2}\right)-\frac{M}{2}$
	$\displaystyle=(cK+\ell)\left(\frac{1}{1+q}-\frac{1}{2}\right)-\frac{M}{2}$	$\displaystyle\text{ since }k=cK+\ell$
	$\displaystyle\geq\frac{2K-M}{2}+\ell\left(\frac{1}{1+q}-\frac{1}{2}\right)$	$\displaystyle\text{ since }c\geq\left(\frac{1}{1+q}-\frac{1}{2}\right)^{-1}$
	$\displaystyle\geq\ell\left(\frac{1}{1+q}-\frac{1}{2}\right)$	$\displaystyle\text{ since }K\geq\frac{M}{1+q}\geq\frac{M}{2}\text{ by definition.}$

Thus, we may apply Hoeffding’s inequality for $k\geq cK$ . Indeed, for any $\ell\geq 0$ , the previous result yields that for $k\geq cK$ :

	$\displaystyle\mathbb{P}\left(\mathrm{Bin}(k-M,1/2)\geq\left\lceil\frac{k+1}{1+q}\right\rceil-M\right)$	$\displaystyle\leq\mathbb{P}\left(\mathrm{Bin}(k-M,1/2)-\frac{k-M}{2}\geq(k-cK)\left(\frac{1}{1+q}-\frac{1}{2}\right)\right)$
		$\displaystyle\leq\exp\left(-2(k-cK)^{2}\left(\frac{1}{1+q}-\frac{1}{2}\right)^{2}\right).$

As notation, set $\alpha_{q}=\frac{1}{1+q}-\frac{1}{2}$ . Combining the previous equation with Eq. (C.7), we obtain

	$\displaystyle\mathbb{E}[\psi(U)]$	$\displaystyle\leq cK+\sum_{k=cK}^{\infty}k\exp\left(-2(k-cK)^{2}\alpha_{q}^{2}\right)$
		$\displaystyle=cK+\sum_{\ell=0}^{\infty}(\ell+cK)\exp\left(-2\ell^{2}\alpha_{q}^{2}\right)$
		$\displaystyle=cK+cK\sum_{\ell=0}^{\infty}\exp(-2\ell^{2}\alpha_{q}^{2})+\sum_{\ell=0}^{\infty}\ell\exp(-2\ell^{2}\alpha_{q}^{2}).$

Note that the sums $\sum_{\ell=0}^{\infty}\ell\exp(-2\ell^{2}\alpha_{q}^{2})$ and $\sum_{\ell=0}^{\infty}\exp(-2\ell^{2}\alpha_{q}^{2})$ are both convergent. As a result, $\mathbb{E}[\psi(U)]$ is bounded by a constant multiple of $cK\sim\frac{c}{1+q}M$ , where the constant depends on $q$ but nothing else. Since $\psi(U)\geq|S^{\prime}|\geq|S|$ as previously argued, this completes the proof.

For the second statement, we note that by the same argument as above, we have that

	$\displaystyle\mathbb{E}[\psi(U)^{2}]$	$\displaystyle\leq(cK)^{2}+\sum_{k=cK}^{\infty}k^{2}\mathbb{P}\left(\mathrm{Bin}(k-M,1/2)\geq\left\lceil\frac{k+1}{1+q}\right\rceil-M\right)$
		$\displaystyle\leq(cK)^{2}+\sum_{\ell=0}^{\infty}(\ell+cK)^{2}\exp\left(-2\ell^{2}\alpha_{q}^{2}\right)$
		$\displaystyle=(cK)^{2}\left(1+\sum_{\ell=0}^{\infty}\exp\left(-2\ell^{2}\alpha_{q}^{2}\right)\right)+cK\sum_{\ell=0}^{\infty}\ell\exp\left(-2\ell^{2}\alpha_{q}^{2}\right)+\sum_{\ell=0}^{\infty}\ell^{2}\exp\left(-2\ell^{2}\alpha_{q}^{2}\right).$

Once again, the three series above are convergent and the value they converge to depends only on $q$ . Since $cK\sim M$ asymptotically, this implies that there exists some constant $C^{(2)}(q)$ such that $\mathbb{E}[\psi(U)^{2}]\leq C^{(2)}(q)M^{2}$ . This completes the proof since $\psi(U)^{2}\geq|S|^{2}$ . ∎

Appendix D Additional comparison to prior work

D.1 Comparison to the unmasked likelihood ratio

In this section, we compare MLR statistics to the earlier unmasked likelihood statistic introduced by Katsevich and Ramdas, (2020), which this work builds upon. The upshot is that unmasked likelihood statistics give the most powerful “binary $p$ -values,” as shown by Katsevich and Ramdas, (2020), but do not yield jointly valid knockoff feature statistics in the sense required for the FDR control proof in Barber and Candès, (2015) and Candès et al., (2018).

In particular, we call a statistic $T_{j}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ a marginally symmetric knockoff statistic if $T_{j}$ satisfies $T_{j}([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(j)}},\mathbf{Y})=-T_{j}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ . Under the null, $T_{j}$ is marginally symmetric, so the quantity $k_{j}=\frac{1}{2}+\frac{1}{2}\mathbb{I}(T_{j}\leq 0)$ is a valid “binary $p$ -value” which only takes values in $\{1/2,1\}$ . Theorem 5 of Katsevich and Ramdas, (2020) shows that for any marginally symmetric knockoff statistic, $P^{\star}(k_{j}=1/2)=P^{\star}(T_{j}>0)$ is maximized if $T_{j}>0\Leftrightarrow p^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}\mid[\mathbf{X}_{j},\mathbf{X}_{-j}])>p^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}\mid[\widetilde{\mathbf{X}}_{j},\mathbf{X}_{-j}])$ , where $p^{\star}_{\mathbf{Y}\mid\mathbf{X}}$ denotes the density of $\mathbf{Y}\mid\mathbf{X}$ under the true law $P^{\star}$ of the data. As such, one might initially hope to use the unmasked likelihood ratio as a knockoff statistic:

W_{j}^{\mathrm{unmasked}}=\log\left(\frac{p^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}\mid[\mathbf{X}_{j},\mathbf{X}_{-j}])}{p^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}\mid[\widetilde{\mathbf{X}}_{j},\mathbf{X}_{-j}])}\right).

However, a marginally symmetric knockoff statistic is not necessarily a valid knockoff feature statistic, which must satisfy the following stronger property (Barber and Candès,, 2015; Candès et al.,, 2018):

W_{j}([\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(J)}},\mathbf{Y})=\begin{cases}W_{j}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})&j\not\in J\\ -W_{j}([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})&j\in J,\end{cases}

for any $J\subset[p]$ . This flip-sign property guarantees that the signs of the null coordinates of $W$ are jointly i.i.d. and symmetric. However, the unmasked likelihood statistic does not satisfy this property, as changing the observed value of $\mathbf{X}_{i}$ for $i\neq j$ will typically change the value of the likelihood $p^{\star}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{Y}\mid[\mathbf{X}_{j},\mathbf{X}_{-j}])$ .

D.2 Comparison to the adaptive knockoff filter

In this section, we compare our methodological contribution, MLR statistics, to the adaptive knockoff filter described in Ren and Candès, (2020), namely their approach based on Bayesian modeling. The main point is that although MLR statistics and the procedure from Ren and Candès, (2020) have some intuitive similarities, the procedures are different and in fact complementary, since one could use the Bayesian adaptive knockoff filter from Ren and Candès, (2020) in combination with MLR statistics.

As review, recall from Section 3.2 that valid knockoff feature statistics $W$ as initially defined by Barber and Candès, (2015); Candès et al., (2018) must ensure that $|W|$ is a function of the masked data $D$ , and thus $|W|$ cannot explicitly depend on $\operatorname*{sign}(W)$ . (It is also important to remember that $|W|$ determines the order and “prioritization” of the SeqStep hypothesis testing procedure.) The key innovation of Ren and Candès, (2020) is to relax this restriction: in particular, they define a procedure where the analyst sequentially reveals the signs of $\operatorname*{sign}(W)$ in reverse order of their prioritization, and after each sign is revealed, the analyst may arbitrarily reorder the remaining hypotheses. The advantage of this approach is that revealing the sign of (e.g.) $W_{1}$ may reveal information that can be used to more accurately prioritize the hypotheses while still guaranteeing provable FDR control.

This raises the question: how should the analyst reorder the hypotheses after each coordinate of $\operatorname*{sign}(W)$ is revealed? One proposal from Ren and Candès, (2020) is to introduce an auxiliary Bayesian model for the relationship between $\operatorname*{sign}(W)$ and $|W_{j}|$ (the authors also discuss the use of additional side information, although for brevity we do not discuss this here). For example, Ren and Candès, (2020) suggest using a two-groups model where

H_{j}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathrm{Bern}(k_{j})\text{ and }W_{j}\mid H_{j}\sim\begin{cases}\mathcal{P}_{1}(W_{j})&H_{j}=1\\ \mathcal{P}_{0}(W_{j})&H_{j}=0.\end{cases}

(D.1)

Above, $H_{j}$ is the indicator of whether the $j$ th hypothesis is non-null, and $\mathcal{P}_{1}$ and $\mathcal{P}_{0}$ are (e.g.) unknown parametric distributions that the analyst fits as they observe $\operatorname*{sign}(W)$ . With this notation, the proposal from Ren and Candès, (2020) can be roughly summarized as follows:

1.

Fit an initial feature statistic $W$ , such as an LCD statistic, and observe $|W|$ .
2.

Fit an initial version of the model in Equation (D.1) and use it to compute $\gamma_{j}\coloneqq\mathbb{P}(W_{j}>0,H_{j}=1\mid|W_{j}|)$ .
3.

Observe $\operatorname*{sign}(W_{j})$ for $j=\operatorname*{arg\,min}_{j}\{\gamma_{j}:\operatorname*{sign}(W_{j})\text{ has not yet been observed }\}$ .
4.

Using $\operatorname*{sign}(W_{j})$ , update the model in Equation (D.1), update $\{\gamma_{j}\}_{j=1}^{p}$ , and return to Step 3.
5.

Terminate when all of $\operatorname*{sign}(W)$ has been revealed, at which point $\operatorname*{sign}(W)$ is passed to SeqStep in the reverse of the order that the signs were revealed.

Note that in Step 3, the reason Ren and Candès, (2020) choose $j$ to be the index minimizing $\mathbb{P}(W_{j}>0,H_{j}=1\mid|W_{j}|)$ is that in Step 5, SeqStep observes $\operatorname*{sign}(W)$ in reverse order of the analyst. Thus, the analyst should observe the least important hypotheses first so that SeqStep can observe the most important hypotheses first.

The main similarity between this procedure and MLR statistics is that both procedures, roughly speaking, attempt to prioritize the hypotheses according to $\mathbb{P}(W_{j}>0)$ , although we condition on the full masked data to maximize power. That said, there are two important differences. First, we and Ren and Candès, (2020) both use an auxiliary Bayesian model—however, we take probabilities over a Bayesian model of the full dataset, whereas Ren and Candès, (2020) only fit a working model of the law of $W$ . Using the full data as opposed to only the statistics $W$ should lead to much higher power—for example, if $W$ are poor feature statistics which do not contain much relevant information about the dataset, the procedure from Ren and Candès, (2020) will have low power. Thus, despite their initial similarity, these procedures are quite different.

The second and more important difference is that the procedure above is not a feature statistic. Rather, it is an extension of SeqStep that wraps on top of any initial feature statistic. This “adaptive knockoffs” procedure augments the power of any feature statistic, although if the initial feature statistic $W$ has many negative signs to begin with or its absolute values $|W|$ are truly uninformative of its signs, the procedure may still be powerless. Since MLR statistics have provable optimality guarantees—namely, they maximize $P^{\pi}(W_{j}>0\mid D)$ and make $|W_{j}|$ a monotone function of $P^{\pi}(W_{j}>0\mid D)$ —one might expect that using MLR statistics in place of a lasso statistic could improve the power of the adaptive knockoff filter. Similarly, using the adaptive knockoff filter in combination with MLR statistics could be more powerful than using MLR statistics alone.

Appendix E Gibbs sampling for MLR statistics

E.1 Proof of Eq. (4.1)

Lemma E.1.

Fix any constants $\mathbf{x},\widetilde{\mathbf{x}}\in\mathbb{R}^{n\times p},\mathbf{y}\in\mathbb{R}^{n}$ , and $\theta\in\Theta$ , and define $\mathbf{d}=(\mathbf{y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p})$ . Then

\displaystyle\frac{P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j}\mid\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}{P^{\pi}(\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j}\mid\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}

\displaystyle=\frac{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\mathbf{x}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j})}{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j})}.

as long as $(\mathbf{x},\widetilde{\mathbf{x}},\mathbf{y},\theta)$ lies in the support of $(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y},\theta^{\star})$ under $P^{\pi}$ .

Proof.

Throughout this proof, we abuse notation and let probabilities of the form (e.g.) $P^{\pi}(\mathbf{Y}=\mathbf{y},\mathbf{X}=\mathbf{x})$ denote the density of this event with respect to the base measure of $P^{\pi}$ . The definition of conditional probability yields

\displaystyle\frac{P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j}\mid\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}{P^{\pi}(\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j}\mid\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}=\frac{P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}{P^{\pi}(\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j},\theta^{\star}=\theta,D=\mathbf{d})}.

By definition of $D=\mathbf{d}$ , the event in the numerator is equivalent to the event $[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}],\mathbf{Y}=\mathbf{y},\theta^{\star}=\theta$ and the event in the denominator is equivalent to $[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]_{\mathrm{swap}(j)},\mathbf{Y}=\mathbf{y},\theta^{\star}=\theta$ . Plugging this in yields

	$\displaystyle=$	$\displaystyle\frac{P^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}],\theta^{\star}=\theta,\mathbf{Y}=\mathbf{y})}{P^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]_{\mathrm{swap}(j)},\theta^{\star}=\theta,\mathbf{Y}=\mathbf{y})}$
	$\displaystyle=$	$\displaystyle\frac{\pi(\theta)P^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]\mid\theta^{\star}=\theta)P^{\pi}(\mathbf{Y}=\mathbf{y}\mid\theta^{\star}=\theta,[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}])}{\pi(\theta)P^{\pi}([\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]_{\mathrm{swap}(j)}\mid\theta^{\star}=\theta)P^{\pi}(\mathbf{Y}=\mathbf{y}\mid\theta^{\star}=\theta,[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]_{\mathrm{swap}(j)})}.$

where the second line uses the chain rule of conditional probability and the fact that $\theta^{\star}$ has marginal density $\pi$ under $P^{\pi}$ . Recall that by definition of $P^{\pi}$ (see Section 1.2), the law the data given $\theta^{\star}=\theta$ is simply $P^{(\theta)}$ . Furthermore, for all $\theta\in\Theta$ , under $P^{(\theta)}$ , $[\mathbf{X},\widetilde{\mathbf{X}}]$ are assumed to be pairwise exchangeable because they are valid knockoffs (see footnote 3). Therefore, cancelling terms, we conclude

	$\displaystyle=$	$\displaystyle\frac{P^{(\theta)}(\mathbf{Y}=\mathbf{y}\mid[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}])}{P^{(\theta)}(\mathbf{Y}=\mathbf{y}\mid[\mathbf{X},\widetilde{\mathbf{X}}]=[\mathbf{x},\widetilde{\mathbf{x}}]_{\mathrm{swap}(j)})}$
		$\displaystyle=\frac{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\mathbf{x}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j})}{P_{\mathbf{Y}\mid\mathbf{X}}^{(\theta)}(\mathbf{y}\mid\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j},\mathbf{X}_{-j}=\mathbf{x}_{-j})}.$

where the last line follows since $\mathbf{Y}\perp\!\!\!\perp\widetilde{\mathbf{X}}\mid\mathbf{X}$ , as $\widetilde{\mathbf{X}}$ are valid knockoffs by assumption under $P^{(\theta)}$ . ∎

E.2 Derivation of Gibbs sampling updates

In this section, we derive the Gibbs sampling updates for the class of MLR statistics defined in Section 4.2. First, for convenience, we restate the model and choice of $\pi$ .

E.2.1 Model and prior

First, we consider the model-X case. For each $j\in[p]$ , let $\phi_{j}(\mathbf{X}_{j})\in\mathbb{R}^{n\times k_{j}}$ denote any vector of prespecified basis functions applied to $\mathbf{X}_{j}$ . We assume the following additive model:

\mathbf{Y}\mid\mathbf{X},\beta,\sigma^{2}\sim\mathcal{N}\left(\sum_{j=1}^{p}\phi_{j}(\mathbf{X}_{j})\beta^{(j)},\sigma^{2}I_{n}\right)

with the following prior on $\beta^{(j)}\in\mathbb{R}^{k_{j}}$ :

\beta^{(j)}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\begin{cases}0\in\mathbb{R}^{k_{j}}&\text{w.p. }p_{0}\\ \mathcal{N}(0,\tau^{2}I_{k_{j}})&\text{w.p. }1-p_{0}.\end{cases}

with the usual hyperpriors

\tau^{2}\sim\mathrm{invGamma}(a_{\tau},b_{\tau}),\sigma^{2}\sim\mathrm{invGamma}(a_{\sigma},b_{\sigma})\text{ and }p_{0}\sim\mathrm{Beta}(a_{0},b_{0}).

This is effectively a group spike-and-slab prior on $\beta^{(j)}$ which ensures group sparsity of $\beta^{(j)}$ , meaning that either the whole vector equals zero or the whole vector is nonzero. We use this group spike-and-slab prior for two reasons. First, it reflects the intuition that $\phi_{j}$ is meant to represent only a single feature and thus $\beta^{(j)}$ will likely be entirely sparse (if $\mathbf{X}_{j}$ is truly null) or entirely non-sparse. Second, and more importantly, the group sparsity will substantially improve computational efficiency in the Gibbs sampler.

Lastly, for the fixed-X case, we assume exactly the same model but with the basis functions $\phi_{j}(\cdot)$ chosen to be the identity. Thus, this model is a typical spike-and-slab Gaussian linear model in the fixed-X case (George and McCulloch,, 1997). It is worth noting that our implementation for the fixed-X case actually uses a slightly more general Gaussian mixture model as the prior on $\beta_{j}$ , where the density $p(\beta_{j})=\sum_{k=1}^{m}p_{k}\mathcal{N}(\beta_{j};0,\tau_{k}^{2})$ for hyperpriors $\tau_{0}=0,\tau_{k}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathrm{invGamma}(a_{k},b_{k})$ , and $(p_{0},\dots,p_{m})\sim\mathrm{Dir}(\alpha)$ . However, for brevity, we only derive the Gibbs updates for the case of two mixture components.

E.2.2 Gibbs sampling updates

Following Section 4, we now review the details of the MLR Gibbs sampler which samples from the posterior of $(\mathbf{X},\beta)$ given the masked data $D=\{\mathbf{Y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p}\}$ .⁸⁸8This is a standard derivation, but we review it here for the reader’s convenience. As notation, let $\beta$ denote the concatenation of $\{\beta^{(j)}\}_{j=1}^{p}$ , let $\beta^{(\mathrm{-}j)}$ denote all of the coordinates of $\beta$ except those of $\beta^{(j)}$ , let $\gamma_{j}$ denote the indicator that $\beta^{(j)}\neq 0$ , and let $\phi(\mathbf{X})\in\mathbb{R}^{n\times\sum_{j}k_{j}}$ denote all of the basis functions concatenated together. Also note that although this section mostly uses the language of model-X knockoffs, when the basis functions $\phi_{j}(\cdot)$ are the identity, the Gibbs updates we are about to describe satisfy the sufficiency property required for fixed-X statistics, and indeed the resulting Gibbs sampler is actually a valid implementation of the fixed-X MLR statistic.

To improve the convergence of the Gibbs sampler, we slightly modify the meta-algorithm in Algorithm 1 to marginalize over the value of $\beta^{(j)}$ when resampling $\mathbf{X}_{j}$ . To be precise, this means that instead of sampling $\mathbf{X}_{j}\mid\mathbf{X}_{-j},\beta,\sigma^{2}$ , we sample $\mathbf{X}_{j}\mid\mathbf{X}_{-j},\beta^{(\mathrm{-}j)}$ . We derive this update in three steps, and along the way we derive the update for $\beta^{(j)}\mid\mathbf{X},\beta^{(\mathrm{-}j)},D$ .

Step 1: First, we derive the update for $\gamma_{j}\mid\mathbf{X},\beta^{(\mathrm{-}j)},D$ . Observe

\displaystyle\frac{\mathbb{P}(\gamma_{j}=0\mid\mathbf{X},\beta^{(\mathrm{-}j)},D)}{\mathbb{P}(\gamma_{j}=1\mid\mathbf{X},\beta^{(\mathrm{-}j)},D)}

\displaystyle=\frac{p_{0}p(\mathbf{Y}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}=0)}{(1-p_{0})p(\mathbf{Y}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}\neq 0)}.

Analyzing the numerator is easy, as the model specifies that if we let $\mathbf{r}=\mathbf{Y}-\phi(\mathbf{X}_{-j})\beta^{(\mathrm{-}j)}$ , then

p(\mathbf{Y}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}=0)\propto\det(\sigma^{2}I_{n})^{-1/2}\exp\left(-\frac{1}{2\sigma^{2}}\|\mathbf{r}\|_{2}^{2}\right).

For the denominator, observe that $\mathbf{r},\beta^{(j)}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}\neq 0$ is jointly Gaussian: in particular,

(\beta^{(j)},\mathbf{r})\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}\neq 0\sim\mathcal{N}\left(0,\begin{bmatrix}\tau^{2}I_{k_{j}}&\tau\phi_{j}(\mathbf{X}_{j})^{T}\\ \tau\phi_{j}(\mathbf{X}_{j})&\tau^{2}\phi_{j}(\mathbf{X}_{j})\phi_{j}(\mathbf{X}_{j})^{T}+\sigma^{2}I_{n}\end{bmatrix}\right).

(E.1)

To lighten notation, let $Q_{j}\coloneqq I_{k_{j}}+\frac{\tau^{2}}{\sigma^{2}}\phi(\mathbf{X}_{j})^{T}\phi(\mathbf{X}_{j})$ . Using the above expression plus the Woodbury identity applied to the density of $\mathbf{Y}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\beta^{(j)}\neq 0$ , we conclude

\frac{\mathbb{P}(\gamma_{j}=0\mid\mathbf{X},\beta^{(\mathrm{-}j)},D)}{\mathbb{P}(\gamma_{j}=1\mid\mathbf{X},\beta^{(\mathrm{-}j)},D)}=\frac{p_{0}}{1-p_{0}}\det(Q_{j})^{1/2}\exp\left(-\frac{\tau^{2}}{2\sigma^{4}}\mathbf{r}^{T}\phi_{j}(\mathbf{X}_{j})Q_{j}^{-1}\phi_{j}(\mathbf{X}_{j})^{T}\mathbf{r}\right).

Since $Q_{j}$ is a $k_{j}\times k_{j}$ matrix, this quantity can be computed relatively efficiently.

Step 2: Next, we derive the distribution of $\beta^{(j)}\mid\mathbf{Y},\mathbf{X},\beta^{(\mathrm{-}j)},\gamma_{j}$ . Of course, the case where $\gamma_{j}=0$ is trivial since then $\beta^{(j)}=0$ by definition: in the alternative case, note from Equation (E.1) that we have

\beta^{(j)}\mid\mathbf{Y},\mathbf{X},\beta^{(\mathrm{-}j)},\gamma_{j}=1\sim\mathcal{N}\left(\frac{\tau^{2}}{\sigma^{2}}\phi_{j}^{T}\mathbf{r}-\frac{\tau^{4}}{\sigma^{4}}\phi_{j}^{T}\phi_{j}Q_{j}^{-1}\phi_{j}^{T}\mathbf{r},\tau^{2}I_{k_{j}}-\frac{\tau^{4}}{\sigma^{2}}\phi_{j}^{T}\phi_{j}+\frac{\tau^{6}}{\sigma^{4}}\phi_{j}^{T}\phi_{j}Q_{j}^{-1}\phi_{j}^{T}\phi_{j}\right),

where above, we use $\phi_{j}$ as shorthand for $\phi_{j}(\mathbf{X}_{j})$ to lighten notation.

Step 3: Lastly, we derive the update for $\mathbf{X}_{j}$ given $\mathbf{X}_{-j},\beta^{(\mathrm{-}j)},D$ . In particular, for any vector $\mathbf{x}$ , let $\kappa(\mathbf{x})\coloneqq\mathbb{P}(\gamma=0\mid\mathbf{X}_{j}=\mathbf{x},\mathbf{X}_{-j},\beta^{(\mathrm{-}j)})$ . Then by the law of total probability and the same Woodbury calculations as before,

	$\displaystyle\mathbb{P}(\mathbf{X}_{j}=\mathbf{x}\mid\mathbf{X}_{-j},\beta^{(\mathrm{-}j)},D)\propto$	$\displaystyle p(\mathbf{Y}\mid\mathbf{X}_{j}=\mathbf{x},\mathbf{X}_{-j},\beta^{(\mathrm{-}j)})$
		$\displaystyle=\kappa(\mathbf{x})p(\mathbf{Y}\mid\mathbf{X}_{j}=\mathbf{x},\mathbf{X}_{-j},\beta^{(\mathrm{-}j)},\beta^{(j)}=0)$
		$\displaystyle+(1-\kappa(\mathbf{x}))p(\mathbf{Y}\mid\mathbf{X}_{j}=\mathbf{x},\mathbf{X}_{-j},\beta^{(\mathrm{-}j)},\beta^{(j)}\neq 0)$
		$\displaystyle\propto\kappa(\mathbf{x})\exp\left(-\frac{1}{2\sigma^{2}}\\|\mathbf{r}\\|_{2}^{2}\right)$
		$\displaystyle+(1-\kappa(\mathbf{x}))\det(Q_{j}(\mathbf{x}))^{-1/2}\exp\left(-\frac{1}{2\sigma^{2}}\\|\mathbf{r}\\|_{2}^{2}+\frac{\tau^{2}}{2\sigma^{4}}\mathbf{r}^{T}\phi_{j}(\mathbf{x})^{T}Q_{j}(\mathbf{x})^{-1}\phi_{j}(\mathbf{x})^{T}\mathbf{r}\right)$
		$\displaystyle\propto\kappa(\mathbf{x})+(1-\kappa(\mathbf{x}))\det(Q_{j}(\mathbf{x}))^{-1/2}\exp\left(\frac{\tau^{2}}{2\sigma^{4}}\mathbf{r}^{T}\phi_{j}(\mathbf{x})^{T}Q_{j}(\mathbf{x})^{-1}\phi_{j}(\mathbf{x})^{T}\mathbf{r}\right)$

where above $Q_{j}(\mathbf{x})=I_{k_{j}}+\frac{\tau^{2}}{\sigma^{2}}\phi_{j}(\mathbf{x})^{T}\phi_{j}(\mathbf{x})$ as before.

The only other sampling steps required in the Gibbs sampler are to sample from the conditional distributions of $\sigma^{2},\tau^{2}$ and $p_{0}$ ; however, this is straightforward since we use conjugate hyperpriors for each of these parameters.

E.2.3 Extension to binary regression

We can easily extend the Gibbs sampler in the preceding section to handle the case where the response is binary via a latent variable approach. Indeed, let us start by considering the case of Probit regression, which means we observe $\mathbf{z}=\mathbb{I}(\mathbf{Y}\geq 0)\in\{0,1\}^{n}$ instead of the continuous outcome $\mathbf{Y}$ . Following Albert and Chib, (1993), we note that distribution of $\mathbf{Y}\mid\mathbf{z},\mathbf{X},\beta$ is truncated normal, namely

\mathbf{Y}_{i}\mid\mathbf{z},\mathbf{X},\beta\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\begin{cases}\mathrm{TruncNorm}(\mu_{i},\sigma^{2};(0,\infty))&\mathbf{z}_{i}=1\\ \mathrm{TruncNorm}(\mu_{i},\sigma^{2};(-\infty,0)&\mathbf{z}_{i}=0,\end{cases}

(E.2)

where $\mu=\phi(\mathbf{X})\beta=\mathbb{E}[\mathbf{Y}\mid\mathbf{X},\beta]$ . Thus, when we observe a binary response $\mathbf{z}$ instead of the continuous response $\mathbf{Y}$ , we can employ the same Gibbs sampler as in Section E.2.2 except that after updating $\beta^{(j)}\mid\mathbf{X},\beta^{(\mathrm{-}j)},\mathbf{Y}$ , we resample the latent variables $\mathbf{Y}$ according to Equation (E.2), which takes $O(n)$ computation per iteration (since we can continuously update the value of $\mu$ whenever we update $\mathbf{X}$ or $\beta$ in $O(n)$ operations as well). As a result, the computational complexity of this algorithm is the same as that of the algorithm in Section E.2.2. A similar formulation based on PolyGamma random variables is available for the case of logistic regression (see Polson et al., (2013)).

E.3 Proof and discussion of Lemma 4.1

Lemma 4.1.

Suppose that under $P^{\pi}$ , (i) $p_{j}^{(i)}\in(0,1)$ a.s. for $j\in[p]$ and (ii) the support of $\theta^{\star}\mid\mathbf{X},\mathbf{Y}$ equals the support of the marginal law of $\theta^{\star}$ . Then as $n_{\mathrm{sample}}\to\infty$ ,

\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}p_{j}^{(i)}\right)-\log\left(\sum_{i=1}^{n_{\mathrm{sample}}}1-p_{j}^{(i)}\right)\stackrel{{\scriptstyle\mathrm{p}}}{{\to}}\mathrm{MLR}_{j}^{\pi}\coloneqq\log\left(\frac{P_{j}^{\pi}(\mathbf{X}_{j}\mid D)}{P_{j}^{\pi}(\widetilde{\mathbf{X}}_{j}\mid D)}\right).

Proof.

This proof follows because by the derivations in Section 4, Algorithm 1 is a standard Gibbs sampler as defined in Algorithm A.40 of Robert and Casella, (2004), i.e., at each step it samples from the conditional law of one unknown variable given all the others (where the unknown variables are $\mathbf{X}_{1},\dots,\mathbf{X}_{p}$ and $\theta^{\star}$ ), where everything is done conditional on $D$ . Furthermore, the condition (i) implies that the support of $\mathbf{X}_{j}\mid D,\mathbf{X}_{-j},\theta^{\star}$ does not depend on $(\theta^{\star},\mathbf{X}_{-j})$ . As a result, this theorem result is a direct consequence of Corollary 10.12 of Robert and Casella, (2004), applied conditional on $D$ . In particular, Corollary 10.12 proves that

\frac{1}{n_{\mathrm{sample}}}\sum_{i=1}^{n_{\mathrm{sample}}}p_{j}^{(i)}\stackrel{{\scriptstyle\mathrm{p}}}{{\to}}P_{j}^{\pi}(\mathbf{X}_{j}\mid D)

The result then follows from the continuous mapping theorem (note that the first assumption for the lemma ensures $P_{j}^{\pi}(\mathbf{X}_{j}\mid D)\in(0,1)$ so the continuous mapping theorem applies). ∎

We note also that the two assumptions of Lemma 4.1 are satisfied in Example 1. In particular, to show that the support of $\mathbf{X}_{j}\mid D,\theta^{\star},\mathbf{X}_{-j}$ does not depend on $\theta^{\star}$ , observe that Eq. (4.1) tells us that conditional on $D=\mathbf{d}$ for $\mathbf{d}=(\mathbf{y},\{\mathbf{x}_{j},\widetilde{\mathbf{x}}_{j}\}_{j=1}^{p})$ , and for any $\theta\in\Theta$ ,

p_{j}^{(i)}=\frac{P^{\pi}(\mathbf{X}_{j}=\mathbf{x}_{j}\mid D=\mathbf{d},\theta^{\star}=\theta,\mathbf{X}_{-j})}{P^{\pi}(\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j}\mid D=\mathbf{d},\theta^{\star}=\theta,\mathbf{X}_{-j})}=\frac{P^{(\theta)}_{\mathbf{Y}\mid\mathbf{X}}(\mathbf{y}\mid\mathbf{X}_{j}=\mathbf{x}_{j},\mathbf{X}_{-j})}{P^{(\theta)}_{\mathbf{y}\mid\mathbf{X}}(\mathbf{y}\mid\mathbf{X}_{j}=\widetilde{\mathbf{x}}_{j},\mathbf{X}_{-j})}.

In Example 1, the numerator and denominator of the above equation are Gaussian likelihoods, so the likelihood ratio is always finite and thus $p_{j}^{(i)}\in(0,1)$ . Similarly, Example 1 is a conjugate Gaussian (additive) spike-and-slab linear model. Well–known results for these models establish that the support of the Gibbs distributions for the linear coefficients $\beta$ and hyperparameters $\sigma^{2},\tau^{2},p_{0}$ is equal to the support of the marginal distribution for these parameters (George and McCulloch,, 1997)—see Appendix E.2 for examples of detailed derivations showing this result.

There may, of course, be other Bayesian models $P^{\pi}$ where the assumptions of Lemma 4.1 do not hold. In these settings, the assumptions in Lemma 4.1 can be relaxed—see Robert and Casella, (2004).

E.4 Computing AMLR statistics

The AMLR statistics are a deterministic function of the MLR statistics $\{\mathrm{MLR}_{j}^{\pi}\}_{j=1}^{p}$ and $\{\nu_{j}\}_{j=1}^{p}$ where

\nu_{j}\coloneqq\frac{P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)}{(1+q)^{-1}-P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)}.

By Proposition 3.3, $P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0\mid D)$ is a function of $|\mathrm{MLR}_{j}^{\pi}|$ , so computing the denominator of $\nu_{j}$ is straightforward since we have already established how to compute $\{\mathrm{MLR}_{j}^{\pi}\}_{j=1}^{p}$ . To compute the numerator, recall that Algorithm 1 samples from the joint posterior of $\theta^{\star},\mathbf{X},\widetilde{\mathbf{X}}\mid D$ . Therefore, we can use the empirical mean of the samples from Algorithm 1 to compute this quantity:

P^{\pi}(\mathrm{MLR}_{j}^{\pi}>0,j\in\mathcal{H}_{1}(\theta^{\star})\mid D)\approx\frac{1}{n_{\mathrm{sample}}}\sum_{i=1}^{n_{\mathrm{sample}}}\mathbb{I}(\widehat{\mathbf{X}}_{j}=\mathbf{X}_{j}^{(i)},j\in\mathcal{H}_{1}(\theta^{(i)}))

where $\widehat{\mathbf{X}}_{j}=\operatorname*{arg\,max}_{\mathbf{x}\in\{\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j}\}}P_{j}(\mathbf{x}\mid D)$ is the MLR “guess” of the value of $\mathbf{X}_{j}$ .

Appendix F MLR statistics for group knockoffs

In this section, we describe how MLR statistics extend to the setting of group knockoffs (Dai and Barber,, 2016). In particular, for a partition $G_{1},\dots,G_{m}\subset[p]$ of the features, group knockoffs allow analysts to test the group null hypotheses $H_{G_{j}}:\mathbf{X}_{G_{j}}\perp\!\!\!\perp\mathbf{Y}\mid\mathbf{X}_{-G_{j}}$ , which can be useful in settings where $\mathbf{X}$ is highly correlated and there is not enough data to discover individual null variables. In particular, knockoffs $\widetilde{\mathbf{X}}$ are model-X group knockoffs if they satisfy the group pairwise-exchangeability condition $[\mathbf{X},\widetilde{\mathbf{X}}]_{{\text{swap}(G_{j})}}\stackrel{{\scriptstyle\mathrm{d}}}{{=}}[\mathbf{X},\widetilde{\mathbf{X}}]$ for each $j\in[m]$ . Similarly, $\widetilde{\mathbf{X}}$ are fixed-X group knockoffs if (i) $\mathbf{X}^{T}\mathbf{X}=\widetilde{\mathbf{X}}^{T}\widetilde{\mathbf{X}}$ and (ii) $S=\mathbf{X}^{T}\mathbf{X}-\widetilde{\mathbf{X}}^{T}\mathbf{X}$ is block-diagonal, where the blocks correspond to groups $G_{1},\dots,G_{m}$ . Given group knockoffs, one computes a single knockoff feature statistic for each group.

MLR statistics extend naturally to the group knockoff setting because we can treat each group of features $X_{G_{j}}$ as a single compound feature. In particular, the masked data for group knockoffs is

D=\begin{cases}(\mathbf{Y},\{\mathbf{X}_{G_{j}},\widetilde{\mathbf{X}}_{G_{j}}\}_{j=1}^{m})&\text{ for model-X knockoffs}\\ (\mathbf{X},\widetilde{\mathbf{X}},\{\mathbf{X}_{G_{j}}^{T}\mathbf{Y},\widetilde{\mathbf{X}}_{G_{j}}^{T}\mathbf{Y}\}_{j=1}^{m})&\text{ for fixed-X knockoffs,}\end{cases}

(F.1)

and the corresponding MLR statistics are

\mathrm{MLR}_{j}^{\pi}=\log\left(\frac{P_{G_{j}}^{\pi}(\mathbf{X}_{G_{j}}\mid D)}{P_{G_{j}}^{\pi}(\widetilde{\mathbf{X}}_{G_{j}}\mid D)}\right)\text{ for model-X knockoffs,}

where $P_{G_{j}}^{\pi}$ above denotes the law of $\mathbf{X}_{G_{j}}\mid D$ under $P^{\pi}$ . For fixed-X knockoffs, we have

\mathrm{MLR}_{j}^{\pi}=\log\left(\frac{P^{\pi}_{G_{j},\mathrm{fx}}(\mathbf{X}_{G_{j}}^{T}\mathbf{Y}\mid D)}{P^{\pi}_{G_{j},\mathrm{fx}}(\widetilde{\mathbf{X}}_{G_{j}}^{T}\mathbf{Y}\mid D)}\right)\text{ for fixed-X knockoffs,}

where $P^{\pi}_{G_{j},\mathrm{fx}}$ denotes the law of $\mathbf{X}_{G_{j}}^{T}\mathbf{Y}\mid D$ under $P^{\pi}$ .

Throughout the paper, we have proved several optimality properties of MLR statistics, and if we treat $\mathbf{X}_{G_{j}}$ as a single compound feature, all of these theoretical results (namely Proposition 3.3 and Theorem 3.2) immediately apply to group MLR statistics as well.

To compute group MLR statistics, we can use exactly the same Gibbs sampling strategy as in Section E.2—indeed, one can just treat $X_{G_{j}}$ as a basis representation of a single compound feature and use exactly the same equations as derived previously. This method is implemented in knockpy.

Appendix G Additional details for the simulations

In this section, we describe the simulation settings in Section 5, and we also give the corresponding plot to Figure 7 which shows the results when $q=0.05$ . To start, we describe the simulation setting for each plot.

1.

Sampling $\mathbf{X}$ : We sample each row of $\mathbf{X}$ as an i.i.d. $\mathcal{N}(0,\Sigma)$ random vector in all simulations, with two choices of $\Sigma$ . First, in the “AR(1)” setting, we take $\Sigma$ to correspond to a nonstationary AR(1) Gaussian Markov chain, so $\mathbf{X}$ has i.i.d. rows satisfying $X_{j}\mid X_{1},\dots,X_{j-1}\sim\mathcal{N}(\rho_{j}X_{j-1},1)$ with $\rho_{j}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\min(0.99,\mathrm{Beta}(5,1))$ . Note that the AR(1) setting is the default used in any plot where the covariance matrix is not specified. Second, in the “ErdosRenyi” (ER) setting, we sample a random matrix $V$ such that $80\%$ of its off-diagonal entries (selected uniformly at random) are equal to zero; for the remaining entries, we sample $V_{ij}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathrm{Unif}((-1,-0.1)\cup(0.1,1))$ . To ensure the final covariance matrix is positive definite, we set $\Sigma=(V+V^{T})+(0.1+\lambda_{\min}(V^{T}+V))I_{p}$ and then rescale $\Sigma$ to be a correlation matrix.
2.

Sampling $\beta$ : Unless otherwise specified in the plot, we randomly choose $s=10\%$ of the entries of $\beta$ to be nonzero and sample the nonzero entries as i.i.d. $\mathrm{Unif}([-\tau,-\tau/2]\cup[\tau/2,\tau])$ random variables with $\tau=0.5$ by default. The exceptions are: (1) in Figure 2, we set $\tau=0.5$ and $s=0.1$ , (2) in Figure 5, we set $\tau=0.3$ , vary $s$ between $0.05$ and $0.4$ as shown in the plot, and in some panels sample the non-null coefficients as $\mathrm{Laplace}(\tau)$ random variables, (3) in Figure 7 we take $\tau=2$ and $s=0.3$ , (4) in Figure 8 we take $\tau=1$ .
3.

Sampling $\mathbf{Y}$ : Throughout we sample $\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(\mathbf{X}\beta,I_{n})$ , with only two exceptions. First, in Figure 7, we sample $\mathbf{Y}\mid\mathbf{X}\sim\mathcal{N}(h(\mathbf{X})\beta,I_{n})$ where $h$ is a nonlinear function applied elementwise to $\mathbf{X}$ , for $h(x)=\sin(x),h(x)=\cos(x),h(x)=x^{2}$ and $h(x)=x^{3}$ . Second, in Figure 8, $\mathbf{Y}$ is binary and $\mathbb{P}(Y=1\mid X)=\frac{\exp(X^{T}\beta)}{1+\exp(X^{T}\beta)}$ .
4.

Sampling knockoffs: We sample MVR and SDP Gaussian knockoffs using the default parameters from knockpy version 1.3, both in the fixed-X and model-X case. Note that in the model-X case, we use the true covariance matrix $\Sigma$ to sample knockoffs, thus guaranteeing finite-sample FDR control.
5.

Fitting feature statistics: We fit the following types of feature statistics throughout the simulations: LCD statistics, LSM statistics, a random forest with swap importances (Gimenez et al.,, 2019), DeepPINK (Lu et al.,, 2018), MLR statistics (linear variant), MLR statistics with splines, and the MLR oracle. In all cases we use the default hyperparameters from knockpy version 1.3, and we do not adjust the hyperparameters, so that the MLR statistics do not have well-specified priors. The exception is that the MLR oracle has access to the underlying data-generating process and the true coefficients $\beta$ , which is why it is an “oracle.”

Now, recall that in Figure 7, we showed the results for $q=0.1$ because several competitor feature statistics made no discoveries at $q=0.05$ . Figure 13 is corresponding plot for $q=0.05$ .

Appendix H Additional results for the real data applications

H.1 HIV drug resistance

For the HIV drug resistance application, Figures 16-21 show the same results as in Figure 1 but for all drugs in the protease inhibitor (PI) class; broadly, they show that MLR statistics have higher power because they ensure that the feature statistics with high absolute values are consistently positive, as discussed in Section 2. Note that in these plots, for the lasso-based statistics, we plot the normalized statistics $\frac{W_{j}}{\max_{i}|W_{i}|}$ so that the absolute value of each statistic is less than one. Similarly, for the MLR statistics, instead of directly plotting the masked likelihood ratio as per Equation (1.3), we plot

W_{j}^{\star\star}\coloneqq 2\left(\mathrm{logit}^{-1}(|\mathrm{MLR}_{j}^{\pi}|)-0.5\right)=2\left(\mathbb{P}(\mathrm{MLR}_{j}^{\pi}>0\mid D)-0.5\right)

because we find this quantity easier to interpret than a log likelihood ratio. In particular, $W_{j}^{\star\star}\approx 0$ if and only if $\mathrm{MLR}_{j}^{\pi}$ is roughly equally likely to be positive or negative under $P^{\pi}$ , and $W_{j}^{\star\star}=1$ when $\mathrm{MLR}_{j}^{\pi}$ is always positive under $P^{\pi}$ .

Additionally, Figures 14 and 15 show the number of discoveries made by each feature statistic for SDP and MVR knockoffs, respectively, stratified by the drug in question. Note that the specific data analysis is identical to that of Barber and Candès, (2015) and Fithian and Lei, (2020) other than the choice of feature statistic—see either of those papers or https://github.com/amspector100/mlr_knockoff_paper for more details.

H.2 Financial factor selection

We now present a few additional details for the financial factor selection analysis from Section 6.2. First, we list the ten index funds we analyze, which are: XLB (materials), XLC (communication services), XLE (energy), XLF (financials), XLK (information technology), XLP (consumer staples), XLRE (real estate), XLU (utilities), XLV (health care), and XLY (consumer discretionary). Second, for each feature statistic, Table 1 shows the average realized FDP across all ten analyses—as desired, the average FDP for each method is lower than the nominal level of $q=0.05\%$ . All code is available at https://github.com/amspector100/mlr_knockoff_paper.

Knockoff Type	Feature Stat.	Average FDP
MVR	LCD	0.013636
	LSM	0.004545
	MLR	0.038571
SDP	LCD	0.000000
	LSM	0.035000
	MLR	0.039002

Table 1: This table shows the average FDP, defined above, for each method in the financial factor selection analysis from Section 6.2.

		$\displaystyle\|\operatorname{Cov}_{P_{n}^{\pi}}(\mathbb{I}(W_{i}>0,i\in\mathcal{H}_{1}(\theta^{\star})),\mathbb{I}(W_{j}>0,j\in\mathcal{H}_{1}(\theta^{\star}))\mid D^{(n)})\|$
	$\displaystyle\leq$	$\displaystyle\max(\|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{+},I_{j}^{+}\mid D^{(n)})\|,\|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{+},I_{j}^{-}\mid D^{(n)})\|,(\|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{-},I_{j}^{+}\mid D^{(n)})\|,\|\operatorname{Cov}_{P^{\pi}_{n}}(I_{i}^{-},I_{j}^{-}\mid D^{(n)})\|)$
	$\displaystyle\leq$	$\displaystyle C\rho^{\|i-j\|}$

$\displaystyle\|T_{q}(R,B)-T_{q}(\delta,b)\|$	$\displaystyle=\left\|\psi_{q}(R)\bar{B}_{\psi_{q}(R)}-\psi_{q}(\delta)\bar{b}_{\psi_{q}(\delta)}\right\|$	by definition
	$\displaystyle=\left\|\psi_{q}(R)\bar{B}_{\psi_{q}(R)}-\psi_{q}(R)\bar{b}_{\psi_{q}(\delta)}+\psi_{q}(R)\bar{b}_{\psi_{q}(\delta)}-\psi_{q}(\delta)\bar{b}_{\psi_{q}(\delta)}\right\|$
	$\displaystyle\leq\psi_{q}(R)\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}\|+\bar{b}_{\psi_{q}(\delta)}\|\psi_{q}(R)-\psi_{q}(\delta)\|$	triangle inequality
	$\displaystyle\leq\underbrace{\psi_{q}(R)\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}\|}_{\text{Term (a)}}+\underbrace{\|\psi_{q}(R)-\psi_{q}(\delta)\|}_{\text{Term (b)}}$	$\displaystyle\text{ since }\bar{b}_{\psi_{q}(\delta)}\in[0,1].$	(B.32)

	$\displaystyle\psi_{q}(R)\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}\|$	$\displaystyle\leq\psi_{q}(R)\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}\|+\psi_{q}(R)\|\bar{b}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(\delta)}\|$
		$\displaystyle\leq\psi_{q}(R)\|\bar{B}_{\psi_{q}(R)}-\bar{b}_{\psi_{q}(R)}\|+2\|\psi_{q}(R)-\psi_{q}(\delta)\|$	$\displaystyle\text{ by Lemma \ref{lem::algebra4powerresult}}.$

Asymptotically Optimal Knockoff Statistics via the Masked Likelihood Ratio

Abstract

1 Introduction

1.1 Review of model-X and fixed-X knockoffs

Theorem 1.1 (Candès et al., (2018)).

1.2 Theoretical problem statement

1.3 Contribution and overview of results

1.4 Related literature

1.5 Notation and outline

2 Intuition and motivation from an HIV drug resistance dataset

2.1 Intuition: what makes knockoffs powerful?

2.2 Motivation from the HIV drug resistance dataset

Remark 1 (Alternative solutions).

3 Masked likelihood ratio statistics

3.1 Knockoffs as inference on masked data

Definition 3.1.

Proposition 3.1.

Proposition 3.2.

Remark 2.

3.2 Introducing masked likelihood ratio (MLR) statistics

Definition 3.2 (MLR statistics).

Lemma 3.1.

Proposition 3.3.

Remark 3.

3.3 MLR statistics are asymptotically optimal

Proposition 3.4.

Proposition 3.5.

Assumption 3.1 (Sparsity).

Assumption 3.2 (Local dependence).

Theorem 3.2.

3.4 Maximizing the expected number of true discoveries

Definition 3.3.

Corollary 3.1.

Theorem 3.3 (Informal).

4 Computing MLR statistics

4.1 General strategy

Lemma 4.1.

Remark 4.

Remark 5.

4.2 A default choice of Bayesian model

Example 1 (Sparse generalized additive model).

5 Simulations

5.1 Gaussian linear models

5.2 Generalized additive models

5.3 Logistic regression

6 Real applications

6.1 HIV drug resistance

6.2 Financial factor selection

6.3 Graphical model discovery for gene networks

7 Discussion

8 Acknowledgements

References

Appendix A An illustration of the importance of the order of |W||W|

Appendix B Main proofs and interpretation

B.1 Knockoffs as inference on masked data

Proposition 3.1.

Proof.

Lemma 3.1.

Proof.

Corollary B.1.

Proof.

B.2 Proof of Proposition 3.3

Proposition 3.3.

Proof.

B.3 How far from optimality are MLR statistics in finite samples?

B.4 Proof of Theorem 3.2

Lemma B.2.

Proof.

Lemma B.3.

Proof.

Theorem 3.2.

Proof.

B.5 Relaxing the assumptions in Theorem 3.2

B.6 Proof of Propositions 3.4-3.5

Proposition 3.4.

Proof.

Proposition 3.5.

Proof.

B.7 Maximizing the expected number of true discoveries

B.7.1 Proof of Corollary 3.1

Appendix A An illustration of the importance of the order of $|W|$