Asymptotically Optimal Knockoff Statistics via the Masked Likelihood Ratio
Abstract
In feature selection problems, knockoffs are synthetic controls for the original features. Employing knockoffs allows analysts to use nearly any variable importance measure or “feature statistic” to select features while rigorously controlling false positives. However, it is not clear which statistic maximizes power. In this paper, we argue that state-of-the-art lasso-based feature statistics often prioritize features that are unlikely to be discovered, leading to low power in real applications. Instead, we introduce masked likelihood ratio (MLR) statistics, which prioritize features according to one’s ability to distinguish each feature from its knockoff. Although no single feature statistic is uniformly most powerful in all situations, we show that MLR statistics asymptotically maximize the number of discoveries under a user-specified Bayesian model of the data. (Like all feature statistics, MLR statistics always provide frequentist error control.) This result places no restrictions on the problem dimensions and makes no parametric assumptions; instead, we require a “local dependence” condition that depends only on known quantities. In simulations and three real applications, MLR statistics outperform state-of-the-art feature statistics, including in settings where the Bayesian model is highly misspecified. We implement MLR statistics in the python package knockpy; our implementation is often faster than computing a cross-validated lasso.
1 Introduction
Given a design matrix and a response vector , the task of controlled feature selection is, informally, to discover features that influence while controlling the false discovery rate (FDR). In this context, knockoffs (Barber and Candès,, 2015; Candès et al.,, 2018) are fake variables which act as negative controls for the features . Remarkably, employing knockoffs allows analysts to use nearly any machine learning model or test statistic, often known interchangeably as a “feature statistic” or “knockoff statistic,” to select features while exactly controlling the FDR. As a result, knockoffs has become popular in the analysis of genetic studies, financial data, clinical trials, and more (Sesia et al.,, 2018, 2019; Challet et al.,, 2021; Sechidis et al.,, 2021).
The flexibility of knockoffs has inspired the development of a variety of feature statistics based on penalized regression coefficients, sparse Bayesian models, random forests, neural networks, and more (see, e.g., Barber and Candès, (2015); Candès et al., (2018); Gimenez et al., (2019); Lu et al., (2018)). These feature statistics not only reflect different modeling assumptions, but more fundamentally, they estimate different quantities, including coefficient sizes, Bayesian posterior inclusion probabilities, and various other measures of variable importance. Yet there has been relatively little theoretical comparison of these methods, in large part because analyzing the power of knockoffs can be very challenging; see Section 1.4. In this work, we develop a principled approach and concrete methods for designing knockoff statistics that maximize power.
1.1 Review of model-X and fixed-X knockoffs
This section reviews the key elements of Model-X (MX) and Fixed-X (FX) knockoff methods.
Model-X (MX) knockoffs (Candès et al.,, 2018) is a method to test the hypotheses , where denotes all features except , assuming that the law of is known.111Note that this assumption can be relaxed to having a well-specified parametric model for (Huang and Janson,, 2020), and knockoffs are known to be robust to misspecification of the law of (Barber et al.,, 2020). Applying MX knockoffs requires three steps.
1. Constructing knockoffs. Valid MX knockoffs must satisfy two properties. First, the columns of must be pairwise exchangeable with the corresponding columns of , i.e. must hold for all . Second, we require that , which holds if one constructs without looking at . Informally, these constraints guarantee that are “indistinguishable” under . Sampling knockoffs can be challenging, but this problem is well studied (e.g., Bates et al.,, 2020).
2. Fitting feature statistics. Next, use any machine learning (ML) algorithm to fit feature importances , where and heuristically measure the “importance” of and in predicting . The only restriction is that swapping and must also swap the feature importances and without changing any of the other feature importances . This restriction is satisfied by most ML algorithms, such as the lasso or various neural networks (Lu et al.,, 2018).
Given , we define the feature statistics via where is any antisymmetric function. E.g., the lasso coefficient difference (LCD) statistic sets , where and are coefficients from a lasso fit on . Intuitively, when is positive, this suggests that is more important than and thus is evidence against the null. Indeed, Steps 1-2 guarantee that the signs of the null are i.i.d. random signs.
3.Make rejections. Define the data-dependent threshold , where . Then, reject , which guarantees finite-sample FDR control at level . Note this result does not require any assumptions about the law of .
Theorem 1.1 (Candès et al., (2018)).
Let denote the unknown joint law of , and suppose the law of is known, allowing one to construct valid knockoffs . 222Typically, one assumes that the observations are i.i.d. to construct valid knockoffs, but the i.i.d. assumption is not necessary as long as are valid knockoffs. Then for any feature statistic ,
where is the set of nulls under .
Fixed-X (FX) knockoffs (Barber and Candès,, 2015) treats as nonrandom and yields exact FDR control under the Gaussian linear model . Fitting FX knockoffs is identical to the steps above with two exceptions:
-
1.
FX knockoffs need not satisfy the constraints in Step 1: instead, must satisfy (i) and (ii) , for some diagonal matrix satisfying .
-
2.
The feature importances can only depend on through , which permits the use of many test statistics, but not all (for example, this prohibits the use of cross-validation).
Our theory applies to both MX and FX knockoffs,but oftenfocus on the MX context for brevity.
1.2 Theoretical problem statement
This section defines two types of optimal knockoff statistics: oracle statistics, which maximize the expected number of discoveries (ENDisc) for the true (unknown) data distribution , and Bayes-optimal statistics, which maximize ENDisc with respect to a prior distribution over . We focus on the expected number of discoveries since it greatly simplifies the analysis and all feature statistics provably control the frequentist FDR anyway. However, Section 3.4 extends our analysis to consider the expected number of true discoveries.
Let denote the discovery set using feature statistic on data . An oracle statistic maximizes the expected number of discoveries under as defined below:
(1.1) |
Next, let denote a model class of potential distributions for and let denote a prior density over .333We implicitly assume all elements of are consistent with the core assumptions of MX/FX knockoffs (see Section 1.1). For example, when employing MX knockoffs, all should specify the correct marginal law for . Although this is not necessary for our theoretical results, it is necessary for the computational techniques in Section 4. A Bayes-optimal statistic maximizes the average-case expected number of discoveries with respect to :
(1.2) |
where above, denotes the mixture distribution which first samples a parameter according to the prior and then samples . We refer to as a “Bayesian model,” and we give a default choice of (based on sparse generalized additive models) in Section 4.2. Our paper introduces statistics and that asymptotically maximize and , respectively.
While introducing a prior distribution may seem a strong assumption to some readers, Bayesian models are routinely used in applications where knockoffs are commonly applied, such as genetic fine-mapping (e.g., Guan and Stephens,, 2011; Weissbrod et al.,, 2020). Furthermore, in simulations and real applications, our approach yields significant power gains over pre-existing approaches even when the prior is highly misspecified. Lastly, we emphasize that Theorem 1.1 guarantees that using as a feature statistic guarantees frequentist FDR control under the true law of the data, even if is not a member of —and in this Type I error result, the conditional independence null hypotheses are defined nonparametrically with respect to the unknown , not with respect to .
1.3 Contribution and overview of results
This paper develops masked likelihood ratio (MLR) statistics, a class of feature statistics that are asymptotically optimal, computationally efficient, and powerful in applications. To derive these statistics, we reformulate MX knockoffs as a guessing game on masked data where the notation denotes an unordered set.444For brevity, this section only presents results for MX knockoffs. See Section 3 for analogous results for the FX case. After observing , the analyst must do as follows:
-
•
For each , the analyst must produce a guess of the value of the feature based on . Note that given , takes one of two values.
-
•
The analyst must then assign an order to their guesses, ideally from most to least promising.
-
•
The analyst may make discoveries if roughly of their first guesses are correct (according to the order they specify), where is the FDR level. Here, guess is “correct” if .
We show that to maximize the expected number of discoveries, an asymptotically optimal strategy is:
-
•
For each , guess the value which is conditionally more likely given (see below).
-
•
Order the guesses in descending order of the probability that each guess is correct.
In the traditional language of knockoffs, this corresponds to using the masked data to compute the log-likelihood ratio between the two possible values of (namely given . Precisely, let denote a Bayesian model as defined in Section 1.2. Then for any value in the support of and any fixed , let denote the conditional law of given . The masked likelihood ratio (MLR) statistic is defined as
(1.3) |
In words, the numerator plugs the observed values of and into the conditional law of under , and the denominator plugs and into the same law. Since swapping and flips the sign of without changing the values of , this equation defines a valid knockoff statistic. See Section 1.4 for comparison to Katsevich and Ramdas, (2020)’s (unmasked) likelihood ratio statistic.
This paper gives three arguments motivating the use of MLR statistics.
1. Intuition: the right notion of variable importance. Existing feature statistics measure many different proxies for variable importance, ranging from regression coefficients to posterior inclusion probabilities. However, Section 2 shows that popular lasso-based methods incorrectly prioritize features that are predictive of but are nearly indistinguishable from their knockoffs , leading to low power in real applications. In contrast to conventional variable importances, MLR statistics instead estimate whether a feature is distinguishable from its knockoff .
2. Theory: asymptotic Bayes-optimality. Section 3.3 shows that MLR statistics asymptotically maximize the number of expected discoveries under . Namely, under technical assumptions, Theorem 3.2 shows that for any valid feature statistic ,
(1.4) |
Our result applies to arbitrarily high-dimensional asymptotic regimes and allows to take any form—we do not assume that follows a linear model under . Instead, we assume the signs of the MLR statistics satisfy a local dependency condition, similar to dependency conditions often assumed on p-values (Genovese and Wasserman,, 2004; Storey et al.,, 2004; Ferreira and Zwinderman,, 2006; Farcomeni,, 2007). Our condition does not involve unknown quantities, so it can be diagnosed in practice.
Despite the Bayesian nature of this optimality result, we emphasize that MLR statistics are valid knockoff statistics. Thus, if are the discoveries made by the feature statistic, Theorem 1.1 shows that the frequentist FDR is controlled in finite samples assuming only that are valid knockoffs:
(1.5) |
3. Empirical results. We demonstrate via simulations and three real data analyses that MLR statistics are powerful in practice, even when the user-specified Bayesian model is highly misspecified.
-
•
We develop concrete instantiations of MLR statistics based on uninformative (sparse) priors for generalized additive models and binary GLMs. Our Python implementation is often faster than fitting a cross-validated lasso.
-
•
In simulations, MLR statistics outperform other state-of-the-art feature statistics, often by wide margins. Even when is highly misspecified, MLR statistics often nearly match the performance of the oracle which sets . Furthermore, when has a highly nonlinear relationship with , MLR statistics also outperform “black-box” feature statistics based on neural networks and random forests.
-
•
We replicate three knockoff-based analyses of drug resistance (Barber and Candès,, 2015), financial factor selection (Challet et al.,, 2021), and RNA-seq data (Li and Maathuis,, 2019). We find that MLR statistics (with an uninformative prior) make one to ten times more discoveries than the original analyses.
Overall, our results suggest that MLR statistics can substantially increase the power of knockoffs.
1.4 Related literature
The literature contains many feature statistics, which can (roughly) be separated into three categories. First, perhaps the most common feature statistics are based on penalized regression coefficients, notably the lasso signed maximum (LSM) and lasso coefficient difference (LCD) statistics (Barber and Candès,, 2015). Indeed, these lasso-based statistics are often used in applied work (e.g., Sesia et al.,, 2019) and have received much theoretical attention (Weinstein et al.,, 2017; Fan et al.,, 2020; Ke et al.,, 2020; Weinstein et al.,, 2020; Wang and Janson,, 2021). Perhaps surprisingly, we argue that many of these statistics target the wrong notion of variable importance, leading to reduced power. Second, some works have introduced Bayesian knockoff statistics (e.g., Candès et al.,, 2018; Ren and Candès,, 2020). MLR statistics have a Bayesian flavor but take a different form than previous statistics. Furthermore, our motivation differs from those of previous works: the real innovation of MLR statistics is to estimate a masked likelihood ratio, and we mainly use a Bayesian framework to quantify uncertainty about nuisance parameters (see Section 3.2). In contrast, previous works largely motivated Bayesian statistics as a way to incorporate prior information (Candès et al.,, 2018; Ren and Candès,, 2020). That said, an important special case of MLR statistics is similar to the “BVS” statistics from Candès et al., (2018), as discussed in Section 4. Third, many feature statistics take advantage of “black-box” ML to assign variable importances (e.g., Lu et al.,, 2018; Gimenez et al.,, 2019). Empirically, our implementation of MLR statistics based on regression splines outperforms “black-box” feature statistics in Section 5.
Previous power analyses for knockoffs have largely focused on showing the consistency of coefficient-difference feature statistics (Liu and Rigollet,, 2019; Fan et al.,, 2020; Spector and Janson,, 2022) or quantifying the power of coefficient-difference feature statistics assuming has i.i.d. Gaussian entries (Weinstein et al.,, 2017, 2020; Wang and Janson,, 2021). Ke et al., (2020) also derive a phase diagram for LCD statistics assuming is blockwise orthogonal. Our goal is different: to show that MLR statistics are asymptotically optimal, with particular focus on settings where the asymptotic power lies strictly between and . Furthermore, the works above exclusively focus on Gaussian linear models, whereas our analysis places no explicit restrictions on the law of or the dimensionality of the problem. Instead, we assume the signs of the MLR statistics satisfy a local dependency condition, similar to common dependency conditions on p-values (Genovese and Wasserman,, 2004; Storey et al.,, 2004; Ferreira and Zwinderman,, 2006; Farcomeni,, 2007). However, our proof technique is novel and specific to knockoffs.
Our theory builds on Li and Fithian, (2021), who developed knockoff, a provably optimal oracle statistic for FX knockoffs—in fact, oracle MLR statistics are equivalent to knockoff for FX knockoffs. Our work also builds on Katsevich and Ramdas, (2020), who showed that unmasked likelihood statistics maximize . MLR statistics also have this property, although we show the stronger result that MLR statistics maximize the expected number of overall discoveries. Another key difference is that unmasked likelihood statistics are not jointly valid knockoff statistics (see Appendix D.1). Thus, unmasked likelihood statistics do not yield provable FDR control, whereas MLR statistics do. Lastly, we note that the oracle procedures derived in these two works cannot be used in practice since they depend on unknown parameters. To our knowledge, MLR statistics are the first usable knockoff statistics with explicit optimality guarantees.
1.5 Notation and outline
Notation: Let and denote the design matrix and response vector in a feature selection problem with data points and features. We let the non-bold versions and denote the features and response for any arbitrary observation. For , define . For any and , denotes the columns of corresponding to the indices in . Similarly, denotes the columns of which do not appear in , and denotes all columns except column . For matrices , denotes the column-wise concatenation of . denotes the identity. Throughout, denotes the true (unknown) law of , and denotes a user-specified Bayesian model of the law of as defined in Section 1.2.
Outline: Section 2 gives intuition explaining why popular feature statistics may have low power, using an HIV resistance dataset as motivation. Section 3 introduces MLR statistics and presents our theoretical results. Section 4 discusses computation and suggests default choices of the Bayesian model . Section 5 compares MLR statistics to competitors via simulations. Section 6 applies MLR statistics to three real datasets. Section 7 discusses future directions.
2 Intuition and motivation from an HIV drug resistance dataset
2.1 Intuition: what makes knockoffs powerful?
Given a vector of knockoff statistics , the number of discoveries is determined as follows:
-
•
Step 1: Let denote the permutation such that .
-
•
Step 2: Let be the largest integer such that at least of the feature statistics with the largest absolute values have positive signs. Then the analyst may discover the features corresponding to the positive signs among .
The procedure above is equivalent to using the “data-dependent threshold” from Section 1.1 (see Barber and Candès,, 2015). This characterization suggests that to make many discoveries, we should:
-
•
Goal 1: Maximize the probability that each coordinate has a positive sign. (Note that null coordinates are guaranteed to be symmetric.)
-
•
Goal 2: Assign absolute values such that coordinates with larger absolute values also have higher probabilities of being positive. This ensures that for each , the feature statistics with the highest absolute values contain as many positive signs as possible, thus maximizing the number of discoveries. Although it is not yet clear how to formalize this goal, intuitively, we would like to have the same order as .
2.2 Motivation from the HIV drug resistance dataset
We now ask: do the most common choices of feature statistics used in the literature, LCD and LSM statistics, accomplish Goals 1-2? We argue no, using the HIV drug resistance dataset from Rhee et al., (2006) as an illustrative example. This dataset has been used as a benchmark in several papers about knockoffs, e.g., Barber and Candès, (2015); Romano et al., (2020), and we perform a complete analysis of this dataset in Section 6. For now, note that the design consists of genotype data from HIV samples, the response measures the resistance of each sample to a drug (in this case Indinavir), and we apply knockoffs to discover which genetic variants affect drug resistance—note our analysis exactly mimics that of Barber and Candès, (2015). As notation, let denote the estimated lasso coefficients fit on and with regularization parameter . Furthermore, let (resp. denote the smallest value of such that (resp. ). Then the LCD and LSM statistics are defined as:
(2.1) |

As intuition, imagine that a feature appears to influence : however, due to high correlations within , we must create a knockoff which is highly correlated with . For example, the “P90.M” variant in the HIV dataset is extremely predictive of resistance to Indinavir (IDV), as its OLS t-statistic is . However, in the original analysis, P90.M is correlated with its knockoff, so the lasso may select instead of . Furthermore, since the lasso induces sparsity, it is unlikely to select both and as they are highly correlated. Thus, and will have large absolute values, since appears significant, and a reasonably high probability of being negative, since . Indeed, the LCD and LSM statistics for P90.M have, respectively, the largest and second-largest absolute values among all genetic variants, but both statistics are negative because the lasso selected the knockoff instead of the feature.
Figure 1 shows that this misprioritization prevents the LCD and LSM statistics from making any discoveries when . Yet this problem is avoidable. If is large and may be negative, we can “deprioritize” by lowering its absolute value. As shown by Figure 1, this is exactly what MLR statistics do for the P90.M variant. Although this does not allow us to discover P90.M, it does allow us to discover other features.
Remark 1 (Alternative solutions).
To our knowledge, this problem with lasso statistics has not been previously discussed (see Section 1.4). Once pointed out, there are many intuitive approaches that mitigate (but do not fully solve) this problem, such as studentizing the coefficients or adding a ridge penalty. These practical ideas may merit further exploration; however, we focus on obtaining optimal feature statistics. Furthermore, some may argue that the best solution is simply to ensure that P90.M is less correlated with its knockoff. We wholeheartedly agree that the SDP knockoff construction from Barber and Candès, (2015) is sub-optimal here, and thus, we also use an alternative knockoff construction in Section 6. Yet reasonably strong correlations between features and knockoffs are inevitable when the features are correlated (Dai and Barber,, 2016). However, knockoffs can still be powerful in these settings if the feature statistics properly account for the (known) dependencies among . MLR statistics are designed to do this.
3 Masked likelihood ratio statistics
This section introduces and analyzes MLR statistics. First, Section 3.1 introduces the notation needed to define MLR statistics. Then, Section 3.2 defines MLR statistics, and Section 3.3 shows that MLR statistics asymptotically maximize the expected number of discoveries. Finally, Section 3.4 introduces an adjusted MLR statistic that asymptotically maximizes the expected number of true discoveries.
3.1 Knockoffs as inference on masked data
Section 2 argued that to maximize power, we should assign a large absolute value if and only if is large. To do this, we must estimate from the data, but we cannot use all the data for this purpose: e.g., we cannot directly adjust based on without violating FDR control. To resolve this ambiguity, we reformulate knockoffs as inference on masked data.
Definition 3.1.
Suppose we observe data , knockoffs , and independent random noise . ( may be used to fit a randomized feature statistic.) The masked data is defined as
(3.1) |
As shown in Propositions 3.1-3.2, the masked data is all the data we may use when assigning magnitudes to , and knockoffs will be powerful when we can recover the full data from .
Proposition 3.1.
Let be model-X knockoffs such that a.s. for . Then is a valid feature statistic if and only if:
-
1.
is a function of the masked data .
-
2.
For all , there exists a -measurable random vector such that if and only if .
Proposition 3.1 reformulates knockoffs as a guessing game, where we produce a “guess” of the value of based on . If our guess is right, meaning , then we are rewarded and : else . To avoid highly negative -statistics, we should only assign a large absolute value when we are confident that our “guess” is correct. We discuss more implications of this result in the next section: for now, we obtain an analogous result for fixed-X knockoffs (similar to a result from Li and Fithian, (2021)) by substituting for .
Proposition 3.2.
Let be fixed-X knockoffs satisfying a.s. for . Then is a valid feature statistic if and only if:
-
1.
is a function of the masked data .
-
2.
For , there exists a -measurable random variable such that if and only if .
Remark 2.
Propositions 3.1-3.2 hold for knockoffs as defined in Barber and Candès, (2015); Candès et al., (2018). However, in the fixed-X case, one can also augment to include , where is the OLS projection matrix of while preserving validity (Chen et al.,, 2019; Li and Fithian,, 2021). Our theory also applies to this extension of the knockoffs framework.
3.2 Introducing masked likelihood ratio (MLR) statistics
We now introduce masked likelihood ratio (MLR) statistics in two steps. First, we introduce oracle MLR statistics, which depend on the unknown law of the data. Then, we introduce Bayesian MLR statistics, which substitute for . Throughout, we focus on MX knockoffs, but analogous results for FX knockoffs merely replace with (see Definition 3.2).
Step 1: Oracle MLR statistics. We now apply Proposition 3.1 to achieve the two intuitive optimality criteria from Section 2.
-
•
Goal 1 asks that we maximize . Proposition 3.1 shows that ensuring is equivalent to correctly guessing the value of from the masked data . Thus, to maximize , the analyst should guess the value which maximizes the likelihood that the guess is correct.
-
•
Goal 2 asks us to order the guesses in descending order of , i.e., in descending order of the likelihood that each guess is correct.
Both goals are achieved by using the masked data to compute a log-likelihood ratio between the two possible values of (namely ). This defines the oracle masked likelihood ratio:
(3.2) |
where is the true (unknown) conditional law of given . Soon, Proposition 3.3 will verify that achieves both goals above, and Section 3.3 shows that asymptotically maximizes the expected number of discoveries under (under regularity conditions).
Step 2: Bayesian MLR statistics. cannot be used in practice since it depends on . A heuristic solution is to “plug in” an estimate for . For example, given some model class of the law of , one could estimate using and replace with . However, this “plug-in” approach can perform poorly, since knockoffs are most popular in high-dimensional settings with significant uncertainty about the true value of any unknown parameters. Thus, to account for uncertainty, we suggest replacing with a Bayesian model .
Definition 3.2 (MLR statistics).
For any Bayesian model (see Section 1.2), we define the model-X masked likelihood ratio (MLR) statistic below:
(3.3) |
where denotes the conditional law of under .
The fixed-X MLR statistic is analogous but replaces with . In particular, if denotes the conditional law of under , then
(3.4) |
To see how MLR statistics account for uncertainty about nuisance parameters, let denote the posterior density of under . We can write, e.g., in the model-X case:
(3.5) |
Unlike the “plug-in” approach, does not rely on a single estimate of —instead, it takes the weighted average of the likelihoods , weighted by the posterior law of under .
We now verify that under , MLR statistics achieve the intuitive criteria from Section 2 (Goals 1-2). This result applies to oracle MLR statistics, since in the special case where .
Proposition 3.3.
Let be the MLR statistics with respect to a Bayesian model . Then for any other feature statistic ,
(3.6) |
Furthermore, has the same order as . More precisely,
(3.7) |
Equation (3.7) shows that the absolute values have the same order as , so under , MLR statistics prioritize the hypotheses “correctly.” More generally, if is predictive of but is nearly indistinguishable from , should be small, since suggests . Thus, MLR statistics should rarely be highly negative (see Figure 1).
Lastly, we make two connections to the literature. First, one proposal in Ren and Candès, (2020) also suggests ranking the hypotheses by (see their footnote 8). That said, Ren and Candès, (2020) do not propose a feature statistic accomplishing this. Rather, they develop “adaptive knockoffs,” an extension of knockoffs that can be combined with any predefined feature statistic, including MLR or lasso statistics. Indeed, using better initial feature statistics should increase the power of adaptive knockoffs, so our contribution is both orthogonal and complementary to theirs (see Appendix D.2 for more details). Second, Katsevich and Ramdas, (2020) show that the unmasked likelihood statistic maximizes ; indeed, our work builds on theirs. However, there are two key differences. First, unlike MLR statistics, the unmasked likelihood statistic is not a valid knockoff statistic even though it is marginally symmetric under the null (see Appendix D.1), so it does not provably control the FDR. Second, MLR statistics have additional guarantees on their magnitudes (Eq. 3.7), allowing us to show much stronger theoretical results in Section 3.3.
3.3 MLR statistics are asymptotically optimal
We now show that MLR statistics asymptotically maximize , the expected number of discoveries under . Indeed, Proposition 3.3 might make one hope that MLR statistics exactly maximize , since MLR statistics exactly accomplish Goals 1-2 from Section 2. This intuition is correct under the conditional independence condition below (generalizing Li and Fithian, (2021) Proposition 2).
Proposition 3.4.
If are conditionally independent given under , then
for any valid feature statistic .
Furthermore, in Gaussian linear models, oracle MLR statistics satisfy this conditional independence condition, making them finite-sample optimal.
Proposition 3.5.
Suppose that (i) are FX knockoffs or Gaussian conditional MX knockoffs (Huang and Janson,, 2020) and (ii) under , . Then under , are conditionally independent.
Absent this independence condition, it may be possible to exploit dependencies among the coordinates of to slightly improve power. Yet Appendix B.3 shows that to improve power even slightly seems to require pathological dependencies, making it hard to imagine that accounting for dependencies can substantially increase power in practice. Formally, we now show that MLR statistics are asymptotically optimal under regularity conditions on the dependence of .
To this end, consider any asymptotic regime where we observe and construct knockoffs . For each , let denote a Bayesian model based on a model class and prior density . Let denote the masked data (Definition 3.1). For a sequence of feature statistics , let denote the rejection set of when controlling the FDR at level . So far, we have made no assumptions about the law of under , and we allow the dimension to grow arbitrarily with . To analyze the asymptotic behavior of MLR statistics under , we need two main assumptions.
Assumption 3.1 (Sparsity).
For , let denote the number of non-nulls under and denote the expected number of non-nulls under . We assume as .
Assumption 3.1 allows for many previously studied sparsity regimes, such as polynomial (Donoho and Jin,, 2004; Ke et al.,, 2020) and linear (e.g., Weinstein et al.,, 2017) sparsity regimes.
Assumption 3.2 (Local dependence).
Under , the conditional covariance matrix of given decays exponentially off its diagonal. Formally, there exist constants such that
(3.8) |
Assumption 3.2 quantifies the requirement that are not “too” conditionally dependent. Similar local dependence conditions are common in the multiple testing literature (Genovese and Wasserman,, 2004; Storey et al.,, 2004; Ferreira and Zwinderman,, 2006; Farcomeni,, 2007), although previous assumptions are typically made about p-values. We justify this assumption below.
-
1.
This assumption is intuitively plausible because knockoffs guarantee that the null coordinates of are independent given under , regardless of the correlations among (Barber and Candès,, 2015). This independence also holds for non-null coordinates in Gaussian linear models (see Prop. 3.5). Appendix B.9 gives additional informal intuition explaining why this result often holds approximately under for both null and non-null variables.
-
2.
Empirically, is nearly indistinguishable from a diagonal matrix in all of our simulations and three real analyses. This suggests that Assumption 3.2 holds in practice.
-
3.
This assumption can also be diagnosed in real applications, since it depends only on , which is specified by the analyst. To this end, Section 4 shows how to compute .
- 4.
- 5.
With these two assumptions, we show that MLR statistics asymptotically maximize , the expected number of discoveries normalized by the expected number of non-nulls:
(3.9) |
Theorem 3.2.
Consider any high-dimensional asymptotic regime where we observe data and knockoffs with denoting the masked data. Let be a sequence of Bayesian models of the data satisfying Assumptions 3.1-3.2, and let denote the MLR statistics with respect . Let denote any other sequence of feature statistics.
Then, if the limits and exist for , we have that
(3.10) |
holds for all but countably many .
Theorem 3.2 shows that MLR statistics asymptotically maximize the (normalized) number of expected discoveries without any explicit assumptions on the relationship between and or the dimensionality. Besides Assumptions 3.1-3.2, we also assume that the quantities we aim to study actually exist, i.e., and exist—however, even this assumption can be relaxed (see Appendix B.5). Yet the weakest aspect of Theorem 3.2 is that MLR statistics are only provably optimal under . If and are quite different, MLR statistics may not perform well. For this reason, Section 4.2 suggests practical choices of that performed well empirically, even under misspecification.
3.4 Maximizing the expected number of true discoveries
We now introduce adjusted MLR (AMLR) statistics, which asymptotically maximize the number of expected true discoveries under . Empirically, AMLR and MLR statistics perform similarly, but AMLR statistics are less intuitive and depend somewhat counterintuitively on the FDR level. (This is why our paper focuses mostly on MLR statistics.) Thus, for brevity, this section gives only a little intuition and a slightly informal theorem statement. Please see Appendix B.7 for a rigorous theorem statement.
We begin with notation. For , denotes the set of non-nulls under and denotes the random set of non-nulls under . Then, is the conditional probability that is positive and the th feature is non-null given the masked data. Finally, define the following ratio :
(3.11) |
Definition 3.3.
With this notation, we now define AMLR statistics in two cases.
-
•
Case 1: if .
-
•
Case 2: Otherwise, with , we define
(3.12) By construction, all AMLR statistics in Case 2 have smaller absolute values than all statistics in Case 1. Note that Appendix E.4 shows how to compute AMLR statistics.
Corollary 3.1.
AMLR statistics from Definition 3.3 are valid knockoff statistics.
MLR and AMLR statistics have the same signs but different absolute values. To understand why, Appendix B.7 argues that maximizing the expected number of true discoveries can be formulated as a simple linear program where the “benefit” of prioritizing a feature is —the probability that is positive and is non-null—and the “cost” is . The intuition is that to make discoveries, of the feature statistics with the largest absolute values must have positive signs. Thus, measures the difference between and the (conditional) probability that is positive. Feature has a negative cost if it produces a “surplus” of positive signs in expectation.
The optimal solution to this problem is to (a) maximally prioritize all features with negative costs by giving them the highest absolute values—i.e., the features in Case 1 above—and (b) prioritize all other features in descending order of the benefit-cost ratio . This is accomplished by the AMLR formulas in Definition 3.3. In contrast, MLR statistic magnitudes are a decreasing function of only the costs . By incorporating the benefit , AMLR statistics reduce the expected number of discoveries while increasing the expected number of true discoveries. See Appendix B.7 for further details.
AMLR and MLR statistics are different but not too different, since typically, has a similar order as . (When the orders are the same, AMLR and MLR statistics yield identical rejection sets.) Indeed, Figure 2 shows in a simple simulation that the power of AMLR and MLR statistics is nearly identical.
We now show that AMLR statistics asymptotically maximize power under (see Appendix B.7 for a formal statement and proof). For any statistic with discovery set , let denote the expected number of true positives under :
(3.13) |
Theorem 3.3 (Informal).
Suppose the conditions of Theorem 3.2 hold. Furthermore, suppose that (i) the local dependence condition in Assumption 3.2 holds when replacing with and (ii) the coefficient of variation of the number of non-nulls is bounded as . Then for any sequence of feature statistics ,
(3.14) |
where is the expected number of non-nulls under , as defined in Assumption 3.1.

4 Computing MLR statistics
4.1 General strategy
We now show how to compute and by Gibbs sampling from the law of under . For brevity, we focus on the MX setting—Appendix E discusses the FX case.
The key idea is that conditional on and the latent parameter , sampling from the law of is easy. In particular, for any fixed , observing implies that must lie in . Lemma E.1 shows that the conditional likelihood ratio equals:
(4.1) |
The right-hand side of Eq. (4.1) is easy to compute for most parametric models , since it only involves computing the likelihood of given . Thus, we can easily sample from the law of .
To sample from the law of , Algorithm 1 describes a Gibbs sampler which (i) for , resamples from and (ii) resamples from the posterior of . Step (ii) can be done using any off-the-shelf Bayesian sampler (Brooks et al.,, 2011), since this step is identical to a typical Bayesian regression. Lemma 4.1 shows that Algorithm 1 correctly computes the MLR statistics as under standard regularity conditions (Robert and Casella,, 2004). These mild conditions are satisfied by our default choices (Example 1), but they can also be relaxed further (see Appendix E.3).
Input: , a model class and prior .
Lemma 4.1.
Let be defined as in Algorithm 1. Suppose that under , (i) a.s. for and (ii) the support of equals . Then as ,
Remark 4.
Remark 5.
In the special case of Gaussian linear models with a sparse prior on the coefficients , Algorithm 1 is similar in flavor to the “Bayesian Variable Selection” (BVS) feature statistic from Candès et al., (2018), although there are differences in the Gibbs sampler and the final estimand. Broadly, we see our work as complementary to theirs. Yet aside from technical details, a main difference is that Candès et al., (2018) seemed to argue that the advantage of BVS was to incorporate accurate prior information. In contrast, we argue that MLR statistics can improve power even without prior information (see Section 5) by estimating the right notion of variable importance.
4.2 A default choice of Bayesian model
Below, we describe a class of Bayesian models that is computationally efficient and can flexibly model both linear and nonlinear relationships. Note that to specify , it suffices to model the law of , since the law of is assumed known in the MX case and is fixed in the FX case.
Example 1 (Sparse generalized additive model).
For linear coefficients and noise variance , let . For a prespecified set of basis functions , we consider the model class where
(4.3) |
By default, we take to be the identity function, which reduces to a Gaussian linear model. However, if and may have nonlinear relationships, we suggest taking to be the basis representation of regression splines (see Hastie et al., (2001) for review), as we do in Section 5.2. For the prior, we let denote the law of after sampling from the following process:
-
•
Sample hyperparameters (sparsity), (signal size), and (noise variance). By default, we take .
-
•
Sample for and
This group-sparse prior is effectively a “two-groups” model, as is null if and only if . As shown in Section 5, using these hyperpriors allows us to adaptively estimate the sparsity level.
Standard techniques for “spike-and-slab” models (George and McCulloch,, 1997) allow us to compute the MLR statistics from Ex. 1 in operations (assuming )—see Appendix E for review. This cost is cheaper than computing Gaussian MX or FX knockoffs, which requires operations. Fitting the LASSO has a comparable cost, which is using coordinate descent or using the LARS algorithm (Efron et al.,, 2004).
5 Simulations
We now analyze the power of MLR statistics in simulations. Throughout, MLR statistics do not have accurate prior information: we use exactly the same choice of Bayesian model (the default from Section 4.2) to compute MLR statistics in every plot. Also, we let be highly correlated to test whether MLR statistics perform well even when Assumption 3.2 may fail. Nonetheless, MLR statistics uniformly outperform existing competitors.
The FDR level is . All plots have two standard deviation error bars, although the bars may be too small to be visible. In each plot, knockoffs provably control the frequentist FDR, so we only plot power. All code is available at https://github.com/amspector100/mlr_knockoff_paper.
5.1 Gaussian linear models

In this section, we sample for sparse . We draw for two choices of . By default, corresponds to a highly correlated nonstationary AR(1) process, inspired by real genetic design matrices. However, we also analyze an “ErdosRenyi” covariance matrix where is sparse with the nonzero entries drawn uniformly at random. We compute both “SDP” and “MVR” knockoffs (Candès et al.,, 2018; Spector and Janson,, 2022) to show that MLR statistics perform well in both cases. See Appendix G for further simulation details.
We compare four feature statistics. First, we compute MLR statistics using the default Bayesian model from Section 4—in plots, “MLR” refers to this version of MLR statistics. Second, we compute LCD and LSM statistics as described in Section 2. Lastly, we compute the oracle MLR statistics which have full knowledge of the true value of . Figure 3 shows the results while varying in low dimensions (using FX knockoffs) and high dimensions (using MX knockoffs). It shows that MLR statistics are substantially more powerful than the lasso-based statistics and, in the FX case, MLR statistics almost perfectly match the power of the oracle. Indeed, this result holds even for the “ErdosRenyi” covariance matrix, where exhibits strong non-local dependencies (in contrast to Assumption 3.2). Furthermore, Figure 4 shows that MLR statistics are computationally efficient, often faster than a cross-validated lasso and comparable to the cost of computing FX knockoffs.

Next, we analyze the performance of MLR statistics when the prior is misspecified. In Figure 5, we vary the sparsity (proportion of non-nulls) between and , and we draw the non-null coefficients as (i) heavy-tailed i.i.d. Laplace variables and (ii) “light-tailed” i.i.d. variables. In all cases, the MLR prior assumes the non-null coefficients are i.i.d. with sparsity . Nonetheless, MLR statistics consistently outperform the lasso-based statistics and nearly match the performance of the oracle.

Lastly, we verify that the local dependence condition assumed in Theorem 3.2 holds empirically. We consider the AR(1) setting but modify the parameters so that is extremely highly correlated, with adjacent correlations drawn as i.i.d. variables. We also consider a setting where is equicorrelated with correlation . In both cases, Figure 6 shows that has entries which decay off the main diagonal—in fact, the maximum off-diagonal covariance across both examples is . Please see Section 3.3 and Appendix B.9 for intuition behind this result, although we cannot perfectly explain it.

5.2 Generalized additive models
We now sample for some non-linear function applied element-wise to . We consider the AR(1) setting from Section 5.1 with four choices of : . We compare six feature statistics: linear MLR statistics, MLR based on cubic regression splines with one knot, the LCD, a random forest with swap importances as in Gimenez et al., (2019), and DeepPINK (Lu et al.,, 2018), which is based on a feedforward neural network. This setting is more challenging than the linear setting, since the feature statistics must learn (or approximate) the function . Thus, our simulations in this section are low-dimensional with , and we should not expect any feature statistic to match the performance of the oracle MLR statistics.

Figure 7 shows that “MLR (splines)” uniformly outperforms every other feature statistic, often by wide margins. Linear MLR and LCD statistics are powerless in the and quadratic settings, where is an even function and thus the non-null features have no linear relationship with the response. However, in the and cubic settings, linear MLR statistics outperform the LCD, suggesting that linear MLR statistics can be powerful under misspecification as long as there is some linear effect.
5.3 Logistic regression
Lastly, we now consider the setting of logistic regression, so we sample where is the sigmoid function. We run the same simulation setting as Figure 3, except that now is binary and we consider low-dimensional settings, since inference in logistic regression is generally more challenging than in linear regression. The results are shown by Figure 8, which shows that MLR statistics outperform the LCD, although there is a substantial gap between the performances of the MLR and oracle MLR statistics.

6 Real applications
In this section, we apply MLR statistics to three real datasets which have been previously analyzed using knockoffs. We use the same default choice of MLR statistics from our simulations in all three applications. In each case, MLR statistics have comparable or higher power than competitor statistics. All code and data are available at https://github.com/amspector100/mlr_knockoff_paper.
6.1 HIV drug resistance
We begin with the HIV drug resistance dataset from Rhee et al., (2006), which (e.g.) Barber and Candès, (2015) previously analyzed using knockoffs. The dataset consists of genotype data from HIV samples as well as drug resistance measurements for different drugs, and the goal is to discover genetic variants that affect drug resistance for each of the drugs. Furthermore, Rhee et al., (2005) published treatment-selected mutation panels for this setting, so we can check whether any discoveries made by knockoffs are corroborated by this separate analysis.
We preprocess and model the data following Barber and Candès, (2015). Then, we apply FX knockoffs with LCD, LSM, and MLR statistics and FDR level . For both MVR and SDP knockoffs, Figure 9 shows the total number of discoveries made by each statistic, stratified by whether each discovery is corroborated by Rhee et al., (2005). For SDP knockoffs, the MLR statistics make nearly an order of magnitude more discoveries than the competitor methods with a comparable corroboration rate. For MVR knockoffs, MLR and LCD statistics perform roughly equally well, although MLR statistics make more discoveries with a slightly higher corroboration rate. Overall, in this setting, MLR statistics are competitive with and sometimes substantially outperform the lasso-based statistics. See Appendix H for specific results for each drug.

6.2 Financial factor selection
In finance, analysts often aim to select factors that drive the performance of a particular asset. Challet et al., (2021) applied FX knockoffs to factor selection, and as a benchmark, they tested which US equities explain the performance of an index fund for the energy sector (XLE). Here, the ground truth is available since the index fund is a weighted combination of a known list of stocks.
We perform the same analysis for ten index funds of key sectors of the US economy, including energy, technology, and more (see Appendix H). Here, is the index fund’s daily log return and contains the daily log returns of each stock in the S&P 500 since , so and . We compute fixed-X MVR and SDP knockoffs and apply LCD, LSM, and MLR statistics. Figure 10 shows the number of true and false discoveries summed across all index funds with . In particular, MLR statistics make and more discoveries than the LCD for MVR and SDP knockoffs (respectively), and the LSM makes more than times fewer discoveries than the MLR statistics. Thus, MLR statistics substantially outperform the lasso-based statistics. Appendix H also shows that the FDP (averaged across all index funds) is well below for each method.

6.3 Graphical model discovery for gene networks
Lastly, we consider the problem of recovering a gene network from single-cell RNAseq data. Our analysis follows Li and Maathuis, (2019), who model gene expression log counts as a Gaussian graphical model (see Li and Maathuis, (2019) for justification of the Gaussian assumption). In particular, they develop an extension of FX knockoffs that detects edges in Gaussian graphical models while controlling the FDR across discovered edges. They applied this method to RNAseq data from Zheng et al., (2017). The ground truth is not available, so following Li and Maathuis, (2019), we only evaluate methods based on the number of discoveries they make.
We replicate this analysis and compare LCD, LSM, and MLR statistics. Figure 11 plots the number of discoveries as a function of . MLR statistics make the most discoveries for nearly every value of , although often by a small margin. For small , the LSM statistic performs poorly, and for large , the LCD statistic performs poorly, whereas the MLR statistic is consistently powerful.

7 Discussion
This paper introduces masked likelihood ratio statistics, a class of asymptotically Bayes-optimal knockoff statistics. We show in simulations and three applications that MLR statistics are efficient and powerful. However, our work leaves open several directions for future research.
-
•
MLR statistics are asymptotically Bayes optimal. However, it might be worthwhile to develop minimax-optimal knockoff-statistics, e.g., by computing a “least-favorable” prior.
-
•
Our theory requires a “local dependency” condition which is challenging to verify analytically, although it can be diagnosed using the data at hand. It might be interesting to investigate (i) precisely when this condition holds and (ii) if MLR statistics are still optimal when it fails.
-
•
We only consider classes of MLR statistics designed for binary GLMs and generalized additive models. However, other types of MLR statistics could be more powerful, e.g., those based on Bayesian additive regression trees (Chipman et al.,, 2010).
-
•
In practice, analysts may prefer to discover features with large effect sizes. E.g., in Section 2, the P90.M variant has a large estimated OLS coefficient; thus, while it is particularly hard to discover, it may be particularly valuable to discover. In principle, the Bayesian framework in Section 3.4 could be used to find knockoff statistics which asymptotically maximize many different notions of power, e.g., the sum of squared coefficient sizes across all discovered variables.
8 Acknowledgements
The authors thank John Cherian, Kevin Guo, Lucas Janson, Lihua Lei, Basil Saeed, Anav Sood, and Timothy Sudijono for valuable comments. A.S. is partially supported by a Citadel GQS PhD Fellowship, the Two Sigma Graduate Fellowship Fund, and an NSF Graduate Research Fellowship. W.F. is partially supported by the NSF DMS-1916220 and a Hellman Fellowship from Berkeley.
References
- Albert and Chib, (1993) Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679.
- Barber and Candès, (2015) Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist., 43(5):2055–2085.
- Barber et al., (2020) Barber, R. F., Candès, E. J., and Samworth, R. J. (2020). Robust inference with knockoffs. Ann. Statist., 48(3):1409–1431.
- Bates et al., (2020) Bates, S., Candès, E., Janson, L., and Wang, W. (2020). Metropolized knockoff sampling. Journal of the American Statistical Association, 0(0):1–15.
- Brooks et al., (2011) Brooks, S., Gelman, A., Jones, G., and Meng, X. (2011). Handbook of Markov Chain Monte Carlo. CRC Press, United States.
- Candès et al., (2018) Candès, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold: Model-X knockoffs for high-dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B, 80(3):551–577.
- Challet et al., (2021) Challet, D., Bongiorno, C., and Pelletier, G. (2021). Financial factors selection with knockoffs: Fund replication, explanatory and prediction networks. Physica A: Statistical Mechanics and its Applications, 580:126105.
- Chen et al., (2019) Chen, J., Hou, A., and Hou, T. Y. (2019). A prototype knockoff filter for group selection with FDR control. Information and Inference: A Journal of the IMA, 9(2):271–288.
- Chipman et al., (2010) Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266 – 298.
- Dai and Barber, (2016) Dai, R. and Barber, R. (2016). The knockoff filter for fdr control in group-sparse and multitask regression. In Balcan, M. F. and Weinberger, K. Q., editors, International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1851–1859, New York, New York, USA. PMLR.
- Donoho and Jin, (2004) Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics, 32(3):962 – 994.
- Doukhan and Neumann, (2007) Doukhan, P. and Neumann, M. H. (2007). Probability and moment inequalities for sums of weakly dependent random variables, with applications. Stochastic Processes and their Applications, 117(7):878–903.
- Efron et al., (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2):407 – 499.
- Fan et al., (2020) Fan, Y., Demirkaya, E., Li, G., and Lv, J. (2020). Rank: Large-scale inference with graphical nonlinear knockoffs. Journal of the American Statistical Association, 115(529):362–379. PMID: 32742045.
- Farcomeni, (2007) Farcomeni, A. (2007). Some results on the control of the false discovery rate under dependence. Scandinavian Journal of Statistics, 34(2):275–297.
- Ferreira and Zwinderman, (2006) Ferreira, J. A. and Zwinderman, A. H. (2006). On the Benjamini–Hochberg method. The Annals of Statistics, 34(4):1827 – 1849.
- Fithian and Lei, (2020) Fithian, W. and Lei, L. (2020). Conditional calibration for false discovery rate control under dependence.
- Genovese and Wasserman, (2004) Genovese, C. and Wasserman, L. (2004). A stochastic process approach to false discovery control. The Annals of Statistics, 32(3):1035 – 1061.
- George and McCulloch, (1997) George, E. I. and McCulloch, R. E. (1997). Approaches for bayesian variable selection. Statistica Sinica, 7(2):339–373.
- Gimenez et al., (2019) Gimenez, J. R., Ghorbani, A., and Zou, J. Y. (2019). Knockoffs for the mass: New feature importance statistics with false discovery guarantees. In Chaudhuri, K. and Sugiyama, M., editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 2125–2133. PMLR.
- Guan and Stephens, (2011) Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics, 5(3):1780 – 1815.
- Hastie et al., (2001) Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA.
- Huang and Janson, (2020) Huang, D. and Janson, L. (2020). Relaxing the assumptions of knockoffs by conditioning. Ann. Statist., 48(5):3021–3042.
- Katsevich and Ramdas, (2020) Katsevich, E. and Ramdas, A. (2020). On the power of conditional independence testing under model-x.
- Ke et al., (2020) Ke, Z. T., Liu, J. S., and Ma, Y. (2020). Power of fdr control methods: The impact of ranking algorithm, tampered design, and symmetric statistic. arXiv preprint: arXiv:2010.08132.
- Li and Maathuis, (2019) Li, J. and Maathuis, M. H. (2019). Ggm knockoff filter: False discovery rate control for gaussian graphical models.
- Li and Fithian, (2021) Li, X. and Fithian, W. (2021). Whiteout: when do fixed-x knockoffs fail?
- Liu and Rigollet, (2019) Liu, J. and Rigollet, P. (2019). Power analysis of knockoff filters for correlated designs. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Lu et al., (2018) Lu, Y. Y., Fan, Y., Lv, J., and Noble, W. S. (2018). Deeppink: reproducible feature selection in deep neural networks. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 8690–8700.
- Martello and Toth, (1990) Martello, S. and Toth, P. (1990). Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc., USA.
- Polson et al., (2013) Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American Statistical Association, 108(504):1339–1349.
- Ren and Candès, (2020) Ren, Z. and Candès, E. J. (2020). Knockoffs with side information. Annals of Applied Statistics.
- Rhee et al., (2005) Rhee, S.-Y., Fessel, W. J., Zolopa, A. R., Hurley, L., Liu, T., Taylor, J., Nguyen, D. P., Slome, S., Klein, D., Horberg, M., Flamm, J., Follansbee, S., Schapiro, J. M., and Shafer, R. W. (2005). Hiv-1 protease and reverse-transcriptase mutations: Correlations with antiretroviral therapy in subtype b isolates and implications for drug-resistance surveillance. The Journal of Infectious Diseases, 192(3):456–465.
- Rhee et al., (2006) Rhee, S.-Y., Taylor, J., Wadhera, G., Ben-Hur, A., Brutlag, D. L., and Shafer, R. W. (2006). Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences, 103(46):17355–17360.
- Robert and Casella, (2004) Robert, C. and Casella, G. (2004). Monte Carlo statistical methods. Springer Verlag.
- Romano et al., (2020) Romano, Y., Sesia, M., and Candès, E. (2020). Deep knockoffs. Journal of the American Statistical Association, 115(532):1861–1872.
- Sechidis et al., (2021) Sechidis, K., Kormaksson, M., and Ohlssen, D. (2021). Using knockoffs for controlled predictive biomarker identification. Statistics in Medicine, 40(25):5453–5473.
- Sesia et al., (2019) Sesia, M., Katsevich, E., Bates, S., Candès, E., and Sabatti, C. (2019). Multi-resolution localization of causal variants across the genome. bioRxiv.
- Sesia et al., (2018) Sesia, M., Sabatti, C., and Candès, E. J. (2018). Gene hunting with hidden Markov model knockoffs. Biometrika, 106(1):1–18.
- Spector and Janson, (2022) Spector, A. and Janson, L. (2022). Powerful knockoffs via minimizing reconstructability. The Annals of Statistics, 50(1):252 – 276.
- Storey et al., (2004) Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(1):187–205.
- Wang and Janson, (2021) Wang, W. and Janson, L. (2021). A high-dimensional power analysis of the conditional randomization test and knockoffs. Biometrika.
- Weinstein et al., (2017) Weinstein, A., Barber, R. F., and Candès, E. J. (2017). A power analysis for knockoffs under Gaussian designs. IEEE Transactions on Information Theory.
- Weinstein et al., (2020) Weinstein, A., Su, W. J., Bogdan, M., Barber, R. F., and Candès, E. J. (2020). A power analysis for knockoffs with the lasso coefficient-difference statistic. arXiv.
- Weissbrod et al., (2020) Weissbrod, O., Hormozdiari, F., Benner, C., Cui, R., Ulirsch, J., Gazal, S., Schoech, A., van de Geijn, B., Reshef, Y., Márquez-Luna, C., O’Connor, L., Pirinen, M., Finucane, H. K., and Price, A. L. (2020). Functionally-informed fine-mapping and polygenic localization of complex trait heritability. Nature Genetics, 52:1355–1363.
- Zheng et al., (2017) Zheng, G., Terry, J., Belgrader, P., Ryvkin, P., Bent, Z., Wilson, R., Ziraldo, S., Wheeler, T., McDermott, G., Zhu, J., Gregory, M., Shuga, J., Montesclaros, L., Underwood, J., Masquelier, D., Nishimura, S., Schnall-Levin, M., Wyatt, P., Hindson, C., Bharadwaj, R., Wong, A., Ness, K., Beppu, L., Deeg, H., McFarland, C., Loeb, K., Valente, W., Ericson, N., Stevens, E., Radich, J., Mikkelsen, T., Hindson, B., and Bielas, J. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications.
Appendix A An illustration of the importance of the order of
Section 2.1 argues intuitively that a good knockoff statistic should roughly achieve the following goals:
-
1.
For each , it should maximize .
-
2.
The order of should match the order of —i.e., should be an increasing function of .
Sections 3 formalizes (and slightly modifies) these goals to develop an asymptotically optimal test statistic. However, to build intuition, we now give a concrete (if contrived) example showing the importance of the second goal.
Consider a setting with features where features have a large signal size with , features have a moderate signal size with , the last features are null with , and are independent. In this case, what absolute values should take to maximize power?
To make any discoveries and control the FDR at level , we must ensure that of the feature statistics with the largest absolute values have positive signs (for some due to the ceiling function in Step 2 in Section 2.1). Since only the features with large signal sizes have a chance of being positive, making any discoveries is extremely unlikely unless the features with large signal sizes generally have the highest absolute values. Figure 12 illustrates this argument—in the “random prioritization” setting, we sample , and in the “oracle prioritization” setting, we set . As expected, knockoffs makes zero discoveries with random prioritization and discoveries with oracle prioritization.

Appendix B Main proofs and interpretation
In this section, we prove the main results of the paper. We also offer additional discussion of these results.
B.1 Knockoffs as inference on masked data
In this section, we prove Propositions 3.1 and 3.2, Lemma 3.1, and one more related corollary which will be useful when proving Theorem 3.2. As notation, for any matrices , let denote the matrix but with the th column of and swapped: similarly, swaps all columns of and .
Proposition 3.1.
Let be model-X knockoffs such that a.s. for . Then is a valid feature statistic if and only if:
-
1.
is a function of the masked data .
-
2.
For all , there exists a -measurable random vector such that if and only if .
Proof.
Forward direction: Suppose is a valid feature statistic; we will now show conditions (i) and (ii). To show (i), note that observing is equivalent to observing for some unobserved chosen uniformly at random. Define and let . Then by the swap invariance property of knockoffs, we have that . Since is a function of , this implies is a function of as well, which proves (i).
To show (ii), we construct as follows. Let be any “other” random vector chosen such that . Then define
Intuitively, we set if and only if , since this will guarantee that whenever .
Note that is a function of and therefore is -measurable. To show it is well-defined (does not depend on ), note that can only take one of three values conditional on . Thus, it suffices to show that the events and do not depend on the random set .
To show that the event does not depend on , recall iff ; since , this event does not depend on .
To show that the event does not depend on , it suffices to show if and only if , which also shows (ii). There are two cases. In the first case, if , then by definition of and also by the “flip-sign” property of . Thus if and only if . The second case is analogous: if , then , so if and only if . In both cases, if and only if , proving (ii).
Backwards direction: To show is a valid feature statistic, it suffices to show the flip-sign property, namely that , where denotes elementwise multiplication and is the vector of all ones but with negative ones at the indices in . To do this, note that is invariant to swaps of and , so because by assumption are a function of . Furthermore, for any , we have that if and only if ; however, since is also a function of , we have that if and only if . This completes the proof. ∎
The proof of Proposition 3.2 is identical to the proof of Proposition 3.1, so we omit it for brevity.
We now prove Lemma 3.1.
Lemma 3.1.
Proof.
For the MX case, we will show that for any , , where denotes elementwise multiplication and is the vector of ones but with negative ones at the indices in . To show this, note that the masked data is invariant to swaps. Therefore, applying Eq. 3.3 yields
(B.1) |
Since is an antisymmetric function, this proves that
(B.2) |
which completes the proof. The proof for the FX case is analogous. ∎
Finally, the following corollary of Propositions 3.1 and 3.2 will be important when proving Theorem 3.2.
Corollary B.1.
Let be two knockoff feature statistics. Then in the same setting as Propositions 3.1 and 3.2, the event is a deterministic function of the masked data .
Proof.
We give the proof for the model-X case, and the fixed-X case is analogous. First, note that the events and are -measurable events since are -measurable by Proposition 3.1. Therefore, the only non-trivial case is the case where , which we now consider.
By Proposition (3.1), there exist -measurable vectors such that and . Since must take one of exactly two distinct values, this implies that
where the right-most expression is -measurable since are -measurable. This completes the proof. ∎
B.2 Proof of Proposition 3.3
Proposition 3.3.
Given data and knockoffs , let be the MLR statistics with respect to some Bayesian model . Let be any other valid knockoff feature statistic. Then,
(3.6) |
Furthermore, has the same order as . More precisely,
(3.7) |
Proof.
First, we prove Eq. (3.6). Let be the “best guess” of the value of based on , and observe that by definition if and only if . Similarly, Proposition 3.1 proves that there exists some alternative -measurable random variable such that if and only if . However, we note that by definition of ,
(B.3) |
which completes the proof of Eq. (3.6).
To prove Eq. (3.7), observe , since conditional on we observe the set . Therefore,
where the second step uses the fact that for as defined above. This completes the proof for model-X knockoffs; the proof in the fixed-X case is analogous and just replaces with . ∎
B.3 How far from optimality are MLR statistics in finite samples?
Our main result (Theorem 3.2) shows that MLR statistics asymptotically maximize the number of discoveries made by knockoffs under . However, before rigorously proving Theorem 3.2, we give intuition suggesting that even in finite samples, MLR statistics are probably nearly optimal anyway.
Recall from Section 3.2 that in finite samples, MLR statistics (i) maximize for each and (ii) ensure that the absolute values of the feature statistics have the same order as the probabilities . As per Proposition 3.4, this strategy is exactly optimal when the vector of signs are conditionally independent given , but in general, it is possible to exploit conditional dependencies among the coordinates of to slightly improve power. However, we argue below that it is challenging to even slightly improve power without unrealistically strong dependencies.
To see why this is the case, consider a very simple setting with features and FDR level , so knockoffs will make discoveries exactly when the first five -statistics with the largest absolute values have positive signs. Suppose that are perfectly correlated and satisfy , and satisfies . Since has the highest chance of being positive, MLR statistics will assign it the highest absolute value, in which case knockoffs will make discoveries with probability . However, in this example, knockoffs will be more powerful if we ensure that have the five largest absolute values, since their signs are perfectly correlated and thus .666Note that there is nothing special about positive correlations in this example: one can also find similar examples where negative correlations among can be prioritized to slightly increase power.
This example has two properties which shed light on the more general situation. First, to even get a slight improvement in power required extremely strong dependencies among the coordinates of , which is not realistic. Indeed, empirically in Figure 6, the coordinates of appear to be almost completely conditionally uncorrelated even when is extremely highly correlated. Thus, although it may be possible to slightly improve power by exploiting dependencies among , the magnitude of the improvement in power is likely to be small. Second, the reason that it is possible to exploit dependencies to improve power in this case is because knockoffs has a “hard” threshold where one can only make any discoveries if one makes at least 5 discoveries, and exploiting conditional correlations among the vector can slightly improve the probability that we reach that initial threshold. However, this “threshold” phenomenon is less important in situations where knockoffs are guaranteed to make at least a few discoveries; thus, if the number of discoveries grows with , this effect should be insignificant asymptotically.
B.4 Proof of Theorem 3.2
Notation: For any vector and , we let be the sample mean of the first elements of . For , we let equal the sample mean of the vector plus additional zeros. Additionally, for and a permutation , denotes the coordinates of permuted according to , so that . Throughout this section, all probabilities and expectations are taken under .
Main idea: There are two main ideas behind Theorem 3.2. First, for any feature statistic , we will compare the power of to the power of a “soft” version of the SeqStep procedure, which depends only on the conditional expectation of instead of the realized values of . Roughly speaking, if the coordinates of obey a strong law of large numbers, the power of SeqStep and the power of the “soft” version of SeqStep will be the same asymptotically. Second, we will show that MLR statistics exactly maximize the power of the “soft” version of the SeqStep procedure. Taken together, these two results imply that MLR statistics are asymptotically optimal.
To make this precise, for a feature statistic , let denote sorted in decreasing order of its absolute values, and let be the vector indicating where has positive entries. The number of discoveries made by knockoffs only depends on . Indeed, for any vector and any desired FDR level , define
(B.4) |
where by convention we set for any and we remind the reader that for , . It turns out that knockoffs makes exactly discoveries. For brevity, we refer the reader to Lemma B.3 of Spector and Janson, (2022) for a formal proof of this: however, to see this intuitively, note that (resp. ) counts the number of negative (resp. positive) entries in the first coordinates of , so this definition lines up with the definition of the data-dependent threshold in Section 1.1.
Now, let be the conditional expectation of given the masked data (defined in Equation 3.1). The “soft” version of SeqStep simply applies the functions and to the conditional expectation instead of the realized indicators . Intuitively speaking, our goal will be to apply a law of large numbers to show the following asymptotic result:
Once we have shown this, it will be straightforward to show that MLR statistics are asymptotically optimal, since MLR statistics maximize in finite samples.
We now begin to prove Theorem 3.2 in earnest. In particular, the following pair of lemmas tells us that if converges to uniformly in , then .
Lemma B.2.
Let be any feature statistic with as defined earlier. Fix any and sufficiently small such that . Define the event
Then on the event , we have that
(B.5) |
This implies that
(B.6) |
Proof.
Note the proof is entirely algebraic (there is no probabilistic content). We proceed in two steps, first showing Equation (B.5), then Equation (B.6).
Step 1: We now prove Equation (B.5). To start, define the sets
and recall that by definition , . To analyze the difference between these quantities, fix any . Then by definition of and , we know
However, Lemma B.3 (proved in a moment) tells us that this implies the following algebraic identity:
However, on the event this cannot occur for any . Therefore, on the event , . This implies that
(B.7) |
We can combine these conditions by writing that . Using the definition of , we conclude
by def. of | ||||
by Eq. (B.7) | ||||
by def. of . |
This proves the upper bound, namely that . To prove the lower bound, note that we can swap the role of and and apply the upper bound to . Then if we take , applying the upper bound yields
Observe that is nondecreasing in . Furthermore, since , we have that . Therefore, . Applying this result, we conclude
This implies the lower bound .
Step 2: Now, we show Equation (B.6) follows from Equation (B.5). To see this, we consider the two cases where and vice versa and apply Equation (B.5). In particular, on the event , then:
by Eq. (B.5) | ||||
where the last line follows because is monotone in . This implies Equation (B.6), because trivially on the event because . ∎
The following lemma proves a very simple algebraic identity used in the proof of Lemma B.2.
Lemma B.3.
For any , and any , suppose that . Then
We are now ready to prove Theorem 3.2. As a reminder, we consider an asymptotic regime with data and knockoffs , where is the Bayesian model. We let denote the masked data for knockoffs as defined in Section 3.1 and let denote the expected number of non-nulls under . We will analyze the limiting normalized number of discoveries of feature statistics with rejection set , defined as the expected number of discoveries divided by the expected number of non-nulls:
(3.9) |
For convenience, we restate Theorem 3.2 and then prove it.
Theorem 3.2.
For each , let denote the MLR statistics with respect to and let denote any other sequence of feature statistics. Assume the following:
-
•
and exist for each .
-
•
The expected number of non-nulls grows faster than . Formally, assume that for some , .
-
•
Conditional on , the covariance between the signs of decays exponentially off the diagonal. That is, there exist constants such that
(3.8)
Then for all but countably many ,
(B.10) |
Proof.
Note that in this proof, all expectations and probabilities are taken over .
The proof proceeds in three main steps, but we begin by introducing some notation and outlining the overall strategy. Following Lemma B.2, let and , and let and be their conditional expectations. (Note that and all change with —however, we omit this dependency to lighten the notation.) As in Equation (B.4), we can write the number of discoveries made by and as a function of and , so:
We will show that the limit of this quantity is nonnegative, and the main idea is to make the approximations and . In particular, we can decompose
(B.11) |
In particular, Step 1 of the proof is to show that holds deterministically, for fixed . This implies that the first term of Equation (B.11) is nonnegative for fixed . In Step 2, we show that as , the second and third terms of Equation (B.11) vanish. In Step 3, we combine these results and take limits to yield the final result.
Step 1: In this step, we show that holds deterministically for fixed . To do this, it suffices to show that for each . To see this, recall that and are increasing functions of and , as defined below:
(B.12) |
where for we use the convention of “padding” and with extra zeros, so (e.g.) for .
Since the function is decreasing in , implies that for each , and therefore , which implies . Thus, it suffices to show that holds for each .
Intuitively, it makes sense that for each , since maximizes coordinate-wise and chooses so that is sorted in decreasing order. To prove this formally, we first argue that conditional on , is a deterministic function of . Recall that according to Corollary B.1, the event is completely determined by the masked data . Furthermore, since and are random permutations of the vectors and where the random permutations only depend on and , this implies there exists a random vector and a random permutation such that and are deterministic conditional on . Note that here, denotes the generalized parity function, so
(B.13) |
which guarantees that and , etc.
The intuition here is that following Proposition 3.1, fitting a feature statistic is equivalent to observing , assigning an ordering to the features, and then guessing which one of is the true feature and which is a knockoff, where if and only if this “guess” is correct. Since these decisions are made as deterministic functions of , can only be different than in that (i) it may make different guesses, flipping the sign of (as represented by , and (ii) its absolute values may be sorted in a different order (as represented by ).
Now, since and are deterministic functions of , this implies that
However, by definition, , and Proposition 3.3 implies that for all : since the ordering of is deterministic conditional on , this also implies . Therefore, for each . Additionally, by construction ensures that . If are the order statistics of in decreasing order, this implies that for all . Therefore,
By the previous analysis, this proves that .
Step 2: In this step, we show that for all but countably many , as well as the analogous result for and . We first prove the result for and , and in particular, for any fixed , we will show that . Since we will show this for any arbitrary , this implies .
We begin by applying Lemma B.2. In particular, fix any , any , and define
Then by Lemma B.2,
where . Therefore,
(B.14) |
We now analyze these terms in order: while doing so, we will choose a sequence and constant which guarantee the desired result. Note that eventually, our choice of will depend on , so the convergence is not necessarily uniform, but that does not pose a problem for our proof.
First term: To start, we will first apply a finite-sample concentration result to bound . In particular, we show in Corollary C.1 that if are mean-zero, -valued random variables satisfying the exponential decay condition from Equation (3.8), then there exists a universal constant depending only on and such that
(B.15) |
Furthermore, Corollary C.1 shows that this result holds even if we permute according to some arbitrary fixed permutation . Now, observe that conditional on , is a zero-mean, -valued random variable which is a fixed permutation of minus its (conditional) expectation. Since obeys the conditional exponential decay condition in Equation (3.8), we can apply Corollary C.1 to :
(B.16) |
which implies by the tower property that . Now, suppose we take
Then observe that is fixed, so as , . Thus
Therefore, for this choice of , we have shown the stronger result that . Of course, this implies as well.
Second term: This term is easy, as we assume in the statement that .
Third term: We will now show that for all but countably many , for any sufficiently small (and thus for any sufficiently small ), for any fixed .
To do this, recall by assumption that for all , we have that exists and converges to some (extended) real number . Furthermore, we show in Lemma C.2 that is always finite—this is intuitively a consequence of the fact that knockoffs controls the false discovery rate, and thus the expected number of discoveries cannot exceed the number of non-nulls by more than a constant factor. Importantly, since is increasing in , the function is increasing in for all : therefore, it is continuous on except on a countable set.
Supposing that is a continuity point of , there exists some such that . Take to be any positive constant such that and thus . Then we conclude
by continuity |
Fourth term: We now show that for all but countably many , for any sufficiently small , . However, this is simple, since Lemma C.2 tells us that is finite and continuous except at countably many points. Thus, we can take sufficiently small so that , and then also take so that .
Combining the results for all four terms, we see the following: for each , there exists a sequence and a constant guaranteeing that
Since this holds for all , we conclude as desired.
Lastly in this step, we need to show the same result for and in place of and . However, the proof for and is identical to the proof for and . The one subtlety worth mentioning is that we do not directly assume the exponential decay condition in Equation (3.8) for . However, as we argued in Step 1, we can write for some random vector which is a deterministic function of . As a result, we have that
Thus, we also conclude that .
Step 3: Finishing the proof. Recall Equation (B.11), which states that
(B.11) |
In Step 1, we showed that for fixed . Furthermore, in Step 2, we showed that the second two terms vanish asymptotically. As a result, we take limits and conclude
Furthermore, since we assume that the limits exist, this implies that
This concludes the proof. ∎
B.5 Relaxing the assumptions in Theorem 3.2
In this section, we discuss a few ways to relax the assumptions in Theorem 3.2.
First, we can easily relax the assumption that the limits and exist for each . Indeed, the proof of Theorem 3.2 only uses this assumption to argue that there exists a sequence such that (and similarly for ). Thus, we do not need the limits to exist for every : in contrast, the result of Theorem 3.2 will hold (e.g.) for any such that are continuous at . Intuitively, this means that the result in Theorem 3.2 holds except at points that delineate a “phase transition,” where the power of knockoffs jumps in a discontinuous fashion as increases.
Second, it is important to note that the precise form of the local dependency condition (3.8) is not crucial. Indeed, the proof of Theorem 3.2 only uses this condition to show that the partial sums of converge to their conditional mean given . To be precise, fix any permutation and let where permutes according to . Let . Then the proof of Theorem 3.2 will go through exactly as written if we replace Equation (3.8) with the following condition:
(B.17) |
where is some sequence satisfying and .
The upshot is this: under any condition where each permutation of obeys a certain strong law of large numbers, we should expect Theorem 3.2 to hold. Although it is unusual to require that a strong law holds for any fixed permutation of a vector, in some cases there is a “worst-case” permutation where if Equation (B.17) holds for some choice of , then it holds for every choice of . For example, in Corollary C.1, we show that if the exponential decay condition holds, then it suffices to show Equation (B.17) in the case where is the identity permutation, since the identity permutation places the most correlated coordinates of next to each other.
B.6 Proof of Propositions 3.4-3.5
Proposition 3.4.
If are conditionally independent given under , then
for any valid feature statistic .
Proof.
Note that the proof here is essentially the same argument used in Proposition 2 of Li and Fithian, (2021), but for completeness we restate it here. Let denote any other feature statistic. Let and denote the discovery sets based on and .
It suffices to show that . The argument from proof of Theorem 3.2 (the beginning of Appendix B.4) shows that the number of discoveries is a monotone function of , where denotes sorted in decreasing order of its absolute values. Therefore, it suffices to show that if and ,
(B.18) |
where as defined in Eq. (B.4), for any . To do this, we recall from the proof of Theorem 3.2 that there exists a -measurable vector and a -measurable permutation such that
where for any vector , and denotes the parity function (see Eq. B.13). We now make a few observations:
-
1.
We assume are conditionally independent. Since the magnitudes of are -measurable by Proposition 3.1, is equal to a -measurable permutation of . Therefore, the entries of are conditionally independent given .
-
2.
Since and are -measurable, the entries of are also conditionally independent given .
-
3.
Since maximizes among all feature statistics by Proposition 3.3, this implies that for any , . Thus, for all .
Thus, we can create a coupling such that has the same marginal law as , but a.s. (by the third observation above). This implies that for all , and therefore . Therefore
Therefore, to complete the proof, it suffices to show that —to simplify notation, take , i.e., assume without loss of generality. Note that by Proposition 3.3, the entries of are arranged in decreasing order. To show the desired result, let and fix any such that are “misordered” (i.e. not in decreasing order). It is sufficient to show that , since conditional on , is simply the result of iteratively swapping elements of to sort in decreasing order.
To show this, for any , let denote the vector which replaces the th and th entries of with and , respectively. Since the entries of are conditionally independent, after conditioning on , we can write out the relevant conditional expectations:
where the first equality uses conditional independence and the definition of expectation, the second equality cancels relevant terms, the third equality is simple algebra, and the final inequality uses the fact that by assumption but . In particular, to see this latter point, define and note that the partial averages obey for every , which implies by definition of .
Thus, by the tower property, , which completes the proof. ∎
Proposition 3.5.
Suppose that (i) are FX knockoffs or Gaussian conditional MX knockoffs (Huang and Janson,, 2020) and (ii) under , . Then under , are conditionally independent.
Proof.
This result is already proved for the fixed-X case by Li and Fithian, (2021), so we only prove it for the case where are conditional Gaussian MX knockoffs. In particular, define and recall that if and only if . Therefore, to show that are conditionally independent under , it suffices to show that are conditionally independent under .
Fix any value and let denote a possible value for the design matrix which is consistent with observing in the sense that for all . It suffices to show the factorization
for some functions which may depend on the value . To do this, observe that
where the last line uses the Gaussian linear model assumption that under , for some fixed as well as the pairwise exchangeability of . Continuing yields,
Here, the last step uses the key assumption that are conditional Gaussian MX knockoffs, in which case and for some diagonal matrix . In other words, the value of is -measurable, and thus conditional on , the value of is a constant. At this point, we conclude that
which completes the proof by the factorization argument above.
∎
B.7 Maximizing the expected number of true discoveries
Theorem 3.2 shows that MLR statistics maximize the (normalized) expected number of discoveries, but not necessarily the expected number of true discoveries. In this section, we give a sketch of the derivation of AMLR statistics and prove that they maximize the expected number of true discoveries.
This section uses the notation introduced in Section B.4. All probabilities and expectations are taken over . As a reminder, for any feature statistic , let , let , and let be as defined in Equation (B.4) so that knockoffs makes discoveries.
B.7.1 Proof of Corollary 3.1
Corollary 3.1.
AMLR statistics from Definition 3.3 are valid knockoff statistics.
Proof.
The signs of the AMLR statistics are identical to the signs of the MLR statistics. Therefore, by Propositions 3.1 and 3.2 (in the MX and FX case, respectively), it suffices to show that the absolute values of the AMLR statistics are functions of the masked data. However, the AMLR statistics magnitudes are purely a function of (i) the magnitudes of the MLR statistics and (ii) , which is the ratio of conditional probabilities given the masked data . These conditional probabilities by definition are functions of , and since MLR statistics are valid knockoff statistics by Lemma 3.1, the MLR magnitudes are also a function of the masked data by Proposition 3.1. Thus, the AMLR statistic magnitudes are a function of , which concludes the proof. ∎
B.7.2 Proof sketch and intuition
Proof sketch: The key idea behind the proof of Theorem 3.2 is to observe that:
-
1.
The number of discoveries only depends on cumulative averages of , denoted .
-
2.
As , under suitable assumptions. Thus, .
-
3.
If are MLR statistics with , then is asymptotically optimal because holds in finite samples for any choice of . Thus we conclude:
(B.19) In particular, this holds because MLR statistics ensure is in descending order.
To show a similar result for the number of true discoveries, we repeat the three steps used in the proof of Theorem 3.2. To do this, let be the indicator that the feature corresponding to the th coordinate of is non-null, and let be the indicator that and that the corresponding feature is non-null. Let . Then:
-
1.
Let denote the number of true discoveries. We claim that is a function of the successive partial means of and . To see this, recall from Section B.4 that knockoffs will make discoveries, and in particular it will make discoveries corresponding to any of the first coordinates of which are positive. Therefore,
(B.20) Since only depends on the successive averages of and the second term is itself a successive average of , this finishes the first step.
-
2.
The second step is to show that as , and therefore . This is done using the same techniques as Theorem 3.2, although it requires an extra assumption that also obeys the local dependency condition (3.8). Like the original condition, this condition also only depends on the posterior of , so it can be diagnosed using the data at hand.
-
3.
To complete the proof, we define the adjusted MLR statistic with corresponding such that holds in finite samples for any other feature statistic . It is easy to see that must have the same signs as the original MLR statistics , since the signs of maximize and coordinatewise. However, the absolute values of may differ from those of , since it is not always true that sorting in decreasing order maximizes .
It turns out that the absolute values of the AMLR statistics in Eq. (3.12) yield vectors which maximize up to an additive constant. Theorem B.6 formally proves this, but we now give some intuition.
Intuition: To determine the optimal absolute values for AMLR statistics, assume WLOG by relabelling the variables that . Let denote an optimization variable representing the set of variables with the largest absolute values for the AMLR statistics, for some . We will try to design such that we can make as many true discoveries within as possible.
As argued above, AMLR and MLR statistics should have the same signs. Thus, roughly speaking, we can discover all features with positive signs among whenever
Making our usual approximation , this is equivalent to the constraint
(B.21) |
Furthermore, if we can discover all of the features with positive signs in , we make exactly true discoveries, where is the indicator that the th MLR statistic is positive and the th feature is nonnull. Maximizing this approximate objective subject to the constraint in Eq. (B.21) yields the optimization problem:
(B.22) |
In other words, including has “benefit” and “cost” . This is a simple integer linear program with one constraint, often known as a “knapsack” problem. An approximately optimal solution to this problem is to do the following:
-
•
Include all variables with “negative” costs, meaning . This is accomplished by ensuring that these features have the largest absolute values.
-
•
Prioritize all other variables in descending order of the ratio of the benefit to the cost, .
This solution is indeed accomplished by the AMLR formula (Eq. (3.12)), which gives the highest absolute values to features with negative costs; then, all other absolute values have the same order as the benefit-to-cost ratios .
B.7.3 Theorem statement and proof
We now show that AMLR statistics asymptotically maximize the true positive rate. To do this, we require two additional regularity conditions beyond those assumed in Theorem 3.2. First, we need a condition that the number of non-nulls under is not too heavy-tailed; namely, that its coefficient of variation is uniformly bounded.
Assumption B.1.
There exists a constant such that as ,
where is the expected number of non-nulls under .
Assumption B.1 is needed for a technical reason. As we will see in Step 3 of the proof, combining this assumption with Lemma C.2 ensures that the normalized number of discoveries is uniformly integrable, which is necessary to show that certain error terms converge in to zero. Nonetheless, this assumption is already quite mild, and it is satisfied in previously studied linear and polynomial sparsity regimes (Donoho and Jin,, 2004; Weinstein et al.,, 2017; Ke et al.,, 2020).
Next, we need an additional local dependence condition.
Assumption B.2 (Additional local dependence condition).
Let indicate the event that is non-null and is positive. Let indicate the event that is non-null and is negative. We assume that there exist constants such that for all :
(B.23) |
(B.24) |
(B.25) |
Assumption B.2 is needed because it implies that for any feature statistic , obey the same local dependence condition.
Lemma B.5.
Assume Assumption B.2. Then for any feature statistic and all ,
Theorem B.6.
Suppose the conditions of Theorem 3.2 plus Assumptions B.1 and B.2 hold. Let denote the AMLR statistics with respect to . Then for any sequence of feature statistics , the following holds for all but countably many :
(B.26) |
where is the expected number of non-nulls under as defined in Assumption 3.1, and is the expected number of true discoveries made by feature statistic under as defined in Section B.7.
Proof.
The proof is in three steps.
Step 1: Notation and setup. Throughout, we use the notation and ideas from Section B.7.2 and the proof of Theorem 3.2 (Section B.4), although to ease readability, we will try to give reminders about notation when needed. In particular:
-
•
Define and . For simplicity, we suppress the dependence on .
-
•
Define and .
-
•
Let denote random the permutations such that and are sorted in descending order of absolute values; with this notation, note that .
-
•
Let and be the indicators that the feature statistic with the th largest absolute value of (resp. ) represents a non-null feature.
-
•
Let be the indicator that the feature with the th largest absolute value among is non-null and that its feature statistic is postive. Similarly, is the indicator that the feature with the th largest AMLR statistic is non-null and its AMLR statistic is positive. Let denote the vectors of these indicators.
-
•
Define and to be the conditional expectations of these quantities given the masked data.
-
•
Throughout, we only consider feature statistics whose values are nonzero a.s., because one can provably increase the power of knockoffs by ensuring that each coordinate of is nonzero.
Equation (B.20) shows that we can write
(B.27) |
and similarly . Therefore it suffices to show that
(B.28) |
To do this, we make the following approximation using the triangle inequality:
(B.29) |
Step 2 of the proof shows that Term 1 is asymptotically positive, and Step 3 shows that Terms 2 and 3 are asymptotically negligible in expectation (i.e., of order ). This is sufficient to complete the proof.
Step 2: Analyzing Term 1. In this step, we show that
(B.30) |
To do this, Step 2a shows that we may assume and therefore and (with the usual notation that ). Step 2b then proves Eq. (B.30).
Step 2a: Define to have the absolute values of and the signs of the AMLR statistics. We claim that if are defined analogously for instead of , then .
To see this, we prove that (i) holds elementwise and (ii) holds elementwise. Results (i) and (ii) complete the proof of Step 2(a) since is nondecreasing in its inputs (namely because is nonnegative and is nondecreasing in its inputs).
To show (i), Proposition 3.3 shows that , where this argument also uses the facts that (a) MLR and AMLR statistics have the same signs and (b) the permutation depends only on and thus is -measurable.
To show (ii), we note that the law of total probability yields
where above we use the fact that —i.e., under the null, is conditionally symmetric. Thus, holds iff . Using this result, we conclude
This proves that . Yet in this step, we seek to show that ; thus, we may assume that and thus . This implies that and take the same values but in different orders; formally, and .
Step 2b: Now we show Eq. (B.30). Recall that is the partial sum of the first elements of , where is the maximum integer such that . It follows that is bounded by the following quantity:
where the notation defines the average value of averaged over the set . Indeed, this inequality follows because is precisely the solution to this optimization problem when is constrained to be a contiguous set of the form for some . Relaxing this constraint to allow to be an arbitrary subset of should only increase the value of the objective. Manipulating this optimization problem yields:
This is an integer linear program with integer decision variables and one constraint:
Such problems—commonly referred to as knapsack problems—are well studied. The maximum value is bounded by the following greedy strategy:
-
•
Let be the set of coordinates such that the constraint coefficient on is negative. This is an “obvious” set because for , setting never decreases the objective value (since ) and never increases the constraint value (since ). Thus, any optimal solution must have for all .
-
•
Let denote the number of obvious coordinates, and let denote the non-obvious coordinates.
-
•
After setting for all coordinates , we should sort the coordinates in in descending order of the ratio and include as many coordinates of as possible until we hit the constraint that . Then, if we include one additional coordinate, the value of this solution (which violates the constraint by a small amount) bounds the optimal value to the overall objective problem. Indeed, this is because this solution actually has a higher objective value than the solution to the relaxed LP which only requires instead of (Martello and Toth,, 1990).
To make this strategy precise, let denote any permutation with the following properties:
-
•
. I.e., the first coordinates specified by are the set .
-
•
For , if and only if . I.e., orders the rest of the coordinates by the ratio .
Then, let denote the maximum value of such that setting yields a feasible solution to the integer LP. Then we have that
Remark 6.
is not uniquely specified by the construction above, but the bound holds for any such .
Step 2a shows that for some permutation , we can write and . We note that must satisfy the following:
-
•
. This is because the AMLR statistics with the top absolute values are constructed to be the AMLR statistics such that (see Definition 3.3), which exactly coincides with the definition of the set .
-
•
For , if and only if . This again follows from Definition 3.3, as the absolute values of AMLR statistics are explicitly chosen to guarantee this.
In other words, satisfies the same properties as above. Thus, we may take . This yields the bound . However, we also know that
where the last step follows because
by definition | ||||
by algebraic manipulation | ||||
by definition |
To summarize, this implies that
This completes the proof of Step 2, since as a consequence,
(B.31) |
Step 3: In this step, we show that holds for all but countably many . The same logic applies to the term involving .
To do this, we will essentially bound by the quantities and ; as we shall see, these are both . To begin, the triangle inequality yields:
by definition | |||||
triangle inequality | |||||
(B.32) |
First, we bound Term (a). Applying the triangle inequality (again) plus a simple algebraic lemma yields:
The second line is not obvious, but it follows because Lemma B.7 shows that . This algebraic result follows intuitively because holds for all ; thus, cannot be large unless is also large.
Combining this bound on Term (a) with the initial result in Eq. (B.32) yields:
To show that Term (b) vanishes, recall that Step 2 of Theorem 3.2 shows that for all but countably many . However, and similarly . Therefore, for all but countably many .
Thus, it suffices to show that Term (c) vanishes. To do this, fix a sequence of integers such that . Separately considering the cases where and yields
where the second line follows because . Assumption 3.1 guarantees that that , so it suffices to show that Term (d) vanishes. To do this, we observe:
- •
- •
-
•
The latter two observations imply that , since is uniformly integrable, and is also uniformly bounded and vanishes in probability.
Together, this proves that for all but countably many , . ∎
The following algebraic lemma is used at the end of Step 3 of the proof of Theorem B.6.
Lemma B.7.
For , fix . Then
Proof.
Define and . The lemma holds trivially when , so we may assume . Then we have that
Therefore we conclude that
which completes the proof. ∎
B.8 Verifying the local dependence assumption in a simple setting
We now verify the local dependency condition (3.8) in the setting where is block-diagonal and is known.
Proposition B.2.
Suppose is a block-diagonal matrix with maximum block size . Suppose is any Bayesian model such that (i) the model class is the class of all Gaussian models of the form for , (ii) the coordinates of are marginally independent under and (iii) is a constant under . Then if are either fixed-X knockoffs or conditional Gaussian model-X knockoffs (Huang and Janson,, 2020), the coordinates of are -dependent conditional on under , implying that Equation (3.8) holds, e.g., with and .
Proof.
Define . We will prove the stronger result that if are a partition of corresponding to the blocks of , then are jointly independent conditional on . As notation, suppose without loss of generality that are contiguous subsets and for . All probabilities and expectations are taken under .
We give the proof for model-X knockoffs; the proof for fixed-X knockoffs is quite similar. Recall by Proposition 3.1 that we can write where is a function of the masked data . Therefore, to show are independent conditional on , it suffices to show are conditionally independent given . To do this, it will first be useful to note that the likelihood is
where above, we only include terms depending on , since these are the only terms relevant to the later stages of the proof. A subtle but important observation in the calculation above is that we can verify that having only observed without observing . Indeed, this follows because for conditional Gaussian MX knockoffs, and only differs from on the main diagonal (just like in the fixed-X case). With this observation in mind, we now abuse notation slightly and let denote an arbitrary conditional density under . Observe that
At this point, we can iteratively pull out parts of the product. In particular, define the following function:
Since and are fixed, is a deterministic function of that does not depend on . Therefore, we can iteratively integrate as below:
This shows that are jointly (conditionally) independent since their density factors, thus completing the proof. For fixed-X knockoffs, the proof is very similar as one can show that the density of factors into blocks. ∎
B.9 Intuition for the local dependency condition and Figure 6
In Figure 6, we see that even when is very highly correlated, looks similar to a diagonal matrix, indicating that the local dependency condition (3.8) holds well empirically. The empirical result is striking and may be surprising initially, this section offers some explanation.
For the sake of intuition, suppose that we are fitting model-X knockoffs and using the Bayesian model from Example 1 with the original features. Suppose we observe that the masked data is equal to some fixed value . After observing , Appendix E shows how to sample from the posterior distribution via the following Gibbs sampling approach:
-
•
For each , initialize and to some value.
-
•
For :
-
1.
Set and .
-
2.
For :
-
(a)
Resample from the law of . It may be helpful to recall holds deterministically.
-
(b)
Resample from the law of .
-
(a)
-
1.
-
•
Return samples .
Now, recall that if and only if the Gibbs sampler consistently chooses instead of . Thus, to analyze for some fixed , we must ask the following question: does the value of strongly affect how we resample ?
To answer this question, the following key fact (reviewed in Appendix E) is useful. At iteration , step 2(a), for any , let be the residuals excluding feature for the current value of and in the Gibbs sampler. Then, standard calculations show that only depends on and through the following inner products:
Thus, the question we must answer is: how does the choice of affect the value of ? Heuristically, the answer is “not very much,” since only appears above through inner products of the form and , and by definition of the knockoffs we know that and . Indeed, for fixed-X knockoffs, we know that this actually holds exactly, and for model-X knockoffs, the law of large numbers should ensure that these approximations are very accurate.
The main way that the choice of can significantly influence the choice of is that the choice of may change the value of . In general, we expect this effect to be rather small, since in many highly-correlated settings, and are necessarily highly correlated and thus the choice of should not affect the choice of too much. That said, there are a few known pathological settings where the choice of does substantially change the estimated value of (Chen et al., (2019); Spector and Janson, (2022)), and in these settings, the coordinates of may be strongly conditionally dependent. The good news is that using MVR knockoffs instead of SDP knockoffs should ameliorate this problem (see Spector and Janson, (2022)).
Overall, we recognize that this explanation is purely heuristic and does not fully explain the results in Figure 6. However, it may provide some intuitive insight. A more rigorous theoretical analysis of would be interesting; however, we leave this to future work.
Appendix C Technical proofs
C.1 Key concentration results
The proof of Theorem 3.2 relies on the fact that the successive averages of the vector converge uniformly to their conditional expectation given the masked data . In this section, we give a brief proof of this result, which is essentially an application of Theorem 1 from Doukhan and Neumann, (2007). For convenience, we first restate a special case of this theorem (namely, the case where the random variables in question are bounded and we have bounds on pairwise correlations) before proving the corollary we use in Theorem 3.2.
Theorem C.1 (Doukhan and Neumann, (2007)).
Suppose that are mean-zero random variables taking values in such that for a constant . Let be constants such that for any ,
where is a nonincreasing sequence satisfyng
Then for all , there exists a universal constant only depending on and such that
where is a universal constant only depending on .
If we take , this yields the following corollary.
Corollary C.1.
Suppose that are mean-zero random variables taking values in . Suppose that for some , the sequence satisfies
(C.1) |
Then there exists a universal constant depending only on and such that
(C.2) |
Furthermore, let be any permutation. For , define to be the sample mean of the first random variables after permuting according to . Then for any ,
(C.3) |
where is the symmetric group.
Proof.
The proof of Equation (C.2) follows an observation of Doukhan and Neumann, (2007), where we note for . Then
As a result, , so we take and . Lastly, we observe that another geometric series argument yields
Thus, we take and apply Theorem C.1, which yields the first result. To prove Equation (C.3), the main idea is that we can apply Equation (C.2) to each sample mean , at which point the Equation (C.3) follows from a union bound.
To prove this, note that if we rearrange into their “original order,” then these variables satisfy the condition in Equation (C.1). Formally, let and let be the permutation such that if and only if , for . Then define for , and note that
where in the last step, follows by construction of . Since by construction, this means we may apply Equation (C.2) to for each .
C.2 Bounds on the expected number of false discoveries
The proof of Theorem 3.2 relied on the fact that is finite whenever it exists. This is a consequence of the lemma below. The lemma below also proves a second moment bound which is needed when making a uniform integrability argument in Step 3 of the proof of Theorem B.6.
Lemma C.2.
Fix any . Then there exist universal constants such that for any Bayesian model and any valid knockoff statistic with discovery set :
-
1.
where is a finite constant depending only on .
-
2.
.
Note that above, denotes the random set of non-nulls under .
Proof.
Recall that denotes the joint law of . Throughout the proof, all expectations and probabilities are taken over . Our strategy is to condition on the nuisance parameters . In particular, let denote the number of non-nulls. To show the first result, it suffices to show
(C.4) |
Proving Equation (C.4) proves the first result because it implies by the tower property that , and therefore . For the second result, by the tower property it also suffices to show that
(C.5) |
The rest of the proof proceeds conditionally on , so we are essentially in the fully frequentist setting. Thus, for the rest of the proof, we will abbreviate as . We will also assume the “worst-case” values for the non-null coordinates of : in particular, let denote but with all of the non-null coordinates replaced with the value , and let be the discovery set made when applying SeqStep to . These are the “worst-case” values in the sense that deterministically (see Spector and Janson, (2022), Lemma B.4), so it suffices to show that and .
As notation, let denote the signs of when sorted in descending order of absolute value. Following the notation in Equation (B.4), let , where . This ensures that is the number of discoveries made by knockoffs (Spector and Janson, (2022), Lemma B.3). To prove the first result, it thus suffices to show . To do this, let and fix any integer (we will pick a specific value for later). Observe that
(C.6) | ||||
(C.7) |
where the second line follows because (i) the event implies that at least of the first coordinates of are positive and (ii) the knockoff flip-sign property guarantees that conditional on , the null coordinates of are i.i.d. random signs conditional on the values of the non-null coordinates of .777Without loss of generality we may assume that the absolute values of are nonzero with probability one, since again, this only increases the number of discoveries made by knockoffs. Thus, doing simple arithmetic, in the first coordinates of , there are null i.i.d. signs, of which at least must be positive, yielding the expression above with the Binomial probability.
We now apply Hoeffding’s inequality. To do so, we must ensure is larger than the mean of a random variable. It turns out that it suffices to pick the value of to satisfy . To see why, fix any , so we may write for some . Then for all such , we have
Thus, we may apply Hoeffding’s inequality for . Indeed, for any , the previous result yields that for :
As notation, set . Combining the previous equation with Eq. (C.7), we obtain
Note that the sums and are both convergent. As a result, is bounded by a constant multiple of , where the constant depends on but nothing else. Since as previously argued, this completes the proof.
For the second statement, we note that by the same argument as above, we have that
Once again, the three series above are convergent and the value they converge to depends only on . Since asymptotically, this implies that there exists some constant such that . This completes the proof since . ∎
Appendix D Additional comparison to prior work
D.1 Comparison to the unmasked likelihood ratio
In this section, we compare MLR statistics to the earlier unmasked likelihood statistic introduced by Katsevich and Ramdas, (2020), which this work builds upon. The upshot is that unmasked likelihood statistics give the most powerful “binary -values,” as shown by Katsevich and Ramdas, (2020), but do not yield jointly valid knockoff feature statistics in the sense required for the FDR control proof in Barber and Candès, (2015) and Candès et al., (2018).
In particular, we call a statistic a marginally symmetric knockoff statistic if satisfies . Under the null, is marginally symmetric, so the quantity is a valid “binary -value” which only takes values in . Theorem 5 of Katsevich and Ramdas, (2020) shows that for any marginally symmetric knockoff statistic, is maximized if , where denotes the density of under the true law of the data. As such, one might initially hope to use the unmasked likelihood ratio as a knockoff statistic:
However, a marginally symmetric knockoff statistic is not necessarily a valid knockoff feature statistic, which must satisfy the following stronger property (Barber and Candès,, 2015; Candès et al.,, 2018):
for any . This flip-sign property guarantees that the signs of the null coordinates of are jointly i.i.d. and symmetric. However, the unmasked likelihood statistic does not satisfy this property, as changing the observed value of for will typically change the value of the likelihood .
D.2 Comparison to the adaptive knockoff filter
In this section, we compare our methodological contribution, MLR statistics, to the adaptive knockoff filter described in Ren and Candès, (2020), namely their approach based on Bayesian modeling. The main point is that although MLR statistics and the procedure from Ren and Candès, (2020) have some intuitive similarities, the procedures are different and in fact complementary, since one could use the Bayesian adaptive knockoff filter from Ren and Candès, (2020) in combination with MLR statistics.
As review, recall from Section 3.2 that valid knockoff feature statistics as initially defined by Barber and Candès, (2015); Candès et al., (2018) must ensure that is a function of the masked data , and thus cannot explicitly depend on . (It is also important to remember that determines the order and “prioritization” of the SeqStep hypothesis testing procedure.) The key innovation of Ren and Candès, (2020) is to relax this restriction: in particular, they define a procedure where the analyst sequentially reveals the signs of in reverse order of their prioritization, and after each sign is revealed, the analyst may arbitrarily reorder the remaining hypotheses. The advantage of this approach is that revealing the sign of (e.g.) may reveal information that can be used to more accurately prioritize the hypotheses while still guaranteeing provable FDR control.
This raises the question: how should the analyst reorder the hypotheses after each coordinate of is revealed? One proposal from Ren and Candès, (2020) is to introduce an auxiliary Bayesian model for the relationship between and (the authors also discuss the use of additional side information, although for brevity we do not discuss this here). For example, Ren and Candès, (2020) suggest using a two-groups model where
(D.1) |
Above, is the indicator of whether the th hypothesis is non-null, and and are (e.g.) unknown parametric distributions that the analyst fits as they observe . With this notation, the proposal from Ren and Candès, (2020) can be roughly summarized as follows:
-
1.
Fit an initial feature statistic , such as an LCD statistic, and observe .
-
2.
Fit an initial version of the model in Equation (D.1) and use it to compute .
-
3.
Observe for .
-
4.
Using , update the model in Equation (D.1), update , and return to Step 3.
-
5.
Terminate when all of has been revealed, at which point is passed to SeqStep in the reverse of the order that the signs were revealed.
Note that in Step 3, the reason Ren and Candès, (2020) choose to be the index minimizing is that in Step 5, SeqStep observes in reverse order of the analyst. Thus, the analyst should observe the least important hypotheses first so that SeqStep can observe the most important hypotheses first.
The main similarity between this procedure and MLR statistics is that both procedures, roughly speaking, attempt to prioritize the hypotheses according to , although we condition on the full masked data to maximize power. That said, there are two important differences. First, we and Ren and Candès, (2020) both use an auxiliary Bayesian model—however, we take probabilities over a Bayesian model of the full dataset, whereas Ren and Candès, (2020) only fit a working model of the law of . Using the full data as opposed to only the statistics should lead to much higher power—for example, if are poor feature statistics which do not contain much relevant information about the dataset, the procedure from Ren and Candès, (2020) will have low power. Thus, despite their initial similarity, these procedures are quite different.
The second and more important difference is that the procedure above is not a feature statistic. Rather, it is an extension of SeqStep that wraps on top of any initial feature statistic. This “adaptive knockoffs” procedure augments the power of any feature statistic, although if the initial feature statistic has many negative signs to begin with or its absolute values are truly uninformative of its signs, the procedure may still be powerless. Since MLR statistics have provable optimality guarantees—namely, they maximize and make a monotone function of —one might expect that using MLR statistics in place of a lasso statistic could improve the power of the adaptive knockoff filter. Similarly, using the adaptive knockoff filter in combination with MLR statistics could be more powerful than using MLR statistics alone.
Appendix E Gibbs sampling for MLR statistics
E.1 Proof of Eq. (4.1)
Lemma E.1.
Fix any constants , and , and define . Then
as long as lies in the support of under .
Proof.
Throughout this proof, we abuse notation and let probabilities of the form (e.g.) denote the density of this event with respect to the base measure of . The definition of conditional probability yields
By definition of , the event in the numerator is equivalent to the event and the event in the denominator is equivalent to . Plugging this in yields
where the second line uses the chain rule of conditional probability and the fact that has marginal density under . Recall that by definition of (see Section 1.2), the law the data given is simply . Furthermore, for all , under , are assumed to be pairwise exchangeable because they are valid knockoffs (see footnote 3). Therefore, cancelling terms, we conclude
where the last line follows since , as are valid knockoffs by assumption under . ∎
E.2 Derivation of Gibbs sampling updates
In this section, we derive the Gibbs sampling updates for the class of MLR statistics defined in Section 4.2. First, for convenience, we restate the model and choice of .
E.2.1 Model and prior
First, we consider the model-X case. For each , let denote any vector of prespecified basis functions applied to . We assume the following additive model:
with the following prior on :
with the usual hyperpriors
This is effectively a group spike-and-slab prior on which ensures group sparsity of , meaning that either the whole vector equals zero or the whole vector is nonzero. We use this group spike-and-slab prior for two reasons. First, it reflects the intuition that is meant to represent only a single feature and thus will likely be entirely sparse (if is truly null) or entirely non-sparse. Second, and more importantly, the group sparsity will substantially improve computational efficiency in the Gibbs sampler.
Lastly, for the fixed-X case, we assume exactly the same model but with the basis functions chosen to be the identity. Thus, this model is a typical spike-and-slab Gaussian linear model in the fixed-X case (George and McCulloch,, 1997). It is worth noting that our implementation for the fixed-X case actually uses a slightly more general Gaussian mixture model as the prior on , where the density for hyperpriors , and . However, for brevity, we only derive the Gibbs updates for the case of two mixture components.
E.2.2 Gibbs sampling updates
Following Section 4, we now review the details of the MLR Gibbs sampler which samples from the posterior of given the masked data .888This is a standard derivation, but we review it here for the reader’s convenience. As notation, let denote the concatenation of , let denote all of the coordinates of except those of , let denote the indicator that , and let denote all of the basis functions concatenated together. Also note that although this section mostly uses the language of model-X knockoffs, when the basis functions are the identity, the Gibbs updates we are about to describe satisfy the sufficiency property required for fixed-X statistics, and indeed the resulting Gibbs sampler is actually a valid implementation of the fixed-X MLR statistic.
To improve the convergence of the Gibbs sampler, we slightly modify the meta-algorithm in Algorithm 1 to marginalize over the value of when resampling . To be precise, this means that instead of sampling , we sample . We derive this update in three steps, and along the way we derive the update for .
Step 1: First, we derive the update for . Observe
Analyzing the numerator is easy, as the model specifies that if we let , then
For the denominator, observe that is jointly Gaussian: in particular,
(E.1) |
To lighten notation, let . Using the above expression plus the Woodbury identity applied to the density of , we conclude
Since is a matrix, this quantity can be computed relatively efficiently.
Step 2: Next, we derive the distribution of . Of course, the case where is trivial since then by definition: in the alternative case, note from Equation (E.1) that we have
where above, we use as shorthand for to lighten notation.
Step 3: Lastly, we derive the update for given . In particular, for any vector , let . Then by the law of total probability and the same Woodbury calculations as before,
where above as before.
The only other sampling steps required in the Gibbs sampler are to sample from the conditional distributions of and ; however, this is straightforward since we use conjugate hyperpriors for each of these parameters.
E.2.3 Extension to binary regression
We can easily extend the Gibbs sampler in the preceding section to handle the case where the response is binary via a latent variable approach. Indeed, let us start by considering the case of Probit regression, which means we observe instead of the continuous outcome . Following Albert and Chib, (1993), we note that distribution of is truncated normal, namely
(E.2) |
where . Thus, when we observe a binary response instead of the continuous response , we can employ the same Gibbs sampler as in Section E.2.2 except that after updating , we resample the latent variables according to Equation (E.2), which takes computation per iteration (since we can continuously update the value of whenever we update or in operations as well). As a result, the computational complexity of this algorithm is the same as that of the algorithm in Section E.2.2. A similar formulation based on PolyGamma random variables is available for the case of logistic regression (see Polson et al., (2013)).
E.3 Proof and discussion of Lemma 4.1
Lemma 4.1.
Suppose that under , (i) a.s. for and (ii) the support of equals the support of the marginal law of . Then as ,
Proof.
This proof follows because by the derivations in Section 4, Algorithm 1 is a standard Gibbs sampler as defined in Algorithm A.40 of Robert and Casella, (2004), i.e., at each step it samples from the conditional law of one unknown variable given all the others (where the unknown variables are and ), where everything is done conditional on . Furthermore, the condition (i) implies that the support of does not depend on . As a result, this theorem result is a direct consequence of Corollary 10.12 of Robert and Casella, (2004), applied conditional on . In particular, Corollary 10.12 proves that
The result then follows from the continuous mapping theorem (note that the first assumption for the lemma ensures so the continuous mapping theorem applies). ∎
We note also that the two assumptions of Lemma 4.1 are satisfied in Example 1. In particular, to show that the support of does not depend on , observe that Eq. (4.1) tells us that conditional on for , and for any ,
In Example 1, the numerator and denominator of the above equation are Gaussian likelihoods, so the likelihood ratio is always finite and thus . Similarly, Example 1 is a conjugate Gaussian (additive) spike-and-slab linear model. Well–known results for these models establish that the support of the Gibbs distributions for the linear coefficients and hyperparameters is equal to the support of the marginal distribution for these parameters (George and McCulloch,, 1997)—see Appendix E.2 for examples of detailed derivations showing this result.
E.4 Computing AMLR statistics
The AMLR statistics are a deterministic function of the MLR statistics and where
By Proposition 3.3, is a function of , so computing the denominator of is straightforward since we have already established how to compute . To compute the numerator, recall that Algorithm 1 samples from the joint posterior of . Therefore, we can use the empirical mean of the samples from Algorithm 1 to compute this quantity:
where is the MLR “guess” of the value of .
Appendix F MLR statistics for group knockoffs
In this section, we describe how MLR statistics extend to the setting of group knockoffs (Dai and Barber,, 2016). In particular, for a partition of the features, group knockoffs allow analysts to test the group null hypotheses , which can be useful in settings where is highly correlated and there is not enough data to discover individual null variables. In particular, knockoffs are model-X group knockoffs if they satisfy the group pairwise-exchangeability condition for each . Similarly, are fixed-X group knockoffs if (i) and (ii) is block-diagonal, where the blocks correspond to groups . Given group knockoffs, one computes a single knockoff feature statistic for each group.
MLR statistics extend naturally to the group knockoff setting because we can treat each group of features as a single compound feature. In particular, the masked data for group knockoffs is
(F.1) |
and the corresponding MLR statistics are
where above denotes the law of under . For fixed-X knockoffs, we have
where denotes the law of under .
Throughout the paper, we have proved several optimality properties of MLR statistics, and if we treat as a single compound feature, all of these theoretical results (namely Proposition 3.3 and Theorem 3.2) immediately apply to group MLR statistics as well.
To compute group MLR statistics, we can use exactly the same Gibbs sampling strategy as in Section E.2—indeed, one can just treat as a basis representation of a single compound feature and use exactly the same equations as derived previously. This method is implemented in knockpy.
Appendix G Additional details for the simulations

In this section, we describe the simulation settings in Section 5, and we also give the corresponding plot to Figure 7 which shows the results when . To start, we describe the simulation setting for each plot.
-
1.
Sampling : We sample each row of as an i.i.d. random vector in all simulations, with two choices of . First, in the “AR(1)” setting, we take to correspond to a nonstationary AR(1) Gaussian Markov chain, so has i.i.d. rows satisfying with . Note that the AR(1) setting is the default used in any plot where the covariance matrix is not specified. Second, in the “ErdosRenyi” (ER) setting, we sample a random matrix such that of its off-diagonal entries (selected uniformly at random) are equal to zero; for the remaining entries, we sample . To ensure the final covariance matrix is positive definite, we set and then rescale to be a correlation matrix.
-
2.
Sampling : Unless otherwise specified in the plot, we randomly choose of the entries of to be nonzero and sample the nonzero entries as i.i.d. random variables with by default. The exceptions are: (1) in Figure 2, we set and , (2) in Figure 5, we set , vary between and as shown in the plot, and in some panels sample the non-null coefficients as random variables, (3) in Figure 7 we take and , (4) in Figure 8 we take .
- 3.
-
4.
Sampling knockoffs: We sample MVR and SDP Gaussian knockoffs using the default parameters from knockpy version 1.3, both in the fixed-X and model-X case. Note that in the model-X case, we use the true covariance matrix to sample knockoffs, thus guaranteeing finite-sample FDR control.
-
5.
Fitting feature statistics: We fit the following types of feature statistics throughout the simulations: LCD statistics, LSM statistics, a random forest with swap importances (Gimenez et al.,, 2019), DeepPINK (Lu et al.,, 2018), MLR statistics (linear variant), MLR statistics with splines, and the MLR oracle. In all cases we use the default hyperparameters from knockpy version 1.3, and we do not adjust the hyperparameters, so that the MLR statistics do not have well-specified priors. The exception is that the MLR oracle has access to the underlying data-generating process and the true coefficients , which is why it is an “oracle.”
Appendix H Additional results for the real data applications
H.1 HIV drug resistance
For the HIV drug resistance application, Figures 16-21 show the same results as in Figure 1 but for all drugs in the protease inhibitor (PI) class; broadly, they show that MLR statistics have higher power because they ensure that the feature statistics with high absolute values are consistently positive, as discussed in Section 2. Note that in these plots, for the lasso-based statistics, we plot the normalized statistics so that the absolute value of each statistic is less than one. Similarly, for the MLR statistics, instead of directly plotting the masked likelihood ratio as per Equation (1.3), we plot
because we find this quantity easier to interpret than a log likelihood ratio. In particular, if and only if is roughly equally likely to be positive or negative under , and when is always positive under .
Additionally, Figures 14 and 15 show the number of discoveries made by each feature statistic for SDP and MVR knockoffs, respectively, stratified by the drug in question. Note that the specific data analysis is identical to that of Barber and Candès, (2015) and Fithian and Lei, (2020) other than the choice of feature statistic—see either of those papers or https://github.com/amspector100/mlr_knockoff_paper for more details.








H.2 Financial factor selection
We now present a few additional details for the financial factor selection analysis from Section 6.2. First, we list the ten index funds we analyze, which are: XLB (materials), XLC (communication services), XLE (energy), XLF (financials), XLK (information technology), XLP (consumer staples), XLRE (real estate), XLU (utilities), XLV (health care), and XLY (consumer discretionary). Second, for each feature statistic, Table 1 shows the average realized FDP across all ten analyses—as desired, the average FDP for each method is lower than the nominal level of . All code is available at https://github.com/amspector100/mlr_knockoff_paper.
Knockoff Type | Feature Stat. | Average FDP |
---|---|---|
MVR | LCD | 0.013636 |
LSM | 0.004545 | |
MLR | 0.038571 | |
SDP | LCD | 0.000000 |
LSM | 0.035000 | |
MLR | 0.039002 |