This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Power of Knockoff: The Impact of Ranking Algorithm, Augmented Design, and Symmetric Statistic

\nameZheng Tracy Ke \emailzke@fas.harvard.edu
\addrDepartment of Statistics
Harvard University
Cambridge, MA 02138, USA \AND\nameJun S. Liu \emailjliu@stat.harvard.edu
\addrDepartment of Statistics
Harvard University
Cambridge, MA 02138, USA \AND\nameYucong Ma \emailyucongma@g.harvard.edu
\addrDepartment of Statistics
Harvard University
Cambridge, MA 02138, USA
Abstract

The knockoff filter is a recent false discovery rate (FDR) control method for high-dimensional linear models. We point out that knockoff has three key components: ranking algorithm, augmented design, and symmetric statistic, and each component admits multiple choices. By considering various combinations of the three components, we obtain a collection of variants of knockoff. All these variants guarantee finite-sample FDR control, and our goal is to compare their power. We assume a Rare and Weak signal model on regression coefficients and compare the power of different variants of knockoff by deriving explicit formulas of false positive rate and false negative rate. Our results provide new insights on how to improve power when controlling FDR at a targeted level. We also compare the power of knockoff with its propotype - a method that uses the same ranking algorithm but has access to an ideal threshold. The comparison reveals the additional price one pays by finding a data-driven threshold to control FDR.

Keywords: CI-knockoff, Hamming error, phase diagram, Rare/Weak signal model, SDP-knockoff, variable ranking, variable selection

1 Introduction

We consider a linear regression model, where yny\in\mathbb{R}^{n} is the vector of responses and Xn×pX\in\mathbb{R}^{n\times p} is the design matrix. We assume

y=Xβ+e,X=[X1,X2,,Xn]n×p;βp,eN(0,σ2In).y=X\beta+e,\qquad X=[X_{1},X_{2},\ldots,X_{n}]^{\prime}\in\mathbb{R}^{n\times p};\ \ \beta\in\mathbb{R}^{p},\ \ e\sim N(0,\sigma^{2}I_{n}). (1.1)

Driven by the interests of high-dimensional data analysis, we assume pp is large and β\beta is a sparse vector (i.e., many coordinates of β\beta are zero). Variable selection is the problem of estimating the true support of β\beta. Let S^{1,2,,p}\hat{S}\subset\{1,2,\ldots,p\} denote the set of selected variables. The false discovery rate (FDR) is defined to be

𝔼[#{j:βj=0,jS^}#{j:jS^}1].\mathbb{E}\biggl{[}\frac{\#\{j:\beta_{j}=0,j\in\hat{S}\}}{\#\{j:j\in\hat{S}\}\vee 1}\biggr{]}.

Controlling FDR is a problem of great interest. When the design is orthogonal (i.e., XXX^{\prime}X is a diagonal matrix), the BH-procedure (Benjamini and Hochberg, 1995) can be employed to control FDR at a targeted level. When the design is non-orthogonal, the BH-procedure faces challenges, and several recent FDR control methods were proposed, such as the knockoff filter (Barber and Candès, 2015), model-X knockoff (Candes et al., 2018), Gaussian mirror (Xing et al., 2023), and multiple data splits (Dai et al., 2022). All these methods are shown to control FDR at a targeted level, but their power is less studied. This paper aims to provide a theoretical understanding to the power of FDR control methods.

We introduce a unified framework that captures the key ideas behind recent FDR control methods. Starting from the seminal work of Barber and Candès (2015), this framework has been implicitly used in the literature, but it is the first time that we abstract it out:

  • (a)

    There is a ranking algorithm, which assigns an importance metric to each variable.

  • (b)

    An FDR control method creates an augmented design matrix by adding fake variables.

  • (c)

    The augmented design and the response vector yy are supplied to the ranking algorithm as input, and the output is converted to a (signed) importance metric for each original variable through a symmetric statistic.

The three components, ranking algorithm, augmented design, and symmetric statistic, should coordinate so that the resulting importance metrics for null variables (βj=0\beta_{j}=0) have symmetric distributions and that the importance metrics for non-null variables (βj0\beta_{j}\neq 0) are positive with high probability. When these requirements are satisfied, one can mimic the BH procedure (Benjamini and Hochberg, 1995) to control FDR at a targeted level.

The choices of the three components are not unique. For example, we may use any linear regression method β^\hat{\beta} as the ranking algorithm, where we assign |β^j||\hat{\beta}_{j}| as the importance metric for variable jj. Similarly, the other two components also admit multiple choices. This leads to many different combinations of the three components. The literature has revealed insights on how to choose these components to get a valid FDR control, but there is little understanding on how to design them to boost power. The main contribution of this paper is dissecting and detailing the impact of each component on the power.

1.1 Main results and discoveries

We start from the orthodox knockoff in Barber and Candès (2015), which uses Lasso as the ranking algorithm, a semi-definite programming (SDP) procedure to construct the augmented design, and the signed maximum function as the symmetric statistic. We then replace each component of the orthodox knockoff by a popular alternative choice in the literature. We compare the power of the resulting variant of knockoff with the power of the orthodox one. This serves to reveal the impact of each component on power.

Our results lead to some noteworthy discoveries: (i) For the choice of symmetric statistic, the signed maximum is better than a popular alternative - the difference statistic; (ii) For the choice of the augmented design, the SDP approach in orthodox knockoff is less favored than a recent alternative - the conditional independence approach (Liu and Rigollet, 2019); (iii) For the choice of ranking algorithm, we compare Lasso and least-squares and find that Lasso has an advantage when the signals are extremely sparse and least-squares has an advantage when the signals are moderately sparse.

For each variant of knockoff, we also consider its prototype, which applies the ranking algorithm to the original design XX and selects variables by applying an ideal threshold on the importance metrics output by the ranking algorithm. We note that the core idea of knockoff is hinged on the other two components - augmented design and symmetric statistic, as these two components serve to find a data-driven threshold on importance metrics. Therefore, the comparison of knockoff and its prototype reveals the key difference between FDR control and variable selection - we need to pay an extra price to find a data-driven threshold. If an FDR control method is designed effectively, it should have a negligible power loss compared with its prototype. In the knockoff framework, when the design is orthogonal or blockwise diagonal and when the ranking algorithm is Lasso, we can show that knockoff (with proper choices of augmented design and symmetric statistic) indeed yields a negligible power loss compared with its own prototype. On the other hand, this is not true for a general design or when the ranking algorithm is not Lasso.

1.2 The theoretical framework and criteria of power comparison

Let G=XXp×pG=X^{\prime}X\in\mathbb{R}^{p\times p} be the Gram matrix. Without loss of generality, we assume that each column of XX has been normalized so that the diagonal entries of GG are all equal to 11.111We use a conventional normalization of XX in the study of Rare/Weak signal models. It is different from the standard normalization where the diagonal entries of GG are assumed to be nn. We note that our β\beta is actually nβ\sqrt{n}\beta in the standard normalization. This is why nn disappears in the order of nonzero βj\beta_{j}. We study a challenging regime of “Rare and Weak signals” (Donoho and Jin, 2015; Jin and Ke, 2016), where for some constants ϑ(0,1)\vartheta\in(0,1) and r>0r>0, we consider settings where

#{j:βj0}p1ϑ,|βj|2rlog(p) if βj0.\#\{j:\beta_{j}\neq 0\}\;\sim\;p^{1-\vartheta},\qquad|\beta_{j}|\;\sim\;\sqrt{2r\log(p)}\text{ if }\beta_{j}\neq 0.

The two parameters, ϑ\vartheta and rr, characterize the signal rarity and signal weakness, respectively. Here, log(p)\sqrt{\log(p)} is the minimax order for a successful inference of the support of β\beta (Genovese et al., 2012), and the constant factor rr drives subtle phase transitions. This model is widely used in multiple testing (Donoho and Jin, 2004; Arias-Castro et al., 2011; Barnett et al., 2017) and variable selection (Ji and Jin, 2012; Jin et al., 2014; Ke et al., 2014).

The power of an FDR control method depends on the target FDR level qq. Instead of fixing qq, we derive a trade-off diagram between FDR and the true positive rate (TPR) as qq varies. This trade-off diagram provides a full characterization of power, for any given model parameters (ϑ,r)(\vartheta,r). We also derive a phase diagram (Jin and Ke, 2016) for each FDR control method. The phase diagram is a partition of the two-dimensional space (ϑ,r)(\vartheta,r) into different regions, according to the asymptotic behavior of the Hamming error (i.e., the expected sum of false positives and false negatives). The phase diagram provides a visualization of power for all (ϑ,r)(\vartheta,r) together. Both the FDR-TPR trade-off diagram and phase diagram can be used as criteria of power comparison. We prefer the phase diagram, because a single phase diagram covers the whole parameter range (in contrast, the FDR-TPR trade-off diagram is tied to a specified (ϑ,r)(\vartheta,r)). Throughout the paper, we use phase diagram to compare different variants of knockoff. At the same time, we also give explicit forms of false positive rate and false negative rate, from which the FDR-TPR trade-off diagram can be deduced easily.

1.3 Related literature

Power analysis of FDR control methods is a small body of literature. Su et al. (2017) set up a framework for studying the trade-off between false positive rate and true positive rate across the lasso solution path. Weinstein et al. (2017) and Weinstein et al. (2021) extended this framework to find a trade-off for the knockoff filter, when the ranking algorithm is the Lasso and thresholded Lasso, respectively. These trade-off diagrams are for linear sparsity (number of nonzero coefficients of β\beta is a constant fraction of pp) and independent Gaussian designs (X(i,j)X(i,j) are iid N(0,n1/2)N(0,n^{-1/2}) variables). However, their analysis and results do not apply to our setting: In our setting, β\beta is much sparser, and the overall signal strength as characterized by β\|\beta\| is much smaller. Furthermore, we are primarily interested in correlated designs, but their study is mostly focused on iid Gaussian designs.

For correlated designs, Liu and Rigollet (2019) gave sufficient and necessary conditions on XX such that knockoff has a full power, but they did not provide an explicit trade-off diagram. Moreover, their analysis does not apply to the orthodox knockoff but only to a variant of knockoff that uses de-biased Lasso as the ranking algorithm. Beyond linear sparsity, Fan et al. (2019) studied the power of model-X knockoff for arbitrary sparsity, under a stronger signal strength: we assume |βj|log(p)|\beta_{j}|\asymp\sqrt{\log(p)}, while they assumed |βj|log(p)|\beta_{j}|\gg\sqrt{\log(p)}. In a similar setting, Javanmard and Javadi (2019) studied the power of using de-biased Lasso for FDR control. Our paper differs from these works because we study the regime of weaker signals and also derive the explicit FDR-TPR trade-off diagrams and phase diagrams.

Wang and Janson (2022) and Spector and Janson (2022) studied the power of model-X knockoff and conditional randomization tests. They considered linear sparsity and iid Gaussian designs, and found a disadvantage of power by constructing augmented design as in the orthodox knockoff (with least-squares as the ranking algorithm). This qualitatively agrees with some of our conclusions in Section 5, but it is for a different setting with uncorrelated variables and linear sparsity. We also study more variants of knockoff than those considered in aforementioned works. Recently, Li and Fithian (2021) recast the fix-X knockoff as a conditional post-selection inference method and studied its power.

In a sequel of papers (Ji and Jin, 2012; Jin et al., 2014; Ke et al., 2014), the Rare/Weak signal model was used to study variable selection. They focused on the class of Screen-and-Clean methods for variable selection and proved its optimality under various design classes. We borrowed the notion of phase diagram from these works. However, they did not consider any FDR control method and their methods do not apply to knockoff. Different from the proof techniques employed by the aforementioned work, our proof is based on a geometric approach, where the key is studying the geometric properties of the “rejection region” induced by knockoff (see Section 7 for details).

1.4 Organization

The remainder of this paper is organized as follows. Section 2 reviews the idea of knockoff. Section 3 introduces the Rare/Weak signal model and explains how to use it as a theoretical platform to study and compare the power of FDR control methods. Sections 4-6 contain the main results, where we study the impact of symmetric statistic, augmented design, and ranking algorithm, respectively. Section 7 sketches the proof and explains the geometrical insight behind the proof. Section 8 contains simulation results. Section 9 concludes with a short discussion. Detailed proofs are relegated to the Appendix.

2 The knockoff filter, its variants and prototypes

Let us first review the orthodox knockoff filter (Barber and Candès, 2015). Write G=XXG=X^{\prime}X and let diag(s)\mathrm{diag}(s), with sps\in\mathbb{R}^{p}, be a nonnegative diagonal matrix (to be chosen by the user) such that diag(s)2G\mathrm{diag}(s)\preceq 2G. The knockoff first creates a design matrix X~n×p\tilde{X}\in\mathbb{R}^{n\times p} such that

X~X~=G,XX~=Gdiag(s).\tilde{X}^{\prime}\tilde{X}=G,\qquad X^{\prime}\tilde{X}=G-\mathrm{diag}(s). (2.1)

Let xjx_{j} and x~j\tilde{x}_{j} be the jjth column of XX and X~\tilde{X}, respectively, 1jp1\leq j\leq p. Here, x~j\tilde{x}_{j} is called a knockoff of variable jj. For any λ>0\lambda>0, let β^(λ)2p\hat{\beta}(\lambda)\in\mathbb{R}^{2p} be the solution of Lasso (Tibshirani, 1996) on the expanded design matrix [X,X~][X,\tilde{X}] with a tuning parameter λ\lambda:

β^(λ)=argminb{y[X,X~]b2/2+λb1}.\hat{\beta}(\lambda)=\mathrm{argmin}_{b}\big{\{}\|y-[X,\tilde{X}]b\|^{2}/2+\lambda\|b\|_{1}\big{\}}. (2.2)

For each 1jp1\leq j\leq p, let Zj=sup{λ>0:β^j(λ)0}Z_{j}=\sup\{\lambda>0:\hat{\beta}_{j}(\lambda)\neq 0\} and Z~j=sup{λ>0:β^p+j(λ)0}\tilde{Z}_{j}=\sup\{\lambda>0:\hat{\beta}_{p+j}(\lambda)\neq 0\}. The importance of variable jj is measured by a symmetric statistic

Wj=f(Zj,Z~j),W_{j}=f(Z_{j},\tilde{Z}_{j}), (2.3)

where f(,)f(\cdot,\cdot) is a bivariate function satisfying f(v,u)=f(u,v)f(v,u)=-f(u,v). Here {Wj}j=1p\{W_{j}\}_{j=1}^{p} are (signed) importance metrics for variables. Under some regularity conditions, it can be shown that WjW_{j} has a symmetric distribution when βj=0\beta_{j}=0 and that WjW_{j} is positive with high probability when βj0\beta_{j}\neq 0. Given a threshold t>0t>0, the number of false discoveries is equal to

#{j:βj=0,Wj>t}#{j:βj=0,Wj<t}#{j:Wj<t},\#\{j:\beta_{j}=0,W_{j}>t\}\;\;\approx\;\;\#\{j:\beta_{j}=0,W_{j}<-t\}\;\;\approx\;\;\#\{j:W_{j}<-t\},

where the first approximation is based on the symmetry of the distribution of WjW_{j} for null variables and the second approximation comes from the sparsity of β\beta. The right hand side gives an estimate of the number of false discoveries. Hence, a data-driven threshold to control FDR at qq is

T^(q)=min{t>0:#{j:Wj<t}#{j:Wj>t}1q}.\hat{T}(q)=\min\biggl{\{}t>0:\frac{\#\{j:W_{j}<-t\}}{\#\{j:W_{j}>t\}\vee 1}\leq q\biggr{\}}. (2.4)

The set of selected variables is S^(q)={j:Wj>T^(q)}\hat{S}(q)=\{j:W_{j}>\hat{T}(q)\}. As long as X~\tilde{X} in (2.1) exists, it can be shown that the FDR associated with S^(q)\hat{S}(q) is guaranteed to be q\leq q.

2.1 Variants of knockoff

The idea of knockoff provides a general framework for FDR control. It consists of three key components, as summarized below:

  • (a)

    A ranking algorithm, which takes yy and an arbitrary design and assigns an importance metric ZjZ_{j} to each variable in the design. In (2.2), it uses a particular ranking algorithm based on the solution path of Lasso.

  • (b)

    An augmented design, which is the n×(2p)n\times(2p) matrix [X,X~][X,\tilde{X}], where a knockoff x~j\tilde{x}_{j} is created for each original xjx_{j}. We supply the augmented design to the ranking algorithm to get importance metrics ZjZ_{j} and Z~j\tilde{Z}_{j} for each variable jj and its knockoff.

  • (c)

    A symmetric statistic f(,)f(\cdot,\cdot), which combines the two importance metrics ZjZ_{j} and Z~j\tilde{Z}_{j} to an ultimate importance metric WjW_{j} for variable jj.

The choice of each component is non-unique. For (c), ff can be any anti-symmetric function. Two popular choices are the signed maximum statistic and the difference statistic:

fsgm(u,v)=sgn(uv)max{u,v},andfdif(u,v)=uv.f^{\mathrm{sgm}}(u,v)=\mathrm{sgn}(u-v)\cdot\max\{u,v\},\qquad\mbox{and}\qquad f^{\mathrm{dif}}(u,v)=u-v. (2.5)

For (b), the freedom comes from choosing diag(s)\mathrm{diag}(s) and constructing X~\tilde{X}. In fact, once diag(s)\mathrm{diag}(s) is given, it can be shown that any X~\tilde{X} satisfying (2.1) yields the same asymptotic performance for knockoff. Hence, the choice of the augmented design boils down to choosing diag(s)\mathrm{diag}(s). A popular option is the SDP-knockoff, which solves diag(s)\mathrm{diag}(s) from a semi-definite programming:

minj(1sj),subject to0sj1,diag(s)2G.\min\sum_{j}(1-s_{j}),\qquad\mbox{subject to}\quad 0\leq s_{j}\leq 1,\;\mathrm{diag}(s)\preceq 2G. (2.6)

Another option is the CI-knockoff (Liu and Rigollet, 2019):

diag(s)=c[diag(G1)]1,where c=sup{0<c~1:c[diag(G1)]12G}.\mathrm{diag}(s)=c\cdot[\mathrm{diag}(G^{-1})]^{-1},\qquad\mbox{where }c=\sup\bigl{\{}0<\tilde{c}\leq 1:c[\mathrm{diag}(G^{-1})]^{-1}\preceq 2G\bigr{\}}. (2.7)

For (a), Lasso is currently used as the ranking algorithm (see (2.2)), but it can be replaced by other linear regression methods. Take the least-squares β^ols=(XX)1Xy\hat{\beta}^{ols}=(X^{\prime}X)^{-1}X^{\prime}y for example. We can define a ranking algorithm that outputs |β^jols||\hat{\beta}_{j}^{ols}| as the importance metric. If we supply the augmented design [X,X~][X,\tilde{X}] to this ranking algorithm, then we have

β^=([X,X~][X,X~])1[X,X~]y.\hat{\beta}=([X,\tilde{X}]^{\prime}[X,\tilde{X}])^{-1}[X,\tilde{X}]^{\prime}y. (2.8)

We can set Zj=|β^j|Z_{j}=|\hat{\beta}_{j}| and Z~j=|β^j+p|\tilde{Z}_{j}=|\hat{\beta}_{j+p}| and plug them into (2.3).

Refer to caption
Figure 1: An illustration of the knockoff and its prototype.

In summary, the flexibility of the three components gives rise to many different variants of knockoff. For example, we can use (2.2) or (2.8) as the ranking algorithm, (2.6) or (2.7) as the augmented design, and either one in (2.5) as the symmetric statistic; this already gives 2×2×2=82\times 2\times 2=8 variants of knockoff. By the theory in Barber and Candès (2015), each variant guarantees the finite-sample FDR control. When the FDR is under control, the user always wants to select as many true signals as possible, i.e., to maximize the power. In this paper, one of our goals is to understand and compare the power of different variants of knockoff.

To this end, we start from the default choices of the three components in the orthodox knockoff, where the ranking algorithm is Lasso as in (2.2), the augmented design is the SDP-knockoff as in (2.6), and the symmetric statistic is the signed maximum in (2.5). In Sections 4-6, we successively alter each component and study its impact on the power.

2.2 Prototypes of knockoff

Given a variant of knockoff (where a specific choice of the ranking algorithm is applied to the augmented design [X,X~][X,\tilde{X}]), we define the corresponding prototype method as follows: It runs the same ranking algorithm on the original design matrix XX, and outputs W1,W2,,WpW_{1}^{*},W_{2}^{*},\ldots,W_{p}^{*} as importance metrics. The method selects variables by thresholding WjW_{j}^{*} at TqT_{q}^{*}, where

Tq=min{t>0:𝔼[#{j:βj=0,Wj>t}#{j:Wj>t}1]q}.T_{q}^{*}=\min\biggl{\{}t>0:\mathbb{E}\biggl{[}\frac{\#\{j:\beta_{j}=0,W^{*}_{j}>t\}}{\#\{j:W^{*}_{j}>t\}\vee 1}\biggr{]}\leq q\biggr{\}}. (2.9)

Compared with knockoff, the prototype ranks variables by WjW_{j}^{*}, whose induced ranking may be different from the one by WjW_{j}. Additionally, the prototype has access to an ideal threshold TqT_{q}^{*} that guarantees an exact FDR control but is practically infeasible. In contrast, knockoff has to find a data-driven threshold from (2.4). See Figure 1.

We look at two examples. Consider the orthodox knockoff, where the ranking algorithm is Lasso (see (2.2)). Its prototype runs Lasso on XX to get β^lasso(λ)=argminb{yXb2/2+λb1}\hat{\beta}^{\mathrm{lasso}}(\lambda)=\mathrm{argmin}_{b}\{\|y-Xb\|^{2}/2+\lambda\|b\|_{1}\} and assigns an importance metric to variable jj as

Wj=sup{λ>0:β^jlasso(λ)0}.W_{j}^{*}=\sup\bigl{\{}\lambda>0:\hat{\beta}_{j}^{\mathrm{lasso}}(\lambda)\neq 0\bigr{\}}. (2.10)

It then selects variables by thresholding WjW_{j}^{*} using the ideal threshold in (2.9). We call this method the Lasso-path. It is the prototype of all the variants of knockoff that use Lasso as ranking algorithm. For all variants of knockoff that use least-squares (see (2.8)) as the ranking algorithm, they share the same prototype, which computes β^ols=(XX)1Xy\hat{\beta}^{\mathrm{ols}}=(X^{\prime}X)^{-1}X^{\prime}y and assigns an importance metric to variable jj as

Wj=|β^jols|=|ejG1Xy|.W_{j}^{*}=|\hat{\beta}_{j}^{\mathrm{ols}}|=|e_{j}^{\prime}G^{-1}X^{\prime}y|. (2.11)

It then selects variables by thresholding WjW_{j}^{*} using the ideal threshold in (2.9). We call this method the least-squares.

In this paper, besides comparing different variants of knockoff, we also aim to compare each variant with its prototype. Here is the motivation: FDR control splits into two tasks: (1) ranking the importance of variables and (2) finding a data-driven threshold. When the FDR is under control, the power depends on how well Task 1 is performed. Importantly, in knockoff, although the augmented design and the ranking algorithm are meant to carry Task 2 only, they do affect the final ranking of variables because the ranking by WjW_{j}’s is usually different from the ranking by WjW_{j}^{*}’s. This yields a potential power loss compared with its prototype – a price we pay for finding a data-driven threshold. Hence, a power comparison between knockoff and its prototype helps us understand how large this price is.

Remark 1. In this paper, we focus on two ranking algorithms, Lasso-path and least-squares. They are both tuning-free. When the ranking algorithm has tuning parameters, we should not set the tuning parameters in the prototype the same as in the original knockoff. For example, we can use |β^jlasso(λ)||\hat{\beta}_{j}^{\mathrm{lasso}}(\lambda)| to rank variables, treating λ\lambda as a tuning parameter. The prototype runs lasso on the original design with pp variables, while knockoff runs lasso on the augmented design with 2p2p variables. The optimal λ\lambda that minimizes the expected Hamming error is different in the two scenarios. A reasonable approach of comparing knockoff and its prototype is to use their respective optimal λ\lambda. This will require computing the expected Hamming error of lasso for an arbitrary λ\lambda (Ji and Jin, 2012; Ke and Wang, 2021).

3 Rare/Weak signal model and criteria of power comparison

We introduce our theoretical framework of power comparison. Recall that we consider a linear model y=Xβ+ey=X\beta+e, where yny\in\mathbb{R}^{n}, X=[X1,X2,,Xn]n×pX=[X_{1},X_{2},\ldots,X_{n}]^{\prime}\in\mathbb{R}^{n\times p}, and eN(0,σ2In)e\sim N(0,\sigma^{2}I_{n}). Without loss of generality, fix σ=1\sigma=1. Given pp, we allow nn to be any integer such that n2pn\geq 2p. This is from the requirement of knockoff (it needs n2pn\geq 2p to guarantee the existence of X~\tilde{X} in (2.1)) and should not be viewed as a limitation of our theory. Our results are extendable to n<2pn<2p, provided that knockoff is replaced by its extension in this case (see Section 9 for a discussion). The Gram matrix is

G:=XXp×p,where we assume Gjj=1, for all 1jp.G:=X^{\prime}X\in\mathbb{R}^{p\times p},\qquad\mbox{where we assume }G_{jj}=1,\mbox{ for all }1\leq j\leq p. (3.1)

We adopt the Rare/Weak signal model (Donoho and Jin, 2004) to assume that β\beta satisfies:

βjiid(1ϵp)ν0+ϵpντp,1jp,\beta_{j}\;\overset{iid}{\sim}\;(1-\epsilon_{p})\nu_{0}+\epsilon_{p}\nu_{\tau_{p}},\qquad 1\leq j\leq p, (3.2)

where νa\nu_{a} denotes a point mass at aa. Here, ϵp(0,1)\epsilon_{p}\in(0,1) is the expected fraction of signals, and τp>0\tau_{p}>0 is the signal strength. We let pp be the driving asymptotic parameter and tie (ϵp,τp)(\epsilon_{p},\tau_{p}) with pp through fixed constants ϑ(0,1)\vartheta\in(0,1) and r>0r>0:

ϵp=pϑ,τp=2rlog(p).\epsilon_{p}=p^{-\vartheta},\qquad\tau_{p}=\sqrt{2r\log(p)}. (3.3)

The parameters, ϑ\vartheta and rr, characterize the signal rarity and the signal strength, respectively. Here, nn does not appear in the order of nonzero βj\beta_{j}, because we have already re-parameterized (X,β)(X,\beta) such that the diagonals of GG are 11 (see Footnote 1).

Under the Rare/Weak signal model (3.2)-(3.3), we define two diagrams for characterizing the power of knockoff. Let WjW_{j} be the ultimate importance metric (2.3) assigned to variable jj, and consider the set of selected variables at a threshold 2ulog(p)\sqrt{2u\log(p)}:

S^(u)={1jp:Wj>2ulog(p)}.\hat{S}(u)=\bigl{\{}1\leq j\leq p:W_{j}>\sqrt{2u\log(p)}\bigr{\}}.

Let S={1jp:βj0}S=\{1\leq j\leq p:\beta_{j}\neq 0\}. Define FPp(u)=𝔼(|S^(u)\S|)\mathrm{FP}_{p}(u)=\mathbb{E}(|\hat{S}(u)\backslash S|), FNp(u)=𝔼(|S\S^(u)|)\mathrm{FN}_{p}(u)=\mathbb{E}(|S\backslash\hat{S}(u)|), and TPp(u)=𝔼(|SS^(u)|)\mathrm{TP}_{p}(u)=\mathbb{E}(|S\cap\hat{S}(u)|), where the expectation is taken with respect to the randomness of both β\beta and yy. Write sp=pϵps_{p}=p\epsilon_{p}, and define

Hammp(u)=FPp(u)+FNp(u),FDRp(u)=FPp(u)FPp(u)+TPp(u),TPRp(u)=TPp(u)sp.\mathrm{Hamm}_{p}(u)=\mathrm{FP}_{p}(u)+\mathrm{FN}_{p}(u),\quad\mathrm{FDR}_{p}(u)=\frac{\mathrm{FP}_{p}(u)}{\mathrm{FP}_{p}(u)+\mathrm{TP}_{p}(u)},\quad\mathrm{TPR}_{p}(u)=\frac{\mathrm{TP}_{p}(u)}{s_{p}}.

The first quantity is the expected Hamming error. The last two quantities are proxies of the false discovery rate and true positive rate, respectively.222The FDRp(u)\mathrm{FDR}_{p}(u) we consider here is the ratio of expectations of false positives and total discoveries (called mFDR in some literature), not the original definition of FDR, which is the expectation of ratios of false positives and total discoveries. According to Javanmard and Montanari (2018), mFDR and FDR can be different in situations with high variability, but it is not the case here. In our setting, the expectation of total discoveries grows to infinity as a power of pp, hence, mFDR and FDR have a negligible difference.

The following definition is conventional in the study of Rare/Weak signal models (Genovese et al., 2012; Ji and Jin, 2012) and will be used frequently in our theoretical results:

Definition 3.1 (Multi-log(p)\log(p) term)

Consider a sequence {ap}p=1\{a_{p}\}_{p=1}^{\infty}. If for any fixed δ>0\delta>0, appδa_{p}p^{\delta}\to\infty and appδ0a_{p}p^{-\delta}\to 0, we call apa_{p} a multi-log(p)\log(p) term and write ap=Lpa_{p}=L_{p} (LpL_{p} is a generic notion for all multi-log(p)\log(p) terms). If there is a constant b0b_{0}\in\mathbb{R} such that appb0=Lpa_{p}p^{-b_{0}}=L_{p}, we write ap=Lppb0a_{p}=L_{p}p^{b_{0}}, which means for any fixed δ>0\delta>0, appb0+δa_{p}p^{-b_{0}+\delta}\to\infty and appb0δ0a_{p}p^{-b_{0}-\delta}\to 0.

In the Rare/Weak signal model, for many classes of designs of interest, FDRp(u)\mathrm{FDR}_{p}(u) and TPRp(u)\mathrm{TPR}_{p}(u) satisfy a property: There exist two fixed functions gFDR(u;ϑ,r)g_{\mathrm{FDR}}(u;\vartheta,r) and gTPR(u;ϑ,r)g_{\mathrm{TPR}}(u;\vartheta,r) such that, for any (ϑ,r,u)(\vartheta,r,u), as pp\to\infty,

FDRp(u)=LppgFDR(u;ϑ,r),1TPRp(u)=LppgTPR(u;ϑ,r).\mathrm{FDR}_{p}(u)=L_{p}p^{-g_{\mathrm{FDR}}(u;\vartheta,r)},\qquad 1-\mathrm{TPR}_{p}(u)=L_{p}p^{-g_{\mathrm{TPR}}(u;\vartheta,r)}. (3.4)

The two functions gFDRg_{\mathrm{FDR}} and gTPRg_{\mathrm{TPR}} depend on the choice of the three components in knockoff and the design class. We propose the FDR-TPR trade-off diagram as follows:

Definition 3.2 (FDR-TPR trade-off diagram)

Given a variant of knockoff and a sequence of designs indexed by pp, if FDRp(u)\mathrm{FDR}_{p}(u) and TPRp(u)\mathrm{TPR}_{p}(u) satisfy (3.4), the FDR-TPR trade-off diagram associated with (ϑ,r)(\vartheta,r) is the plot with gFDR(u;ϑ,r)g_{\mathrm{FDR}}(u;\vartheta,r) in the y-axis and gTPR(u;ϑ,r)g_{\mathrm{TPR}}(u;\vartheta,r) in the x-axis, as uu varies.

Refer to caption

          Refer to caption

Figure 2: Left: the FDR-TPR trade-off diagram for a few values of (ϑ,r)(\vartheta,r). Right: the phase diagram. The design is orthogonal, and the importance metric is as in (3.6). Each FDR-TPR trade-off diagram corresponds to one point in the phase diagram.

An FDR-TPR trade-off diagram depends on (ϑ,r)(\vartheta,r). To compare the performance of two variants of knockoff, we have to draw many curves for different values of (ϑ,r)(\vartheta,r). Hence, we introduce another diagram, which characterizes the power simultaneously at all (ϑ,r)(\vartheta,r). Define Hammpminu{FPp(u)+FNp(u)}\mathrm{Hamm}_{p}^{*}\equiv\min_{u}\{\mathrm{FP}_{p}(u)+\mathrm{FN}_{p}(u)\}. This is the minimum expected Hamming selection error when the threshold uu is chosen optimally. For each variant of knockoff and each class of designs of interest, there exists a bivariate function fHamm(ϑ,r)[0,1]f^{*}_{\mathrm{Hamm}}(\vartheta,r)\in[0,1] such that

Hammp=LppfHamm(ϑ,r).\mathrm{Hamm}_{p}^{*}=L_{p}p^{f^{*}_{\mathrm{Hamm}}(\vartheta,r)}. (3.5)

The phase diagram is defined as follows:

Definition 3.3 (Phase diagram)

When Hammp\mathrm{Hamm}^{*}_{p} satisfies (3.5), the phase diagram is defined to the partition of the two-dimensional space (ϑ,r)(\vartheta,r) into three regions:

  • Region of Exact Recovery (ER): {(ϑ,r):fHamm(ϑ,r)<0}\{(\vartheta,r):f^{*}_{\mathrm{Hamm}}(\vartheta,r)<0\}.

  • Region of Almost Full Recovery (AFR): {(ϑ,r):0<fHamm(ϑ,r)<1ϑ}\{(\vartheta,r):0<f^{*}_{\mathrm{Hamm}}(\vartheta,r)<1-\vartheta\}.

  • Region of No Recovery (NR): {(ϑ,r):fHamm(ϑ,r)1ϑ}\{(\vartheta,r):f_{\mathrm{Hamm}}^{*}(\vartheta,r)\geq 1-\vartheta\}.

The curves separating different regions are called phase curves. We use hAFR(ϑ)h_{\mathrm{AFR}}(\vartheta) to denote the curve between NR and AFR, and hER(ϑ)h_{\mathrm{ER}}(\vartheta) the curve between AFR and ER.

In the ER region, the expected Hamming error, Hammp\mathrm{Hamm}_{p}^{*}, tends to zero. Therefore, with high probability, the support of β\beta is exactly recovered. In the AFR region, Hammp\mathrm{Hamm}_{p}^{*} does not tend to zero but is much smaller than pϵpp\epsilon_{p} (which is the expected number of signals). As a result, with high probability, the majority of signals are correctly recovered. In the region of NR, Hammp\mathrm{Hamm}_{p}^{*} is comparable with the number of signals, and variable selection fails. The phase diagram was introduced in the literature (Genovese et al., 2012; Ji and Jin, 2012) but has never been used to study FDR control methods.

We illustrate these definitions with an example. Both the FDR-TPR trade-off diagram and phase diagram only depend on the importance metrics assigned to variables. Therefore, they are also well-defined for the prototypes in Section 2.2. We consider a special class of designs, where XX=IpX^{\prime}X=I_{p}, and a prototype that assigns the importance metrics

Wj=|xjy|,1jp.W^{*}_{j}=|x_{j}^{\prime}y|,\qquad 1\leq j\leq p. (3.6)

The next proposition is adapted from literature (Donoho and Jin, 2004; Ji and Jin, 2012) and proved in the Appendix. We use a+a_{+} to denote max{a,0}\max\{a,0\}, for any aa\in\mathbb{R}.

Proposition 3.1

Suppose XX=IpX^{\prime}X=I_{p} and consider the importance metric in (3.6). When r>ϑr>\vartheta, the FDR-TPR trade-off diagram is given by gFDR(u;ϑ,r)=(uϑ)+g_{\mathrm{FDR}}(u;\vartheta,r)=(u-\vartheta)_{+} and gTPR(u;ϑ,r)=(ru)+2g_{\mathrm{TPR}}(u;\vartheta,r)=(\sqrt{r}-\sqrt{u})_{+}^{2}. The phase diagram is given by hAFR(ϑ)=ϑh_{\mathrm{AFR}}(\vartheta)=\vartheta and hER(ϑ)=(1+1ϑ)2h_{\mathrm{ER}}(\vartheta)=(1+\sqrt{1-\vartheta})^{2}.

These diagrams are visualized in Figure 2.

As we have mentioned in Section 2.2, FDR control is composed of the task of ranking variables and the task of finding a data-driven threshold. It is the variable ranking that determines the power of an FDR control method simultaneously at all FDR levels qq. The two diagrams in Definitions 3.2-3.3 only depend on the importance metrics assigned to variables, hence, they measure the quality of variable ranking, which is fundamental for power comparison of different FDR control methods when they all control FDR at the same level.

4 Impact of the symmetric statistic

We fix the choice of ranking algorithm and augmented design in the orthodox knockoff and compare using the two symmetric statistics in (2.5). For simplicity, we consider the orthogonal design where XX=IpX^{\prime}X=I_{p}. In this special case, the ranking algorithm in (2.2) reduces to calculating marginal regression coefficients and the output ZjZ_{j} and Z~j\tilde{Z}_{j} reduce to

Zj=|xjy|,andZ~j=|x~jy|,1jp.Z_{j}=|x_{j}^{\prime}y|,\qquad\mbox{and}\qquad\tilde{Z}_{j}=|\tilde{x}_{j}^{\prime}y|,\qquad 1\leq j\leq p. (4.1)

In the augmented design, the choice of diag(s)\mathrm{diag}(s) in (2.6) reduces to diag(s)=Ip\mathrm{diag}(s)=I_{p}. We consider a slightly more general form:

diag(s)=(1a)Ip,where 1<a<1 is a fixed constant.\mathrm{diag}(s)=(1-a)I_{p},\qquad\mbox{where }-1<a<1\mbox{ is a fixed constant}. (4.2)

Given diag(s)\mathrm{diag}(s), the construction of X~\tilde{X} is not unique, but all constructions lead to exactly the same FDR-TPR trade-off diagram and the same phase diagram. For this reason, we only specify diag(s)\mathrm{diag}(s) as in (4.2), but not the actual X~\tilde{X}. Fixing the above choices (4.1)-(4.2), we consider the two symmetric statistics in (2.5), which lead to the importance metrics of

Wjsgm=(ZjZ~j){+1,if Zj>Z~j1,if ZjZ~j,andWjdif=ZjZ~j.W_{j}^{\mathrm{sgm}}=(Z_{j}\vee\tilde{Z}_{j})\cdot\begin{cases}+1,&\text{if }Z_{j}>\tilde{Z}_{j}\\ -1,&\text{if }Z_{j}\leq\tilde{Z}_{j}\end{cases},\qquad\mbox{and}\qquad W_{j}^{\mathrm{dif}}=Z_{j}-\tilde{Z}_{j}. (4.3)

We call the two variants of knockoff knockoff-sgm and knockoff-diff, respectively. The next theorem gives the explicit forms of FPp(u)\mathrm{FP}_{p}(u) and FNp(u)\mathrm{FN}_{p}(u) associated with these two variants. Its proof can be found in the Appendix.

Theorem 4.1

Consider a linear regression model where (3.1)-(3.3) hold. Suppose n2pn\geq 2p and G=IpG=I_{p}. We construct X~\tilde{X} in knockoff as in (4.2), for a constant a(1,1)a\in(-1,1), and let ZjZ_{j} and Z~j\tilde{Z}_{j} be as in (4.1). For any constant u>0u>0, let FPp(u)\mathrm{FP}_{p}(u) and FNp(u)\mathrm{FN}_{p}(u) be the expected numbers of false positives and false negatives, by selecting variables with Wj>2ulog(p)W_{j}>\sqrt{2u\log(p)}. When WjW_{j} is the signed maximum statistic in (4.3), as pp\to\infty,

FPp(u)=Lpp1u,FNp(u)=Lpp1ϑmin{(1|a|)r2,(ru)+2}.\mathrm{FP}_{p}(u)=L_{p}p^{1-u},\qquad\mathrm{FN}_{p}(u)=L_{p}p^{1-\vartheta-\min\bigl{\{}\frac{(1-|a|)r}{2},\ (\sqrt{r}-\sqrt{u})_{+}^{2}\bigr{\}}}.

When WjW_{j} is the difference statistic in (4.3), as pp\to\infty,

FPp(u)=Lpp1u,FNp(u)=Lpp1ϑ(1|a|)2(ru)+2.\mathrm{FP}_{p}(u)=L_{p}p^{1-u},\qquad\mathrm{FN}_{p}(u)=L_{p}p^{1-\vartheta-\frac{(1-|a|)}{2}(\sqrt{r}-\sqrt{u})_{+}^{2}}.

Here, LpL_{p} is the generic multi-log(p)\log(p) notion in Definition 3.1. For Theorem 4.1, using Mills’ ration, we actually know that Lp1/log(p)L_{p}\asymp 1/\sqrt{\log(p)}. For other theorems, we do not always know the exact order of LpL_{p}, but we can intuitively regard it as a polynomial of log(p)\log(p).

Given Theorem 4.1 and Definitions 3.2-3.3, we can derive the explicit FDR-TPR tradeoff diagrams and the phase diagrams:

Corollary 4.1

In the setting of Theorem 4.1, when r>ϑr>\vartheta, the FDR-TPR trade-off diagram is given by

gFDR(u;ϑ,r)=(uϑ)+,gTPR(u)={min{(1|a|)r2,(ru)+2},ifWj=Wjsgn,(1|a|)2(ru)+2ifWj=Wjdif.g_{\mathrm{FDR}}(u;\vartheta,r)=(u-\vartheta)_{+},\quad g_{\mathrm{TPR}}(u)=\begin{cases}\min\bigl{\{}\frac{(1-|a|)r}{2},\;(\sqrt{r}-\sqrt{u})_{+}^{2}\bigr{\}},&\mbox{if}\;\;W_{j}=W_{j}^{\mathrm{sgn}},\\ \frac{(1-|a|)}{2}(\sqrt{r}-\sqrt{u})_{+}^{2}&\mbox{if}\;\;W_{j}=W_{j}^{\mathrm{dif}}.\end{cases}

The phase diagram is given by

hAFR(ϑ)=ϑ,hER(ϑ)={max{22ϑ1|a|,(1+1ϑ)2},ifWj=Wjsgn,(1+22ϑ1|a|)2,ifWj=Wjdif.h_{\mathrm{AFR}}(\vartheta)=\vartheta,\qquad h_{\mathrm{ER}}(\vartheta)=\begin{cases}\max\bigl{\{}\frac{2-2\vartheta}{1-|a|},\;(1+\sqrt{1-\vartheta})^{2}\bigr{\}},&\mbox{if}\;\;W_{j}=W_{j}^{\mathrm{sgn}},\\ \Bigl{(}1+\sqrt{\frac{2-2\vartheta}{1-|a|}}\Bigr{)}^{2},&\mbox{if}\;\;W_{j}=W_{j}^{\mathrm{dif}}.\\ \end{cases}
Refer to caption
Refer to caption
Refer to caption
Figure 3: Power comparison of knockoff with different symmetric statistics (orthogonal design; ranking algorithm is Lasso, and augmented design is such that diag(s)=Ip\mathrm{diag}(s)=I_{p}). The left two panels are the phase diagrams of knockoff-diff (left) and knockoff-sgm (middle), where the dashed lines are the phase curves of their common prototype. The right panel is the FDR-TPR trade-off diagram of knockoff-sgm, where each trade-off curve corresponds to one point in the phase diagram.

For both knockoff-sgm and knockoff-diff, by Corollary 4.1, the best choice of aa is a=0a=0. In the remaining of this section, we fix a=0a=0. Figure 3 gives visualizations of the FDR-TPR trade-off diagrams and phase diagrams. The prototype is Lasso-path (see (2.10)). In the orthogonal design XX=IpX^{\prime}X=I_{p}, Lasso-path reduces to the prototype in (3.6), whose FDR-TPR trade-off diagram and phase diagram are given in Figure 2.

Comparison of two symmetric statistics:

First, we compare the phase diagrams in Figures 2-3 and find that (i) knockoff-sgm has a strictly better phase diagram than knockoff-diff, and (ii) knockoff-sgm has the same phase diagram as the prototype. It suggests that signed maximum is a better choice of symmetric statistic. It also suggests that knockoff-sgm yields a negligible power loss relative to its prototype.

We also point out that knockoff-sgm is already “optimal” among all symmetric statistics, in this orthogonal design. The reason is that, when XX=IpX^{\prime}X=I_{p}, the Hamming error (Hammp\mathrm{Hamm}_{p}^{*}) has an information-theoretical lower bound (Genovese et al., 2012; Ji and Jin, 2012), whose induced phase diagram coincides with the phase diagram of the prototype, which is also the phase diagram of knockoff-sgm. This is the optimal phase diagram any method can achieve (including all variants of knockoff with other symmetric statistics).

Next, we compare the FDR-TPR trade-off diagrams of knockoff and the prototype. We focus on knockoff-sgm, whose trade-off diagram is in Figure 3 (right panel). The trade-off diagram of the prototype is in Figure 2 (right panel). We find that the trade-off diagram of knockoff-sgm is slightly different from the one of the prototype. By Theorem 4.1, (1TPRp)=FNp/spLppr/2(1-\mathrm{TPR}_{p})=\mathrm{FN}_{p}/s_{p}\geq L_{p}p^{-r/2}; hence, the FDR-TRP trade-off curve is truncated at r/2r/2 in the x-axis. For large ϑ\vartheta, the curve hits zero before the x-axis reaches r/2r/2, and the truncation has no impact. However, for small ϑ\vartheta, the curve has changed due to the truncation (Figure 3, right panel, all but the blue curve).

Some geometric insights, especially why signed maximum is “optimal”.

By (4.1) and (4.3), the importance metrics produced by knockoff can be written as Wj=I(xjy,x~jy)W_{j}=I(x_{j}^{\prime}y,\tilde{x}_{j}^{\prime}y), where xjx_{j} and x~j\tilde{x}_{j} are the jjth variable and its knockoff, and I(,)I(\cdot,\cdot) is a bivariate function depending on the choice of symmetric statistic. Define the “rejection region” as

={(h1,h2)2:I(h12log(p),h22log(p))>2ulog(p)}.{\cal R}=\left\{(h_{1},h_{2})\in\mathbb{R}^{2}:\;I\Bigl{(}h_{1}\sqrt{2\log(p)},\,h_{2}\sqrt{2\log(p)}\Bigr{)}>\sqrt{2u\log(p)}\right\}.

Figure 4 shows the rejection region induced by knockoff-sgm, knockoff-diff, and their prototype. Write h^1=xjy/2log(p)\hat{h}_{1}=x_{j}^{\prime}y/\sqrt{2\log(p)} and h^2=x~jy/2log(p)\hat{h}_{2}=\tilde{x}_{j}^{\prime}y/\sqrt{2\log(p)}. The random vector (h^1,h^2)(\hat{h}_{1},\hat{h}_{2})^{\prime} follows the bivariate normal distribution with covariance matrix 1log(p)I2\frac{1}{\log(p)}I_{2}. Its mean vector is (0,0)(0,0)^{\prime} when βj=0\beta_{j}=0 and (r,0)(\sqrt{r},0)^{\prime} when βj=τp\beta_{j}=\tau_{p}. By Lemma 7.1 (to be introduced in Section 7), the exponent in FPp\mathrm{FP}_{p} is determined by the Euclidean distance from (0,0)(0,0)^{\prime} to {\cal R} and the exponent in FNp\mathrm{FN}_{p} is determined by the Euclidean distance from (r,0)(\sqrt{r},0)^{\prime} to c{\cal R}^{c}. From Figure 4, it is clear that the difference statistic is inferior to the signed maximum statistic because the distance from (r,0)(\sqrt{r},0)^{\prime} to c{\cal R}^{c} is strictly smaller in the former.

Refer to caption

       Refer to caption        Refer to caption

Figure 4: The rejection region of the symmetric statistics (orthogonal design, a=0a=0 in the construction of knockoff variables). Left: the signed maximum statistic. Middle: the difference statistic. Right: the prototype.

As we have mentioned earlier, signed maximum is “optimal” among all symmetric statistics, because its phase diagram already matches with the information-theoretic lower bound. Now, we use Figure 4 to provide a geometric interpretation of why signed maximum is “optimal”. We call a subset {\cal R} an eligible rejection region if there exists a symmetric statistic f(,)f(\cdot,\cdot) whose induced rejection region is {\cal R}. It is not hard to see that any eligible {\cal R} should be symmetric with respect to both x-axis and y-axis. In addition, by the anti-symmetry requirement f(v,u)=f(u,v)f(v,u)=-f(u,v), an eligible rejection region also needs to satisfy the following necessary condition: ±={\cal R}\cap{\cal R}_{\pm}=\emptyset, where ±{\cal R}_{\pm} is the reflection of {\cal R} with respect to the line y=±xy=\pm x. The prototype has the optimal phase diagram, but its rejection region 0{\cal R}_{0} (Figure 4, right panel) does not satisfy this condition. We can see that the rejection region of knockoff-sgm (left panel) is a minimal modification of 0{\cal R}_{0} to tailor to this condition. This partially explains why signed maximum is already the best choice in this setting.

Remark 2. We consider a non-stochastic threshold 2ulog(p)\sqrt{2u\log(p)} in Theorem 4.1. For a data-driven threshold 2u^log(p)\sqrt{2\hat{u}\log(p)}, if there is ϵp=o(1)\epsilon_{p}=o(1) such that |u^u|ϵp|\hat{u}-u|\leq\epsilon_{p} with probability 1o(p2)1-o(p^{-2}), then Theorem 4.1 continues to hold.

Remark 3. In this section, we conduct power comparison only on the orthogonal design. Our rationale is as follows: A good method has to at least perform well in the simplest case. If a method is inferior to others in the orthogonal design, then we do not expect it to have a good potential in real applications where the designs can be much more complicated, i.e., the results on orthogonal designs help us filter out those methods that have little potential in practice. Such insights are valuable to users.

Remark 4. Recently, Weinstein et al. (2021) showed a remarkable result: Under linear sparsity and random Gaussian design, they found that knockoff with the difference symmetric statistic has great power. We note that the prototype of their knockoff is thresholded Lasso, not Lasso, and so the gain of power is primarily from prototype instead of symmetric statistic. Also, see Ke and Wang (2021) for a comparison of thresholded-Lasso v.s. Lasso.

5 Impact of the augmented design

We fix the choice of ranking algorithm and symmetric statistic as in the orthodox knockoff and compare using two augmented designs, the SDP-knockoff in (2.6) and the CI-knockoff in (2.7). We also call the two respective variants of knockoff the SDP-knockoff and CI-knockoff, so that “SDP-knockoff” (say) has two meanings, an augmented design or a variant of knockoff, depending on the context.

When XX=IpX^{\prime}X=I_{p}, SDP-knockoff and CI-knockoff are the same and reduce to diag(s)=Ip\mathrm{diag}(s)=I_{p}, so it is impossible to tell their difference in power. We must consider non-orthogonal designs. However, since there is no explicit form of the Lasso solution path, the results for a general XX are difficult to obtain; despite technical challenges, the phase diagrams may be too messy to provide any useful insight. We hope to find a class of non-orthogonal designs such that (i) it is mathematically tractable, (ii) it is considerably different from the orthogonal design and allows GG to have some “large” off-diagonal entries, and (iii) it captures some key features of real applications. We start from a class of row-wise sparse designs, which approximate the designs in many real applications (e.g., in bioinformatics and in compressed sensing). The next proposition is adapted from Lemma 1 of Jin et al. (2014), whose proof is omitted.

Proposition 5.1

Consider a linear model where (3.1)-(3.3) hold. Suppose each row of GG has at most LpL_{p} nonzero entries, where LpL_{p} is a multi-log(p)\log(p) term as in Definition 3.1. Let SS be the support of β\beta. There exists a constant integer m0=m0(ϑ)m_{0}=m_{0}(\vartheta) such that with probability 1o(1)1-o(1), GSSG_{SS} is a blockwise diagonal matrix afte a permutation of indices, and the maximum block size is bounded by m0m_{0}.

Proposition 5.1 is a consequence of the interplay between design sparsity and signal sparsity: Under the Rare/Weak signal model (3.2) and a sparse design, the true signals in SS appear in groups, where each group contains only a small number of variables and distinct groups are mutually uncorrelated. This motivates us to consider a simpler setting where GG is blockwise diagonal by itself. While these two settings look so different from each other, the asymptotic behavior of Hamming error is closely related. For example, when GG is a tridiagonal matrix with equal values in the sub-diagonal, the optimal phase diagram is the same as in the case where GG is blockwise diagonal with 2×22\times 2 blocks (Ji and Jin, 2012; Jin et al., 2014). Inspired by these observations, we study a class of blockwise diagonal designs (Jin and Ke, 2016): For some ρ(1,1)\rho\in(-1,1) and a p×pp\times p permutation matrix 𝒯{\cal T},

G=𝒯diag(B,B,,B,B1)𝒯,whereB=[1ρρ1],B1={B,if p is even,1,if p is odd.G={\cal T}\mathrm{diag}(B,B,\ldots,B,B_{1}){\cal T},\quad\mbox{where}\;B=\begin{bmatrix}1\;\;\;&\rho\\ \rho\;\;\;&1\end{bmatrix},\;\;B_{1}=\begin{cases}B,&\mbox{if $p$ is even},\\ 1,&\mbox{if $p$ is odd}.\end{cases} (5.1)

This design serves the aforementioned purposes (i)-(iii): It has only one parameter ρ\rho, so is mathematically tractable. The nonzero off-diagonal entries of GG are at the constant order, so this design is sufficiently different from the orthogonal design (in contrast, many literature consider the independent random Gaussian design, for which the maximum absolute off-diagonal entry of GG is only o(1)o(1)). Also, as we have argued above, studying this design helps us draw useful insights that will likely continue to hold for general sparse designs.

5.1 The prototype, Lasso-path

Before studying SDP-knockoff and CI-knockoff, we first study their prototype, Lasso-path. The next theorem characterizes FPp(u)\mathrm{FP}_{p}(u) and FNp(u)\mathrm{FN}_{p}(u) for Lasso-path and is proved in the Appendix.

Theorem 5.1

Consider a linear regression model where (3.1)-(3.3) hold. Suppose n2pn\geq 2p and GG is as in (5.1) with a correlation parameter ρ(1,1)\rho\in(-1,1). Let WjW_{j}^{*} be as in (2.10). For any constant u>0u>0, let FPp(u)\mathrm{FP}_{p}(u) and FNp(u)\mathrm{FN}_{p}(u) be the expected numbers of false positives and false negatives, by selecting variables with Wj>2ulog(p)W^{*}_{j}>\sqrt{2u\log(p)}. As pp\to\infty,

FPp(u)=Lpp1min{u,ϑ+(u|ρ|r)2+(ξρrηρu)+2(ru)+2},\mathrm{FP}_{p}(u)=L_{p}p^{1-\min\left\{u,\;\,\vartheta+(\sqrt{u}-|\rho|\sqrt{r})^{2}+(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}-(\sqrt{r}-\sqrt{u})_{+}^{2}\right\}},

and

FNp(u)={Lpp1ϑ{(ru)+[(1ξρ)r(1ηρ)u]+}2,ρ0,Lpp1min{ϑ+{(ru)+[(1ξρ)r(1ηρ)u]+}2,  2ϑ+(ξρrηρ1u)+2},ρ<0,\mathrm{FN}_{p}(u)=\begin{cases}L_{p}p^{1-\vartheta-\{(\sqrt{r}-\sqrt{u})_{+}-[(1-\xi_{\rho})\sqrt{r}-(1-\eta_{\rho})\sqrt{u}]_{+}\}^{2}},&\rho\geq 0,\\ L_{p}p^{1-\min\left\{\vartheta+\{(\sqrt{r}-\sqrt{u})_{+}-[(1-\xi_{\rho})\sqrt{r}-(1-\eta_{\rho})\sqrt{u}]_{+}\}^{2},\;\,2\vartheta+(\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+}^{2}\right\}},&\rho<0,\end{cases}

where ξρ=1ρ2\xi_{\rho}=\sqrt{1-\rho^{2}} and ηρ=(1|ρ|)/(1+|ρ|)\eta_{\rho}=\sqrt{(1-|\rho|)/(1+|\rho|)}.

Using Theorem 5.1, we can deduce the FDR-TPR tradeoff diagram and phase diagram. To save space, we only present the phase diagram:

Corollary 5.1 (Phase diagram of Lasso-path)

In the setting of Theorem 5.1, the phase diagram of Lasso-path is given by

hAFR(ϑ)=ϑ,hER(ϑ)={max{h1(ϑ),h2(ϑ)},whenρ0,max{h1(ϑ),h2(ϑ),h3(ϑ),h4(ϑ)},whenρ<0,h_{\mathrm{AFR}}(\vartheta)=\vartheta,\qquad h_{\mathrm{ER}}(\vartheta)=\begin{cases}\max\{h_{1}(\vartheta),h_{2}(\vartheta)\},&\mbox{when}\;\;\rho\geq 0,\cr\max\{h_{1}(\vartheta),h_{2}(\vartheta),h_{3}(\vartheta),h_{4}(\vartheta)\},&\mbox{when}\;\;\rho<0,\end{cases}

where h1(ϑ)=(1+1ϑ)2h_{1}(\vartheta)=(1+\sqrt{1-\vartheta})^{2}, h2(ϑ)=(1+1+|ρ|1|ρ|)2(1ϑ)h_{2}(\vartheta)=\bigl{(}1+\sqrt{\frac{1+|\rho|}{1-|\rho|}}\bigr{)}^{2}(1-\vartheta), h3(ϑ)=1(1+ρ)2(1+ρ1ρ12ϑ+1ρ1+ρ1ϑ)21{ϑ<1/2}h_{3}(\vartheta)=\frac{1}{(1+\rho)^{2}}\bigl{(}\sqrt{\frac{1+\rho}{1-\rho}}\sqrt{1-2\vartheta}+\sqrt{\frac{1-\rho}{1+\rho}}\sqrt{1-\vartheta}\bigr{)}^{2}\cdot 1\{\vartheta<1/2\}, and h4(ϑ)=1(1+ρ)2(1+1+ρ1ρ12ϑ)21{ϑ<1/2}h_{4}(\vartheta)=\frac{1}{(1+\rho)^{2}}\bigl{(}1+\sqrt{\frac{1+\rho}{1-\rho}}\cdot\sqrt{1-2\vartheta}\bigr{)}^{2}\cdot 1\{\vartheta<1/2\}.

A visualization of the phase diagram for ρ=±0.5\rho=\pm 0.5 is in Figure 5.

Refer to caption
Refer to caption
Refer to caption
Figure 5: The phase diagrams of Lasso-path (block-wise diagonal designs). Left: ρ=0.5\rho=0.5. Middle: ρ=0.5\rho=-0.5. Right: zoom-out of the middle panel. In all three panels, the dashed lines are the phase curves for orthogonal designs (ρ=0\rho=0), as a reference.

When ρ=0\rho=0, the blockwise diagonal design reduces to the orthogonal design, Lasso-path reduces to (3.6), and the phase diagram reduces to the one in Proposition 3.1. Comparing Figure 5 with Figure 2, the phase diagram of Lasso-path is inferior to the one for orthogonal designs. This suggests that the strength of design correlations can have a significant impact on the performance of variable selection.

Another observation from Figure 5 is that the sign of ρ\rho plays a crucial role. This is related to the “signal cancellation” phenomenon (Ke et al., 2014; Ke and Yang, 2017). Suppose {j,j+1}\{j,j+1\} is a block and both βj\beta_{j} and βj+1\beta_{j+1} are signals. It is seen that 𝔼[xjy|β]=(1+ρ)τp\mathbb{E}[x_{j}^{\prime}y|\beta]=(1+\rho)\tau_{p}, whose absolute value is strictly smaller than τp\tau_{p} for a negative ρ\rho. Hence, when ρ\rho is negative, the signal at j+1j+1 creates a “cancellation effect” and makes xjx_{j} marginally less correlated with yy. Lasso is known to be quite vulnerable to “signal cancellation” (Zhao and Yu, 2006). This is why the phase diagram becomes worse when the sign of ρ\rho is flipped. We will provide a more rigorous explanation in Section 7 using geometric properties of the Lasso solution.

5.2 SDP-knockoff

We now study the SDP-knockoff, where diag(s)\mathrm{diag}(s) is as in (2.6). For the block-wise diagonal design parameterized by ρ\rho, we have an explicit form of diag(s)\mathrm{diag}(s):

diag(s)=(1a)Ip,wherea={2|ρ|1,|ρ|1/2,0,|ρ|<1/2.\mathrm{diag}(s)=(1-a)I_{p},\qquad\mbox{where}\quad a=\begin{cases}2|\rho|-1,&|\rho|\geq 1/2,\\ 0,&|\rho|<1/2.\end{cases} (5.2)

The value of aa controls the correlation between xjx_{j} and x~j\tilde{x}_{j}. SDP-knockoff aims to minimize this correlation, subject to the eligibility constraints. We first study the case |ρ|1/2|\rho|\geq 1/2.

Theorem 5.2 (The case of |ρ|1/2|\rho|\geq 1/2)

Consider a linear model where (3.1)-(3.3) hold. Suppose n2pn\geq 2p and GG is as in (5.1), where |ρ|1/2|\rho|\geq 1/2. We construct X~\tilde{X} in knockoff with diag(s)\mathrm{diag}(s) as in (5.2). Let ZjZ_{j}, Z~j\tilde{Z}_{j} and WjW_{j} be as in (2.2)-(2.3), where ff is the signed maximum in (2.5). For any constant u>0u>0, let FPp(u)\mathrm{FP}_{p}(u) and FNp(u)\mathrm{FN}_{p}(u) be the expected numbers of false positives and false negatives, by selecting variables with Wj>2ulog(p)W_{j}>\sqrt{2u\log(p)}. As pp\to\infty,

FPp(u)=Lpp1min{u,ϑ+(u|ρ|r)2+(ξρrηρu)+2(ru)+2},\mathrm{FP}_{p}(u)=L_{p}p^{1-\min\left\{u,\;\,\vartheta+(\sqrt{u}-|\rho|\sqrt{r})^{2}+(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}-(\sqrt{r}-\sqrt{u})_{+}^{2}\right\}},

and for ρ1/2\rho\geq 1/2,

FNp(u)=Lpp1ϑ{(ru)+[(1ξρ)r(1ηρ)u]+(λρrηρu)+}2,\mathrm{FN}_{p}(u)=L_{p}p^{1-\vartheta-\left\{(\sqrt{r}-\sqrt{u})_{+}-[(1-\xi_{\rho})\sqrt{r}-(1-\eta_{\rho})\sqrt{u}]_{+}-(\lambda_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}\right\}^{2}},

and for ρ1/2\rho\leq-1/2,

FNp(u)=Lpp1min{ϑ+{(ru)+[(1ξρ)r(1ηρ)u]+(λρrηρu)+}2,   2ϑ},\mathrm{FN}_{p}(u)=L_{p}p^{1-\min\bigl{\{}\vartheta+\left\{(\sqrt{r}-\sqrt{u})_{+}-[(1-\xi_{\rho})\sqrt{r}-(1-\eta_{\rho})\sqrt{u}]_{+}-(\lambda_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}\right\}^{2},\;\;\;2\vartheta\bigr{\}}},

where ξρ=1ρ2\xi_{\rho}=\sqrt{1-\rho^{2}}, ηρ=(1|ρ|)/(1+|ρ|)\eta_{\rho}=\sqrt{(1-|\rho|)/(1+|\rho|)}, and λρ=1ρ21|ρ|\lambda_{\rho}=\sqrt{1-\rho^{2}}-\sqrt{1-|\rho|}.

When |ρ|<1/2|\rho|<1/2, listing the separate forms of FPp(u)\mathrm{FP}_{p}(u) and FNp(u)\mathrm{FN}_{p}(u) is tedious. We instead present the form of FPp(u)+FNp(u)\mathrm{FP}_{p}(u)+\mathrm{FN}_{p}(u), which is sufficient for deriving the phase diagram.

Theorem 5.3 (The case of |ρ|<1/2|\rho|<1/2)

Consider a linear model where (3.1)-(3.3) hold. Suppose n2pn\geq 2p and GG is as in (5.1), where |ρ|<1/2|\rho|<1/2. We construct X~\tilde{X} in knockoff with diag(s)\mathrm{diag}(s) as in (5.2). Let ZjZ_{j}, Z~j\tilde{Z}_{j} and WjW_{j} be as in (2.2)-(2.3), where ff is the signed maximum in (2.5). For any constant u>0u>0, let FPp(u)\mathrm{FP}_{p}(u) and FNp(u)\mathrm{FN}_{p}(u) be the expected numbers of false positives and false negatives, by selecting variables with Wj>2ulog(p)W_{j}>\sqrt{2u\log(p)}. As pp\to\infty,

FPp\displaystyle\mathrm{FP}_{p} (u)+FNp(u)=\displaystyle(u)+\mathrm{FN}_{p}(u)=
{Lpp1fHamm+(u,r,ϑ),0ρ<1/2,Lpp1min{fHamm+(u,r,ϑ),  2ϑ+(ξρrηρ1u)+2,  2ϑ+(1+2ρ)2(1ρ)2(1+ρ)r},1/2<ρ<0,\displaystyle\begin{cases}L_{p}p^{1-f^{+}_{\text{Hamm}}(u,r,\vartheta)},&0\leq\rho<1/2,\\ L_{p}p^{1-\min\bigl{\{}f^{+}_{\text{Hamm}}(u,r,\vartheta),\;\;2\vartheta+(\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+}^{2},\;\;2\vartheta+\frac{(1+2\rho)^{2}(1-\rho)}{2(1+\rho)}r\bigr{\}}},&-1/2<\rho<0,\end{cases}

where

fHamm+(u,r,ϑ)\displaystyle f^{+}_{\text{Hamm}}(u,r,\vartheta) =min{u,ϑ+(u|ρ|r)2+((ξρrηρu)+)2((ru)+)2,\displaystyle=\min\bigl{\{}u,\;\;\vartheta+(\sqrt{u}-|\rho|\sqrt{r})^{2}+((\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+})^{2}-((\sqrt{r}-\sqrt{u})_{+})^{2},
ϑ+[(ru)+((1ξρ)r(1ηρ)u)+]2},\displaystyle\qquad\vartheta+[(\sqrt{r}-\sqrt{u})_{+}-((1-\xi_{\rho})\sqrt{r}-(1-\eta_{\rho})\sqrt{u})_{+}]^{2}\bigr{\}},

and ξρ\xi_{\rho}, ηρ\eta_{\rho} are the same as those in Theorem 5.2.

Corollary 5.2 (Phase diagram of SDP-knockoff)

Suppose conditions of Theorems 5.2-5.3 hold, where ρ\rho can be any value in (1,1)(-1,1). Let hAFRlasso(ϑ)h^{\mathrm{lasso}}_{AFR}(\vartheta) and hERlasso(ϑ)h^{\mathrm{lasso}}_{ER}(\vartheta) be the phase curves in Corollary 5.1. Define

ρ0=2122(note: ρ00.35).\rho_{0}=\sqrt{2}-1-\sqrt{2-\sqrt{2}}\qquad\mbox{(note: $\rho_{0}\approx-0.35$)}.

For SDP-knockoff, hAFR(ϑ)=hAFRlasso(ϑ)h_{AFR}(\vartheta)=h^{\mathrm{lasso}}_{AFR}(\vartheta), and hER(ϑ)h_{ER}(\vartheta) has three cases:

  • When ρ[ρ0,1)\rho\in[\rho_{0},1), hER(ϑ)=hERlasso(ϑ)h_{ER}(\vartheta)=h^{\mathrm{lasso}}_{ER}(\vartheta).

  • When ρ(0.5,ρ0)\rho\in(-0.5,\rho_{0}), hER(ϑ)=max{hERlasso(ϑ),h5(ϑ)}h_{ER}(\vartheta)=\max\{h^{\mathrm{lasso}}_{ER}(\vartheta),\,h_{5}(\vartheta)\}, where h5(ϑ)=2(12ϑ)(1+ρ)(1+2ρ)2(1ρ)h_{5}(\vartheta)=\frac{2(1-2\vartheta)(1+\rho)}{(1+2\rho)^{2}(1-\rho)}.

  • When ρ(1,0.5]\rho\in(-1,-0.5], hER(ϑ)=hERlasso(ϑ)h_{ER}(\vartheta)=h^{\mathrm{lasso}}_{ER}(\vartheta) if ϑ>1/2\vartheta>1/2, and hER(ϑ)=h_{ER}(\vartheta)=\infty otherwise.

A visualization of the phase diagram for three values of ρ\rho is in Figure 6.

Comparing Corollary 5.2 and Corollary 5.1, We have the following observations:

  • When ρ[ρ0,1)\rho\in[\rho_{0},1), SDP-knockoff shares the same phase diagram as Lasso-path, i.e., SDP-knockoff yields a negligible power loss compared with its prototype.

  • When ρ(1,ρ0)\rho\in(-1,\rho_{0}), the phase diagrams of SDP-knockoff is inferior to that of Lasso-path. Especially, when ρ(1,0.5]\rho\in(-1,-0.5], the AFR region of SDP-knockoff is infinite: For any ϑ<1/2\vartheta<1/2, no matter how large rr is, SDP-knockoff never achieves Exact Recovery.

We give an explanation of the discrepancy of the phase diagram between SDP-knockoff and Lasso-path for ρ(1,ρ0)\rho\in(-1,\rho_{0}). First, consider ρ(0.5,ρ0)\rho\in(-0.5,\rho_{0}). By (5.2), a=0a=0 and diag(s)=Ip\mathrm{diag}(s)=I_{p}. It follows that the jjth knockoff is uncorrelated with the jjth original variable. However, this knockoff is still highly correlated with the (j+1)(j+1)th original variable. Suppose jj is a true signal variable. Then, a true signal at (j+1)(j+1) will increase the absolute correlation between yy and x~j\tilde{x}_{j} but decrease the absolute correlation between yy and xjx_{j} (since ρ<0\rho<0), making it more difficult for xjx_{j} to stand out. Next, consider ρ(1,0.5]\rho\in(-1,-0.5]. Suppose {j,j+1}\{j,j+1\} has two ‘nested’ signals, i.e., (βj,βj+1)=(τp,τp)(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p}). By (5.2) and an elementary calculation,

𝔼[xjy|β]=(1+ρ)τp,𝔼[x~jy|β]={ρτp,when 0.5<ρ<0,(1+ρ)τp,when 1<ρ0.5.\mathbb{E}[x_{j}^{\prime}y|\beta]=(1+\rho)\tau_{p},\qquad\mathbb{E}[\tilde{x}_{j}^{\prime}y|\beta]=\begin{cases}\rho\tau_{p},&\mbox{when }-0.5<\rho<0,\\ -(1+\rho)\tau_{p},&\mbox{when }-1<\rho\leq-0.5.\end{cases}

When ρ0.5\rho\leq-0.5, variable jj and its knockoff have the same absolute correlation with yy. Consequently, there is a non-diminishing probability that the true signal variable fails to dominate its knockoff variable, making it impossible to select jj consistently. In the Rare/Weak signal model, ‘nested’ signals appear with a non-diminishing probability if ϑ<1/2\vartheta<1/2. This explains why hER(ϑ)=h_{ER}(\vartheta)=\infty when ρ0.5\rho\leq-0.5 and ϑ<1/2\vartheta<1/2.

The rationale of SDP-knockoff is to minimize the correlation between a variable and its own knockoff, but this is not necessarily the best strategy for constructing knockoff variables when the original variables are highly correlated. In the next subsection, we will see that a proper increase of the correlation between a variable and its knockoff can boost power.

Refer to caption
Refer to caption
Refer to caption
Figure 6: The phase diagrams of SDP-knockoff (blockwise diagonal designs; ranking algorithm is Lasso, and symmetric statistic is signed maximum). From left to right, the correlation parameter in the design is ρ=0.3\rho=-0.3, ρ=0.4\rho=-0.4, and ρ=0.5\rho=-0.5, respectively. They correspond to the three cases in Corollary 5.2. The shadowed area is the Almost Full Recovery region for SDP-knockoff but Exact Recovery region for the prototype Lasso-path. If SDP-knockoff is replaced by CI-knockoff, then in each of three cases the phase diagram is the same as that of Lasso-path.

5.3 CI-knockoff

We study the CI-knockoff, where diag(s)\mathrm{diag}(s) is as in (2.7). Liu and Rigollet (2019) showed that when c=1c=1 in (2.7), the resulting X~\tilde{X} satisfies xj(IPj)x~j=0x_{j}^{\prime}(I-P_{-j})\tilde{x}_{j}=0, where PjP_{-j} is the projection matrix to the linear span of {xk:kj}\{x_{k}:k\neq j\}. It means xjx_{j} and x~j\tilde{x}_{j} are conditionally uncorrelated, conditioning on the other (p1)(p-1) original variables. For the block-wise diagonal design (5.1), diag(s)\mathrm{diag}(s) has an explicit form:

diag(s)=(1ρ2)Ip,for all ρ(1,1).\mathrm{diag}(s)=(1-\rho^{2})I_{p},\qquad\mbox{for all }\rho\in(-1,1). (5.3)

Compared with (5.1), the value of aa has changed. We recall that aa controls the correlation between an original variable and its knockoff. In SDP-knockoff, aa is chosen as the minimum eligible value, but in CI-knockoff, aa is set at ρ2\rho^{2}.

Theorem 5.4

Consider a linear model where (3.1)-(3.3) hold. Suppose n2pn\geq 2p and GG is as in (5.1), with a correlation parameter ρ(1,1)\rho\in(-1,1). We construct X~\tilde{X} in knockoff with diag(s)\mathrm{diag}(s) as in (5.3). Let ZjZ_{j}, Z~j\tilde{Z}_{j} and WjW_{j} be as in (2.2)-(2.3), where ff is the signed maximum in (2.5). For any constant u>0u>0, let FPp(u)\mathrm{FP}_{p}(u) and FNp(u)\mathrm{FN}_{p}(u) be the expected numbers of false positives and false negatives, by selecting variables with Wj>2ulog(p)W_{j}>\sqrt{2u\log(p)}. As pp\to\infty,

FPp(u)+FNp(u)={Lpp1fHamm+(u,r,ϑ),ρ0,Lpp1min{fHamm+(u,r,ϑ),  2ϑ+(ξρrηρ1u)+2},ρ<0,\displaystyle\mathrm{FP}_{p}(u)+\mathrm{FN}_{p}(u)=\begin{cases}L_{p}p^{1-f^{+}_{\text{Hamm}}(u,r,\vartheta)},&\rho\geq 0,\\ L_{p}p^{1-\min\bigl{\{}f^{+}_{\text{Hamm}}(u,r,\vartheta),\;\;2\vartheta+(\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+}^{2}\bigr{\}}},&\rho<0,\end{cases}

where fHamm+(u,r,ϑ)f^{+}_{\text{Hamm}}(u,r,\vartheta) is the same as that in Theorem 5.3.

The exponent in Theorem 5.4 is in fact the same as that in Theorem 5.1. We immediately conclude that CI-knockoff yields the same phase diagram as its prototype, Lasso-path.

Corollary 5.3 (Phase diagram of CI-knockoff)

In the setting of Theorem 5.4, for any ρ(1,1)\rho\in(-1,1), the phase curves of CI-knockoff are the same as those in Corollary 5.1.

The result of CI-knockoff is very encouraging. We now explain how CI-knockoff improves SDP-knockoff for ρ(1,ρ0)\rho\in(-1,\rho_{0}). Comparing (5.3) with (5.2), we find that the correlation between xjx_{j} and x~j\tilde{x}_{j} increases from max{0,2|ρ|1}\max\{0,2|\rho|-1\} to ρ2\rho^{2}. We revisit the scenario of two ‘nested’ signals, i.e., (βj,βj+1)=(τp,τp)(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p}). By direct calculations,

𝔼[xjy|β]=(1+ρ)τp,𝔼[x~jy|β]=ρ(1+ρ)τp.\mathbb{E}[x_{j}^{\prime}y|\beta]=(1+\rho)\tau_{p},\qquad\mathbb{E}[\tilde{x}_{j}^{\prime}y|\beta]=\rho(1+\rho)\tau_{p}.

It always holds that |𝔼[xjy|β]|>|𝔼[x~jy|β]||\mathbb{E}[x_{j}^{\prime}y|\beta]|>|\mathbb{E}[\tilde{x}_{j}^{\prime}y|\beta]|. As long as rr is sufficiently large, the original variable xjx_{j} can standard out. This resolves the previous issue of SDP-knockoff.

Going beyond the block-wise design, it is an interesting question whether CI-knockoff still improves SDP-knockoff. We study it numerically in Section 8, where we consider designs such as Factor models, Exponential decay, and Normalized Wishart; see Experiment 4.

Remark 5. Our theory is focused on the 2×22\times 2 blockwise design in (5.1). Using similar techniques, we can study other blockwise designs, such as k×kk\times k blocks or varying-size blocks. Take k×kk\times k blocks for example. In knockoff, solving Lasso in (2.2) reduces to solving many 2k2k-dimensional problems separately. Let J={j,jJ=\{j,j+1,,j1,\ldots,j+kk-1}1\} be a block. The sufficient statistic for WjW_{j} is y~=[XJ,X~J]y2k\tilde{y}=[X_{J},\tilde{X}_{J}]^{\prime}y\in\mathbb{R}^{2k}. By Lemma 7.1, the Hamming error at jj depends on the interplay between probability contour of y~\tilde{y} and geometry of the 2k2k-dimensional Lasso problem. We make such analysis for k=2k=2 in the proofs of Theorems 5.2-5.4, which can be extended to a general kk.

6 Impact of the ranking algorithm

We consider two options of the ranking algorithm, Lasso and least-squares. As the ranking algorithm changes, the prototype is different. In Section 6.1, we first compare two prototypes. In Section 6.2, we further compare the associated versions of knockoff.

In the orthodox knockoff, ranking algorithm is Lasso, augmented design is SDP-knockoff, and symmetric statistic is signed maximum. We re-name it SDP-knockoff-Lasso. If ranking algorithm is changed to least-squares (with the other two components unchanged), we call it SDP-knockoff-OLS. In each method, if augmented design is changed to CI-knockoff (with the other two components unchanged), we call them CI-knockoff-Lasso and CI-knockoff-OLS, respectively. SDP-knockoff-Lasso, CI-knockoff-Lasso, and their prototype, Lasso-path, have been studied in Section 5. In this section, we study SDP-knockoff-OLS, CI-knockoff-OLS, and their prototype, least-squares, and compare the results with those in Section 5.

We consider the general design, where G=XXG=X^{\prime}X can be any positive definite matrix. We then restrict ourselves to the special case of 2×22\times 2 blockwise design in (5.1). The reason we can study general designs is that the least-squares solution has a simple and explicit form (but the Lasso solution does not).

6.1 The prototype, least-squares

Before studying SDP-knockoff-OLS and CI-knockoff-OLS, we first study their common prototype, the least-squares (see (2.11)).

Theorem 6.1

Consider a linear regression model where (3.1)-(3.3) hold and n2pn\geq 2p. Let ωj\omega_{j} be the jj-th diagonal element of G1G^{-1}. Suppose min1jp{ωj}C0\min_{1\leq j\leq p}\{\omega_{j}\}\leq C_{0}, for a constant C0>0C_{0}>0. Let WjW_{j}^{*} be as in (2.11). For any constant u>0u>0, let FPp(u)\mathrm{FP}_{p}(u) and FNp(u)\mathrm{FN}_{p}(u) be the expected numbers of false positives and false negatives, by selecting variables with Wj>2ulog(p)W^{*}_{j}>\sqrt{2u\log(p)}. As pp\to\infty, As pp\to\infty,

FPp(u)Lpj=1ppωj1u,FNp(u)Lppϑj=1ppωj1(ru)+2.\mathrm{FP}_{p}(u)\leq L_{p}\sum_{j=1}^{p}p^{-\omega_{j}^{-1}u},\qquad\mathrm{FN}_{p}(u)\leq L_{p}p^{-\vartheta}\sum_{j=1}^{p}p^{-\omega_{j}^{-1}(\sqrt{r}-\sqrt{u})_{+}^{2}}.
Corollary 6.1 (Phase diagram of OLS)

In the setting of Theorem 6.1, suppose GG is as in (5.1), with a correlation parameter ρ(1,1)\rho\in(-1,1). Then, ωj=(1ρ2)1\omega_{j}=(1-\rho^{2})^{-1} for 1jp1\leq j\leq p. As pp\to\infty, FPp(u)=Lpp1(1ρ2)u\mathrm{FP}_{p}(u)=L_{p}p^{1-(1-\rho^{2})u}, and FNp(u)=Lpp1ϑ(1ρ2)(ru)+2\mathrm{FN}_{p}(u)=L_{p}p^{1-\vartheta-(1-\rho^{2})(\sqrt{r}-\sqrt{u})_{+}^{2}}. The phase diagram of least-squares is given by hAFR(ϑ)=ϑ1ρ2h_{\mathrm{AFR}}(\vartheta)=\frac{\vartheta}{1-\rho^{2}} and hER(ϑ)=(1+1ϑ)21ρ2h_{\mathrm{ER}}(\vartheta)=\frac{(1+\sqrt{1-\vartheta})^{2}}{1-\rho^{2}}.

Figure 7 (left panel) shows the phase diagram of least-squares for |ρ|=0.5|\rho|=0.5; as a reference, in the right two panels, we plot again the phase diagrams of Lasso-path for ρ=±0.5\rho=\pm 0.5. For the comparison between least-squares and Lasso-path, we have the following observations:

  • In terms of hAFR(ϑ)h_{\mathrm{AFR}}(\vartheta), Lasso-path is always better than least-squares. To attain Almost Full Recovery, Lasso-path requires r>ϑr>\vartheta, but least-squares requires r>ϑ/(1ρ2)r>\vartheta/(1-\rho^{2}).

  • In terms of hER(ϑ)h_{\mathrm{ER}}(\vartheta), Lasso-path is better than least-squares when ϑ\vartheta is relatively large (i.e., β\beta is comparably sparser), and least-squares is better than Lasso-path when ϑ\vartheta is relatively small (i.e., β\beta is comparably denser).

  • The sign of ρ\rho also matters. For small ϑ\vartheta, the advantage of least-squares over Lasso-path on hER(ϑ)h_{\mathrm{ER}}(\vartheta) is much more obvious when ρ\rho is negative.

We give an intuitive explanation to the above phenomena. We say a signal variable (i.e., βj0\beta_{j}\neq 0) is ‘isolated’ if it is the only signal variable in the 2×22\times 2 block, and we say two signals are ‘nested’ if they are in the same 2×22\times 2 block. In the sparser regime (i.e., ϑ\vartheta is large), least-squares has a disadvantage because it is inefficient in discovering an ‘isolated’ signal. In the less sparse regime (i.e., ϑ\vartheta is small), Lasso-path has a disadvantage because it suffers from signal cancellation when estimating a pair of ‘nested’ signals (‘signal cancellation’ means a signal variable has a weak marginal correlation with yy due to the effect of other signals correlated with this one). A more rigorous explanation is given in Section 7, using geometry of solutions of least-squares and Lasso; see Lemma 7.2, Figure 8, and discussions therein.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: The phase diagrams of least-squares (left) and CI-knockoff-OLS (middle left), for blockwise designs with |ρ|=0.5|\rho|=0.5. For reference, we also plot the phase diagrams of Lasso-path for ρ=0.5\rho=0.5 (middle right) and ρ=0.5\rho=-0.5 (right), which are also the phase diagrams of CI-knockoff-Lasso.

6.2 Knockoff-OLS

We now study SDP-knockoff-OLS and CI-knockoff-OLS. The next theorem provides a general result that applies to all augmented designs:

Theorem 6.2

Consider a linear model where (3.1)-(3.3) hold. Suppose n2pn\geq 2p. We construct X~\tilde{X} in knockoff as in (2.1), with some choice of diag(s)\mathrm{diag}(s). Write G=[X,X~][X,X~]2p×2pG^{*}=[X,\tilde{X}]^{\prime}[X,\tilde{X}]\in\mathbb{R}^{2p\times 2p}. Suppose diag(s)\mathrm{diag}(s) is chosen such that GG^{*} is non-singular. Let Aj2×2A_{j}\in\mathbb{R}^{2\times 2} be the submatrix of (G)1(G^{*})^{-1} restricted to the jjth and (j+p)(j+p)th rows and columns. Denote ω1j=Aj(1,1)\omega_{1j}=A_{j}(1,1) and ω2j=Aj(1,2)\omega_{2j}=A_{j}(1,2). Suppose max1jp{ω1j}C0\max_{1\leq j\leq p}\{\omega_{1j}\}\leq C_{0}, for a constanat C0>0C_{0}>0. Let ZjZ_{j}, Z~j\tilde{Z}_{j} and WjW_{j} be as in (2.8) and (2.3), where ff is the signed maximum in (2.5). For any constant u>0u>0, let FPp(u)\mathrm{FP}_{p}(u) and FNp(u)\mathrm{FN}_{p}(u) be the expected numbers of false positives and false negatives, by selecting variables with Wj>2ulog(p)W_{j}>\sqrt{2u\log(p)}. As pp\to\infty,

FPp(u)Lpj=1ppω1j1u,FNp(u)Lppϑj=1ppω1j1min{(ru)+2,ω1jω1j+|ω2j|12r}.\mathrm{FP}_{p}(u)\leq L_{p}\sum_{j=1}^{p}p^{-\omega_{1j}^{-1}u},\qquad\mathrm{FN}_{p}(u)\leq L_{p}p^{-\vartheta}\sum_{j=1}^{p}p^{-\omega_{1j}^{-1}\min\bigl{\{}(\sqrt{r}-\sqrt{u})_{+}^{2},\;\;\frac{\omega_{1j}}{\omega_{1j}+|\omega_{2j}|}\cdot\frac{1}{2}r\bigr{\}}}.

The phase diagram of knockoff-OLS is governed by the quantities {ω1j}1jp\{\omega_{1j}\}_{1\leq j\leq p}. We now consider the special case of the 2×22\times 2 blockwise design in (5.1), where the augmented design is such that diag(s)=(1a)Ip\mathrm{diag}(s)=(1-a)I_{p}, where a=max{0,2|ρ|1}a=\max\{0,2|\rho|-1\} in SDP-knockoff and a=ρ2a=\rho^{2} in CI-knockoff. We note that in SDP-knockoff, the matrix GG^{*} is singular when |ρ|1/2|\rho|\geq 1/2. In other words, SDP-knockoff-OLS is well defined only for |ρ|<1/2|\rho|<1/2.

Corollary 6.2 (Phase diagram of knockoff-OLS)

In the same setting of Theorem 6.2, suppose GG is as in (5.1), with a correlation parameter ρ(1,1)\rho\in(-1,1).

  • SDP-knockoff-OLS (only defined for |ρ|<1/2|\rho|<1/2): ω1j=(14ρ2)1(12ρ2)\omega_{1j}=(1-4\rho^{2})^{-1}(1-2\rho^{2}) for 1jp1\leq j\leq p. The phase diagram is given by hAFR(ϑ)=ϑ(12ρ2)14ρ2h_{\mathrm{AFR}}(\vartheta)=\frac{\vartheta(1-2\rho^{2})}{1-4\rho^{2}} and hER(ϑ)=(1+1ϑ)2(12ρ2)14ρ2h_{\mathrm{ER}}(\vartheta)=\frac{(1+\sqrt{1-\vartheta})^{2}(1-2\rho^{2})}{1-4\rho^{2}}.

  • CI-knockoff-OLS: ω1j=(1ρ2)2\omega_{1j}=(1-\rho^{2})^{-2} for 1jp1\leq j\leq p. The phase diagram is given by hAFR(ϑ)=ϑ(1ρ2)2h_{\mathrm{AFR}}(\vartheta)=\frac{\vartheta}{(1-\rho^{2})^{2}} and hER(ϑ)=(1+1ϑ)2(1ρ2)2h_{\mathrm{ER}}(\vartheta)=\frac{(1+\sqrt{1-\vartheta})^{2}}{(1-\rho^{2})^{2}}.

Figure 7 (second left panel) shows the phase diagram of CI-knockoff-OLS for |ρ|=0.5|\rho|=0.5. In this figure, the right two panels are the phase diagrams of CI-knockoff-Lasso for ρ=±0.5\rho=\pm 0.5.

From Corollary 6.2 and Figure 7, we draw two conclusions: First, for both SDP-knockoff-OLS and and CI-knockoff-OLS, whenever ρ0\rho\neq 0, their phase diagrams are strictly inferior to the phase diagram of the least-squares (prototype). This is different from the case of using Lasso as ranking algorithm, where the phase diagrams of CI-knockoff-Lasso and Lasso-path (prototype) are the same in the blockwise design for all ρ(1,1)\rho\in(-1,1). Second, the comparison of CI-knockoff-OLS and CI-knockoff-Lasso is largely similar to the comparison between the least-squares and Lasso-path (see Section 6.1).

Remark 6. When we use the least-squares as the ranking algorithm, such a gap between knockoff and its prototype always exists, for a general design. To see this, note that by Theorem 6.1 and Theorem 6.2, the phase diagrams of knockoff-OLS and its prototype are governed by the quantities {ω1j}1jp\{\omega_{1j}\}_{1\leq j\leq p} and {ωj}1jp\{\omega_{j}\}_{1\leq j\leq p}, respectively. Since ω1j\omega_{1j} and ωj\omega_{j} are the jjth diagonal elements of G1G^{-1} and (G)1(G^{*})^{-1}, respectively, and GG is a principal submatrix of GG^{*}, it follows by elementary linear algebra that ωjω1j\omega_{j}\leq\omega_{1j} is always true (and this inequality is often strict). Unfortunately, it is impossible to mitigate this gap by using the augmented design in (2.1), no matter how we choose diag(s)\mathrm{diag}(s). Xing et al. (2023) proposed a new idea of constructing an augmented design, called the Gaussian mirror, which is tailored to using the least-squares as the ranking algorithm. In a companion paper (Ke et al., 2022), we show that the Gaussian mirror attains the same phase diagram as the least-squares.

Remark 7. Besides Lasso and least-squares, we may consider other ranking algorithms, such as the thresholded Lasso, non-convex penalization methods, and the forward-backward selection. See Ke and Wang (2021) about the phase diagrams of these methods.

7 The proof ideas and some geometric insights

A key technical tool in the proof is the following lemma, which is proved in the Appendix. Recall that LpL_{p} is a generic notation of multi-log(p)\log(p) terms; see Definition 3.1. For a vector vv, v\|v\| denotes the 2\ell^{2}-norm; for a matrix MM, M\|M\| denotes the spectral norm.

Lemma 7.1

Fix an integer d1d\geq 1, a vector μd\mu\in\mathbb{R}^{d}, a covariance matrix Σd×d\Sigma\in\mathbb{R}^{d\times d}, and an open set SdS\subset\mathbb{R}^{d} such that μS\mu\notin S. The quantities (d,μ,Σ,S)(d,\mu,\Sigma,S) do not change with pp. Suppose binfxS{(xμ)Σ1(xμ)}<b\equiv\inf_{x\in S}\{(x-\mu)^{\prime}\Sigma^{-1}(x-\mu)\}<\infty. Consider a sequence of random vectors XpdX_{p}\in\mathbb{R}^{d}, indexed by pp, satisfying that

Xp|(μp,Σp)𝒩d(μp,12log(p)Σp),X_{p}|(\mu_{p},\Sigma_{p})\sim{\cal N}_{d}\Bigl{(}\mu_{p},\;\;\frac{1}{2\log(p)}\Sigma_{p}\Bigr{)},

where μpd\mu_{p}\in\mathbb{R}^{d} is a random vector and Σpd×d\Sigma_{p}\in\mathbb{R}^{d\times d} is a random covariance matrix. As pp\to\infty, suppose for any fixed γ>0\gamma>0 and L>0L>0, (μpμ>γ)pL\mathbb{P}(\|\mu_{p}-\mu\|>\gamma)\leq p^{-L} and (ΣpΣ>γ)pL\mathbb{P}(\|\Sigma_{p}-\Sigma\|>\gamma)\leq p^{-L}. Then, as pp\to\infty,

(XpS)=Lppb,\mathbb{P}(X_{p}\in S)=L_{p}p^{-b},

or equivalently, pb+δ(XpS)p^{b+\delta}\mathbb{P}(X_{p}\in S)\to\infty and pb+δ(XpS)0p^{b+\delta}\mathbb{P}(X_{p}\in S)\to 0 for any constant δ>0\delta>0.

This lemma connects the rate of convergence of (XpS)\mathbb{P}(X_{p}\in S) with the geometric property of the set SS. The exponent bb is the “radius” of the largest ellipsoid that centers at μ\mu and is fully contained in the complement of SS.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Rejection regions and ‘most-likely’ cases in block-wise diagonal designs (x-axis: xjy/2log(p)x_{j}^{\prime}y/\sqrt{2\log(p)}; y-axis: xj+1y/2log(p)x_{j+1}^{\prime}y/\sqrt{2\log(p)}). From left to right: (i) positive ρ\rho and large ϑ\vartheta, (ii) positive ρ\rho and small ϑ\vartheta, (iii) negative ρ\rho and large ϑ\vartheta, (iv) negative ρ\rho and small ϑ\vartheta. In each plot, the blue solid lines define rejection region of Lasso-path, and the red solid lines define rejection region of least-squares. For each method, FPp\mathrm{FP}_{p} is determined by the largest FP-ellipsoid in c{\cal R}^{c}, and FNp\mathrm{FN}_{p} is determined by the largest FN-ellipsoid in {\cal R}, where the centers of these ellipsoids are determined by (βj,βj+1)(\beta_{j},\beta_{j+1}) in the ‘most-likely’ case. In each plot, the largest FP-ellipsoid is controlled to be the same for both Lasso-path and least-squares, and so the method with a larger FN-ellipsoid is better.

Proof sketch.

We illustrate how to use Lemma 7.1 to prove the theorems in Sections 4-6. Take the proof of Theorem 5.1 for example. Consider the block-wise design in (5.1). Under this design, the objective of Lasso is separable, and it reduces to solving many 2-dimensional Lasso problems separably. Fix jj and suppose {j,j+1}\{j,j+1\} is a block. Let WjW_{j}^{*} be as in (2.10). Write

h^=(xjy,xj+1y)/2log(p)2.\hat{h}=\bigl{(}x_{j}^{\prime}y,\;x_{j+1}^{\prime}y\bigr{)}^{\prime}/\sqrt{2\log(p)}\;\;\in\;\;\mathbb{R}^{2}. (7.1)

Since the Lasso objective is separable, (Wj,Wj+1)(W_{j}^{*},W_{j+1}^{*}) are purely determined by h^\hat{h}. Particularly, there exists u2{\cal R}_{u}\subset\mathbb{R}^{2}, such that Wj>2ulog(p)W_{j}^{*}>\sqrt{2u\log(p)} if and only if h^u\hat{h}\in{\cal R}_{u}. We call u{\cal R}_{u} the “rejection region” of Lasso-path. The probabilities of a false positive and a false negative occurring at jj are respectively

(h^u,βj=0)and(h^uc,βj=τp).\mathbb{P}\bigl{(}\hat{h}\in{\cal R}_{u},\,\beta_{j}=0\bigr{)}\qquad\mbox{and}\qquad\mathbb{P}\bigl{(}\hat{h}\in{\cal R}^{c}_{u},\,\beta_{j}=\tau_{p}\bigr{)}.

Conditioning on β\beta, the random vector h^\hat{h} has a bivariate normal distribution, whose mean is a constant vector and whose covariance matrix is 12log(p)B\frac{1}{2\log(p)}B, where BB is the same as in (5.1). Applying Lemma 7.1, we reduce the proof into two steps: In Step 1, we derive the rejection region u{\cal R}_{u}. In Step 2, for each possible realization of β\beta with βj=0\beta_{j}=0, we calculate b(β)infxu{(xμ(β))B1(xμ(β))}b(\beta)\equiv\inf_{x\in{\cal R}_{u}}\{(x-\mu(\beta))^{\prime}B^{-1}(x-\mu(\beta))\}, and for each possible realization of β\beta with βj0\beta_{j}\neq 0, we calculate b(β)infxuc{(xμ(β))B1(xμ(β))}b(\beta)\equiv\inf_{x\in{\cal R}^{c}_{u}}\{(x-\mu(\beta))^{\prime}B^{-1}(x-\mu(\beta))\}, where μ(β)𝔼[h^|β]\mu(\beta)\equiv\mathbb{E}[\hat{h}|\beta]. Both steps can be carried out by direct calculations. We use a similar strategy to prove other theorems. The proof is sometimes complicated. For example, to analyze knockoff for block-wise diagonal designs, we have to consider the random vector h^=(xjy,xj+1y,x~jy,x~j+1y)/2log(p)4\hat{h}=(x_{j}^{\prime}y,x_{j+1}^{\prime}y,\tilde{x}_{j}^{\prime}y,\tilde{x}_{j+1}^{\prime}y)^{\prime}/\sqrt{2\log(p)}\in\mathbb{R}^{4}. The proof requires deriving a 4-dimensional rejection region and calculating b(β)b(\beta), for an arbitrary ρ(1,1)\rho\in(-1,1). The calculations are very tedious.

The geometric insight about two prototypes.

We use the geometric interpretation of our proofs to give more insights about Lasso-path versus least-squares (see Corollary 5.1 and Corollary 6.1). Under the blockwise design (5.1), for each method, the objective is separable, so that the event Wj>2ulog(p)W_{j}^{*}>\sqrt{2u\log(p)} can be described via a 2-dimensional rejection region. The next lemma gives the rejection regions of Lasso-path and least-squares:

Lemma 7.2

Consider a linear model, where the Gram matrix satisfies (5.1), with a correlation parameter ρ(1,1)\rho\in(-1,1). Let Wj,pathW_{j}^{*,path} and Wj,oldW_{j}^{*,old} be as in (2.10) and (2.11), respectively. Suppose {j,j+1}\{j,j+1\} is a block. Write h^=(xjy,xj+1y)/2log(p)\hat{h}=(x_{j}^{\prime}y,x_{j+1}^{\prime}y)^{\prime}/\sqrt{2\log(p)}. Define

upath(ρ)\displaystyle{\cal R}^{\mathrm{path}}_{u}(\rho) ={(h1,h2):h1ρh2>(1ρ)u,h1>u}\displaystyle=\{(h_{1},h_{2}):h_{1}-\rho h_{2}>(1-\rho)\sqrt{u},\,h_{1}>\sqrt{u}\}
{(h1,h2):h1ρh2>(1+ρ)u}\displaystyle\qquad\cup\{(h_{1},h_{2}):h_{1}-\rho h_{2}>(1+\rho)\sqrt{u}\}
{(h1,h2):h1ρh2<(1ρ)u,h1<u}\displaystyle\qquad\cup\{(h_{1},h_{2}):h_{1}-\rho h_{2}<-(1-\rho)\sqrt{u},\,h_{1}<-\sqrt{u}\}
{(h1,h2):xρy<(1+ρ)u},for ρ0,\displaystyle\qquad\cup\{(h_{1},h_{2}):x-\rho y<-(1+\rho)u\},\quad\mbox{for }\rho\geq 0,
upath(ρ)\displaystyle{\cal R}^{\mathrm{path}}_{u}(\rho) ={(h1,h2):(h1,h2)upath(ρ)},for ρ<0,\displaystyle=\{(h_{1},h_{2}):(h_{1},-h_{2})\in{\cal R}^{\mathrm{path}}_{u}(-\rho)\},\qquad\mbox{for }\rho<0,
uols(ρ)\displaystyle{\cal R}^{\mathrm{ols}}_{u}(\rho) ={(h1,h2):h1ρh2>(1ρ2)u}\displaystyle=\{(h_{1},h_{2}):h_{1}-\rho h_{2}>(1-\rho^{2})\sqrt{u}\}
{(h1,h2):h1ρh2<(1ρ2)u}.\displaystyle\qquad\cup\{(h_{1},h_{2}):h_{1}-\rho h_{2}<-(1-\rho^{2})\sqrt{u}\}.

Then, for Lasso-path, Wj,path>2ulog(p)W_{j}^{*,path}>\sqrt{2u\log(p)} if and only if h^upath(ρ)\hat{h}\in{\cal R}^{\mathrm{path}}_{u}(\rho); for least-squares, Wj,ols>2ulog(p)W_{j}^{*,ols}>\sqrt{2u\log(p)} if and only if h^uols(ρ)\hat{h}\in{\cal R}^{\mathrm{ols}}_{u}(\rho).

These rejection regions are shown in Figure 8. Their geometric properties are different for positive and negative ρ\rho. Fix jj. Let h^\hat{h} be as in (7.1), and write μ(β)=𝔼[h^|β]\mu(\beta)=\mathbb{E}[\hat{h}|\beta].

  • The rate of convergence of FPp(u)\mathrm{FP}_{p}(u) is determined by the largest ellipsoid that centers at μ(β)\mu(\beta) and is contained in uc\mathcal{R}_{u}^{c}. We call this ellipsoid the FP-ellipsoid.

  • The rate of convergence of FNp(u)\mathrm{FN}_{p}(u) is determined by the largest ellipsoid that centers at μ(β)\mu(\beta) and is contained in u\mathcal{R}_{u}. We call this ellipsoid the FN-ellipsoid.

By direct calculations, μ(β)=(βj+ρβj+1,ρβj+βj+1)/2log(p)\mu(\beta)=\bigl{(}\beta_{j}+\rho\beta_{j+1},\;\rho\beta_{j}+\beta_{j+1}\bigr{)}^{\prime}/\sqrt{2\log(p)}. Under our model, (βj,βj+1)(\beta_{j},\beta_{j+1}) has 4 possible values {(0,0),(0,τp),(τp,0),(τp,τp)}\{(0,0),(0,\tau_{p}),(\tau_{p},0),(\tau_{p},\tau_{p})\}, where the first two correspond to a null at jj and the last two correspond to a non-null at jj. The probability of having a selection error at jj thus splits into 4 terms, and which term is dominating depends on the values of ϑ\vartheta and ρ\rho. The realization of (βj,βj+1)(\beta_{j},\beta_{j+1}) that plays a dominating role is called the ‘most-likely’ case. For example, when ϑ\vartheta is large (i.e., β\beta is sparser), the most-likely case of a false positive occuring at jj is when (βj,βj+1)=(0,0)(\beta_{j},\beta_{j+1})=(0,0); when ϑ\vartheta is small (i.e., β\beta is less sparse), the most-likely case of a false positive is when (βj,βj+1)=(0,τp)(\beta_{j},\beta_{j+1})=(0,\tau_{p}). Table 1 summarizes the ‘most-likely’ cases. We also visualize the ‘most-likely’ cases for different (ρ,ϑ)(\rho,\vartheta) in Figure 8. In each plot of Figure 8, we have coordinated the thresholds uu in two methods so that the FP-ellipsoid is exactly the same. It suffices to compare the FN-ellipsoid: The method with a larger FN-ellipsoid has a faster rate of convergence on the Hamming error. It is clear that, when ϑ\vartheta is large, the FN-ellipsoid of Lasso-path is larger; when ϑ\vartheta is small, the FN-ellipsoid of least-squares is larger. This explains the different performances of two methods. Moreover, when ϑ\vartheta is small, comparing the case of a positive ρ\rho with the case of a negative ρ\rho, we find that the difference between FN-ellipsoids of two methods are much more prominent in the case of a negative ρ\rho. This explains why the sign of ρ\rho matters.

Sparsity Correlation Error type Most-likely case Center of ellipsoid
large ϑ\vartheta positive/negative ρ\rho FP βj=0\beta_{j}=0, βj+1=0\beta_{j+1}=0 (0, 0)(0,\,0)
FN βj=τp\beta_{j}=\tau_{p}, βj+1=0\beta_{j+1}=0 (r,ρr)(\sqrt{r},\,\rho\sqrt{r})
small ϑ\vartheta positive ρ\rho FP βj=0\beta_{j}=0, βj+1=τp\beta_{j+1}=\tau_{p} (ρr,r)(\rho\sqrt{r},\,\sqrt{r})
FN βj=τp\beta_{j}=\tau_{p}, βj+1=0\beta_{j+1}=0 (r,ρr)(\sqrt{r},\,\rho\sqrt{r})
small ϑ\vartheta negative ρ\rho FP βj=0\beta_{j}=0, βj+1=τp\beta_{j+1}=\tau_{p} (ρr,r)(\rho\sqrt{r},\,\sqrt{r})
FN βj=τp\beta_{j}=\tau_{p}, βj+1=τp\beta_{j+1}=\tau_{p} ((1+ρ)r,(1+ρ)r)\bigl{(}(1+\rho)\sqrt{r},\,(1+\rho)\sqrt{r}\bigr{)}
Table 1: The ‘most-likely’ cases and the corresponding ellipsoid center μ(β)\mu(\beta)

8 Simulations

We use numerical experiments to support and exemplify the theoretical results in Sections 4-6. In Experiments 1 and 2, we consider orthogonal designs and block-wise diagonal designs, respectively. In Experiments 3 and 4, we consider other design classes, including block-wise diagonal designs with larger blocks, factor models, exponentially decaying designs, and normalized Wishart designs. We consider four methods, Lasso-path (Lasso), least-squares (OLS), knockoff with Lasso-path ranking (KF.Lasso) and with least-squares ranking (KF.OLS). We use either the signed maximum or the difference as the symmetric statistic, and for KF we choose diag(s)=min{1, 2λmin(G)}Ip\mathrm{diag}(s)=\min\{1,\,2\lambda_{\min}(G)\}\cdot I_{p}, unless specified otherwise. It is called the equi-correlated knockoff (EC-KF), and is the same as the SDP-knockoff for orthogonal designs and the 2×22\times 2 block-wise diagonal designs. In Experiments 1-3, this is the only diag(s)\mathrm{diag}(s) we use, and so we write EC-KF as KF for short. In Experiments 4, we also consider the conditional independence knockoff (CI-KF). For most experiments, fixing a parameter setting, we generate 200200 data sets and record the averaged Hamming selection error among these 200 repetitions.

Experiment 1. We investigate the performance of different methods for orthogonal designs. Given (n,p)=(2000,1000)(n,p)=(2000,1000), ϑ{0.3,0.5}\vartheta\in\{0.3,0.5\} and rr ranging on a grid from 0 to 66 with step size 0.20.2, we generate data yy from N(Xβ,In)N(X\beta,I_{n}) where XX is an n×pn\times p matrix with unit length columns that are orthogonal to each other and β\beta is generated from (3.2). We implemented Lasso and KF.Lasso using both the signed maximum and the difference as the symmetric statistic. Under the orthogonal design, Lasso and OLS yield the same importance metric thus OLS and KF.OLS are neglected in this experiment. Each method outputs pp importance statistics, and we threshold these importance statistics at 2ulog(p)\sqrt{2u^{*}\log(p)} where uu^{*} minimizes FNp(u)+FNp(u)\mathrm{FN}_{p}(u)+\mathrm{FN}_{p}(u) in theory. The results are in Figure 9, where the y-axis is logp(Hp/p)\log_{p}(H_{p}/p), and HpH_{p} is the averaged Hamming selection error over 200 repetitions.

The theory in Sections 4-5 suggests the following for orthogonal designs: (i) Regarding the choice of symmetric statistic for KF, the signed maximum outperforms the difference. (ii) With signed maximum as the symmetric statistic, KF.Lasso has a similar performance as Lasso. These theoretical results are perfectly validated by simulations (see Figure 9).

Refer to caption

       Refer to caption       

Figure 9: Experiment 1 (orthogonal designs). The y-axis is logp(Hp/p)\log_{p}(H_{p}/p), where HpH_{p} is the average Hamming error over 200 repetitions.

Experiment 2. We consider the block-wise diagonal design with 2×22\times 2 blocks, where we take ρ=0.5\rho=0.5 and ρ=0.7\rho=0.7. In the data generation, we fix an n×pn\times p matrix XX such that XXX^{\prime}X has the desirable form. We then generate (β,y)(\beta,y) in the same way as before. For each ρ\rho, we fix (n,p,ϑ)=(2000,1000,0.2)(n,p,\vartheta)=(2000,1000,0.2), and let rr range on a grid from 0 to 88 with a step size 0.20.2. For KF.Lasso and KF.OLS, we now fix the symmetric statistic as signed maximum and the default choice of diag(s)\mathrm{diag}(s) yields that diag(s)=(1a)Ip\mathrm{diag}(s)=(1-a)I_{p} with a=2ρ1a=2\rho-1. In this case, G=[X,X~][X,X~]G^{*}=[X,\tilde{X}]^{\prime}[X,\tilde{X}] is degenerated, thus an ϵ=105\epsilon=10^{-5} was subtracted from each elements of diag(s)\mathrm{diag}(s) to ensure KF.OLS is applicable. The results are in the first two panels of Figure 10.

The theory in Section 5 suggests that since the two values of ρ\rho considered here are in (ρ0,1)(\rho_{0},1), KF.Lasso has a similar performance as its prototype, Lasso. While according to Section 6, KF.OLS has a inferior performance comparing to its prototype, OLS. The simulation results are consistent with these theoretical predictions. Moreover, we can see that, for the current ϑ\vartheta value, OLS has a smaller Hamming error than that of Lasso when rr is large, and the opposite is true when rr is small. These also agree with our theory.

Refer to caption

       Refer to caption        Refer to caption        Refer to caption

Figure 10: Experiments 2 and 3 (block-wise diagonal designs, dd: block size, ρ\rho: off-diagonal entries). The y-axis is logp(Hp/p)\log_{p}(H_{p}/p), where HpH_{p} is the average Hamming error. The parameter aa controls the construction of knockoff.

Experiment 3. We further consider blockwise diagonal designs with larger-size blocks. Given d2d\geq 2 and pp that is a multiple of dd, we generate Xn×pX\in\mathbb{R}^{n\times p} such that XXX^{\prime}X is block-wise diagonal with d×dd\times d diagonal blocks, where the off-diagonal elements of each block are all equal to ρ\rho. Other steps of the data generation are the same as in Experiment 2. We consider (d,ρ)=(4,0.4)(d,\rho)=(4,0.4) and (d,ρ)=(5,0.3)(d,\rho)=(5,0.3). For each choice of (d,ρ)(d,\rho), we set (n,p,ϑ)=(2000,1000,0.3)(n,p,\vartheta)=(2000,1000,0.3) and let rr range on a grid from 0 to 66 with a step size 0.20.2. We use signed maximum as symmetric statistic in KF and use the equi-correlated knockoff described above. The results are in the last two panels of Figure 10.

One noteworthy observation is that KF.Lasso still has a similar performance as its own prototype. Meanwhile, KF.OLS can get close to its prototype in the case where ρ\rho is close to 0. Another observation is that OLS outperforms Lasso when rr is large, and Lasso slightly outperforms OLS when rr is small. While our theory is only derived for d=2d=2, the simulations suggest that similar insight continues to apply when the block size gets larger.

Experiment 4. In Section 5, we studied variants of knockoff with different augmented designs. The theory for 2×22\times 2 block-wise designs suggests that using CI-knockoff to construct X~\tilde{X} yields a higher power than using EC-knockoff (for 2×22\times 2 block-wise design, EC-knockoff is the same as SDP-knockoff). In this experiment, we investigate whether using CI-knockoff still yields a power boost for other design classes. We consider 4 types of designs:

  • Factor models: XX=(BB+Ip)/2X^{\prime}X=(BB^{\prime}+I_{p})/2, where BB is a p×2p\times 2 matrix whose jj-th row is equal to [cos(αj),sin(αj)][\cos(\alpha_{j}),\sin(\alpha_{j})] with {αj}j=1,,p\{\alpha_{j}\}_{j=1,\cdots,p} iidiid drawn from Uniform[0,2π]\mathrm{Uniform}[0,2\pi];

  • Block diagonal: Same as in Experiment 2, where ρ=0.5\rho=0.5.

  • Exponential decay: The (i,j)(i,j)-th element of XXX^{\prime}X is 0.6|ij|0.6^{|i-j|}, for 1i,jp1\leq i,j\leq p.

  • Normalized Wishart: XXX^{\prime}X is the sample correlation matrix of nn iidiid samples of N(0,Ip)N(0,I_{p}).

In the normalized Wishart design, the CI-knockoff in (2.7) may not satisfy diag(s)2G\mathrm{diag}(s)\preceq 2G. We modify it to diag(s)=α[diag(G1)]1\text{diag}(s)=\alpha[\text{diag}(G^{-1})]^{-1}, where α\alpha is the maximum value in [0,1][0,1] such that diag(s)2G\mathrm{diag}(s)\preceq 2G. For each design, we fix (n,p)=(1000,300)(n,p)=(1000,300), let ϑ\vartheta take values in {0.2,0.4}\{0.2,0.4\} and let rr range on a grid from 0 to 66 with a step size 0.20.2. Different from previous experiments, we generate β\beta from βjiid(1ϵp)ν0+12ϵpντp+12ϵpντp\beta_{j}\overset{iid}{\sim}(1-\epsilon_{p})\nu_{0}+\frac{1}{2}\epsilon_{p}\nu_{\tau_{p}}+\frac{1}{2}\epsilon_{p}\nu_{-\tau_{p}}, for 1jp1\leq j\leq p. The motivation of using this model is to allow for negative entries in β\beta. Even when XXX^{\prime}X contains only nonnegative elements, this signal model can still reveal the effect of having negative correlations in the design. We compare two versions of knockoff, EC-knockoff and CI-knockoff, along with the prototype, Lasso. The results are in Figure 11.

For the 2×22\times 2 block-wise diagonal design, the simulations suggest that CI-KF significantly outperforms EC-KF, and that CI-KF has a similar performance as the prototype, Lasso. This is consistent with the theory in Section 5.2 and Section 5.3. CI-KF also yields a significant improvement over EC-KF in the factor design, and the two methods perform similarly in the exponentially decaying design and the normalized Wishart design. We notice that the Gram matrix of the normalized Wishart design has uniformly small off-diagonal entries for the current (n,p)(n,p), which is similar to the orthogonal design and explains why EC-KF and CI-KF do not have much difference. Combining these simulation results, we recommend CI-KF for practical use. Additionally, in some settings (e.g., factor design, ϑ=0.4\vartheta=0.4; exponentially decaying design, ϑ=0.2\vartheta=0.2), CI-KF even outperforms its prototype Lasso. One possible reason is that the ideal threshold we use is derived by ignoring the multi-log(p)\log(p) term, but this term can have a non-negligible effect for a moderately large pp, so the Hamming error of Lasso presented here may be larger than the actual optimal one.

Refer to caption

       Refer to caption        Refer to caption        Refer to caption
Refer to caption        Refer to caption        Refer to caption        Refer to caption

Figure 11: Experiment 4 (general designs). The y-axis is logp(Hp/p)\log_{p}(H_{p}/p), where HpH_{p} is the average Hamming error. We focus on comparing two constructions of knockoff’s, EC-KF and CI-KF, and include Lasso as the benchmark.

9 Discussions

How to maximize the power when controlling FDR at a targeted level is a problem of great interest. We focus on the FDR control method, knockoff, and point out that it has three key components: ranking algorithm, augmented design, and symmetric statistic. Since each component admits multiple choices, knockoff has many different variants. All the variants guarantee finite-sample FDR control. Our goal is to understand which variants enjoy good power. In a Rare/Weak signal model, for each variant of knockoff under consideration, we derive explicit forms of false positive rate and false negative rate, and obtain the theoretical phase diagram. The results provide useful guidelines of choosing the version of knockoff to use in practice. We also define the prototype of knockoff, which uses only one component, ranking algorithm, and has access to an ideal threshold. We compare the phase diagram of knockoff with the phase diagram of prototype. The results help us understand the extra price we pay for finding a data-driven threshold to control FDR.

We have several notable discoveries: (i) For the choice of symmetric statistic, signed maximum is better than difference, because the latter has an inferior phase diagram in the orthogonal design. (ii) For the choice of augmented design, CI-knockoff is better than SDP-knockoff, because the latter has an inferior phase diagram in a simple blockwise diagonal design. (ii) For the choice of ranking algorithm, roughly, Lasso is better than least-squares when the signals are extremely sparse and the design correlations are moderate; and least-squares is better than Lasso when the signals are only moderately sparse and the design correlations are more severe. (iv) In a simple blockwise diagonal design, when knockoff uses Lasso as ranking algorithm, with proper choices of two other components, knockoff has the same phase diagram as its prototype (i.e., we pay a negligible price for finding a data-driven threshold). This is however not true when knockoff uses least-squares as ranking algorithm.

There are several directions to extend our current results. First, we focus on the regime where FDR and TPR converge to either 0 or 1 and characterize the rates of convergence. The more subtle regime where FDR and TPR converge to constants between 0 and 1 is not studied. We leave it to future work. Second, the study of knockoff here is only for block-wise diagonal designs. For general designs, it is very tedious to derive the precise phase diagram, but some cruder results may be less tedious to derive, such as an upper bound for the Hamming error. This kind of results will help shed more insights on how to construct the knockoff variables (e.g., how to choose diag(s)\mathrm{diag}(s)). Third, we only investigate Lasso-path or the least-squares as options of the ranking algorithm. It is interesting to study the power of FDR control methods based on other ranking algorithms, such as the marginal screening and iterative sure screening (Fan and Lv, 2008) and the covariance assisted screening (Ke et al., 2014; Ke and Yang, 2017). The covariance assisted screening was shown to yield optimal phase diagrams for a broad class of sparse designs; whether it can be developed into an FDR control method with “optimal” power remains unknown and is worth future study. Last, some FDR control methods may not fit exactly the unified framework here. For instance, the multiple data splits (Dai et al., 2022) is a method that controls FDR through data splitting. We can similarly assess its power using the Rare/Weak signal model and phase diagram, except that we need to assume the rows of XX are i.i.d.i.i.d. generated. We leave such study to future work.


Acknowledgments and Disclosure of Funding

The research of Jun S. Liu was partially supported by the NSF grant DMS-201541 and the NIH R01grant HG011485-01. The research of Zheng T. Ke was partially supported by the NSF CAREER grant DMS-1943902.

Appendix A Proof of Lemma 7.1

By definition of the multi-log(p)\log(p) term, it suffices to show that, for every ϵ>0\epsilon>0, as pp\to\infty,

pϵ+b(XpS)0,andpϵ+b(XpS).p^{-\epsilon+b}\ \mathbb{P}(X_{p}\in S)\to 0,\qquad\mbox{and}\qquad p^{\epsilon+b}\ \mathbb{P}(X_{p}\in S)\to\infty. (A.1)

We introduce two sets S¯\underline{S} and S¯\overline{S} such that

S¯SS¯.\underline{S}\subset S\subset\overline{S}.

Define m(x)=(xμ)Σ1(xμ)m(x)=(x-\mu)^{\prime}\Sigma^{-1}(x-\mu) for any xdx\in\mathbb{R}^{d}. By definition, b=infxSm(x)b=\inf_{x\in S}m(x). As a result, m(x)bm(x)\geq b for all xSx\in S. Define

S¯={xp:m(x)b}.\overline{S}=\{x\in\mathbb{R}^{p}:m(x)\geq b\}. (A.2)

Then, SS¯S\subset\overline{S}. Furthermore, since m(x)m(x) is a quadratic function and b=infxSm(x)b=\inf_{x\in S}m(x), given any ϵ>0\epsilon>0, there exists x0Sx_{0}\in S such that

m(x0)b+ϵ/8.m(x_{0})\leq b+\epsilon/8. (A.3)

Note that (A.3) guarantees that x0μ\|x_{0}-\mu\| is bounded. For any xSx\in S and xx01\|x-x_{0}\|\leq 1,

|m(x)m(x0)|\displaystyle|m(x)-m(x_{0})| 2|(xμ)Σ1(xx0)|+|(xx0)Σ1(xx0)|\displaystyle\leq 2|(x-\mu)^{\prime}\Sigma^{-1}(x-x_{0})|+|(x-x_{0})^{\prime}\Sigma^{-1}(x-x_{0})|
2xμΣ1xx0+Σ1xx02\displaystyle\leq 2\|x-\mu\|\|\Sigma^{-1}\|\cdot\|x-x_{0}\|+\|\Sigma^{-1}\|\|x-x_{0}\|^{2}
C1xx0+C2xx02,\displaystyle\leq C_{1}\|x-x_{0}\|+C_{2}\|x-x_{0}\|^{2},

where C1C_{1} and C2C_{2} are positive constants that only depend on (μ,Σ,b,ϵ)(\mu,\Sigma,b,\epsilon). It follows that there exists a constant δ1>0\delta_{1}>0 such that

xS,xx0δ1|m(x)m(x0)|ϵ/8.x\in S,\quad\|x-x_{0}\|\leq\delta_{1}\qquad\Longrightarrow\qquad|m(x)-m(x_{0})|\leq\epsilon/8. (A.4)

Additionally, since SS is an open set and x0Sx_{0}\in S, there exists δ2>0\delta_{2}>0, such that

{xd:xx0δ2}S.\{x\in\mathbb{R}^{d}:\|x-x_{0}\|\leq\delta_{2}\}\subset S.

Define

S¯={xd:xx0δ},whereδ=min{δ1,δ2}.\underline{S}=\{x\in\mathbb{R}^{d}:\|x-x_{0}\|\leq\delta\},\qquad\mbox{where}\quad\delta=\min\{\delta_{1},\delta_{2}\}. (A.5)

It is easy to see that S¯S\underline{S}\subset S. Additionally, in light of (A.3) and (A.4),

m(x)b+ϵ/4,for all xS¯.m(x)\leq b+\epsilon/4,\qquad\mbox{for all }x\in\underline{S}. (A.6)

Since S¯SS¯\underline{S}\subset S\subset\overline{S}, to show (A.1), it suffices to show that

pϵ+b(XpS¯)p^{\epsilon+b}\ \mathbb{P}\bigl{(}X_{p}\in\underline{S}\bigr{)}\to\infty (A.7)

and

pϵ+b(XpS¯)0.p^{-\epsilon+b}\ \mathbb{P}\bigl{(}X_{p}\in\overline{S}\bigr{)}\to 0. (A.8)

First, we show (A.7). Let fp(x)f_{p}(x) denote the density of 𝒩d(μp,12log(p)Σp){\cal N}_{d}\bigl{(}\mu_{p},\;\frac{1}{2\log(p)}\Sigma_{p}\bigr{)}. Write mp(x)=(xμp)Σp1(xμp)m_{p}(x)=(x-\mu_{p})^{\prime}\Sigma_{p}^{-1}(x-\mu_{p}). It is seen that

fp(x)=[2log(p)]d/2(2π)d/2|det(Σd)|1/2pmp(x).f_{p}(x)=\frac{[2\log(p)]^{d/2}}{(2\pi)^{d/2}|\det(\Sigma_{d})|^{1/2}}\cdot p^{-m_{p}(x)}. (A.9)

By direct calculations,

(XpS¯|μp,Σp)\displaystyle\mathbb{P}\bigl{(}X_{p}\in\underline{S}\;|\;\mu_{p},\Sigma_{p}\bigr{)} =[2log(p)]d/2(2π)d/2|det(Σp)|1/2xS¯pmp(x)𝑑x\displaystyle=\frac{[2\log(p)]^{d/2}}{(2\pi)^{d/2}|\det(\Sigma_{p})|^{1/2}}\int_{x\in\underline{S}}p^{-m_{p}(x)}dx (A.10)
[log(p)]d/2πd/2|det(Σp)|1/2Volume(S¯)psupxS¯{mp(x)}.\displaystyle\geq\frac{[\log(p)]^{d/2}}{\pi^{d/2}|\det(\Sigma_{p})|^{1/2}}\cdot\mathrm{Volume}(\underline{S})\cdot p^{-\sup_{x\in\underline{S}}\{m_{p}(x)\}}. (A.11)

The assumptions on (μp,Σp)(\mu_{p},\Sigma_{p}) imply that, for any constant γ>0\gamma>0,

limp(μpμ>γ or ΣpΣ>γ)=0.\lim_{p\to\infty}\mathbb{P}\bigl{(}\|\mu_{p}-\mu\|>\gamma\mbox{ or }\|\Sigma_{p}-\Sigma\|>\gamma\bigr{)}=0.

Let EE be the event that μpμγ\|\mu_{p}-\mu\|\leq\gamma_{*} and ΣpΣγ\|\Sigma_{p}-\Sigma\|\leq\gamma_{*}, for some γ\gamma_{*} to be decided. On this event, for any xS¯x\in\underline{S},

|m(x)mp(x)|\displaystyle|m(x)-m_{p}(x)| |(xμ)Σ1(xμ)(xμ)Σp1(xμ)|\displaystyle\leq|(x-\mu)^{\prime}\Sigma^{-1}(x-\mu)-(x-\mu)^{\prime}\Sigma_{p}^{-1}(x-\mu)|
+|(xμ)Σp1(xμ)(xμp)Σp1(xμp)|\displaystyle\qquad+|(x-\mu)^{\prime}\Sigma_{p}^{-1}(x-\mu)-(x-\mu_{p})^{\prime}\Sigma_{p}^{-1}(x-\mu_{p})|
|(xμ)(Σ1Σp1)(xμ)|+2|(xμ)Σp1(μμp)|\displaystyle\leq|(x-\mu)^{\prime}(\Sigma^{-1}-\Sigma_{p}^{-1})(x-\mu)|+2|(x-\mu)^{\prime}\Sigma_{p}^{-1}(\mu-\mu_{p})|
+(μμp)Σp1(μμp)\displaystyle\qquad+(\mu-\mu_{p})^{\prime}\Sigma_{p}^{-1}(\mu-\mu_{p})
xμ2Σ1Σp1ΣpΣ+2xμΣp1μμp\displaystyle\leq\|x-\mu\|^{2}\|\Sigma^{-1}\|\|\Sigma_{p}^{-1}\|\cdot\|\Sigma_{p}-\Sigma\|+2\|x-\mu\|\|\Sigma_{p}^{-1}\|\cdot\|\mu-\mu_{p}\|
+Σp1μμp2\displaystyle\qquad+\|\Sigma_{p}^{-1}\|\cdot\|\mu-\mu_{p}\|^{2}
C3γ+C4γ2,\displaystyle\leq C_{3}\gamma_{*}+C_{4}\gamma_{*}^{2},

where C3C_{3} and C4C_{4} are positive constants that do not depend on γ\gamma_{*}, and in the last line we have used the fact that S¯\underline{S} is a bounded set so that xμ\|x-\mu\| is bounded. It follows that we can choose an appropriately small γ\gamma_{*} such that

|m(x)mp(x)|ϵ/4,for all xS¯.\displaystyle|m(x)-m_{p}(x)|\leq\epsilon/4,\qquad\mbox{for all }x\in\underline{S}. (A.12)

Combining (A.12) with (A.6) gives

supxS¯mp(x)b+ϵ/2,on the event E.\sup_{x\in\underline{S}}m_{p}(x)\leq b+\epsilon/2,\qquad\mbox{on the event }E.

Moreover, since S¯\underline{S} is a ball with radius δ\delta,

Volume(S¯)=δdVolume(Bd),\mathrm{Volume}(\underline{S})=\delta^{d}\cdot\mathrm{Volume}(B_{d}),

where BdB_{d} is the unit ball in d\mathbb{R}^{d}, whose volume is a constant. We plug the above results into (A.10) and notice that |det(Σp)||det(Σ)|C5δ|\det(\Sigma_{p})|\geq|\det(\Sigma)|-C_{5}\delta on the event EE, for a constant C5>0C_{5}>0. It yields that, when (μp,Σp)(\mu_{p},\Sigma_{p}) satisfies the event EE,

(XpS¯|μp,Σp)c0[log(p)]d/2p(b+ϵ/2),\mathbb{P}\bigl{(}X_{p}\in\underline{S}\;|\;\mu_{p},\Sigma_{p}\bigr{)}\geq c_{0}[\log(p)]^{d/2}\cdot p^{-(b+\epsilon/2)}, (A.13)

for some constant c0>0c_{0}>0. It follows that

(XpS¯)(E)c0[log(p)]d/2p(b+ϵ/2).\mathbb{P}\bigl{(}X_{p}\in\underline{S}\bigr{)}\geq\mathbb{P}(E)\cdot c_{0}[\log(p)]^{d/2}p^{-(b+\epsilon/2)}.

We plug it into the left hand side of (A.7) and note that (E)1\mathbb{P}(E)\to 1 as pp\to\infty. This gives the desirable claim in (A.7).

Next, we show (A.8). We define a counterpart of the set S¯\overline{S} by

S¯p={xd:mp(x)b}.\overline{S}_{p}=\{x\in\mathbb{R}^{d}:m_{p}(x)\geq b\}.

Define Yp=2log(p)Σp1/2(Xpμp)Y_{p}=\sqrt{2\log(p)}\cdot\Sigma_{p}^{-1/2}(X_{p}-\mu_{p}). Then, Yp𝒩d(0,Id)Y_{p}\sim{\cal N}_{d}(0,I_{d}) and

XpS¯pif and only ifYp22blog(p).X_{p}\in\overline{S}_{p}\qquad\mbox{if and only if}\qquad\|Y_{p}\|^{2}\geq 2b\log(p).

The distribution of Yp2\|Y_{p}\|^{2} is a χd2\chi^{2}_{d} distribution, which does not depend on (μp,Σp)(\mu_{p},\Sigma_{p}). We have

(XpS¯p)\displaystyle\mathbb{P}\bigl{(}X_{p}\in\overline{S}_{p}\bigr{)} =𝔼[(XpS¯p|μp,Σp)]\displaystyle=\mathbb{E}\bigl{[}\mathbb{P}\bigl{(}X_{p}\in\overline{S}_{p}\;|\;\mu_{p},\Sigma_{p}\bigr{)}\bigr{]} (A.14)
=𝔼[(Yp22blog(p))]\displaystyle=\mathbb{E}\bigl{[}\mathbb{P}\bigl{(}\|Y_{p}\|^{2}\geq 2b\log(p)\bigr{)}\bigr{]} (A.15)
=(χd22blog(p)).\displaystyle=\mathbb{P}\bigl{(}\chi^{2}_{d}\geq 2b\log(p)\bigr{)}. (A.16)

For chi-square distribution, the tail probability has an explicit form:

(χd22blog(p))=Γ(d/2,blog(p))Γ(d/2),\mathbb{P}\bigl{(}\chi^{2}_{d}\geq 2b\log(p)\bigr{)}=\frac{\Gamma(d/2,\ b\log(p))}{\Gamma(d/2)},

where Γ(s,x)xts1exp(t)𝑑t\Gamma(s,x)\equiv\int_{x}^{\infty}t^{s-1}\exp(-t)dt is the upper incomplete gamma function and Γ(s)Γ(s,0)\Gamma(s)\equiv\Gamma(s,0) is the ordinary gamma function. By property of the upper incomplete gamma function, Γ(s,x)/(xs1exp(x))1\Gamma(s,x)/(x^{s-1}\exp(-x))\to 1 as xx\to\infty. It follows that

Γ(d/2,blog(p))[blog(p)]d/21pb1,asp.\frac{\Gamma(d/2,\ b\log(p))}{[b\log(p)]^{d/2-1}p^{-b}}\to 1,\qquad\mbox{as}\quad p\to\infty.

In particular, when pp is sufficiently large, the left hand side is 1/2\geq 1/2. We plug these results into (A.14) to get

(XpS¯p)[blog(p)]d/212Γ(d/2)pb.\mathbb{P}\bigl{(}X_{p}\in\overline{S}_{p}\bigr{)}\geq\frac{[b\log(p)]^{d/2-1}}{2\Gamma(d/2)}\cdot p^{-b}. (A.17)

It remains to study the difference caused by replacing S¯p\overline{S}_{p} by S¯\overline{S}. Let

Up=(S¯\S¯p)(S¯p\S¯).U_{p}=(\overline{S}\backslash\overline{S}_{p})\cup(\overline{S}_{p}\backslash\overline{S}).

Then,

|(XpS¯)(XpS¯p)|(XpUp).\bigl{|}\mathbb{P}\bigl{(}X_{p}\in\overline{S}\bigr{)}-\mathbb{P}\bigl{(}X_{p}\in\overline{S}_{p}\bigr{)}\bigr{|}\leq\mathbb{P}\bigl{(}X_{p}\in U_{p}\bigr{)}. (A.18)

Similar to (A.10), we have

(XpUp|μp,Σp)\displaystyle\mathbb{P}\bigl{(}X_{p}\in U_{p}\;|\;\mu_{p},\Sigma_{p}\bigr{)} =[2log(p)]d/2(2π)d/2|det(Σp)|1/2xUppmp(x)𝑑x\displaystyle=\frac{[2\log(p)]^{d/2}}{(2\pi)^{d/2}|\det(\Sigma_{p})|^{1/2}}\int_{x\in U_{p}}p^{-m_{p}(x)}dx (A.19)
[log(p)]d/2πd/2|det(Σp)|1/2Volume(Up)pinfxUp{mp(x)}.\displaystyle\leq\frac{[\log(p)]^{d/2}}{\pi^{d/2}|\det(\Sigma_{p})|^{1/2}}\cdot\mathrm{Volume}(U_{p})\cdot p^{-\inf_{x\in U_{p}}\{m_{p}(x)\}}. (A.20)

For a constant γ>0\gamma>0 to be decided, let FF be the event that

μpμγ,andΣpΣγ.\|\mu_{p}-\mu\|\leq\gamma,\qquad\mbox{and}\quad\|\Sigma_{p}-\Sigma\|\leq\gamma. (A.21)

On this event, we study both Volume(Up)\mathrm{Volume}(U_{p}) and infxUpmp(x)\inf_{x\in U_{p}}m_{p}(x). Re-write

Up=(S¯c\S¯pc)(S¯pc\S¯c).U_{p}=(\overline{S}^{c}\backslash\overline{S}^{c}_{p})\cup(\overline{S}^{c}_{p}\backslash\overline{S}^{c}).

By definition, S¯c={xd:m(x)b}={xd:Σ1/2(xμ)b}\overline{S}^{c}=\{x\in\mathbb{R}^{d}:m(x)\leq b\}=\{x\in\mathbb{R}^{d}:\|\Sigma^{-1/2}(x-\mu)\|\leq\sqrt{b}\}, and S¯pc={xd:Σp1/2(xμp)b}\overline{S}_{p}^{c}=\{x\in\mathbb{R}^{d}:\|\Sigma_{p}^{-1/2}(x-\mu_{p})\|\leq\sqrt{b}\}. On the event FF, for any xS¯pcx\in\overline{S}^{c}_{p},

Σ1/2(xμ)\displaystyle\|\Sigma^{-1/2}(x-\mu)\| b+Σ1/2(xμ)Σp1/2(xμp)\displaystyle\leq\sqrt{b}+\|\Sigma^{-1/2}(x-\mu)-\Sigma_{p}^{-1/2}(x-\mu_{p})\|
b+Σ1/2(μpμ)+(Σ1/2Σp1/2)(xμp)\displaystyle\leq\sqrt{b}+\|\Sigma^{-1/2}(\mu_{p}-\mu)\|+\|(\Sigma^{-1/2}-\Sigma_{p}^{-1/2})(x-\mu_{p})\|
b+Σ1/2μpμ+Σ1/2Σp1/2IdΣp1/2(xμp)\displaystyle\leq\sqrt{b}+\|\Sigma^{-1/2}\|\cdot\|\mu_{p}-\mu\|+\|\Sigma^{1/2}\Sigma_{p}^{-1/2}-I_{d}\|\cdot\|\Sigma_{p}^{-1/2}(x-\mu_{p})\|
b+Σ1/2μpμ+bΣ1/2Σp1/2Id\displaystyle\leq\sqrt{b}+\|\Sigma^{-1/2}\|\cdot\|\mu_{p}-\mu\|+\sqrt{b}\cdot\|\Sigma^{1/2}\Sigma_{p}^{-1/2}-I_{d}\|
b+C5γ,\displaystyle\leq\sqrt{b}+C_{5}\gamma,

for a constant C5>0C_{5}>0 that does not depend on γ\gamma. Choosing γ<C51b\gamma<C_{5}^{-1}\sqrt{b}, we have Σ1/2(xμ)2b\|\Sigma^{-1/2}(x-\mu)\|\leq 2\sqrt{b} for all xS¯pcx\in\overline{S}_{p}^{c}. Additionally, by definition, Σ1/2(xμ)b\|\Sigma^{-1/2}(x-\mu)\|\leq\sqrt{b} for all xS¯cx\in\overline{S}^{c}. Combining the above gives

Up(S¯cS¯pc){xd:Σ1/2(xμ)2b}.U_{p}\;\subset\;(\overline{S}^{c}\cup\overline{S}^{c}_{p})\;\subset\;\bigl{\{}x\in\mathbb{R}^{d}:\|\Sigma^{-1/2}(x-\mu)\|\leq 2\sqrt{b}\bigr{\}}.

Recall that BdB_{d} is the unit ball in d\mathbb{R}^{d}. It follows immediately that

Volume(Up)(2b)dVolume(Bd),on the event F.\mathrm{Volume}(U_{p})\leq(2\sqrt{b})^{d}\cdot\mathrm{Volume}(B_{d}),\qquad\mbox{on the event $F$}. (A.22)

At the same time, for any xS¯x\in\overline{S}, on the event FF,

Σp1/2(xμp)\displaystyle\|\Sigma_{p}^{-1/2}(x-\mu_{p})\| Σ1/2(xμ)Σp1/2(xμp)Σ1/2(xμ)\displaystyle\geq\|\Sigma^{-1/2}(x-\mu)\|-\|\Sigma_{p}^{-1/2}(x-\mu_{p})-\Sigma^{-1/2}(x-\mu)\|
Σ1/2(xμ)Σp1/2(μpμ)(Σ1/2Σp1/2)(xμ)\displaystyle\geq\|\Sigma^{-1/2}(x-\mu)\|-\|\Sigma_{p}^{-1/2}(\mu_{p}-\mu)\|-\|(\Sigma^{-1/2}-\Sigma_{p}^{-1/2})(x-\mu)\|
Σ1/2(xμ)Σp1/2μpμΣp1/2Σ1/2IdΣ1/2(xμ)\displaystyle\geq\|\Sigma^{-1/2}(x-\mu)\|-\|\Sigma_{p}^{-1/2}\|\cdot\|\mu_{p}-\mu\|-\|\Sigma_{p}^{-1/2}\Sigma^{1/2}-I_{d}\|\cdot\|\Sigma^{-1/2}(x-\mu)\|
=Σ1/2(xμ)(1Σp1/2Σ1/2Id)Σ1/2μpμ\displaystyle=\|\Sigma^{-1/2}(x-\mu)\|\bigl{(}1-\|\Sigma_{p}^{-1/2}\Sigma^{1/2}-I_{d}\|\bigr{)}-\|\Sigma^{-1/2}\|\cdot\|\mu_{p}-\mu\|
Σ1/2(xμ)(1C6γ)Σ1/2γ\displaystyle\geq\|\Sigma^{-1/2}(x-\mu)\|(1-C_{6}\gamma)-\|\Sigma^{-1/2}\|\gamma
b(1C6γ)Σ1/2γ,\displaystyle\geq\sqrt{b}(1-C_{6}\gamma)-\|\Sigma^{-1/2}\|\gamma,

where C6>0C_{6}>0 is a constant that does not depend on γ\gamma and in the last line we have used the fact that Σ1/2(xμ)b\|\Sigma^{-1/2}(x-\mu)\|\geq\sqrt{b} for xS¯x\in\overline{S}. We choose γ\gamma properly small so that b(1C6γ)Σ1/2γbϵ/2\sqrt{b}(1-C_{6}\gamma)-\|\Sigma^{-1/2}\|\gamma\geq\sqrt{b-\epsilon/2}. It follows that

mp(x)=Σp1/2(xμp)2bϵ/2,for all xS¯.m_{p}(x)=\|\Sigma_{p}^{-1/2}(x-\mu_{p})\|^{2}\geq b-\epsilon/2,\qquad\mbox{for all }x\in\overline{S}. (A.23)

Additionally, the definition of S¯p\overline{S}_{p} already guarantees that mp(x)bm_{p}(x)\geq b for all xS¯px\in\overline{S}_{p}. Consequently,

infxUpmp(x)infxS¯S¯p{mp(x)}bϵ/2,on the event F.\inf_{x\in U_{p}}m_{p}(x)\geq\inf_{x\in\overline{S}\cup\overline{S}_{p}}\{m_{p}(x)\}\geq b-\epsilon/2,\qquad\mbox{on the event $F$}. (A.24)

We plug (A.22) and (A.24) into (A.19). It yields that, on the event FF,

(XpUp|μp,Σp)C7[log(p)]d/2p(bϵ/2),\mathbb{P}\bigl{(}X_{p}\in U_{p}\;|\;\mu_{p},\Sigma_{p}\bigr{)}\leq C_{7}[\log(p)]^{d/2}\cdot p^{-(b-\epsilon/2)}, (A.25)

for a constant C7>0C_{7}>0. Then,

(XpUp)(F)C7[log(p)]d/2p(bϵ/2)+(Fc).\mathbb{P}\bigl{(}X_{p}\in U_{p}\bigr{)}\leq\mathbb{P}(F)\cdot C_{7}[\log(p)]^{d/2}\cdot p^{-(b-\epsilon/2)}+\mathbb{P}(F^{c}).

By our assumption, for any γ>0\gamma>0 and L>0L>0, (μpμ>γ)pL\mathbb{P}(\|\mu_{p}-\mu\|>\gamma)\leq p^{-L} and (ΣpΣ>γ)pL\mathbb{P}(\|\Sigma_{p}-\Sigma\|>\gamma)\leq p^{-L}. In particular, we can choose L=bL=b. It gives

(Fc)pb.\mathbb{P}(F^{c})\leq p^{-b}.

We combine the above results and plug them into (A.18). It follows that

|(XpS¯)(XpS¯p)|C7[log(p)]d/2p(bϵ/2)+pb.\bigl{|}\mathbb{P}\bigl{(}X_{p}\in\overline{S}\bigr{)}-\mathbb{P}\bigl{(}X_{p}\in\overline{S}_{p}\bigr{)}\bigr{|}\leq C_{7}[\log(p)]^{d/2}\cdot p^{-(b-\epsilon/2)}+p^{-b}. (A.26)

Combining (A.17) and (A.26) gives

(Xp/S¯)[1+o(1)]C7[log(p)]d/2p(bϵ/2).\mathbb{P}\bigl{(}X_{p}/\in\overline{S}\bigr{)}\leq[1+o(1)]\cdot C_{7}[\log(p)]^{d/2}\cdot p^{-(b-\epsilon/2)}.

This gives the claim in (A.8). The proof of this lemma is complete.

Appendix B Proof of Lemma 7.2

First, we study the least-squares. Note that β^\hat{\beta} has an explicit solution: β^=G1XTy\hat{\beta}=G^{-1}X^{T}y. Since GG is a block-wise diagonal matrix, we immediately have

[β^jβ^j+1]=[1ρρ1]1[xjTyxj+1Ty]=11ρ2[xjTyρxj+1Tyxj+1Tyρxj+1Ty].\begin{bmatrix}\hat{\beta}_{j}\\ \hat{\beta}_{j+1}\end{bmatrix}=\begin{bmatrix}1\;\;\;&\rho\\ \rho\;\;\;&1\end{bmatrix}^{-1}\begin{bmatrix}x_{j}^{T}y\\ x_{j+1}^{T}y\end{bmatrix}=\frac{1}{1-\rho^{2}}\begin{bmatrix}x_{j}^{T}y-\rho x_{j+1}^{T}y\\ x_{j+1}^{T}y-\rho x_{j+1}^{T}y\end{bmatrix}.

Recall that y~=Xy/2log(p)\tilde{y}=X^{\prime}y/\sqrt{2\log(p)}. Then, |β^j|>2ulog(p)|\hat{\beta}_{j}|>\sqrt{2u\log(p)} if and only if

11ρ2|y~jρy~j+1|>u.\frac{1}{1-\rho^{2}}|\tilde{y}_{j}-\rho\tilde{y}_{j+1}|>\sqrt{u}.

It immediately gives the rejection region for least-squares.

Next, we study the Lasso-path. We write Wj,pathW_{j}^{*,\text{path}} as WjW_{j}^{*} for notation simplicity. The lasso estimate β^(λ)\hat{\beta}(\lambda) minimizes the objective

Q(b)=12yXb2+λb1=12y2yTXb+12bTGb+λb1.Q(b)=\frac{1}{2}\|y-Xb\|^{2}+\lambda\|b\|_{1}=\frac{1}{2}\|y\|^{2}-y^{T}Xb+\frac{1}{2}b^{T}Gb+\lambda\|b\|_{1}.

When GG is a block-wise diagonal matrix, the objective Q(b)Q(b) is separable, and we can optimize over each pair of (bj,bj+1)(b_{j},b_{j+1}) separately. It reduces to solving many bi-variate problems:

(β^j(λ),β^j+1(λ))T=argminb{12||y[xj,xj+1]b||22+λb1}.(\hat{\beta}_{j}(\lambda),\hat{\beta}_{j+1}(\lambda))^{T}=\mathrm{argmin}_{b}\Big{\{}\frac{1}{2}||y-[x_{j},x_{j+1}]b||_{2}^{2}+\lambda||b||_{1}\Big{\}}. (B.1)

Write b^=(β^j(λ),β^j+1(λ))T\hat{b}=(\hat{\beta}_{j}(\lambda),\hat{\beta}_{j+1}(\lambda))^{T} and let

B=[1ρρ1]andh=[xjTyxj+1Ty].B=\begin{bmatrix}1\;\;\;&\rho\\ \rho\;\;\;&1\end{bmatrix}\qquad\mbox{and}\qquad h=\begin{bmatrix}x_{j}^{T}y\\ x_{j+1}^{T}y\end{bmatrix}.

Then, the optimization (B.1) can be written as

b^=argminb{hTb+bTBb/2+λb1}.\hat{b}=\mathrm{argmin}_{b}\bigl{\{}-h^{T}b+b^{T}Bb/2+\lambda\|b\|_{1}\bigr{\}}. (B.2)

Recall that WjW_{j}^{*} is the value of λ\lambda at which b^1\hat{b}_{1} becomes nonzero for the first time. Our goal is to find a region of (h1,h2)(h_{1},h_{2}) such that Wj>tp(u)2ulog(p)W_{j}^{*}>t_{p}(u)\equiv\sqrt{2u\log(p)}.

It suffices to consider the case of ρ0\rho\geq 0. To see this, we consider changing ρ\rho to ρ-\rho in the matrix BB. The objective remains unchanged if we also change b2b_{2} to b2-b_{2} and h2h_{2} to h2-h_{2}. Note that the change of b2b_{2} to b2-b_{2} has no impact on WjW_{j}^{*}; this means WjW_{j}^{*} is unchanged if we simultaneously flip the sign of ρ\rho and h2h_{2}. Consequently, once we know the rejection region for ρ>0\rho>0, we can immediately obtain that for ρ<0\rho<0 by a reflection of the region with respect to the x-axis.

Below, we fix ρ0\rho\geq 0. We first derive the explicit form of the whole solution path and then use it to decide the rejection region. Taking sub-gradients of (B.1), we find that b^\hat{b} has to satisfy

[1ρρ1][b^1b^2]+λ[sgn(b^1)sgn(b^2)]=[h1h2],\begin{bmatrix}1\;\;\;&\rho\\ \rho\;\;\;&1\end{bmatrix}\begin{bmatrix}\hat{b}_{1}\\ \hat{b}_{2}\end{bmatrix}+\lambda\begin{bmatrix}\mathrm{sgn}(\hat{b}_{1})\\ \mathrm{sgn}(\hat{b}_{2})\end{bmatrix}=\begin{bmatrix}h_{1}\\ h_{2}\end{bmatrix}, (B.3)

where sgn(x)=1\mathrm{sgn}(x)=1 if x>0x>0, sgn(x)=1\mathrm{sgn}(x)=-1 if x<0x<0, and sgn(x)\mathrm{sgn}(x) can be equal to any value in [1,1][-1,1] if x=0x=0. Let λ1>λ2>0\lambda_{1}>\lambda_{2}>0 be the values at which variables enter the solution path. When λ(λ1,)\lambda\in(\lambda_{1},\infty), b^1=0\hat{b}_{1}=0 and b^2=0\hat{b}_{2}=0. Plugging them into (B.3) gives sgn(b^1)=λ1h1\mathrm{sgn}(\hat{b}_{1})=\lambda^{-1}h_{1}. The definition of sgn(b^1)\mathrm{sgn}(\hat{b}_{1}) implies that |h1|λ|h_{1}|\leq\lambda, for any λ>λ1\lambda>\lambda_{1}. We then have |h1|λ1|h_{1}|\leq\lambda_{1}. Similarly, it is true that |h2|λ1|h_{2}|\leq\lambda_{1}. It gives

λ1=max{|h1|,|h2|}.\lambda_{1}=\max\{|h_{1}|,|h_{2}|\}. (B.4)

We first assume |h1|>|h2||h_{1}|>|h_{2}|. By (B.3) and continuity of solution path, there exists a sufficiently small constant δ>0\delta>0 such that, for λ(λ2δ,λ2)\lambda\in(\lambda_{2}-\delta,\lambda_{2}), the following equation holds.

[1ρρ1][b^1(λ)b^2(λ)]+λ[sgn(b^1)sgn(b^2)]=[h1h2].\begin{bmatrix}1\;\;\;&\rho\\ \rho\;\;\;&1\end{bmatrix}\begin{bmatrix}\hat{b}_{1}(\lambda)\\ \hat{b}_{2}(\lambda)\end{bmatrix}+\lambda\begin{bmatrix}\mathrm{sgn}(\hat{b}_{1})\\ \mathrm{sgn}(\hat{b}_{2})\end{bmatrix}=\begin{bmatrix}h_{1}\\ h_{2}\end{bmatrix}. (B.5)

The sign vector of b^\hat{b} for λ(λ2δ,λ2)\lambda\in(\lambda_{2}-\delta,\lambda_{2}) has four different cases: (1,1)T(1,1)^{T}, (1,1)T(1,-1)^{T}, (1,1)T(-1,1)^{T}, (1,1)T(-1,-1)^{T}. For these four different cases, we can use (B.5) to solve b^\hat{b}. The solutions in four cases are respectively

11ρ2[(h1ρh2)(1ρ)λ(h2ρh1)(1ρ)λ],11ρ2[(h1ρh2)(1+ρ)λ(h2ρh1)+(1+ρ)λ],\displaystyle\frac{1}{1-\rho^{2}}\begin{bmatrix}(h_{1}-\rho h_{2})-(1-\rho)\lambda\\ (h_{2}-\rho h_{1})-(1-\rho)\lambda\end{bmatrix},\qquad\frac{1}{1-\rho^{2}}\begin{bmatrix}(h_{1}-\rho h_{2})-(1+\rho)\lambda\\ (h_{2}-\rho h_{1})+(1+\rho)\lambda\end{bmatrix},
11ρ2[(h1ρh2)+(1+ρ)λ(h2ρh1)(1+ρ)λ],11ρ2[(h1ρh2)+(1ρ)λ(h2ρh1)+(1ρ)λ].\displaystyle\frac{1}{1-\rho^{2}}\begin{bmatrix}(h_{1}-\rho h_{2})+(1+\rho)\lambda\\ (h_{2}-\rho h_{1})-(1+\rho)\lambda\end{bmatrix},\qquad\frac{1}{1-\rho^{2}}\begin{bmatrix}(h_{1}-\rho h_{2})+(1-\rho)\lambda\\ (h_{2}-\rho h_{1})+(1-\rho)\lambda\end{bmatrix}.

The solution b^\hat{b} has to match the sign assumption on b^\hat{b}. For each of the four cases, the requirement becomes

  • Case 1:  (h1ρh2)(1ρ)λ>0(h_{1}-\rho h_{2})-(1-\rho)\lambda>0,   (h2ρh1)(1ρ)λ>0(h_{2}-\rho h_{1})-(1-\rho)\lambda>0.

  • Case 2:  (h1ρh2)(1+ρ)λ>0(h_{1}-\rho h_{2})-(1+\rho)\lambda>0,   (h2ρh1)+(1+ρ)λ<0(h_{2}-\rho h_{1})+(1+\rho)\lambda<0.

  • Case 3:  (h1ρh2)+(1+ρ)λ<0(h_{1}-\rho h_{2})+(1+\rho)\lambda<0,   (h2ρh1)(1+ρ)λ>0(h_{2}-\rho h_{1})-(1+\rho)\lambda>0.

  • Case 4:  (h1ρh2)+(1ρ)λ<0(h_{1}-\rho h_{2})+(1-\rho)\lambda<0,   (h2ρh1)+(1ρ)λ<0(h_{2}-\rho h_{1})+(1-\rho)\lambda<0.

Note that we have assumed |h1|>|h2||h_{1}|>|h_{2}|. Then, Case kk is possible only in the region 𝒜k{\cal A}_{k}, where

𝒜1={(h1,h2):h1>0,ρh1<h2<h1},𝒜2={(h1,h2):h1>0,h1<h2<ρh1},\displaystyle{\cal A}_{1}=\{(h_{1},h_{2}):h_{1}>0,\;\;\rho h_{1}<h_{2}<h_{1}\},\quad{\cal A}_{2}=\{(h_{1},h_{2}):h_{1}>0,\;\;-h_{1}<h_{2}<\rho h_{1}\},
𝒜3={(h1,h2):h1<0,ρh1<h2<h1},𝒜4={(h1,h2):h1<0,h1<h2<ρh1}.\displaystyle{\cal A}_{3}=\{(h_{1},h_{2}):h_{1}<0,\;\;\rho h_{1}<h_{2}<-h_{1}\},\quad{\cal A}_{4}=\{(h_{1},h_{2}):h_{1}<0,\;\;h_{1}<h_{2}<\rho h_{1}\}.

In each case, λ1=|h1|\lambda_{1}=|h_{1}|. To get the value of λ2\lambda_{2}, we use the continuity of the solution path. It implies that b^2(λ)=0\hat{b}_{2}(\lambda)=0 at λ=λ2\lambda=\lambda_{2}. As a result, the value of λ2\lambda_{2} in Case kk is

λ2(1)=h2ρh11ρ,λ2(2)=ρh1h21+ρ,λ2(3)=h2ρh11+ρ,λ2(4)=ρh1h21ρ.\lambda_{2}^{(1)}=\frac{h_{2}-\rho h_{1}}{1-\rho},\qquad\lambda_{2}^{(2)}=\frac{\rho h_{1}-h_{2}}{1+\rho},\qquad\lambda_{2}^{(3)}=\frac{h_{2}-\rho h_{1}}{1+\rho},\qquad\lambda_{2}^{(4)}=\frac{\rho h_{1}-h_{2}}{1-\rho}. (B.6)

It is easy to verify that λ2<λ1\lambda_{2}<\lambda_{1} in each case. We also need to check that in the region 𝒜k{\cal A}_{k}, the KKT condition (B.3) can be satisfied with b^2=0\hat{b}_{2}=0 for all λ(λ2(k),λ1)\lambda\in(\lambda^{(k)}_{2},\lambda_{1}). For example, in Case 1, (B.3) becomes

[1ρρ1][b^10]+λ[1c]=[h1h2],for some |c|1.\begin{bmatrix}1\;\;\;&\rho\\ \rho\;\;\;&1\end{bmatrix}\begin{bmatrix}\hat{b}_{1}\\ 0\end{bmatrix}+\lambda\begin{bmatrix}1\\ c\end{bmatrix}=\begin{bmatrix}h_{1}\\ h_{2}\end{bmatrix},\qquad\mbox{for some }|c|\leq 1.

We can solve the equations to get b^1=h1λ\hat{b}_{1}=h_{1}-\lambda and λc=h2ρb^1=(h2ρh1)λ\lambda c=h_{2}-\rho\hat{b}_{1}=(h_{2}-\rho h_{1})-\lambda. It can be verified that |(h2ρh1)λ|λ|(h_{2}-\rho h_{1})-\lambda|\leq\lambda for (h1,h2)𝒜1(h_{1},h_{2})\in{\cal A}_{1} and λ(λ2(1),λ1)\lambda\in(\lambda^{(1)}_{2},\lambda_{1}). The verification for other cases is similar and thus omitted. We then assume |h2|>|h1||h_{2}|>|h_{1}|. By symmetry, we will have the same result, except that (h1,h2)(h_{1},h_{2}) are switched in the expression of 𝒜{\cal A} and (λ1,λ2)(\lambda_{1},\lambda_{2}). This gives the other four cases:

𝒜5={(h1,h2):h2>0,ρh2<h1<h2},𝒜6={(h1,h2):h2>0,h2<h1<ρh2},\displaystyle{\cal A}_{5}=\{(h_{1},h_{2}):h_{2}>0,\;\;\rho h_{2}<h_{1}<h_{2}\},\quad{\cal A}_{6}=\{(h_{1},h_{2}):h_{2}>0,\;\;-h_{2}<h_{1}<\rho h_{2}\},
𝒜7={(h1,h2):h2<0,ρh2<h1<h2},𝒜8={(h1,h2):h2<0,h2<h1<ρh2}.\displaystyle{\cal A}_{7}=\{(h_{1},h_{2}):h_{2}<0,\;\;\rho h_{2}<h_{1}<-h_{2}\},\quad{\cal A}_{8}=\{(h_{1},h_{2}):h_{2}<0,\;\;h_{2}<h_{1}<\rho h_{2}\}.

In these four cases, we similarly have λ1=|h2|\lambda_{1}=|h_{2}| and

λ2(5)=h1ρh21ρ,λ2(6)=ρh2h11+ρ,λ2(7)=h1ρh21+ρ,λ2(8)=ρh2h11ρ.\lambda_{2}^{(5)}=\frac{h_{1}-\rho h_{2}}{1-\rho},\qquad\lambda_{2}^{(6)}=\frac{\rho h_{2}-h_{1}}{1+\rho},\qquad\lambda_{2}^{(7)}=\frac{h_{1}-\rho h_{2}}{1+\rho},\qquad\lambda_{2}^{(8)}=\frac{\rho h_{2}-h_{1}}{1-\rho}. (B.7)

These eight regions are shown in Figure 12.

Refer to caption
Refer to caption
Figure 12: The rejection region of least-squares (left) and Lasso-path (right). On the right panel, the regions 𝒜1{\cal A}_{1}-𝒜8{\cal A}_{8} are the same as those defined in the proof. In the regions 𝒜1{\cal A}_{1}-𝒜4{\cal A}_{4}, Wj=|h1|W_{j}^{*}=|h_{1}|, and the rejection region is colored by yellow. In the regions 𝒜5{\cal A}_{5} and 𝒜8{\cal A}_{8}, Wj=|h1ρh2|/(1ρ)W_{j}^{*}=|h_{1}-\rho h_{2}|/(1-\rho), and the rejection region is colored by purple. In the regiions 𝒜6{\cal A}_{6} and 𝒜7{\cal A}_{7}, Wj=|h1ρh2|/(1+ρ)W_{j}^{*}=|h_{1}-\rho h_{2}|/(1+\rho), and the rejection region is colored by green.

We then compute WjW_{j}^{*} and the associated rejection region. Note that Wj=λ1W_{j}^{*}=\lambda_{1} in Case 1-Case 4, and Wj=λ2W_{j}^{*}=\lambda_{2} in Case 5-Case 8. It follows directly that

Wj={|h1|,if (h1,h2)𝒜1𝒜2𝒜3𝒜4,|h1ρh2|/(1ρ),if (h1,h2)𝒜5𝒜8,|h1ρh2|/(1+ρ),if (h1,h2)𝒜6𝒜7.W_{j}^{*}=\begin{cases}|h_{1}|,&\mbox{if }(h_{1},h_{2})\in{\cal A}_{1}\cup{\cal A}_{2}\cup{\cal A}_{3}\cup{\cal A}_{4},\\ |h_{1}-\rho h_{2}|/(1-\rho),&\mbox{if }(h_{1},h_{2})\in{\cal A}_{5}\cup{\cal A}_{8},\\ |h_{1}-\rho h_{2}|/(1+\rho),&\mbox{if }(h_{1},h_{2})\in{\cal A}_{6}\cup{\cal A}_{7}.\end{cases} (B.8)

As a result, the region Wj>2ulog(p)W_{j}^{*}>\sqrt{2u\log(p)} if and only if the vector (xjTy,xj+1Ty)/2log(p)(x_{j}^{T}y,x_{j+1}^{T}y)/\sqrt{2\log(p)} is in the following set:

\displaystyle{\cal R} =[(𝒜1𝒜2𝒜3𝒜){|h1|>u}]\displaystyle=\bigl{[}({\cal A}_{1}\cup{\cal A}_{2}\cup{\cal A}_{3}\cup{\cal A})\cap\{|h_{1}|>\sqrt{u}\}\bigr{]}
[(𝒜5𝒜8){|h1ρh2|>(1ρ)u}]\displaystyle\qquad\cup\bigl{[}({\cal A}_{5}\cup{\cal A}_{8})\cap\{|h_{1}-\rho h_{2}|>(1-\rho)\sqrt{u}\}\bigr{]}
[(𝒜6𝒜7){|h1ρh2|>(1+ρ)u}].\displaystyle\qquad\cap\bigl{[}({\cal A}_{6}\cup{\cal A}_{7})\cap\{|h_{1}-\rho h_{2}|>(1+\rho)\sqrt{u}\}\bigr{]}.

In Figure 12, the 3 subsets are colored by yellow, purple, and green, respectively. This gives the rejection region for Lasso-path.

Appendix C Proof of Theorem 4.1

By definition of (FPp,FNp)(\mathrm{FP}_{p},\mathrm{FN}_{p}) and the Rare/Weak signal model (3.2)-(3.3), we have

FPp=j=1p(1ϵp)(Wj>tp(u)|βj=0),FNp=j=1pϵp(Wj<tp(u)|βj=τp),\mathrm{FP}_{p}=\sum_{j=1}^{p}(1-\epsilon_{p})\mathbb{P}(W_{j}>t_{p}(u)|\beta_{j}=0),\quad\mathrm{FN}_{p}=\sum_{j=1}^{p}\epsilon_{p}\ \mathbb{P}(W_{j}<t_{p}(u)|\beta_{j}=\tau_{p}), (C.1)

where ϵp=pϑ\epsilon_{p}=p^{-\vartheta}, τp=2rlog(p)\tau_{p}=\sqrt{2r\log(p)}, and tp(u)=2ulog(p)t_{p}(u)=\sqrt{2u\log(p)}. Therefore, it suffices to study (Wj>tp(u)|βj=0)\mathbb{P}(W_{j}>t_{p}(u)|\beta_{j}=0) and (Wj<tp(u)|βj=τp)\mathbb{P}(W_{j}<t_{p}(u)|\beta_{j}=\tau_{p}).

Fix 1jp1\leq j\leq p. The knockoff filter applies Lasso to the design matrix [X,X~][X,\tilde{X}]. This design is belongs to the block-wise diagonal design (5.1) with a dimension 2p2p and ρ=a\rho=a. The variable jj and its own knockoff are in one block. Fix jj and write

h1=xjy/2log(p),andh2=x~jy/2log(p).h_{1}=x_{j}^{\prime}y/\sqrt{2\log(p)},\qquad\mbox{and}\qquad h_{2}=\tilde{x}_{j}^{\prime}y/\sqrt{2\log(p)}. (C.2)

It is easy to see that (xjy,x~jy)(x_{j}^{\prime}y,\tilde{x}_{j}^{\prime}y)^{\prime} follows a distribution 𝒩2(𝟎2,Σ){\cal N}_{2}({\bf 0}_{2},\Sigma) when βj=0\beta_{j}=0, and it follows a distribution 𝒩2(μ2log(p),Σ){\cal N}_{2}(\mu\sqrt{2\log(p)},\;\Sigma), when βj=τp\beta_{j}=\tau_{p}, where

μ=[rar],Σ=[1aa1].\mu=\begin{bmatrix}\sqrt{r}\\ a\sqrt{r}\end{bmatrix},\qquad\Sigma=\begin{bmatrix}1\;\;\;&a\\ a\;\;\;&1\end{bmatrix}.

Let {\cal R} be the region of (h1,h2)(h_{1},h_{2}) corresponding to the event that {Wj>tp(u)}\{W_{j}>t_{p}(u)\}. It follows from Lemma 7.1 that

(Wj>tp(u)|βj=0)\displaystyle\mathbb{P}(W_{j}>t_{p}(u)|\beta_{j}=0) =Lppinfh{hΣ1h},\displaystyle=L_{p}p^{-\inf_{h\in{\cal R}}\{h^{\prime}\Sigma^{-1}h\}}, (C.3)
(Wj<tp(u)|βj=τp)\displaystyle\mathbb{P}(W_{j}<t_{p}(u)|\beta_{j}=\tau_{p}) =Lppinfhc{(hμ)Σ1(hμ)}.\displaystyle=L_{p}p^{-\inf_{h\in{\cal R}^{c}}\{(h-\mu)^{\prime}\Sigma^{-1}(h-\mu)\}}. (C.4)

Below, we first derive the rejection region {\cal R}, and then compute the exponents in (C.3).

Recall that ZjZ_{j} and Z~j\tilde{Z}_{j} are the same as in (4.3). They are indeed the values of λ\lambda at which the variable jj and its knockoff enter the solution path of a bivariate lasso as in (B.1). We can apply the solution path derived in the proof of Lemma 7.2, with ρ=a\rho=a. Before we proceed to the proof, we argue that it suffices to consider the case of a0a\geq 0. If a<0a<0, we can simultaneously flip the signs of aa and h2h_{2}, so that the objective (B.1) remains unchanged; as a result, the values of (Zj,Z~j)(Z_{j},\tilde{Z}_{j}) remain unchanged, so does the symmetric statistic WjW_{j}. It implies that, if we flip the sign of aa, the rejection region is reflected with respect to the x-axis. At the same time, in light of the exponents in (C.3), we consider two ellipsoids

FP(t)={h2:hΣ1ht},FN(t)={h2:(hμ)Σ1(hμ)t}.{\cal E}_{\mathrm{FP}}(t)=\{h\in\mathbb{R}^{2}:h^{\prime}\Sigma^{-1}h\leq t\},\qquad{\cal E}_{\mathrm{FN}}(t)=\{h\in\mathbb{R}^{2}:(h-\mu)^{\prime}\Sigma^{-1}(h-\mu)^{\prime}\leq t\}. (C.5)

Similarly, if we simultaneously flip the signs of aa and h2h_{2}, these ellipsoids remain unchanged. It implies that, if we flip the sign of aa, these ellipsoids are reflected with respect to the x-axis. Combining the above observations, we know that the exponents in (C.3) are unchanged with a sign flip of aa, i.e., they only depend on |a||a|. We assume a0a\geq 0 without loss of generality.

Fix a0a\geq 0. Write z=Zj/2log(p)z=Z_{j}/\sqrt{2\log(p)} and z~=Z~j/2log(p)\tilde{z}=\tilde{Z}_{j}/\sqrt{2\log(p)}. The symmetric statistics in (4.3) can be re-written as

Wjsgm=(zz~)2log(p){+1,if z>z~1,if zz~,Wjdif=(zz~)2log(p).W_{j}^{\mathrm{sgm}}=(z\vee\tilde{z})\sqrt{2\log(p)}\cdot\begin{cases}+1,&\text{if }z>\tilde{z}\\ -1,&\text{if }z\leq\tilde{z}\end{cases},\qquad W_{j}^{\mathrm{dif}}=(z-\tilde{z})\sqrt{2\log(p)}.

Recall that h1h_{1} and h2h_{2} are as in (C.2). Let λ1>λ2>0\lambda_{1}>\lambda_{2}>0 be the values of λ\lambda at which variables enter the solution path of a bivariate lasso. In the proof of Lemma 7.2, we have derived the formula of (λ1,λ2)(\lambda_{1},\lambda_{2}); see (B.6) and (B.7) (with ρ\rho replaced by aa). It follows that

(z,z~)={(λ1,λ2),in the regions 𝒜1-𝒜4,(λ2,λ1),in the regions 𝒜5-𝒜8,(z,\tilde{z})=\begin{cases}(\lambda_{1},\lambda_{2}),&\mbox{in the regions ${\cal A}_{1}$-${\cal A}_{4}$},\\ (\lambda_{2},\lambda_{1}),&\mbox{in the regions ${\cal A}_{5}$-${\cal A}_{8}$},\end{cases}

where regions 𝒜1{\cal A}_{1}-𝒜8{\cal A}_{8} are the same as those on the right panel of Figure 12 (with ρ\rho replaced by aa). Plugging in (B.6) and (B.7) gives the following results:

  • Region 𝒜1{\cal A}_{1}:  z=h1z=h_{1},   z~=h2ρh11a\tilde{z}=\frac{h_{2}-\rho h_{1}}{1-a},   Wjsgm=h12log(p)W_{j}^{\mathrm{sgm}}=h_{1}\sqrt{2\log(p)},   Wjdif=h1h21a2log(p)W_{j}^{\mathrm{dif}}=\frac{h_{1}-h_{2}}{1-a}\sqrt{2\log(p)}.

  • Region 𝒜2{\cal A}_{2}:   z=h1z=h_{1},  z~=ρh1h21+a\tilde{z}=\frac{\rho h_{1}-h_{2}}{1+a},  Wjsgm=h12log(p)W_{j}^{\mathrm{sgm}}=h_{1}\sqrt{2\log(p)},   Wjdif=h1+h21+a2log(p)W_{j}^{\mathrm{dif}}=\frac{h_{1}+h_{2}}{1+a}\sqrt{2\log(p)}.

  • Region 𝒜3{\cal A}_{3}:   z=h1z=-h_{1},  z~=h2ρh11+a\tilde{z}=\frac{h_{2}-\rho h_{1}}{1+a},   Wjsgm=h12log(p)W_{j}^{\mathrm{sgm}}=-h_{1}\sqrt{2\log(p)},   Wjdif=h1+h21+a2log(p)W_{j}^{\mathrm{dif}}=-\frac{h_{1}+h_{2}}{1+a}\sqrt{2\log(p)}.

  • Region 𝒜4{\cal A}_{4}:   z=h1z=-h_{1},  z~=ρh1h21a\tilde{z}=\frac{\rho h_{1}-h_{2}}{1-a},   Wjsgm=h12log(p)W_{j}^{\mathrm{sgm}}=-h_{1}\sqrt{2\log(p)},   Wjdif=h2h11a2log(p)W_{j}^{\mathrm{dif}}=\frac{h_{2}-h_{1}}{1-a}\sqrt{2\log(p)}.

  • Regions 𝒜5{\cal A}_{5}-𝒜8{\cal A}_{8}:   |Zj|<|Z~j||Z_{j}|<|\tilde{Z}_{j}|,  Wjsgm<0W_{j}^{\mathrm{sgm}}<0,  Wjdif<0W_{j}^{\mathrm{dif}}<0.

The event that Wjsgm>2ulog(p)W_{j}^{\mathrm{sgm}}>\sqrt{2u\log(p)} corresponds to that (h1,h2)(h_{1},h_{2}) is in the region of

usgm\displaystyle{\cal R}_{u}^{\mathrm{sgm}} =(𝒜1𝒜2𝒜3𝒜4){|h1|>u}\displaystyle=({\cal A}_{1}\cup{\cal A}_{2}\cup{\cal A}_{3}\cup{\cal A}_{4})\cap\{|h_{1}|>\sqrt{u}\} (C.6)
={|h1|>|h2|,|h1|>u}.\displaystyle=\{|h_{1}|>|h_{2}|,\;|h_{1}|>\sqrt{u}\}. (C.7)

The event that Wjdif>2ulog(p)W_{j}^{\mathrm{dif}}>\sqrt{2u\log(p)} corresponds to that (h1,h2)(h_{1},h_{2}) is in the region of

udif\displaystyle{\cal R}_{u}^{\mathrm{dif}} =(𝒜1{h1h2>(1a)u})(𝒜2{h1+h2>(1+a)u})\displaystyle=\bigl{(}{\cal A}_{1}\cap\{h_{1}-h_{2}>(1-a)\sqrt{u}\}\bigr{)}\cup\bigl{(}{\cal A}_{2}\cap\{h_{1}+h_{2}>(1+a)\sqrt{u}\}\bigr{)} (C.8)
(𝒜3{h1+h2<(1+a)u})(𝒜4{h1h2<(1a)u}).\displaystyle\qquad\cup\bigl{(}{\cal A}_{3}\cap\{h_{1}+h_{2}<-(1+a)\sqrt{u}\}\bigr{)}\cup\bigl{(}{\cal A}_{4}\cap\{h_{1}-h_{2}<-(1-a)\sqrt{u}\}\bigr{)}. (C.9)

These two regions are shown in Figure 13.

Refer to caption
Refer to caption
Figure 13: The rejection region of knockoff in the orthogonal design, where the symmetric statistic is signed maximum (left) and difference (right). The rate of convergence of FPp\mathrm{FP}_{p} is captured by an ellipsoid centered at (0,0)(0,0), and the rate of convergence of FNp\mathrm{FN}_{p} is captured by an ellipsoid centered at (r,ar)(\sqrt{r},a\sqrt{r}).

We are now ready to compute the exponents in (C.3). First, we compute infh{hΣ1h}\inf_{h\in{\cal R}}\{h^{\prime}\Sigma^{-1}h\}. Let FP(t){\cal E}_{\mathrm{FP}}(t) be the same as in (C.5). Then,

infh{hΣ1h}=sup{t>0:FP(t)}.\inf_{h\in{\cal R}}\{h^{\prime}\Sigma^{-1}h\}=\sup\bigl{\{}t>0:{\cal E}_{\mathrm{FP}}(t)\cap{\cal R}\neq\emptyset\bigr{\}}.

When the rejection region is usgm{\cal R}_{u}^{\mathrm{sgm}}, from Figure 13, we can increase tt until FP(t){\cal E}_{\mathrm{FP}}(t) intersects with the line of h1=±uh_{1}=\pm\sqrt{u}. For any hh on the surface of this ellipsoid, the perpendicular vector of its tangent plane is proportional to Σ1h{}^{\prime}\Sigma^{-1}h. When the ellipsoid intersects with the line of h1=±uh_{1}=\pm\sqrt{u}, the perpendicular vector should be proportional to (1,0)(1,0)^{\prime}. Therefore, we need to find hh such that

h1=±u,hΣ1h=t,andΣ1h(1,0).h_{1}=\pm\sqrt{u},\quad h^{\prime}\Sigma^{-1}h=t,\quad\mbox{and}\quad\Sigma^{-1}h\propto(1,0)^{\prime}.

The second equation requires that h2=ah1h_{2}=ah_{1}. Combining it with the first equation gives h=(±u,±au)h=(\pm\sqrt{u},\pm a\sqrt{u}). We then plug it into the second equation to obtain t=ut=u. This gives

infhusgm{hΣ1h}=u.\inf_{h\in{\cal R}_{u}^{\mathrm{sgm}}}\{h^{\prime}\Sigma^{-1}h\}=u. (C.10)

When the rejection region is udif{\cal R}_{u}^{\mathrm{dif}}, there are 3 possible cases:

  • (i)

    The ellipsoid intersects with the line h1h2=(1a)uh_{1}-h_{2}=(1-a)\sqrt{u},

  • (ii)

    The ellipsoid intersects with the line h1+h2=(1+a)uh_{1}+h_{2}=(1+a)\sqrt{u},

  • (iii)

    The ellipsoid intersects with the point h=(u,au)h=(\sqrt{u},a\sqrt{u}).

In Case (i), we can compute the intersection point by solving hh for h1h2=(1a)uh_{1}-h_{2}=(1-a)\sqrt{u} and Σ1h(1,1)\Sigma^{-1}h\propto(1,-1)^{\prime}. The second relationship gives h2=h1h_{2}=-h_{1}. Together with the first relationship, we have h=(1a2u,1a2u)h=(\frac{1-a}{2}\sqrt{u},\frac{1-a}{2}\sqrt{u}). It is not in udif{\cal R}_{u}^{\mathrm{dif}}. Similarly, for Case (ii), we can show that the intersection point is h=(1+a2u,1+a2u)h=(\frac{1+a}{2}\sqrt{u},\frac{1+a}{2}\sqrt{u}), which is not in udif{\cal R}_{u}^{\mathrm{dif}} either. The only possible case is Case (iii), where the intersection point is (u,au)(\sqrt{u},a\sqrt{u}) and the associated t=hΣ1t=ut=h^{\prime}\Sigma^{-1}t=u. We have proved that

infhudif{hΣ1h}=u.\inf_{h\in{\cal R}_{u}^{\mathrm{dif}}}\{h^{\prime}\Sigma^{-1}h\}=u. (C.11)

Next, we compute infhc{(hμ)Σ1(hμ)}\inf_{h\in{\cal R}^{c}}\{(h-\mu)^{\prime}\Sigma^{-1}(h-\mu)\}. Let FN(t){\cal E}_{\mathrm{FN}}(t) be the same as in (C.5). Then,

infhc{(hμ)Σ1(hμ)}=sup{t>0:FN(t)c}.\inf_{h\in{\cal R}^{c}}\{(h-\mu)^{\prime}\Sigma^{-1}(h-\mu)\}=\sup\bigl{\{}t>0:{\cal E}_{\mathrm{FN}}(t)\cap{\cal R}^{c}\neq\emptyset\bigr{\}}.

Note that the center of the ellipsoid is μ=(r,ar)\mu=(\sqrt{r},a\sqrt{r}). When either =usgm{\cal R}={\cal R}_{u}^{\mathrm{sgm}} or =udif{\cal R}={\cal R}_{u}^{\mathrm{dif}}, μc\mu\notin{\cal R}^{c} if and only if r>ur>u. In other words, the above is well defined only if r>ur>u. We now fix r>ur>u. When the rejection region is usgm{\cal R}_{u}^{\mathrm{sgm}}, the ellipsoid intersects with either the line of h1=uh_{1}=\sqrt{u} or the line of h1=h2h_{1}=h_{2}. Since the perpendicular vector of the tangent plane of the ellipsoid at hh is proportional to Σ1(hμ){}^{\prime}\Sigma^{-1}(h-\mu), we can solve the intersection points from

{h1=u,Σ1(hμ)(1,0),and{h1=h2,Σ1(hμ)(1,1).\begin{cases}h_{1}=\sqrt{u},\\ \Sigma^{-1}(h-\mu)\propto(1,0)^{\prime},\end{cases}\qquad\mbox{and}\qquad\begin{cases}h_{1}=h_{2},\\ \Sigma^{-1}(h-\mu)\propto(1,-1)^{\prime}.\end{cases}

By calculations, the two intersection points are h=(u,au)h=(\sqrt{u},\;a\sqrt{u}) and h=(1+a2r,1+a2r)h=(\frac{1+a}{2}\sqrt{r},\;\frac{1+a}{2}\sqrt{r}). The associated value of (hμ)Σ1(hμ)(h-\mu)^{\prime}\Sigma^{-1}(h-\mu) is t=(ru)2t=(\sqrt{r}-\sqrt{u})^{2} and t=(1a)r/2t=(1-a)r/2, respectively. When we increase the ellipsoid until it interacts with (usgm)c({\cal R}_{u}^{\mathrm{sgm}})^{c}, the corresponding tt is the smaller of the above two values. This gives

infh(usgm)c{(hμ)Σ1(hμ)}=min{(ru)+2,1a2r}.\inf_{h\in({\cal R}_{u}^{\mathrm{sgm}})^{c}}\{(h-\mu)^{\prime}\Sigma^{-1}(h-\mu)\}=\min\Bigl{\{}(\sqrt{r}-\sqrt{u})_{+}^{2},\;\frac{1-a}{2}r\Bigr{\}}. (C.12)

When the rejection region is udif{\cal R}_{u}^{\mathrm{dif}}, the ellipsoid intersects with either the line of h1h2=(1a)uh_{1}-h_{2}=(1-a)\sqrt{u} or the line of h1+h2=(1+a)uh_{1}+h_{2}=(1+a)\sqrt{u}. We can solve the intersection points from

{h1h2=(1a)u,Σ1(hμ)(1,1),and{h1+h2=(1+a)u,Σ1(hμ)(1,1).\begin{cases}h_{1}-h_{2}=(1-a)\sqrt{u},\\ \Sigma^{-1}(h-\mu)\propto(1,-1)^{\prime},\end{cases}\qquad\mbox{and}\qquad\begin{cases}h_{1}+h_{2}=(1+a)\sqrt{u},\\ \Sigma^{-1}(h-\mu)\propto(1,1)^{\prime}.\end{cases}

Solving these equations gives the two intersection points: h=(1+a2r+1a2u,1+a2r1a2u)h=(\frac{1+a}{2}\sqrt{r}+\frac{1-a}{2}\sqrt{u},\;\frac{1+a}{2}\sqrt{r}-\frac{1-a}{2}\sqrt{u}) and h=(1a2r+1+a2u,1a2r+1+a2u)h=(\frac{1-a}{2}\sqrt{r}+\frac{1+a}{2}\sqrt{u},\;-\frac{1-a}{2}\sqrt{r}+\frac{1+a}{2}\sqrt{u}). The corresponding value of (hμ)Σ1(hμ)(h-\mu)^{\prime}\Sigma^{-1}(h-\mu) is t=1a2(ru)2t=\frac{1-a}{2}(\sqrt{r}-\sqrt{u})^{2} and t=1+a2(ru)2t=\frac{1+a}{2}(\sqrt{r}-\sqrt{u})^{2}, respectively. The smaller of these two values is 1a2(ru)2\frac{1-a}{2}(\sqrt{r}-\sqrt{u})^{2}. We have proved that

infh(udif)c{(hμ)Σ1(hμ)}=1a2(ru)+2.\inf_{h\in({\cal R}_{u}^{\mathrm{dif}})^{c}}\{(h-\mu)^{\prime}\Sigma^{-1}(h-\mu)\}=\frac{1-a}{2}(\sqrt{r}-\sqrt{u})^{2}_{+}. (C.13)

We plug (C.10)-(C.13) into (C.3), and we further plug it into (C.1). This gives the claim for a0a\geq 0. As we have argued, the results for a<0a<0 only requires replacing aa by |a||a|.

Appendix D Proof of Theorem 5.1

Without loss of generality, we assume pp is even. Then, for block-wise diagonal designs as in (5.1), the Lasso objective is separable. Therefore, for each WjW_{j}^{*}, it is not affected by any βk\beta_{k} outside the block. Additionally, by symmetry, the distribution of WjW_{j}^{*} is the same for all 1jp1\leq j\leq p. It follows that

FPp(u)\displaystyle\mathrm{FP}_{p}(u) =Lpp{Wj>tp(u)|(βj,βj+1)=(0,0)}\displaystyle=L_{p}p\cdot\mathbb{P}\bigl{\{}W^{*}_{j}>t_{p}(u)\,\big{|}\,(\beta_{j},\beta_{j+1})=(0,0)\bigr{\}} (D.1)
+Lpp1ϑ{Wj>tp(u)|(βj,βj+1)=(0,τp)},\displaystyle\qquad+L_{p}p^{1-\vartheta}\cdot\mathbb{P}\bigl{\{}W^{*}_{j}>t_{p}(u)\,\big{|}\,(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}, (D.2)

where jj can be odd index. Similarly, we can derive that

FNp(u)\displaystyle\mathrm{FN}_{p}(u) =Lpp1ϑ{Wj<tp(u)|(βj,βj+1)=(τp,0)}\displaystyle=L_{p}p^{1-\vartheta}\cdot\mathbb{P}\bigl{\{}W^{*}_{j}<t_{p}(u)\,\big{|}\,(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}} (D.3)
+Lpp12ϑ{Wj<tp(u)|(βj,βj+1)=(τp,τp)}.\displaystyle\qquad+L_{p}p^{1-2\vartheta}\cdot\mathbb{P}\bigl{\{}W^{*}_{j}<t_{p}(u)\,\big{|}\,(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}. (D.4)

Fix variables {j,j+1}\{j,j+1\}, and consider the random vector h^=(xjy,xj+1y)/log(p)\hat{h}=(x_{j}^{\prime}y,x_{j+1}^{\prime}y)^{\prime}/\sqrt{\log(p)}. Then,

h^𝒩2(μ,1log(p)Σ),whereΣ=[1ρρ1].\hat{h}\sim{\cal N}_{2}\Bigl{(}\mu,\;\frac{1}{\log(p)}\Sigma\Bigr{)},\qquad\mbox{where}\quad\Sigma=\begin{bmatrix}1\;\;\;&\rho\\ \rho\;\;\;&1\end{bmatrix}.

The vector μ\mu is equal to

μ(1)[00],μ(2)[ρrr],μ(3)[rρr],μ(4)[(1+ρ)r(1+ρ)r)],\mu^{(1)}\equiv\begin{bmatrix}0\\ 0\end{bmatrix},\quad\mu^{(2)}\equiv\begin{bmatrix}\rho\sqrt{r}\\ \sqrt{r}\end{bmatrix},\quad\mu^{(3)}\equiv\begin{bmatrix}\sqrt{r}\\ \rho\sqrt{r}\end{bmatrix},\quad\mu^{(4)}\equiv\begin{bmatrix}(1+\rho)\sqrt{r}\\ (1+\rho)\sqrt{r})\end{bmatrix}, (D.5)

in the four cases where (βj,βj+1)(\beta_{j},\beta_{j+1})^{\prime} is (0,0)(0,0)^{\prime}, (0,τp)(0,\tau_{p})^{\prime}, (τp,0)(\tau_{p},0)^{\prime}, and (τp,τp)(\tau_{p},\tau_{p})^{\prime}, respectively. Let u{\cal R}_{u} be the rejection region induced by Lasso-path, given explicitly in Lemma 7.2. By Lemma 7.1, the probabilities in (D.1) and (D.3) are related to the following quantities:

αk={infinfhu{(hμ(k))Σ1(hμ(k))},k=1,2,infhuc{(hμ(k))Σ1(hμ(k))},k=3,4.\alpha_{k}=\begin{cases}\inf\inf_{h\in{\cal R}_{u}}\{(h-\mu^{(k)})^{\prime}\Sigma^{-1}(h-\mu^{(k)})\},&k=1,2,\cr\inf_{h\in{\cal R}^{c}_{u}}\{(h-\mu^{(k)})^{\prime}\Sigma^{-1}(h-\mu^{(k)})\},&k=3,4.\end{cases}

and plug it into (D.1) and (D.3). It gives

FPp(u)=Lpp1min{α1,ϑ+α2},FNp(u)=Lpp1min{ϑ+α3, 2ϑ+α4}.\mathrm{FP}_{p}(u)=L_{p}p^{1-\min\{\alpha_{1},\;\vartheta+\alpha_{2}\}},\quad\mathrm{FN}_{p}(u)=L_{p}p^{1-\min\{\vartheta+\alpha_{3},\;2\vartheta+\alpha_{4}\}}. (D.6)

It remains to compute the exponents α1\alpha_{1}-α4\alpha_{4}.

First, we consider the case that ρ0\rho\geq 0. The rejection region in Figure 12 is defined by the following lines:

  • Line 1: h1ρh2=(1ρ)uh_{1}-\rho h_{2}=(1-\rho)\sqrt{u}.

  • Line 2: h1=uh_{1}=\sqrt{u}.

  • Line 3: h1ρh2=(1+ρ)uh_{1}-\rho h_{2}=(1+\rho)\sqrt{u}.

  • Line 4: h1ρh2=(1ρ)uh_{1}-\rho h_{2}=-(1-\rho)\sqrt{u}.

  • Line 5: h1=uh_{1}=-\sqrt{u}.

  • Line 6: h1ρh2=(1+ρ)uh_{1}-\rho h_{2}=-(1+\rho)\sqrt{u}.

Consider a general ellipsoid:

(t;μ)={h2:(hμ)Σ1(hμ)t}.{\cal E}(t;\mu)=\{h\in\mathbb{R}^{2}:(h-\mu)^{\prime}\Sigma^{-1}(h-\mu)^{\prime}\leq t\}.

Given any line h1+bh2=ch_{1}+bh_{2}=c, as tt increases, this ellipsoid eventually intersects with this line. The intersection point is computed by the following equations:

h1+bh2=c,Σ1(hμ)(1,b).h_{1}+bh_{2}=c,\qquad\Sigma^{-1}(h-\mu)\propto(1,b)^{\prime}.

The second equation (it is indeed a linear equation on hh) says that the perpendicular vector of the tangent plane is orthogonal to the line. Solving the above equations gives the intersection point and the value of tt: As long as b21b^{2}\neq 1, we have

h=μ+c(μ1+bμ2)1+b2+2bρ[1+bρb+ρ],t=[c(μ1+bμ2)]21+b2+2bρ.h^{*}=\mu+\frac{c-(\mu_{1}+b\mu_{2})}{1+b^{2}+2b\rho}\begin{bmatrix}1+b\rho\\ b+\rho\end{bmatrix},\qquad t^{*}=\frac{[c-(\mu_{1}+b\mu_{2})]^{2}}{1+b^{2}+2b\rho}. (D.7)

Using the expressions of lines 1-6, we can obtain the corresponding tt^{*} for 6 lines:

t1=[(1ρ)u(μ1ρμ2)]21ρ2,t2=(uμ1)2,t3=[(1+ρ)u(μ1ρμ2)]21ρ2,\displaystyle t^{*}_{1}=\frac{[(1-\rho)\sqrt{u}-(\mu_{1}-\rho\mu_{2})]^{2}}{1-\rho^{2}},\qquad t^{*}_{2}=(\sqrt{u}-\mu_{1})^{2},\qquad t_{3}^{*}=\frac{[(1+\rho)\sqrt{u}-(\mu_{1}-\rho\mu_{2})]^{2}}{1-\rho^{2}},
t4=[(1ρ)u+(μ1ρμ2)]21ρ2,t5=(u+μ1)2,t6=[(1+ρ)u+(μ1ρμ2)]21ρ2.\displaystyle t_{4}^{*}=\frac{[(1-\rho)\sqrt{u}+(\mu_{1}-\rho\mu_{2})]^{2}}{1-\rho^{2}},\qquad t_{5}^{*}=(\sqrt{u}+\mu_{1})^{2},\qquad t_{6}^{*}=\frac{[(1+\rho)\sqrt{u}+(\mu_{1}-\rho\mu_{2})]^{2}}{1-\rho^{2}}.

We first look at the ellipsoid (t;μ(1)){\cal E}(t;\mu^{(1)}) and study when it intersects with u{\cal R}_{u}. Note that μ(1)=(0,0)\mu^{(1)}=(0,0)^{\prime}. The above tt^{*} values become

t2=t5=u,t1=t4=u1+ρ,t3=t6=u1ρ.t_{2}^{*}=t_{5}^{*}=u,\qquad t_{1}^{*}=t_{4}^{*}=\frac{u}{1+\rho},\qquad t_{3}^{*}=t_{6}^{*}=\frac{u}{1-\rho}.

Therefore, as we increase tt, this ellipsoid first intersects with line 1 and line 4. For line 1, the intersection point is ((1ρ)u,0)((1-\rho)\sqrt{u},0)^{\prime}, but it is outside the rejection region (see Figure 12); the situation for line 4 is similar. We then further increase tt, and the ellipsoid intersects with line 2 and line 5, where the intersection point is (u,ρu)(\sqrt{u},\rho\sqrt{u})^{\prime}; this point is indeed on the boundary of the rejection region. We thus conclude that

infhu{(hμ(1))Σ1(hμ(1))}=u.\inf_{h\in{\cal R}_{u}}\{(h-\mu^{(1)})^{\prime}\Sigma^{-1}(h-\mu^{(1)})\}=u. (D.8)

We then look at the the ellipsoid (t;μ(2)){\cal E}(t;\mu^{(2)}), with μ(2)=(ρr,r)\mu^{(2)}=(\rho\sqrt{r},\sqrt{r})^{\prime}. The tt^{*} values for 6 lines are:

t1=t4=1ρ1+ρu,t2=(uρr)2,t3=t6=1+ρ1ρu,t5=(u+ρr)2.t_{1}^{*}=t_{4}^{*}=\frac{1-\rho}{1+\rho}u,\qquad t_{2}^{*}=(\sqrt{u}-\rho\sqrt{r})^{2},\qquad t_{3}^{*}=t_{6}^{*}=\frac{1+\rho}{1-\rho}u,\qquad t_{5}^{*}=(\sqrt{u}+\rho\sqrt{r})^{2}.

The smallest tt^{*} is among {t1,t2,t4}\{t_{1}^{*},t_{2}^{*},t_{4}^{*}\}. Since μ(2)\mu^{(2)} is in the positive orthant, the intersection point of the ellipsoid with line 4 must be outside the rejection region, so we further restrict to t1t_{1}^{*} and t2t_{2}^{*}. The ellipsoid intersects with line 1 at (ρr+(1ρ)u,r)(\rho\sqrt{r}+(1-\rho)\sqrt{u},\;\sqrt{r})^{\prime}. This point is on the boundary of u{\cal R}_{u} if and only if its second coordinate is u\geq\sqrt{u} (see Figure 12), i.e., uru\leq r. The ellipsoid intersects with line 2 at (u,ρu+(1ρ2)r)(\sqrt{u},\;\rho\sqrt{u}+(1-\rho^{2})\sqrt{r})^{\prime}. This point is on the boundary of u{\cal R}_{u} if and only if its second coordinate is u\leq\sqrt{u} (see Figure 12), i.e., u(1+ρ)2ru\geq(1+\rho)^{2}r. In the range of r<u<(1+ρ)2rr<u<(1+\rho)^{2}r, the ellipsoid intersects with u{\cal R}_{u} at the corner point (u,u)(\sqrt{u},\sqrt{u})^{\prime}, with the corresponding

t=r+21+ρu2ru={1ρ1+ρu+(ur)2,(uρr)2+1ρ1+ρ(u(1+ρ)r)2.t^{*}=r+\frac{2}{1+\rho}u-2\sqrt{ru}=\begin{cases}\frac{1-\rho}{1+\rho}u+(\sqrt{u}-\sqrt{r})^{2},\cr(\sqrt{u}-\rho\sqrt{r})^{2}+\frac{1-\rho}{1+\rho}\bigl{(}\sqrt{u}-(1+\rho)\sqrt{r}\bigr{)}^{2}.\end{cases}

This tt^{*} has two equivalent expressions. Comparing them with t1t_{1}^{*} and t2t_{2}^{*}, we can see that the smallest tt^{*} is a continuous function of uu, given (ρ,r)(\rho,r). It follows that

infhu{(hμ(2))Σ1(hμ(2))}\displaystyle\inf_{h\in{\cal R}_{u}}\{(h-\mu^{(2)})^{\prime}\Sigma^{-1}(h-\mu^{(2)})\} (D.9)
=\displaystyle=\;\; 1ρ1+ρu+(ur)+21ρ1+ρ(u(1+ρ)r)+2.\displaystyle\frac{1-\rho}{1+\rho}u+(\sqrt{u}-\sqrt{r})_{+}^{2}-\frac{1-\rho}{1+\rho}\bigl{(}\sqrt{u}-(1+\rho)\sqrt{r}\bigr{)}_{+}^{2}. (D.10)

We plug (D.8) and (D.9) into (D.6). It gives the expression of FPp(u)\mathrm{FP}_{p}(u) for ρ0\rho\geq 0.

We then look at the ellipsoid (t;μ(3)){\cal E}(t;\mu^{(3)}), with μ(3)=(r,ρr)\mu^{(3)}=(\sqrt{r},\rho\sqrt{r})^{\prime}. Note that we now investigate its distance to the complement of u{\cal R}_{u}. In order for μ(3)\mu^{(3)} to outside uc{\cal R}_{u}^{c} (i.e., in the interior of u){\cal R}_{u}), we require that u<ru<r; furthermore, when u<ru<r, the ellipsoid can only intersect with lines 1-2 (see Figure 12). Using the formula of tt^{*} in the equation below (D.7), we have

t1=1ρ1+ρ((1+ρ)ru)2,t2=(ru)2.t_{1}^{*}=\frac{1-\rho}{1+\rho}\bigl{(}(1+\rho)\sqrt{r}-\sqrt{u}\bigr{)}^{2},\qquad t_{2}^{*}=(\sqrt{r}-\sqrt{u})^{2}.

By (D.7), the ellipsoid intersects with line 1 at (r(1ρ)[(1+ρ)ru],ρr)\bigl{(}\sqrt{r}-(1-\rho)[(1+\rho)\sqrt{r}-\sqrt{u}],\;\rho\sqrt{r}\bigr{)}^{\prime}. To guarantee that this point is on the boundary of u{\cal R}_{u}, we need its second coordinate to be u\geq\sqrt{u} (see Figure 12), i.e., uρ2ru\leq\rho^{2}r; furthermore, when u>ρ2ru>\rho^{2}r, it can be easily seen from Figure 12 that the ellipsoid must have already crossed line 2. By (D.7) again, the ellipsoid intersects with line 2 at (u,ρu)(\sqrt{u},\rho\sqrt{u})^{\prime}. This point is always on the boundary of u{\cal R}_{u}. It follows that

infhuc{(hμ(3))Σ1(hμ(3))}=min{1ρ1+ρ((1+ρ)ru)2,(ru)+2}.\inf_{h\in{\cal R}^{c}_{u}}\{(h-\mu^{(3)})^{\prime}\Sigma^{-1}(h-\mu^{(3)})\}=\min\Bigl{\{}\frac{1-\rho}{1+\rho}\bigl{(}(1+\rho)\sqrt{r}-\sqrt{u}\bigr{)}^{2},\;\,(\sqrt{r}-\sqrt{u})_{+}^{2}\Bigr{\}}. (D.11)

We then look at the ellipsoid (t;μ(4)){\cal E}(t;\mu^{(4)}), with μ(4)=((1+ρ)r,(1+ρ)r)\mu^{(4)}=\bigl{(}(1+\rho)\sqrt{r},(1+\rho)\sqrt{r}\bigr{)}^{\prime}. It follows from figure 12 that μ(4)\mu^{(4)} is in the interior of the ellipsoid if and only if (1+ρ)r>u(1+\rho)\sqrt{r}>\sqrt{u}. We restrict to (1+ρ)r>u(1+\rho)\sqrt{r}>\sqrt{u}. Then, this ellipsoid can only touch lines 1-2 first. The tt^{*} values are

t1=1ρ1+ρ((1+ρ)ru)2,t2=((1+ρ)ru)2.t_{1}^{*}=\frac{1-\rho}{1+\rho}\bigl{(}(1+\rho)\sqrt{r}-\sqrt{u}\bigr{)}^{2},\qquad t_{2}^{*}=\bigl{(}(1+\rho)\sqrt{r}-\sqrt{u}\bigr{)}^{2}.

Since t1<t2t_{1}^{*}<t_{2}^{*}, the ellipsoid touches line 1 first, at the intersection point ((1ρ)u+ρ(1+ρ)r,(1+ρ)r)\bigl{(}(1-\rho)\sqrt{u}+\rho(1+\rho)\sqrt{r},\;(1+\rho)\sqrt{r}\bigr{)}^{\prime}. In order for this point to be on the boundary of u{\cal R}_{u}, we need that its second coordinate is u\geq\sqrt{u}, which translates to u(1+ρ)r\sqrt{u}\leq(1+\rho)\sqrt{r}. This is always true when r>ur>u and ρ>0\rho>0. It follows that

infhuc{(hμ(4))Σ1(hμ(4))}=1ρ1+ρ((1+ρ)ru)+2.\inf_{h\in{\cal R}^{c}_{u}}\{(h-\mu^{(4)})^{\prime}\Sigma^{-1}(h-\mu^{(4)})\}=\frac{1-\rho}{1+\rho}\bigl{(}(1+\rho)\sqrt{r}-\sqrt{u}\bigr{)}_{+}^{2}. (D.12)

We plug (D.11) and (D.12) into (D.6). It gives the expression of FNp(u)\mathrm{FN}_{p}(u) for ρ0\rho\geq 0.

Next, we consider the case that ρ<0\rho<0. By Lemma 7.2, u(ρ){\cal R}_{u}(\rho) is a reflection of u(|ρ|){\cal R}_{u}(|\rho|) with respect to the x-axis. As a result, if we re-define h^=(xjy,xj+1y)/2log(p)\hat{h}=(x_{j}^{\prime}y,\;-x_{j+1}^{\prime}y)/\sqrt{2\log(p)}, then the rejection region becomes u(|ρ|){\cal R}_{u}(|\rho|), which has the same shape as that in Figure 12. At the same time, the distribution of h^\hat{h} becomes

h^𝒩2(μ,1log(p)Σ),whereΣ=[1|ρ||ρ|1].\hat{h}\sim{\cal N}_{2}\Bigl{(}\mu,\;\frac{1}{\log(p)}\Sigma\Bigr{)},\qquad\mbox{where}\quad\Sigma=\begin{bmatrix}1\;\;\;&|\rho|\\ |\rho|\;\;\;&1\end{bmatrix}.

The vector μ\mu is equal to

μ(1)[00],μ(2)[|ρ|rr],μ(3)[r|ρ|r],μ(4)[(1|ρ|)r(1|ρ|)r)],\mu^{(1)}\equiv\begin{bmatrix}0\\ 0\end{bmatrix},\quad\mu^{(2)}\equiv\begin{bmatrix}-|\rho|\sqrt{r}\\ -\sqrt{r}\end{bmatrix},\quad\mu^{(3)}\equiv\begin{bmatrix}\sqrt{r}\\ |\rho|\sqrt{r}\end{bmatrix},\quad\mu^{(4)}\equiv\begin{bmatrix}(1-|\rho|)\sqrt{r}\\ -(1-|\rho|)\sqrt{r})\end{bmatrix}, (D.13)

when (βj,βj+1)(\beta_{j},\beta_{j+1})^{\prime} is (0,0)(0,0)^{\prime}, (0,τp)(0,\tau_{p})^{\prime}, (τp,0)(\tau_{p},0)^{\prime}, and (τp,τp)(\tau_{p},\tau_{p})^{\prime}, respectively. Therefore, the calculations are similar, except that the expressions of μ(1)\mu^{(1)} to μ(4)\mu^{(4)} have changed to (D.13).

Below, for a negative ρ\rho, we calculate the exponents in (D.6) as follows: We pretend that ρ>0\rho>0 and calculate the exponents using the same u{\cal R}_{u} and Σ\Sigma as before, with μ(1)\mu^{(1)} to μ(4)\mu^{(4)} replaced by those in (D.13). Finally, we replace ρ\rho by |ρ||\rho| in all four exponents.

We now pretend that ρ>0\rho>0. Then, for each ellipsoid (t;μ(k)){\cal E}(t;\mu^{(k)}), its intersection point with a line h1+bh2=ch_{1}+bh_{2}=c still obeys the formula in (D.7), and the corresponding tt_{*} values associated with line 1-line 6 are still the same as those in the equation below (D.7) (but the vector μ\mu has changed). Comparing (D.13) with (D.5), we notice that μ(1)\mu^{(1)} and μ(3)\mu^{(3)} are unchanged. Therefore, the expressions of exponents in (D.8) and (D.11) are still correct. The current μ(2)\mu^{(2)} is a sign flip (on both x-axis and y-axis) of the μ(2)\mu^{(2)} in (D.5); also, it can be seen from Figure 12 that the rejection region remains unchanged subject to a sign flip. Therefore, the expression in (D.11) is also valid. We only need to re-calculate the exponent in (D.12). The current μ(4)\mu^{(4)} is in the 4-th orthant. It is in the interior of u{\cal R}_{u} only if (1ρ)r>u(1-\rho)\sqrt{r}>\sqrt{u}, i.e., u<(1ρ)2ru<(1-\rho)^{2}r. As we increase tt, the ellipsoid (t;μ(4)){\cal E}(t;\mu^{(4)}) will first intersect with either line 2 or line 3. Using the formula of tt^{*} in the equation below (D.7), we have

t2=(u(1ρ)r)2,t3=1+ρ1ρ((1ρ)ru)2.t_{2}^{*}=\bigl{(}\sqrt{u}-(1-\rho)\sqrt{r}\bigr{)}^{2},\qquad t_{3}^{*}=\frac{1+\rho}{1-\rho}\bigl{(}(1-\rho)\sqrt{r}-\sqrt{u}\bigr{)}^{2}.

While t2t_{2}^{*} is the smaller one, the intersection point of the ellipsoid with line 2 is (u,(1ρ)r)(\sqrt{u},-(1-\rho)\sqrt{r})^{\prime}, which by Figure 12 is in the interior of u{\cal R}_{u}. Hence, the ellipsoid hits line 3 first. We conclude that

infhuc{(hμ(4))Σ1(hμ(4))}=1+ρ1ρ((1ρ)ru)+2.\inf_{h\in{\cal R}^{c}_{u}}\{(h-\mu^{(4)})^{\prime}\Sigma^{-1}(h-\mu^{(4)})\}=\frac{1+\rho}{1-\rho}\bigl{(}(1-\rho)\sqrt{r}-\sqrt{u}\bigr{)}_{+}^{2}. (D.14)

Finally, we plug (D.8), (D.9), (D.11) and (D.14) into (D.6), and then change ρ\rho to |ρ||\rho|. This gives the expressions of FPp(u)\mathrm{FP}_{p}(u) and FNp(u)\mathrm{FN}_{p}(u) for a negative ρ\rho.

Appendix E Proof of Theorem 5.2

We assume ρ1/2\rho\geq 1/2 throughout the proof. The calculation for the case where ρ1/2\rho\leq-1/2 is similar. By the design of the gram matrix XTXX^{T}X and the construction of the knockoff variables, we know Lasso regression problem with 2p2p variables can be reduced to (p/2)(p/2) independent four-variate Lasso regression problems:

(β^j,β^j+1,β^j+p,β^j+p+1)(λ)=argminb{12||y(xj,xj+1,x~j,x~j+1)b||22+λb1}(\hat{\beta}_{j},\hat{\beta}_{j+1},\hat{\beta}_{j+p},\hat{\beta}_{j+p+1})(\lambda)=\mathrm{argmin}_{b}\Big{\{}\frac{1}{2}||y-(x_{j},x_{j+1},\tilde{x}_{j},\tilde{x}_{j+1})b||_{2}^{2}+\lambda||b||_{1}\Big{\}} (E.1)

for j=1,3,,p1j=1,3,\cdots,p-1. By taking the sub-gradients of the objective function in (E.1), we know (β^j,β^j+1,β^j+p,β^j+p+1)(\hat{\beta}_{j},\hat{\beta}_{j+1},\hat{\beta}_{j+p},\hat{\beta}_{j+p+1}) should satisfy:

(β^j,β^j+1,β^j+p,β^j+p+1)G+λ(sgn(β^j),sgn(β^j+1),sgn(β^j+p),sgn(β^j+p+1))=(yTxj,yTxj+1,yTx~j,yTx~j+1)\begin{split}(\hat{\beta}_{j},\hat{\beta}_{j+1},\hat{\beta}_{j+p},\hat{\beta}_{j+p+1})G+\lambda(\text{sgn}(\hat{\beta}_{j}),\text{sgn}(\hat{\beta}_{j+1}),\text{sgn}(\hat{\beta}_{j+p}),\text{sgn}(\hat{\beta}_{j+p+1}))\\ =(y^{T}x_{j},y^{T}x_{j+1},y^{T}\tilde{x}_{j},y^{T}\tilde{x}_{j+1})\end{split} (E.2)

where G=((1,ρ,2ρ1,ρ)T,(ρ,1,ρ,2ρ1)T,(2ρ1,ρ,1,ρ)T,(ρ,2ρ1,ρ,1)T)G=((1,\rho,2\rho-1,\rho)^{T},(\rho,1,\rho,2\rho-1)^{T},(2\rho-1,\rho,1,\rho)^{T},(\rho,2\rho-1,\rho,1)^{T}) and sgn(x)=1(x)=1 if x>0x>0; 1-1 if x<0x<0; any value in [1,1][-1,1] if x=0x=0. We have choose the correlation between a true variable and its knockoff to be 2ρ12\rho-1, which is the smallest value such that (X,X~)T(X,X~)(X,\tilde{X})^{T}(X,\tilde{X}) is semi-positive definite. In this case, GG is degenerated and has rank 3. As λ\lambda is decreasing from infinity, we recognize that the first two variables (assume these two features are linear independent) entering the model will not leave before the third variable enters the model, which is obviously true from the close form solution of the bi-variate Lasso problem. We then show that the first two variables enter the Lasso path, individually. Furthermore, if the first two variables are a true variable and its knockoff variable, then the third and fourth variable enter the Lasso path simultaneously.

Since (yTxj,yTxj+1,yTx~j,yTx~j+1)T𝒩(G(βj,βj+1,0,0)T,G)(y^{T}x_{j},y^{T}x_{j+1},y^{T}\tilde{x}_{j},y^{T}\tilde{x}_{j+1})^{T}\sim\mathcal{N}(G(\beta_{j},\beta_{j+1},0,0)^{T},G) is a degenerated normal random variable, we reparametrize it as (m+d1,m+d2,md1,md2)(m+d_{1},m+d_{2},m-d_{1},m-d_{2}) with (m,d1,d2)T𝒩((ρβj+ρβj+1,(1ρ)βj,(1ρ)βj+1)T,diag(ρ,1ρ,1ρ))(m,d_{1},d_{2})^{T}\sim\mathcal{N}((\rho\beta_{j}+\rho\beta_{j+1},(1-\rho)\beta_{j},(1-\rho)\beta_{j+1})^{T},\text{diag}(\rho,1-\rho,1-\rho)). We intend to give the Lasso solution path (or Zj,Z~jZ_{j},\tilde{Z}_{j}) as a function of m,d1m,d_{1} and d2d_{2}. We only present the result in the case where d1>d2>0d_{1}>d_{2}>0. Results from other cases are immediate by permuting the rows in equation set (E.2) and transforming to the d1>d2>0d_{1}>d_{2}>0 case. Lasso solution path are obtained by the KKT condition (E.2) and summarized in the table below.

range of mm λ1\lambda_{1} sign1\text{sign}_{1} λ2\lambda_{2} sign2\text{sign}_{2} λ3\lambda_{3} sign3\text{sign}_{3}
(,ρ1ρ(d2d1))(-\infty,\frac{\rho}{1-\rho}(d_{2}-d_{1})) m+d1-m+d_{1} (0,0,0,0)(0,0,0^{-},0) mρ1ρd1+11ρd2-m-\frac{\rho}{1-\rho}d_{1}+\frac{1}{1-\rho}d_{2} (0,0,,0)(0,0,-,0^{-})
(ρ1ρ(d2d1),0)(\frac{\rho}{1-\rho}(d_{2}-d_{1}),0) m+d1-m+d_{1} (0,0,0,0)(0,0,0^{-},0) 1ρρm+d1\frac{1-\rho}{\rho}m+d_{1} (0+,0,,0)(0^{+},0,-,0) d2d_{2} (+,0+,,0)(+,0^{+},-,0^{-})
(0,ρ1ρ(d1d2))(0,\frac{\rho}{1-\rho}(d_{1}-d_{2})) m+d1m+d_{1} (0+,0,0,0)(0^{+},0,0,0) ρ1ρm+d1\frac{\rho-1}{\rho}m+d_{1} (+,0,0,0)(+,0,0^{-},0) d2d_{2} (+,0+,,0)(+,0^{+},-,0^{-})
(ρ1ρ(d1d2),)(\frac{\rho}{1-\rho}(d_{1}-d_{2}),\infty) m+d1m+d_{1} (0+,0,0,0)(0^{+},0,0,0) mρ1ρd1+11ρd2m-\frac{\rho}{1-\rho}d_{1}+\frac{1}{1-\rho}d_{2} (+,0+,0,0)(+,0^{+},0,0)
Table 2: Summary of solution path of the Lasso problem (E.1). λi\lambda_{i} record the critical value of λ\lambda where a new variable enters the model and signi\text{sign}_{i} records the sign and the limiting behavior of (β^j,β^j+p)(\hat{\beta}_{j},\hat{\beta}_{j+p}) as λλi\lambda\to\lambda_{i}^{-}. Value of λ3\lambda_{3} is omitted in row 1 and 4 since it will not affect the value of WjW_{j} and Wj+1W_{j+1}.

Here we explain the third row of the table as an example, b1=(ϵ,0,0,0)Tb_{1}=(\epsilon,0,0,0)^{T} is a solution of the KKT condition (E.2) when λ=m+d1ϵ\lambda=m+d_{1}-\epsilon for ϵ(0,mρ]\epsilon\in(0,\frac{m}{\rho}], so sign1\text{sign}_{1} is expressed as (0+,0,0,0)(0^{+},0,0,0). By property of the Lasso solution, if b1b_{1} and b2b_{2} are both Lasso solutions, then G(b1b2)=0G(b_{1}-b_{2})=0 and b11=b21||b_{1}||_{1}=||b_{2}||_{1}. G(b1b2)=0G(b_{1}-b_{2})=0 implies b1b2=δ×(1,1,1,1)Tb_{1}-b_{2}=\delta\times(1,-1,1,-1)^{T} for some δ0\delta\neq 0. Therefore, b2=(ϵδ,δ,δ,δ)Tb_{2}=(\epsilon-\delta,\delta,-\delta,\delta)^{T} and b21b11+2|δ|||b_{2}||_{1}\geq||b_{1}||_{1}+2|\delta|. This means the Lasso solution is unique with λ=m+d1ϵ\lambda=m+d_{1}-\epsilon and variable 1 is the only one entering the model when λ\lambda gets below λ1\lambda_{1}. When λ=ρ1ρm+d1ϵ\lambda=\frac{\rho-1}{\rho}m+d_{1}-\epsilon for ϵ(0,ρ1ρm+d1d2]\epsilon\in(0,\frac{\rho-1}{\rho}m+d_{1}-d_{2}], b1=(mρ+ϵ22ρ,0,ϵ22ρ,0)Tb_{1}=(\frac{m}{\rho}+\frac{\epsilon}{2-2\rho},0,-\frac{\epsilon}{2-2\rho},0)^{T} is a solution of the KKT conditions. If there is another Lasso solution b2b_{2}, then b2=(mρ+ϵ22ρδ,δ,ϵ22ρδ,δ)Tb_{2}=(\frac{m}{\rho}+\frac{\epsilon}{2-2\rho}-\delta,\delta,-\frac{\epsilon}{2-2\rho}-\delta,\delta)^{T} and b21b11+2|δ|||b_{2}||_{1}\geq||b_{1}||_{1}+2|\delta|. So b2b_{2} does not exist and variable 3 is the only one entering the model when λ\lambda gets below λ2\lambda_{2}. When λ=d2ϵ\lambda=d_{2}-\epsilon for sufficient small positive ϵ\epsilon, b1=(m2ρ+d122ρ,ϵ22ρ,m2ρd122ρ,ϵ22ρ)Tb_{1}=(\frac{m}{2\rho}+\frac{d_{1}}{2-2\rho},\frac{\epsilon}{2-2\rho},\frac{m}{2\rho}-\frac{d_{1}}{2-2\rho},-\frac{\epsilon}{2-2\rho})^{T} satisfies the KKT condition, thus variable 2 and 4 enters the model simultaneously. At this point, the Lasso solution is not unique and all solutions can be expressed as b1δ×(1,1,1,1)Tb_{1}-\delta\times(1,-1,1,-1)^{T} with δ[ϵ22ρ,ϵ22ρ]\delta\in[-\frac{\epsilon}{2-2\rho},\frac{\epsilon}{2-2\rho}]. Other rows from the table can be analyzed similarly.

Table 2 implicitly expresses Zj,Zj+1,Z~jZ_{j},Z_{j+1},\tilde{Z}_{j} and Z~j+1\tilde{Z}_{j+1} as a function of d1,d2d_{1},d_{2} and mm. By examining all possible ordinal relationship of d1,d2d_{1},d_{2} and 0, we record the region in the space of (d1,d2,m)(d_{1},d_{2},m) such that β^j(u)>0\hat{\beta}_{j}(u)>0 and denote it as R(u)R(u). R(u)R(u) is the union of 4 disjoint sub-regions {Ri(u)}i=1,,4\{R_{i}(u)\}_{i=1,\cdots,4}, defined as following:

R1(u)={(x,y,z):x>0,y>0,x>y,z>0,x+z>T}12{(x,y,z):x>0,y>0,x<y,z<0,z>xy,x>T}12{(x,y,z):x>0,y>0,x<y,z>0,z<ρ1ρ(yx),x>T}{(x,y,z):x>0,y>0,x<y,z>0,z>max(ρ1ρ(yx),T+ρ1ρy11ρx)},\begin{split}R_{1}(u)=&\{(x,y,z):x>0,y>0,x>y,z>0,x+z>T\}\\ &\cup\frac{1}{2}\{(x,y,z):x>0,y>0,x<y,z<0,z>x-y,x>T\}\\ &\cup\frac{1}{2}\{(x,y,z):x>0,y>0,x<y,z>0,z<\frac{\rho}{1-\rho}(y-x),x>T\}\\ &\cup\{(x,y,z):x>0,y>0,x<y,z>0,z>\max\big{(}\frac{\rho}{1-\rho}(y-x),T+\frac{\rho}{1-\rho}y-\frac{1}{1-\rho}x\big{)}\},\end{split} (E.3)

R2(u)={(x,y,z):(x,y,z)R1(u)}R_{2}(u)=\{(x,y,z):(-x,y,-z)\in R_{1}(u)\}, R3(u)={(x,y,z):(x,y,z)R1(u)}R_{3}(u)=\{(x,y,z):(x,-y,z)\in R_{1}(u)\} and R4(u)={(x,y,z):(x,y,z)R1(u)}R_{4}(u)=\{(x,y,z):(-x,-y,-z)\in R_{1}(u)\}, where T=2ulog(p)T=\sqrt{2u\log(p)} and the 12\frac{1}{2} ahead of a certain region means when (d1,d2,m)(d_{1},d_{2},m) is in this region, β^j(u)>0\hat{\beta}_{j}(u)>0 happens with 1/21/2 probability. Let the four disjoint regions that composes R1(u)R_{1}(u) in (E.3) be denoted by R1,j(u)R_{1,j}(u) for j=1,,4j=1,\cdots,4. We can similarly define Ri,j(u)R_{i,j}(u) for i=2,3,4i=2,3,4. By Lemma 1, as pp\to\infty,

(βj=0,β^j(u)0)=(β^j(u)0|βj=0,βj+1=0)×(βj=0,βj+1=0)+(β^j(u)0|βj=0,βj+1=τp)×(βj=0,βj+1=τp)=LppinfR(u)[(z2/ρ+x2/(1ρ)+y2/(1ρ))/(2log(p))]+LppϑinfR(u)[((zρτp)2/ρ+x2/(1ρ)+(y(1ρ)τp)2/(1ρ))/(2log(p))],\begin{split}\mathbb{P}(\beta_{j}=0,\hat{\beta}_{j}(u)\neq 0)=&\mathbb{P}(\hat{\beta}_{j}(u)\neq 0|\beta_{j}=0,\beta_{j+1}=0)\times\mathbb{P}(\beta_{j}=0,\beta_{j+1}=0)\\ &+\mathbb{P}(\hat{\beta}_{j}(u)\neq 0|\beta_{j}=0,\beta_{j+1}=\tau_{p})\times\mathbb{P}(\beta_{j}=0,\beta_{j+1}=\tau_{p})\\ =&L_{p}p^{-\inf_{R(u)}[(z^{2}/\rho+x^{2}/(1-\rho)+y^{2}/(1-\rho))/(2\log(p))]}\\ &+L_{p}p^{-\vartheta-\inf_{R(u)}[((z-\rho\tau_{p})^{2}/\rho+x^{2}/(1-\rho)+(y-(1-\rho)\tau_{p})^{2}/(1-\rho))/(2\log(p))]},\end{split} (E.4)
(βj0,β^j(u)=0)=(β^j(u)=0|βj=τp,βj+1=0)×(βj=τp,βj+1=0)+(β^j(u)=0|βj=τp,βj+1=τp)×(βj=τp,βj+1=τp)=LppϑinfR(u)C[((zρτp)2/ρ+(x(1ρ)τp)2/(1ρ)+y2/(1ρ))/(2log(p))]+Lpp2ϑinfR(u)C[((z2ρτp)2/ρ+(x(1ρ)τp)2/(1ρ)+(y(1ρ)τp)2/(1ρ))/(2log(p))].\begin{split}\mathbb{P}(\beta_{j}\neq 0,\hat{\beta}_{j}(u)=0)=&\mathbb{P}(\hat{\beta}_{j}(u)=0|\beta_{j}=\tau_{p},\beta_{j+1}=0)\times\mathbb{P}(\beta_{j}=\tau_{p},\beta_{j+1}=0)\\ &+\mathbb{P}(\hat{\beta}_{j}(u)=0|\beta_{j}=\tau_{p},\beta_{j+1}=\tau_{p})\times\mathbb{P}(\beta_{j}=\tau_{p},\beta_{j+1}=\tau_{p})\\ =&L_{p}p^{-\vartheta-\inf_{R(u)^{C}}[((z-\rho\tau_{p})^{2}/\rho+(x-(1-\rho)\tau_{p})^{2}/(1-\rho)+y^{2}/(1-\rho))/(2\log(p))]}\\ &+L_{p}p^{-2\vartheta-\inf_{R(u)^{C}}[((z-2\rho\tau_{p})^{2}/\rho+(x-(1-\rho)\tau_{p})^{2}/(1-\rho)+(y-(1-\rho)\tau_{p})^{2}/(1-\rho))/(2\log(p))]}.\end{split} (E.5)

Define the ρ\rho-distance function of two sets AA and BB in 3\mathbb{R}^{3} as

dρ(A,B)=infaA,bB[(a1b1)2/(1ρ)+(a2b2)2/(1ρ)+(a3b3)2/ρ]d_{\rho}(A,B)=\inf_{a\in A,b\in B}[(a_{1}-b_{1})^{2}/(1-\rho)+(a_{2}-b_{2})^{2}/(1-\rho)+(a_{3}-b_{3})^{2}/\rho]

where ak,bka_{k},b_{k} denote the kk-th coordinate of vector aa and bb. An immediate property of the ρ\rho-distance function would be

dρ(i=1,,MAi,j=1,,NBj)=mini,jdρ(Ai,Bj).d_{\rho}(\cup_{i=1,\cdots,M}A_{i},\cup_{j=1,\cdots,N}B_{j})=\min_{i,j}d_{\rho}(A_{i},B_{j}).

Utilizing the symmetry of the regions, we can compute the region distances involved in (E.4) and (E.5) explicitly. Take the second exponent in (E.4) as an example, it can be simplified as

ϑdρ(R(u),{(0,(1ρ)τp,ρτp)})/(2log(p))=ϑdρ(R1(u)R2(u)R3(u)R4(u),{(0,(1ρ)τp,ρτp)})/(2log(p))=ϑdρ(R1,1(u)R1,3(u)R1,4(u)R2,2(u),{(0,(1ρ)τp,ρτp)}))/(2log(p)).\begin{split}-\vartheta-&d_{\rho}(R(u),\{(0,(1-\rho)\tau_{p},\rho\tau_{p})\})/(2\log(p))\\ =-\vartheta-&d_{\rho}(R_{1}(u)\cup R_{2}(u)\cup R_{3}(u)\cup R_{4}(u),\{(0,(1-\rho)\tau_{p},\rho\tau_{p})\})/(2\log(p))\\ =-\vartheta-&d_{\rho}(R_{1,1}(u)\cup R_{1,3}(u)\cup R_{1,4}(u)\cup R_{2,2}(u),\{(0,(1-\rho)\tau_{p},\rho\tau_{p})\}))/(2\log(p)).\end{split}

Define R~1,2(u)={(x,y,z):x>0,y>0,z>0,x<y,x>T,z<yx}\tilde{R}_{1,2}(u)=\{(x,y,z):x>0,y>0,z>0,x<y,x>T,z<y-x\}, R~1,3(u)={(x,y,z):x>0,y>0,z>0,x<y,x>T}\tilde{R}_{1,3}(u)=\{(x,y,z):x>0,y>0,z>0,x<y,x>T\} and R~1,4(u)={(x,y,z):x>0,y>0,z>0,x<y,x<T,z>T+ρ1ρy11ρx}\tilde{R}_{1,4}(u)=\{(x,y,z):x>0,y>0,z>0,x<y,x<T,z>T+\frac{\rho}{1-\rho}y-\frac{1}{1-\rho}x\}. Then R~1,2(u)R~1,3(u)\tilde{R}_{1,2}(u)\subset\tilde{R}_{1,3}(u) and R1,3(u)R1,4(u)=R~1,3(u)R~1,4(u)R_{1,3}(u)\cup R_{1,4}(u)=\tilde{R}_{1,3}(u)\cup\tilde{R}_{1,4}(u). Since R~1,2\tilde{R}_{1,2}(u) and R2,2(u)R_{2,2}(u) are symmetric about the plane x=0x=0, we know

dρ(R2,2(u),{(0,(1ρ)τp,ρτp)})=dρ(R~1,2(u),{(0,(1ρ)τp,ρτp)}).d_{\rho}(R_{2,2}(u),\{(0,(1-\rho)\tau_{p},\rho\tau_{p})\})=d_{\rho}(\tilde{R}_{1,2}(u),\{(0,(1-\rho)\tau_{p},\rho\tau_{p})\}).

Therefore,

dρ(R(u),{(0,(1ρ)τp,ρτp)})=min{dρ(R1,1(u),{(0,(1ρ)τp,ρτp)}),dρ(R~1,3(u),{(0,(1ρ)τp,ρτp)}),dρ(R~1,4(u),{(0,(1ρ)τp,ρτp)})}=min{1ρ2×τp2+21+ρ×[(T(1+ρ)τp/2)+]21ρ1+ρ×[(T(1+ρ)τp)+]2,11ρ×T2+11ρ×[(T(1ρ)τp)+]2,1ρ1+ρ×T2+((Tτp)+)2}=(Tρτp)2+(ξρτpηρT)+2(τpT)+2,\begin{split}d_{\rho}(R&(u),\{(0,(1-\rho)\tau_{p},\rho\tau_{p})\})\\ =\min\{&d_{\rho}(R_{1,1}(u),\{(0,(1-\rho)\tau_{p},\rho\tau_{p})\}),d_{\rho}(\tilde{R}_{1,3}(u),\{(0,(1-\rho)\tau_{p},\rho\tau_{p})\}),\\ &d_{\rho}(\tilde{R}_{1,4}(u),\{(0,(1-\rho)\tau_{p},\rho\tau_{p})\})\}\\ =\min\Big{\{}&\frac{1-\rho}{2}\times\tau_{p}^{2}+\frac{2}{1+\rho}\times[(T-(1+\rho)\tau_{p}/2)_{+}]^{2}-\frac{1-\rho}{1+\rho}\times[(T-(1+\rho)\tau_{p})_{+}]^{2},\\ &\frac{1}{1-\rho}\times T^{2}+\frac{1}{1-\rho}\times[(T-(1-\rho)\tau_{p})_{+}]^{2},\ \ \frac{1-\rho}{1+\rho}\times T^{2}+((T-\tau_{p})_{+})^{2}\Big{\}}\\ =(T-&\rho\tau_{p})^{2}+(\xi_{\rho}\tau_{p}-\eta_{\rho}T)_{+}^{2}-(\tau_{p}-T)_{+}^{2},\end{split}

where ξρ=1ρ2\xi_{\rho}=\sqrt{1-\rho^{2}} and ηρ=(1ρ)/(1+ρ)\eta_{\rho}=\sqrt{(1-\rho)/(1+\rho)}.

Let τp=0\tau_{p}=0, we know dρ(R(u),{(0,0,0)})=T2d_{\rho}(R(u),\{(0,0,0)\})=T^{2}. By (E.4) we immediately have

(βj=0,β^j(u)0)=Lppmin{u,ϑ+(uρr)2+(ξρrηρu)+2(ru)+2}.\mathbb{P}(\beta_{j}=0,\hat{\beta}_{j}(u)\neq 0)=L_{p}p^{-\min\left\{u,\;\,\vartheta+(\sqrt{u}-\rho\sqrt{r})^{2}+(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}-(\sqrt{r}-\sqrt{u})_{+}^{2}\right\}}. (E.6)

We can see the false positive rate is exactly the same when using the Lasso filter and the Knockoff filter when ρ>0\rho>0. For ρ1/2\rho\geq 1/2, we can similarly compute dρ(R(u)C,{((1ρ)τp,0,ρτp)})d_{\rho}(R(u)^{C},\{((1-\rho)\tau_{p},0,\rho\tau_{p})\}) to be

[(τpT)+((1ξρ)τp(1ηρ)T)+(λρτpηρT)+]2,[(\tau_{p}-T)_{+}-((1-\xi_{\rho})\tau_{p}-(1-\eta_{\rho})T)_{+}-(\lambda_{\rho}\tau_{p}-\eta_{\rho}\ T)_{+}]^{2},

and dρ(R(u)C,{((1ρ)τp,(1ρ)τp,2ρτp)})d_{\rho}(R(u)^{C},\{((1-\rho)\tau_{p},(1-\rho)\tau_{p},2\rho\tau_{p})\}) to be

[(ξρτpηρT)+(λρτpηρT)+]2,[(\xi_{\rho}\ \tau_{p}-\eta_{\rho}T)_{+}-(\lambda_{\rho}\tau_{p}-\eta_{\rho}T)_{+}]^{2},

where ξρ=1ρ2\xi_{\rho}=\sqrt{1-\rho^{2}}, ηρ=(1ρ)/(1+ρ)\eta_{\rho}=\sqrt{(1-\rho)/(1+\rho)}, and λρ=1ρ21ρ\lambda_{\rho}=\sqrt{1-\rho^{2}}-\sqrt{1-\rho}.

Plug these results in to (E.5), we have

(βj0,β^j(u)=0)=Lppϑ{(ru)+[(1ξρ)r(1ηρ)u]+(λρrηρu)+}2.\mathbb{P}(\beta_{j}\neq 0,\hat{\beta}_{j}(u)=0)=L_{p}p^{-\vartheta-\left\{(\sqrt{r}-\sqrt{u})_{+}-[(1-\xi_{\rho})\sqrt{r}-(1-\eta_{\rho})\sqrt{u}]_{+}-(\lambda_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}\right\}^{2}}. (E.7)

From here we have prove the result for ρ1/2\rho\geq 1/2 case.

In the case where ρ1/2\rho\leq-1/2, the exponent of false negative rate is additionally lower bounded by 2ϑ-2\vartheta. One can verify the rate given in the theorem through similar calculations. This is somehow more straight forwards since in the case where βj=βj+1=τ\beta_{j}=\beta_{j+1}=\tau, (yTxj,yTxj+1,yTx~j,yTx~j+1)T𝒩((1+ρ)τ(1,1,1,1)T,G)(y^{T}x_{j},y^{T}x_{j+1},y^{T}\tilde{x}_{j},y^{T}\tilde{x}_{j+1})^{T}\sim\mathcal{N}((1+\rho)\tau\cdot(1,1,-1,-1)^{T},G), meaning there is no way to distinguish the true variable from its knockoff variable.

Appendix F Proof of Theorem 5.3

In the following proofs, we only consider ρ0\rho\geq 0 case, since ρ<0\rho<0 case can be transformed to the positive |ρ||\rho| case by flipping the sign of either βj\beta_{j} or βj+1\beta_{j+1} for j=1,3,,p1j=1,3,\cdots,p-1. By the block diagonal structure of the gram matrix, the Lasso problem with 2p2p features can be reduced to (p/2)(p/2) independent four-variate Lasso regression problems:

b^(λ)=argminb{12||y(xj,xj+1,x~j,x~j+1)b||22+λb1}\hat{b}(\lambda)=\mathrm{argmin}_{b}\Big{\{}\frac{1}{2}||y-(x_{j},x_{j+1},\tilde{x}_{j},\tilde{x}_{j+1})b||_{2}^{2}+\lambda||b||_{1}\Big{\}} (F.1)

for j=1,3,,p1j=1,3,\cdots,p-1. Before we turn to the proof of the theorem, we first analysis the solution path of the following four-variate Lasso problem:

b^=argminb{hTb+bTBb/2+λb1}.\hat{b}=\mathrm{argmin}_{b}\bigl{\{}-h^{T}b+b^{T}Bb/2+\lambda\|b\|_{1}\bigr{\}}. (F.2)

with B=((1,ρ,a,ρ)T,(ρ,1,ρ,a)T,(a,ρ,1,ρ)T,(ρ,a,ρ,1)T)B=((1,\rho,a,\rho)^{T},(\rho,1,\rho,a)^{T},(a,\rho,1,\rho)^{T},(\rho,a,\rho,1)^{T}) and a[2|ρ|1,1]a\in[2|\rho|-1,1]. By taking the sub-gradients, we know b^\hat{b} should satisfy

Bb^+λsgn(b^)=h.B\ \hat{b}+\lambda\ \text{sgn}(\hat{b})=h. (F.3)

Let b^i\hat{b}_{i} and hih_{i} denotes the ii-th coordinate of b^\hat{b} and hh. Let λ1>λ2>λ3>λ4\lambda_{1}>\lambda_{2}>\lambda_{3}>\lambda_{4} be the values at which variables enter the solution path. As discussed in the proof of Lemma 7.2, λ1=max{|h1|,|h2|,|h3|,|h4|}\lambda_{1}=\max\{|h_{1}|,|h_{2}|,|h_{3}|,|h_{4}|\}. Without loss of generality, assume λ1=|h1|\lambda_{1}=|h_{1}| and variable 1 is the first variable entering the model in solution path. We know for one variate Lasso problem, the only feature will not leave the model after its entry as λ\lambda is decreasing. So in the four-variate Lasso (F.2), variable 1 will stay in the model until the second variable enters the model. Consider three bi-variate Lasso problems (k=2,3,4k=2,3,4):

b^(k)=argminb(k){(h(k))Tb(k)+(b(k))TB(k)b(k)/2+λb(k)1}\hat{b}^{(k)}=\mathrm{argmin}_{b^{(k)}}\bigl{\{}-(h^{(k)})^{T}b^{(k)}+(b^{(k)})^{T}B^{(k)}b^{(k)}/2+\lambda\|b^{(k)}\|_{1}\bigr{\}} (F.4)

with

B(2)=B(4)=[1ρρ1]andB(3)=[1aa1],B^{(2)}=B^{(4)}=\begin{bmatrix}1\;\;\;&\rho\\ \rho\;\;\;&1\end{bmatrix}\qquad\mbox{and}\qquad B^{(3)}=\begin{bmatrix}1\;\;\;&a\\ a\;\;\;&1\end{bmatrix},

h(2)=(h1,h2)h^{(2)}=(h_{1},h_{2}), h(3)=(h1,h3)h^{(3)}=(h_{1},h_{3}) and h(4)=(h1,h4)h^{(4)}=(h_{1},h_{4}). Now, we claim λ2=maxk{λ2(k)}\lambda_{2}=\max_{k}\{\lambda_{2}^{(k)}\} where λ2(k)\lambda_{2}^{(k)} is the value at which the second variables enter the solution path in the kk-th bi-variate Lasso problems. Suppose λ2(i)>λ2(k)\lambda_{2}^{(i)}>\lambda_{2}^{(k)} for ik{2,3,4}i\neq k\in\{2,3,4\}, when λ[λ2(i),λ1]\lambda\in[\lambda_{2}^{(i)},\lambda_{1}], we know the KKT condition (F.3) is satisfied with h2=h3=h4=0h_{2}=h_{3}=h_{4}=0 by looking at the KKT conditions of the bi-variate Lasso problems. When λ[λ2(i)ϵ,λ2(i))\lambda\in[\lambda_{2}^{(i)}-\epsilon,\lambda_{2}^{(i)}), a second variable ii must have entered the four-variate Lasso path, since the objective function of (F.2) is smaller when including variable 11 and ii than including variable 11 alone (this is because the second variable have entered the model in the ii-th bi-variate Lasso path when λ[λ2(i)ϵ,λ2(i))\lambda\in[\lambda_{2}^{(i)}-\epsilon,\lambda_{2}^{(i)})). We are ready to prove the theorem now, using what we have shown regarding λ1\lambda_{1} and λ2\lambda_{2}. We next compute the false positive rate and false negative rate given (βj,βj+1)=(0,0),(0,τp),(τp,0),(τp,τp),(τp,τp)(\beta_{j},\beta_{j+1})=(0,0),(0,\tau_{p}),(\tau_{p},0),(\tau_{p},\tau_{p}),(-\tau_{p},\tau_{p}) by deriving upper and lower bounds for those rates.

We first establish some noatations. For the four-variate Lasso problem (F.1), let AiA_{i} denotes the event that variable ii is the first one entering the model, Ai1,i2A_{i_{1},i_{2}} denotes the event that variable i1i_{1} and i2i_{2} are the first two entering the model (ignoring the order between i1i_{1} and i2i_{2}) and Ai1i2A_{i_{1}\to i_{2}} denotes the event that variable i1i_{1} is the first one and variable i2i_{2} is the second one entering the model. Let Li1,i2L_{i_{1},i_{2}} denote the bi-variate Lasso problem with yy as the response and xi1x_{i_{1}}, xi2x_{i_{2}} as the variables. Let h(yTxj,yTxj+1,yTx~j,yTx~j+1)h\equiv(y^{T}x_{j},y^{T}x_{j+1},y^{T}\tilde{x}_{j},y^{T}\tilde{x}_{j+1}), then h𝒩(μ,G)h\sim\mathcal{N}(\mu,G) with μ=G(βj,βj+1,0,0)T\mu=G(\beta_{j},\beta_{j+1},0,0)^{T} and G=((1,ρ,0,ρ)T,(ρ,1,ρ,0)T,(0,ρ,1,ρ)T,(ρ,0,ρ,1)T)G=((1,\rho,0,\rho)^{T},(\rho,1,\rho,0)^{T},(0,\rho,1,\rho)^{T},(\rho,0,\rho,1)^{T}). When not causing any confusing, we write tpt_{p} in place of tp(u)t_{p}(u) for simplicity.

  • When (βj,βj+1)=(0,0)(\beta_{j},\beta_{j+1})=(0,0),

    {Wj>tp|(βj,βj+1)=(0,0)}=Lppu.\mathbb{P}\bigl{\{}W_{j}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,0)\bigr{\}}=L_{p}p^{-u}. (F.5)

    To derive a lower bound for {Wj>tp|(βj,βj+1)=(0,0)}\mathbb{P}\bigl{\{}W_{j}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,0)\bigr{\}}, we look for a point in the region (or on the boundary of the region) that choose variable jj as a signal and apply Lemma 7.1. The point we choose is p1=(tp,ρtp,0,ρtp)Tp_{1}=(t_{p},\rho t_{p},0,\rho t_{p})^{T} where tp=2ulog(p)t_{p}=\sqrt{2u\log(p)}. It’s obvious that when h=p1h=p_{1}, variable jj is the first one entering the Lasso path. Though h=p1h=p_{1} is in the rejection region, it is also on the boundary of the region that choose variable jj as a signal because slight increasing the first coordinate will result in variable jj being selected. Since h𝒩(μ1,G)h\sim\mathcal{N}(\mu_{1},G) with μ1=0\mu_{1}=\textbf{0}, by Lemma 7.1,

    {Wj>tp|(βj,βj+1)=(0,0)}Lpp(p1μ1)TG1(p1μ1)/2log(p)=Lppu.\mathbb{P}\bigl{\{}W_{j}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,0)\bigr{\}}\geq L_{p}p^{-(p_{1}-\mu_{1})^{T}G^{-1}(p_{1}-\mu_{1})/2\log(p)}=L_{p}p^{-u}.

    The upper bound is straight forward by considering the first variable-ii entering the model and notice that Wi𝒩(0,1)W_{i}\sim\mathcal{N}(0,1):

    {Wj>tp|(βj,βj+1)=(0,0)}=i{Wj>tp,Ai|(βj,βj+1)=(0,0)}i{Wi>tp|(βj,βj+1)=(0,0)}=Lppu.\begin{split}\mathbb{P}\bigl{\{}W_{j}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,0)\bigr{\}}=&\sum_{i}\mathbb{P}\bigl{\{}W_{j}>t_{p},A_{i}\big{|}(\beta_{j},\beta_{j+1})=(0,0)\bigr{\}}\\ \leq&\sum_{i}\mathbb{P}\bigl{\{}W_{i}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,0)\bigr{\}}=L_{p}p^{-u}.\end{split} (F.6)
  • When (βj,βj+1)=(0,τp)(\beta_{j},\beta_{j+1})=(0,\tau_{p}),

    {Wj>tp|(βj,βj+1)=(0,τp)}Lpp(uρr)2(ξρrηρu)+2+(ru)+2,\mathbb{P}\bigl{\{}W_{j}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}\geq L_{p}p^{-(\sqrt{u}-\rho\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}}, (F.7)
    {Wj>tp,A|(βj,βj+1)=(0,τp)}Lppu\mathbb{P}\bigl{\{}W_{j}>t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}\leq L_{p}p^{-u} (F.8)

    for A=Aj+p+1j,Aj+1,j+p+1A=A_{j+p+1\to j},A_{j+1,j+p+1} and

    {Wj>tp,A|(βj,βj+1)=(0,τp)}Lpp(uρr)2(ξρrηρu)+2+(ru)+2\mathbb{P}\bigl{\{}W_{j}>t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}\leq L_{p}p^{-(\sqrt{u}-\rho\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}} (F.9)

    for A=Aj,j+1,Aj,j+p,Ajj+p+1A=A_{j,j+1},A_{j,j+p},A_{j\to j+p+1}.

    This time we choose

    p2T={(tp,ρtp+(1ρ2)τp,ρτp,ρtpρ2τp),(1+ρ)τptp,(tp,tp,ρ1+ρtp,ρ1+ρtp),τptp<(1+ρ)τp,(tp+ρ(τptp),τp,ρ(τptp)+ρ1+ρtp,ρ1+ρtp),tp<τp.p_{2}^{T}=\left\{\begin{array}[]{ll}(t_{p},\rho t_{p}+(1-\rho^{2})\tau_{p},\rho\tau_{p},\rho t_{p}-\rho^{2}\tau_{p}),&(1+\rho)\tau_{p}\leq t_{p},\\ (t_{p},t_{p},\frac{\rho}{1+\rho}t_{p},\frac{\rho}{1+\rho}t_{p}),&\tau_{p}\leq t_{p}<(1+\rho)\tau_{p},\\ (t_{p}+\rho(\tau_{p}-t_{p}),\tau_{p},\rho(\tau_{p}-t_{p})+\frac{\rho}{1+\rho}t_{p},\frac{\rho}{1+\rho}t_{p}),&t_{p}<\tau_{p}.\end{array}\right.

    When h=p2h=p_{2} and tpτpt_{p}\geq\tau_{p}, variable jj is the first variable entering the four-variate Lasso path with Wj=tpW_{j}=t_{p}; when h=p2h=p_{2} and tp<τpt_{p}<\tau_{p}, variable j+1j+1 is the first and jj is the second variable entering the Lasso path with Wj=tpW_{j}=t_{p} and Wj+1=τpW_{j+1}=\tau_{p}. h=p2h=p_{2} is on the boundary of the region that chooses variable jj as a signal. Since h𝒩(μ2,G)h\sim\mathcal{N}(\mu_{2},G) with μ2=(ρτp,τp,ρτp,0)T\mu_{2}=(\rho\tau_{p},\tau_{p},\rho\tau_{p},0)^{T}, by Lemma 7.1,

    {Wj>tp|(βj,βj+1)=(0,τp)}Lpp(p2μ2)TG1(p2μ2)/2log(p)=Lpp(uρr)2(ξρrηρu)+2+(ru)+2.\begin{split}\mathbb{P}\bigl{\{}W_{j}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}&\geq L_{p}p^{-(p_{2}-\mu_{2})^{T}G^{-1}(p_{2}-\mu_{2})/2\log(p)}\\ &=L_{p}p^{-(\sqrt{u}-\rho\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}}.\end{split}

    When Aj+1,j+p+1A_{j+1,j+p+1} occurs, since by our argument on λ1\lambda_{1} and λ2\lambda_{2}, Zj+1Z_{j+1} and Zj+p+1Z_{j+p+1} are the λ\lambda value at which the variables enter the solution path in the bi-variate Lasso problem Lj+1,j+p+1L_{j+1,j+p+1}. Therefore, Zj+1=|yTxj+1|,Zj+p+1=|yTx~j+1|Z_{j+1}=|y^{T}x_{j+1}|,Z_{j+p+1}=|y^{T}\tilde{x}_{j+1}|. We notice that Zj+p+1>Zj>tpZ_{j+p+1}>Z_{j}>t_{p} and marginally yTx~j+1𝒩(0,1)y^{T}\tilde{x}_{j+1}\sim\mathcal{N}(0,1), so

    {Wj>tp,Aj+1,j+p+1|(βj,βj+1)=(0,τp)}{|yTx~j+1|>tp|(βj,βj+1)=(0,τp)}=Lppu.\begin{split}&\mathbb{P}\bigl{\{}W_{j}>t_{p},A_{j+1,j+p+1}\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}\\ &\leq\mathbb{P}\bigl{\{}|y^{T}\tilde{x}_{j+1}|>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}=L_{p}p^{-u}.\end{split}

    Above inequality also holds for Aj+p+1jA_{j+p+1\to j} since if variable j+p+1j+p+1 is the first entering the Lasso path, then we must have |yTx~j+1|=Zj+p+1>Zj>tp|y^{T}\tilde{x}_{j+1}|=Z_{j+p+1}>Z_{j}>t_{p}.

    When any one of Aj,j+1,Aj,j+p,Ajj+p+1A_{j,j+1},A_{j,j+p},A_{j\to j+p+1} occurs, it implies in the bi-variate Lasso problem Lj,j+1L_{j,j+1}, the largest λ\lambda such that variable 1 enters the model for the first time is equal to WjW_{j}, thus larger than tpt_{p}. In other words, if variable jj is a false positive using Knockoff for variable selection, then it is also a false positive when using bi-variate Lasso Lj,j+1L_{j,j+1}. This means {Wj>tp,A|(βj,βj+1)=(0,τp)}\mathbb{P}\bigl{\{}W_{j}>t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}} is upper bounded by the corresponding false positive rate of Lasso, which is Lpp(uρr)2(ξρrηρu)+2+(ru)+2L_{p}p^{-(\sqrt{u}-\rho\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}}, for A=Aj,j+1,Aj,j+p,Ajj+p+1A=A_{j,j+1},A_{j,j+p},A_{j\to j+p+1}.

    Since Aj+1,j+pA_{j+1,j+p} and Aj+p,j+p+1A_{j+p,j+p+1} can never occur when Wj>0W_{j}>0, (F.8) and (F.9) implies

    {Wj>tp|(βj,βj+1)=(0,τp)}Lppmin{u,(uρr)2+(ξρrηρu)+2(ru)+2}.\mathbb{P}\bigl{\{}W_{j}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}\leq L_{p}p^{-\min\{u,(\sqrt{u}-\rho\sqrt{r})^{2}+(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}-(\sqrt{r}-\sqrt{u})_{+}^{2}\}}. (F.10)

    Further coupled with (F.5) and (F.7), we have

    {Wj>tp,βj=0}=Lppmin{u,ϑ+(uρr)2+(ξρrηρu)+2(ru)+2}.\mathbb{P}\bigl{\{}W_{j}>t_{p},\beta_{j}=0\bigr{\}}=L_{p}p^{-\min\{u,\vartheta+(\sqrt{u}-\rho\sqrt{r})^{2}+(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}-(\sqrt{r}-\sqrt{u})_{+}^{2}\}}. (F.11)
  • When (βj,βj+1)=(τp,0)(\beta_{j},\beta_{j+1})=(\tau_{p},0),

    {Wjtp|(βj,βj+1)=(τp,0)}Lpp[(ru)+]2,\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\geq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}}, (F.12)

    and

    {Wjtp|(βj,βj+1)=(τp,0)}LppϑfHamm+(u,r,ϑ).\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\leq L_{p}p^{\vartheta-f^{+}_{\text{Hamm}}(u,r,\vartheta)}. (F.13)

    Let p3=(tp,ρtp,0,ρtp)Tp_{3}=(t_{p},\rho t_{p},0,\rho t_{p})^{T}. when h=p3h=p_{3}, variable jj is the first variable entering the Lasso path and p3p_{3} is in the region of rejecting variable jj as a signal. Since h𝒩(μ3,G)h\sim\mathcal{N}(\mu_{3},G) with μ3=(τp,ρτp,0,ρτp)T\mu_{3}=(\tau_{p},\rho\tau_{p},0,\rho\tau_{p})^{T}, by Lemma 7.1,

    {Wjtp|(βj,βj+1)=(τp,0)}Lpp(p3μ3)TG1(p3μ3)/2log(p)=Lpp[(ru)+]2.\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}&\geq L_{p}p^{-(p_{3}-\mu_{3})^{T}G^{-1}(p_{3}-\mu_{3})/2\log(p)}\\ &=L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}}.\end{split}

    Before we prove (F.13), we first analysis fHamm+(u,r,ϑ)f^{+}_{\text{Hamm}}(u,r,\vartheta). By simply calculation, we find the optimal value of uu that maximize fHamm+(u,r,ϑ)f^{+}_{\text{Hamm}}(u,r,\vartheta) given r,ϑr,\vartheta is

    u={1+ρ(1+ρ+1ρ)2r,ϑ2ρ(1+ρ+1ρ)2r,(r+ϑ)24r,2ρ(1+ρ+1ρ)2rϑ<r,ϑ,r<ϑ.u^{*}=\left\{\begin{array}[]{ll}\frac{1+\rho}{(\sqrt{1+\rho}+\sqrt{1-\rho})^{2}}r,&\vartheta\leq\frac{2\rho}{(\sqrt{1+\rho}+\sqrt{1-\rho})^{2}}r,\\ \frac{(r+\vartheta)^{2}}{4r},&\frac{2\rho}{(\sqrt{1+\rho}+\sqrt{1-\rho})^{2}}r\leq\vartheta<r,\\ \vartheta,&r<\vartheta.\end{array}\right.

    This implies u1+ρ(1+ρ+1ρ)2ru^{*}\geq\frac{1+\rho}{(\sqrt{1+\rho}+\sqrt{1-\rho})^{2}}r regardless of the relationship of ϑ\vartheta and rr. Consider r,ϑr,\vartheta as fixed, fHamm+(r,u,ϑ)f^{+}_{\text{Hamm}}(r,u,\vartheta) as a function of uu is monotonically non-decreasing in [0,u][0,u^{*}] and monotonically non-increasing in [u,)[u^{*},\infty). fHamm+(r,ϑ)=ϑ+[(ru)+((1ξρ)r(1ηρ)u)+]2f^{+}_{\text{Hamm}}(r,\vartheta)=\vartheta+[(\sqrt{r}-\sqrt{u})_{+}-((1-\xi_{\rho})\sqrt{r}-(1-\eta_{\rho})\sqrt{u})_{+}]^{2} if and only if u>uu>u^{*}. Since (1ξρ)r(1ηρ)u<0(1-\xi_{\rho})\sqrt{r}-(1-\eta_{\rho})\sqrt{u^{*}}<0, (1ξρ)r(1ηρ)u<0(1-\xi_{\rho})\sqrt{r}-(1-\eta_{\rho})\sqrt{u}<0 for all u>uu>u^{*}, which implies fHamm+(r,ϑ)=ϑ+[(ru)+]2f^{+}_{\text{Hamm}}(r,\vartheta)=\vartheta+[(\sqrt{r}-\sqrt{u})_{+}]^{2} when u>uu>u^{*}. Therefore,

    fHamm+(r,u,ϑ)=min{u,ϑ+(u|ρ|r)2+((ξρrηρu)+)2((ru)+)2,ϑ+[(ru)+]2}.\begin{split}f^{+}_{\text{Hamm}}(r,u,\vartheta)=\min\{&u,\vartheta+(\sqrt{u}-|\rho|\sqrt{r})^{2}+((\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+})^{2}-((\sqrt{r}-\sqrt{u})_{+})^{2},\\ &\vartheta+[(\sqrt{r}-\sqrt{u})_{+}]^{2}\}.\end{split}

    Now, we show that (F.13) holds for uuu\geq u^{*}. This would implies (F.13) for all u0u\geq 0, since the false negative rate {Wjtp(u)|(βj,βj+1)=(τp,0)}\mathbb{P}\bigl{\{}W_{j}\leq t_{p}(u)\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}} is monotone non-decreasing with uu, so for u<uu<u^{*}, {Wjtp(u)|(βj,βj+1)=(τp,0)}{Wjtp(u)|(βj,βj+1)=(τp,0)}LppϑfHamm+(r,u,ϑ)LppϑfHamm+(r,u,ϑ)\mathbb{P}\bigl{\{}W_{j}\leq t_{p}(u)\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\leq\mathbb{P}\bigl{\{}W_{j}\leq t_{p}(u^{*})\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\leq L_{p}p^{\vartheta-f^{+}_{\text{Hamm}}(r,u^{*},\vartheta)}\leq L_{p}p^{\vartheta-f^{+}_{\text{Hamm}}(r,u,\vartheta)}.

    Assume uuu\geq u^{*}, so u1+ρ(1+ρ+1ρ)2ru\geq\frac{1+\rho}{(\sqrt{1+\rho}+\sqrt{1-\rho})^{2}}r and

    [(ru)+]2(1ρ1+ρ+1ρ)2r(23)(1ρ)r1ρ2r12r.-[(\sqrt{r}-\sqrt{u})_{+}]^{2}\geq-\Big{(}\frac{\sqrt{1-\rho}}{\sqrt{1+\rho}+\sqrt{1-\rho}}\Big{)}^{2}r\geq-(2-\sqrt{3})(1-\rho)r\geq-\frac{1-\rho}{2}r\geq-\frac{1}{2}r. (F.14)

    We next prove (F.13) by showing that

    {Wjtp,A|(βj,βj+1)=(τp,0)}Lpp[(ru)+]2\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}} (F.15)

    holds for A=Aj,Aj+1,Aj+p,Aj+p+1A=A_{j},A_{j+1},A_{j+p},A_{j+p+1} and uuu\geq u^{*}. Respectively,

    {Wjtp,Aj|(βj,βj+1)=(τp,0)}{|yTxj|tp|(βj,βj+1)=(τp,0)}=Lpp[(ru)+]2,\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}&\leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\\ &=L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}},\end{split}

    and by symmetry and (F.14),

    {Wjtp,Aj+1|(βj,βj+1)=(τp,0)}={Wjtp,Aj+p+1|(βj,βj+1)=(τp,0)}{|yTxj||yTxj+p+1||(βj,βj+1)=(τp,0)}Lpp1ρ2rLpp[(ru)+]2,\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+1}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}=\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+p+1}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\\ \leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq|y^{T}x_{j+p+1}|\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\leq L_{p}p^{-\frac{1-\rho}{2}r}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}},\end{split}
    {Wjtp,Aj+p|(βj,βj+1)=(τp,0)}{|yTxj||yTxj+p||(βj,βj+1)=(τp,0)}Lpp12rLpp[(ru)+]2.\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}&\leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq|y^{T}x_{j+p}|\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\\ &\leq L_{p}p^{-\frac{1}{2}r}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}}.\end{split}

    (F.13) is immediate by [(ru)+]2fHamm+(r,u,ϑ)ϑ[(\sqrt{r}-\sqrt{u})_{+}]^{2}\geq f^{+}_{\text{Hamm}}(r,u,\vartheta)-\vartheta.

  • When (βj,βj+1)=(τp,τp)(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p}),

    {Wjtp|(βj,βj+1)=(τp,τp)}LppϑfHamm+(u,r,ϑ).\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\leq L_{p}p^{\vartheta-f^{+}_{\text{Hamm}}(u,r,\vartheta)}. (F.16)

    More precisely, we will prove that

    {Wjtp|(βj,βj+1)=(τp,τp)}Lpp[(ru)+]2\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}} (F.17)

    holds for uuu\geq u^{*}, thus implies (F.16). We prove (F.17) by showing

    {Wjtp,A|(βj,βj+1)=(τp,τp)}Lpp[(ru)+]2\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}} (F.18)

    holds for A=Aj,Aj+1j,Aj+1j+p,Aj+1j+p+1,Aj+p,Aj+p+1A=A_{j},A_{j+1\to j},A_{j+1\to j+p},A_{j+1\to j+p+1},A_{j+p},A_{j+p+1} and uuu\geq u^{*}, which cover all possibilities. Respectively,

    {Wjtp,Aj|(βj,βj+1)=(τp,τp)}{|yTxj|tp|(βj,βj+1)=(τp,τp)}Lpp[((1+ρ)ru)+]2Lpp[(ru)+]2,\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}&\leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\\ &\leq L_{p}p^{-[((1+\rho)\sqrt{r}-\sqrt{u})_{+}]^{2}}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}},\end{split}
    {Wjtp,Aj+p|(βj,βj+1)=(τp,τp)}{|yTxj||yTx~j||(βj,βj+1)=(τp,τp)}Lpp12rLpp[(ru)+]2,\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}&\leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq|y^{T}\tilde{x}_{j}|\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\\ &\leq L_{p}p^{-\frac{1}{2}r}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}},\end{split}
    {Wjtp,Aj+p+1|(βj,βj+1)=(τp,τp)}{|yTxj||yTx~j+1||(βj,βj+1)=(τp,τp)}Lpp12(1ρ)rLpp[(ru)+]2.\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+p+1}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}&\leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq|y^{T}\tilde{x}_{j+1}|\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\\ &\leq L_{p}p^{-\frac{1}{2(1-\rho)}r}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}}.\end{split}

    When Aj+1jA_{j+1\to j} occurs, the bi-variate Lasso problem Lj,j+1L_{j,j+1} shares the same λ1\lambda_{1} and λ2\lambda_{2} with the four-variate Lasso problem. So variable jj is a false negative when doing variable selection using the bi-variate Lasso Lj,j+1L_{j,j+1} given WjtpW_{j}\leq t_{p}, which implies {Wjtp,Aj+1j|(βj,βj+1)=(τp,τp)}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+1\to j}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}} is upper bounded by the corresponding false negative rate of Lasso, which is Lpp(ξρrηρu)+2Lpp[(ru)+]2L_{p}p^{-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}}. The last inequality is equivalent to

    (11ρ2)r(11ρ1+ρ)u.(1-\sqrt{1-\rho^{2}})\sqrt{r}\leq\Big{(}1-\sqrt{\frac{1-\rho}{1+\rho}}\Big{)}\sqrt{u}.

    By (F.14), the right hand side is no smaller than r\sqrt{r}, thus no smaller than the left hand side.

    When Aj+1j+pA_{j+1\to j+p} occurs, we know variable j+pj+p instead of variable jj is the second one entering the Lasso path. This means the λ2\lambda_{2} (the λ\lambda value when the second variable entering Lasso path) of the bi-variate Lasso problem Lj+1,j+pL_{j+1,j+p} is larger than the λ2\lambda_{2} of the bi-variate Lasso problem Lj,j+1L_{j,j+1}. Since we have derived the explicit expression of λ2\lambda_{2} in bi-variate Lasso problems, when yTxj+10y^{T}x_{j+1}\geq 0, we must have

    max{yTxjρyTxj+11ρ,yTxjρyTxj+11ρ}<max{yTx~jρyTxj+11ρ,yTx~jρyTxj+11ρ}.\max\{\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}<\max\{\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}.

    Therefore, Aj+1j+pA_{j+1\to j+p} implies one the three following events must occur:

    yTxj+1<0,yTxjρyTxj+11ρ<yTx~jρyTxj+11ρ,yTxjρyTxj+11ρ<yTx~jρyTxj+11ρy^{T}x_{j+1}<0,\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho}<\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho}<\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{-1-\rho}

    The probability of these three events given (βj,βj+1)=(τp,τp)(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p}) are Lpp(1+ρ)2rL_{p}p^{-(1+\rho)^{2}r}, Lppr2L_{p}p^{-\frac{r}{2}} and Lpp(1+2ρ)2(1ρ)2(1+ρ)rL_{p}p^{-\frac{(1+2\rho)^{2}(1-\rho)}{2(1+\rho)}r}, all of which are upper bounded by Lpp[(ru)+]2L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}} when uuu\geq u^{*}.

    When Aj+1j+p+1A_{j+1\to j+p+1} occurs, the λ2\lambda_{2} of the bi-variate Lasso problem Lj+1,j+p+1L_{j+1,j+p+1} is larger than the λ2\lambda_{2} of the bi-variate Lasso problem Lj,j+1L_{j,j+1}. When yTxj+10y^{T}x_{j+1}\geq 0, we must have

    max{yTxjρyTxj+11ρ,yTxjρyTxj+11ρ}<|yTx~j+1|.\max\{\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}<|y^{T}\tilde{x}_{j+1}|.

    Therefore, Aj+1j+p+1A_{j+1\to j+p+1} implies one the three following events must occur:

    yTxj+1<0,yTxjρyTxj+11ρ<yTx~j+1,yTxjρyTxj+11ρ<yTx~j+1.y^{T}x_{j+1}<0,\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho}<y^{T}\tilde{x}_{j+1},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho}<-y^{T}\tilde{x}_{j+1}.

    Respectively, the probability of these three events are Lpp(1+ρ)2rL_{p}p^{-(1+\rho)^{2}r}, Lppr2L_{p}p^{-\frac{r}{2}} and Lpp(1+2ρ)2(1ρ)2(1+ρ)rL_{p}p^{-\frac{(1+2\rho)^{2}(1-\rho)}{2(1+\rho)}r}, all of which are upper bounded by Lpp[(ru)+]2L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}} when uuu\geq u^{*}. From here we have verified (F.18), thus implies (F.16).

    From (F.11) and (F.12), we have

    {Wj>tp,βj=0}+{Wjtp,βj=τp}LppfHamm+(r,u,ϑ).\mathbb{P}\bigl{\{}W_{j}>t_{p},\beta_{j}=0\bigr{\}}+\mathbb{P}\bigl{\{}W_{j}\leq t_{p},\beta_{j}=\tau_{p}\bigr{\}}\geq L_{p}p^{-f^{+}_{\text{Hamm}}(r,u,\vartheta)}. (F.19)
    {Wjtp,βj=τp}=pϑ×{Wjtp|(βj,βj+1)=(τp,0)}+p2ϑ×{Wjtp|(βj,βj+1)=(τp,τp)}LppfHamm+(r,u,ϑ)\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},\beta_{j}=\tau_{p}\bigr{\}}=&p^{-\vartheta}\times\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\\ &+p^{-2\vartheta}\times\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\\ \leq&L_{p}p^{-f^{+}_{\text{Hamm}}(r,u,\vartheta)}\end{split} (F.20)

    Since (F.11) also implies {Wj>tp,βj=0}LppfHamm+(r,u,ϑ)\mathbb{P}\bigl{\{}W_{j}>t_{p},\beta_{j}=0\bigr{\}}\leq L_{p}p^{-f^{+}_{\text{Hamm}}(r,u,\vartheta)}, we know

    {Wj>tp,βj=0}+{Wjtp,βj=τp}=LppfHamm+(r,u,ϑ).\mathbb{P}\bigl{\{}W_{j}>t_{p},\beta_{j}=0\bigr{\}}+\mathbb{P}\bigl{\{}W_{j}\leq t_{p},\beta_{j}=\tau_{p}\bigr{\}}=L_{p}p^{-f^{+}_{\text{Hamm}}(r,u,\vartheta)}. (F.21)
  • When (βj,βj+1)=(τp,τp)(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p}),

    {Wjtp|(βj,βj+1)=(τp,τp)}Lpp((ξρrηρ1u)+)2,\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}\geq L_{p}p^{-((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2}}, (F.22)
    {Wjtp|(βj,βj+1)=(τp,τp)}Lpp(12ρ)2(1+ρ)2(1ρ)r,\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}\geq L_{p}p^{-\frac{(1-2\rho)^{2}(1+\rho)}{2(1-\rho)}r}, (F.23)

    and

    {Wjtp|(βj,βj+1)=(τp,τp)}Lppmin{((ξρrηρ1u)+)2,(12ρ)2(1+ρ)2(1ρ)r,ϑ+fHamm+(u,r,ϑ)}.\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}\leq L_{p}p^{-\min\{((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2},\frac{(1-2\rho)^{2}(1+\rho)}{2(1-\rho)}r,-\vartheta+f^{+}_{\text{Hamm}}(u,r,\vartheta)\}}. (F.24)

    Let

    p4T={((1ρ)τp,(1ρ)τp,ρτp,ρτp),(1ρ)τptp,(ρ(1ρ)τp(1+ρ)tp,(1ρ)τp,ρ(1ρ)τp+ρ21ρtp,ρ1ρtp),(1ρ)τp>tp.p_{4}^{T}=\left\{\begin{array}[]{ll}(-(1-\rho)\tau_{p},(1-\rho)\tau_{p},\rho\tau_{p},-\rho\tau_{p}),&(1-\rho)\tau_{p}\leq t_{p},\\ (\rho(1-\rho)\tau_{p}-(1+\rho)t_{p},(1-\rho)\tau_{p},\rho(1-\rho)\tau_{p}+\frac{\rho^{2}}{1-\rho}t_{p},-\frac{\rho}{1-\rho}t_{p}),&(1-\rho)\tau_{p}>t_{p}.\\ \end{array}\right.

    When h=p4h=p_{4} and (1ρ)τptp(1-\rho)\tau_{p}\leq t_{p}, variable jj is the first variable entering the Lasso path with Wj=(1ρ)τptpW_{j}=(1-\rho)\tau_{p}\leq t_{p}; when h=p4h=p_{4} and (1ρ)τp>tp(1-\rho)\tau_{p}>t_{p}, j+1j+1 is the first and jj is the second variable entering the Lasso path with Wj=tpW_{j}=t_{p}. Regardless of the relationship between τp\tau_{p} and tpt_{p}, h=p4h=p_{4} is always in the region of rejecting jj as a signal. Since h𝒩(μ4,G)h\sim\mathcal{N}(\mu_{4},G) with μ4=((1ρ)τp,(1ρ)τp,ρτp,ρτp)T\mu_{4}=(-(1-\rho)\tau_{p},(1-\rho)\tau_{p},\rho\tau_{p},-\rho\tau_{p})^{T}, by Lemma 7.1,

    {Wjtp|(βj,βj+1)=(τp,τp)}Lpp(p4μ4)TG1(p4μ4)/2log(p)=Lpp((ξρrηρ1u)+)2.\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}&\geq L_{p}p^{-(p_{4}-\mu_{4})^{T}G^{-1}(p_{4}-\mu_{4})/2\log(p)}\\ &=L_{p}p^{-((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2}}.\end{split}

    Let p5T=(4ρ32ρ2+ρ12(1ρ)τp,(1ρ)τp,1+2ρ4ρ22τp,ρ21ρτp)p_{5}^{T}=\Big{(}\frac{4\rho^{3}-2\rho^{2}+\rho-1}{2(1-\rho)}\tau_{p},(1-\rho)\tau_{p},\frac{1+2\rho-4\rho^{2}}{2}\tau_{p},-\frac{\rho^{2}}{1-\rho}\tau_{p}\Big{)}. When h=p5h=p_{5}, variable j+1j+1 is the first one entering the Lasso path with Wj+1=(1ρ)τpW_{j+1}=(1-\rho)\tau_{p}, if we slightly increase the value of the third coordinate of p5p_{5}, then it falls in the region of rejecting jj as a signal since variable j+pj+p is the second variable entering the Lasso path. This implies h=p5h=p_{5} in on the boundary of the region that rejects jj as a signal, by Lemma 7.1,

    {Wjtp|(βj,βj+1)=(τp,τp)}Lpp(p5μ4)TG1(p5μ4)/2log(p)=Lpp(12ρ)2(1+ρ)2(1ρ)r.\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}&\geq L_{p}p^{-(p_{5}-\mu_{4})^{T}G^{-1}(p_{5}-\mu_{4})/2\log(p)}\\ &=L_{p}p^{-\frac{(1-2\rho)^{2}(1+\rho)}{2(1-\rho)}r}.\end{split}

    Next, we show that

    {Wjtp,A|(βj,βj+1)=(τp,τp)}Lppmin{((ξρrηρ1u)+)2,(12ρ)2(1+ρ)2(1ρ)r,ϑ+fHamm+(u,r,ϑ)}.\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}\leq L_{p}p^{-\min\{((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2},\frac{(1-2\rho)^{2}(1+\rho)}{2(1-\rho)}r,-\vartheta+f^{+}_{\text{Hamm}}(u,r,\vartheta)\}}. (F.25)

    holds for A=Aj,Aj+1j,Aj+1j+p,Aj+1j+p+1,Aj+p,Aj+p+1A=A_{j},A_{j+1\to j},A_{j+1\to j+p},A_{j+1\to j+p+1},A_{j+p},A_{j+p+1}, which cover all possibilities.

    When A=AjA=A_{j} or Aj+1jA_{j+1\to j} occurs, as previously discussed, variable jj is a false negative when doing variable selection using the bi-variate Lasso Lj,j+1L_{j,j+1} given WjtpW_{j}\leq t_{p}, which implies {Wjtp,A|(βj,βj+1)=(τp,τp)}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}} is upper bounded by the corresponding false negative rate of Lasso, which is Lpp((ξρrηρ1u)+)2L_{p}p^{-((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2}}.

    When Aj+1j+pA_{j+1\to j+p} occurs, the λ2\lambda_{2} of the bi-variate Lasso problem Lj+1,j+pL_{j+1,j+p} is larger than the λ2\lambda_{2} of the bi-variate Lasso problem Lj,j+1L_{j,j+1}. When yTxj+10y^{T}x_{j+1}\geq 0, we must have

    max{yTxjρyTxj+11ρ,yTxjρyTxj+11ρ}<max{yTx~jρyTxj+11ρ,yTx~jρyTxj+11ρ}.\max\{\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}<\max\{\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}.

    Therefore, Aj+1j+pA_{j+1\to j+p} implies one of the three following events must occur:

    yTxj+1<0,yTxjρyTxj+11ρ<yTx~jρyTxj+11ρ,yTxjρyTxj+11ρ<yTx~jρyTxj+11ρy^{T}x_{j+1}<0,\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}<\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}<\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{-1-\rho}

    The probability of these three events are Lpp(1ρ)2rL_{p}p^{-(1-\rho)^{2}r}, Lpp(12ρ)2(1+ρ)2(1ρ)rL_{p}p^{-\frac{(1-2\rho)^{2}(1+\rho)}{2(1-\rho)}r} and Lppr2L_{p}p^{-\frac{r}{2}}, all of which are upper bounded by Lppmin{(12ρ)2(1+ρ)2(1ρ)r,ϑ+fHamm+(u,r,ϑ)}L_{p}p^{-\min\{\frac{(1-2\rho)^{2}(1+\rho)}{2(1-\rho)}r,-\vartheta+f^{+}_{\text{Hamm}}(u,r,\vartheta)\}}.

    When Aj+1j+p+1A_{j+1\to j+p+1} occurs, the λ2\lambda_{2} of the bi-variate Lasso problem Lj+1,j+p+1L_{j+1,j+p+1} is larger than the λ2\lambda_{2} of the bi-variate Lasso problem Lj,j+1L_{j,j+1}. When yTxj+10y^{T}x_{j+1}\geq 0, we must have

    max{yTxjρyTxj+11ρ,yTxjρyTxj+11ρ}<|yTx~j+1|.\max\{\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}<|y^{T}\tilde{x}_{j+1}|.

    Therefore, Aj+1j+p+1A_{j+1\to j+p+1} implies one of the three following events must occur:

    yTxj+1<0,yTxjρyTxj+11ρ<yTx~j+1,yTxjρyTxj+11ρ<yTx~j+1.y^{T}x_{j+1}<0,\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}<y^{T}\tilde{x}_{j+1},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}<-y^{T}\tilde{x}_{j+1}.

    The probability of these three events are Lpp(1ρ)2rL_{p}p^{-(1-\rho)^{2}r}, Lppr2L_{p}p^{-\frac{r}{2}} and Lpp(12ρ)2(1+ρ)2(1ρ)rL_{p}p^{-\frac{(1-2\rho)^{2}(1+\rho)}{2(1-\rho)}r}, all of which are upper bounded by Lppmin{(12ρ)2(1+ρ)2(1ρ)r,ϑ+fHamm+(u,r,ϑ)}L_{p}p^{-\min\{\frac{(1-2\rho)^{2}(1+\rho)}{2(1-\rho)}r,-\vartheta+f^{+}_{\text{Hamm}}(u,r,\vartheta)\}}.

    When Aj+pA_{j+p} occurs, then |yTx~j|>|yTxj||y^{T}\tilde{x}_{j}|>|y^{T}x_{j}| and |yTx~j|>|yTxj+1||y^{T}\tilde{x}_{j}|>|y^{T}x_{j+1}|. If yTx~j>0y^{T}\tilde{x}_{j}>0, we further have (yTx~jyTxj+1)+12ρ+1(yTx~j+yTxj)>0(y^{T}\tilde{x}_{j}-y^{T}x_{j+1})+\frac{1}{2\rho+1}(y^{T}\tilde{x}_{j}+y^{T}x_{j})>0; if yTx~j0y^{T}\tilde{x}_{j}\leq 0, we further have yTx~j+yTxj+1<0y^{T}\tilde{x}_{j}+y^{T}x_{j+1}<0. Therefore,

    {Wjtp,Aj+p|(βj,βj+1)=(τp,τp)}{(yTx~jyTxj+1)+12ρ+1(yTx~j+yTxj)>0|(βj,βj+1)=(τp,τp)}+{yTx~j+yTxj+1<0|(βj,βj+1)=(τp,τp)}Lpp2(12ρ)2(1+ρ)34ρ2r+Lppr2Lppmin{(12ρ)2(1+ρ)2(1ρ)r,ϑ+fHamm+(u,r,ϑ)}.\begin{split}\mathbb{P}&\bigl{\{}W_{j}\leq t_{p},A_{j+p}\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}\\ \leq&\mathbb{P}\bigl{\{}(y^{T}\tilde{x}_{j}-y^{T}x_{j+1})+\frac{1}{2\rho+1}(y^{T}\tilde{x}_{j}+y^{T}x_{j})>0\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}\\ &+\mathbb{P}\bigl{\{}y^{T}\tilde{x}_{j}+y^{T}x_{j+1}<0\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}\\ \leq&L_{p}p^{-\frac{2(1-2\rho)^{2}(1+\rho)}{3-4\rho^{2}}r}+L_{p}p^{-\frac{r}{2}}\leq L_{p}p^{-\min\{\frac{(1-2\rho)^{2}(1+\rho)}{2(1-\rho)}r,-\vartheta+f^{+}_{\text{Hamm}}(u,r,\vartheta)\}}.\end{split}

    For A=Aj+p+1A=A_{j+p+1}, (F.25) is immediate due to the symmetry between variable j+pj+p and j+p+1j+p+1.

    Now consider the case where βj\beta_{j} takes value in {0,τp}\{0,-\tau_{p}\} and βj+1\beta_{j+1} takes value in {0,τp}\{0,\tau_{p}\}, this corresponds to the ρ<0\rho<0 case (we flipped the sign of ρ\rho and βj\beta_{j} simultaneously). By (F.5), (F.7), (F.12), (F.22) and (F.23), we know

    {Wj>tp,βj=0}+{Wjtp,βj=τp}Lppmin{fHamm+(u,r,ϑ),2ϑ+((ξρrηρ1u)+)2,2ϑ+(12|ρ|)2(1+|ρ|)2(1|ρ|)r}.\begin{split}\mathbb{P}\bigl{\{}W_{j}>t_{p},\beta_{j}=0\bigr{\}}+\mathbb{P}\bigl{\{}W_{j}\leq t_{p},\beta_{j}=-\tau_{p}\bigr{\}}\\ \geq L_{p}p^{-\min\{f^{+}_{\text{Hamm}}(u,r,\vartheta),2\vartheta+((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2},2\vartheta+\frac{(1-2|\rho|)^{2}(1+|\rho|)}{2(1-|\rho|)}r\}}.\end{split} (F.26)

    Meanwhile, (F.5), (F.8), (F.9), (F.13) and (F.24) gives

    {Wj>tp,βj=0}+{Wjtp,βj=τp}Lppmin{fHamm+(u,r,ϑ),2ϑ+((ξρrηρ1u)+)2,2ϑ+(12|ρ|)2(1+|ρ|)2(1|ρ|)r}.\begin{split}\mathbb{P}\bigl{\{}W_{j}>t_{p},\beta_{j}=0\bigr{\}}+\mathbb{P}\bigl{\{}W_{j}\leq t_{p},\beta_{j}=-\tau_{p}\bigr{\}}\\ \leq L_{p}p^{-\min\{f^{+}_{\text{Hamm}}(u,r,\vartheta),2\vartheta+((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2},2\vartheta+\frac{(1-2|\rho|)^{2}(1+|\rho|)}{2(1-|\rho|)}r\}}.\end{split} (F.27)

    Therefore,

    {Wj>tp,βj=0}+{Wjtp,βj=τp}=Lppmin{fHamm+(u,r,ϑ),2ϑ+((ξρrηρ1u)+)2,2ϑ+(12|ρ|)2(1+|ρ|)2(1|ρ|)r}.\begin{split}\mathbb{P}\bigl{\{}W_{j}>t_{p},\beta_{j}=0\bigr{\}}+\mathbb{P}\bigl{\{}W_{j}\leq t_{p},\beta_{j}=-\tau_{p}\bigr{\}}\\ =L_{p}p^{-\min\{f^{+}_{\text{Hamm}}(u,r,\vartheta),2\vartheta+((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2},2\vartheta+\frac{(1-2|\rho|)^{2}(1+|\rho|)}{2(1-|\rho|)}r\}}.\end{split} (F.28)

    (F.21) and (F.28) complete the proof for Theorem 5.3.

Appendix G Proof of Theorem 5.4

The only difference of the conditional knockoff from the Equal-correlated knockoff construction is that xjTx~jx_{j}^{T}\tilde{x}_{j} is changed from 0 to ρ2\rho^{2} for j=1,,pj=1,\cdots,p. Therefore, G=((1,ρ,ρ2,ρ)T,(ρ,1,ρ,ρ2)TG=((1,\rho,\rho^{2},\rho)^{T},(\rho,1,\rho,\rho^{2})^{T}, (ρ2,ρ,1,ρ)T,(ρ,ρ2,ρ,1)T)(\rho^{2},\rho,1,\rho)^{T},(\rho,\rho^{2},\rho,1)^{T}) is the new gram matrix for the four-variate Lassos (F.1). We follow the same notations and workflow from the previous proof.

  • When (βj,βj+1)=(0,0)(\beta_{j},\beta_{j+1})=(0,0),

    {Wj>tp|(βj,βj+1)=(0,0)}=Lppu.\mathbb{P}\bigl{\{}W_{j}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,0)\bigr{\}}=L_{p}p^{-u}. (G.1)

    Let p1=(tp,ρtp,ρ2tp,ρtp)Tp_{1}=(t_{p},\rho t_{p},\rho^{2}t_{p},\rho t_{p})^{T} where tp=2ulog(p)t_{p}=\sqrt{2u\log(p)}. When h=p1h=p_{1}, variable jj is the first one entering the Lasso path. Though h=p1h=p_{1} is in the rejection region, it is also on the boundary of the region that choose variable jj as a signal. Since h𝒩(μ1,G)h\sim\mathcal{N}(\mu_{1},G) with μ1=0\mu_{1}=\textbf{0}, by Lemma 7.1,

    {Wj>tp|(βj,βj+1)=(0,0)}Lpp(p1μ1)TG1(p1μ1)/2log(p)=Lppu.\mathbb{P}\bigl{\{}W_{j}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,0)\bigr{\}}\geq L_{p}p^{-(p_{1}-\mu_{1})^{T}G^{-1}(p_{1}-\mu_{1})/2\log(p)}=L_{p}p^{-u}.

    The upper bound is derived exactly the same as (F.6).

  • When (βj,βj+1)=(0,τp)(\beta_{j},\beta_{j+1})=(0,\tau_{p}),

    {Wj>tp|(βj,βj+1)=(0,τp)}=Lpp(uρr)2(ξρrηρu)+2+(ru)+2.\mathbb{P}\bigl{\{}W_{j}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}=L_{p}p^{-(\sqrt{u}-\rho\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}}. (G.2)

    This time we choose

    p2T={(tp,ρtp+(1ρ2)τp,ρ2tp+ρ(1ρ2)τp,ρtp)T,(1+ρ)τptp,(tp,tp,ρtp,ρtp)T,τptp<(1+ρ)τp,((1ρ)tp+ρτp,τp,ρτp,ρ(1ρ)tp+ρ2τp)T,tp<τp.p_{2}^{T}=\left\{\begin{array}[]{ll}(t_{p},\rho t_{p}+(1-\rho^{2})\tau_{p},\rho^{2}t_{p}+\rho(1-\rho^{2})\tau_{p},\rho t_{p})^{T},&(1+\rho)\tau_{p}\leq t_{p},\\ (t_{p},t_{p},\rho t_{p},\rho t_{p})^{T},&\tau_{p}\leq t_{p}<(1+\rho)\tau_{p},\\ ((1-\rho)t_{p}+\rho\tau_{p},\tau_{p},\rho\tau_{p},\rho(1-\rho)t_{p}+\rho^{2}\tau_{p})^{T},&t_{p}<\tau_{p}.\end{array}\right.

    When h=p2h=p_{2} and tpτpt_{p}\geq\tau_{p}, variable jj is the first variable entering the four-variate Lasso path with Wj=tpW_{j}=t_{p}; when h=p2h=p_{2} and tp<τpt_{p}<\tau_{p}, variable j+1j+1 is the first and jj is the second variable entering the Lasso path with Wj=tpW_{j}=t_{p} and Wj+1=τpW_{j+1}=\tau_{p}. h=p2h=p_{2} is on the boundary of the region that chooses variable jj as a signal. Since h𝒩(μ2,G)h\sim\mathcal{N}(\mu_{2},G) with μ2=(ρτp,τp,ρτp,ρ2τp)T\mu_{2}=(\rho\tau_{p},\tau_{p},\rho\tau_{p},\rho^{2}\tau_{p})^{T}, by Lemma 7.1,

    {Wj>tp|(βj,βj+1)=(0,τp)}Lpp(p2μ2)TG1(p2μ2)/2log(p)=Lpp(uρr)2(ξρrηρu)+2+(ru)+2.\begin{split}\mathbb{P}\bigl{\{}W_{j}>t_{p}\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}&\geq L_{p}p^{-(p_{2}-\mu_{2})^{T}G^{-1}(p_{2}-\mu_{2})/2\log(p)}\\ &=L_{p}p^{-(\sqrt{u}-\rho\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}}.\end{split} (G.3)

    Next we show that

    {Wj>tp,A|(βj,βj+1)=(0,τp)}Lpp(uρr)2(ξρrηρu)+2+(ru)+2\mathbb{P}\bigl{\{}W_{j}>t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}\leq L_{p}p^{-(\sqrt{u}-\rho\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}} (G.4)

    holds for A=Aj,j+1,Aj,j+p,Ajj+p+1,Aj+p+1j,Aj+1,j+p+1A=A_{j,j+1},A_{j,j+p},A_{j\to j+p+1},A_{j+p+1\to j},A_{j+1,j+p+1}, which covers all possibilities.

    When any one of Aj,j+1,Aj,j+p,Ajj+p+1A_{j,j+1},A_{j,j+p},A_{j\to j+p+1} occurs, same as for EC-knockoff, it implies if variable jj is a false positive using Knockoff for variable selection, then it is also a false positive when using bi-variate Lasso Lj,j+1L_{j,j+1}. So {Wj>tp,A|(βj,βj+1)=(0,τp)}\mathbb{P}\bigl{\{}W_{j}>t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}} is upper bounded by the corresponding false positive rate of Lasso, which is Lpp(uρr)2(ξρrηρu)+2+(ru)+2L_{p}p^{-(\sqrt{u}-\rho\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}}, for A=Aj,j+1,Aj,j+p,Ajj+p+1A=A_{j,j+1},A_{j,j+p},A_{j\to j+p+1}.

    When A=Aj+p+1jA=A_{j+p+1\to j}, j+p+1j+p+1 is the first variable entering the model in the four-variate Lasso problem, thus it’s also the first variable entering the model in the bi-variate Lasso problem Lj+1,j+p+1L_{j+1,j+p+1} and Lj,j+p+1L_{j,j+p+1}. Variable j+p+1j+p+1 gets picked up as a signal in Lj+1,j+p+1L_{j+1,j+p+1} implies

    {Wj>tp,Aj+p+1j|(βj,βj+1)=(0,τp)}Lpp(u|ρ2|r)2(ξρ2rηρ2u)+2+(ru)+2Lpp(u|ρ|r)2(ξρrηρu)+2+(ru)+2\begin{split}\mathbb{P}\bigl{\{}W_{j}>t_{p},A_{j+p+1\to j}\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}&\leq L_{p}p^{-(\sqrt{u}-|\rho^{2}|\sqrt{r})^{2}-(\xi_{\rho^{2}}\sqrt{r}-\eta_{\rho^{2}}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}}\\ &\leq L_{p}p^{-(\sqrt{u}-|\rho|\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}}\end{split}

    when u(1+ρ)2ru\geq(1+\rho)^{2}r or u(1+ρ2)2ru\leq(1+\rho^{2})^{2}r.

    Now consider bi-variate Lasso problem Lj,j+p+1L_{j,j+p+1} given (1+ρ2)2r<u<(1+ρ)2r(1+\rho^{2})^{2}r<u<(1+\rho)^{2}r. Variable j,j+p+1j,j+p+1 both get picked up as signals with j+p+1j+p+1 entering the model first given Wj>tpW_{j}>t_{p}. This implies (yTxj,yTx~j+1)(y^{T}x_{j},y^{T}\tilde{x}_{j+1}) falls in the purple or green region of the right panel of Figure 12. Marginally, (yTxj,yTx~j+1)𝒩((ρτp,ρ2τp)T,[(1,ρ),(ρ,1)])(y^{T}x_{j},y^{T}\tilde{x}_{j+1})\sim\mathcal{N}((\rho\tau_{p},\rho^{2}\tau_{p})^{T},[(1,\rho),(\rho,1)]). The point in purple or green region that has the smallest ellipsoid distance to (ρτp,ρ2τp)T(\rho\tau_{p},\rho^{2}\tau_{p})^{T} is (tp,tp)(t_{p},t_{p}) when (1+ρ2)2r<u<(1+ρ)2r(1+\rho^{2})^{2}r<u<(1+\rho)^{2}r, thus by Lemma 7.1,

    {Wj>tp,Aj+p+1j|(βj,βj+1)=(0,τp)}Lpp(uρr)21ρ1+ρuLppr+2ru21+ρu=Lpp(u|ρ|r)2(ξρrηρu)+2+(ru)+2\begin{split}\mathbb{P}\bigl{\{}W_{j}>t_{p},A_{j+p+1\to j}\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}&\leq L_{p}p^{-(\sqrt{u}-\rho\sqrt{r})^{2}-\frac{1-\rho}{1+\rho}u}\\ &\leq L_{p}p^{-r+2\sqrt{ru}-\frac{2}{1+\rho}u}\\ &=L_{p}p^{-(\sqrt{u}-|\rho|\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}}\end{split}

    for u((1+ρ2)2r,(1+ρ)2r)u\in((1+\rho^{2})^{2}r,(1+\rho)^{2}r), which completes the proof of (G.4) for A=Aj+p+1jA=A_{j+p+1\to j}.

    When Aj+1,j+p+1A_{j+1,j+p+1} occurs, consider the bi-variate Lasso problem Lj+1,j+p+1L_{j+1,j+p+1}. In this bi-variate Lasso problem, {λ1,λ2}={Zj+1,Zj+p+1}\{\lambda_{1},\lambda_{2}\}=\{Z_{j+1},Z_{j+p+1}\}, both of which are larger than WjW_{j}. Thus in this bi-variate Lasso problem, both variables will be picked up as signals given Wj>tpW_{j}>t_{p}. So (yTxj+1,yTx~j+1)/2log(p)(y^{T}x_{j+1},y^{T}\tilde{x}_{j+1})/\sqrt{2\log(p)} falls in one of the four regions in the right panel of Figure 12 (with xj+1Tx~j+1=ρ2x_{j+1}^{T}\tilde{x}_{j+1}=\rho^{2} instead of ρ\rho): the purple region, the mirror of purple region against x=yx=y, the green region and the mirror of green region against x=yx=-y. Since (yTxj+1,yTx~j+1)𝒩((τp,ρ2τp)T,[(1,ρ2),(ρ2,1)])(y^{T}x_{j+1},y^{T}\tilde{x}_{j+1})\sim\mathcal{N}((\tau_{p},\rho^{2}\tau_{p})^{T},[(1,\rho^{2}),(\rho^{2},1)]). By Lemma 7.1, we need to find the point in those regions that has the smallest ellipsoid distance to the center-(τp,ρ2τp)T(\tau_{p},\rho^{2}\tau_{p})^{T}. When τptp\tau_{p}\leq t_{p}, this critical point is (yTxj+1,yTx~j+1)=(tp,tp)(y^{T}x_{j+1},y^{T}\tilde{x}_{j+1})=(t_{p},t_{p}); when τp>tp\tau_{p}>t_{p}, this critical point is (yTxj+1,yTx~j+1)=(τp,tp+ρ(τptp))(y^{T}x_{j+1},y^{T}\tilde{x}_{j+1})=(\tau_{p},t_{p}+\rho(\tau_{p}-t_{p})). So Lemma 7.1 gives the probability for λ1\lambda_{1} and λ2\lambda_{2} in Lj+1,j+p+1L_{j+1,j+p+1} to be both larger than tpt_{p} is

    Lpp(ur)+21ρ21+ρ2uLpp(u|ρ|r)2(ξρrηρu)+2+(ru)+2.L_{p}p^{-(\sqrt{u}-\sqrt{r})_{+}^{2}-\frac{1-\rho^{2}}{1+\rho^{2}}u}\leq L_{p}p^{-(\sqrt{u}-|\rho|\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}}.

    Since Aj+1,j+p+1{Wj>tp}A_{j+1,j+p+1}\cap\{W_{j}>t_{p}\} implies {λ1>tp}{λ2>tp}\{\lambda_{1}>t_{p}\}\cap\{\lambda_{2}>t_{p}\} in Lj+1,j+p+1L_{j+1,j+p+1}, we know

    {Wj>tp,Aj+1,j+p+1|(βj,βj+1)=(0,τp)}Lpp(u|ρ|r)2(ξρrηρu)+2+(ru)+2.\mathbb{P}\bigl{\{}W_{j}>t_{p},A_{j+1,j+p+1}\big{|}(\beta_{j},\beta_{j+1})=(0,\tau_{p})\bigr{\}}\leq L_{p}p^{-(\sqrt{u}-|\rho|\sqrt{r})^{2}-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}+(\sqrt{r}-\sqrt{u})_{+}^{2}}.

    Now, we have verified (G.4). Further coupled with (G.3), we have (G.2).

  • When (βj,βj+1)=(τp,0)(\beta_{j},\beta_{j+1})=(\tau_{p},0),

    {Wjtp|(βj,βj+1)=(τp,0)}Lpp[(ru)+]2,\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\geq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}}, (G.5)

    and

    {Wjtp|(βj,βj+1)=(τp,0)}LppϑfHamm+(u,r,ϑ).\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\leq L_{p}p^{\vartheta-f^{+}_{\text{Hamm}}(u,r,\vartheta)}. (G.6)

    Let p3=(tp,ρtp,ρ2tp,ρtp)Tp_{3}=(t_{p},\rho t_{p},\rho^{2}t_{p},\rho t_{p})^{T}. when h=p3h=p_{3}, variable jj is the first variable entering the Lasso path and p3p_{3} is in the region of rejecting variable jj as a signal. Since h𝒩(μ3,G)h\sim\mathcal{N}(\mu_{3},G) with μ3=(τp,ρτp,ρ2τp,ρτp)T\mu_{3}=(\tau_{p},\rho\tau_{p},\rho^{2}\tau_{p},\rho\tau_{p})^{T}, by Lemma 7.1,

    {Wjtp|(βj,βj+1)=(τp,0)}Lpp(p3μ3)TG1(p3μ3)/2log(p)=Lpp[(ru)+]2.\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}&\geq L_{p}p^{-(p_{3}-\mu_{3})^{T}G^{-1}(p_{3}-\mu_{3})/2\log(p)}\\ &=L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}}.\end{split}

    Now, we show that (G.6) holds for uuu\geq u^{*}, which implies (G.6) for all u0u\geq 0 as discussed in the proof of EC-knockoff. We prove (G.6) by showing that

    {Wjtp,A|(βj,βj+1)=(τp,0)}Lpp[(ru)+]2\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}} (G.7)

    holds for A=Aj,Aj+1,Aj+p,Aj+p+1A=A_{j},A_{j+1},A_{j+p},A_{j+p+1} given uuu\geq u^{*}. Respectively,

    {Wjtp,Aj|(βj,βj+1)=(τp,0)}{|yTxj|tp|(βj,βj+1)=(τp,0)}=Lpp[(ru)+]2,\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}&\leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\\ &=L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}},\end{split}

    and by symmetry and (F.14),

    {Wjtp,Aj+1|(βj,βj+1)=(τp,0)}={Wjtp,Aj+p+1|(βj,βj+1)=(τp,0)}{|yTxj||yTxj+p+1||(βj,βj+1)=(τp,0)}Lpp1ρ2rLpp[(ru)+]2,\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+1}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}=\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+p+1}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\\ \leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq|y^{T}x_{j+p+1}|\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\leq L_{p}p^{-\frac{1-\rho}{2}r}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}},\end{split}
    {Wjtp,Aj+p|(βj,βj+1)=(τp,0)}{|yTxj||yTxj+p||(βj,βj+1)=(τp,0)}Lpp1ρ22rLpp[(ru)+]2.\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}&\leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq|y^{T}x_{j+p}|\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},0)\bigr{\}}\\ &\leq L_{p}p^{-\frac{1-\rho^{2}}{2}r}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}}.\end{split}

    (G.6) is immediate by [(ru)+]2fHamm+(r,u,ϑ)ϑ[(\sqrt{r}-\sqrt{u})_{+}]^{2}\geq f^{+}_{\text{Hamm}}(r,u,\vartheta)-\vartheta.

  • When (βj,βj+1)=(τp,τp)(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p}),

    {Wjtp|(βj,βj+1)=(τp,τp)}LppϑfHamm+(u,r,ϑ).\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\leq L_{p}p^{\vartheta-f^{+}_{\text{Hamm}}(u,r,\vartheta)}. (G.8)

    We prove (G.8) by showing

    {Wjtp,A|(βj,βj+1)=(τp,τp)}Lpp[(ru)+]2\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}} (G.9)

    holds for A=Aj,Aj+1j,Aj+1j+p,Aj+1j+p+1,Aj+p,Aj+p+1A=A_{j},A_{j+1\to j},A_{j+1\to j+p},A_{j+1\to j+p+1},A_{j+p},A_{j+p+1} given uuu\geq u^{*}, which cover all possibilities. Respectively,

    {Wjtp,Aj|(βj,βj+1)=(τp,τp)}{|yTxj|tp|(βj,βj+1)=(τp,τp)}Lpp[((1+ρ)ru)+]2Lpp[(ru)+]2,\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}&\leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\\ &\leq L_{p}p^{-[((1+\rho)\sqrt{r}-\sqrt{u})_{+}]^{2}}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}},\end{split}
    {Wjtp,Aj+p|(βj,βj+1)=(τp,τp)}{|yTxj||yTx~j||(βj,βj+1)=(τp,τp)}Lpp1ρ22rLpp[(ru)+]2,\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+p}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}&\leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq|y^{T}\tilde{x}_{j}|\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\\ &\leq L_{p}p^{-\frac{1-\rho^{2}}{2}r}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}},\end{split}
    {Wjtp,Aj+p+1|(βj,βj+1)=(τp,τp)}{|yTxj||yTx~j+1||(βj,βj+1)=(τp,τp)}Lpp(1ρ)(1+ρ)22rLpp[(ru)+]2.\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+p+1}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}&\leq\mathbb{P}\bigl{\{}|y^{T}x_{j}|\leq|y^{T}\tilde{x}_{j+1}|\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}}\\ &\leq L_{p}p^{-\frac{(1-\rho)(1+\rho)^{2}}{2}r}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}}.\end{split}

    When Aj+1jA_{j+1\to j} occurs, the bi-variate Lasso problem Lj,j+1L_{j,j+1} has variable jj is a false negative given WjtpW_{j}\leq t_{p}, which implies {Wjtp,Aj+1j|(βj,βj+1)=(τp,τp)}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A_{j+1\to j}\big{|}(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p})\bigr{\}} is upper bounded by the corresponding false negative rate of Lasso, which is Lpp(ξρrηρu)+2Lpp[(ru)+]2L_{p}p^{-(\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+}^{2}}\leq L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}} for uuu\geq u^{*}.

    When Aj+1j+pA_{j+1\to j+p} occurs, we know variable j+pj+p instead of variable jj is the second one entering the Lasso path. This means the λ2\lambda_{2} (the λ\lambda value when the second variable entering Lasso path) of the bi-variate Lasso problem Lj+1,j+pL_{j+1,j+p} is larger than the λ2\lambda_{2} of the bi-variate Lasso problem Lj,j+1L_{j,j+1}. When yTxj+10y^{T}x_{j+1}\geq 0, we must have

    max{yTxjρyTxj+11ρ,yTxjρyTxj+11ρ}<max{yTx~jρyTxj+11ρ,yTx~jρyTxj+11ρ}.\max\{\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}<\max\{\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}.

    Therefore, Aj+1j+pA_{j+1\to j+p} implies one the three following events must occur:

    yTxj+1<0,yTxjρyTxj+11ρ<yTx~jρyTxj+11ρ,yTxjρyTxj+11ρ<yTx~jρyTxj+11ρy^{T}x_{j+1}<0,\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho}<\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho}<\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{-1-\rho}

    The probability of these three events given (βj,βj+1)=(τp,τp)(\beta_{j},\beta_{j+1})=(\tau_{p},\tau_{p}) are Lpp(1+ρ)2rL_{p}p^{-(1+\rho)^{2}r}, Lpp1ρ22rL_{p}p^{-\frac{1-\rho^{2}}{2}r} and Lpp(1+ρ)3(1ρ)2(1+ρ2)rL_{p}p^{-\frac{(1+\rho)^{3}(1-\rho)}{2(1+\rho^{2})}r}, all of which are upper bounded by Lpp[(ru)+]2L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}} when uuu\geq u^{*}.

    When Aj+1j+p+1A_{j+1\to j+p+1} occurs, the λ2\lambda_{2} of the bi-variate Lasso problem Lj+1,j+p+1L_{j+1,j+p+1} is larger than the λ2\lambda_{2} of the bi-variate Lasso problem Lj,j+1L_{j,j+1}. When yTxj+10y^{T}x_{j+1}\geq 0, we must have

    max{yTxjρyTxj+11ρ,yTxjρyTxj+11ρ}<max{yTx~j+1ρ2yTxj+11ρ2,yTx~j+1ρ2yTxj+11ρ2}.\max\{\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}<\max\{\frac{y^{T}\tilde{x}_{j+1}-\rho^{2}y^{T}x_{j+1}}{1-\rho^{2}},\frac{y^{T}\tilde{x}_{j+1}-\rho^{2}y^{T}x_{j+1}}{-1-\rho^{2}}\}.

    Therefore, Aj+1j+p+1A_{j+1\to j+p+1} implies one the three following events must occur:

    yTxj+1<0,yTxjρyTxj+11ρ<yTx~j+1ρ2yTxj+11ρ2,yTxjρyTxj+11ρ<yTx~j+1ρ2yTxj+11ρ2.y^{T}x_{j+1}<0,\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho}<\frac{y^{T}\tilde{x}_{j+1}-\rho^{2}y^{T}x_{j+1}}{1-\rho^{2}},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho}<-\frac{y^{T}\tilde{x}_{j+1}-\rho^{2}y^{T}x_{j+1}}{-1-\rho^{2}}.

    Respectively, the probability of these three events are Lpp(1+ρ)2rL_{p}p^{-(1+\rho)^{2}r}, Lpp1ρ22rL_{p}p^{-\frac{1-\rho^{2}}{2}r} and Lpp(1+ρ)3(1ρ)2(1+ρ2)rL_{p}p^{-\frac{(1+\rho)^{3}(1-\rho)}{2(1+\rho^{2})}r}, all of which are upper bounded by Lpp[(ru)+]2L_{p}p^{-[(\sqrt{r}-\sqrt{u})_{+}]^{2}} when uuu\geq u^{*}. From here we have verified (G.9), thus implies (G.8).

    From (G.1), (G.2), (G.5), (G.6) and (G.8), we have

    {Wj>tp,βj=0}+{Wjtp,βj=τp}=LppfHamm+(r,u,ϑ),\mathbb{P}\bigl{\{}W_{j}>t_{p},\beta_{j}=0\bigr{\}}+\mathbb{P}\bigl{\{}W_{j}\leq t_{p},\beta_{j}=\tau_{p}\bigr{\}}=L_{p}p^{-f^{+}_{\text{Hamm}}(r,u,\vartheta)}, (G.10)

    which completes the proof for positive ρ\rho.

  • When (βj,βj+1)=(τp,τp)(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p}),

    {Wjtp|(βj,βj+1)=(τp,τp)}Lpp((ξρrηρ1u)+)2\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}\geq L_{p}p^{-((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2}} (G.11)

    and

    {Wjtp|(βj,βj+1)=(τp,τp)}Lppmin{((ξρrηρ1u)+)2,(1ρ)3(1+ρ)2(1+ρ2)r}.\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}\leq L_{p}p^{-\min\{((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2},\frac{(1-\rho)^{3}(1+\rho)}{2(1+\rho^{2})}r\}}. (G.12)

    Let

    p4T={((1ρ)τp,(1ρ)τp,ρ(1ρ)τp,ρ(1ρ)τp),(1ρ)τptp,(ρ(1ρ)τp(1+ρ)tp,(1ρ)τp,ρ(1ρ)τp,ρ2(1ρ)τpρ(1+ρ)tp),(1ρ)τp>tp.p_{4}^{T}=\left\{\begin{array}[]{ll}(-(1-\rho)\tau_{p},(1-\rho)\tau_{p},\rho(1-\rho)\tau_{p},-\rho(1-\rho)\tau_{p}),&(1-\rho)\tau_{p}\leq t_{p},\\ (\rho(1-\rho)\tau_{p}-(1+\rho)t_{p},(1-\rho)\tau_{p},\rho(1-\rho)\tau_{p},\rho^{2}(1-\rho)\tau_{p}-\rho(1+\rho)t_{p}),&(1-\rho)\tau_{p}>t_{p}.\\ \end{array}\right.

    When h=p4h=p_{4} and (1ρ)τptp(1-\rho)\tau_{p}\leq t_{p}, variable jj is the first variable entering the Lasso path with Wj=(1ρ)τptpW_{j}=(1-\rho)\tau_{p}\leq t_{p}; when h=p4h=p_{4} and (1ρ)τp>tp(1-\rho)\tau_{p}>t_{p}, j+1j+1 is the first and jj is the second variable entering the Lasso path with Wj=tpW_{j}=t_{p}. Regardless of the relationship between τp\tau_{p} and tpt_{p}, h=p4h=p_{4} is always in the region of rejecting jj as a signal. Since h𝒩(μ4,G)h\sim\mathcal{N}(\mu_{4},G) with μ4=((1ρ)τp,(1ρ)τp,ρ(1ρ)τp,ρ(1ρ)τp)T\mu_{4}=(-(1-\rho)\tau_{p},(1-\rho)\tau_{p},\rho(1-\rho)\tau_{p},-\rho(1-\rho)\tau_{p})^{T}, by Lemma 7.1,

    {Wjtp|(βj,βj+1)=(τp,τp)}Lpp(p4μ4)TG1(p4μ4)/2log(p)=Lpp((ξρrηρ1u)+)2.\begin{split}\mathbb{P}\bigl{\{}W_{j}\leq t_{p}\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}&\geq L_{p}p^{-(p_{4}-\mu_{4})^{T}G^{-1}(p_{4}-\mu_{4})/2\log(p)}\\ &=L_{p}p^{-((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2}}.\end{split}

    Next, we show that

    {Wjtp,A|(βj,βj+1)=(τp,τp)}Lppmin{((ξρrηρ1u)+)2,(1ρ)3(1+ρ)2(1+ρ2)r}.\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}}\leq L_{p}p^{-\min\{((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2},\frac{(1-\rho)^{3}(1+\rho)}{2(1+\rho^{2})}r\}}. (G.13)

    holds for A=Aj,Aj+1j,Aj+1j+p,Aj+1j+p+1,Aj+p,Aj+p+1A=A_{j},A_{j+1\to j},A_{j+1\to j+p},A_{j+1\to j+p+1},A_{j+p},A_{j+p+1}, which cover all possibilities.

    When A=AjA=A_{j} or Aj+1jA_{j+1\to j} occurs, as previously discussed, variable jj is a false negative in the bi-variate Lasso Lj,j+1L_{j,j+1} given WjtpW_{j}\leq t_{p}, which implies {Wjtp,A|(βj,βj+1)=(τp,τp)}\mathbb{P}\bigl{\{}W_{j}\leq t_{p},A\big{|}(\beta_{j},\beta_{j+1})=(-\tau_{p},\tau_{p})\bigr{\}} is upper bounded by the corresponding false negative rate of Lasso, which is Lpp((ξρrηρ1u)+)2L_{p}p^{-((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2}}.

    When Aj+1j+pA_{j+1\to j+p} occurs, the λ2\lambda_{2} of the bi-variate Lasso problem Lj+1,j+pL_{j+1,j+p} is larger than the λ2\lambda_{2} of the bi-variate Lasso problem Lj,j+1L_{j,j+1}. When yTxj+10y^{T}x_{j+1}\geq 0, we must have

    max{yTxjρyTxj+11ρ,yTxjρyTxj+11ρ}<max{yTx~jρyTxj+11ρ,yTx~jρyTxj+11ρ}.\max\{\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}<\max\{\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}.

    Therefore, Aj+1j+pA_{j+1\to j+p} implies one of the three following events must occur:

    yTxj+1+yTx~j<0,yTxjρyTxj+11ρ<yTx~jρyTxj+11ρ,yTxjρyTxj+11ρ<yTx~jρyTxj+11ρy^{T}x_{j+1}+y^{T}\tilde{x}_{j}<0,\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}<\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}<\frac{y^{T}\tilde{x}_{j}-\rho y^{T}x_{j+1}}{-1-\rho}

    The probability of these three events are Lpp(1+ρ)(1ρ)22rL_{p}p^{-\frac{(1+\rho)(1-\rho)^{2}}{2}r}, Lpp(1ρ)3(1+ρ)2(1+ρ2)rL_{p}p^{-\frac{(1-\rho)^{3}(1+\rho)}{2(1+\rho^{2})}r} and Lpp1ρ22rL_{p}p^{-\frac{1-\rho^{2}}{2}r}, all of which are upper bounded by Lpp(1ρ)3(1+ρ)2(1+ρ2)rL_{p}p^{-\frac{(1-\rho)^{3}(1+\rho)}{2(1+\rho^{2})}r}.

    When Aj+1j+p+1A_{j+1\to j+p+1} occurs, the λ2\lambda_{2} of the bi-variate Lasso problem Lj+1,j+p+1L_{j+1,j+p+1} is larger than the λ2\lambda_{2} of the bi-variate Lasso problem Lj,j+1L_{j,j+1}. When yTxj+10y^{T}x_{j+1}\geq 0, we must have

    max{yTxjρyTxj+11ρ,yTxjρyTxj+11ρ}<max{yTx~j+1ρ2yTxj+11ρ2,yTx~j+1ρ2yTxj+11ρ2}.\max\{\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{1-\rho},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}\}<\max\{\frac{y^{T}\tilde{x}_{j+1}-\rho^{2}y^{T}x_{j+1}}{1-\rho^{2}},\frac{y^{T}\tilde{x}_{j+1}-\rho^{2}y^{T}x_{j+1}}{-1-\rho^{2}}\}.

    Therefore, Aj+1j+p+1A_{j+1\to j+p+1} implies one of the three following events must occur:

    yTxj+1+yTx~j<0,yTxjρyTxj+11ρ<yTx~j+1ρ2yTxj+11ρ2,yTxjρyTxj+11ρ<yTx~j+1ρ2yTxj+11ρ2.y^{T}x_{j+1}+y^{T}\tilde{x}_{j}<0,\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}<\frac{y^{T}\tilde{x}_{j+1}-\rho^{2}y^{T}x_{j+1}}{1-\rho^{2}},\frac{y^{T}x_{j}-\rho y^{T}x_{j+1}}{-1-\rho}<\frac{y^{T}\tilde{x}_{j+1}-\rho^{2}y^{T}x_{j+1}}{-1-\rho^{2}}.

    The probability of these three events are Lpp(1+ρ)(1ρ)22rL_{p}p^{-\frac{(1+\rho)(1-\rho)^{2}}{2}r}, Lpp1ρ22rL_{p}p^{-\frac{1-\rho^{2}}{2}r}, Lpp(1ρ)3(1+ρ)2(1+ρ2)rL_{p}p^{-\frac{(1-\rho)^{3}(1+\rho)}{2(1+\rho^{2})}r}, all of which are upper bounded by Lpp(1ρ)3(1+ρ)2(1+ρ2)rL_{p}p^{-\frac{(1-\rho)^{3}(1+\rho)}{2(1+\rho^{2})}r}.

    When Aj+pA_{j+p} occurs, if yTx~j<0y^{T}\tilde{x}_{j}<0, then yTxj+1+yTx~j0y^{T}x_{j+1}+y^{T}\tilde{x}_{j}\leq 0, which happens with probability Lpp(1+ρ)(1ρ)22rLpp(1ρ)3(1+ρ)2(1+ρ2)rL_{p}p^{-\frac{(1+\rho)(1-\rho)^{2}}{2}r}\leq L_{p}p^{-\frac{(1-\rho)^{3}(1+\rho)}{2(1+\rho^{2})}r}. If yTx~j0y^{T}\tilde{x}_{j}\geq 0, then yTx~j+1ρ2yTxj1+ρ2yTxj+10y^{T}\tilde{x}_{j}+\frac{1-\rho}{2}y^{T}x_{j}-\frac{1+\rho}{2}y^{T}x_{j+1}\geq 0, which happens with probability Lpp2(1ρ)33+ρ2rLpp(1ρ)3(1+ρ)2(1+ρ2)rL_{p}p^{-\frac{2(1-\rho)^{3}}{3+\rho^{2}}r}\leq L_{p}p^{-\frac{(1-\rho)^{3}(1+\rho)}{2(1+\rho^{2})}r}. Therefore, (G.13) holds for Aj+pA_{j+p} and also for Aj+p+1A_{j+p+1} due to symmetry. We thus complete the proof for (G.13).

    Now consider the case where βj\beta_{j} takes value in {0,τp}\{0,-\tau_{p}\} and βj+1\beta_{j+1} takes value in {0,τp}\{0,\tau_{p}\}, this corresponds to the ρ<0\rho<0 case (we flipped the sign of ρ\rho and βj\beta_{j} simultaneously). By (G.1), (G.2), (G.5) and (G.11), we know

    {Wj>tp,βj=0}+{Wjtp,βj=τp}Lppmin{fHamm+(u,r,ϑ),2ϑ+((ξρrηρ1u)+)2}.\begin{split}\mathbb{P}\bigl{\{}W_{j}>t_{p},\beta_{j}=0\bigr{\}}+\mathbb{P}\bigl{\{}W_{j}\leq t_{p},\beta_{j}=-\tau_{p}\bigr{\}}\\ \geq L_{p}p^{-\min\{f^{+}_{\text{Hamm}}(u,r,\vartheta),2\vartheta+((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2}\}}.\end{split} (G.14)

    Meanwhile, (G.1), (G.2), (G.6) and (G.12) gives

    {Wj>tp,βj=0}+{Wjtp,βj=τp}Lppmin{fHamm+(u,r,ϑ),2ϑ+((ξρrηρ1u)+)2,2ϑ+(1|ρ|)3(1+|ρ|)2(1+|ρ|2)r}.\begin{split}\mathbb{P}\bigl{\{}W_{j}>t_{p},\beta_{j}=0\bigr{\}}+\mathbb{P}\bigl{\{}W_{j}\leq t_{p},\beta_{j}=-\tau_{p}\bigr{\}}\\ \leq L_{p}p^{-\min\{f^{+}_{\text{Hamm}}(u,r,\vartheta),2\vartheta+((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2},2\vartheta+\frac{(1-|\rho|)^{3}(1+|\rho|)}{2(1+|\rho|^{2})}r\}}.\end{split} (G.15)

    The proof is complete once we show that

    min{fHamm+(u,r,ϑ),2ϑ+((ξρrηρ1u)+)2}2ϑ+(1|ρ|)3(1+|ρ|)2(1+ρ2)r.\min\{f^{+}_{\text{Hamm}}(u,r,\vartheta),2\vartheta+((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2}\}\leq 2\vartheta+\frac{(1-|\rho|)^{3}(1+|\rho|)}{2(1+\rho^{2})}r. (G.16)

    Otherwise, there exists a tuple of (ϑ,r,ρ,u,r)(\vartheta,r,\rho,u,r) such that

    2ϑ+(1|ρ|)3(1+|ρ|)2(1+ρ2)r<2ϑ+((ξρrηρ1u)+)22\vartheta+\frac{(1-|\rho|)^{3}(1+|\rho|)}{2(1+\rho^{2})}r<2\vartheta+((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2} (G.17)

    and

    2ϑ+(1|ρ|)3(1+|ρ|)2(1+ρ2)r<ϑ+(u|ρ|r)2+((ξρrηρu)+)2((ru)+)22\vartheta+\frac{(1-|\rho|)^{3}(1+|\rho|)}{2(1+\rho^{2})}r<\vartheta+(\sqrt{u}-|\rho|\sqrt{r})^{2}+((\xi_{\rho}\sqrt{r}-\eta_{\rho}\sqrt{u})_{+})^{2}-((\sqrt{r}-\sqrt{u})_{+})^{2} (G.18)

    are satisfied simultaneously.

    By (G.17), ξρrηρ1u>0\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u}>0, which implies (1|ρ|)r>u(1-|\rho|)\sqrt{r}>\sqrt{u}. Therefore, the right hand side of (G.18) simplifies to ϑ+1|ρ|1+|ρ|u\vartheta+\frac{1-|\rho|}{1+|\rho|}u. By (G.18), we know

    (1|ρ|)3(1+|ρ|)2(1+ρ2)rϑ+(1|ρ|)3(1+|ρ|)2(1+ρ2)r<1|ρ|1+|ρ|u.\frac{(1-|\rho|)^{3}(1+|\rho|)}{2(1+\rho^{2})}r\leq\vartheta+\frac{(1-|\rho|)^{3}(1+|\rho|)}{2(1+\rho^{2})}r<\frac{1-|\rho|}{1+|\rho|}u.

    Plug this into the right hand side of (G.17), we have

    2ϑ+(1|ρ|)3(1+|ρ|)2(1+ρ2)r<2ϑ+((ξρrηρ1u)+)22ϑ+(1ρ2(1|ρ|)(1+|ρ|)32(1+ρ2))2r,\begin{split}2\vartheta+\frac{(1-|\rho|)^{3}(1+|\rho|)}{2(1+\rho^{2})}r<2\vartheta+((\xi_{\rho}\sqrt{r}-\eta_{\rho}^{-1}\sqrt{u})_{+})^{2}\\ \leq 2\vartheta+\Big{(}\sqrt{1-\rho^{2}}-\sqrt{\frac{(1-|\rho|)(1+|\rho|)^{3}}{2(1+\rho^{2})}}\Big{)}^{2}r,\end{split} (G.19)

    which can only be true when ρ2>1\rho^{2}>1. By reduction, we proved (G.16).

Appendix H Proof of Theorem 6.1

The least-squares estimator satisfies that β^𝒩p(β,G1)\hat{\beta}\sim{\cal N}_{p}(\beta,G^{-1}). It gives β^j𝒩(βj,ωj)\hat{\beta}_{j}\sim{\cal N}(\beta_{j},\omega_{j}). Applying Lemma 7.1 to Xp=β^jX_{p}=\hat{\beta}_{j} and S={x:xu}S=\{x\in\mathbb{R}:x\geq\sqrt{u}\}, we have

(|β^j|>tp(u)|βj=0)=Lppωj1u,(|β^j|tp(u)|βj=τp)=Lppωj1(ru)+2.\mathbb{P}(|\hat{\beta}_{j}|>t_{p}(u)|\beta_{j}=0)=L_{p}p^{-\omega_{j}^{-1}u},\qquad\mathbb{P}(|\hat{\beta}_{j}|\leq t_{p}(u)|\beta_{j}=\tau_{p})=L_{p}p^{-\omega_{j}^{-1}(\sqrt{r}-\sqrt{u})_{+}^{2}}.

It follows that

FPp(u)\displaystyle\mathrm{FP}_{p}(u) =j=1p(1ϵp)(Wj>tp(u)|βj=0)=Lpj=1ppωj1u,\displaystyle=\sum_{j=1}^{p}(1-\epsilon_{p})\cdot\mathbb{P}(W_{j}^{*}>t_{p}(u)|\beta_{j}=0)=L_{p}\sum_{j=1}^{p}p^{-\omega_{j}^{-1}u},
FNp(u)\displaystyle\mathrm{FN}_{p}(u) =j=1pϵp(Wj<tp(u)|βj=τp)=Lppϑj=1ppωj1(ru)+2.\displaystyle=\sum_{j=1}^{p}\epsilon_{p}\cdot\mathbb{P}(W_{j}^{*}<t_{p}(u)|\beta_{j}=\tau_{p})=L_{p}p^{-\vartheta}\sum_{j=1}^{p}p^{-\omega_{j}^{-1}(\sqrt{r}-\sqrt{u})_{+}^{2}}.

For the block-wise diagonal design (5.1), ωj=(1ρ2)1\omega_{j}=(1-\rho^{2})^{-1} for all 1jp11\leq j\leq p-1.

Appendix I Proof of Theorem 6.2

By the property of least-square coefficients,

(β^1,,β^p,β~1,,β~p)𝒩2p((β1,,βp,0,,0),(G)1).(\hat{\beta}_{1},\cdots,\hat{\beta}_{p},\tilde{\beta}_{1},\cdots,\tilde{\beta}_{p})\sim\mathcal{N}_{2p}\big{(}(\beta_{1},\cdots,\beta_{p},0,\cdots,0),(G^{*})^{-1}\big{)}.

Consider the joint distribution of β^j\hat{\beta}_{j} and β~j\tilde{\beta}_{j} which are the regression coefficient of xjx_{j} and x~j\tilde{x}_{j}, we know that (β^j,β~j)𝒩2((βj,0),Aj)(\hat{\beta}_{j},\tilde{\beta}_{j})\sim\mathcal{N}_{2}\big{(}(\beta_{j},0),A_{j}\big{)} where AjA_{j} has ω1j\omega_{1j} as its diagonal element and ω2j\omega_{2j} as its off-diagonal elements. Then theorem 6.2 is immediate from the following lemma:

Lemma I.1

If (Zj,Z~j)(Z_{j},\tilde{Z}_{j}) follows 𝒩2((βj,0)T,Σ){\cal N}_{2}\Bigl{(}(\beta_{j},0)^{T},\;\;\Sigma\Bigr{)} with Σ=((σ1,σ2),(σ2,σ1))\Sigma=((\sigma_{1},\sigma_{2}),(\sigma_{2},\sigma_{1})), then

(|Zj|>2ulog(p),|Zj||Z~j||βj=0)=Lppu/σ1\mathbb{P}(|Z_{j}|>\sqrt{2u\log(p)},|Z_{j}|\geq|\tilde{Z}_{j}|\big{|}\beta_{j}=0)=L_{p}p^{-u/\sigma_{1}} (I.1)

and

(|Zj|2ulog(p) or |Zj|<|Z~j||βj=2rlog(p))=Lppmin{(ru)+2/σ1,r/(2max{σ1+σ2,σ1σ2})}.\begin{split}&\mathbb{P}(|Z_{j}|\leq\sqrt{2u\log(p)}\text{ or }|Z_{j}|<|\tilde{Z}_{j}|\big{|}\beta_{j}=\sqrt{2r\log(p)})\\ =&L_{p}p^{-\min\{(\sqrt{r}-\sqrt{u})_{+}^{2}/\sigma_{1},r/(2\max\{\sigma_{1}+\sigma_{2},\sigma_{1}-\sigma_{2}\})\}}.\end{split} (I.2)

Next, we prove Lemma I.1. To compute the left hand side of (I.1), we only need to find the tt such that ellipsoid (x,y)Σ1(x,y)T=t2(x,y)\Sigma^{-1}(x,y)^{T}=t^{2} is tangent with x=±2ulog(p)x=\pm\sqrt{2u\log(p)}. This is because when we increase the radius of the ellipsoid, it must intersect with x=±2ulog(p)x=\pm\sqrt{2u\log(p)} first amongst the boundaries of the region that pick variable jj as a signal. When they intersect,

t2=1σ12σ22(σ1x22σ2xy+σ1y2)=1σ12σ22(σ1(yσ2σ1x)2+(σ1σ22σ1)x2)2ulog(p)σ1.t^{2}=\frac{1}{\sigma_{1}^{2}-\sigma_{2}^{2}}(\sigma_{1}x^{2}-2\sigma_{2}xy+\sigma_{1}y^{2})=\frac{1}{\sigma_{1}^{2}-\sigma_{2}^{2}}\Big{(}\sigma_{1}\Big{(}y-\frac{\sigma_{2}}{\sigma_{1}}x\Big{)}^{2}+\Big{(}\sigma_{1}-\frac{\sigma_{2}^{2}}{\sigma_{1}}\Big{)}x^{2}\Big{)}\geq\frac{2u\log(p)}{\sigma_{1}}.

When t2=2ulog(p)σ1t^{2}=\frac{2u\log(p)}{\sigma_{1}}, the tangent points are (±2ulog(p),±σ2σ12ulog(p))(\pm\sqrt{2u\log(p)},\pm\frac{\sigma_{2}}{\sigma_{1}}\sqrt{2u\log(p)}). By Lemma 7.1, we verified (I.1).

For (I.2), when r<ur<u, the center of the bi-variate normal is in the region of rejecting variable jj as a signal thus the false positive rate is LpL_{p}. When r>ur>u, we need to find the tt such that ellipsoid (xβj,y)Σ1(xβj,y)T=t2(x-\beta_{j},y)\Sigma^{-1}(x-\beta_{j},y)^{T}=t^{2} is tangent with either x=±2ulog(p)x=\pm\sqrt{2u\log(p)} or y=±xy=\pm x. When the ellipsoid intersects with x=±2ulog(p)x=\pm\sqrt{2u\log(p)},

t2=1σ12σ22(σ1(yσ2σ1(xβj))2+(σ1σ22σ1)(xβj)2)2(ur)2log(p)σ1,t^{2}=\frac{1}{\sigma_{1}^{2}-\sigma_{2}^{2}}\Big{(}\sigma_{1}\Big{(}y-\frac{\sigma_{2}}{\sigma_{1}}(x-\beta_{j})\Big{)}^{2}+\Big{(}\sigma_{1}-\frac{\sigma_{2}^{2}}{\sigma_{1}}\Big{)}(x-\beta_{j})^{2}\Big{)}\geq\frac{2(\sqrt{u}-\sqrt{r})^{2}\log(p)}{\sigma_{1}},

therefore, they are tangent at (±2ulog(p),σ2σ1(±2ulog(p)βj))(\pm\sqrt{2u\log(p)},\frac{\sigma_{2}}{\sigma_{1}}(\pm\sqrt{2u\log(p)}-\beta_{j})) when t2=2(ur)2log(p)σ1t^{2}=\frac{2(\sqrt{u}-\sqrt{r})^{2}\log(p)}{\sigma_{1}}.

Meanwhile, since the long/short shaft of the ellipsoid are paralleled with y=±xy=\pm x, the tangent points of ellipsoid with y=±xy=\pm x must be (βj/2,βj/2)(\beta_{j}/2,\beta_{j}/2) and (βj/2,βj/2)(\beta_{j}/2,-\beta_{j}/2), which gives t2=rlog(p)σ1+σ2t^{2}=\frac{r\log(p)}{\sigma_{1}+\sigma_{2}} and rlog(p)σ1σ2\frac{r\log(p)}{\sigma_{1}-\sigma_{2}}. From here we can conclude the ”distance” between the center of the normal distribution and the region that reject variable jj as a signal is

min{2(ru)+2log(p)σ1,rlog(p)σ1+σ2,rlog(p)σ1σ2}.\min\{\frac{2(\sqrt{r}-\sqrt{u})_{+}^{2}\log(p)}{\sigma_{1}},\frac{r\log(p)}{\sigma_{1}+\sigma_{2}},\frac{r\log(p)}{\sigma_{1}-\sigma_{2}}\}.

By Lemma 7.1, we know

(|Zj|2ulog(p)|βj=2rlog(p))=Lppmin{(ru)+2/σ1,r/(2max{σ1+σ2,σ1σ2})}.\mathbb{P}(|Z_{j}|\leq\sqrt{2u\log(p)}\big{|}\beta_{j}=\sqrt{2r\log(p)})=L_{p}p^{-\min\{(\sqrt{r}-\sqrt{u})_{+}^{2}/\sigma_{1},r/(2\max\{\sigma_{1}+\sigma_{2},\sigma_{1}-\sigma_{2}\})\}}.

References

  • Arias-Castro et al. (2011) Ery Arias-Castro, Emmanuel J Candès, and Yaniv Plan. Global testing under sparse alternatives: Anova, multiple comparisons and the higher criticism. The Annals of Statistics, 39(5):2533–2556, 2011.
  • Barber and Candès (2015) Rina Foygel Barber and Emmanuel J Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
  • Barnett et al. (2017) Ian Barnett, Rajarshi Mukherjee, and Xihong Lin. The generalized higher criticism for testing snp-set effects in genetic association studies. Journal of the American Statistical Association, 112(517):64–76, 2017.
  • Benjamini and Hochberg (1995) Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995.
  • Candes et al. (2018) Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold: ’model-x’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577, 2018.
  • Dai et al. (2022) Chenguang Dai, Buyu Lin, Xin Xing, and Jun S Liu. False discovery rate control via data splitting. Journal of the American Statistical Association, pages 1–18, 2022.
  • Donoho and Jin (2004) David Donoho and Jiashun Jin. Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics, 32(3):962–994, 2004.
  • Donoho and Jin (2015) David Donoho and Jiashun Jin. Special invited paper: Higher criticism for large-scale inference, especially for rare and weak effects. Statistical Science, pages 1–25, 2015.
  • Fan and Lv (2008) Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911, 2008.
  • Fan et al. (2019) Yingying Fan, Emre Demirkaya, Gaorong Li, and Jinchi Lv. Rank: large-scale inference with graphical nonlinear knockoffs. Journal of the American Statistical Association, pages 1–43, 2019.
  • Genovese et al. (2012) Christopher R Genovese, Jiashun Jin, Larry Wasserman, and Zhigang Yao. A comparison of the lasso and marginal regression. Journal of Machine Learning Research, 13(Jun):2107–2143, 2012.
  • Javanmard and Javadi (2019) Adel Javanmard and Hamid Javadi. False discovery rate control via debiased lasso. Electronic Journal of Statistics, 13(1):1212–1253, 2019.
  • Javanmard and Montanari (2018) Adel Javanmard and Andrea Montanari. Online rules for control of false discovery rate and false discovery exceedance. The Annals of statistics, 46(2):526–554, 2018.
  • Ji and Jin (2012) Pengsheng Ji and Jiashun Jin. UPS delivers optimal phase diagram in high-dimensional variable selection. The Annals of Statistics, 40(1):73–103, 2012.
  • Jin and Ke (2016) Jiashun Jin and Zheng Tracy Ke. Rare and weak effects in large-scale inference: methods and phase diagrams. Statistica Sinica, pages 1–34, 2016.
  • Jin et al. (2014) Jiashun Jin, Cun-Hui Zhang, and Qi Zhang. Optimality of graphlet screening in high dimensional variable selection. The Journal of Machine Learning Research, 15(1):2723–2772, 2014.
  • Ke and Wang (2021) Zheng Tracy Ke and Longlin Wang. A comparison of Hamming errors of representative variable selection methods. In International Conference on Learning Representations, 2021.
  • Ke and Yang (2017) Zheng Tracy Ke and Fan Yang. Covariate assisted variable ranking. arXiv preprint arXiv:1705.10370, 2017.
  • Ke et al. (2014) Zheng Tracy Ke, Jiashun Jin, and Jianqing Fan. Covariance assisted screening and estimation. The Annals of statistics, 42(6):2202, 2014.
  • Ke et al. (2022) Zheng Tracy Ke, Jun S. Liu, and Yucong Ma. The de-randomized Gaussian mirror and its power analysis. Manuscript, 2022.
  • Li and Fithian (2021) Xiao Li and William Fithian. Whiteout: when do fixed-x knockoffs fail? arXiv preprint arXiv:2107.06388, 2021.
  • Liu and Rigollet (2019) Jingbo Liu and Philippe Rigollet. Power analysis of knockoff filters for correlated designs. In Advances in Neural Information Processing Systems, pages 15420–15429, 2019.
  • Spector and Janson (2022) Asher Spector and Lucas Janson. Powerful knockoffs via minimizing reconstructability. The Annals of Statistics, 50(1):252–276, 2022.
  • Su et al. (2017) Weijie Su, Małgorzata Bogdan, and Emmanuel Candes. False discoveries occur early on the lasso path. The Annals of statistics, 45(5):2133–2150, 2017.
  • Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • Wang and Janson (2022) Wenshuo Wang and Lucas Janson. A high-dimensional power analysis of the conditional randomization test and knockoffs. Biometrika, 109(3):631–645, 2022.
  • Weinstein et al. (2017) Asaf Weinstein, Rina Barber, and Emmanuel Candes. A power and prediction analysis for knockoffs with lasso statistics. arXiv preprint arXiv:1712.06465, 2017.
  • Weinstein et al. (2021) Asaf Weinstein, Weijie J Su, Małgorzata Bogdan, Rina F Barber, and Emmanuel J Candès. A power analysis for knockoffs with the lasso coefficient-difference statistic. arXiv preprint arXiv:2007.15346, 2021.
  • Xing et al. (2023) Xin Xing, Zhigen Zhao, and Jun S Liu. Controlling false discovery rate using gaussian mirrors. Journal of the American Statistical Association, 118(541):222–241, 2023.
  • Zhao and Yu (2006) Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006.