This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Feature Selection using e-values

Subhabrata Majumdar    Snigdhansu Chatterjee
Abstract

In the context of supervised parametric models, we introduce the concept of e-values. An e-value is a scalar quantity that represents the proximity of the sampling distribution of parameter estimates in a model trained on a subset of features to that of the model trained on all features (i.e. the full model). Under general conditions, a rank ordering of e-values separates models that contain all essential features from those that do not.

The e-values are applicable to a wide range of parametric models. We use data depths and a fast resampling-based algorithm to implement a feature selection procedure using e-values, providing consistency results. For a pp-dimensional feature space, this procedure requires fitting only the full model and evaluating p+1p+1 models, as opposed to the traditional requirement of fitting and evaluating 2p2^{p} models. Through experiments across several model settings and synthetic and real datasets, we establish that the e-values method as a promising general alternative to existing model-specific methods of feature selection.

Feture selection, model selection, data depth, bootstrap

1 Introduction

In the era of big data, feature selection in supervised statistical and machine learning (ML) models helps cut through the noise of superfluous features, provides storage and computational benefits, and gives model intepretability. Model-based feature selection can be divided into two categories (Guyon & Elisseeff, 2003): wrapper methods that evaluate models trained on multiple feature sets (Schwarz, 1978; Shao, 1996), and embedded methods that combine feature selection and training, often through sparse regularization (Tibshirani, 1996; Zou, 2006; Zou & Hastie, 2005). Both categories are extremely well-studied for independent data models, but have their own challenges. Navigating an exponentially growing feature space using wrapper methods is NP-hard (Natarajan, 1995), and case-specific search strategies like kk-greedy, branch-and-bound, simulated annealing are needed. Sparse penalized embedded methods can tackle high-dimensional data, but have inferential and algorithmic issues, such as biased Lasso estimates (Zhang & Zhang, 2014) and the use of convex relaxations to compute approximate local solutions (Wang et al., 2013; Zou & Li, 2008).

Feature selection in dependent data models has received comparatively lesser attention. Existing implementations of wrapper and embedded methods have been adapted for dependent data scenarios, such as mixed effect models (Meza & Lahiri, 2005; Nguyen & Jiang, 2014; Peng & Lu, 2012) and spatial models (Huang et al., 2010; Lee & Ghosh, 2009). However these suffer from the same issues as their independent data counterparts. If anything, the higher computational overhead for model training makes implementation of wrapper methods even harder in this situation!

In this paper, we propose the framework of e-values as a common principle to perform best subset feature selection in a range of parametric models covering independent and dependent data settings. In essence ours is a wrapper method, although it is able to determine important features affecting the response by training a single model: the one with all features, or the full model. We achieve this by utilizing the information encoded in the distribution of model parameter estimates. Notwithstanding recent efforts (Lai et al., 2015; Singh et al., 2007), parameter distributions that are fully data-driven have generally been underutilized in data science. In our work, e-values score a candidate model to quantify the similarity between sampling distributions of parameters in that model and the full model. Sampling distribution is the distribution of a parameter estimate, based on the random data samples the estimate is calculated from. We access sampling distributions using the efficient Generalized Bootstrap technique (Chatterjee & Bose, 2005), and utilize them using data depths, which are essentially point-to-distribution inverse distance functions (Tukey, 1975; Zuo, 2003; Zuo & Serfling, 2000), to compute e-values.

How e-values perform feature selection

Data depth functions quantify the inlyingness of a point in multivariate space with respect to a probability distribution or a multivariate point cloud, and have seen diverse applications in the literature (Jornsten, 2004; Narisetty & Nair, 2016; Rousseeuw & Hubert, 1998). As an example, consider the Mahalanobis depth, defined as

D(x,F)=[1+(xμF)ΣF1(xμF)]1,D(x,F)=[1+(x-\mu_{F})^{\top}\Sigma_{F}^{-1}(x-\mu_{F})]^{-1},

for xpx\in\mathbb{R}^{p} and FF a pp-dimensional probability distribution withg mean μF\mu_{F} and covariance matrix ΣF\Sigma_{F}. When xx is close to μF\mu_{F}, D(x,F)D(x,F) takes a high value close to 1. On the other hand, as x\|x\|\rightarrow\infty, the depth at xx approaches 0. Thus, D(x,F)D(x,F) quantifies the proximity of xx to FF. Points with high depth are situated ’deep inside’ the probability distribution FF, while low depth points sit at the periphery of FF.

Given a depth function DD we define e-value as the mean depth of the finite-sample estimate of a parameter β\beta for a candidate model \mathcal{M}, with respect to the sampling distribution of the full model estimate β^\hat{\beta}:

e()=𝔼D(β^,[β^]),e(\mathcal{M})=\mathbb{E}D(\hat{\beta}_{\mathcal{M}},[\hat{\beta}]),

where β^\hat{\beta}_{\mathcal{M}} is the estimate of β\beta assuming model \mathcal{M}, [β^][\hat{\beta}] is the sampling distribution of β^\hat{\beta}, and 𝔼()\mathbb{E}(\cdot) denotes expectation. For large sample size nn, the index set 𝒮select\mathcal{S}_{select} obtained by Algorithm 1 below elicits all non-zero features in the true parameter. We use bootstrap to obtain multiple copies of β^\hat{\beta} and β^j\hat{\beta}_{-j} for calculating the e-values in steps 1 and 5.

Algorithm 1 Best subset selection using e-values
  1. 1:

    Obtain full model e-value: e(full)=𝔼D(β^,[β^])e(\mathcal{M}_{full})=\mathbb{E}D(\hat{\beta},[\hat{\beta}]).

  2. 2:

    Set 𝒮select=ϕ\mathcal{S}_{select}=\phi.

  3. 3:

    For jj in 1:p1:p

  4. 4:

    Replace jthj^{\text{th}} index of β^\hat{\beta} by 0, name it β^j\hat{\beta}_{-j}.

  5. 5:

    Obtain e(j)=𝔼D(β^j,[β^])e(\mathcal{M}_{-j})=\mathbb{E}D(\hat{\beta}_{-j},[\hat{\beta}]).

  6. 6:

    If e(j)<e(full))e(\mathcal{M}_{-j})<e(\mathcal{M}_{full}))

  7. 7:

    Set 𝒮select{𝒮select,j}\mathcal{S}_{select}\leftarrow\{\mathcal{S}_{select},j\}.

Refer to caption
Figure 1.1: The e-value method. Solid and dotted ellipses represent Mahalanobis depth contours at some α>0\alpha>0 for sample sizes n1,n2;n2>n1n_{1},n_{2};n_{2}>n_{1}.

As an example, consider a linear regression with two features, Gaussian errors ϵN(0,1)\epsilon\sim N(0,1), and the following candidate models (Figure 1.1). We denote by Θi\Theta_{i} the domain of parameters considered in model i;i=1,,4\mathcal{M}_{i};i=1,\ldots,4.

1:Y=X1β1+X2β2+ϵ\mathcal{M}_{1}:\ \ Y=X_{1}\beta_{1}+X_{2}\beta_{2}+\epsilon; Θ1=2,\ \ \Theta_{1}=\mathbb{R}^{2},
2:Y=X1β1+ϵ\mathcal{M}_{2}:\ \ Y=X_{1}\beta_{1}+\epsilon; Θ2=×{0},\ \ \Theta_{2}=\mathbb{R}\times\{0\},
3:Y=X2β2+ϵ\mathcal{M}_{3}:\ \ Y=X_{2}\beta_{2}+\epsilon; Θ3={0}×,\ \ \Theta_{3}=\{0\}\times\mathbb{R},
4:Y=ϵ\mathcal{M}_{4}:\ \ Y=\epsilon; Θ4=(0,0)T.\ \ \Theta_{4}=(0,0)^{T}.

Let β0=(5,0)T\beta_{0}=(5,0)^{T} be the true parameter. The full model sampling distribution, β^𝒩(β0,(XTX)1)\hat{\beta}\sim\mathcal{N}(\beta_{0},(X^{T}X)^{-1}), is more concentrated around β0\beta_{0} for higher sample sizes. Thus the depths at points along the (red) line β2=0\beta_{2}=0, and the origin (red point), become smaller, and mean depths approach 0 for 3\mathcal{M}_{3} and 4\mathcal{M}_{4}. On the other hand, since depth functions are affine invariant (Zuo & Serfling, 2000), mean depths calculated over parameter spaces for 1\mathcal{M}_{1} (blue line) and 2\mathcal{M}_{2} (blue surface) remain the same, and do not vanish as nn\rightarrow\infty. Thus, e-values of the ‘good’ models 1,2\mathcal{M}_{1},\mathcal{M}_{2}—models the parameter spaces of which contain β0\beta_{0}—separate from those of the ‘bad’ models 3,4\mathcal{M}_{3},\mathcal{M}_{4} more and more as nn grows. Algorithm 1 is the result of this separation, and a rank ordering of the ‘good’ models based on how parsimonious they are (Theorem 5.1).

2 Related work

Feature selection is a widely studied area in statistics and ML. The vast amount of literature on this topic includes classics like the Bayesian Information Criterion (Schwarz, 1978, BIC) or Lasso (Tibshirani, 1996), and recent advances such as Mixed Integer Optimization (Bertsimas et al., 2016, MIO). To deal with the increasing scale and complexity of data in recent applications, newer streams of work have also materialized—such as nonlinear and model-independent methods (Song et al., 2012), model averaging (Fragoso et al., 2018), and dimension reduction techniques (Ma & Zhu, 2013). We refer the reader to Guyon & Elisseeff (2003) and the literature review in Bertsimas et al. (2016) for a broader overview of feature selection methods, and focus on the three lines of work most related to our formulation.

Shao (1996) first proposed using bootstrap for feature selection, with the squared residual averaged over a large number of resampled parameter estimates from a mm-out-of-nn bootstrap as selection criterion. Leave-One-Covariate-Out (Lei et al., 2018, LOCO) is based on the idea of sample-splitting. The full model and leave-one-covariate-out models are trained using one part of the data, and the rest of the data are used to calculate the importance of a feature as median difference in predictive performance between the full model and a LOCO model. Finally, Barber & Candes (2015) introduced the powerful idea of Knockoff filters, where a ‘knockoff’ version of the original input dataset imitating its correlation structure is constructed. This method is explicitly able to control False Discovery Rate.

Our framework of e-values has some similarities with the above methods, such as the use of bootstrap (similar to Shao (1996)), feature-specific statistics (similar to LOCO and Knockoffs), and evaluation at p+1p+1 models (similar to LOCO). However, e-values are also significantly different. They do not suffer from the high computational burden of model refitting over multiple bootstrap samples (unlike Shao (1996)), are not conditional on data splitting (unlike LOCO), and have a very general theoretical formulation (unlike Knockoffs). Most importantly, unlike all the three methods discussed above, e-values require fitting only a single model, and work for dependent data models.

Previously, VanderWeele & Ding (2017) have used the term ’E-Value’ in the context of sensitivity analysis. However, our and their definition of e-values are quite different. We see our e-values as a means to evaluate a feature with reference to a model. There are some parallels to the well known pp-values used for hypothesis testing, but we see the e-value as a more general quantity that covers dependent data situations, as well as general estimation and hypothesis testing problems. While the current paper is an application of e-values for feature selection, we plan to build up on the framework introduced here on other applications, including group feature selection, hypothesis testing, and multiple testing problems.

3 Preliminaries

For a positive integer n1n\geq 1, Let 𝒵n={Z1,,Zn}\mathcal{Z}_{n}=\{Z_{1},\ldots,Z_{n}\} be an observable array of random variables that are not necessarily independent. We parametrize 𝒵n\mathcal{Z}_{n} using a parameter vector θΘp\theta\in\Theta\subseteq\mathbb{R}^{p}, and energy functions {ψi(θ,Zi):1in}\{\psi_{i}(\theta,Z_{i}):1\leq i\leq n\}. We assume that there is a true unknown vector of parameters θ0\theta_{0}, which is the unique minimizer of Ψn(θ)=𝔼i=1nψi(θ,Zi)\Psi_{n}(\theta)=\mathbb{E}\sum_{i=1}^{n}\psi_{i}(\theta,Z_{i}).

Models and their characterization

Let 𝒮=θΘsupp(θ)\mathcal{S}_{*}=\cup_{\theta\in\Theta}\operatorname*{supp}(\theta) be the common non-zero support of all estimable parameters θ\theta (denoted by supp(θ)\operatorname*{supp}(\theta)). In this setup, we associate a model \mathcal{M} with two quantities (a) The set of indices 𝒮𝒮\mathcal{S}\subseteq\mathcal{S}_{*} with unknown and estimable parameter values, and (b) an ordered vector of known constants C=(Cj:j𝒮)C=(C_{j}:j\notin\mathcal{S}) at indices not in 𝒮\mathcal{S}. Thus a generic parameter vector for the model (𝒮,C)\mathcal{M}\equiv(\mathcal{S},C), denoted by θmΘmΘ=jΘj\theta_{m}\in\Theta_{m}\subseteq\Theta=\prod_{j}\Theta_{j}, consists of unknown indices and known constants

θmj={ unknownθmjΘj for j𝒮, knownCjΘj for j𝒮.\displaystyle{\theta}_{mj}=\left\{\begin{array}[]{ll}\text{ unknown}\ \theta_{mj}\in{\Theta}_{j}&\text{ for }j\in\mathcal{S},\\ \text{ known}\ C_{j}\in{\Theta}_{j}&\text{ for }j\notin\mathcal{S}.\end{array}\right. (3.3)

Given the above formulation, we characterize a model.

Definition 3.1.

A model \mathcal{M} is called adequate if j𝒮|Cjθ0j|=0\sum_{j\notin\mathcal{S}}|C_{j}-{\theta}_{0j}|=0. A model that is not adequate, is called an inadequate model.

By definition the full model is always adequate, as is the model corresponding to the singleton set containing the true parameter. Thus the set of adequate models is non-empty by construction.

Another important notion is the one of nested models.

Definition 3.2.

We consider a model 1\mathcal{M}_{1} to be nested in 2\mathcal{M}_{2}, notationally 12\mathcal{M}_{1}\prec\mathcal{M}_{2}, if 𝒮1𝒮2\mathcal{S}_{1}\subset\mathcal{S}_{2} and C2C_{2} is a subvector of C1C_{1}.

If a model is adequate, then any model it is nested in is also adequate. In the context of feature selection this obtains a rank ordering: the most parsimonious adequate model, with 𝒮=supp(β0)\mathcal{S}=\operatorname*{supp}(\beta_{0}) and C=0p|𝒮|C=0_{p-|\mathcal{S}|}, is nested in all other adequate models. All models nested within it are inadequate.

Estimators

The estimator θ^\hat{\theta}_{*} of θ0\theta_{0} is obtained as a minimizer of the sample analog of Ψn(θ)\Psi_{n}(\theta):

θ^=argminθΨn(θ)=argminθi=1nψi(θ,Zi).\displaystyle\hat{\theta}_{*}=\operatorname*{arg\,min}_{\theta}{\Psi}_{n}(\theta)=\operatorname*{arg\,min}_{\theta}\sum_{i=1}^{n}\psi_{i}\bigl{(}\theta,Z_{i}\bigr{)}. (3.4)

Under standard regularity conditions on the energy functions, an(θ^θ)a_{n}(\hat{\theta}_{*}-\theta_{*}) converges to a pp-dimensional Gaussian distribution as nn\rightarrow\infty, where ana_{n}\uparrow\infty is a sequence of positive real numbers (Appendix A).

Remark 3.3.

For generic estimation methods based on likelihood and estimating equations, the above holds with ann1/2a_{n}\equiv{n}^{1/2}, resulting in the standard ‘root-nn’ asymptotics.

The estimator in (3.4) corresponds to the full model \mathcal{M}_{*}, i.e. the model where all indices in 𝒮\mathcal{S}_{*} are estimated. For any other model \mathcal{M}, we simply augment entries of θ^\hat{\theta}_{*} at indices in 𝒮\mathcal{S} with elements of CC elsewhere to obtain a model-specific coefficient estimate θ^m\hat{\theta}_{m}:

θ^mj={θ^j for j𝒮,Cj for j𝒮.\displaystyle\hat{{\theta}}_{mj}=\left\{\begin{array}[]{ll}\hat{\theta}_{*j}&\text{ for }j\in\mathcal{S},\\ C_{j}&\text{ for }j\notin\mathcal{S}.\end{array}\right. (3.7)

The logic behind this plug-in estimate is simple: for a candidate model \mathcal{M}, a joint distribution of its estimated parameters, i.e. [θ^𝒮][\hat{{\theta}}_{\mathcal{S}}], can actually be obtained from [θ^][\hat{{\theta}}_{*}] by taking its marginals at indices 𝒮\mathcal{S}.

Depth functions

Data depth functions (Zuo & Serfling, 2000) quantify the closeness of a point in multivariate space to a probability distribution or data cloud. Formally, let 𝒢\mathcal{G} denote the set of non-degenerate probability measures on p\mathbb{R}^{p}. We consider D:p×𝒢[0,)D:\mathbb{R}^{p}\times\mathcal{G}\rightarrow[0,\infty) to be a data depth function if it satisfies the following properties:

  1. (B1)

    Affine invariance: For any non-singular matrix Ap×p{A}\in\mathbb{R}^{p\times p}, and bpb\in\mathbb{R}^{p} and random variable YY having distribution 𝔾𝒢\mathbb{G}\in\mathcal{G}, D(x,𝔾)=D(Ax+b,[AY+b])D(x,\mathbb{G})=D({A}x+b,[{A}Y+b]).

  2. (B2)

    Lipschitz continuity: For any 𝔾𝒢\mathbb{G}\in\mathcal{G}, there exists δ>0\delta>0 and α(0,1)\alpha\in(0,1), possibly depending on 𝔾\mathbb{G} such that whenever |xy|<δ|x-y|<\delta, we have |D(x,𝔾)D(y,𝔾)|<|xy|α|D(x,\mathbb{G})-D(y,\mathbb{G})|<|x-y|^{\alpha}.

  3. (B3)

    Consider random variables YnpY_{n}\in\mathbb{R}^{p} such that [Yn]𝕐𝒢[Y_{n}]\leadsto\mathbb{Y}\in\mathcal{G}. Then D(y,[Yn])D(y,[Y_{n}]) converges uniformly to D(y,𝕐)D(y,\mathbb{Y}). So that, if Y𝕐Y\sim\mathbb{Y}, then limn𝔼D(Yn,[Yn])=𝔼D(Y,𝕐)<\lim_{n\rightarrow\infty}\mathbb{E}D(Y_{n},[Y_{n}])=\mathbb{E}D(Y,\mathbb{Y})<\infty.

  4. (B4)

    For any 𝔾𝒢\mathbb{G}\in\mathcal{G}, limxD(x,𝔾)=0\lim_{\|{x}\|\rightarrow\infty}D({x},\mathbb{G})=0.

  5. (B5)

    For any 𝔾𝒢\mathbb{G}\in\mathcal{G} with a point of symmetry μ(𝔾)p{\mu}(\mathbb{G})\in\mathbb{R}^{p}, D(.,𝔾)D(.,\mathbb{G}) is maximum at μ(𝔾){\mu}(\mathbb{G}):

    D(μ(𝔾),𝔾)=supxpD(x,𝔾)<.D({\mu}(\mathbb{G}),\mathbb{G})=\sup_{{x}\in\mathbb{R}^{p}}D({x},\mathbb{G})<\infty.

    Depth decreases along any ray between μ(𝔾){\mu}(\mathbb{G}) to xx, i.e. for t(0,1),xpt\in(0,1),{x}\in\mathbb{R}^{p},

    D(x,𝔾)\displaystyle D({x},\mathbb{G}) <D(μ(𝔾)+t(xμ(𝔾)),𝔾)\displaystyle<D({\mu}(\mathbb{G})+t({x}-{\mu}(\mathbb{G})),\mathbb{G})
    <D(μ(𝔾),𝔾).\displaystyle<D({\mu}(\mathbb{G}),\mathbb{G}).

Conditions (B1), (B4) and (B5) are integral to the formal definition of data depth (Zuo & Serfling, 2000), while (B2) and (B3) implicitly arise for several depth functions (Mosler, 2013). We require only a subset of (B1)-(B5) for the theory in this paper, but use data depths throughout for simplicity.

4 The e-values framework

The e-value of model \mathcal{M} is the mean depth of θ^\hat{\theta}_{\mathcal{M}} with respect to [θ^][\hat{\theta}]: en()=𝔼D(θ^m,[θ^])e_{n}(\mathcal{M})=\mathbb{E}D(\hat{\theta}_{m},[\hat{\theta}_{*}]).

4.1 Resampling approximation of e-values

Typically, the distributions of either of the random quantities θ^m\hat{\theta}_{m} and θ^\hat{\theta}_{*}, are unknown, and have to be elicited from the data. Because of the plugin method in (3.7), only [θ^][\hat{\theta}_{*}] needs to be approximated. We do this using Generalized Bootstrap (Chatterjee & Bose, 2005, GBS). For an exchangeable array of non-negative random variables independent of the data as resampling weights: 𝒲r=(𝕎r1,,𝕎rn)Tn\mathcal{W}_{r}=(\mathbb{W}_{r1},\ldots,\mathbb{W}_{rn})^{T}\in\mathbb{R}^{n}, the GBS estimator θ^r\hat{{\theta}}_{r*} is the minimizer of

Ψrn(θ)=i=1n𝕎riψi(θ,Zi).\displaystyle{\Psi}_{rn}(\theta)=\sum_{i=1}^{n}\mathbb{W}_{ri}\psi_{i}\bigl{(}\theta,Z_{i}\bigr{)}. (4.1)

We assume the following conditions on the resampling weights and their interactions as nn\rightarrow\infty:

𝔼𝕎r1=1,𝕍𝕎r1=τn2=o(an2),𝔼Wr14<,\displaystyle\mathbb{E}\mathbb{W}_{r1}=1,\mathbb{V}\mathbb{W}_{r1}=\tau_{n}^{2}=o(a_{n}^{2})\uparrow\infty,\mathbb{E}W_{r1}^{4}<\infty,
𝔼Wr1Wr2=O(n1),𝔼Wr12Wr221.\displaystyle\mathbb{E}W_{r1}W_{r2}=O(n^{-1}),\mathbb{E}W_{r1}^{2}W_{r2}^{2}\rightarrow 1. (4.2)

Many resampling schemes can be described as GBS, such as the mm-out-of-nn bootstrap (Bickel & Sakov, 2008) and scale-enhanced wild bootstrap (Chatterjee, 2019). Under fairly weak regularity conditions on the first two derivatives ψi\psi^{\prime}_{i} and ψi′′\psi^{\prime\prime}_{i} of ψi\psi_{i}, (an/τn)(θ^rθ^)(a_{n}/\tau_{n})(\hat{\theta}_{r*}-\hat{\theta}_{*}) and an(θ^θ0)a_{n}(\hat{\theta}_{*}-\theta_{0}) converge to the same weak limit in probability (See Appendix A).

Instead of repeatedly solving (4.1), we use model quantities computed while calculating θ^\hat{\theta}_{*} to obtain a first-order approximation of θ^r\hat{\theta}_{r*}.

θ^r\displaystyle\hat{\theta}_{r*} =θ^τnan[i=1nψi′′(θ^,Zi)]1/2i=1nWriψi(θ^,Zi)\displaystyle=\hat{\theta}_{*}-\frac{\tau_{n}}{a_{n}}\left[\sum_{i=1}^{n}\psi_{i}^{\prime\prime}(\hat{\theta}_{*},Z_{i})\right]^{-1/2}\sum_{i=1}^{n}W_{ri}\psi_{i}^{\prime}(\hat{\theta}_{*},Z_{i})
+Rr,\displaystyle+{R}_{r}, (4.3)

where 𝔼rRr2=oP(1)\mathbb{E}_{r}\|{R}_{r}\|^{2}=o_{P}(1), and Wri=(𝕎ri1)/τnW_{ri}=(\mathbb{W}_{ri}-1)/\tau_{n}. Thus only Monte Carlo sampling is required to obtain the resamples. Being an embarrassingly parallel procedure, this results in significant computational speed gains.

To estimate en()e_{n}(\mathcal{M}) we obtain two independent sets of weights {𝒲r;r=1,,R}\{\mathcal{W}_{r};r=1,\ldots,R\} and {𝒲r1;r1=1,,R1}\{\mathcal{W}_{r_{1}};r_{1}=1,\ldots,R_{1}\} for large integers R,R1R,R_{1}. We use the first set of resamples to obtain the distribution [θ^r][\hat{{\theta}}_{r*}] to approximate [θ^][\hat{\theta}_{*}], and the second set of resamples to get the plugin estimate θ^r1m\hat{{\theta}}_{r_{1}m}:

θ^r1mj={θ^r1j for j𝒮,Cj for j𝒮.\displaystyle\hat{\theta}_{r_{1}mj}=\left\{\begin{array}[]{ll}\hat{\theta}_{r_{1}*j}&\text{ for }j\in\mathcal{S},\\ C_{j}&\text{ for }j\notin\mathcal{S}.\end{array}\right. (4.6)

Finally, the resampling estimate of a model e-value is: ern()=𝔼r1D(θ^r1m,[θ^r])e_{rn}(\mathcal{M})=\mathbb{E}_{r_{1}}D\bigl{(}\hat{{\theta}}_{r_{1}m},[\hat{{\theta}}_{r*}]\bigr{)}, where 𝔼r1\mathbb{E}_{r_{1}} is expectation, conditional on the data, computed using the resampling r1r_{1}.

4.2 Fast algorithm for best subset selection

For best subset selection, we restrict to the model class 𝕄0={:Cj=0j𝒮}\mathbb{M}_{0}=\{\mathcal{M}:C_{j}=0\quad\forall\quad j\notin\mathcal{S}\} that only allows zero constant terms. In this setup our fast selection algorithm consists of only three stages: (a) fit the full model and estimate its e-value, (b) replace each covariate by 0 and compute e-value of all such reduced models, and (c) collect covariates dropping which causes the e-value to go down.

To fit the full model, we need to determine the estimable index set 𝒮\mathcal{S}_{*}. When n>pn>p, the choice is simple: 𝒮={1,,p}\mathcal{S}_{*}=\{1,\ldots,p\}. When p>np>n, we need to ensure that p=|𝒮|<np^{\prime}=|\mathcal{S}_{*}|<n, so that θ^\hat{\theta}_{*} (properly centered and scaled) has a unique asymptotic distribution. Similar to past works on feature selection (Lai et al., 2015), we use a feature screening step before model fitting to achieve this (Section 5).

After obtaining 𝒮\mathcal{S}_{*} and θ^\hat{\theta}_{*}, for each of the p+1p^{\prime}+1 models under consideration: the full model and all drop-1-feature models, we follow the recipe in Section 4.1 to get bootstrap e-values. This gives us all the components for a sample version of Algorithm 1, which we present as Algorithm 2.

Algorithm 2 Best subset feature selection using e-values and GBS

1. Fix resampling standard deviation τn\tau_{n}.

2. Obtain GBS samples: 𝒯={θ^1,,θ^R}\mathcal{T}=\{\hat{\theta}_{1*},...,\hat{\theta}_{R*}\}, and 𝒯1={θ^1,,θ^R1}\mathcal{T}_{1}=\{\hat{\theta}_{1*},...,\hat{\theta}_{R_{1}*}\} using (4.1).

3. Calculate e^rn()=1R1r1=1R1D(θ^r1,[𝒯])\hat{e}_{rn}(\mathcal{M}_{*})=\frac{1}{R_{1}}\sum_{r_{1}=1}^{R_{1}}D(\hat{\theta}_{r_{1}*},[\mathcal{T}]).

4. Set 𝒮^0=ϕ\hat{\mathcal{S}}_{0}=\phi.

5. for jj in 1:p1:p

6.     for r1r_{1} in 1:R11:R_{1}

7.        Replace jthj^{\text{th}} index of θ^r1\hat{\theta}_{*r_{1}} by 0 to get θ^r1,j\hat{\theta}_{r_{1},-j}.

8.     Calculate e^rn(j)=1R1r1=1R1D(θ^r1,j,[𝒯])\hat{e}_{rn}(\mathcal{M}_{-j})=\frac{1}{R_{1}}\sum_{r_{1}=1}^{R_{1}}D(\hat{\theta}_{r_{1},-j},[\mathcal{T}]).

9.     if e^rn(j)<e^rn()\hat{e}_{rn}(\mathcal{M}_{-j})<\hat{e}_{rn}(\mathcal{M}_{*})

10.        Set 𝒮^0{𝒮^0,j}\hat{\mathcal{S}}_{0}\leftarrow\{\hat{\mathcal{S}}_{0},j\}.

Choice of τn\tau_{n}

The intermediate rate of divergence for the bootstrap standard deviation τn\tau_{n}: τn,τn/an0\tau_{n}\rightarrow\infty,\tau_{n}/a_{n}\rightarrow 0, is a necessary and sufficient condition for the consistency of GBS (Chatterjee & Bose, 2005), and that of the bootstrap approximation of population e-values (Theorem 5.4). We select τn\tau_{n} using the following quantity, which we call Generalized Bootstrap Information Criterion (GBIC):

GBIC(τn)\displaystyle\text{GBIC}(\tau_{n}) =i=1nψi(θ^(𝒮^0,τn),Zi)+\displaystyle=\sum_{i=1}^{n}\psi_{i}\left(\hat{\theta}(\hat{\mathcal{S}}_{0},\tau_{n}),Z_{i}\right)+
τn2|supp(θ^(𝒮^0,τn))|,\displaystyle\frac{\tau_{n}}{2}\left|\operatorname*{supp}(\hat{\theta}(\hat{\mathcal{S}}_{0},\tau_{n}))\right|, (4.7)

where θ^(𝒮^0,τn)\hat{\theta}(\hat{\mathcal{S}}_{0},\tau_{n}) is the refitted parameter vector using the index set 𝒮^0\hat{\mathcal{S}}_{0} selected by running Algorithm 2 with τn\tau_{n}. We repeat this for a range of τn\tau_{n} values, and choose the index set corresponding to the τn\tau_{n} that gives the smallest GBIC(τn)\text{GBIC}(\tau_{n}). For our synthetic data experiments we take τn=log(n)\tau_{n}=\log(n) and use GBIC, while for one real data example, we use validation on a test set to select the optimal τn\tau_{n}, both with favorable results.

Detection threshold for finite samples

In practice—especially for smaller sample sizes—due to sampling uncertainty it may be difficult to ensure that the full model e-value exactly separates the null and non-null e-values, and small amounts of false positives or negatives may occur in the selected feature set 𝒮^0\hat{\mathcal{S}}_{0}. One way to tackle this is by shifting the detection threshold in Algorithm 2 by a small δ\delta:

e^rnδ()=(1+δ)e^rn().\hat{e}^{\delta}_{rn}(\mathcal{M}_{*})=(1+\delta)\hat{e}_{rn}(\mathcal{M}_{*}).

To prioritize true positives or true negatives, δ\delta can be set to be positive or negative respectively. In one of our experiments (Section 6.3), setting δ=0\delta=0 results in a small amount of false positives due to a few non-null features having e-values close to the full model e-value. Setting δ\delta to small positive values gradually eliminates these false positives.

5 Theoretical results

We now investigate theoretical properties of e-values. Our first result separates inadequate models from adequate models at the population level, and gives a rank ordering of adequate models using their population e-values.

Theorem 5.1.

Under conditions B1-B5, for a finite sequence of adequate models 1k\mathcal{M}_{1}\prec\ldots\prec\mathcal{M}_{k} and any inadequate models k+1,,K\mathcal{M}_{k+1},\ldots,\mathcal{M}_{K}, we have for large nn

en(1)>>en(k)>maxj{k+1,K}en(j).\displaystyle e_{n}(\mathcal{M}_{1})>\ldots>e_{n}(\mathcal{M}_{k})>\max_{j\in\{k+1,\ldots K\}}e_{n}(\mathcal{M}_{j}).

As nn\rightarrow\infty, en(i)𝔼D(Y,[Y])<e_{n}(\mathcal{M}_{i})\rightarrow\mathbb{E}D(Y,[Y])<\infty with YY having an elliptic distribution if iKi\leq K, else en(i)0e_{n}(\mathcal{M}_{i})\rightarrow 0.

We define the data generating model as 0\mathcal{M}_{0}\prec\mathcal{M}_{*} with estimable indices 𝒮0=supp(θ0)\mathcal{S}_{0}=\operatorname*{supp}(\theta_{0}) and constants C0=0p|𝒮0|C_{0}=0_{p-|\mathcal{S}_{0}|}. Then we have the following.

Corollary 5.2.

Assume the conditions of Theorem 5.1. Consider the restricted class of candidate models 𝕄0\mathbb{M}_{0} in Section 4.2, where all known coefficients are fixed at 0. Then for large enough nn, 0=argmax𝕄0[en()]\mathcal{M}_{0}=\arg\max_{\mathcal{M}\in\mathbb{M}_{0}}[e_{n}(\mathcal{M})].

Thus, when only the models with known parameters set at 0 (the model set 𝕄0\mathbb{M}_{0}) are considered, e-value indeed maximizes at the true model at the population level. However there are still 2p2^{p} possible models. This is where the advantage of using e-values—their one-step nature—comes through.

Corollary 5.3.

Assume the conditions of Corollary 5.2. Consider the models j𝕄0\mathcal{M}_{-j}\in\mathbb{M}_{0} with 𝒮j={1,,p}{j}\mathcal{S}_{-j}=\{1,\ldots,p\}\setminus\{j\} for j=1,,pj=1,\ldots,p. Then covariate jj is a necessary component of 0\mathcal{M}_{0}, i.e. j\mathcal{M}_{-j} is an inadequate model, if and only if for large nn we have en(j)<en()e_{n}(\mathcal{M}_{-j})<e_{n}(\mathcal{M}_{*}).

Dropping an essential feature from the full model makes the model inadequate, which has very small e-value for large enough nn, whereas dropping a non-essential feature increases the e-value (Theorem 5.1). Thus, simply collecting features dropping which cause a decrease in the e-value suffices for feature selection.

Following the above results, we establish model selection consistency of Algorithm 2 at the sample level. This means that the probability that the one-step procedure is able to select the true model feature set goes to 1, when the procedure is repeated for a large number of randomly drawn datasets from the data-generating distribution.

Theorem 5.4.

Consider two sets of generalized bootstrap estimates of θ^\hat{\theta}_{*}: 𝒯={θ^r:r=1,,R}\mathcal{T}=\{\hat{\theta}_{r*}:r=1,\ldots,R\} and 𝒯1={θ^r1:r1=1,,R1}\mathcal{T}_{1}=\{\hat{\theta}_{r_{1}*}:r_{1}=1,\ldots,R_{1}\}. Obtain sample e-values as:

e^rn()\displaystyle\hat{e}_{rn}(\mathcal{M}) =1R1r1=1R1D(θ^r1m,[𝒯]),\displaystyle=\frac{1}{R_{1}}\sum_{r_{1}=1}^{R_{1}}D(\hat{\theta}_{r_{1}m},[\mathcal{T}]), (5.1)

where [𝒯][\mathcal{T}] is the empirical distribution of the corresponding bootstrap samples, and θ^r1m\hat{\theta}_{r_{1}m} are obtained as in Section 4.1. Consider the feature set 𝒮^0={e^rn(j)<e^rn()}\hat{\mathcal{S}}_{0}=\{\hat{e}_{rn}(\mathcal{M}_{-j})<\hat{e}_{rn}(\mathcal{M}_{*})\}. Then as n,R,R1n,R,R_{1}\rightarrow\infty, P2(𝒮^0=𝒮0)1P_{2}(\hat{\mathcal{S}}_{0}=\mathcal{S}_{0})\rightarrow 1, with P2P_{2} as probability conditional on data and bootstrap samples.

Theorem 5.4 is contingent on the fact that the the true model 0\mathcal{M}_{0} is a member of 𝕄0\mathbb{M}_{0}, that is 𝒮0𝒮\mathcal{S}_{0}\subseteq\mathcal{S}_{*}. This is ensured trivially when npn\geq p. If p>np>n, 𝕄0\mathbb{M}_{0} is the set of all possible models on the feature set selected by an appropriate screening procedure. In high-dimensional linear models, we use Sure Independent Screening (Fan & Lv, 2008, SIS) for this purpose. Given that 𝒮\mathcal{S}_{*} is selected using SIS, Fan & Lv (2008) proved that, for constants C>0C>0 and κ\kappa that depend on the minimum signal in θ0\theta_{0},

P(0𝕄0)1O(exp[Cn12κ]logn).\displaystyle P(\mathcal{M}_{0}\in\mathbb{M}_{0})\geq 1-O\left(\frac{\exp[-Cn^{1-2\kappa}]}{\log n}\right). (5.2)

For more complex models, model-free filter methods (Koller & Sahami, 1996; Zhu et al., 2011) can be used to obtain 𝒮\mathcal{S}_{*}. For example, under mild conditions on the design matrix, the method of Zhu et al. (2011) is consistent:

P(|𝒮𝒮0c|r)(1rp+d)d,\displaystyle P(|\mathcal{S}_{*}\cap\mathcal{S}_{0}^{c}|\geq r)\leq\left(1-\frac{r}{p+d}\right)^{d}, (5.3)

with positive integers rr and dd: d=pd=p being a good practical choice (Zhu et al., 2011, Theorem 3). Combining (5.2) or (5.3) with Corollary 5.4 as needed establishes asymptotically accurate recovery of 𝒮0\mathcal{S}_{0} through Algorithm 2.

6 Numerical experiments

We implement e-values using a GBS with scaled resample weights WriGamma(1,1)1W_{ri}\sim\text{Gamma}(1,1)-1, and resample sizes R=R1=1000R=R_{1}=1000. We use Mahalanobis depth for all depth calculations. Mahalanobis depth is much less computation-intensive than other depth functions (Dyckerhoff & Mozharovskyi, 2016; Liu & Zuo, 2014), but is not usually preferred in applications due to its non-robustness. However, we do not use any robustness properties of data depth, so are able to use it without any concern. For each replication for each data setting and method, we compute performance metrics on test datasets of the same dimensions as the respective training dataset. All our results are based on 1000 such replications.

6.1 Feature selection in linear regression

Given a true coefficient vector β0\beta_{0}, we use the model Z(Y,X),Y=Xβ0+ϵZ\equiv(Y,X),Y=X\beta_{0}+\epsilon, with ϵ𝒩n(0,σ2In)\epsilon\sim\mathcal{N}_{n}(0,\sigma^{2}I_{n}), and n=500n=500 and p=100p=100. We generate the rows of XX independently from 𝒩p(0,ΣX)\mathcal{N}_{p}(0,\Sigma_{X}), where ΣX\Sigma_{X} follows an equicorrelation structure having correlation coefficient ρ\rho: (ΣX)ij=ρ|ij|(\Sigma_{X})_{ij}=\rho^{|i-j|}. Under this basic setup, we consider the following settings to evaluate the effect of different magnitudes of feature correlation in XX, sparsity level in β0\beta_{0} and error variance σ\sigma.

  • Setting 1 (effect of feature correlation): We repeat the above setup for ρ=0.1,,0.9\rho=0.1,\ldots,0.9, fixing β0=(1.5,0.5,1,1.5,1,0p5)\beta_{0}=(1.5,0.5,1,1.5,1,0_{p-5}), σ=1\sigma=1;

  • Setting 2 (effect of sparsity level): We consider β0=(1k,0pk)\beta_{0}=(1_{k},0_{p-k}), with varying degrees of the sparsity level k=5,10,15,20,25k=5,10,15,20,25. We fix ρ=0.5,σ=1\rho=0.5,\sigma=1;

  • Setting 3 (effect of noise level): We consider different values of noise level (equivalent to having different signal-to-noise ratios) by testing for σ=0.3,0.5,,2.3\sigma=0.3,0.5,\ldots,2.3, fixing β0=(15,0p5),ρ=0.5\beta_{0}=(1_{5},0_{p-5}),\rho=0.5.

Refer to caption
Figure 6.1: Performance metrics for n=500,p=100n=500,p=100: top, middle and bottom rows indicate simulation settings 1, 2 and 3, respectively. Competing methods include LASSO and SCAD-penalized regression, Stepwise regression using BIC and backward deletion, Knockoff filters (Barber & Candes, 2015), and Mixed Integer Optimization (Bertsimas et al., 2016, MIO).

To implement e-values, we repeat Algorithm 2 for τn{0.2,0.6,1,1.4,1.8}log(n)log(p)\tau_{n}\in\{0.2,0.6,1,1.4,1.8\}\log(n)\sqrt{\log(p)}, and select the model having the lowest GBIC(τn)(\tau_{n}).

Figure 6.1 summarizes the comparison results. Across all three settings, our method consistently produces the most predictive model, i.e. the model with the lowest prediction error. It also produces the sparsest model almost always. Among the competitors, SCAD performs the closest to e-values in setting 2 for both the metrics. However, SCAD seems to be more severely affected by high noise level (i.e. high σ\sigma) or high feature correlation (i.e. high ρ\rho). Lasso and Step tend to select models with many null variables, and have higher prediction errors. Performance of the other two methods (MIO, knockoffs) is middling.

Method e-value Lasso SCAD Step Knockoff MIO
Time 6.3 0.4 0.9 20.1 1.9 127.2
Table 6.1: Mean computation time (in seconds) for all methods.

We present computation times for Setting 1 averaged across all values of ρ\rho in Table 6.1. All computations were performed on a Windows desktop with an 8-core Intel Core-i7 6700K 4GHz CPU and 16GB RAM. For e-values, using a smaller number of bootstrap samples or running computations over bootstrap samples in parallel greatly reduces computation time with little to no loss of performance. However, we report its computation time over a single core similar to other methods. Sparse methods like Lasso and SCAD are much faster as expected. Our method has lower computation time than Step and MIO, and much better performance.

6.2 High-dimensional linear regression (p>np>n)

Refer to caption
Figure 6.2: Performance metrics for n=100,p=500n=100,p=500.

We generate data from the same setup as Section 6.1, but with n=100,p=500n=100,p=500. We perform an initial screening of features using SIS, then apply the e-values and other methods on this SIS-selected predictor set. Figure 6.2 summarizes the results. In addition to competing methods, we report metrics corresponding to the original SIS-selected predictor set as a baseline. Sparsity-wise, e-values produce the most improvement over the SIS baseline among all methods, and tracks the true sparsity level closely. Both e-values and Knockoffs produce sparser estimates as ρ\rho grows higher (Setting 1). However, unlike the Knockoffs our method maintains good prediction performance even at high feature correlation. In general e-values maintain good prediction performance, although this difference is less obvious than the low-dimensional case. Note that for k=25k=25 in setting 2, SIS screening produces overly sparse feature sets, affecting prediction errors for all methods.

6.3 Mixed effect model

Method Setting 1: ni=5,m=30n_{i}=5,m=30 Setting 2: ni=10,m=60n_{i}=10,m=60
FPR FNR Acc MS FPR FNR Acc MS
δ=0\delta=0 9.4 0.0 76 2.33 0.0 0.0 100 2.00
δ=0.01\delta=0.01 6.7 0.0 82 2.22 0.0 0.0 100 2.00
e-value δ=0.05\delta=0.05 1.0 0.0 97 2.03 0.0 0.0 100 2.00
δ=0.1\delta=0.1 0.3 0.0 99 2.01 0.0 0.0 100 2.00
δ=0.15\delta=0.15 0.0 0.0 100 2.00 0.0 0.0 100 2.00
BIC 21.5 9.9 49 2.26 1.5 1.9 86 2.10
AIC 17 11.0 46 2.43 1.5 3.3 77 2.20
SCAD (Peng & Lu, 2012) GCV 20.5 6 49 2.30 1.5 3 79 2.18
logn/n\sqrt{\log n/n} 21 15.6 33 2.67 1.5 4.1 72 2.26
M-ALASSO (Bondell et al., 2010) - - 73 - - - 83 -
SCAD-P (Fan & Li, 2012) - - 90 - - - 100 -
rPQL (Hui et al., 2017) - - 98 - - - 99 -
Table 6.2: Performance comparison for mixed effect models. We compare e-values with a number of sparse penalized methods: (a) Peng & Lu (2012) that uses SCAD penalty and different methods of selecting regularization tuning parameters, (b) The adaptive lasso-based method of Bondell et al. (2010), (c) The SCAD-P method Fan & Li (2012), and (d) regularized Penalized Quasi-Likelihood Hui et al. (2017, rPQL). For comparison with Peng & Lu (2012), we present mean false positive (FPR) and false negative (FNR) rates, Accuracy (Acc), and Model Size (MS), i.e. the number of non-zero fixed effects estimated. To compare with other methods we only use Acc, since they did not report the rest of the metrics.

We use the simulation setup from Peng & Lu (2012): Y=Xβ+RU+ϵ{Y}={X}\beta+{R}{U}+{\epsilon}, where the data Z(Y,X,R)Z\equiv(Y,X,R) consists of mm independent groups of observations with multiple (nin_{i}) dependent observations in each group, R{R} being the within-group random effects design matrix. We consider 9 fixed effects and 4 random effects, with the true fixed effect coefficient β0=(12,07){\beta}_{0}=(1_{2},0_{7}) and random effect coefficient UU drawn from 𝒩4(0,Δ)\mathcal{N}_{4}(0,\Delta). The random effect covariance matrix Δ\Delta has elements (Δ)11=9(\Delta)_{11}=9, (Δ)21=4.8(\Delta)_{21}=4.8, (Δ)22=4(\Delta)_{22}=4, (Δ)31=0.6(\Delta)_{31}=0.6, (Δ)32=(Δ)33=1(\Delta)_{32}=(\Delta)_{33}=1, (Δ)4j=0;j=1,,4(\Delta)_{4j}=0;j=1,\ldots,4, and the error variance of ϵ\epsilon is set at σ2=1\sigma^{2}=1. The goal here is to select covariates of the fixed effect. We use two scenarios for our study: Setting 1, where the number of groups (mm) considered is 3030, and the number of observations in the ithi^{\operatorname*{th}} group, i=1,,mi=1,\ldots,m, is ni=5n_{i}=5, and Setting 2, where m=60,ni=10m=60,n_{i}=10. Finally, we generate 100 independent datasets for each setting. To implement e-values by minimizing GBIC(τn)(\tau_{n}), we consider τn{1,1.2,,4.8,5}\tau_{n}\in\{1,1.2,\ldots,4.8,5\}. To tackle small sample issues in Setting 1 (Section 4.2), we repeat the model selection procedure using the shifted e-values e^rnδ()\hat{e}^{\delta}_{rn}(\cdot) for δ{0,0.01,0.05,0.1,0.15}\delta\in\{0,0.01,0.05,0.1,0.15\}.

Without shifted thresholds, e-values perform commendably in both settings. For Setting 2, it reaches the best possible performance across all metrics. However, we observed that in a few replications of setting 1, a small number of null input features had e-values only slightly lower than the full model e-value, resulting in increased FPR and model size. We experimented with lowered detection thresholds to mitigate this. As seen in Table 6.2, increasing δ\delta gradually eliminates the null features, and e-values reach perfect performance across all metrics for δ=0.15\delta=0.15.

7 Real data examples

Indian monsoon data

Refer to caption
Refer to caption
Refer to caption
Figure 7.1: (a) Plot of t-statitics for Indian Monsoon data for features selected by at least one of the methods. Dashed lines indicate standard Gaussian quantiles at tail probabilities 0.025 and 0.975, (b) Full vs. reduced model rolling forecast comparisons, and (c) fMRI data: smoothed surface obtained from p-values show high spatial dependence in right optic nerve, auditory nerves and auditory cortex (top left), left visual cortex (bottom right) and cerebellum (lower middle).

To identify the driving factors behind precipitation during the Indian monsoon season using e-values, we obtain data on 35 potential covariates (see Appendix D) from National Climatic Data Center (NCDC) and National Oceanic and Atmospheric Administration (NOAA) repositories for 1978–2012. We consider annual medians of covariates as fixed effects, log yearly rainfall at a weather station as output feature, and include year-specific random intercepts. To implement e-values, we use projection depth (Zuo, 2003) and GBS resample sizes R=R1=1000R=R_{1}=1000. We train our model on data from the years 1978-2002, run e-values best subset selection for tuning parameters τn{0.05,0.1,,1}\tau_{n}\in\{0.05,0.1,\ldots,1\}. We consider two methods to select the best refitted model: (a) minimizing GBIC(τn)(\tau_{n}), and (b) minimizing forecasting errors on samples from 2003–2012.

Figure 7.1a plots the t-statistics for features from the best refitted models obtained by the above two methods. Minimizing for GBIC and test error selects 32 and 26 covariates, respectively. The largest contributors are maximum temperature and elevation, which are related to precipitation based on the Clausius-Clapeyron relation (Li et al., 2017; Singleton & Toumi, 2012). All other selected covariates have documented effects on Indian monsoon (Krishnamurthy & Kinter III, 2003; Moon et al., 2013). Reduced model forecasts obtained from a rolling validation scheme (i.e. i.e. use data from 1978–2002 for 2003, 1979-2003 for 2004 and so on) have less bias across testing years (Fig. 7.1b).

Spatio-temporal dependence analysis in fMRI data

This dataset is due to Wakeman & Henson (2015), where each of the 19 subjects go through 9 runs of consecutive visual tasks. Blood oxygen level readings are recorded across time as 3D images made of 64×64×3364\times 64\times 33 (total 135,168) voxels. Here we use the data from a single run and task on subject 1, and aim to estimate dependence patterns of readings across 210 time points and areas of the brain.

We fit separate regressions at each voxel (Appendix E), with second order autoregressive terms, neighboring voxel readings and one-hot encoded visual task categories in the design matrix. After applying the e-value feature selection, we compute the F-statistic at each voxel using selected coefficients only, and obtain their p-values. Fig. 7.1c highlights voxels with p-values <0.05<0.05. Left and right visual cortex areas show high spatial dependence, with more dependence on the left side. Signals from the right visual field obtained by both eyes are processed by the left visual cortex. The lop-sided dependence pattern suggests that visual signals from the right side led to a higher degree of processing in our subject’s brain. We also see activity in the cerebellum, the role of which in visual perception is well-known (Calhoun et al., 2010; Kirschen et al., 2010).

8 Conclusion

In this work, we introduced a new paradigm of feature selection through e-values. The e-values can be particularly helpful in situations where model training is costly and potentially distributed across multiple servers (i.e. federated learning), so that a brute force parallelized approach of training and evaluating multiple models is not practical or even possible.

There are three immediate extensions of the framework presented in this paper. Firstly, grouped e-values are of interest to leverage prior structural information on the predictor set. There are no conceptual difficulties in evaluating overlapping and non-overlapping groups of predictors using e-values in place of individual predictors. However technical conditions may be required to ensure a rigorous implementation. Secondly, our current formulation of e-values essentially relies upon the number of samples being more than the effective number of predictors for a unique covariance matrix of the parameter estimate to asymptotically exist. When p>np>n, using a sure screening method to filter out unnecessary variables works well empirically. However this needs theoretical validation. Thirdly, instead of using mean depth, other functionals of the (empirical) depth distribution—such as quantiles—can be used as e-values. Similar to stability selection (Meinshausen & Buhlmann, 2010), it may be possible to use the intersection of predictor sets obtained by using a number of such functionals in Algorithm 2 as the final selected set of important predictors.

An effective implementation of e-values hinges on the choice of bootstrap method and tuning parameter τn\tau_{n}. To this end, we see opportunity for enriching the research on empirical process methods in complex overparametrized models, such as Deep Neural Nets (DNN), which the e-values framework can build up on. Given the current push for interpretability and trustworthiness of DNN-based decision making systems, there is potential for tools to be developed within the general framework of e-values that provide local and global explanations of large-scale deployed model outputs in an efficient manner.

Acknowledgements

This work is part of the first author (SM)’s PhD thesis (Majumdar, 2017). He acknowledges the support of the University of Minnesota Interdisciplinary Doctoral Fellowship during his PhD. The research of SC is partially supported by the US National Science Foundation grants 1737918, 1939916 and 1939956 and a grant from Cisco Systems.

Reproducibility

Code and data for the experiments in this paper are available at https://github.com/shubhobm/e-values.

References

  • Barber & Candes (2015) Barber, R. F. and Candes, E. J. Controlling the false discovery rate via knockoffs. Ann. Statist., 43:2055–2085, 2015.
  • Bertsimas et al. (2016) Bertsimas, D., King, A., and Mazumder, R. Best Subset Selection via a Modern Optimization Lens. Ann. Statist., 44(2):813–852, 2016.
  • Bickel & Sakov (2008) Bickel, P. and Sakov, A. On the Choice of mm in the mm Out of nn Bootstrap and its Application to Confidence Bounds for Extrema. Stat. Sinica, 18:967–985, 2008.
  • Bondell et al. (2010) Bondell, H. D., Krishna, A., and Ghosh, S. K. Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics, 66(4):1069–1077, 2010.
  • Calhoun et al. (2010) Calhoun, V. D., Adali, T., and Mcginty, V. B. fMRI activation in a visual-perception task: network of areas detected using the general linear model and independent components analysis. Neuroimage, 14:1080–1088, 2010.
  • Chatterjee (2019) Chatterjee, S. The scale enhanced wild bootstrap method for evaluating climate models using wavelets. Statist. Prob. Lett., 144:69–73, 2019.
  • Chatterjee & Bose (2005) Chatterjee, S. and Bose, A. Generalized bootstrap for estimating equations. Ann. Statist., 33(1):414–436, 2005.
  • Dietz & Chatterjee (2014) Dietz, L. R. and Chatterjee, S. Logit-normal mixed model for Indian monsoon precipitation. Nonlin. Processes Geophys., 21:939–953, sep 2014.
  • Dietz & Chatterjee (2015) Dietz, L. R. and Chatterjee, S. Investigation of Precipitation Thresholds in the Indian Monsoon Using Logit-Normal Mixed Models. In Lakshmanan, V., Gilleland, E., McGovern, A., and Tingley, M. (eds.), Machine Learning and Data Mining Approaches to Climate Science, pp.  239–246. Springer, 2015.
  • Dyckerhoff & Mozharovskyi (2016) Dyckerhoff, R. and Mozharovskyi, P. Exact computation of the halfspace depth. Comput. Stat. Data Anal., 98:19–30, 2016.
  • Fan & Lv (2008) Fan, J. and Lv, J. Sure Independence Screening for Ultrahigh Dimensional Feature Space. J. R. Statist. Soc. B, 70:849–911, 2008.
  • Fan & Li (2012) Fan, Y. and Li, R. Variable selection in linear mixed effects models. Ann. Statist., 40(4):2043–2068, 2012.
  • Fang et al. (1990) Fang, K. T., Kotz, S., and Ng, K. W. Symmetric multivariate and related distributions. Monographs on Statistics and Applied Probability, volume 36. Chapman and Hall Ltd., London, 1990.
  • Fragoso et al. (2018) Fragoso, T. M., Bertoli, W., and Louzada, F. Bayesian model averaging: A systematic review and conceptual classification. International Statistical Review, 86(1):1–28, 2018.
  • Ghosh et al. (2016) Ghosh, S., Vittal, H., Sharma, T., et al. Indian Summer Monsoon Rainfall: Implications of Contrasting Trends in the Spatial Variability of Means and Extremes. PLoS ONE, 11:e0158670, 2016.
  • Gosswami (2005) Gosswami, B. N. The Global Monsoon System: Research and Forecast, chapter The Asian Monsoon: Interdecadal Variability. World Scientific, 2005.
  • Guyon & Elisseeff (2003) Guyon, I. and Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res., 3:1157–1182, 2003.
  • Huang et al. (2010) Huang, H.-C., Hsu, N.-J., Theobald, D. M., and Breidt, F. J. Spatial Lasso With Applications to GIS Model Selection. J. Comput. Graph. Stat., 19:963–983, 2010.
  • Hui et al. (2017) Hui, F. K. C., Muller, S., and Welsh, A. H. Joint Selection in Mixed Models using Regularized PQL. J. Amer. Statist. Assoc., 112:1323–1333, 2017.
  • Jornsten (2004) Jornsten, R. Clustering and classification based on the L1L_{1} data depth. J. Mult. Anal., 90:67–89, 2004.
  • Kirschen et al. (2010) Kirschen, M. P., Chen, S. H., and Desmond, J. E. Modality specific cerebro-cerebellar activations in verbal working memory: an fMRI study. Behav. Neurol., 23:51–63, 2010.
  • Knutti et al. (2010) Knutti, R., Furrer, R., Tebaldi, C., et al. Challenges in Combining Projections from Multiple Climate Models. J. Clim., 23:2739–2758, 2010.
  • Koller & Sahami (1996) Koller, D. and Sahami, M. Toward optimal feature selection. In 13th International Conference on Machine Learning, pp. 284–292, 1996.
  • Krishnamurthy et al. (2009) Krishnamurthy, C. K. B., Lall, U., and Kwon, H.-H. Changing Frequency and Intensity of Rainfall Extremes over India from 1951 to 2003. J. Clim., 22:4737–4746, 2009.
  • Krishnamurthy & Kinter III (2003) Krishnamurthy, V. and Kinter III, J. L. The Indian monsoon and its relation to global climate variability. In Rodo, X. and Comin, F. A. (eds.), Global Climate: Current Research and Uncertainties in the Climate System, pp.  186–236. Springer, 2003.
  • Lahiri (1992) Lahiri, S. N. Bootstrapping M-estimators of a multiple linear regression parameter. Ann. Statist., 20(‘):1548–1570, 1992.
  • Lai et al. (2015) Lai, R. C. S., Hannig, J., and Lee, T. C. M. Generalized Fiducial Inference for Ultrahigh-Dimensional Regression. J. Amer. Statist. Assoc., 110(510):760–772, 2015.
  • Lee & Ghosh (2009) Lee, H. and Ghosh, S. Performance of Information Criteria for Spatial Models. J. Stat. Comput. Simul., 79:93–106, 2009.
  • Lei et al. (2018) Lei, J. et al. Distribution-Free Predictive Inference for Regression. J. Amer. Statist. Assoc., 113:1094–1111, 2018.
  • Li et al. (2017) Li, X., Wang, L., Guo, X., and Chen, D. Does summer precipitation trend over and around the Tibetan Plateau depend on elevation? Int. J. Climatol., 37:1278–1284, 2017.
  • Liu & Zuo (2014) Liu, X. and Zuo, Y. Computing halfspace depth and regression depth. Commun. Stat. Simul. Comput., 43(5):969–985, 2014.
  • Ma & Zhu (2013) Ma, Y. and Zhu, L. A review on dimension reduction. International Statistical Review, 81(1):134–150, 2013.
  • Majumdar (2017) Majumdar, S. An Inferential Perspective on Data Depth. PhD thesis, University of Minnesota, 2017.
  • Meinshausen & Buhlmann (2010) Meinshausen, N. and Buhlmann, P. Stability selection. J. R. Stat. Soc. B, 72:417–473, 2010.
  • Meza & Lahiri (2005) Meza, J. and Lahiri, P. A note on the CpC_{p} statistic under the nested error regression model. Surv. Methodol., 31:105–109, 2005.
  • Moon et al. (2013) Moon, J.-Y., Wang, B., Ha, K.-J., and Lee, J.-Y. Teleconnections associated with Northern Hemisphere summer monsoon intraseasonal oscillation. Clim. Dyn., 40(11-12):2761–2774, 2013.
  • Mosler (2013) Mosler, K. Depth statistics. In Becker, C., Fried, R., and Kuhnt, S. (eds.), Robustness and Complex Data Structures, pp.  17–34. Springer Berlin Heidelberg, 2013. ISBN 978-3-642-35493-9.
  • Narisetty & Nair (2016) Narisetty, N. N. and Nair, V. N. Extremal Depth for Functional Data and Applications. J. Amer. Statist. Assoc., 111:1705–1714, 2016.
  • Natarajan (1995) Natarajan, B. K. Sparse Approximate Solutions to Linear Systems. Siam. J. Comput., 24:227–234, 1995.
  • Nguyen & Jiang (2014) Nguyen, T. and Jiang, J. Restricted fence method for covariate selection in longitudinal data analysis. Biostatistics, 13(2):303–314, 2014.
  • Peng & Lu (2012) Peng, H. and Lu, Y. Model selection in linear mixed effect models. J. Multivar. Anal., 109:109–129, 2012.
  • Rousseeuw & Hubert (1998) Rousseeuw, P. J. and Hubert, M. Regression Depth. J. Amer. Statist. Assoc., 94:388–402, 1998.
  • Schwarz (1978) Schwarz, G. Estimating the dimension of a model. Ann. Statist., 6(2):461–464, 1978.
  • Shao (1996) Shao, J. Bootstrap model selection. J. Amer. Statist. Assoc., 91(434):655–665, 1996.
  • Singh et al. (2007) Singh, K., Xie, M., and Strawderman, W. E. Confidence distribution (CD) – distribution estimator of a parameter. In Complex Datasets and Inverse Problems: Tomography, Networks and Beyond, volume 54 of Lecture Notes-Monograph Series, pp. 132–150. IMS, 2007.
  • Singleton & Toumi (2012) Singleton, A. and Toumi, R. Super-Clausius-Clapeyron scaling of rainfall in a model squall line. Quat. J. R. Met. Soc., 139(671):334–339, 2012.
  • Song et al. (2012) Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt, K. Feature selection via dependence maximization. Journal of Machine Learning Research, 13(47):1393–1434, 2012.
  • Tibshirani (1996) Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58(267–288), 1996.
  • Trenberth (2011) Trenberth, K. E. Changes in precipitation with climate change. Clim. Res., 47:123–138, 2011.
  • Trenberth et al. (2003) Trenberth, K. E., Dai, A., Rasmussen, R. M., and Parsons, D. B. The changing character of precipitation. Bull. Am. Meteorol. Soc.,, 84:1205–1217, 2003.
  • Tukey (1975) Tukey, J. W. Mathematics and picturing data. In James, R. (ed.), Proceedings of the International Congress on Mathematics, volume 2, pp.  523–531, 1975.
  • VanderWeele & Ding (2017) VanderWeele, T. and Ding, P. Sensitivity Analysis in Observational Research: Introducing the E-Value. Ann Intern Med., 167(4):268–274, 2017.
  • Wakeman & Henson (2015) Wakeman, D. G. and Henson, R. N. A multi-subject, multi-modal human neuroimaging dataset. Scientif. Data, 2:article 15001, 2015.
  • Wang et al. (2005) Wang, B., Ding, Q., Fu, X., et al. Fundamental challenge in simulation and prediction of summermonsoon rainfall. Geophys. Res. Lett., 32:L15711, 2005.
  • Wang et al. (2013) Wang, L., Kim, Y., and Li, R. Calibrating Nonconvex Penalized Regression in Ultra-high Dimension. Ann. Statist., 41:2505–2536, 2013.
  • Zhang & Zhang (2014) Zhang, C.-H. and Zhang, S. S. Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Statist. Soc. B, 76(1):217–242, 2014.
  • Zhu et al. (2011) Zhu, L.-P., Li, L., Li, R., and Zhu, L.-X. Model-Free Feature Screening for Ultrahigh-Dimensional Data. J. Amer. Statist. Assoc., 106(496):1464–1475, 2011.
  • Zou (2006) Zou, H. The Adaptive Lasso and Its Oracle Properties. J. Amer. Statist. Assoc., 101:1418–1429, 2006.
  • Zou & Hastie (2005) Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
  • Zou & Li (2008) Zou, H. and Li, R. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist., 36:1509–1533, 2008.
  • Zuo (2003) Zuo, Y. Projection-based depth functions and associated medians. Ann. Statist., 31:1460–1490, 2003.
  • Zuo & Serfling (2000) Zuo, Y. and Serfling, R. General notions of statistical depth function. Ann. Statist., 28(2):461–482, 2000.

Appendix

Appendix A Consistency of Generalized Bootstrap

We first state a number of conditions on the energy functions ψi(,)\psi_{i}(\cdot,\cdot), under which we state and prove two results to ensure consistency of the estimation and bootstrap approximation procedures. In the context of the main paper, the conditions and results here ensure that the full model parameter estimate θ^\hat{\theta}_{*} follows a Gaussian sampling distribution, and Generalized Bootstrap (GBS) can be used to approximate this sampling distribution.

A.1 Technical conditions

Note that several sets of alternative conditions can be developed (Chatterjee & Bose, 2005; Lahiri, 1992), many of which are amenable to our results. However, for the sake of brevity and clarity, we only address the case where the energy function ψi(,)\psi_{i}(\cdot,\cdot) is smooth in the first argument. This case covers a vast number of models routinely considered in statistics and machine learning.

We often drop the second argument from energy function, thus for example ψi(θ)ψi(θ,Zi)\psi_{i}\bigl{(}\theta\bigr{)}\equiv\psi_{i}\bigl{(}\theta,Z_{i}\bigr{)}, and use the notation ψ^ki\hat{\psi}_{ki} for ψki(θ^)\psi_{ki}(\hat{{\theta}}_{*}), for k=0,1,2k=0,1,2. Also, for any function h(θ)h({\theta}) evaluated at the true parameter value θ{\theta}_{*}, we use the notation hh(θ)h\equiv h({\theta}_{*}). When AA and BB are square matrices of identical dimensions, the notation B<AB<A implies that the matrix ABA-B is positive definite.

In a neighborhood of θ{\theta}_{*}, we assume the functions ψi\psi_{i} are thrice continuously differentiable in the first argument, with the successive derivatives denoted by ψki\psi_{ki}, k=0,1,2k=0,1,2. That is, there exists a δ>0\delta>0 such that for any θ=θ+t\theta={\theta}_{*}+t satisfying t<δ\|t\|<\delta we have

ddθψi(θ)\displaystyle{\frac{d}{d\theta}}\psi_{i}(\theta) :=ψ0i(θ)p,\displaystyle:=\psi_{0i}(\theta)\in\mathbb{R}^{p},

and for the aa-th element of ψ0i(θ)\psi_{0i}(\theta), denoted by ψ0i(a)(θ)\psi_{0i(a)}(\theta), we have

ψ0i(a)(θ)\displaystyle\psi_{0i(a)}(\theta) =ψ0i(a)(θ)+ψ1i(a)(θ)t+21tTψ2i(a)(θ+ct)t,\displaystyle=\psi_{0i(a)}({\theta}_{*})+\psi_{1i(a)}({\theta}_{*})t+2^{-1}t^{T}\psi_{2i(a)}({\theta}_{*}+ct)t,

for a=1,pa=1,\ldots p, and some c(0,1)c\in(0,1) possibly depending on aa. We assume that for each nn, there is a sequence of σ\sigma-fields 12n\mathcal{F}_{1}\subset\mathcal{F}_{2}\ldots\subset\mathcal{F}_{n} such that {i=1jψ0i(θ),j}\{\sum_{i=1}^{j}\psi_{0i}({\theta}_{*}),\mathcal{F}_{j}\} is a martingale.

The spectral decomposition of Γ0n:=i=1n𝔼ψ0iψ0iT\Gamma_{0n}:=\sum_{i=1}^{n}\mathbb{E}\psi_{0i}\psi_{0i}^{T} is given by Γ0n=P0nΛ0nP0nT\Gamma_{0n}={P}_{0n}{\Lambda}_{0n}{P}^{T}_{0n}, where P0np×p{P}_{0n}\in\mathbb{R}^{p}\times\mathbb{R}^{p} is an orthogonal matrix whose columns contain the eigenvectors, and Λ0n{\Lambda}_{0n} is a diagonal matrix containing the eigenvalues of Γ0n\Gamma_{0n}. We assume that Γ0n\Gamma_{0n} is positive definite, that is, all the diagonal entries of Λ0n{\Lambda}_{0n} are positive numbers. We assume that there is a constant δ0>0\delta_{0}>0 such that λmin(Γ0n)>δ0\lambda_{min}(\Gamma_{0n})>\delta_{0} for sufficiently large nn.

Let Γ1i(θ)\Gamma_{1i}({\theta}_{*}) be the p×pp\times p matrix whose aa-th row is 𝔼ψ1i(a)\mathbb{E}\psi_{1i(a)}; we assume this expectation exists. Define Γ1n=i=1nΓ1i(θ)\Gamma_{1n}=\sum_{i=1}^{n}\Gamma_{1i}({\theta}_{*}). We assume that Γ1n\Gamma_{1n} is nonsingular for each nn. The singular value decomposition of Γ1n\Gamma_{1n} is given by Γ1n=P1nΛ1nQ1nT\Gamma_{1n}={P}_{1n}{\Lambda}_{1n}{Q}^{T}_{1n}, where P1n,Q1np×p{P}_{1n},{Q}_{1n}\in\mathbb{R}^{p}\times\mathbb{R}^{p} are orthogonal matrices, and Λ1n{\Lambda}_{1n} is a diagonal matrix. We assume that the diagonal entries of Λ1n{\Lambda}_{1n} are all positive, which implies that in the population, at the true value of the parameter the energy functional Ψn(θ)\Psi_{n}(\theta_{*}) actually achieves a minimal value. We define Λknc{\Lambda}_{kn}^{c} for various real numbers cc as diagonal matrices where the jj-th diagonal entry of Λkn{\Lambda}_{kn} is raised to the power cc, for k=0,1k=0,1. Correspondingly, we define Γ1nc=P1nΛ1ncQ1nT\Gamma_{1n}^{c}={P}_{1n}{\Lambda}_{1n}^{c}{Q}^{T}_{1n}. We assume that there is a constant δ1>0\delta_{1}>0 such that λmax(Γ1nTΓ1n)<δ1\lambda_{max}(\Gamma_{1n}^{T}\Gamma_{1n})<\delta_{1} for all sufficiently large nn. Define the matrix An=Γ0n1/2Γ1nA_{n}=\Gamma_{0n}^{-1/2}\Gamma_{1n}. We assume the following conditions:

  1. (C1)

    The minimum eigenvalue of AnTAnA^{T}_{n}A_{n} tends to infinity. That is, there is a sequence ana_{n}\uparrow\infty as nn\rightarrow\infty such that

    λmin(Γ1nΓ0n1Γ1nT)an2.\displaystyle\lambda_{min}\bigl{(}\Gamma_{1n}\Gamma_{0n}^{-1}\Gamma_{1n}^{T}\bigr{)}\asymp a_{n}^{2}. (C.1)
  2. (C2)

    There exists a sequence of positive reals {γn}\{\gamma_{n}\} that is bounded away from zero, such that

    λmax(Γ1n1Γ0n2Γ1nT)=o(γn2) as n.\displaystyle\lambda_{max}\bigl{(}\Gamma_{1n}^{-1}\Gamma_{0n}^{2}\Gamma_{1n}^{-T}\bigr{)}=o(\gamma_{n}^{-2})\ \ \text{ as $n\rightarrow\infty$.} (C.2)
  3. (C3)
    𝔼An1(i=1nψ1iΓ1n)An1F2=o(pγn2).\displaystyle\mathbb{E}\left\|A_{n}^{-1}\bigl{(}\sum_{i=1}^{n}\psi_{1i}-\Gamma_{1n}\bigr{)}A_{n}^{-1}\right\|_{F}^{2}=o(p\gamma_{n}^{-2}). (C.3)

    where AF\|A\|_{F} denotes the Frobenius norm of matrix AA.

  4. (C4)

    For the symmetric matrix ψ2i(a)(θ)\psi_{2i(a)}(\theta) and for some δ2>0\delta_{2}>0, there exists a symmetric matrix M2i(a)M_{2i(a)} such that

    supθθ<δ2ψ2i(a)(θ)<M2i(a),\displaystyle\sup_{\|\theta-{\theta}_{*}\|<\delta_{2}}\psi_{2i(a)}(\theta)<M_{2i(a)},

    satisfying

    a=1pi=1n𝔼λmax2(M2i(a))\displaystyle\sum_{a=1}^{p}\sum_{i=1}^{n}\mathbb{E}\lambda_{max}^{2}\bigl{(}M_{2i(a)}\bigr{)} =o(an6n1pγn2).\displaystyle=o\bigl{(}a_{n}^{6}n^{-1}p\gamma_{n}^{-2}\bigr{)}. (C.4)
  5. (C5)

    For any vector cpc\in\mathbb{R}^{p} with c=1\|c\|=1, we define the random variable Zni=cTΓ0n1/2ψiZ_{ni}=-c^{T}\Gamma_{0n}^{-1/2}\psi_{i} for i=1,ni=1,\ldots n. We assume that

    i=1nZni2p1, and 𝔼[maxiZni]0.\displaystyle\sum_{i=1}^{n}Z_{ni}^{2}\stackrel{{\scriptstyle p}}{{\rightarrow}}1,\text{ and }\mathbb{E}\bigl{[}\max_{i}\|Z_{ni}\|\bigr{]}\rightarrow 0. (C.5)
  6. (C6)

    Assume that

    λmax(Γ1nΓ0n1Γ1nT)an2.\displaystyle\lambda_{max}\bigl{(}\Gamma_{1n}\Gamma_{0n}^{-1}\Gamma_{1n}^{T}\bigr{)}\asymp a_{n}^{2}. (C.6)

The technical conditions (C1)-(C5) are extremely broad, and allow for different rates of convergence of different parameter estimators. The additional condition (C6) is a natural condition that, coupled with (C1), ensures identical rate of convergence ana_{n} for all the parameter estimators in a model.

Standard regularity conditions on likelihood and estimating functions that have been routinely assumed in the literature are special cases of the framework above. In such cases, (C1)-(C6) hold with ann1/2a_{n}\equiv{n}^{1/2}, resulting in the standard “root-nn” asymptotics.

A.2 Results

We first present the consistency and asymptotic normality of the estimation process in Theorem A.1 below.

Theorem A.1.

Assume conditions (C1)-(C5). Then θ^\hat{{\theta}}_{*} is a consistent estimator of θ{{\theta}}_{*}, and An(θ^θ)A_{n}(\hat{{\theta}}_{*}-{{\theta}}_{*}) converges weakly to the pp-dimensional standard Gaussian distribution. Under the additional condition (C6), we have that an(θ^θ)a_{n}(\hat{{\theta}}_{*}-{{\theta}}_{*}) converges weakly to a Gaussian distribution in pp-dimension.

Proof of Theorem A.1.

We consider a generic point θ=θ+An1t{\theta}={\theta}_{*}+A_{n}^{-1}t. From the Taylor series expansion, we have

ψ0i(a)(θ)\displaystyle\psi_{0i(a)}({\theta}) =ψ0i(a)+ψ1i(a)An1t+21tTAnTψ2i(a)(θ~)An1t,\displaystyle=\psi_{0i(a)}+\psi_{1i(a)}A_{n}^{-1}t+2^{-1}t^{T}A_{n}^{-T}\psi_{2i(a)}(\tilde{{\theta}}_{*})A_{n}^{-1}t,

for a=1,pa=1,\ldots p, and θ~=θ+cAn1t\tilde{{\theta}}_{*}={{\theta}}_{*}+cA_{n}^{-1}t for some c(0,1)c\in(0,1).

Recall our convention that for any function h(θ)h({\theta}) evaluated at the true parameter value θ{\theta}_{*}, we use the notation hh(θ)h\equiv h({\theta}_{*}). Also define the pp-dimensional vector Rn(θ~,t)R_{n}(\tilde{{\theta}}_{*},t) whose aa-th element is given by

Rn(a)(θ~,t)=tTAnTi=1nψ2i(a)(θ~)An1t.\displaystyle R_{n(a)}(\tilde{{\theta}}_{*},t)=t^{T}A_{n}^{-T}\sum_{i=1}^{n}\psi_{2i(a)}(\tilde{{\theta}}_{*})A_{n}^{-1}t.

Thus we have

p1/2An1i=1nψ0i(θ+An1t)\displaystyle p^{-1/2}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}({\theta}_{*}+A_{n}^{-1}t) =p1/2An1i=1nψ0i+p1/2An1i=1nψ1iAn1t+21p1/2An1Rn(θ~,t)\displaystyle=p^{-1/2}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}+p^{-1/2}A_{n}^{-1}\sum_{i=1}^{n}\psi_{1i}A_{n}^{-1}t+2^{-1}p^{-1/2}A_{n}^{-1}R_{n}(\tilde{{\theta}}_{*},t)
=p1/2An1i=1nψ0i+p1/2An1Γ1nAn1t\displaystyle=p^{-1/2}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}+p^{-1/2}A_{n}^{-1}\Gamma_{1n}A_{n}^{-1}t
+p1/2An1(i=1nψ1iΓ1n)An1t\displaystyle\quad+p^{-1/2}A_{n}^{-1}\bigl{(}\sum_{i=1}^{n}\psi_{1i}-\Gamma_{1n}\bigr{)}A_{n}^{-1}t
+21p1/2An1Rn(θ~,t).\displaystyle\quad+2^{-1}p^{-1/2}A_{n}^{-1}R_{n}(\tilde{{\theta}}_{*},t).

Fix ϵ>0\epsilon>0. We first show that there exists a C0>0C_{0}>0 such that

[p1/2An1i=1nψ0i>C0]<ϵ/2.\displaystyle\mathbb{P}\Bigl{[}\bigl{\|}p^{-1/2}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}\bigr{\|}>C_{0}\Bigr{]}<\epsilon/2. (A.7)

For this, we compute

p1𝔼An1i=1nψ0i2\displaystyle p^{-1}\mathbb{E}\bigl{\|}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}\bigr{\|}^{2} =p1𝔼i,j=1nψ0iTAnTAn1ψ0j\displaystyle=p^{-1}\mathbb{E}\sum_{i,j=1}^{n}\psi_{0i}^{T}A_{n}^{-T}A_{n}^{-1}\psi_{0j}
=p1tr(AnTAn1𝔼i=1nψ0iψ0iT)\displaystyle=p^{-1}{\mathrm{tr}}\left(A_{n}^{-T}A_{n}^{-1}\mathbb{E}\sum_{i=1}^{n}\psi_{0i}\psi_{0i}^{T}\right)
=p1tr(AnTAn1Γ0n)\displaystyle=p^{-1}{\mathrm{tr}}\left(A_{n}^{-T}A_{n}^{-1}\Gamma_{0n}\right)
=O(1)\displaystyle=O(1)

from assumption (C.2).

Define

Sn(t)=p1/2An1(i=1nψ0i(θ+An1t)i=1nψ0i)p1/2Γ1n1Γ0nt.\displaystyle S_{n}(t)=p^{-1/2}A_{n}^{-1}\bigl{(}\sum_{i=1}^{n}\psi_{0i}({\theta}_{*}+A_{n}^{-1}t)-\sum_{i=1}^{n}\psi_{0i}\bigr{)}-p^{-1/2}\Gamma_{1n}^{-1}\Gamma_{0n}t.

We next show that for any C>0C>0, for all sufficiently large nn, we have

𝔼[suptCSn(t)]2=o(1).\displaystyle\mathbb{E}\Bigl{[}\sup_{\|t\|\leq C}\bigl{\|}S_{n}(t)\bigr{\|}\Bigr{]}^{2}=o(1). (A.8)

This follows from (C.3) and (C.4). Note that

Sn(t)=p1/2An1(i=1nψ1iΓ1n)An1t+21p1/2An1Rn(θ~n,t).\displaystyle S_{n}(t)=p^{-1/2}A_{n}^{-1}\bigl{(}\sum_{i=1}^{n}\psi_{1i}-\Gamma_{1n}\bigr{)}A_{n}^{-1}t+2^{-1}p^{-1/2}A_{n}^{-1}R_{n}(\tilde{{\theta}}_{n},t).

Thus,

suptCSn(t)\displaystyle\sup_{\|t\|\leq C}\bigl{\|}S_{n}(t)\bigr{\|} p1/2suptCAn1(i=1nψ1iΓ1n)An1t+21p1/2suptCAn1Rn(θ~,t).\displaystyle\leq p^{-1/2}\sup_{\|t\|\leq C}\bigl{\|}A_{n}^{-1}\bigl{(}\sum_{i=1}^{n}\psi_{1i}-\Gamma_{1n}\bigr{)}A_{n}^{-1}t\bigr{\|}+2^{-1}p^{-1/2}\sup_{\|t\|\leq C}\bigl{\|}A_{n}^{-1}R_{n}(\tilde{{\theta}}_{*},t)\bigr{\|}.

We consider each of these terms separately.

For any matrix Mp×pM\in\mathbb{R}^{p}\times\mathbb{R}^{p}, we have

suptCMt=suptC[i=1p(j=1pMijtj)2]1/2suptC[i=1pj=1pMij2j=1ptj2]1/2=MFsuptCt=CMF.\displaystyle\sup_{\|t\|\leq C}\bigl{\|}Mt\bigr{\|}=\sup_{\|t\|\leq C}\Bigl{[}\sum_{i=1}^{p}\bigl{(}\sum_{j=1}^{p}M_{ij}t_{j}\bigr{)}^{2}\Bigr{]}^{1/2}\leq\sup_{\|t\|\leq C}\Bigl{[}\sum_{i=1}^{p}\sum_{j=1}^{p}M_{ij}^{2}\sum_{j=1}^{p}t_{j}^{2}\Bigr{]}^{1/2}=\bigl{\|}M\bigr{\|}_{F}\sup_{\|t\|\leq C}\|t\|=C\bigl{\|}M\bigr{\|}_{F}.

Using M=An1(i=1nψ1iΓ1n)An1M=A_{n}^{-1}\bigl{(}\sum_{i=1}^{n}\psi_{1i}-\Gamma_{1n}\bigr{)}A_{n}^{-1} and (C.3), we get one part of the result.

For the other term, we similarly have

[suptCp1/2An1Rn(θ~,t)]2\displaystyle\Bigl{[}\sup_{\|t\|\leq C}\bigl{\|}p^{-1/2}A_{n}^{-1}R_{n}(\tilde{{\theta}}_{*},t)\bigr{\|}\Bigr{]}^{2} =p1suptCAn1Rn(θ~,t)2\displaystyle=p^{-1}\sup_{\|t\|\leq C}\bigl{\|}A_{n}^{-1}R_{n}(\tilde{{\theta}}_{*},t)\bigr{\|}^{2}
p1λmax(AnTAn1)suptCRn(θ~,t)2\displaystyle\leq p^{-1}\lambda_{max}\bigl{(}A_{n}^{-T}A_{n}^{-1}\bigr{)}\sup_{\|t\|\leq C}\bigl{\|}R_{n}(\tilde{{\theta}}_{*},t)\bigr{\|}^{2}
p1λmax(An1AnT)suptCRn(θ~,t)2\displaystyle\leq p^{-1}\lambda_{max}\bigl{(}A_{n}^{-1}A_{n}^{-T}\bigr{)}\sup_{\|t\|\leq C}\bigl{\|}R_{n}(\tilde{{\theta}}_{*},t)\bigr{\|}^{2}
p1an2suptCRn(θ~,t)2.\displaystyle\leq p^{-1}a_{n}^{-2}\sup_{\|t\|\leq C}\bigl{\|}R_{n}(\tilde{{\theta}}_{*},t)\bigr{\|}^{2}.

Note that

(suptCRn(θ~,t))2=suptCRn(θ~,t)2.\displaystyle\bigl{(}\sup_{\|t\|\leq C}\bigl{\|}R_{n}(\tilde{\theta}_{*},t)\bigr{\|}\bigr{)}^{2}=\sup_{\|t\|\leq C}\bigl{\|}R_{n}(\tilde{\theta}_{*},t)\bigr{\|}^{2}.

Now

Rn(θ~,t)2\displaystyle\bigl{\|}R_{n}(\tilde{\theta}_{*},t)\bigr{\|}^{2} =a=1p(Rn(a)(θ~,t))2\displaystyle=\sum_{a=1}^{p}\bigl{(}R_{*n(a)}(\tilde{\theta}_{*},t)\bigr{)}^{2}
=a=1p(tTAnTi=1nψ2i(a)(θ~)An1t)2\displaystyle=\sum_{a=1}^{p}\bigl{(}t^{T}A_{n}^{-T}\sum_{i=1}^{n}\psi_{2i(a)}(\tilde{\theta}_{*})A_{n}^{-1}t\bigr{)}^{2}
=a=1pi,j=1ntTAnTψ2i(a)(θ~)An1ttTAnTψ2j(a)(θ~)An1t.\displaystyle=\sum_{a=1}^{p}\sum_{i,j=1}^{n}t^{T}A_{n}^{-T}\psi_{2i(a)}(\tilde{\theta}_{*})A_{n}^{-1}tt^{T}A_{n}^{-T}\psi_{2j(a)}(\tilde{\theta}_{*})A_{n}^{-1}t.

Based on this, we have

suptCRn(θ~,t)2\displaystyle\sup_{\|t\|\leq C}\bigl{\|}R_{n}(\tilde{\theta}_{*},t)\bigr{\|}^{2} =suptCa=1pi,j=1ntTAnTψ2i(a)(θ~)An1ttTAnTψ2nj(a)(θ~)An1t\displaystyle=\sup_{\|t\|\leq C}\sum_{a=1}^{p}\sum_{i,j=1}^{n}t^{T}A_{n}^{-T}\psi_{2i(a)}(\tilde{\theta}_{*})A_{n}^{-1}tt^{T}A_{n}^{-T}\psi_{2*nj(a)}(\tilde{\theta}_{*})A_{n}^{-1}t
suptCa=1pi,j=1ntTAnTM2i(a)An1ttTAnTM2j(a)An1t\displaystyle\leq\sup_{\|t\|\leq C}\sum_{a=1}^{p}\sum_{i,j=1}^{n}t^{T}A_{n}^{-T}M_{2i(a)}A_{n}^{-1}tt^{T}A_{n}^{-T}M_{2j(a)}A_{n}^{-1}t
suptCAn1t4a=1p(i=1nλmax(M2i(a)))2\displaystyle\leq\sup_{\|t\|\leq C}\bigl{\|}A_{n}^{-1}t\bigr{\|}^{4}\sum_{a=1}^{p}\Bigl{(}\sum_{i=1}^{n}\lambda_{max}\bigl{(}M_{2i(a)}\bigr{)}\Bigr{)}^{2}
C4nλmax2(AnTAn1)a=1pi=1nλmax2(M2i(a)).\displaystyle\leq C^{4}n\lambda_{max}^{2}\bigl{(}A_{n}^{-T}A_{n}^{-1}\bigr{)}\sum_{a=1}^{p}\sum_{i=1}^{n}\lambda_{max}^{2}\bigl{(}M_{2i(a)}\bigr{)}.

Putting all these together, we have

𝔼[suptCp1/2An1Rn(θ~,t)]2\displaystyle\mathbb{E}\Bigl{[}\sup_{\|t\|\leq C}\bigl{\|}p^{-1/2}A_{n}^{-1}R_{n}(\tilde{\theta}_{*},t)\bigr{\|}\Bigr{]}^{2} =p1𝔼[suptCAn1Rn(θ~,t)]2\displaystyle=p^{-1}\mathbb{E}\Bigl{[}\sup_{\|t\|\leq C}A_{n}^{-1}R_{n}(\tilde{\theta}_{*},t)\Bigr{]}^{2}
p1an2𝔼[suptCRn(θ~,t)]2\displaystyle\leq p^{-1}a_{n}^{-2}\mathbb{E}\Bigl{[}\sup_{\|t\|\leq C}\bigl{\|}R_{n}(\tilde{\theta}_{*},t)\bigr{\|}\Bigr{]}^{2}
=O(p1an2)𝔼[suptCRn(θ~,t)]2\displaystyle=O\bigl{(}p^{-1}a_{n}^{-2}\bigr{)}\mathbb{E}\Bigl{[}\sup_{\|t\|\leq C}\bigl{\|}R_{n}(\tilde{\theta}_{*},t)\bigr{\|}\Bigr{]}^{2}
=O(p1nan6)a=1pi=1n𝔼λmax2(M2ni(a))\displaystyle=O\bigl{(}p^{-1}na_{n}^{-6}\bigr{)}\sum_{a=1}^{p}\sum_{i=1}^{n}\mathbb{E}\lambda_{max}^{2}\bigl{(}M_{2ni(a)}\bigr{)}
=o(1),\displaystyle=o(1),

using (C.4).

Since we have defined

Sn(t)=p1/2An1(i=1nψ0i(θ+An1t)i=1nψ0i)p1/2Γ1n1Γ0nt,\displaystyle S_{n}(t)=p^{-1/2}A_{n}^{-1}\bigl{(}\sum_{i=1}^{n}\psi_{0i}({\theta}_{*}+A_{n}^{-1}t)-\sum_{i=1}^{n}\psi_{0i}\bigr{)}-p^{-1/2}\Gamma_{1n}^{-1}\Gamma_{0n}t,

we have

p1/2An1i=1nψ0i(θ+p1/2An1t)\displaystyle p^{-1/2}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}({\theta}_{*}+p^{1/2}A_{n}^{-1}t) =Sn(t)+p1/2An1i=1nψ0i+An1Γ1nAn1t.\displaystyle=S_{n}(t)+p^{-1/2}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}+A_{n}^{-1}\Gamma_{1n}A_{n}^{-1}t.

Hence

inft=C{p1/2tTΓ1nAn1i=1nψ0i(θ+p1/2An1t)}\displaystyle\inf_{\|t\|=C}\Bigl{\{}p^{-1/2}t^{T}\Gamma_{1n}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}({\theta}_{*}+p^{1/2}A_{n}^{-1}t)\Bigr{\}}
=inft=C{tTΓ1nSn(t)+p1/2tTΓ1nAn1i=1nψ0i+tTΓ1nAn1Γ1nAn1t}\displaystyle=\inf_{\|t\|=C}\Bigl{\{}t^{T}\Gamma_{1n}S_{n}(t)+p^{-1/2}t^{T}\Gamma_{1n}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}+t^{T}\Gamma_{1n}A_{n}^{-1}\Gamma_{1n}A_{n}^{-1}t\Bigr{\}}
inft=CtTΓ1nSn(t)+p1/2inft=CtTΓ1nAn1i=1nψ0i+inft=CtTΓ1nAn1Γ1nAn1t\displaystyle\geq\inf_{\|t\|=C}t^{T}\Gamma_{1n}S_{n}(t)+p^{-1/2}\inf_{\|t\|=C}t^{T}\Gamma_{1n}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}+\inf_{\|t\|=C}t^{T}\Gamma_{1n}A_{n}^{-1}\Gamma_{1n}A_{n}^{-1}t
Cδ11/2sup|t|=C|Sn(t)|Cδ11/2p1/2An1i=1nψ0i+C2δ0.\displaystyle\geq-C\delta_{1}^{1/2}\sup_{|t|=C}|S_{n}(t)|-C\delta_{1}^{1/2}p^{-1/2}\|A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}\|+C^{2}\delta_{0}.

The last step above utilizes facts like aTbaba^{T}b\geq-\|a\|\|b\|. Consequently, defining C1=Cδ0/δ11/2C_{1}=C\delta_{0}/\delta_{1}^{1/2}, we have

[inft=C{tTΓ1nAn1i=1nψ0i(θ+p1/2An1t)}<0]\displaystyle\mathbb{P}\Bigl{[}\inf_{\|t\|=C}\Bigl{\{}t^{T}\Gamma_{1n}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}({\theta}_{*}+p^{1/2}A_{n}^{-1}t)\Bigr{\}}<0\Bigr{]}
[supt=C|Sn(t)|+|An1i=1nψ0i|>C1]\displaystyle\leq\mathbb{P}\Bigl{[}\sup_{\|t\|=C}|S_{n}(t)|+|A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}|>C_{1}\Bigr{]}
[supt=CSn(t)>C1/2]+[An1i=1nψ0i>C1/2]<ϵ,\displaystyle\leq\mathbb{P}\Bigl{[}\sup_{\|t\|=C}\bigl{\|}S_{n}(t)\bigr{\|}>C_{1}/2\Bigr{]}+\mathbb{P}\Bigl{[}\bigl{\|}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}\bigr{\|}>C_{1}/2\Bigr{]}<\epsilon,

for all sufficiently large nn, using (A.7) and (A.8). This implies that with a probability greater than 1ϵ1-\epsilon there is a root TnT_{n} of the equations i=1nψ0i(θ+An1t)\sum_{i=1}^{n}\psi_{0i}({\theta}_{*}+A_{n}^{-1}t) in the ball {t<C}\{\|t\|<C\}, for some C>0C>0 and all sufficiently large nn. Defining θ^=θ+An1Tn\hat{{\theta}}_{*}={\theta}_{*}+A_{n}^{-1}T_{n}, we obtain the desired result. Issues like dependence on ϵ\epsilon and other technical details are handled using standard arguments, see Chatterjee & Bose (2005) for related arguments.

Since we have supt<CSn(t)=oP(1)\sup_{\|t\|<C}\|S_{n}(t)\|=o_{P}(1), and TnT_{n} lies in the set t<C\|t\|<C, define Rn=SnTn=oP(1)-R_{n}=S_{n}T_{n}=o_{P}(1). Consequently

Rn\displaystyle-R_{n} =SnTn\displaystyle=S_{n}T_{n}
=p1/2An1(i=1nψ0i(θ+An1Tn)i=1nψ0i)p1/2Γ1n1Γ0nTn\displaystyle=p^{-1/2}A_{n}^{-1}\bigl{(}\sum_{i=1}^{n}\psi_{0i}({\theta}_{*}+A_{n}^{-1}T_{n})-\sum_{i=1}^{n}\psi_{0i}\bigr{)}-p^{-1/2}\Gamma_{1n}^{-1}\Gamma_{0n}T_{n}
=p1/2An1i=1nψ0ip1/2Γ1n1Γ0nTn.\displaystyle=p^{-1/2}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}-p^{-1/2}\Gamma_{1n}^{-1}\Gamma_{0n}T_{n}.

Thus,

Tn=Γ0n1Γ1nAn1i=1nψ0i+p1/2Γ0n1Γ1nRn=Γ0n1/2i=1nΨ0i+p1/2Γ0n1Γ1nRn.\displaystyle T_{n}=-\Gamma_{0n}^{-1}\Gamma_{1n}A_{n}^{-1}\sum_{i=1}^{n}\psi_{0i}+p^{1/2}\Gamma_{0n}^{-1}\Gamma_{1n}R_{n}=-\Gamma_{0n}^{-1/2}\sum_{i=1}^{n}\Psi_{0i}+p^{1/2}\Gamma_{0n}^{-1}\Gamma_{1n}R_{n}.

Note that our conditions imply that for any cc with c=1\|c\|=1, we have that cTTnc^{T}T_{n} has two terms, where 𝕍(cTΓ0n1/2i=1nψ0i)=1\mathbb{V}\bigl{(}-c^{T}\Gamma_{0n}^{-1/2}\sum_{i=1}^{n}\psi_{0i}\bigr{)}=1 and

𝔼[p1/2cTΓ0n1Γ1nRn]2=O(1),\displaystyle\mathbb{E}\bigl{[}p^{1/2}c^{T}\Gamma_{0n}^{-1}\Gamma_{1n}R_{n}\bigr{]}^{2}=O(1),

using (C.2). Using (C.5) we also have that for any cc with c=1\|c\|=1, cTTnc^{T}T_{n} converges in distribution to N(0,1)N(0,1). This completes the proof. ∎

We now have a parallel result on consistency of the GBS resampling scheme. The essence of this theorem is that under the same set of conditions, several resampling schemes are consistent resampling procedures to implement the e-values framework.

Theorem A.2.

Assume conditions (C1)-(C5). Additionally, assume that the resampling weights 𝕎rni\mathbb{W}_{rni} are exchangeable random variables satisfying the conditions (4.2). Define B^n=μnτn1Γ^0n1/2Γ^1n1\hat{B}_{n}=\mu_{n}\tau_{n}^{-1}\hat{\Gamma}^{1/2}_{0n}\hat{\Gamma}^{-1}_{1n}, where Γ^0n\hat{\Gamma}_{0n} and Γ^1n\hat{\Gamma}_{1n} are sample equivalents of Γ0n\Gamma_{0n} and Γ1n\Gamma_{1n}, respectively. Conditional on the data, B^n(θ^rθ^)\hat{B}_{n}(\hat{{\theta}}_{r*}-\hat{{\theta}}_{*}) converges weakly to the pp-dimensional standard Gaussian distribution in probability.

Under the additional condition (C6), defining bn=μnτn1anb_{n}=\mu_{n}\tau_{n}^{-1}a_{n}, the distributions of an(θ^θ)a_{n}(\hat{{\theta}}_{*}-{{\theta}}_{*}) and bn(θ^rθ^)b_{n}(\hat{{\theta}}_{r*}-\hat{{\theta}}_{*}) converge to the same weak limit in probability.

Proof of Theorem A.2.

This proof has steps similar to that of the proof of Theorem A.1, apart from several additional technicalities. We omit the details. ∎

Appendix B Theoretical proofs

The results in Section 5 hold under more general conditions than in the main paper–specifically, when the asymptotic distribution of β^\hat{\beta}_{*} comes from a general class of elliptic distributions, rather than simply being multivariate Gaussian.

Following Fang et al. (1990), the density function of an elliptically distributed random variable (μ,Σ,g)\mathcal{E}(\mu,\Sigma,g) takes the form: h(x;μ,Σ)=|Σ|1/2g((xμ)TΣ1(xμ))h(x;\mu,\Sigma)=|\Sigma|^{-1/2}g((x-\mu)^{T}\Sigma^{-1}(x-\mu)) where μp\mu\in\mathbb{R}^{p}, Σp×p\Sigma\in\mathbb{R}^{p\times p} is positive semi-definite, and gg is a density function that is non-negative, scalar-valued, continuous and strictly increasing. For the asymptotic distribution of β^\hat{\beta}_{*}, we assume the following conditions:

  1. (A1)

    There exist a sequence of positive reals ana_{n}\uparrow\infty, positive-definite (PD) matrix Vp×p{V}\in\mathbb{R}^{p\times p} and density gg such that an(θ^θ0)a_{n}(\hat{\theta}_{*}-{\theta}_{0}) converges to (0p,V,g)\mathcal{E}(0_{p},{V},g) in distribution, denoted by an(θ^θ0)(0p,V,g)a_{n}(\hat{\theta}_{*}-{\theta}_{0})\leadsto\mathcal{E}(0_{p},{V},g);

  2. (A2)

    For almost every dataset 𝒵n\mathcal{Z}_{n}, There exist PD matrices Vnp×p{V}_{n}\in\mathbb{R}^{p\times p} such that plimnVn=V\text{plim}_{n\rightarrow\infty}{V}_{n}={V}.

These conditions are naturally satisfied for a Gaussian limiting distribution.

B.1 Proofs of main results

Proof of Theorem 5.1.

We divide the proof into four parts:

  1. 1.

    ordering of adequate model e-values,

  2. 2.

    convergence of all adequate model e-values to a common limit,

  3. 3.

    convergence of inadequate model e-values to 0,

  4. 4.

    comparison of adequate and inadequate model e-values,

Proof of part 1.

Since we are dealing with a finite sequence of nested models, it is enough to prove that en(1)>en(2)e_{n}(\mathcal{M}_{1})>e_{n}(\mathcal{M}_{2}) for large enough nn, when both 1\mathcal{M}_{1} and 2\mathcal{M}_{2} are adequate models and 12\mathcal{M}_{1}\prec\mathcal{M}_{2}.

Suppose 𝕋0=(0p,Ip,g)\mathbb{T}_{0}=\mathcal{E}(0_{p},I_{p},g). Affine invariance implies invariant to rotational transformations, and since the evaluation functions we consider decrease along any ray from the origin because of (B5), E(θ,𝕋0)E({\theta},\mathbb{T}_{0}) is a monotonocally decreasing function of θ\|{\theta}\| for any θp{\theta}\in\mathbb{R}^{p}. Now consider the models 10,20\mathcal{M}^{0}_{1},\mathcal{M}^{0}_{2} that have 0 in all indices outside 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2}, respectively. Take some θ10Θ10{\theta}_{10}\in{\Theta}^{0}_{1}, which is the parameter space corresponding to 10\mathcal{M}^{0}_{1}, and replace its (zero) entries at indices j𝒮2𝒮1j\in\mathcal{S}_{2}\setminus\mathcal{S}_{1} by some non-zero δp|𝒮2𝒮1|{\delta}\in\mathbb{R}^{p-|\mathcal{S}_{2}\setminus\mathcal{S}_{1}|}. Denote it by θ1δ{\theta}_{1{\delta}}. Then we shall have

θ1δTθ1δ>θ10Tθ10D(θ10,𝕋0)>D(θ1δ,𝕋0)𝔼s1D(θ10,𝕋0)>𝔼s1D(θ1δ,𝕋0),\displaystyle{\theta}_{1{\delta}}^{T}{\theta}_{1{\delta}}>{\theta}_{10}^{T}{\theta}_{10}\quad\Rightarrow\quad D({\theta}_{10},\mathbb{T}_{0})>D({\theta}_{1{\delta}},\mathbb{T}_{0})\quad\Rightarrow\quad\mathbb{E}_{s1}D({\theta}_{10},\mathbb{T}_{0})>\mathbb{E}_{s1}D({\theta}_{1{\delta}},\mathbb{T}_{0}),

where 𝔼s1\mathbb{E}_{s1} denotes the expectation taken over the marginal of the distributional argument 𝕋0\mathbb{T}_{0} at indices 𝒮1\mathcal{S}_{1}. Notice now that by construction θ1δΘ20{\theta}_{1{\delta}}\in{\Theta}^{0}_{2}, the parameter space corresponding to 20\mathcal{M}^{0}_{2}, and since the above holds for all possible δ{\delta}, we can take expectation over indices 𝒮2𝒮1\mathcal{S}_{2}\setminus\mathcal{S}_{1} in both sides to obtain 𝔼s1D(θ10,𝕋0)>𝔼s2D(θ20,𝕋0)\mathbb{E}_{s1}D({\theta}_{10},\mathbb{T}_{0})>\mathbb{E}_{s2}D({\theta}_{20},\mathbb{T}_{0}), with θ20{\theta}_{20} denoting a general element in Θ20{\Theta}_{20}.

Combining (A1) and (A2) we get anVn1/2(θ^θ0)𝕋0a_{n}{V}_{n}^{-1/2}(\hat{\theta}_{*}-{\theta}_{0})\leadsto\mathbb{T}_{0}. Denote 𝕋n=[anVn1/2(θ^θ0)]\mathbb{T}_{n}=[a_{n}{V}_{n}^{-1/2}(\hat{\theta}_{*}-{\theta}_{0})], and choose a positive ϵ<(𝔼s1D(θ10,𝕋0)𝔼s2D(θ20,𝕋0))/2\epsilon<(\mathbb{E}_{s1}D({\theta}_{10},\mathbb{T}_{0})-\mathbb{E}_{s2}D({\theta}_{20},\mathbb{T}_{0}))/2. Then, for large enough nn we shall have

|D(θ10,𝕋n)D(θ10,𝕋0)|<ϵ|𝔼s1D(θ10,𝕋n)𝔼s1D(θ10,𝕋0)|<ϵ,\left|D({\theta}_{10},\mathbb{T}_{n})-D({\theta}_{10},\mathbb{T}_{0})\right|<\epsilon\quad\Rightarrow\quad|\mathbb{E}_{s1}D({\theta}_{10},\mathbb{T}_{n})-\mathbb{E}_{s1}D({\theta}_{10},\mathbb{T}_{0})|<\epsilon,

following condition (B4). Similarly we have |𝔼s2D(θ20,𝕋n)𝔼s2D(θ20,𝕋0)|<ϵ|\mathbb{E}_{s2}D({\theta}_{20},\mathbb{T}_{n})-\mathbb{E}_{s2}D({\theta}_{20},\mathbb{T}_{0})|<\epsilon for the same nn for which the above holds. This implies 𝔼s1D(θ10,𝕋n)>𝔼s2D(θ20,𝕋n)\mathbb{E}_{s1}D({\theta}_{10},\mathbb{T}_{n})>\mathbb{E}_{s2}D({\theta}_{20},\mathbb{T}_{n}).

Now apply the affine transformation t(θ)=Vn1/2θ/an+θ0{t}({\theta})={V}_{n}^{1/2}{\theta}/a_{n}+{\theta}_{0} to both arguments of the depth function above. This will keep the depths constant following affine invariance, i.e. D(t(θ10),[θ^])=D(θ10,𝕋n)D({t}({\theta}_{10}),[\hat{\theta}_{*}])=D({\theta}_{10},\mathbb{T}_{n}) and D(t(θ20),[θ^])=D(θ20,𝕋n)D({t}({\theta}_{20}),[\hat{\theta}_{*}])=D({\theta}_{20},\mathbb{T}_{n}). Since this transformation maps Θ10{\Theta}^{0}_{1} to Θ1{\Theta}_{1}, the parameter space corresonding to 1\mathcal{M}_{1}, we get 𝔼s1D(t(θ10),[θ^])>𝔼s2D(t(θ20),[θ^])\mathbb{E}_{s1}D({t}({\theta}_{10}),[\hat{\theta}_{*}])>\mathbb{E}_{s2}D({t}({\theta}_{20}),[\hat{\theta}_{*}]), i.e. en(1)>en(2)e_{n}(\mathcal{M}_{1})>e_{n}(\mathcal{M}_{2}).

Proof of part 2.

For the full model \mathcal{M}_{*}, an(θ^θ0)p(0,V,g)a_{n}(\hat{\theta}_{*}-\theta_{0})\leadsto\mathcal{E}_{p}(0,V,g) by (A1). It follows now from a direct application of condition (B3) that en()𝔼(Y,[Y]e_{n}(\mathcal{M}_{*})\rightarrow\mathbb{E}(Y,[Y] where Y(0p,V,g)Y\sim\mathcal{E}(0_{p},V,g).

For any other adequate model \mathcal{M}, we use (B1) property of DD:

D(θ^m,[θ^])=D(θ^mθ0,[θ^0θ0]),\displaystyle D(\hat{{\theta}}_{m},[\hat{{\theta}}_{*}])=D\Bigl{(}\hat{{\theta}}_{m}-{\theta}_{0},\bigl{[}\hat{{\theta}}_{0}-{\theta}_{0}\bigr{]}\Bigr{)}, (B.1)

and decompose the first argument

θ^mθ0=(θ^mθ^)+(θ^θ0).\displaystyle\hat{{\theta}}_{m}-{\theta}_{0}=(\hat{{\theta}}_{m}-\hat{{\theta}}_{*})+(\hat{{\theta}}_{*}-{\theta}_{0}). (B.2)

We now have

θ^m\displaystyle\hat{{\theta}}_{m} =θm+an1Tmn,\displaystyle={\theta}_{m}+a_{n}^{-1}T_{mn},
θ^\displaystyle\hat{{\theta}}_{*} =θ+an1Tn,\displaystyle={\theta}_{*}+a_{n}^{-1}T_{*n},

where TmnT_{mn} is non-degenerate at the indices 𝒮\mathcal{S}, and Tn(0p,V,g)T_{*n}\leadsto\mathcal{E}(0_{p},V,g). For the first summand of the right-hand side in (B.2) we then have

θ^mθ^=θmθ0+Rn,\displaystyle\hat{{\theta}}_{m}-\hat{{\theta}}_{*}={\theta}_{m}-{\theta}_{0}+R_{n}, (B.3)

where 𝔼Rn2=O(an2)\mathbb{E}\|R_{n}^{2}\|=O(a_{n}^{-2}), and θmj\theta_{mj} equals θ0j\theta_{0j} in indices j𝒮j\in\mathcal{S} and CjC_{j} elsewhere. Since \mathcal{M} is adequate, θm=θ0{\theta}_{m}={\theta}_{0}. Thus, substituting the right-hand side in (B.2) we get

|D(θ^mθ0,[θ^θ0])D(θ^θ0,[θ^θ0])|Rnα,\displaystyle\left|D\left(\hat{{\theta}}_{m}-{\theta}_{0},\left[\hat{{\theta}}_{*}-{\theta}_{0}\right]\right)-D\left(\hat{{\theta}}_{*}-{\theta}_{0},\left[\hat{{\theta}}_{*}-{\theta}_{0}\right]\right)\right|\leq\|R_{n}\|^{\alpha}, (B.4)

from Lipschitz continuity of D()D(\cdot) given in (B2). Part 2 now follows.

Proof of Part 3.

Since the depth function DD is invariant under location and scale transformations, we have

D(θ^m,[θ^])=D(an(θ^mθ0),[an(θ^θ0)]).\displaystyle D(\hat{{\theta}}_{m},[\hat{{\theta}}_{*}])=D\Bigl{(}a_{n}(\hat{{\theta}}_{m}-{\theta}_{0}),\bigl{[}a_{n}(\hat{{\theta}}_{*}-{\theta}_{0})\bigr{]}\Bigr{)}. (B.5)

Decomposing the first argument,

an(θ^mθ0)=an(θ^mθm)+an(θmθ0).\displaystyle a_{n}(\hat{{\theta}}_{m}-{\theta}_{0})=a_{n}(\hat{{\theta}}_{m}-{\theta}_{m})+a_{n}({\theta}_{m}-{\theta}_{0}). (B.6)

Since \mathcal{M} is inadequate, j𝒮|Cjθ0j|>0\sum_{j\notin\mathcal{S}}|C_{j}-\theta_{0j}|>0, i.e. θm\theta_{m} and θ0\theta_{0} are not equal in at least one (fixed) index. Consequently as nn\rightarrow\infty, an(θmθ0)\|a_{n}({\theta}_{m}-{\theta}_{0})\|\rightarrow\infty, thus en()0e_{n}(\mathcal{M})\rightarrow 0 by condition (B4).

Proof of part 4.

For any inadequate model j,k<jK\mathcal{M}_{j},k<j\leq K, suppose NjN_{j} is the integer such that en1(j1)<en1()e_{n_{1}}(\mathcal{M}_{j_{1}})<e_{n_{1}}(\mathcal{M}_{*}) for all n1>Njn_{1}>N_{j}. Part 3 above ensures that such an integer exists for every inadequate model. Now define N=maxk<jKNjN=\max_{k<j\leq K}N_{j}. Thus en1()e_{n_{1}}(\mathcal{M}_{*}) is larger than e-values of all inadequate models j1\mathcal{M}_{j_{1}} for k<jKk<j\leq K. ∎

Proof of Corollary 5.2.

By construction, 0\mathcal{M}_{0} is nested in all other adequate models in 𝕄0\mathbb{M}_{0}. Hence Theorem 4.1 implies en(0)>en(ad)>en(inad)e_{n}(\mathcal{M}_{0})>e_{n}(\mathcal{M}^{ad})>e_{n}(\mathcal{M}^{inad}) for any adequate model ad\mathcal{M}^{ad} and inadequate model inad\mathcal{M}^{inad} in 𝕄0\mathbb{M}_{0} and large enough nn. ∎

Proof of Corollary 5.3.

Consider j𝒮0j\in\mathcal{S}_{0}. Then θ0j{\theta}_{0}\notin\mathcal{M}_{-j}, hence j\mathcal{M}_{-j} is inadequate. By choice of n1n_{1} from Corollary 4.1, ee-values of all inadequate models are less than that of \mathcal{M}_{*}, hence en1(j)<en1()e_{n_{1}}(\mathcal{M}_{-j})<e_{n_{1}}(\mathcal{M}_{*}).

On the other hand, suppose there exists a jj such that en1(j)en1()e_{n_{1}}(\mathcal{M}_{-j})\leq e_{n_{1}}(\mathcal{M}_{*}) but j𝒮0j\notin\mathcal{S}_{0}. Now j𝒮0j\notin\mathcal{S}_{0} means that j\mathcal{M}_{-j} is an adequate model. Since j\mathcal{M}_{-j} is nested within \mathcal{M}_{*} for any jj, and the full model is always adequate, we have en1(j)>en1()e_{n_{1}}(\mathcal{M}_{-j})>e_{n_{1}}(\mathcal{M}_{*}) by Theorem 4.1: leading to a contradiction and thus completing the proof. ∎

Proof of Theorem 5.4.

Corollary 4.2 implies that

𝒮0={j:en(j)<en()}.\mathcal{S}_{0}=\{j:e_{n}(\mathcal{M}_{-j})<e_{n}(\mathcal{M}_{*})\}.

Now define 𝒮¯0={j:ern(j)<ern()}\bar{\mathcal{S}}_{0}=\{j:e_{rn}(\mathcal{M}_{-j})<e_{rn}(\mathcal{M}_{*})\}. We also use an approximation result.

Lemma B.1.

For any adequate model \mathcal{M}, the following holds for fixed nn and an exchangeable array of GB resamples 𝕎r\mathbb{W}_{r} as in the main paper:

ern=en()+Rr,𝔼r|Rr|2=oP(1).\displaystyle e_{rn}=e_{n}(\mathcal{M})+R_{r},\quad\mathbb{E}_{r}|R_{r}|^{2}=o_{P}(1). (B.7)

Using Lemma B.1 for j\mathcal{M}_{-j} and \mathcal{M}_{*} we now have

ern(j)\displaystyle e_{rn}(\mathcal{M}_{-j}) =en(j)+Rrj,\displaystyle=e_{n}(\mathcal{M}_{-j})+R_{rj},
ern()\displaystyle e_{rn}(\mathcal{M}_{*}) =en()+Rr,\displaystyle=e_{n}(\mathcal{M}_{*})+R_{r*},

such that 𝔼r|Rr|2=oP(1)\mathbb{E}_{r}|R_{r*}|^{2}=o_{P}(1) and 𝔼r|Rrj|2=oP(1)\mathbb{E}_{r}|R_{rj}|^{2}=o_{P}(1) for all jj. Hence P1(𝒮¯0=𝒮0)1P_{1}(\bar{\mathcal{S}}_{0}=\mathcal{S}_{0})\rightarrow 1 as nn\rightarrow\infty, P1P_{1} being probability conditional on the data. Similarly one can prove that the probability conditional on the bootstrap samples that 𝒮¯0=𝒮^0\bar{\mathcal{S}}_{0}=\hat{\mathcal{S}}_{0} holds goes to 1 as R,R1R,R_{1}\rightarrow\infty, completing the proof. ∎

B.2 Proofs of auxiliary results

Proof of Lemma B.1.

We decompose the resampled depth function as

D(θ^r1m,[θ^r])\displaystyle D\Bigl{(}\hat{{\theta}}_{r_{1}m},[\hat{{\theta}}_{r*}]\Bigr{)} =D(anτn(θ^r1mθ^),[anτn(θ^rθ^)])\displaystyle=D\Bigl{(}\frac{a_{n}}{\tau_{n}}\bigl{(}\hat{{\theta}}_{r_{1}m}-\hat{{\theta}}_{*}\bigr{)},\left[\frac{a_{n}}{\tau_{n}}\bigl{(}\hat{{\theta}}_{r*}-\hat{{\theta}}_{*}\bigr{)}\right]\Bigr{)}
=D(anτn(θ^r1mθ^m)anτn(θ^mθ^),[anτn(θ^rθ^)]).\displaystyle=D\Bigl{(}\frac{a_{n}}{\tau_{n}}\bigl{(}\hat{{\theta}}_{r_{1}m}-\hat{{\theta}}_{m}\bigr{)}-\frac{a_{n}}{\tau_{n}}\bigl{(}\hat{{\theta}}_{m}-\hat{{\theta}}_{*}\bigr{)},\left[\frac{a_{n}}{\tau_{n}}\bigl{(}\hat{{\theta}}_{r*}-\hat{{\theta}}_{*}\bigr{)}\right]\Bigr{)}.

Conditional on the data, (an/τn)(θ^r1mθ^m)(a_{n}/\tau_{n})(\hat{\theta}_{r_{1}m}-\hat{\theta}_{m}) has the same weak limit as an(θ^mθm)a_{n}(\hat{\theta}_{m}-\theta_{m}), and the same holds for (an/τn)(θ^r1θ^)(a_{n}/\tau_{n})(\hat{\theta}_{r_{1}*}-\hat{\theta}_{*}) and an(θ^θ)a_{n}(\hat{\theta}_{*}-\theta_{*}). Now (B.3) and τn\tau_{n}\rightarrow\infty combine to give

anτn(θ^mθ^)P0,\frac{a_{n}}{\tau_{n}}\bigl{(}\hat{{\theta}}_{m}-\hat{{\theta}}_{*}\bigr{)}\stackrel{{\scriptstyle P}}{{\rightarrow}}0,

as nn\rightarrow\infty. Lemma B.1 follows directly now. ∎

Appendix C Details of experiments

Among the competing methods, for stepwise regression there is no tuning. MIO requires specification of a range of desired sparsity levels and a time limit for the MIO solver to run for each sparsity level in the beginning. We specify the sparsity range to be {1,3,,29}\{1,3,\ldots,29\} in all settings to cover the sparsity levels across different simulation settings, and the time limit to be 10 seconds. We select the optimal MIO sparsity level as the one for which the resulting estimate gives the lowest BIC. We use BIC to select the optimal regularization parameter for Lasso and SCAD as well. The Knockoff filters come in two flavors: Knockoff and Knockoff+. We found that Knockoff+ hardly selects any features in our settings, so use the Knockoffs for evaluation, setting its false discovery rate threshold at the default value of 0.1. We shall include these details in the appendix.

Appendix D Details of Indian Monsoon data

Various studies indicate that our knowledge about the physical drivers of precipitation in India is incomplete; this is in addition to the known difficulties in modeling precipitation itself (Knutti et al., 2010; Trenberth et al., 2003; Trenberth, 2011; Wang et al., 2005). For example, (Gosswami, 2005) discovered an upward trend in frequency and magnitude of extreme rain events, using daily central Indian rainfall data on a 1010^{\circ} ×\times 1212^{\circ} grid, but a similar study on a 1×11^{\circ}\times 1^{\circ} gridded data by (Ghosh et al., 2016) suggested that there are both increasing and decreasing trends of extreme rainfall events, depending on the location. Additionally, (Krishnamurthy et al., 2009) reported increasing trends in exceedances of the 99th percentile of daily rainfall; however, there is also a decreasing trend for exceedances of the 90th percentile data in many parts of India. Significant spatial and temporal variabilities at various scales have also been discovered for Indian Monsoon (Dietz & Chatterjee, 2014, 2015).

We attempt to identify the driving factors behind precipitation during the Indian monsoon season using our e-value based feature selection technique. Data is obtained from the repositories of the National Climatic Data Center (NCDC) and National Oceanic and Atmospheric Administration (NOAA), for the years 1978-2012. We obtained data 35 potential predictors of the Indian summer precipitation:

(A) Station-specific: (from 36 weather stations across India) Latitude, longitude, elevation, maximum and minimum temperature, tropospheric temperature difference (ΔTT\Delta TT), Indian Dipole Mode Index (DMI), Niño 3.4 anomaly;

(B) Global:

  • uu-wind and vv-wind at 200, 600 and 850 mb;

  • Ten indices of Madden-Julian Oscillations: 20E, 70E, 80E, 100E, 120E, 140E, 160E, 120W, 40W, 10W;

  • Teleconnections: North Atlantic Oscillation (NAO), East Atlantic (EA), West Pacific (WP), East Pacific/North Pacific (EPNP), Pacific/North American (PNA), East Atlantic/Western Russia (EAWR), Scandinavia (SCA), Tropical/Northern Hemisphere (TNH), Polar/Eurasia (POL);

  • Solar Flux;

  • Land-Ocean Temperature Anomaly (TA).

These covariates are all based on existing knowledge and conjectures from the actual Physics driving Indian summer precipitations. The references provided earlier in this section, and multiple references contained therein may be used for background knowledge on the physical processes related to Indian monsoon rainfall, which after decades of study remains one of the most challenging problems in climate science.

Appendix E Details of fMRI data implementation

Refer to caption
Refer to caption
Refer to caption
Figure E.1: (Top) Plot of significant pp-values at 95% confidence level at the specified cross-sections; (bottom) a smoothed surface obtained from the pp-values clearly shows high spatial dependence in right optic nerve, auditory nerves, auditory cortex and left visual cortex areas

Typically, the brain is divided by a grid into three-dimensional array elements called voxels, and activity is measured at each voxel. More specifically, a series of three-dimensional images are obtained by measuring Blood Oxygen Level Dependent (BOLD) signals for a time interval as the subject performs several tasks at specific time points. A single fMRI image typically consists of voxels in the order of 10510^{5}, which makes even fitting the simplest of statistical models computationally intensive when it is repeated for all voxels to generate inference, e.g. investigating the differential activation of brain region in response to a task.

The dataset we work with comes from a recent study involving 19 test subjects and two types of visual tasks (Wakeman & Henson, 2015). Each subject went through 9 runs, in which they were showed faces or scrambled faces at specific time points. In each run 210 images were recorded in 2 second intervals, and each 3D image was of the dimension of 64×64×3364\times 64\times 33, which means there were 135,168 voxels. Here we use the data from a single run on subject 1, and perform a voxelwise analysis to find out the effect of time lags and BOLD responses at neighboring voxels on the BOLD response at a voxel. Formally we consider separate models at voxel i{1,2,,V}i\in\{1,2,...,V\}, with observations across time points t{1,2,,T}t\in\{1,2,...,T\}.

Clubbing together the stimuli, drift, neighbor and autoregressive terms into a combined design matrix X~=(x~(1)T,,x~(T)T)T\tilde{X}=(\tilde{x}(1)^{T},...,\tilde{x}(T)^{T})^{T} and coefficient vector θi{\theta}_{i}, we can write yi(t)=x~(t)Tθi+ϵi(t)y_{i}(t)=\tilde{x}(t)^{T}{\theta}_{i}+\epsilon_{i}(t). We estimate the set of non-zero coefficients in θi{\theta}_{i} using the e-value method. Suppose this set is RiR_{i}, and its subsets containing coefficient corresponding to neighbor and non-neighbor (i.e. stimuli and drift) terms are SiS_{i} and TiT_{i}, respectively. To quantify the effect of neighbors we now calculate the corresponding FF-statistic:

Fi=(nSix~i,nθ^i,n)2(yi(t)nTix~i,nθ^i,n)2|nTi||Si|,F_{i}=\frac{(\sum_{n\in S_{i}}\tilde{x}_{i,n}\hat{\theta}_{i,n})^{2}}{(y_{i}(t)-\sum_{n\in T_{i}}\tilde{x}_{i,n}\hat{\theta}_{i,n})^{2}}\frac{|n-T_{i}|}{|S_{i}|},

and obtain its pp-value, i.e. P(FiF|Si|,|nTi|)P(F_{i}\geq F_{|S_{i}|,|n-T_{i}|}).

Figure E.1 shows plots of the voxels with a significant pp-value from the above FF-test, with a darker color associated with lower p-value, as opposed to the smoothed surface in the main paper. Most of the significant terms were due to the coefficients corresponding to neighboring terms. A very small proportion of voxels had any autoregressive effects selected (less than 1%), and most of them were in regions of the image that were outside the brain, indicating noise.

In future work, we aim to expand on the encouraging findings and repeat the procedure on other individuals in the study. An interesting direction here is to include subject-specific random effects and correlate their clinical outcomes (if any) to observed spatial dependency patterns in their brain.