This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fractals with point impact in functional linear regression

Ian W. McKeaguelabel=e1]im2131@columbia.edu [    Bodhisattva Senlabel=e2]bs2528@columbia.edu [ Columbia University Department of Biostatistics
Columbia University
722 West 168th Street, 6th Floor
New York, New York 10032
USA
Department of Statistics
Columbia University
1255 Amsterdam Avenue, 10th Floor
New York, New York 10027
USA
(2010; 7 2009; 12 2009)
Abstract

This paper develops a point impact linear regression model in which the trajectory of a continuous stochastic process, when evaluated at a sensitive time point, is associated with a scalar response. The proposed model complements and is more interpretable than the functional linear regression approach that has become popular in recent years. The trajectories are assumed to have fractal (self-similar) properties in common with a fractional Brownian motion with an unknown Hurst exponent. Bootstrap confidence intervals based on the least-squares estimator of the sensitive time point are developed. Misspecification of the point impact model by a functional linear model is also investigated. Non-Gaussian limit distributions and rates of convergence determined by the Hurst exponent play an important role.

62G08,
62E20,
62M09,
60J65,
Functional linear regression,
fractional Brownian motion,
M-estimation,
misspecification,
nonstandard asymptotics,
empirical processes,
bootstrap methods,
doi:
10.1214/10-AOS791
keywords:
[class=AMS] .
keywords:
.
volume: 38issue: 4

and

t1Supported by NSF Grant DMS-08-06088. t2Supported by NSF Grant DMS-09-06597.

1 Introduction

This paper investigates a linear regression model involving a scalar response YY and a predictor given by the value of the trajectory of a continuous stochastic process X={X(t)X=\{X(t), t[0,1]}t\in[0,1]\} at some unknown time point. Specifically, we consider the point impact linear regression model

Y=α+βX(θ)+εY=\alpha+\beta X(\theta)+\varepsilon (1)

and focus on the time point θ(0,1)\theta\in(0,1) as the target parameter of interest. The intercept α\alpha and the slope β\beta are scalars, and the error ε\varepsilon is taken to be independent of XX, having zero mean and finite variance σ2\sigma^{2}. The complete trajectory of XX is assumed to be observed (at least on a fine enough grid that it makes no difference in terms of accuracy), even though the model itself only involves the value of XX at θ\theta, which represents a “sensitive” time point in terms of the relationship to the response. The main aim of the paper is to show that the precision of estimation of θ\theta is driven by fractal behavior in XX, and to develop valid inferential procedures that adapt to a broad range of such behavior. Our model could easily be extended in various ways, for example, to allow multiple sensitive time points or further covariates, but, for simplicity, we restrict attention to (1).

Our motivation for developing this type of model arises from genome-wide expression studies that measure the activity of numerous genes simultaneously. In these studies, it is of interest to locate genes showing activity that is associated with clinical outcomes. Emilsson et al. emilsson , for example, studied gene expression levels at over 24,000 loci in samples of adipose tissue to identify genes correlated with body mass index and other obesity-related outcomes. Gruvberger-Saal et al. gruvberger used gene expression profiles from the tumors of breast cancer patients to predict estrogen receptor protein concentration, an important prognostic marker for breast tumors; see also buness . In such studies, the gene expression profile across a chromosome can be regarded a functional predictor, and a gene associated with the clinical outcome is identified by its base pair position θ\theta along the chromosome; see Figure 1. Our aim here is to develop a method of estimating a confidence interval for θ\theta, leading to the identification of chromosomal regions that are potentially useful for diagnosis and therapy. Although there is extensive statistical literature on gene expression data, it is almost exclusively concerned with multiple testing procedures for detecting differentially expressed genes; see, for example, dula , salas .

Refer to caption
Figure 1: Log gene expression at 518 loci along chromosome 17 in tissue from a breast cancer patient.

Gene expression profiles (as in Figure 1) clearly display fractal behavior, that is, self-similarity over a range of scales. Indeed, fractals often arise when spatiotemporal patterns at higher levels emerge from localized interactions and selection processes acting at lower levels, as with gene expression activity. Moreover, the recent discovery l-a that chromosomes are folded as “fractal globules,” which can easily unfold during gene activation, also helps explain the fractal appearance of gene expression profiles.

A basic stochastic model for fractal phenomena is provided by fractional Brownian motion (fBm) (see manness ), in which the so-called Hurst exponent H[0,1]H\in[0,1] calibrates the scaling of the self-similarity and provides a natural measure of trajectory roughness. It featured prominently in the pioneering work of Benoît Mandelbrot, who stated (man , page 256) that fBm provides “the most manageable mathematical environment I can think of (for representing fractals).” For background on fBm from a statistical modeling point of view, see gn .

The key issue to be considered in this paper is how to construct a confidence interval for the true sensitive time point θ0\theta_{0} based on its least squares estimator θ^n\hat{\theta}_{n}, obtained by fitting model (1) from a sample of size nn,

(α^n,β^n,θ^n)=argminα,β,θi=1n[YiαβXi(θ)]2.(\hat{\alpha}_{n},\hat{\beta}_{n},\hat{\theta}_{n})=\mathop{\arg\min}_{\alpha,\beta,\theta}\sum_{i=1}^{n}[Y_{i}-\alpha-\beta X_{i}(\theta)]^{2}. (2)

We show that, when XX is fBm, both the rate of convergence rnr_{n} and limiting distribution of θ^n\hat{\theta}_{n} depend on HH. In addition, we construct bootstrap confidence intervals for θ0\theta_{0} that do not require knowledge of HH. This facilitates applications (e.g., to gene expression data) in which the type of fractal behavior is not known in advance; the trajectory in Figure 1 has an estimated Hurst exponent of about 0.1, but it would be very difficult to estimate precisely using data in a small neighborhood of θ^n\hat{\theta}_{n}, so a bootstrap approach becomes crucial. We emphasize that nothing about the distribution of XX is used in the construction of the estimators or the bootstrap confidence intervals; the fBm assumption will only be utilized to study the large sample properties of these procedures. Moreover, our main results will make essential use of the fBm assumption only locally, that is, in a small neighborhood of θ0\theta_{0}.

The point impact model (1) can be regarded as a simple working model that provides interpretable information about the influence of XX at a specific location (e.g., a genetic locus). Such information cannot be extracted using the standard functional linear regression model rs given by

Y=α+01f(t)X(t)𝑑t+ε,Y=\alpha+\int_{0}^{1}f(t)X(t)\,dt+\varepsilon, (3)

where ff is a continuous function and α\alpha is an intercept, because the influence of X(t)X(t) is spread continuously across [0,1][0,1] and point-impact effects are excluded. In the gene expression context, if only a few genes are predictive of YY, then a model of the form (1) would be more suitable than (3), which does not allow ff to have infinite spikes. In general, however, a continuum of locations is likely to be involved (as well as point-impacts), so it is of interest to study the behavior of θ^n\hat{\theta}_{n} in misspecified settings in which the data arise from combinations of (1) and (3).

Asymptotic results for the least squares estimator (2) in the correctly specified setting are presented in Section 2. In Section 3 it is shown that the residual bootstrap is consistent for the distribution of θ^n\hat{\theta}_{n}, leading to the construction of valid bootstrap confidence intervals without knowing HH. The nonparametric bootstrap is shown to be inconsistent in the same setting. The effect of misspecification is discussed in Section 4. A two-sample problem version of the point impact model is discussed in Section 5. Some numerical examples are presented in Section 6, where we compare the proposed bootstrap confidence interval with Wald-type confidence intervals (in which HH is assumed to be known); an application to gene expression data is also discussed. Concluding remarks appear in Section 7. Proofs are placed in Section 8.

2 Least squares estimation of the sensitive time point

Throughout we take XX to be a fBm with Hurst exponent HH, which, as discussed earlier, controls the roughness of the trajectories. We shall see in this section that the rate of convergence of θ^n\hat{\theta}_{n} can be expressed explicitly in terms of HH.

First we recall some basic properties of fBm. A (standard) fBm with Hurst exponent H(0,1]H\in(0,1] is a Gaussian process BH={BH(t)B_{H}=\{B_{H}(t), t}t\in\mathbb{R}\} having continuous sample paths, mean zero and covariance function

\operatornameCov{BH(t),BH(s)}=\tfrac12(|t|2H+|s|2H|ts|2H).\operatorname{Cov}\{B_{H}(t),B_{H}(s)\}=\tfrac{1}{2}(|t|^{2H}+|s|^{2H}-|t-s|^{2H}). (4)

By comparing their mean and covariance functions, BH(at)=daHBH(t)B_{H}(at)\stackrel{{\scriptstyle d}}{{=}}a^{H}B_{H}(t) as processes, for all a>0a>0 (self-similarity). Clearly, B1/2B_{1/2} is a two-sided Brownian motion, and B1B_{1} is a random straight line: B1(t)=tZB_{1}(t)=tZ where ZN(0,1)Z\sim N(0,1). The increments are negatively correlated if H<1/2H<1/2, and positively correlated if H>1/2H>1/2. Increasing HH results in smoother sample paths.

Suppose (Xi,Yi),i=1,,n(X_{i},Y_{i}),i=1,\ldots,n, are i.i.d. copies of (X,Y)(X,Y) satisfying the model (1). The unknown parameter is η=(α,β,θ)Ξ=2×[0,1]\eta=(\alpha,\beta,\theta)\in\Xi=\mathbb{R}^{2}\times[0,1], and its true value is denoted η0=(α0,β0,θ0)\eta_{0}=(\alpha_{0},\beta_{0},\theta_{0}). The following conditions are needed:

  • [(A3)]

  • (A1)

    XX is a fBm with Hurst exponent H(0,1)H\in(0,1).

  • (A2)

    0<θ0<10<\theta_{0}<1 and β00\beta_{0}\neq 0.

  • (A3)

    E|ε|2+δ<E|\varepsilon|^{2+\delta}<\infty for some δ>0\delta>0.

The construction of the least squares estimator η^n=(α^n,β^n,θ^n)\hat{\eta}_{n}=(\hat{\alpha}_{n},\hat{\beta}_{n},\hat{\theta}_{n}), defined by (2), does not involve any assumptions about the distribution of the trajectories, whereas the asymptotic behavior does. Our first result gives the consistency and asymptotic distribution of η^n\hat{\eta}_{n} under the above assumptions.

Theorem 2.1

If (A1) and (A2) hold, then η^n\hat{\eta}_{n} is consistent, that is, η^nPη0\hat{\eta}_{n}\stackrel{{\scriptstyle P}}{{\rightarrow}}\eta_{0}. If (A3) also holds, then

ζn\displaystyle\zeta_{n} \displaystyle\equiv (n(α^nα0),n(β^nβ0),n1/(2H)(θ^nθ0))\displaystyle\bigl{(}\sqrt{n}(\hat{\alpha}_{n}-\alpha_{0}),\sqrt{n}(\hat{\beta}_{n}-\beta_{0}),n^{{1/(2H)}}(\hat{\theta}_{n}-\theta_{0})\bigr{)}
d\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}} (σZ1,|θ0|HσZ2,argmint{2σ|β0|BH(t)+|t|2H})ζ,\displaystyle\biggl{(}\sigma Z_{1},|\theta_{0}|^{-H}\sigma Z_{2},\mathop{\arg\min}_{t\in\mathbb{R}}\biggl{\{}2\frac{\sigma}{|\beta_{0}|}B_{H}(t)+|t|^{2H}\biggr{\}}\biggr{)}\equiv\zeta,

where Z1Z_{1} and Z2Z_{2} are i.i.d. N(0,1)N(0,1), independent of the fBm BHB_{H}.

Remarks

  1. 1.

    It may come as a surprise that the convergence rate of θ^n\hat{\theta}_{n} increases as HH decreases, and becomes arbitrarily fast as H0H\to 0. A heuristic explanation is that fBm “travels further” with a smaller HH, so independent trajectories of XX are likely to “cover different ground,” making it easier to estimate θ0\theta_{0}. In a nutshell, the smaller the Hurst exponent, the better the design.

  2. 2.

    It follows from (a sight extension of) Lemmas 2.5 and 2.6 of Kim and Pollard kp that the third component of ζ\zeta is well defined.

  3. 3.

    Using the self-similarity of fBm, the asymptotic distribution of θ^n\hat{\theta}_{n} can be expressed as the distribution of

    Δ(σ|β0|)1/Hargmint(BH(t)+|t|2H/2).\Delta\equiv\biggl{(}\frac{\sigma}{|\beta_{0}|}\biggr{)}^{1/H}\mathop{\arg\min}_{t\in\mathbb{R}}\bigl{(}B_{H}(t)+|t|^{2H}/2\bigr{)}. (6)

    This distribution does not appear to have been studied in the literature except for H=1/2H=1/2 and H=1H=1 (standard normal). When H=1/2H=1/2, XX is a standard Brownian motion and the limiting distribution is given in terms of a two-sided Brownian motion with a triangular drift. Bhattacharya and Brockwell bb showed that this distribution has a density that can be expressed in terms of the standard normal distribution function. It arises frequently in change-point problems under contiguous asymptotics y , s , ms .

  4. 4.

    From the proof, it can be seen that the essential assumptions on XX are the self-similarity and stationary increments properties in some neighborhood of θ0\theta_{0}, along with the trajectories of XX being Lipschitz of all orders less than HH. Note that any Gaussian self-similar process with stationary increments and zero mean is a fBm (see, e.g., Theorem 1.3.3 of em ).

  5. 5.

    The trajectories of fBm are nondifferentiable when H<1H<1, so the usual technique of Taylor expanding the criterion function about θ0\theta_{0} does not work and a more sophisticated approach is required to prove the result.

  6. 6.

    Note that (α^n,β^n)(\hat{\alpha}_{n},\hat{\beta}_{n}) has the same limiting behavior as though θ0\theta_{0} is known, and θ^n\hat{\theta}_{n} and (α^n,β^n)(\hat{\alpha}_{n},\hat{\beta}_{n}) are asymptotically independent.

  7. 7.

    The result is readily extended to allow for additional covariates [cf. (11)], which is often important in applications. The limiting distribution of θ^n\hat{\theta}_{n} remains the same, and the other regression coefficient estimates have the same limiting behavior as though θ0\theta_{0} is known.

  8. 8.

    Note that the assumption β00\beta_{0}\neq 0 is crucial for the theorem to hold. When β0=0\beta_{0}=0, the fBm does not influence the response at all and θ^n\hat{\theta}_{n} contains no information about θ0\theta_{0}.

3 Bootstrap confidence intervals

In general, a valid Wald-type confidence interval for θ0\theta_{0} would at least need a consistent estimator of the Hurst exponent HH, which is a nuisance parameter in this problem. Unfortunately, however, accurate estimation of HH is difficult and quite often unstable. Bootstrap methods have been widely applied to avoid issues of nuisance parameter estimation, and they work well in problems with n\sqrt{n}-rates; see bickel81 , singh81 , shao95 and the references therein. In this section we study the consistency properties of two bootstrap methods that arise naturally in our setting. One of these methods leads to a valid confidence interval for θ0\theta_{0} without requiring any knowledge of HH.

3.1 Preliminaries

We start with a brief review of the bootstrap. Given a sample 𝐙n={Z1,Z2,,Zn}i.i.d.L{\mathbf{Z}}_{n}=\{Z_{1},Z_{2},\ldots,Z_{n}\}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}L from an unknown distribution LL, suppose that the distribution function, FnF_{n}, say, of some random variable RnRn(𝐙n,L)R_{n}\equiv R_{n}(\mathbf{Z}_{n},L), is of interest; RnR_{n} is usually called a root and it can in general be any measurable function of the data and the distribution LL. The bootstrap method can be broken into three simple steps: {longlist}

Construct an estimator L^n\hat{L}_{n} of LL from 𝐙n{\mathbf{Z}}_{n}.

Generate 𝐙n={Z1,,Zn}i.i.d.L^n{\mathbf{Z}}_{n}^{*}=\{Z_{1}^{*},\ldots,Z_{n}^{*}\}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\hat{L}_{n} given 𝐙n{\mathbf{Z}}_{n}.

Estimate FnF_{n} by FnF_{n}^{*}, the conditional c.d.f. of Rn(𝐙n,L^n)R_{n}({\mathbf{Z}}_{n}^{*},\hat{L}_{n}) given 𝐙n{\mathbf{Z}}_{n}. Let dd denote the Lévy metric or any other metric metrizing weak convergence of distribution functions. We say that FnF_{n}^{*} is weakly consistent if d(Fn,Fn)P0d(F_{n},F_{n}^{*})\stackrel{{\scriptstyle P}}{{\rightarrow}}0; if FnF_{n} has a weak limit FF, this is equivalent to FnF_{n}^{*} converging weakly to FF in probability.

The choice of L^n\hat{L}_{n} mostly considered in the literature is the empirical distribution. Intuitively, an L^n\hat{L}_{n} that mimics the essential properties (e.g., smoothness) of the underlying distribution LL can be expected to perform well. In most situations, the empirical distribution of the data is a good estimator of LL, but in some nonstandard situations it may fail to capture some of the important aspects of the problem, and the corresponding bootstrap method can be suspect. The following discussion illustrates this phenomenon (the inconsistency when bootstrapping from the empirical distribution of the data) when Δnn1/(2H)(θ^nθ0)\Delta_{n}\equiv n^{1/(2H)}(\hat{\theta}_{n}-\theta_{0}) is the random variable of interest.

3.2 Inconsistency of bootstrapping pairs

In a regression setup there are two natural ways of bootstrapping: bootstrapping pairs (i.e., the nonparametric bootstrap) and bootstrapping residuals (while keeping the predictors fixed). We show that bootstrapping pairs (drawing nn data points with replacement from the original data set) is inconsistent for θ0\theta_{0}.

Theorem 3.1

Under conditions (A1)–(A3), the nonparametric bootstrap is inconsistent for estimating the distribution of Δn\Delta_{n}, that is, Δnn1/(2H)(θ^nθ^n)\Delta_{n}^{*}\equiv n^{1/(2H)}(\hat{\theta}_{n}^{*}-\hat{\theta}_{n}), conditional on the data, does not converge in distribution to Δ\Delta in probability, where Δ\Delta is defined by (6).

3.3 Consistency of bootstrapping residuals

Another bootstrap procedure is to use the form of the assumed model more explicitly to draw the bootstrap samples: condition on the predictor XiX_{i} and generate its response as

Yi=α^n+β^nXi(θ^n)+εi,Y_{i}^{*}=\hat{\alpha}_{n}+\hat{\beta}_{n}X_{i}(\hat{\theta}_{n})+\varepsilon_{i}^{*}, (7)

where the εi\varepsilon_{i}^{*} are conditionally i.i.d. under the empirical distribution of the centered residuals ε^iε¯n\hat{\varepsilon}_{i}-\bar{\varepsilon}_{n}, with ε^i=Yiα^nβ^nXi(θ^n)\hat{\varepsilon}_{i}=Y_{i}-\hat{\alpha}_{n}-\hat{\beta}_{n}X_{i}(\hat{\theta}_{n}) and ε¯n=i=1nε^i/n\bar{\varepsilon}_{n}=\sum_{i=1}^{n}\hat{\varepsilon}_{i}/n. Let α^n,β^n\hat{\alpha}_{n}^{*},\hat{\beta}_{n}^{*} and θ^n\hat{\theta}_{n}^{*} be the estimates of the unknown parameters obtained from the bootstrap sample. We approximate the distribution of ζn\zeta_{n} [see (2.1)] by the conditional distribution of

ζn[n(α^nα^n),n(β^nβ^n),n1/(2H)(θ^nθ^n)],\zeta_{n}^{*}\equiv\bigl{[}\sqrt{n}(\hat{\alpha}_{n}^{*}-\hat{\alpha}_{n}),\sqrt{n}(\hat{\beta}_{n}^{*}-\hat{\beta}_{n}),n^{1/(2H)}(\hat{\theta}_{n}^{*}-\hat{\theta}_{n})\bigr{]},

given the data.

Theorem 3.2

Under conditions (A1)–(A3), the above procedure of bootstrapping residuals is consistent for estimating the distribution of ζn\zeta_{n}, that is, ζndζ\zeta_{n}^{*}\stackrel{{\scriptstyle d}}{{\rightarrow}}\zeta, in probability, conditional on the data.

We now use the above theorem to construct a valid confidence interval (CI) for θ0\theta_{0} that does not require any knowledge of HH. Let qαq^{*}_{\alpha} be the α\alpha-quantile of the conditional distribution of (θ^nθ^n)(\hat{\theta}_{n}^{*}-\hat{\theta}_{n}) given the data, which can be readily obtained via simulation and does not involve the knowledge of any distributional properties of XX. The proposed approximate (12α)(1-2\alpha)-level bootstrap CI for θ0\theta_{0} is then given by

𝒞n=[θ^nq1α,θ^nqα].\mathcal{C}_{n}=[\hat{\theta}_{n}-q^{*}_{1-\alpha},\hat{\theta}_{n}-q^{*}_{\alpha}].

Under the assumptions of Theorem 3.2, the coverage probability of this CI is

P{θ0𝒞n}\displaystyle P\{\theta_{0}\in\mathcal{C}_{n}\} =\displaystyle= P{n1/(2H)qαΔnn1/(2H)q1α}\displaystyle P\bigl{\{}n^{1/(2H)}q^{*}_{\alpha}\leq\Delta_{n}\leq n^{1/(2H)}q^{*}_{1-\alpha}\bigr{\}}
\displaystyle\approx P{n1/(2H)qαΔnn1/(2H)q1α}\displaystyle P^{*}\bigl{\{}n^{1/(2H)}q^{*}_{\alpha}\leq\Delta_{n}^{*}\leq n^{1/(2H)}q^{*}_{1-\alpha}\bigr{\}}
=\displaystyle= P{qαθ^nθ^nq1α}\displaystyle P^{*}\{q^{*}_{\alpha}\leq\hat{\theta}_{n}^{*}-\hat{\theta}_{n}\leq q^{*}_{1-\alpha}\}
=\displaystyle= 12α,\displaystyle 1-2\alpha,

where PP^{*} denotes the bootstrap distribution given the data, and we have used the fact that the supremum distance between the relevant distribution functions of Δn\Delta_{n} and Δn\Delta_{n}^{*} is asymptotically negligible. The key point of this argument is that Δn\Delta_{n} and Δn\Delta_{n}^{*} have the same normalization factor n1/(2H)n^{1/(2H)} and, thus, it “cancels” out. CIs for α0\alpha_{0} and β0\beta_{0} can be constructed in a similar fashion.

3.4 Discussion

In nonparametric regression settings, dichotomies in the behavior of different bootstrap methods are well known, for example, when using the bootstrap to calibrate omnibus goodness-of-fit tests for parametric regression models; see hm , vkgms , neu and references therein. A dichotomy in the behavior of the two bootstrap methods, however, is surprising in a linear regression model. This illustrates that in problems with nonstandard asymptotics, the usual nonparametric bootstrap might fail, whereas a resampling procedure that uses some particular structure of the model can perform well. The improved performance of bootstrapping residuals will be confirmed by our simulation results in Section 6.

The difference in the behavior of the two bootstrap methods can be explained as follows. As in any M-estimation problem, the standard approach is to study the criterion (objective) function being optimized, in a neighborhood of the target parameter, by splitting it into an empirical process and a drift term. The drift term has different behavior for the two bootstrap methods: while bootstrapping pairs, it does not converge, whereas the bootstrapped residuals are conditionally independent of the predictors, and the drift term converges. This highlights the importance of designing the bootstrap to accurately replicate the structure in the assumed model. A more technical explanation is provided in a remark following the proof of Theorem 3.2.

Other types of resampling (e.g., the mm-out-of-nn bootstrap, or subsampling) could be applicable, but such methods require knowledge of the rate of convergence, which depends on the unknown HH. Also, these methods require the choice of a tuning parameter, which is problematic in practice. However, the residual bootstrap is consistent, easy to implement, and does not require the knowledge of HH and the estimation of a tuning parameter.

The inconsistency of the nonparametric bootstrap casts some doubt on its validity for checking the stability of variable selection results in high-dimensional regression problems (as is common practice). Indeed, it suggests that more care (in terms of more explicit use of the model) is needed in the choice of a bootstrap method in such settings.

4 Misspecification by a functional linear model

The point impact model cannot capture effects that are spread out over the domain of the trajectory, for example, gene expression profiles for which the effect on a clinical outcome involves complex interactions between numerous genes. Such effects, however, may be represented by a functional linear model, and we now examine how the limiting behavior of θ^n\hat{\theta}_{n} changes when the data arise in this way.

4.1 Complete misspecification

In this case we treat (1) as the working model (for fitting the data), but view this model as being completely misspecified in the sense that the data are generated from the functional linear model (3). For simplicity, we set α=0\alpha=0 and β=1\beta=1 in the working model, and set α=0\alpha=0 in the true functional linear model. The least squares estimator θ^n\hat{\theta}_{n} now estimates the minimizer θ0\theta_{0} of

𝕄(θ)E[YX(θ)]2=σ2+E[01f(t)X(t)𝑑tX(θ)]2\mathbb{M}(\theta)\equiv E[Y-X(\theta)]^{2}=\sigma^{2}+E\biggl{[}\int_{0}^{1}f(t)X(t)\,dt-X(\theta)\biggr{]}^{2}

and the following result gives its asymptotic distribution.

Theorem 4.1

Suppose that (A1) and (A3) hold, and that 𝕄(θ)\mathbb{M}(\theta) has a unique minimizer and is twice-differentiable at 0<θ0<10<\theta_{0}<1. Then, in the misspecified case,

n1/(42H)(θ^nθ0)dargmint(2aBH(t)+bt2),n^{{1/(4-2H)}}(\hat{\theta}_{n}-\theta_{0})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathop{\arg\min}_{t\in\mathbb{R}}\bigl{(}2aB_{H}(t)+bt^{2}\bigr{)},

where a2=𝕄(θ0)a^{2}=\mathbb{M}(\theta_{0}) and b=𝕄′′(θ0)/2b=\mathbb{M}^{\prime\prime}(\theta_{0})/2.

Remarks

  1. 1.

    Here the rate of convergence reverses itself from the correctly specified case: the convergence rate now decreases as HH decreases, going from a parametric rate of n1/2n^{1/2} when H1H\to 1, to as slow as n1/4n^{1/4} as H0H\to 0. A heuristic explanation is that roughness in XX now amounts to measurement error (which results in a slower rate) as the fluctuations of XX are smoothed out in the true model.

  2. 2.

    In the case of Brownian motion trajectories (H=1/2H=1/2), note that 𝕄(θ)=θ201f(t)min(t,θ)𝑑t+const\mathbb{M}(\theta)=\theta-2\int_{0}^{1}f(t)\min(t,\theta)\,dt+\mathrm{const}, the normal equation is

    𝕄(θ)=12θ1f(t)𝑑t=0\mathbb{M}^{\prime}(\theta)=1-2\int_{\theta}^{1}f(t)\,dt=0 (8)

    and 𝕄′′(θ)=2f(θ)\mathbb{M}^{\prime\prime}(\theta)=2f(\theta).

  3. 3.

    Also in the case H=1/2H=1/2, the limiting distribution is given in terms of two-sided Brownian motion with a parabolic drift, and was investigated originally by Chernoff c in connection with the estimation of the mode of a distribution, and shown to have a density (as the solution of a heat equation). The Chernoff distribution arises frequently in monotone function estimation settings; Groeneboom and Wellner gw introduced various algorithms for computation of its distribution function and quantiles.

4.2 Partial misspecification

The nonparametric functional linear model (3) can be combined with (1) to give the semiparametric model

Y=α+βX(θ)+01f(t)X(t)𝑑t+ε,Y=\alpha+\beta X(\theta)+\int_{0}^{1}f(t)X(t)\,dt+\varepsilon, (9)

which allows XX to have both a point impact and an influence that is spread out continuously in time. When f=0f=0, this model reduces to the point impact model; when β=0\beta=0, to the functional linear model. In this section we examine the behavior of θ^n\hat{\theta}_{n} when the working model is (1), as before, but the data are now generated from (9).

For simplicity, suppose that α=0\alpha=0 and β=1\beta=1 in both the working point impact model and in the true model (9). Denote the true value of θ\theta by θ0(0,1)\theta_{0}\in(0,1). It can then be shown that θ^n\hat{\theta}_{n} is robust to small levels of misspecification, that is, it consistently estimates θ0\theta_{0} with the same rate of convergence as in the correctly specified case. Indeed, θ^n\hat{\theta}_{n} targets the minimizer of the criterion function

𝕄(θ)=E[YX(θ)]2=|θθ0|2H01f(t)[t2H+θ2H|θt|2H]𝑑t+const.\mathbb{M}(\theta)=E[Y-X(\theta)]^{2}=|\theta-\theta_{0}|^{2H}-\int_{0}^{1}f(t)[t^{2H}+\theta^{2H}-|\theta-t|^{2H}]\,dt+\mathrm{const}.

Provided |f|\int|f| is sufficiently small, the derivative of 𝕄\mathbb{M} will be negative over the interval (0,θ0)(0,\theta_{0}) and positive over (θ0,1)(\theta_{0},1), so 𝕄\mathbb{M} is minimized at θ0\theta_{0}. It is then possible to extend Theorem 2.1 to give

n1/(2H)(θ^nθ0)da1/Hargmint(BH(t)+|t|2H/2),n^{{1/(2H)}}(\hat{\theta}_{n}-\theta_{0})\stackrel{{\scriptstyle d}}{{\rightarrow}}a^{1/H}\mathop{\arg\min}_{t\in\mathbb{R}}\bigl{(}B_{H}(t)+|t|^{2H}/2\bigr{)}, (10)

where aσa\geq\sigma is defined in the statement of Theorem 4.1. This shows that the effect of partial misspecification is a simple inflation of the variance [cf. (6)], without any change in the form of the limit distribution.

It is also of interest to estimate θ0\theta_{0} in a way that adapts to any function ff (i.e., sufficiently smooth) in this semiparametric setting. This could be done, for example, by approximating ff by a finite B-spline basis expansion of the form fm(t)=j=1mβjϕj(t)f_{m}(t)=\sum_{j=1}^{m}\beta_{j}\phi_{j}(t), and fitting the working model

Y=α+βX(θ)+j=1mβjZj+ε,Y=\alpha+\beta X(\theta)+\sum_{j=1}^{m}\beta_{j}Z_{j}+\varepsilon, (11)

where Zj=01ϕj(t)X(t)𝑑tZ_{j}=\int_{0}^{1}\phi_{j}(t)X(t)\,dt are additional covariates with regression coefficients βj\beta_{j}; the resulting least squares estimator θ~n\tilde{\theta}_{n} can then be used as an estimate of θ0\theta_{0} of θ\theta. For the working model (11), the misspecification is ffmf-f_{m}, which will be small if mm is sufficiently large. Therefore, based on our previous discussion, θ~n\tilde{\theta}_{n} will satisfy a result of the form (10); in particular, θ~n\tilde{\theta}_{n} will exhibit the fast n1/(2H)n^{1/(2H)}-rate of convergence. Note that for this result to hold, mm can be fixed and does not need to tend to infinity with the sample size.

5 Two-sample problem

In this section we discuss a variation of the point impact regression model in which the response takes just two values (say ±1\pm 1). This is of interest, for example, in case-control studies in which gene-expression data are available for a sample of cancer patients and a sample of healthy controls, and the target parameter is the locus of a differentially expressed gene.

Suppose we have two independent samples of trajectories XX, with n1n_{1} trajectories from class 1, and n2n_{2} trajectories from class 1-1, for a total sample size of n=n1+n2n=n_{1}+n_{2}. It is assumed that ρ=n1/n2>0\rho=n_{1}/n_{2}>0 remains fixed, and the jjth sample satisfies the model

Xij(t)=μj(t)+εij(t),j=1,2,X_{ij}(t)=\mu_{j}(t)+\varepsilon_{ij}(t),\qquad j=1,2,

where εij\varepsilon_{ij}, i=1,,nji=1,\ldots,n_{j} are i.i.d. fBms with Hurst exponent H(0,1)H\in(0,1), and μj(t)\mu_{j}(t) is an unknown mean function, assumed to be continuous. The treatment effect 𝕄(t)=μ1(t)μ2(t)\mathbb{M}(t)=\mu_{1}(t)-\mu_{2}(t) is taken to have a point impact in the sense of having a unique maximum at θ0(0,1)\theta_{0}\in(0,1); minima can of course be treated in a similar fashion. The least squares estimator of the sensitive time point now becomes

θ^n=argmaxθ{X¯1(θ)X¯2(θ)},\hat{\theta}_{n}=\mathop{\arg\max}_{\theta}\{\bar{X}_{1}(\theta)-\bar{X}_{2}(\theta)\}, (12)

where X¯j(θ)=i=1njXij(θ)/nj\bar{X}_{j}(\theta)=\sum_{i=1}^{n_{j}}X_{ij}(\theta)/{n_{j}} is the sample mean for class jj. Although a studentized version (normalizing the the difference of the sample means by a standard error estimate) might be preferable in some cases, with small or unbalanced samples, say, to keep the discussion simple, we restrict attention to θ^n\hat{\theta}_{n}. The empirical criterion function 𝕄n(θ)=X¯1(θ)X¯2(θ)\mathbb{M}_{n}(\theta)=\bar{X}_{1}(\theta)-\bar{X}_{2}(\theta) converges uniformly to 𝕄(θ)\mathbb{M}(\theta) a.s. (by the Glivenko–Cantelli theorem), so θ^n\hat{\theta}_{n} is a consistent estimator of θ0\theta_{0}.

As before, our objective is to find a confidence interval for θ0\theta_{0} based on θ^n\hat{\theta}_{n} under appropriate conditions on the treatment effect. Toward this end, we need an assumption on the degree of smoothness of the treatment effect at θ0\theta_{0} in terms of an exponent 0<S10<S\leq 1:

𝕄(θ)=𝕄(θ0)c|θθ0|2S+o(|θθ0|2S)\mathbb{M}(\theta)=\mathbb{M}(\theta_{0})-c|\theta-\theta_{0}|^{2S}+o(|\theta-\theta_{0}|^{2S})

as θθ0\theta\to\theta_{0}, where c>0c>0. If 𝕄\mathbb{M} is twice-differentiable at θ0\theta_{0}, then this assumption holds only with S=1S=1; for it to hold for some S<1S<1, a cusp is needed. When the smoothness of the treatment effect and the fBm match, that is, S=HS=H, the rate of convergence of θ^n\hat{\theta}_{n} is n1/(2H)n^{1/(2H)}, as before, and θ^n\hat{\theta}_{n} has a nondegenerate limit distribution of the same form as in Theorem 2.1:

n11/(2H)(θ^nθ0)dargmint{(1+ρ)BH(t)+c|t|2H}.n_{1}^{1/(2H)}(\hat{\theta}_{n}-\theta_{0})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathop{\arg\min}_{t\in\mathbb{R}}\bigl{\{}\bigl{(}1+\sqrt{\rho}\bigr{)}B_{H}(t)+c|t|^{2H}\bigr{\}}. (13)

The key step in the proof (which is simpler than in the regression case) is given at the end of Section 8.

6 Numerical examples

In this section we report some numerical results based on trajectories from fBm simulations and from gene expression data.

We first consider a correctly specified example as in Section 2 and study the behavior of CIs for the sensitive time-point θ0\theta_{0} using the two bootstrap based methods, and compare them with the 100(1α)%100(1-\alpha)\% Wald-type CI

θ^n±(σ^n|β^n|n)1/HzH,α/2\hat{\theta}_{n}\pm\biggl{(}\frac{\hat{\sigma}_{n}}{|\hat{\beta}_{n}|\sqrt{n}}\biggr{)}^{1/H}z_{H,\alpha/2} (14)

with HH assumed to be known. Here σ^n\hat{\sigma}_{n} is the sample standard deviation of the residuals, and zH,αz_{H,\alpha} is the upper α\alpha-quantile of argmint(BH(t)+|t|2H/2)\arg\min_{t\in\mathbb{R}}(B_{H}(t)+|t|^{2H}/2). In practice, HH needs to be estimated to apply (14). Numerous estimators of HH based on a single realization of XX have been proposed in the literature b , coeur , although observation at fine time scales is required for such estimators to work well, and it is not clear that direct plug-in would be satisfactory. The quantiles zH,α/2z_{H,\alpha/2} needed to compute the Wald-type CIs were extracted from an extensive simulation of the limit distribution, as no closed form expression is available.

Table 1 reports the estimated coverage probabilities and average lengths of nominal 95% confidence intervals for θ0\theta_{0} calculated using 500 independent samples. The data were generated from the model (1), for α0=0\alpha_{0}=0, β0=1\beta_{0}=1, θ0=1/2\theta_{0}=1/2, εN(0,σ2)\varepsilon\sim N(0,\sigma^{2}) where σ=0.3\sigma=0.3 and 0.50.5, the Hurst exponent H=0.3,0.5,0.7H=0.3,0.5,0.7 and sample sizes n=20n=20 and 4040. To calculate the least squares estimators (2), we restricted θ\theta to a uniform grid of 101 points in [0,1][0,1]; the fBm trajectories were generated over the same grid. The fBm simulations were carried out in R, using the function fbmSim from the fArma package, and via the Cholesky method of decomposing the covariance matrix of XX. Histograms and scatterplots of θ^n\hat{\theta}_{n} and β^n\hat{\beta}_{n} for H=0.3,0.5,0.7H=0.3,0.5,0.7 when σ=0.5\sigma=0.5 are displayed in Figure 2.

Table 1: Monte Carlo results for coverage probabilities and average widths of nominal 95% confidence intervals for θ0\theta_{0}; data simulated from the linear model with θ0=1/2\theta_{0}=1/2, α0=0\alpha_{0}=0 and β0=1\beta_{0}=1
Wald-\boldsH\bolds H R bootstrap NP bootstrap
\boldsn\bolds n \boldsσ\bolds\sigma \boldsH\bolds H Cover Width Cover Width Cover Width
20 0.3 0.3 0.874 0.023 0.924 0.044 1.000 0.174
0.5 0.880 0.088 0.946 0.119 0.992 0.220
0.7 0.822 0.170 0.912 0.249 0.970 0.360
0.5 0.3 0.806 0.129 0.912 0.211 0.998 0.410
0.5 0.852 0.256 0.924 0.333 0.988 0.487
0.7 0.834 0.352 0.938 0.510 0.962 0.591
40 0.3 0.3 0.984 0.007 0.986 0.002 1.000 0.022
0.5 0.892 0.048 0.942 0.053 0.992 0.087
0.7 0.898 0.108 0.930 0.138 0.976 0.182
0.5 0.3 0.900 0.039 0.928 0.054 0.998 0.149
0.5 0.908 0.134 0.950 0.165 0.990 0.251
0.7 0.856 0.229 0.946 0.332 0.962 0.386
Refer to caption
Figure 2: Histograms and scatterplots of θ^n\hat{\theta}_{n} and β^n\hat{\beta}_{n} in the correctly specified case for H=0.3H=0.3 (top row), H=0.5H=0.5 (middle row) and H=0.7H=0.7 (bottom row), based on 500 samples of size n=20n=20.

In practice, XX can only be observed at discrete time points, so restricting to a grid is the natural formulation for this example. Indeed, the resolution of the observation times in the neighborhood of θ0\theta_{0} is a limiting factor for the accuracy of θ^n\hat{\theta}_{n}, so the grid resolution needs to be fine enough for the statistical behavior of θ^n\hat{\theta}_{n} to be apparent. For large sample sizes, a very fine grid would be needed in the case of a small Hurst exponent (cf. Theorem 2.1). Indeed, the histogram of θ^n\hat{\theta}_{n} in the case H=0.3H=0.3 (the first plot in Figure 2) shows that the resolution of the grid is almost too coarse to see the statistical variation, as the bin centered on θ0=1/2\theta_{0}=1/2 contains almost 80% of the estimates. This phenomenon is also observed in Table 1 when n=40n=40 and σ=H=0.3\sigma=H=0.3. The average length of the CIs is smaller than the resolution of the grid and, thus, we observe an over-coverage. The two histograms of θ^n\hat{\theta}_{n} for H=0.5H=0.5 and H=0.7H=0.7, however, show increasing dispersion and become closer to bell-shaped as HH increases.

Recall that, for simplicity, we pretend as if we know HH, which should be an advantage, yet the residual bootstrap has better performance based on the results in Table 1. We see that usually the Wald-type CIs have coverage less than the nominal 95%, whereas the inconsistent nonparametric bootstrap method over-covers with observed coverage probability close to 1. Accordingly, the average lengths of the Wald-type CIs are the smallest, whereas those obtained from the nonparametric bootstrap method are the widest. The behavior of CIs obtained from the nonparametric bootstrap method also illustrates the inconsistency of this procedure. A similar phenomenon was also observed in LM06 in connection with estimators that converge at n1/3n^{1/3}-rate.

Despite the asymptotic independence of θ^n\hat{\theta}_{n} and β^n\hat{\beta}_{n}, considerable correlation is apparent in the scatterplots in Figure 2, with increasing negative correlation as HH increases; note, however, that when H=1H=1 there is a lack of identifiability of θ\theta and β\beta, so the trend in the correlation as HH approaches 1 is to be expected in small samples.

Refer to caption
Figure 3: Same as Figure 2 except in the partially misspecified case.

Next we consider a partially misspecified example, in which the data are now generated from (9) by setting f(t)=1/2f(t)=1/2 and θ=θ=1/2\theta=\theta_{*}=1/2, but the other ingredients are unchanged from the correctly specified example. The plots in Figure 2 correspond to those in Figure 3. The effect of misspecification on θ^n\hat{\theta}_{n} is a slight increase in dispersion but no change in mean; the effect on β^n\hat{\beta}_{n} is a substantial shift in mean along with a slight increase in dispersion.

6.1 Gene expression example

Next we consider the gene expression data mentioned in connection with Figure 1, to see how the residual bootstrap performs with such trajectories. The trajectories consist of log gene expression levels from the breast tissue of n=40n=40 breast cancer patients, along a sequence of 518 loci from chromosome 17. The complete gene expression data set is described in Richardson et al. rich . Although a continuous response is not available for this particular data set, it is available in numerous other studies of this type; see the references mentioned in the Introduction.

Refer to caption
Figure 4: Gene expression example: histograms of θ^n\hat{\theta}_{n}^{*} based on 1000 residual bootstrap samples and simulated responses with σ=0.01\sigma=0.01 (left), σ=0.03\sigma=0.03 (middle) and σ=0.1\sigma=0.1 (right).

To construct a scalar response, we generated YiY_{i} using the point impact model (1) with α0=0\alpha_{0}=0 and β0=1\beta_{0}=1, θ0=0.5\theta_{0}=0.5 (corresponding to the position of 259 base pairs along the chromosome) and εN(0,σ2)\varepsilon\sim N(0,\sigma^{2}) for various values of σ\sigma. As previously noted, the trajectories are very rough in this example (with HH estimated to be about 0.1), which implies a rapid rate of convergence for θ^n\hat{\theta}_{n}. We find that an abrupt transition in the behavior of the residual bootstrap occurs as σ\sigma increases: for small σ\sigma, the residual bootstrap estimates become degenerate at θ0\theta_{0} due to the relatively coarse resolution; for moderately large σ\sigma, although a considerable portion of the estimates are concentrated at θ0\theta_{0}, they become spread out over the 518 loci; for very large σ\sigma, the estimates are more or less uniformly scattered along the chromosome. Indeed, this is consistent with the behavior of the Wald-type CI (14) having width proportional to σ1/H\sigma^{1/H}, which blows up dramatically as σ\sigma increases when HH is small.

In Figure 4 we plot the bootstrap distribution of θ^n\hat{\theta}_{n} (obtained from 1000 residual bootstrap samples in each case) for σ=\sigma= 0.01, 0.03 and 0.1. When σ=0.01\sigma=0.01, the bootstrap distribution is degenerate at θ0\theta_{0}; the resolution of the grid is too course to see any statistical fluctuation in this case. When σ\sigma is moderate, namely, 0.03, although the bootstrap distribution has a peak at θ0\theta_{0}, the mass is widely scattered and the resulting CI now covers almost the entire chromosome. Further increasing the noise level causes the bootstrap distribution to become even more dispersed and its mode moves away from θ0\theta_{0}; the sample size of 40 is now too small for the method to locate the neighborhood of θ0\theta_{0}.

7 Concluding remarks

In this paper we have developed a point impact functional linear regression model for use with trajectories as predictors of a continuous scalar response. It is expected that the proposed approach will be useful when there are sensitive time points at which the trajectory has an effect on the response. We have derived the rates of convergence and the explicit limiting distributions of the least squares estimator of such a parameter in terms of the Hurst exponent for fBm trajectories. We also established the validity of the residual bootstrap method for obtaining CIs for sensitive time points, avoiding the need to estimate the Hurst exponent. In addition, we have developed some results in the misspecified case in which the data are generated partially or completely from a standard functional linear model, and in the two-sample setting.

Although for simplicity of presentation we have assumed that the trajectories are fBm, it is clear from the proofs that it is only local properties of the trajectories in the neighborhood of the sensitive time point that drive the theory, and thus the validity of the confidence intervals. The consistency of the least squares estimator is of course needed, but this could be established under much weaker assumptions (namely, uniform convergence of the empirical criterion function and the existence of a well-separated minimum of the limiting criterion function; cf. vw , page 287).

Exploiting the fractal behavior of the trajectories plays a crucial role in developing confidence intervals based on the least squares estimator of the sensitive time point, in contrast to standard functional linear regression where it is customary to smooth the predictor trajectories prior to fitting the model (rs , Chapter 15). Our approach does not require any preprocessing of the trajectories involving a choice of smoothing parameters, nor any estimation of nuisance parameters (namely, the Hurst exponent). On the other hand, functional linear regression is designed with prediction in mind, rather than interpretability, so in a sense the two approaches are complimentary. The tendency of functional linear regression to over-smooth a point impact (see lmck for detailed discussion) is also due to the use of a roughness penalty on the regression function; the smoothing parameter is usually chosen by cross-validation, a criterion that optimizes for predictive performance.

Our model naturally extends to allow multiple sensitive time points, and any model selection procedure having the oracle property (such as the adaptive lasso) could be used to estimate the number of those sensitive time points. The bootstrap procedure for the (unregularized) least squares estimator can then be adapted to provide individual CIs around each time point, although developing theoretical justification would be challenging. Other challenging problems would be to develop bootstrap procedures that are suitable for the two-sample problem and for the misspecified model settings.

It should be feasible to carry through much of our program for certain types of diffusion processes driven by fBm, and also for processes having jumps. In the case of piecewise constant trajectories that have a single jump, the theory specializes to an existing type of change-point analysis kqs . Other possibilities include Lévy processes (which have stationary independent increments) and multi-parameter fBm. It should also be possible to develop versions of our results in the setting of censored survival data (e.g., Cox regression). Lindquist and McKeague lmck recently studied point impact generalized linear regression models in the case that XX is standard Brownian motion and we expect that our approach can be extended to such models as well.

8 Proofs

To avoid measurability problems and for simplicity of notation, we will always use outer expectation/probability, and denote them by EE and PP; EE^{*} and PP^{*} will denote bootstrap conditional expectation/probability given the data.

We begin with the proof of Theorem 2.1. The strategy is to establish (a) consistency, (b) the rate of convergence, (c) the weak convergence of a suitably localized version of the criterion function, and (d) apply the argmax (or argmin) continuous mapping theorem.

8.1 Consistency

We start with some notation. Let mη(Y,X)[YαβX(θ)]2m_{\eta}(Y,X)\equiv[Y-\alpha-\beta X(\theta)]^{2} and let 𝕄n(η)nmη=1ni=1n[YiαβXi(θ)]2\mathbb{M}_{n}(\eta)\equiv\mathbb{P}_{n}m_{\eta}=\frac{1}{n}\sum_{i=1}^{n}[Y_{i}-\alpha-\beta X_{i}(\theta)]^{2}, where n\mathbb{P}_{n} denotes the expectation with respect to the empirical measure of the data. Let

𝕄(η)\displaystyle\qquad\mathbb{M}(\eta) \displaystyle\equiv Pmη=(α0α)2+P[{β0X(θ0)βX(θ)}2]+σ2\displaystyle Pm_{\eta}=(\alpha_{0}-\alpha)^{2}+P[\{\beta_{0}X(\theta_{0})-\beta X(\theta)\}^{2}]+\sigma^{2}
=\displaystyle= (α0α)2+σ2+(β0β)2P[X2(θ0)]+β2P[X(θ0)X(θ)]2\displaystyle(\alpha_{0}-\alpha)^{2}+\sigma^{2}+(\beta_{0}-\beta)^{2}P[X^{2}(\theta_{0})]+\beta^{2}P[X(\theta_{0})-X(\theta)]^{2}
+2β(β0β)P[X(θ0){X(θ0)X(θ)}].\displaystyle{}+2\beta(\beta_{0}-\beta)P[X(\theta_{0})\{X(\theta_{0})-X(\theta)\}].

First observe that 𝕄(η)\mathbb{M}(\eta) has a unique minimizer at η0\eta_{0} as P[βX(θ)β0X(θ0)]>0P[\beta X(\theta)\neq\beta_{0}X(\theta_{0})]>0, for all (β,θ)×(0,1)(\beta,\theta)\in\mathbb{R}\times(0,1) with (β,θ)(β0,θ0)(\beta,\theta)\neq(\beta_{0},\theta_{0}).

Using the fBm covariance formula (4),

𝕄(η)𝕄(η0)\displaystyle\mathbb{M}(\eta)-\mathbb{M}(\eta_{0}) =\displaystyle= (α0α)2+(β0β)2|θ0|2H+β2|θ0θ|2H\displaystyle(\alpha_{0}-\alpha)^{2}+(\beta_{0}-\beta)^{2}|\theta_{0}|^{2H}+\beta^{2}|\theta_{0}-\theta|^{2H}
+β(β0β){|θ0|2H+|θ0θ|2H|θ|2H}\displaystyle{}+\beta(\beta_{0}-\beta)\{|\theta_{0}|^{2H}+|\theta_{0}-\theta|^{2H}-|\theta|^{2H}\}
=\displaystyle= (α0α)2+(β0β)2|θ0|2H+ββ0|θ0θ|2H\displaystyle(\alpha_{0}-\alpha)^{2}+(\beta_{0}-\beta)^{2}|\theta_{0}|^{2H}+\beta\beta_{0}|\theta_{0}-\theta|^{2H}
+β(β0β){|θ0|2H|θ|2H}.\displaystyle{}+\beta(\beta_{0}-\beta)\{|\theta_{0}|^{2H}-|\theta|^{2H}\}.

To show that η^n\hat{\eta}_{n} is a consistent estimator of η0\eta_{0}, first note that η^n\hat{\eta}_{n} is uniformly tight. Also notice that 𝕄(η)\mathbb{M}(\eta) is continuous and has a unique minimum at η0\eta_{0}, and, thus, by Theorem 3.2.3(i) of vw , it is enough to show that 𝕄nP𝕄\mathbb{M}_{n}\stackrel{{\scriptstyle P}}{{\rightarrow}}\mathbb{M} uniformly on each compact subset KK of Ξ=2×[0,1]\Xi=\mathbb{R}^{2}\times[0,1], and that 𝕄\mathbb{M} has a well-separated minimum in the sense that 𝕄(η0)<infηG𝕄(η)\mathbb{M}(\eta_{0})<\inf_{\eta\notin G}\mathbb{M}(\eta) for every open set GG that contains η0\eta_{0}. That 𝕄\mathbb{M} has a well-separated minimum can be seen from the form of the expression in (8.1). For the uniform convergence, we need to show that the class ={mη\dvtxηK}\mathcal{F}=\{m_{\eta}\dvtx\eta\in K\} is PP-Glivenko Cantelli (PP-GC). Using GC preservation properties (see Corollary 9.27 of kbook ), it is enough to show that 𝒢={BH(h)X(θ0+h)X(θ0)\dvtxh[1,1]}\mathcal{G}=\{B_{H}(h)\equiv X(\theta_{0}+h)-X(\theta_{0})\dvtx h\in[-1,1]\} is PP-GC. Note that almost all trajectories of XX are Lipschitz of any order strictly less than HH, in the sense of (22) in Lemma 8.1 below. Thus, the bracketing number N[](ε,𝒢,L1(Q))<N_{[\cdot]}(\varepsilon,\mathcal{G},L_{1}(Q))<\infty and 𝒢\mathcal{G} is PP-GC, by Theorems 2.7.11 and 2.4.1 of vw .

8.2 Rate of convergence

We will apply a result of van der Vaart and Wellner (vw , Theorem 3.2.5) to obtain a lower bound on the rate of convergence of the M-estimator η^n\hat{\eta}_{n}. Setting d~(η,η0)=max{|αα0|,|ββ0|,|θθ0|H}\tilde{d}(\eta,\eta_{0})=\max\{|\alpha-\alpha_{0}|,|\beta-\beta_{0}|,|\theta-\theta_{0}|^{H}\}, the first step is to show that

𝕄(η)𝕄(η0)d~2(η,η0)\mathbb{M}(\eta)-\mathbb{M}(\eta_{0})\gtrsim\tilde{d}^{2}(\eta,\eta_{0}) (17)

in a neighborhood of η0\eta_{0}. Here \gtrsim means that the right-hand side is bounded above by a (positive) constant times the left-hand side. Note that |θ0|2H|θ|2H|\theta_{0}|^{2H}-|\theta|^{2H} has a bounded derivative in θ[δ,1]\theta\in[\delta,1], where δ>0\delta>0 is fixed, so for such θ\theta we have

β(ββ0){|θ0|2H|θ|2H}\displaystyle\beta(\beta-\beta_{0})\{|\theta_{0}|^{2H}-|\theta|^{2H}\}
|β||β0β|C|θ0θ|\displaystyle\qquad\geq-|\beta||\beta_{0}-\beta|C|\theta_{0}-\theta|
=[|β|C|θ0θ|1H]|β0β||θ0θ|H\displaystyle\qquad=-[|\beta|C|\theta_{0}-\theta|^{1-H}]|\beta_{0}-\beta||\theta_{0}-\theta|^{H}
c(θ,β){(β0β)2+|θ0θ|2H},\displaystyle\qquad\geq-c(\theta,\beta)\{(\beta_{0}-\beta)^{2}+|\theta_{0}-\theta|^{2H}\},

where CC is the bound on the derivative, c(θ,β)=|β|C|θ0θ|1H/2c(\theta,\beta)=|\beta|C|\theta_{0}-\theta|^{1-H}/2, and we used the inequality |ab|(a2+b2)/2|ab|\leq(a^{2}+b^{2})/2. As β00\beta_{0}\neq 0 and 0<θ0<10<\theta_{0}<1, by combining (8.1) and (8.2), suitably grouping terms, and noting that c(θ,β)c(\theta,\beta) can be made arbitrarily small by restricting η\eta to a sufficiently small neighborhood of η0\eta_{0}, there exist c1>0c_{1}>0 and c2>0c_{2}>0 such that

𝕄(η)𝕄(η0)(α0α)2+c1(β0β)2+c2|θ0θ|2H,\mathbb{M}(\eta)-\mathbb{M}(\eta_{0})\geq(\alpha_{0}-\alpha)^{2}+c_{1}(\beta_{0}-\beta)^{2}+c_{2}|\theta_{0}-\theta|^{2H},

which shows that (17) holds.

Let δ{mηmη0\dvtxd~(η,η0)<δ}\mathcal{M}_{\delta}\equiv\{m_{\eta}-m_{\eta_{0}}\dvtx\tilde{d}(\eta,\eta_{0})<\delta\}, where δ(0,1]\delta\in(0,1]. Note that

mηmη0\displaystyle\qquad m_{\eta}-m_{\eta_{0}} =\displaystyle= (α2α02)+β2[X2(θ)X2(θ0)]+(β2β02)X2(θ0)\displaystyle(\alpha^{2}-\alpha_{0}^{2})+\beta^{2}[X^{2}(\theta)-X^{2}(\theta_{0})]+(\beta^{2}-\beta_{0}^{2})X^{2}(\theta_{0})
2Y(αα0)2βY[X(θ)X(θ0)]2(ββ0)YX(θ0)\displaystyle{}-2Y(\alpha-\alpha_{0})-2\beta Y[X(\theta)-X(\theta_{0})]-2(\beta-\beta_{0})YX(\theta_{0})
+2αβ[X(θ)X(θ0)]+2αX(θ0)(ββ0)\displaystyle{}+2\alpha\beta[X(\theta)-X(\theta_{0})]+2\alpha X(\theta_{0})(\beta-\beta_{0})
+2β0X(θ0)(αα0).\displaystyle{}+2\beta_{0}X(\theta_{0})(\alpha-\alpha_{0}).

This shows that δ\mathcal{M}_{\delta} has envelope

Mδ(Y,X)\displaystyle\hskip 5.0ptM_{\delta}(Y,X) \displaystyle\equiv (2|α0|+δ)δ+(|β0|+δ)2sup|θθ0|H<δ|X2(θ)X2(θ0)|\displaystyle{(2|\alpha_{0}|+\delta)\delta+(|\beta_{0}|+\delta)^{2}\sup_{|\theta-\theta_{0}|^{H}<\delta}}|X^{2}(\theta)-X^{2}(\theta_{0})|
+X2(θ0)δ(2|β0|+δ)+2|Y|δ\displaystyle{}+X^{2}(\theta_{0})\delta(2|\beta_{0}|+\delta)+2|Y|\delta
+2|Y|(|β0|+δ)sup|θθ0|H<δ|X(θ)X(θ0)|\displaystyle{}+2|Y|{(|\beta_{0}|+\delta)\sup_{|\theta-\theta_{0}|^{H}<\delta}}|X(\theta)-X(\theta_{0})|
+2|X(θ0)||Y|δ+2(|α0|+δ)(|β0|+δ)sup|θθ0|H<δ|X(θ)X(θ0)|\displaystyle{}+2|X(\theta_{0})||Y|\delta+2(|\alpha_{0}|+\delta){(|\beta_{0}|+\delta)\sup_{|\theta-\theta_{0}|^{H}<\delta}}|X(\theta)-X(\theta_{0})|
+2(|α0|+δ)|X(θ0)|δ+2|β0||X(θ0)|δ.\displaystyle{}+2(|\alpha_{0}|+\delta)|X(\theta_{0})|\delta+2|\beta_{0}||X(\theta_{0})|\delta.

Using a maximal inequality for fBm (Theorem 1.1 of nov ), we have

E[sup|θθ0|H<δ|X(θ)X(θ0)|q]δqE\Bigl{[}\sup_{|\theta-\theta_{0}|^{H}<\delta}|X(\theta)-X(\theta_{0})|^{q}\Bigr{]}\lesssim\delta^{q} (21)

for any q>0q>0. Then, using (A3) in conjunction with Hölder’s inequality (cf. the proof of Lemma 8.1), all nine terms in (8.2) can be shown to have second moments bounded by δ2\delta^{2} (up to a constant) and, thus, EMδ2δ2EM_{\delta}^{2}\lesssim\delta^{2}.

The following lemma shows that mηm_{\eta} is “Lipschitz in parameter” and, consequently, that the bracketing entropy integral J[](1,δ,L2(P))J_{[\cdot]}(1,\mathcal{M}_{\delta},L^{2}(P)) is uniformly bounded as a function of δ(0,1]\delta\in(0,1]; see vw , page 294. Without loss of generality, to simplify notation, we assume that α=0\alpha=0 and β=1\beta=1, and state the lemma with θ\theta as the only parameter.

Lemma 8.1

If (A1) and (A3) hold and 0<α<H0<\alpha<H, there is a random variable LL with finite second moment such that

|mθ1mθ2|L|θ1θ2|α|m_{\theta_{1}}-m_{\theta_{2}}|\leq L|\theta_{1}-\theta_{2}|^{\alpha} (22)

for all θ1,θ2[0,1]\theta_{1},\theta_{2}\in[0,1] almost surely.

{pf}

The trajectories of fBm are Lipschitz of any order α<H\alpha<H in the sense that

|X(t)X(s)|ξ|ts|αt,s[0,1]|X(t)-X(s)|\leq\xi|t-s|^{\alpha}\qquad\forall t,s\in[0,1] (23)

almost surely, where ξ\xi has moments of all orders; this is a consequence of the proof of Kolmogorov’s continuity theorem; see Theorem 2.2 of Revuz and Yor ry . Noting that mθ(X,Y)=(YX(θ))2m_{\theta}(X,Y)=(Y-X(\theta))^{2}, we then have

|mθ1mθ2|C|X(θ1)X(θ2)|L|θ1θ2|α,|m_{\theta_{1}}-m_{\theta_{2}}|\leq C|X(\theta_{1})-X(\theta_{2})|\leq L|\theta_{1}-\theta_{2}|^{\alpha},

where C=2(supθ|X(θ)|+|Y|)C=2({\sup_{\theta}}|X(\theta)|+|Y|) and L=CξL=C\xi. Here LL has a finite second moment:

EL2{EC2p}1/p{Eξ2q}1/q<EL^{2}\leq\{EC^{2p}\}^{1/p}\{E\xi^{2q}\}^{1/q}<\infty

by Hölder’s inequality for 1/p+1/q=11/p+1/q=1 with p=1+δ/2p=1+\delta/2 and δ>0\delta>0 comes from the moment condition (A3).

Using a maximal inequality from vw (see page 291), we then have

EP𝔾nδJ[](1,δ,L2(P))(EMδ2)1/2δE_{P}\|\mathbb{G}_{n}\|_{\mathcal{M}_{\delta}}\lesssim J_{[\cdot]}(1,\mathcal{M}_{\delta},L_{2}(P))(EM_{\delta}^{2})^{1/2}\lesssim\delta

for all δ(0,1]\delta\in(0,1], where 𝔾n=n(nP)\mathbb{G}_{n}=\sqrt{n}(\mathbb{P}_{n}-P), and it follows that d~(η^n,η0)=OP(1/n)\tilde{d}(\hat{\eta}_{n},\eta_{0})=O_{P}(1/\sqrt{n}) by Theorem 3.2.5 of vw .

8.3 Localizing the criterion function

To simplify notation, let rn1𝐡(h1/n,h2/n,n1/(2H)h3)r_{n}^{-1}\mathbf{h}\equiv(h_{1}/\sqrt{n},h_{2}/\sqrt{n},n^{-1/(2H)}h_{3}), for 𝐡=(h1,h2,h3)3\mathbf{h}=(h_{1},h_{2},h_{3})\in\mathbb{R}^{3}. Then

ζn=argmin𝐡[𝕄n(η0+rn1𝐡)𝕄n(η0)]\zeta_{n}=\mathop{\arg\min}_{\mathbf{h}}[\mathbb{M}_{n}(\eta_{0}+r_{n}^{-1}\mathbf{h})-\mathbb{M}_{n}(\eta_{0})] (24)

and we can write the expression in the square brackets after multiplication by nn as the sum of an empirical process and a drift term:

𝔾n[n(mη0+rn1𝐡mη0)]+n[𝕄(η0+rn1𝐡)𝕄(η0)].\mathbb{G}_{n}\bigl{[}\sqrt{n}(m_{\eta_{0}+r_{n}^{-1}\mathbf{h}}-m_{\eta_{0}})\bigr{]}+n[\mathbb{M}(\eta_{0}+r_{n}^{-1}\mathbf{h})-\mathbb{M}(\eta_{0})]. (25)

First consider the empirical process term, and note that

mη0+rn1𝐡\displaystyle m_{\eta_{0}+r_{n}^{-1}\mathbf{h}} =\displaystyle= [Y(α0+n1/2h1)(β0+n1/2h2)X(θ0+n1/(2H)h3)]2\displaystyle\bigl{[}Y-(\alpha_{0}+n^{-1/2}h_{1})-(\beta_{0}+n^{-1/2}h_{2})X\bigl{(}\theta_{0}+n^{-1/(2H)}h_{3}\bigr{)}\bigr{]}^{2}
=\displaystyle= [ε{h1n+(β0+h2n)X(θ0+n1/(2H)h3)β0X(θ0)}]2,\displaystyle\biggl{[}\varepsilon-\biggl{\{}\frac{h_{1}}{\sqrt{n}}+\biggl{(}\beta_{0}+\frac{h_{2}}{\sqrt{n}}\biggr{)}X\bigl{(}\theta_{0}+n^{-1/(2H)}h_{3}\bigr{)}-\beta_{0}X(\theta_{0})\biggr{\}}\biggr{]}^{2},

so we obtain

n[mη0+rn1𝐡mη0]\displaystyle\quad\sqrt{n}[m_{\eta_{0}+r_{n}^{-1}\mathbf{h}}-m_{\eta_{0}}] =\displaystyle= n[h1n+(β0+h2n)𝔹(h3)n+h2nX(θ0)]2\displaystyle\sqrt{n}\biggl{[}\frac{h_{1}}{\sqrt{n}}+\biggl{(}\beta_{0}+\frac{h_{2}}{\sqrt{n}}\biggr{)}\frac{\mathbb{B}(h_{3})}{\sqrt{n}}+\frac{h_{2}}{\sqrt{n}}X(\theta_{0})\biggr{]}^{2}
2ε[h1+(β0+h2n)𝔹(h3)+h2X(θ0)],\displaystyle{}-2\varepsilon\biggl{[}h_{1}+\biggl{(}\beta_{0}+\frac{h_{2}}{\sqrt{n}}\biggr{)}\mathbb{B}(h_{3})+h_{2}X(\theta_{0})\biggr{]},

where 𝔹(h3)n[X(θ0+n1/(2H)h3)X(θ0)]=dBH(h3)\mathbb{B}(h_{3})\equiv\sqrt{n}[X(\theta_{0}+n^{-1/(2H)}h_{3})-X(\theta_{0})]\stackrel{{\scriptstyle d}}{{=}}B_{H}(h_{3}) (as a process in h3h_{3}).

The result of applying 𝔾n\mathbb{G}_{n} to the first term on the right-hand side of the above display gives a term of order oP(1)o_{P}(1) uniformly in 𝐡[K,K]3\mathbf{h}\in[-K,K]^{3}, for each K>0K>0. This is seen by applying the maximal inequality from vw , page 291, as used above; here the class of functions n\mathcal{F}_{n} in question is bounded by the envelope function

Fn=3n{K2n+(β0+Kn)2sup|h3|K𝔹2(h3)n+K2nX2(θ0)},F_{n}=3\sqrt{n}\biggl{\{}{\frac{K^{2}}{n}+\biggl{(}\beta_{0}+\frac{K}{\sqrt{n}}\biggr{)}^{2}\sup_{|h_{3}|\leq K}}\frac{\mathbb{B}^{2}(h_{3})}{n}+\frac{K^{2}}{n}X^{2}(\theta_{0})\biggr{\}},

for which PFn2=o(1)PF_{n}^{2}=o(1) and J[](1,n,L2(P))<J_{[\cdot]}(1,\mathcal{F}_{n},L_{2}(P))<\infty; cf. the proof of Lemma 8.1. Hence, we just need to consider the second term. To determine the limit distribution of the empirical process term in (25), it thus suffices to show that

𝔾n[(ε,ε𝔹(h3),εX(θ0))]d(σZ1,σBH(h3),σZ2)\mathbb{G}_{n}[(\varepsilon,\varepsilon\mathbb{B}(h_{3}),\varepsilon X(\theta_{0}))]\stackrel{{\scriptstyle d}}{{\to}}(\sigma Z_{1},\sigma B_{H}(h_{3}),\sigma Z_{2}) (27)

in ×C[K,K]×\mathbb{R}\times C[-K,K]\times\mathbb{R}, where Z1,Z2Z_{1},Z_{2} are i.i.d. N(0,1)N(0,1) and independent of the fBm BHB_{H}. For the second component above, notice that since ε\varepsilon is independent of 𝔹\mathbb{B},

𝔾n[ε𝔹(h3)]=dBH(h3)(1ni=1nεi2)1/2dσBH(h3)\mathbb{G}_{n}[\varepsilon\mathbb{B}(h_{3})]\stackrel{{\scriptstyle d}}{{=}}B_{H}(h_{3})\Biggl{(}{1\over n}\sum_{i=1}^{n}\varepsilon_{i}^{2}\Biggr{)}^{1/2}\stackrel{{\scriptstyle d}}{{\to}}\sigma B_{H}(h_{3}) (28)

in C[K,K]C[-K,K]. The asymptotic independence of the three components of (27) is a consequence of

\operatornameCov(ε,ε𝔹(h3))\displaystyle\operatorname{Cov}(\varepsilon,\varepsilon\mathbb{B}(h_{3})) =\displaystyle= σ2E[𝔹(h3)]=0,\displaystyle\sigma^{2}E[\mathbb{B}(h_{3})]=0,
\operatornameCov(ε,εX(θ0))\displaystyle\operatorname{Cov}(\varepsilon,\varepsilon X(\theta_{0})) =\displaystyle= σ2E[X(θ0)]=0,\displaystyle\sigma^{2}E[X(\theta_{0})]=0,
\operatornameCov(ε𝔹(h3),εX(θ0))\displaystyle\operatorname{Cov}(\varepsilon\mathbb{B}(h_{3}),\varepsilon X(\theta_{0})) =\displaystyle= σ2[n2(|θ0+n1/2Hh3|2H|θ0|2H)h32n],\displaystyle\sigma^{2}\biggl{[}\frac{\sqrt{n}}{2}(|\theta_{0}+n^{-1/2H}h_{3}|^{2H}-|\theta_{0}|^{2H})-\frac{h_{3}}{2\sqrt{n}}\biggr{]},

which tends to zero uniformly in h3[K,K]h_{3}\in[-K,K], using the assumption H<1H<1.

It just remains to find the limit of the drift term in (25). Using (8.1), it is given by

h12+h22|θ0|2H+(β0+n1/2h2)β0|h3|2H\displaystyle h_{1}^{2}+h_{2}^{2}|\theta_{0}|^{2H}+(\beta_{0}+n^{-1/2}h_{2})\beta_{0}|h_{3}|^{2H}
+h2(β0+n1/2h2)[n{|θ0|2H|θ0+n1/2Hh3|2H}]\displaystyle\quad{}+h_{2}(\beta_{0}+n^{-1/2}h_{2})\bigl{[}\sqrt{n}\{|\theta_{0}|^{2H}-|\theta_{0}+n^{-1/2H}h_{3}|^{2H}\}\bigr{]}
h12+h22|θ0|2H+β02|h3|2H\displaystyle\qquad\rightarrow h_{1}^{2}+h_{2}^{2}|\theta_{0}|^{2H}+\beta_{0}^{2}|h_{3}|^{2H}

uniformly in 𝐡[K,K]3\mathbf{h}\in[-K,K]^{3}. Combining this with the limit distribution of the first term in (25), we get from (24) and the argmin continuous mapping theorem that

ζn\displaystyle\zeta_{n} d\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}} argmin𝐡[2σ(Z1h1+β0BH(h3)+h2|θ0|HZ2)\displaystyle\mathop{\arg\min}_{\mathbf{h}}\bigl{[}-2\sigma\bigl{(}Z_{1}h_{1}+\beta_{0}B_{H}(h_{3})+h_{2}|\theta_{0}|^{H}Z_{2}\bigr{)}
+(h12+h22|θ0|2H+β02|h3|2H)]\displaystyle\hskip 69.6pt{}+(h_{1}^{2}+h_{2}^{2}|\theta_{0}|^{2H}+\beta_{0}^{2}|h_{3}|^{2H})\bigr{]}
=d\displaystyle\stackrel{{\scriptstyle d}}{{=}} [σZ1,|θ0|HσZ2,argminh3{2σ|β0|BH(h3)+|h3|2H}].\displaystyle\biggl{[}\sigma Z_{1},|\theta_{0}|^{-H}\sigma Z_{2},\mathop{\arg\min}_{h_{3}}\biggl{\{}2\frac{\sigma}{|\beta_{0}|}B_{H}(h_{3})+|h_{3}|^{2H}\biggr{\}}\biggr{]}.

This completes the proof of Theorem 2.1.

8.4 Proof of Theorem 3.1

We prove the result by the method of contradiction. Before giving the proof, we state a general lemma that can be useful in studying bootstrap validity. The lemma can be proved easily using characteristic functions; see also Sethuraman sethu and Theorem 2.2 of Kosorok k08 .

Lemma 8.2

Let WnW_{n} and WnW_{n}^{*} be random vectors in l\mathbb{R}^{l} and k\mathbb{R}^{k}, respectively; let QQ and QQ^{*} be distributions on the Borel sets of l\mathbb{R}^{l} and k\mathbb{R}^{k}, and let n\mathcal{F}_{n} be σ\sigma-fields for which WnW_{n} is n\mathcal{F}_{n}-measurable. If WnW_{n} converges in distribution to QQ and the conditional distribution of WnW_{n}^{*} given n\mathcal{F}_{n} converges (in distribution) in probability to QQ^{*}, then (Wn,Wn)(W_{n},W_{n}^{*}) converges in distribution to Q×QQ\times Q^{*}.

The basic idea of the proof of the theorem now is to assume that ΔndΔ\Delta_{n}^{*}\stackrel{{\scriptstyle d}}{{\rightarrow}}\Delta^{*} in probability, where Δ\Delta^{*} has the same distribution as Δ\Delta. Therefore, ΔndΔ\Delta_{n}^{*}\stackrel{{\scriptstyle d}}{{\rightarrow}}\Delta^{*} unconditionally also. We already know that ΔndΔ\Delta_{n}\stackrel{{\scriptstyle d}}{{\rightarrow}}\Delta from Theorem 2.1. By Lemma 8.2 applied with Wn=ΔnW_{n}=\Delta_{n}, Wn=ΔnW_{n}^{*}=\Delta_{n}^{*} and n=σ((Y1,X1),(Y2,X2),,(Yn,Xn)){\mathcal{F}}_{n}=\sigma((Y_{1},X_{1}),(Y_{2},X_{2}),\break\ldots,(Y_{n},X_{n})), we can show that (Δn,Δn)(\Delta_{n},\Delta_{n}^{*}) converges unconditionally to a product measure and, thus, Δn+ΔndΔ+Δ\Delta_{n}+\Delta_{n}^{*}\stackrel{{\scriptstyle d}}{{\rightarrow}}\Delta+\Delta^{*}. Thus, n1/(2H)(θ^nθ0)Δn+Δnn^{1/(2H)}(\hat{\theta}_{n}^{*}-\theta_{0})\equiv\Delta_{n}+\Delta_{n}^{*} converges unconditionally to a tight limiting distribution which has twice the variance of Δ\Delta.

Using arguments along the lines of those used in the proof of Theorem 2.1, we can show that

n1/(2H)(θ^nθ0)dargmint{2σβ0(BH(t)+BH(t))+β02|t|2H}Δ,n^{1/(2H)}(\hat{\theta}_{n}^{*}-\theta_{0})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathop{\arg\min}_{t\in\mathbb{R}}\bigl{\{}2\sigma\beta_{0}\bigl{(}B_{H}(t)+B_{H}^{*}(t)\bigr{)}+\beta_{0}^{2}|t|^{2H}\bigr{\}}\equiv\Delta^{**},

where BHB_{H}^{*} is another independent fBm with Hurst exponent HH. Using properties of fBm, we see that

Δ=d(2σ|β0|)1/Hargmint{BH(t)+|t|2H/2}=d21/(2H)Δ.\Delta^{**}\stackrel{{\scriptstyle d}}{{=}}\biggl{(}\sqrt{2}\frac{\sigma}{|\beta_{0}|}\biggr{)}^{1/H}\mathop{\arg\min}_{t\in\mathbb{R}}\{B_{H}(t)+|t|^{2H}/2\}\stackrel{{\scriptstyle d}}{{=}}2^{1/(2H)}\Delta.

Thus, the variance of the limiting distribution of n1/(2H)(θ^nθ0)n^{1/(2H)}(\hat{\theta}_{n}^{*}-\theta_{0}) is 21/H>22^{1/H}>2 times the variance of Δ\Delta, which is a contradiction.

8.5 Proof of Theorem 3.2

The bootstrap sample is {(Yi,Xi),i=1,,n}\{(Y_{i}^{*},X_{i}),i=1,\ldots,n\}, where the YiY_{i}^{*} are defined in (7). Letting 𝕄n(η)nmη=1ni=1n[YiαβXi(θ)]2\mathbb{M}^{*}_{n}(\eta)\equiv{\mathbb{P}^{*}_{n}}m_{\eta}=\frac{1}{n}\sum_{i=1}^{n}[Y_{i}^{*}-\alpha-\beta X_{i}(\theta)]^{2}, the bootstrap estimates are

η^n=(α^n,β^n,θ^n)argminηΞ𝕄n(η).\hat{\eta}_{n}^{*}=(\hat{\alpha}_{n}^{*},\hat{\beta}_{n}^{*},\hat{\theta}_{n}^{*})\equiv\mathop{\arg\min}_{\eta\in\Xi}\mathbb{M}_{n}^{*}(\eta). (29)

We omit the rate of convergence part of the proof, and concentrate on establishing the limit distribution. Also, to keep the argument simple, we will assume that η^nη0\hat{\eta}_{n}\to\eta_{0} a.s., but a subsequence argument can be used to bypass this assumption. Note that

ζn=argmin𝐡3{n(nPn)[mη^n+rn1𝐡mη^n]+nPn[mη^n+rn1𝐡mη^n]},\qquad\zeta_{n}^{*}=\mathop{\arg\min}_{\mathbf{h}\in\mathbb{R}^{3}}\{n(\mathbb{P}_{n}^{*}-P_{n})[m_{\hat{\eta}_{n}+r_{n}^{-1}\mathbf{h}}-m_{\hat{\eta}_{n}}]+nP_{n}[m_{\hat{\eta}_{n}+r_{n}^{-1}\mathbf{h}}-m_{\hat{\eta}_{n}}]\}, (30)

where PnP_{n} is the probability measure generating the bootstrap sample. Consider the first term within the curly brackets. Using a similar calculation as in (8.3),

n(mη^n+rn1𝐡mη^n)=2ε[h1+β^n𝔹^(θ^n,h3)+h2X(θ^n)]+An,\sqrt{n}(m_{\hat{\eta}_{n}+r_{n}^{-1}\mathbf{h}}-m_{\hat{\eta}_{n}})=-2\varepsilon^{*}[h_{1}+\hat{\beta}_{n}\hat{\mathbb{B}}(\hat{\theta}_{n},h_{3})+h_{2}X(\hat{\theta}_{n})]+A_{n}, (31)

where 𝔹^(θ,t)n[X(θ+n1/(2H)t)X(θ)]\hat{\mathbb{B}}(\theta,t)\equiv\sqrt{n}[X(\theta+n^{-1/(2H)}t)-X(\theta)], with the dependence on nn suppressed for notational convenience, and ann(nPn)An=oP(1)a_{n}\equiv\sqrt{n}(\mathbb{P}_{n}^{*}-P_{n})A_{n}=o_{P}(1) uniformly in 𝐡[K,K]3\mathbf{h}\in[-K,K]^{3}. Then, using (31),

n(nPn)[n(mη^n+rn1𝐡mη^n)]\displaystyle\sqrt{n}(\mathbb{P}_{n}^{*}-P_{n})\bigl{[}\sqrt{n}(m_{\hat{\eta}_{n}+r_{n}^{-1}\mathbf{h}}-m_{\hat{\eta}_{n}})\bigr{]}
=n(nPn)[ε{h1+β^n𝔹^(θ^n,h3)+h2X(θ^n)}]+an\displaystyle\qquad=-\sqrt{n}(\mathbb{P}_{n}^{*}-P_{n})[\varepsilon^{*}\{h_{1}+\hat{\beta}_{n}\hat{\mathbb{B}}(\hat{\theta}_{n},h_{3})+h_{2}X(\hat{\theta}_{n})\}]+a_{n} (32)
d2σ(Z1h1+β0BH(h3)+h2|θ0|HZ2)\displaystyle\qquad\stackrel{{\scriptstyle d}}{{\rightarrow}}-2\sigma\bigl{(}Z_{1}h_{1}+\beta_{0}B_{H}(h_{3})+h_{2}|\theta_{0}|^{H}Z_{2}\bigr{)}

in C[K,K]C[-K,K], a.s., where Z1,Z2Z_{1},Z_{2} are i.i.d. N(0,1)N(0,1) that are independent of BHB_{H}.

To prove (8.5), first note that Pn[ε{h1+β^n𝔹^(θ^n,h3)+h2X(θ^n)}]=0P_{n}[\varepsilon^{*}\{h_{1}+\hat{\beta}_{n}\hat{\mathbb{B}}(\hat{\theta}_{n},h_{3})+h_{2}X(\hat{\theta}_{n})\}]=0, as the XiX_{i} are fixed and the εi\varepsilon_{i}^{*} have mean zero under PnP_{n}. We will need the following properties of 𝔹^(θ^n,t)\hat{\mathbb{B}}(\hat{\theta}_{n},t), proved at the end:

1ni=1n𝔹^i(θ^n,t)\displaystyle\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbb{B}}_{i}(\hat{\theta}_{n},t) P\displaystyle\stackrel{{\scriptstyle P}}{{\rightarrow}} 0,1ni=1n𝔹^i(θ^n,t)Xi(θ^n)P0,\displaystyle 0,\qquad\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbb{B}}_{i}(\hat{\theta}_{n},t)X_{i}(\hat{\theta}_{n})\stackrel{{\scriptstyle P}}{{\rightarrow}}0,
1ni=1n𝔹^i(θ^n,s)𝔹^i(θ^n,t)\displaystyle\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbb{B}}_{i}(\hat{\theta}_{n},s)\hat{\mathbb{B}}_{i}(\hat{\theta}_{n},t) P\displaystyle\stackrel{{\scriptstyle P}}{{\rightarrow}} CH(s,t),\displaystyle C_{H}(s,t),

uniformly for |s|,|t|<K|s|,|t|<K, where CH(s,t)C_{H}(s,t) is the covariance function (4) of fBm. Now considering (8.5), by simple application of the Lindeberg–Feller theorem, it follows that

nn[εh1]dh1N(0,σ2),nn[εh2X(θ^n)]dh2N(0,|θ0|2Hσ2),\sqrt{n}\mathbb{P}_{n}^{*}[\varepsilon^{*}h_{1}]\stackrel{{\scriptstyle d}}{{\rightarrow}}h_{1}N(0,\sigma^{2}),\qquad\sqrt{n}\mathbb{P}_{n}^{*}[\varepsilon^{*}h_{2}X(\hat{\theta}_{n})]\stackrel{{\scriptstyle d}}{{\rightarrow}}h_{2}N(0,|\theta_{0}|^{2H}\sigma^{2}),

a.s. in C[K,K]C[-K,K]. Next consider nn[ε𝔹^(θ^n,t)]\sqrt{n}\mathbb{P}_{n}^{*}[\varepsilon^{*}\hat{\mathbb{B}}(\hat{\theta}_{n},t)]. The finite-dimensional convergence and tightness of this process follow from Theorems 1.5.4 and 1.5.7 in vw using the properties of 𝔹^(θ^n,t)\hat{\mathbb{B}}(\hat{\theta}_{n},t) stated in (8.5). The asymptotic independence of the terms under consideration also follows using (8.5) via a similar calculation as in (29).

To study the drift term in (30), note that

Pnmη\displaystyle P_{n}m_{\eta} =\displaystyle= 1ni=1nPn[YiαβXi(θ)]2\displaystyle\frac{1}{n}\sum_{i=1}^{n}P_{n}[Y_{i}^{*}-\alpha-\beta X_{i}(\theta)]^{2}
=\displaystyle= 1ni=1n1nj=1n[α^n+β^nXi(θ^n)+(ε^jε¯n)αβXi(θ)]2\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{1}{n}\sum_{j=1}^{n}[\hat{\alpha}_{n}+\hat{\beta}_{n}X_{i}(\hat{\theta}_{n})+(\hat{\varepsilon}_{j}-\bar{\varepsilon}_{n})-\alpha-\beta X_{i}(\theta)]^{2}
=\displaystyle= 1ni=1n[(α^nα)+(β^nβ)Xi(θ^n)+β{Xi(θ^n)Xi(θ)}]2\displaystyle\frac{1}{n}\sum_{i=1}^{n}[(\hat{\alpha}_{n}-\alpha)+(\hat{\beta}_{n}-\beta)X_{i}(\hat{\theta}_{n})+\beta\{X_{i}(\hat{\theta}_{n})-X_{i}(\theta)\}]^{2}
+1nj=1n(ε^jε¯n)2.\displaystyle{}+\frac{1}{n}\sum_{j=1}^{n}(\hat{\varepsilon}_{j}-\bar{\varepsilon}_{n})^{2}.

Simple algebra then simplifies the drift term to

i=1n{h1n+h2nXi(θ^n)+𝔹^i(θ^n,h3)n(β^n+h2n)}2\displaystyle\sum_{i=1}^{n}\biggl{\{}\frac{h_{1}}{\sqrt{n}}+\frac{h_{2}}{\sqrt{n}}X_{i}(\hat{\theta}_{n})+\frac{\hat{\mathbb{B}}_{i}(\hat{\theta}_{n},h_{3})}{\sqrt{n}}\biggl{(}\hat{\beta}_{n}+\frac{h_{2}}{\sqrt{n}}\biggr{)}\biggr{\}}^{2}
=h12+h22ni=1nXi2(θ^n)+(β^n+h2n)21ni=1n𝔹^i(θ^n,h3)2\displaystyle\qquad=h_{1}^{2}+\frac{h_{2}^{2}}{n}\sum_{i=1}^{n}X_{i}^{2}(\hat{\theta}_{n})+\biggl{(}\hat{\beta}_{n}+\frac{h_{2}}{\sqrt{n}}\biggr{)}^{2}\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbb{B}}_{i}(\hat{\theta}_{n},h_{3})^{2}
+2h1h2ni=1nXi(θ^n)+2h1(β^n+h2n)1ni=1n𝔹^i(θ^n,h3)\displaystyle\qquad\quad{}+2\frac{h_{1}h_{2}}{n}\sum_{i=1}^{n}X_{i}(\hat{\theta}_{n})+2h_{1}\biggl{(}\hat{\beta}_{n}+\frac{h_{2}}{\sqrt{n}}\biggr{)}\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbb{B}}_{i}(\hat{\theta}_{n},h_{3}) (35)
+2h2(β^n+h2n)1ni=1n𝔹^i(θ^n,h3)Xi(θ^n)\displaystyle\qquad\quad{}+2h_{2}\biggl{(}\hat{\beta}_{n}+\frac{h_{2}}{\sqrt{n}}\biggr{)}\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbb{B}}_{i}(\hat{\theta}_{n},h_{3})X_{i}(\hat{\theta}_{n})
Ph12+h22|θ0|2H+β02|h3|2H\displaystyle\qquad\stackrel{{\scriptstyle P}}{{\rightarrow}}h_{1}^{2}+h_{2}^{2}|\theta_{0}|^{2H}+\beta_{0}^{2}|h_{3}|^{2H}

uniformly on [K,K][-K,K], where we have used the properties of 𝔹^(θ^n,h3)\hat{\mathbb{B}}(\hat{\theta}_{n},h_{3}) in (8.5) and

|1ni=1nXi(θ^n)|supθ|(nP)X(θ)|P0.\Biggl{|}\frac{1}{n}\sum_{i=1}^{n}X_{i}(\hat{\theta}_{n})\Biggr{|}\leq{\sup_{\theta}}|(\mathbb{P}_{n}-P)X(\theta)|\stackrel{{\scriptstyle P}}{{\rightarrow}}0.

Thus, combining (30), (8.5) and (8.5), we get ζndζ\zeta_{n}^{*}\stackrel{{\scriptstyle d}}{{\rightarrow}}\zeta in probability.

It remains to prove (8.5). We only prove the last part, the other parts being similar. For fixed K>0K>0, consider the function class

n={𝔹^(θ,s)𝔹^(θ,t)\dvtxθ[0,1],|s|<K,|t|<K},\mathcal{F}_{n}=\{\hat{\mathbb{B}}(\theta,s)\hat{\mathbb{B}}(\theta,t)\dvtx\theta\in[0,1],|s|<K,|t|<K\},

which has a uniformly bounded bracketing entropy integral, and envelope

Fn=supθ,|s|<K,|t|<K|𝔹^(θ,s)𝔹^(θ,t)|nα/HK2(Hα)ξ2F_{n}={\sup_{\theta,|s|<K,|t|<K}}|\hat{\mathbb{B}}(\theta,s)\hat{\mathbb{B}}(\theta,t)|\leq n^{\alpha^{\prime}/H}K^{2(H-\alpha^{\prime})}\xi^{2}

from the Lipschitz property (23) of order α=Hα\alpha=H-\alpha^{\prime}, where 0<α<H/20<\alpha^{\prime}<H/2 and ξ\xi has finite moments of all orders. Then

P{sup|s|,|t|<K|1ni=1n𝔹^i(θ^n,t)𝔹^i(θ^n,s)CH(s,t)|>ε}\displaystyle P\Biggl{\{}\sup_{|s|,|t|<K}\Biggl{|}\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbb{B}}_{i}(\hat{\theta}_{n},t)\hat{\mathbb{B}}_{i}(\hat{\theta}_{n},s)-C_{H}(s,t)\Biggr{|}>\varepsilon\Biggr{\}}
P{supfn|(nP)f|>ε}1εE[supfn|(nP)f|]\displaystyle\qquad\leq P\Bigl{\{}{\sup_{f\in\mathcal{F}_{n}}}|(\mathbb{P}_{n}-P)f|>\varepsilon\Bigr{\}}\leq\frac{1}{\varepsilon}E\Bigl{[}{\sup_{f\in\mathcal{F}_{n}}}|(\mathbb{P}_{n}-P)f|\Bigr{]}
1εnJ[](1,n,L2(P))(EFn2)1/2nα/H1/20,\displaystyle\qquad\lesssim\frac{1}{\varepsilon\sqrt{n}}J_{[\cdot]}(1,\mathcal{F}_{n},L_{2}(P))(EF_{n}^{2})^{1/2}\lesssim n^{\alpha^{\prime}/H-1/2}\rightarrow 0,

where we use a maximal inequality in Theorem 2.14.2 of vw .

Remark

The failure of the nonparametric bootstrap can be explained from the behavior of the drift term in (30). In the nonparametric bootstrap, we need to find the conditional limit of nn[mη^n+rn1𝐡mη^n]n\mathbb{P}_{n}[m_{\hat{\eta}_{n}+r_{n}^{-1}\mathbf{h}}-m_{\hat{\eta}_{n}}] given the data, but observe that nn\sqrt{n}\mathbb{P}_{n} applied to the second term of (8.3) fails to converge in probability. However, when bootstrapping residuals, the drift term in (30) becomes nPn[mη^n+rn𝐡mη^n]n{P}_{n}[m_{\hat{\eta}_{n}+r_{n}\mathbf{h}}-m_{\hat{\eta}_{n}}], and nPn\sqrt{n}P_{n} applied to the second term in (8.3) vanishes, so the drift term now converges in probability, as seen in (8.5).

8.6 Proof of Theorem 4.1

The consistency of θ^n\hat{\theta}_{n} follows using a Glivenko–Cantelli argument for the function class {mθ(X,Y)=[YX(θ)]2\dvtxθ[0,1]}{\mathcal{F}}\equiv\{m_{\theta}(X,Y)=[Y-X(\theta)]^{2}\dvtx\theta\in[0,1]\} and the existence of a well-separated minimum for 𝕄\mathbb{M}; cf. the proof of Theorem 2.1. Note that θ0\theta_{0} is the unique solution of the normal equation 𝕄(θ)=0\mathbb{M}^{\prime}(\theta)=0 and 𝕄′′(θ0)>0\mathbb{M}^{\prime\prime}(\theta_{0})>0, so

𝕄(θ)𝕄(θ0)d2(θ,θ0)\mathbb{M}(\theta)-\mathbb{M}(\theta_{0})\gtrsim d^{2}(\theta,\theta_{0}) (36)

for all θ\theta in a neighborhood of θ0\theta_{0}, where dd is the usual Euclidean distance. The envelope function Mδ=sup|θθ0|<δ|mθmθ0|M_{\delta}={\sup_{|\theta-\theta_{0}|<\delta}}|m_{\theta}-m_{\theta_{0}}| for δ{mθmθ0\dvtxθ[0,1]}\mathcal{M}_{\delta}\equiv\{m_{\theta}-m_{\theta_{0}}\dvtx\theta\in[0,1]\} has L2L^{2}-norm of order δH\delta^{H}, from (21), so Theorem 3.2.5 of vw applied with ϕn(δ)=δH\phi_{n}(\delta)=\delta^{H} gives rate rn=n1/(42H)r_{n}=n^{1/(4-2H)} with respect to Euclidean distance.

Now write h^nrn(θ^nθ0)=argminh𝕄~n(h)\hat{h}_{n}\equiv r_{n}(\hat{\theta}_{n}-\theta_{0})=\arg\min_{h\in\mathbb{R}}\tilde{\mathbb{M}}_{n}(h), where

𝕄~n(h)=rn2[𝕄n(θ0+h/rn)𝕄n(θ0)],h.\tilde{\mathbb{M}}_{n}(h)=r_{n}^{2}[\mathbb{M}_{n}(\theta_{0}+h/r_{n})-\mathbb{M}_{n}(\theta_{0})],\qquad h\in\mathbb{R}. (37)

This gives

𝕄~n(h)=nH/(42H)𝔾n[Zn(h)2]2𝔾n[WZn(h)]+\tfrac12𝕄′′(θ0)h2+An,\qquad\tilde{\mathbb{M}}_{n}(h)=n^{-H/(4-2H)}\mathbb{G}_{n}[Z_{n}(h)^{2}]-2\mathbb{G}_{n}[WZ_{n}(h)]+\tfrac{1}{2}\mathbb{M}^{\prime\prime}(\theta_{0})h^{2}+A_{n}, (38)

where An=o(1)A_{n}=o(1) uniformly in h[K,K]h\in[-K,K], for any K>0K>0, and

W\displaystyle W \displaystyle\equiv 01f(t)X(t)𝑑tX(θ0)+ε,\displaystyle\int_{0}^{1}f(t)X(t)\,dt-X(\theta_{0})+\varepsilon,
Zn(h)\displaystyle Z_{n}(h) \displaystyle\equiv nH/(42H)[X(θ0+h/rn)X(θ0)].\displaystyle n^{H/(4-2H)}[X(\theta_{0}+h/r_{n})-X(\theta_{0})].

Note that Zn(h)=dBH(h)Z_{n}(h)=_{d}B_{H}(h) as processes, so, by Donsker’s theorem, the first term in (38) converges to zero in probability uniformly over [K,K][-K,K]. For the second term, we claim that

𝔾n[WZn(h)]daBH(h)\mathbb{G}_{n}[WZ_{n}(h)]\stackrel{{\scriptstyle d}}{{\rightarrow}}aB_{H}(h) (39)

as processes in C[K,K]C[-K,K], where a2=E(W2)a^{2}=E(W^{2}). Application of the argmin continuous mapping theorem will then complete the proof.

To prove (39), for simplicity, we just give the detailed argument in the Brownian motion case, with B=B1/2B=B_{1/2} denoting two-sided Brownian motion. Consider the decomposition

𝔾n[WZn(h)]=𝔾n[(WWη)Zn(h)]+𝔾n[WηZn(h)],\mathbb{G}_{n}[WZ_{n}(h)]=\mathbb{G}_{n}[(W-W_{\eta})Z_{n}(h)]+\mathbb{G}_{n}[W_{\eta}Z_{n}(h)], (40)

where

Wη=θ0ηθ0+ηf(t)X(t)𝑑t+(X(θ0+η)X(θ0))(F(1)F(θ0+η)),\quad W_{\eta}=\int_{\theta_{0}-\eta}^{\theta_{0}+\eta}f(t)X(t)\,dt+\bigl{(}X(\theta_{0}+\eta)-X(\theta_{0})\bigr{)}\bigl{(}F(1)-F(\theta_{0}+\eta)\bigr{)}, (41)

F(θ)=0θf(t)𝑑tF(\theta)=\int_{0}^{\theta}f(t)\,dt, and η>0\eta>0 is sufficiently small so that |θ0±η|<1|\theta_{0}\pm\eta|<1. Splitting the range of integration for the first term in WW into three intervals, and using the integration by parts formula (for semimartingales) over the intervals [0,θ0η][0,\theta_{0}-\eta] and [θ0+η,1][\theta_{0}+\eta,1], we get

WWη\displaystyle W-W_{\eta} =\displaystyle= 0θ0η(F(θ0η)F(t))𝑑X(t)+θ0+η1(F(1)F(θ0+η))𝑑X(t)\displaystyle\int_{0}^{\theta_{0}-\eta}\bigl{(}F(\theta_{0}-\eta)-F(t)\bigr{)}\,dX(t)+\int_{\theta_{0}+\eta}^{1}\bigl{(}F(1)-F(\theta_{0}+\eta)\bigr{)}\,dX(t)
+ε+X(θ0)(F(1)F(θ0+η)1),\displaystyle{}+\varepsilon+X(\theta_{0})\bigl{(}F(1)-F(\theta_{0}+\eta)-1\bigr{)},

which implies, by the independent increments property, that WWηW-W_{\eta} is independent of Zn(h)Z_{n}(h) for |h|<ηn1/3|h|<\eta n^{1/3}. Using the same argument as in proving (28), it follows that

𝔾n[(WWη)Zn(h)]daηB(h)\mathbb{G}_{n}[(W-W_{\eta})Z_{n}(h)]\stackrel{{\scriptstyle d}}{{\rightarrow}}a_{\eta}B(h)

as processes in C[K,K]C[-K,K], for each fixed η>0\eta>0, where

aη2=E(WWη)2E(W2)=E[01f(t)X(t)𝑑tX(θ0)]2+σ2a2a_{\eta}^{2}=E(W-W_{\eta})^{2}\to E(W^{2})=E\biggl{[}\int_{0}^{1}f(t)X(t)\,dt-X(\theta_{0})\biggr{]}^{2}+\sigma^{2}\equiv a^{2}

as η0\eta\to 0. Clearly, aηB(h)daB(h)a_{\eta}B(h)\stackrel{{\scriptstyle d}}{{\rightarrow}}aB(h) in C[K,K]C[-K,K] as η0\eta\to 0. If we show that the last term in (40) is asymptotically negligible in the sense that, for every M>0M>0 and δ>0\delta>0,

limη0lim supnP(sup|h|<M|𝔾n[WηZn(h)]|>δ)=0,\lim_{\eta\to 0}\limsup_{n\to\infty}P\Bigl{(}{\sup_{|h|<M}}|\mathbb{G}_{n}[W_{\eta}Z_{n}(h)]|>\delta\Bigr{)}=0, (42)

this will complete the proof in view of Theorem 4.2 in bill . Theorem 2.14.2 in vw gives

E[sup|h|<M|𝔾n[WηZn(h)]|]J[](1,,L2(P))(EF2)1/2,E\Bigl{[}{\sup_{|h|<M}}|\mathbb{G}_{n}[W_{\eta}Z_{n}(h)]|\Bigr{]}\lesssim J_{[\cdot]}(1,\mathcal{F},L^{2}(P))(EF^{2})^{1/2},

where J[](1,,L2(P))J_{[\cdot]}(1,\mathcal{F},L^{2}(P)) is the bracketing entropy integral of the class of functions =n,η={WηZn(h)\dvtx|h|<M}\mathcal{F}=\mathcal{F}_{n,\eta}=\{W_{\eta}Z_{n}(h)\dvtx|h|<M\}, and F=Fn,ηF=F_{n,\eta} is an envelope function for \mathcal{F}. We can take F=|Wη|sup|h|<M|Zn(h)|F=|W_{\eta}|{\sup_{|h|<M}}|Z_{n}(h)|. By the Cauchy–Schwarz inequality,

E(F2)(EWη4)1/2(Esup|h|<M|B(h)|4)1/2ηM,E(F^{2})\leq(EW_{\eta}^{4})^{1/2}\Bigl{(}{E\sup_{|h|<M}}|B(h)|^{4}\Bigr{)}^{1/2}\lesssim\eta M,

where we have used (21). The bracketing entropy integral can be shown to be uniformly bounded (over all η>0\eta>0 and nn) using the Lipschitz property (23). The previous two displays and Markov’s inequality then lead to

lim supnP(sup|h|<M|𝔾n[WηZn(h)]|>δ)ηM/δ,\limsup_{n\to\infty}P\Bigl{(}{\sup_{|h|<M}}|\mathbb{G}_{n}[W_{\eta}Z_{n}(h)]|>\delta\Bigr{)}\lesssim\sqrt{\eta M}/\delta,

which implies (42) and establishes (39).

To establish (39) for general fBm, we apply Theorem 2.11.23 of vw to the class of measurable functions n={fn,h\dvtx|h|<M}\mathcal{F}_{n}=\{f_{n,h}\dvtx|h|<M\}, where fn,h(X,ε)=WZn(h)f_{n,h}(X,\varepsilon)=WZ_{n}(h) and M>0M>0 is fixed. Direct computation using the covariance of fBm shows that the sequence of covariance functions of fn,hf_{n,h} converges pointwise to the covariance function of aBH(h)aB_{H}(h), and the various other conditions can be shown to be satisfied using similar arguments to what we have seen already.

8.7 Proof of (13)

The key step involving the localization of the criterion function again relies on the self-similarity of fBm BHB_{H}:

n11/(2H)(θ^nθ0)\displaystyle n_{1}^{1/(2H)}(\hat{\theta}_{n}-\theta_{0}) =\displaystyle= argmaxh(n1n2)[X(θ0+n11/(2H)h)X(θ0)]\displaystyle\mathop{\arg\max}_{h}(\mathbb{P}_{n}^{1}-\mathbb{P}_{n}^{2})\bigl{[}X\bigl{(}\theta_{0}+n_{1}^{-1/(2H)}h\bigr{)}-X(\theta_{0})\bigr{]}
=d\displaystyle\stackrel{{\scriptstyle d}}{{=}} argmaxh{(𝔾n1ρ𝔾n2)[BH(h)]\displaystyle\mathop{\arg\max}_{h}\bigl{\{}\bigl{(}\mathbb{G}_{n}^{1}-\sqrt{\rho}\mathbb{G}_{n}^{2}\bigr{)}[B_{H}(h)]
+n1(𝕄(θ0+n11/(2H)h)𝕄(θ0))}\displaystyle\hskip 37.9pt{}+n_{1}\bigl{(}\mathbb{M}\bigl{(}\theta_{0}+n_{1}^{-1/(2H)}h\bigr{)}-\mathbb{M}(\theta_{0})\bigr{)}\bigr{\}}
d\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}} argmaxh{(1+ρ)BH(h)c|h|2H},\displaystyle\mathop{\arg\max}_{h}\bigl{\{}\bigl{(}1+\sqrt{\rho}\bigr{)}B_{H}(h)-c|h|^{2H}\bigr{\}},

where 𝔾nj=nj(njPj)\mathbb{G}_{n}^{j}=\sqrt{n_{j}}(\mathbb{P}_{n}^{j}-P_{j}) is the empirical process for the jjth sample.

Acknowledgments

The authors thank Moulinath Banerjee and Davar Khoshnevisan for helpful discussions.

References

  • (1) Beran, J. (1994). Statistics for Long-Memory Processes. Chapman and Hall, New York. \MR1304490
  • (2) Bhattacharya, P. K. and Brockwell, P. J. (1976). The minimum of an additive process with applications to signal to signal estimation and storage theory. Z. Wahrsch. Verw. Gebiete 37 51–75. \MR0426166
  • (3) Bickel, P. and Freedman, D. (1981). Some asymptotic theory for the bootstrap. Ann. Statist. 9 1196–1217. \MR0630103
  • (4) Billingsley, P. (1999). Convergence of Probability Measures, 2nd ed. Wiley, New York. \MR1700749
  • (5) Buness, A. et al. (2007). Identification of aberrant chromosomal regions from gene expression microarray studies applied to human breast cancer. Bioinformatics 23 2273–2280.
  • (6) Chernoff, H. (1964). Estimation of the mode. Ann. Inst. Statist. Math. 16 31–41. \MR0172382
  • (7) Coeurjolly, J.-F. (2008). Hurst exponent estimation of locally self-similar Gaussian processes using sample quantiles. Ann. Statist. 36 1404–1434. \MR2418662
  • (8) Dudoit, S. and van der Laan, M. J. (2008). Multiple Testing Procedures with Applications to Genomics. Springer, New York.
  • (9) Embrechts, P. and Maejima, M. (2002). Selfsimilar Processes. Princeton Univ. Press, Princeton. \MR1920153
  • (10) Emilsson, V. et al. (2008). Genetics of gene expression and its effect on disease. Nature 452 423–428.
  • (11) Gneiting, T. and Schlather, M. (2004). Stochastic models that separate fractal dimension and the Hurst effect. SIAM Rev. 46 269–282. \MR2114455
  • (12) Groeneboom, P. and Wellner, J. A. (2001). Computing Chernoff’s distribution. J. Comput. Graph. Statist. 10 388–400. \MR1939706
  • (13) Gruvberger-Saal, S. K. et al. (2004). Predicting continuous values of prognostic markers in breast cancer from microarray gene expression profiles. Molecular Cancer Therapeutics 3 161–168.
  • (14) Härdle, W. and Mammen, E. (1993). Comparing nonparametric versus parametric regression fits. Ann. Statist. 21 1926–1947. \MR1245774
  • (15) Kim, J. and Pollard, D. (1990). Cube root asymptotics. Ann. Statist. 18 191–219. \MR1041391
  • (16) Kosorok, M. (2008). Bootstrapping the Grenander estimator. In Beyond Parametrics in Interdisciplinary Research: Festschrift in Honour of Professor Pranab K. Sen (N. Balakrishnan, E. Pena and M. Silvapulle, eds.) 282–292. IMS, Beachwood, OH. \MR2464195
  • (17) Kosorok, M. (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer, New York.
  • (18) Koul, H. L., Qian, L. and Surgailis, D. (2003). Asymptotics of M-estimators in two-phase linear regression models. Stochastic Process. Appl. 103 123–154. \MR1947962
  • (19) Lieberman-Aiden, E., van Berkum, N. L., Williams, L. et al. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326 289–293.
  • (20) Léger, C. and MacGibbon, B. (2006). On the bootstrap in cube root asymptotics. Canad. J. Statist. 34 29–44. \MR2267708
  • (21) Lindquist, M. A. and McKeague, I. W. (2009). Logistic regression with Brownian-like predictors. J. Amer. Statist. Assoc. 104 1575–1585.
  • (22) Mandelbrot, B. B. and Van Ness, J. (1968). Fractional Brownian motions, fractional noises and applications. SIAM Rev. 10 422–437. \MR0242239
  • (23) Mandelbrot, B. B. (1982). The Fractal Geometry of Nature. Freeman, New York. \MR0665254
  • (24) Müller, H.-G. and Song, K.-S. (1997). Two-stage change-point estimators in smooth regression models. Statist. Probab. Lett. 34 323–335. \MR1467437
  • (25) Neumeyer, N. (2009). Smooth residual bootstrap for empirical processes of non-parametric regression residuals. Scand. J. Statist. 36 204–228. \MR2528982
  • (26) Novikov, A. and Valkeila, E. (1999). On some maximal inequalities for fractional Brownian motions. Statist. Probab. Lett. 44 47–54. \MR1706311
  • (27) Ramsay, J. O. and Silverman, B. W. (2006). Functional Data Analysis, 2nd ed. Springer, New York. \MR2168993
  • (28) Revuz, D. and Yor, M. (1999). Continuous Martingales and Brownian Motion, 3rd ed. Springer, New York.
  • (29) Richardson, A. L., Wang, Z. C., De Nicolo, A., Lu, X. et al. (2006). XX chromosomal abnormalities in basal-like human breast cancer. Cancer Cell 9 121–132.
  • (30) Salas-Gonzalez, D., Kuruoglu, E. E. and Ruiz, D. P. (2009). Modelling and assessing differential gene expression using the alpha stable distribution. Int. J. Biostat. 5 23. \MR2504963
  • (31) Sethuraman, J. (1961). Some limit theorems for joint distributions. Sankhyā Ser. A 23 379–386. \MR0139190
  • (32) Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer, New York. \MR1351010
  • (33) Singh, K. (1981). On asymptotic accuracy of Efron’s bootstrap. Ann. Statist. 9 1187–1195. \MR0630102
  • (34) Stryhn, H. (1996). The location of the maximum of asymmetric two-sided Brownian motion with triangular drift. Statist. Probab. Lett. 29 279–284. \MR1411427
  • (35) van der Vaart, A. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York. \MR1385671
  • (36) Van Keilegom, I., González-Manteiga, W. and Sánchez Sellero, C. (2008). Goodness-of-fit tests in parametric regression based on the estimation of the error distribution. Test 17 401–415. \MR2434335
  • (37) Yao, Y.-C. (1987). Approximating the distribution of the maximum likelihood estimate of the change-point in a sequence of independent random variables. Ann. Statist. 15 1321–1328. \MR0902262