Nonparametric denoising Signals of Unknown Local Structure, II: Nonparametric Regression Estimation

Anatoli Juditsky Corresponding author. anatoli.juditsky@imag.fr Arkadi Nemirovski nemirovs@isye.gatech.edu LJK, B.P. 53, 38041 Grenoble Cedex 9, France ISyE, Georgia Institute of Technology, 765 Ferst Drive, Atlanta GA 30332-0205 USA

Abstract

We consider the problem of recovering of continuous multi-dimensional functions $f$ from the noisy observations over the regular grid $m^{-1}\mathbb{Z}^{d}$ , $m\in\mathbb{N}_{*}$ . Our focus is at the adaptive estimation in the case when the function can be well recovered using a linear filter, which can depend on the unknown function itself. In the companion paper [26] we have shown in the case when there exists an adapted time-invariant filter, which locally recovers “well” the unknown signal, there is a numerically efficient construction of an adaptive filter which recovers the signals “almost as well”. In the current paper we study the application of the proposed estimation techniques in the non-parametric regression setting. Namely, we propose an adaptive estimation procedure for “locally well-filtered” signals (some typical examples being smooth signals, modulated smooth signals and harmonic functions) and show that the rate of recovery of such signals in the $\ell_{p}$ -norm on the grid is essentially the same as that rate for regular signals with nonhomogeneous smoothness.

keywords:

Nonparametric denoising, adaptive filtering, minimax estimation, nonparametric regression.

1 Introduction

Let ${\mathbf{F}}=(\Omega,\Sigma,P)$ be a probability space. We consider the problem of recovering unknown complex-valued random field $(s_{\tau}=s_{\tau}(\xi))_{{\tau\in\mathbb{Z}^{d}\atop\xi\in\Omega}}$ over $\mathbb{Z}^{d}$ from noisy observations

y_{\tau}=s_{\tau}+e_{\tau}.

(1)

We assume that the field $(e_{\tau})$ of observation noises is independent of $(s_{\tau})$ and is of the form $e_{\tau}=\sigma\epsilon_{\tau}$ , where $(\epsilon_{\tau})$ are independent of each other standard Gaussian complex-valued variables; the adjective “standard” means that $\Re(\epsilon_{\tau})$ , $\Im(\epsilon_{\tau})$ are independent of each other ${\mathbf{N}}(0,1)$ random variables.

We suppose that the observations (1) come from a function (“signal”) $f$ of continuous argument (which we assume to vary in the $d$ -dimensional unit cube $[0,1]^{d}$ ); this function is observed in noise along an $n$ -point equidistant grid in $[0,1]^{d}$ , and the problem is to recover $f$ via these observations. This problem fits the framework of nonparametric regression estimation with a “traditional setting” as follows:

: A. The objective is to recover an unknown smooth function $f:\,[0,1]^{d}\to{\mathbb{R}}$ , which is sampled on the observation grid $\Gamma_{n}=\{x_{\tau}=m^{-1}\tau:\,0\leq\tau_{1},...,\tau_{d}\leq m\}$ with $(m+1)^{d}=n$ , so that $s_{\tau}=f(x_{\tau})$ . The error of recovery is measured with some functional norm (or a semi-norm) $\|\cdot\|$ on $[0,1]^{d}$ , and the risk of recovery $\widehat{f}$ of $f$ is the expectation $E_{f}\|\widehat{f}-f\|^{2}$ ;
: B. The estimation routines are aimed at recovering smooth signals, and their quality is measured by their maximal risks, the maximum being taken over $f$ running through natural families of smooth signals, e.g., Hölder or Sobolev balls;
: C. The focus is on the asymptotic, as the volume of observations $n$ goes to infinity, behavior of the estimation routines, with emphasis on asymptotically minimax (nearly) optimal estimates – those which attain (nearly) best possible rates of convergence of the risks to 0 as the observation sample size $n\to\infty$ .

Initially, the research was focused on recovering smooth signals with a priori known smoothness parameters and the estimation routines were tuned to these parameters (see, e.g., [23, 34, 38, 24, 2, 31, 39, 22, 36, 21, 27]). Later on, there was a significant research on adaptive estimation. Adaptive estimation methods are free of a priori assumptions on the smoothness parameters of the signal to be recovered, and the primary goal is to develop the routines which exhibit asymptotically (nearly) optimal behavior on a wide variety of families of smooth functions (cf. [35, 28, 29, 30, 6, 8, 9, 25, 3, 7, 19]). For a more compete overview of results on smooth nonparametric regression estimation see, for instance, [33].³³3Our “brief outline” of adaptive approach to nonparametric regression would be severely incomplete without mentioning a novel approach aimed at recovering nonsmooth signals possessing sparse representations in properly constructed functional systems [5, 10, 4, 11, 12, 13, 14, 15, 16, 17, 37, 18]. This promising approach is completely beyond the scope of our paper.
The traditional focus on recovering smooth signals ultimately comes from the fact that such a signal locally can be well-approximated by a polynomial of a fixed order $r$ , and such a polynomial is an “easy to estimate” entity. Specifically, for every integer $T\geq 0$ , the value of a polynomial $p$ at an observation point $x_{t}$ can be recovered via $(2T+1)^{d}$ neighboring observations $\{x_{\tau}:|\tau_{j}-t_{j}|\leq T,1\leq j\leq d\}$ “at a parametric rate” – with the expected squared error $C\sigma^{2}(2T+1)^{-d}$ which is inverse proportional to the amount $(2T+1)^{d}$ of the observations used by the estimate. The coefficient $C$ depends solely on the order $r$ and the dimensionality $d$ of the polynomial. The corresponding estimate $\widehat{p}(x_{t})$ of $p(x_{t})$ is pretty simple: it is given by a “time-invariant filter”, that is, by convolution of observations with an appropriate discrete kernel $q^{(T)}=(q^{(T)}_{\tau})_{\tau\in\mathbb{Z}^{d}}$ vanishing outside the box ${\mathbf{O}}_{T}=\{\tau\in\mathbb{Z}^{d}:|\tau_{j}|\leq T,1\leq j\leq d\}$ :

\widehat{p}(x_{t})=\sum\limits_{\tau\in{\mathbf{O}}_{T}}q^{(T)}_{\tau}y_{t-\tau},

then the estimation $\widehat{f}$ of $f(x_{t})$ is taken as $\widehat{f}=\widehat{p}(x_{t})$ .

Note that the kernel $q^{(T)}$ is readily given by the degree $r$ of the approximating polynomial, $T$ and dimension $d$ . The “classical” adaptation routines takes care of choosing “good” values of the approximation parameters (namely, $T$ and $r$ ). On the other hand, the polynomial approximation “mechanism” is supposed to be fixed once for ever. Thus, in those procedures the “form” of the kernel is considered as given in advance.

In the companion paper [26] (referred hereafter as Part I) we have introduced the notion of a well-filtered signal. In brief, the signal $(s_{\tau})_{\tau\in\mathbb{Z}^{d}}$ is $T$ -well-filtered for some $T\in\mathbb{N}_{+}$ if there is a filter (kernel) $q=q^{(T)}\in\mathbf{O}_{T}$ which recovers $(s_{\tau})$ in the box $\{u:|u-t|\leq 3T\}$ with the mean square error comparable with $\sigma T^{-d/2}$ :

\max\limits_{u:|u-t|\leq 3T}E\left\{|s_{u}-\sum\limits_{\tau\in{\mathbf{O}}_{T}}q^{(T)}_{\tau}y_{u-\tau}|^{2}\right\}\leq O(\sigma^{2}T^{-d}).

The universe of these signals is much wider than the one of smooth signals.As we have seen in Part I that it contains, in particular, “modulated smooth signals” – sums of a fixed number of products of smooth functions and multivariate harmonic oscillations of unknown (and arbitrarily high) frequencies. We have shown in Part I that whenever a discrete time signal (that is, a signal defined on a regular discrete grid) is well-filtered, we can recover this signal at a “nearly parametric” rate without a priori knowledge of the associated filter. In other words, a well-filtered signal can be recovered on the observation grid basically as well as if it were an algebraic polynomial of a given order.

We are about to demonstrate that the results of Part I on recovering well-filtered signals of unknown structure can be applied to recovering nonparametric signals which admit well-filtered local approximations. Such an extension has an unavoidable price – now we cannot hope to recover the signal well outside of the observation grid (a highly oscillating signal can merely vanish on the observation grid and be arbitrarily large outside it). As a result, in what follows we are interested in recovering the signals along the observation grid only and, consequently, replace the error measures based on the functional norms on $[0,1]^{d}$ by their grid analogies.

The estimates to be developed will be “double adaptive”, that is, adaptive with respect to both the unknown in advance structures of well-filtered approximations of our signals and to the unknown in advance “approximation rate” – the dependence between the size of a neighborhood of a point where the signal in question is approximated and the quality of approximation in this neighborhood. Note that in the case of smooth signals, this approximation rate is exactly what is given by the smoothness parameters. The results to follow can be seen as extensions of the results of [32, 20] (see also [33]) dealing with the particular case of univariate signals satisfying differential inequalities with unknown differential operators.

2 Nonparametric regression problem

We start with the formal description of the components of the nonparametric regression problem.

Let for $\tau\in\mathbb{Z}^{d}$ , $|\tau|=\max\{|\tau_{1}|,...,|\tau_{d}|\}$ , and let $\tau\leq m$ for some $a\in\mathbb{N}$ denote $\tau_{i}\leq m,\;i=1,...,d$ . Let $m$ be a positive integer, $n=(m+1)^{d}$ , and let $\Gamma_{n}=\{x=m^{-1}\alpha:\alpha\in\mathbb{Z}^{d},0\leq\alpha,|\alpha|\leq m\}$ .

Let $C([0,1]^{d})$ be the linear space of complex-valued fields over $[0,1]^{d}$ . We associate with a signal $f\in C([0,1]^{d})$ its observations along $\Gamma_{n}$ :

y\equiv y^{n}_{f}(\epsilon)=\{y_{\tau}\equiv y_{\tau}^{n}(f,\epsilon)=f(m^{-1}\tau)+e_{\tau},e_{\tau}=\sigma\epsilon_{\tau}\}_{0\leq\tau\leq m},

(2)

where $\{\epsilon_{\tau}\}_{\tau\in\mathbb{Z}^{d}}$ are independent standard Gaussian complex-valued random noises. Our goal is to recover $f\big{|}_{\Gamma_{n}}$ from observations (2). In what follows, we write

f_{\tau}=f(m^{-1}\tau),\;\;\;\;\;{[\tau\in\mathbb{Z}^{d},m^{-1}\tau\in[0,1]^{d}]}

Below we use the following notations. For a set $B\subset[0,1]^{d}$ , we denote by $\mathbb{Z}(B)$ the set of all $t\in\mathbb{Z}^{d}$ such that $m^{-1}t\in B$ . We denote $\|\cdot\|_{q,B}$ the standard $L_{p}$ -norm on $B$ :

\|g\|_{p,B}=\left(\,\int_{x\in B}|g(x)|^{p}dx\right)^{1/p},

and $|g|_{q,B}$ its discrete analogy, so that

|g|_{q,B}=m^{-d/q}\left(\sum\limits_{\tau\in\mathbb{Z}(B)}|g_{\tau}|^{q}\right)^{1/q}.

We set

\Gamma_{n}^{o}=\Gamma_{n}\cap(0,1)^{n}=\{m^{-1}t:t\in\mathbb{Z}^{d},t>0,|t|<m\}.

Let $x=m^{-1}t\in\Gamma_{n}^{o}$ . We say that a nonempty open cube

{B}_{h}(x)=\{u\mid\,|u_{i}-x_{i}|<h/2,\,\,i=1,...,d\}

centered at $x$ is admissible for $x$ , if $B_{h}(x)\subset[0,1]^{n}$ . For such a cube, $T_{h}(x)$ denotes the largest nonnegative integer $T$ such that

\mathbb{Z}(B)\supset\{\tau\in\mathbb{Z}^{d}:|\tau-t|\leq 4T\}.

For a cube

B=\{x\in\mathbb{R}^{d}:|x_{i}-c_{i}|\leq h/2,\,i=1,...,d\},

$D(B)=h$ stands for the edge of $B$ . For $\gamma\in(0,1)$ we denote

B_{\gamma}=\{x\in\mathbb{R}^{d}:|x_{i}-c_{i}|\leq\gamma h/2,i=1,...,d\}

the $\gamma$ -shrinkage of $B$ to the center of $B$ .

2.1 Classes of locally well-filtered signals

Recall that we say that a function on $[0,1]^{d}$ is smooth if it can be locally well-approximated by a polynomial. Informally, the the definition below sais that a continuous signal $f\in C([0,1])^{d}$ is locally well-filtered if $f$ admits a good local approximation by a well-filtered discrete signal $\phi_{\tau}$ on $\Gamma_{n}$ (see Definition 1 of Section 2.1, Part I).

Definition 1

Let $B\subset[0,1]^{d}$ be a cube, $k$ be a positive integer, $\rho\geq 1$ , $R\geq 0$ be reals, and let $p\in(d,\infty]$ . The collection $B$ , $k$ , $\rho$ , $R$ , $p$ specifies the family ${\mathbf{F}}^{k,\rho,p}(B,R)$ of locally well-filtered on $B$ signals $f$ defined by the following requirements:
(1) $f\in C([0,1]^{d})$ ;
(2) There exists a nonnegative function $F\in L_{p}(B),\;\|F\|_{p,B}\leq R,$ such that for every $x=m^{-1}t\in\Gamma_{n}\cap\hbox{\rm int}B$ and for every admissible for $x$ cube $B_{h}(x)$ contained in $B$ there exists a field $\phi\in C(\mathbb{Z}^{d})$ such that $\phi\in{\mathbf{S}}^{t}_{3T_{h}(x)}(0,\rho,T_{h}(x))$ (where the set ${\mathbf{S}}^{t}_{L}(\theta,\rho,T)$ of $T$ -well filtered signals is defined in Definition 1 of Part I) and

\forall\tau\in\mathbb{Z}(B_{h}(x)):|\phi_{\tau}-f_{\tau}|\leq h^{k-d/p}\|F\|_{p,B_{h}(x)}.

(3)

In the sequel, we use for ${\mathbf{F}}^{k,\rho,p}(B;R)$ also the shortened notation ${\mathbf{F}}[\psi]$ , where $\psi$ stands for the collection of “parameters” $(k,\rho,p,B,R)$ .

Remark The motivating example of locally well-filtered signals is that of modulated smooth signals as follows. Let a cube $B\subset[0,1]^{d}$ , $p\in(d,\infty]$ , positive integers $k,\nu$ and a real $R\geq 0$ be given. Consider a collection of $\nu$ functions $g_{1},...,g_{\nu}\in C([0,1]^{d})$ which are $k$ times continuously differentiable and satisfy the constraint

\sum\limits_{\ell=1}^{\nu}\|D^{k}g_{\ell}\|_{p,B}\leq R.

Let $\omega(\ell)\in\mathbb{R}^{d}$ , and let

f(x)=\sum\limits_{\ell=1}^{\nu}g_{\ell}(x)\exp\{i\omega^{T}(\ell)x\}.

By the standard argument [1], whenever $x=m^{-1}t\in\Gamma_{n}\cap\hbox{\rm int}B$ and $B_{h}(x)$ is admissible for $x$ , the Taylor polynomial $\Phi_{\ell}^{x}(\cdot)$ , of order $k-1$ , taken at $x$ , of $f_{\ell}$ satisfies the inequality

u\in B_{h}(x)\Rightarrow|\Phi_{\ell}^{x}(u)-f_{\ell}(u)|\leq c_{1}h^{k-d/p}\|F_{\ell}\|_{p,B_{h}(x)},\;\;\mbox{where}\;\;F_{\ell}(u)=|D^{k}f_{\ell}(u)|

(here and in what follows, $c_{i}$ are positive constants depending solely on $d$ , $k$ and $\nu$ ). It follows that if $\Phi(u)=\sum\limits_{\ell=1}^{\nu}\Phi_{\ell}^{x}(u)\exp\{i\omega^{T}(\ell)u\}$ then

\begin{array}[]{l}u\in B_{h}(x)\Rightarrow|\Phi(u)-f(u)|\leq h^{k-d/p}\|F\|_{p,B_{h}(x)},\\ F=c_{2}\sum\limits_{\ell=1}^{\nu}F_{\ell}\qquad[\Rightarrow\|F\|_{p,B}\leq c_{3}R].\\ \end{array}

(4)

Now observe that the exponential polynomial $\phi(\tau)=\Phi(m^{-1}\tau)$ belongs to ${\mathbf{S}}^{t}_{L}(0,c_{4},T)$ for any $0\leq T\leq L\leq\infty$ (Proposition 10 of Part I). Combining this fact with (4), we conclude that $f\in{\mathbf{F}}^{k,\rho(\nu,k,d),p}(B,c(\nu,k,d)R).$

2.2 Accuracy measures

Let us fix $\gamma\in(0,1)$ and $q\in[1,\infty]$ . Given an estimate $\widehat{f}_{n}$ of the restriction $f\big{|}_{\Gamma_{n}}$ of $f$ on the grid ${\Gamma_{n}}$ , based on observations (2) (i.e., a Borel real-valued function of $x\in\Gamma_{n}$ and $y\in{\mathbb{C}}^{n}$ ) and $\psi=(k,\rho,p,B,R)$ , let us characterize the quality of the estimate on the set ${\mathbf{F}}[\psi]$ by the worst-case risks

\widehat{{\mathbf{R}}}_{q}\left(\widehat{f}_{n};{\mathbf{F}}[\psi]\right)=\sup_{f\in{\mathbf{F}}[\psi]}\left(E\left\{\left|\widehat{f}_{n}(\cdot;y_{f}(\epsilon))-f\big{|}_{\Gamma_{n}}(\cdot)\right|_{q,B_{\gamma}}^{2}\right\}\right)^{1/2}.

3 Estimator construction

The recovering routine we are about to build is aimed at estimating functions from classes ${\mathbf{F}}^{k,\rho,p}(B,R)$ with unknown in advance parameters $k,\rho,p,B,R$ . The only design parameters of the routine is an a priori upper bound $\mu$ on the parameter $\rho$ and a $\gamma\in(0,1)$ .

3.1 Preliminaries

From now on, we denote by $\Theta=\Theta_{(n)}$ the deterministic function of observation noises defined as follows. For every cube $B\subset[0,1]^{d}$ with vertices in $\Gamma_{n}$ , we consider the discrete Fourier transform of the observation noises reduced to $B\cap\Gamma_{n}$ , and take the maximum of modules of the resulting Fourier coefficients, let it be denoted $\theta_{B}(e)$ . By definition,

\Theta\equiv\Theta_{(n)}=\sigma^{-1}\max\limits_{B}\theta_{B}(e),

where the maximum is taken over all cubes $B$ of the indicated type. By the origin of $\Theta_{(n)}$ , due to the classical results on maxima of Gaussian processes (cf also Lemma 15 of Part I), we have

\forall w\geq 1:\quad{\hbox{\rm Prob}}\left\{\Theta_{(n)}>w\sqrt{\ln n}\right\}\leq\exp\left\{-{c{w^{2}\ln n}\over 2}\right\},

(5)

where $c>0$ depends solely on $d$ .

3.2 Building blocks: window estimates

To recover a signal $f$ via $n=m^{d}$ observations (2), we use point-wise window estimates of $f$ defined as follows.

Let us fix a point $x=m^{-1}t\in\Gamma_{n}^{o}$ ; our goal is to build an estimate of $f(x)$ . Let $B_{h}(x)$ be an admissible window for $x$ . We associate with this window an estimate $\widehat{f}^{h}_{n}=\widehat{f}^{h}_{n}(x;y^{n}_{f}(\epsilon))$ of $f(x)$ defined as follows. If the window is “very small”, specifically, $h\leq m^{-1}$ , so that $x$ is the only point from the observation grid $\Gamma_{n}$ in $B_{h}(x)$ , we set $T_{h}(x)=0$ and $\widehat{f}^{h}_{n}=y_{t}$ . For a larger window, we choose the largest nonnegative integer $T=T_{h}(x)$ such that

\mathbb{Z}(B_{h}(x))\supset\{\tau:|\tau-t|\leq 4T\}

and apply Algorithm A of Part I to build the estimate of $f_{t}=f(x)$ , the design parameters of the algorithm being $(\mu,T_{h}(x))$ . Let the resulting estimate be denoted by $\widehat{f}^{h}_{n}=\widehat{f}^{h}_{n}(x;y^{n}_{f}(\epsilon))$ .
To characterize the quality of the estimate $\widehat{f}^{h}_{n}=\widehat{f}^{h}_{n}(x;y^{n}_{f}(\epsilon))$ , let us set

\Phi_{\mu}(f,{B}_{h}(x))=\min\limits_{p}\left\{\max\limits_{\tau\in\mathbb{Z}(B_{h}(x))}|p_{\tau}-f_{\tau}|:p\in{\mathbf{S}}^{t}_{3T_{h}(x)}(0,\mu,T_{h}(x))\right\}.

Lemma 2

One has

(f_{\tau})\in{\mathbf{S}}^{t}_{3T_{h}(x)}(\theta,\mu,T_{h}(x)),\quad\theta={\Phi_{\mu}(f,B_{h}(x))(1+\mu)\over(2T+1)^{d/2}}.

(6)

Assuming that $h>m^{-1}$ and combining (6) with the result of Theorem 4 of Part I we come to the following upper bound on the error of estimating $f(x)$ by the estimate $\widehat{f}^{h}_{n}(x;\cdot)$ :

|f(x)-\widehat{f}^{h}_{n}(x;y_{f}(\epsilon))|\leq C_{1}\left[\Phi_{\mu}(f,B_{h}(x))+{\sigma\over{\sqrt{nh^{d}}}}\Theta_{(n)}\right]

(7)

(note that $(2T_{h}(x)+1)^{-d/2}\leq C_{0}(nh^{d})^{-1/2}$ ). For evident reasons (7) holds true for “very small windows” (those with $h\leq m^{-1}$ ) as well.

3.3 The adaptive estimate

We are about to “aggregate” the window estimates $\widehat{f}^{h}_{n}$ into an adaptive estimate, applying Lepskii’s adaptation scheme in the same fashion as in [30, 19, 20].
Let us fix a “safety factor” $\omega$ in such a way that the event $\Theta_{(n)}>\omega\sqrt{\ln n}$ is “highly un-probable”, namely,

{\hbox{\rm Prob}}\left\{\Theta_{(n)}>\omega\sqrt{\ln n}\right\}\leq n^{-4(\mu+1)};

(8)

by (5), the required $\omega$ may be chosen as a function of $\mu,d$ only. We are to describe the basic blocks of the construction of the adaptive estimate.
“Good” realizations of noise. Let us define the set of “good realizations of noise” as

\Xi_{n}=\{\epsilon\mid\,\Theta_{(n)}\leq\omega\sqrt{\ln n}\}.

(9)

Now (7) implies the “conditional” error bound

\begin{array}[]{l}\epsilon\in\Xi_{n}\Rightarrow|f(x)-\widehat{f}^{h}_{n}(x;y_{f}(\epsilon))|\leq C_{1}\left[\Phi_{\mu}(f,{B}_{h}(x))+S_{n}(h)\right],\\[8.53581pt] S_{n}(h)=\displaystyle{{\sigma\over{\sqrt{nh^{d}}}}\omega\sqrt{\ln n}}.\\ \end{array}

(10)

Observe that as $h$ grows, the “deterministic term” $\Phi_{\mu}(f,{B}_{h}(x))$ does not decrease, while the “stochastic term” $S_{n}(h)$ decreases.

The “ideal” window. Let us define the ideal window ${B}_{*}(x)$ as the largest admissible window for which the stochastic term dominates the deterministic one:

\begin{array}[]{rcl}{B}_{*}(x)&=&{B}_{h_{*}(x)}(x),\\ h_{*}(x)&=&\max\{h\mid\,h>0,{B}_{h}(x)\subset[0,1]^{d},\Phi_{\mu}(f,{B}_{h}(x))\leq S_{n}(h)\}.\\ \end{array}

(11)

Note that such a window does exist, since $S_{n}(h)\to\infty$ as $h\to+0$ . Besides this, since the cubes $B_{h}(x)$ are open, the quantity $\Phi_{\mu}(f,{B}_{h}(x))$ is continuous from the left, so that

0<h\leq h_{*}(x)\Rightarrow\Phi_{\mu}(f,{B}_{h}(x))\leq S_{n}(h).

(12)

Thus, the ideal window ${B}_{*}(x)$ is well-defined for every $x$ possessing admissible windows, i.e., for every $x=\Gamma_{n}^{o}=\{m^{-1}t:t\in\mathbb{Z}^{d},0<t,|t|<m\}$ .
Normal windows. Assume that $\epsilon\in\Xi_{n}$ . Then the errors of all estimates $\widehat{f}^{h}_{n}(x;y)$ associated with admissible windows smaller than the ideal one are dominated by the corresponding stochastic terms:

\epsilon\in\Xi_{n},0<h\leq h_{*}(x)\Rightarrow|f(x)-\widehat{f}^{h}_{n}(x;y_{f}(\epsilon))|\leq 2C_{1}S_{n}(h)

(13)

(by (10) and (12)). Let us fix an $\epsilon\in\Xi_{n}$ (and thus – a realization $y$ of the observations) and let us call an admissible for $x$ window ${B}_{h}(x)$ normal, if the associated estimate $\widehat{f}^{h}_{n}(x;y)$ differs from every estimate associated with a smaller window by no more than $4C_{1}$ times the stochastic term of the latter estimate, i.e.

\begin{array}[]{c}\hbox{Window ${B}_{h}(x)$ is normal}\\ \Updownarrow\\ \left\{\begin{array}[]{l}\hbox{${B}_{h}(x)$ is admissible}\\ \forall h^{\prime},0<h^{\prime}\leq h:\quad|\widehat{f}^{h^{\prime}}_{n}(x;y)-\widehat{f}^{h}_{n}(x;y)|\leq 4C_{1}S_{n}(h^{\prime})\quad[y=y_{f}(\epsilon)]\\ \end{array}\right.\\ \end{array}

(14)

Note that if $x\in\Gamma_{n}^{o}$ , then $x$ possesses a normal window, specifically, the window $B_{m^{-1}}(x)$ . Indeed, this window contains a single observation point, namely, $x$ itself, so that the corresponding estimate, same as every estimate corresponding to a smaller window, by construction coincides with the observation at $x$ , so that all the estimates $\widehat{f}^{h^{\prime}}_{n}(x;y)$ , $0<h^{\prime}\leq m^{-1}$ , are the same. Note also that (13) implies that

(!) If $\epsilon\in\Xi_{n}$ , then the ideal window ${B}_{*}(x)$ is normal.

The adaptive estimate $\widehat{f}_{n}(x;y)$ . The property of an admissible window to be normal is “observable” – given observations $y$ , we can say whether a given window is or is not normal. Besides this, it is clear that among all normal windows there exists the largest one ${B}^{+}(x)={B}_{h^{+}(x)}(x)$ . The adaptive estimate $\widehat{f}_{n}(x;y)$ is exactly the window estimate associated with the window $B^{+}(x)$ . Note that from (!) it follows that

(!!) If $\epsilon\in\Xi_{n}$ , then the largest normal window ${B}^{+}(x)$ contains the ideal window ${B}_{*}(x)$ .

By definition of a normal window, under the premise of (!!) we have

|\widehat{f}^{h^{+}(x)}_{n}(x;y)-\widehat{f}^{h_{*}(x)}_{n}(x;y)|\leq 4C_{1}S_{n}(h_{*}(x)),

and we come to the conclusion as follows:
(*) If $\epsilon\in\Xi_{n}$ , then the error of the estimate $\widehat{f}_{n}(x;y)\equiv\widehat{f}^{h^{+}(x)}_{n}(x;y)$ is dominated by the error bound (10) associated with the ideal window:

\epsilon\in\Xi_{n}\Rightarrow|\widehat{f}_{n}(x;y)-f(x)|\leq 5C_{1}\left[\Phi_{\mu}(f,{B}_{h_{*}(x)}(x))+S_{n}(h_{*}(x))\right].

(15)

Thus, the estimate $\widehat{f}_{n}(\cdot;\cdot)$ – which is based solely on observations and does not require any a priori knowledge of the “parameters of well-filterability of $f$ ” – possesses basically the same accuracy as the “ideal” estimate associated with the ideal window (provided, of course, that the realization of noises is not “pathological”: $\epsilon\in\Xi_{n}$ ).
Note that the adaptive estimate $\widehat{f}_{n}(x;y)$ we have built depends solely on “design parameters” $\mu$ , $\gamma$ (recall that $C_{1}$ depends on $\mu,\gamma$ ), the volume of observations $n$ and the dimension $d$ .

4 Main result

Our main result is as follows:

Theorem 3

Let $\gamma\in(0,1)$ , $\mu\geq 1$ be an integer, let ${\mathbf{F}}={\mathbf{F}}^{k,\rho,p}(B;R)$ be a family of locally well-filtered signals associated with a cube $B\subset[0,1]^{d}$ with $mD(B)\geq 1$ , $\rho\leq\mu$ and $p>d$ . For properly chosen $P\geq 1$ depending solely on $\mu,d,p,\gamma$ and nonincreasing in $p>d$ the following statement holds true:

Suppose that the volume $n=m^{d}$ of observations (2) is large enough, namely,

P^{-1}n^{{{2kp+d(p-2)}\over{2dp}}}\geq{R\over{\sigma}}\sqrt{n\over{\ln n}}\geq P[D(B)]^{-{{2kp+d(p-2)}\over 2p}}

(16)

where $D(B)$ is the edge of the cube $B$ .

Then for every $q\in[1,\infty]$ the worst case, with respect to ${\mathbf{F}}$ , $q$ -risk of the adaptive estimate $\widehat{f}_{n}(\cdot,\cdot)$ associated with the parameter $\mu$ can be bounded as follows:

	$\displaystyle\widehat{{\mathbf{R}}}_{q}\left(\widehat{f}_{n};{\mathbf{F}}\right)$	$\displaystyle=$	$\displaystyle\sup\limits_{f\in{\mathbf{F}}}\left(E\left\{\left\|\widehat{f}_{n}(\cdot;y_{f}(\epsilon))-f(\cdot)\right\|_{q,B_{\gamma}}^{2}\right\}\right)^{1/2}$
		$\displaystyle\leq$	$\displaystyle PR\left({\sigma^{2}\ln n\over R^{2}n}\right)^{\beta(p,k,d,q)}[D(B)]^{d\lambda(p,k,d,q)},$

where

	$\displaystyle\beta(p,k,d,q)$	$\displaystyle=$	$\displaystyle\left\{\begin{array}[]{lcl}{k\over{2k+d}},&\mbox{when}&{q}\leq{(2k+d)p\over d},\\ {{k+d\left({1\over q}-{1\over p}\right)}\over{2k+d-{2d\over p}}},&\mbox{when}&{q}>{(2k+d)p\over d},\end{array}\right.$
	$\displaystyle\lambda(p,k,d,q)$	$\displaystyle=$	$\displaystyle\left\{\begin{array}[]{lcl}{1\over q}-{d\over{(2k+d)p}},&\mbox{when}&{q}\leq{(2k+d)p\over d},\\ 0,&\mbox{when}&{q}>{(2k+d)p\over d}\end{array}\right.$

(recall that here $B_{\gamma}$ is the concentric to $B$ $\gamma$ times smaller cube).

Note that the rates of convergence to 0, as $n\to\infty$ , of the risks $\widehat{{\mathbf{R}}}_{q}\left(\widehat{f}_{n};{\mathbf{F}}\right)$ of our adaptive estimate on the families ${\mathbf{F}}={\mathbf{F}}^{k,\rho,p}(B;R)$ are exactly the same as those stated by Theorem 3 from [31] (see also [30, 9, 19, 33]) in the case of recovering non-parametric smooth regression functions from Sobolev balls. It is well-known that in the smooth case the latter rates are optimal in order, up to logarithmic in $n$ factors. Since the families of locally well-filtered signals are much wider than local Sobolev balls (smooth signals are trivial examples of modulated smooth signals!), it follows that the rates of convergence stated by Theorem 3 also are nearly optimal.

5 Simulation examples

In this section we present the results of a small simulation study of the adaptive filtering algorithm applied to the 2-dimensional de-noising problem. The simulation setting is as follows: we consider real-valued signals

y_{\tau}=s_{\tau}+e_{\tau},\;\;\tau=(\tau_{1},\,\tau_{2})\in\{1,...,m\}^{2},

$e_{(1,1)},...,\,e_{(m,m)}$ being independent standard Gaussian random variables. The problem is to estimate, given observations $(y_{\tau})$ , the values of the signal $(f_{x_{\tau}})$ on the grid $\Gamma_{m}=\{m^{-1}\tau,\;1\leq\tau_{1},\tau_{2}\leq m\}$ , and $f({x_{\tau}})=s_{\tau}$ . The value $m=128$ is common to all experiments.

We consider signals which are sums of three harmonic components:

s_{\tau}=\alpha[\sin(m^{-1}\omega_{1}^{T}\tau+\theta_{1})+\sin(m^{-1}\omega_{2}^{T}\tau+\theta_{2})+\sin(m^{-1}\omega_{3}^{T}\tau+\theta_{3})];

the frequencies $\omega_{i}$ and the phase shifts $\theta_{i}$ , $i=1,...,3$ are drawn randomly from the uniform distribution over, respectively, $[0,\omega_{\max}]^{3}$ and $[0,1]^{3}$ and the coefficient $\alpha$ is chosen to obtain the signal-to-noise ratio equal to one.

In the simulations we present here we compared the result of adaptive recovery with $T=10$ to that of a “standard nonparametric recovery”, i.e. the recovery by the locally linear estimator with square window. We have done $k=100$ independent runs for each of eight values of $\omega_{\max}$ ,

\omega_{\max}=\{1.0,\,2.0,\,4.0,\,8.0\,16.0,\,32.0,\,64.0,\,128.0\}.

In Table 1 we summarize the results for the mean integrated squared error (MISE) of the estimation,

MISE=\sqrt{{1\over 100m^{2}}\sum_{j=1}^{100}\sum_{\tau=(1,1)}^{(m,m)}(\widehat{s}^{(j)}_{\tau}-s^{(j)}_{\tau})^{2}}.

The observed phenomenon is rather expectable: for slowly oscillating signals the quality of the adaptive recovery is slightly worse than that of “standard recovery”, which are tuned for estimation of regular signals. When we rise the frequency of the signal components, the adaptive recovery stably outperforms the standard recovery. Finally, standard recovery is clearly unable to recover highly oscillating signals (cf Figures 1-4).

Table 1: MISE of adaptive recovery

	Standard	Adaptive
$\omega_{\max}$	recovery	recovery
$1.0$	0.12	0.1
$2.0$	0.20	0.12
$4.0$	0.36	0.18
$8.0$	0.54	0.27
$16.0$	0.79	0.25
$32.0$	0.75	0.29
$64.0$	0.27	0.98
$128.0$	0.24	1.00

Appendix

We denote $C(\mathbb{Z}^{d})$ the linear space of complex-valued fields over $\mathbb{Z}^{d}$ . A field $r\in C(\mathbb{Z}^{d})$ with finitely many nonzero entries $r_{\tau}$ is called a filter. We use the commun notation $\Delta_{j}$ , $j=1,...,d$ , for the “basic shift operators” on $C(\mathbb{Z}^{d})$ :

(\Delta_{j}r)_{\tau_{1},...,\tau_{d}}=r_{\tau_{1},...,\tau_{j-1},\tau_{j}-1,\tau_{j+1},...,\tau_{d}}.

and denote $r(\Delta)x$ the output of a filter $r$ , the input to the filter being a field $x\in C(\mathbb{Z}^{d})$ , so that $(r(\Delta)x)_{t}=\sum\limits_{\tau}r_{\tau}x_{t-\tau}.$

5.1 Proof of Lemma 2.

To save notation, let $B=B_{h}(x)$ and $T=T_{h}(x)$ . Let $p\in C(\mathbb{Z}^{d})$ be such that $p\in{\mathbf{S}}^{t}_{3T}(0,\mu,T)$ and $|p_{\tau}-f_{\tau}|\leq\Phi_{\mu}(f,{B}_{h}(x))$ for all $\tau\in\mathbb{Z}(B_{h}(x))$ . Since $p\in{\mathbf{S}}^{t}_{3T}(0,\mu,T)$ , there exists a filter $q\in C_{T}(\mathbb{Z}^{d})$ such that $|q|_{2}\leq\mu(2T+1)^{-d/2}$ and $(q(\Delta)p)_{\tau}=p_{\tau}$ whenever $|\tau-t|\leq 3T$ . Setting $\delta_{\tau}=f_{\tau}-p_{\tau}$ , we have for any $\tau$ , $|\tau-t|\leq 3T$ ,

$\displaystyle\|f_{\tau}-(q(\Delta)f)_{\tau}\|\leq\|\delta_{\tau}\|+\|p_{\tau}-(q(\Delta)p)_{\tau}\|+\|(q(\Delta)\delta)_{\tau}\|$
	$\displaystyle\leq$	$\displaystyle\Phi_{\mu}(f,B_{h}(x))+\|q\|_{1}\max\{\|\delta_{\nu}\|:\|\nu-\tau\|\leq T\}\leq\Phi_{\mu}(f,B_{h}(x))$
		$\displaystyle+\|q\|_{2}(2T+1)^{d/2}\Phi_{\mu}(f,B_{h}(x))\max\{\|\delta_{\nu}\|:\|\nu-\tau\|\leq T\}$
		[note that $\|\tau-t\|\leq 3T$ and $\|\nu-\tau\|\leq T$ implies $\|\nu-t\|\leq 4T$ ]
	$\displaystyle\leq$	$\displaystyle\Phi_{\mu}(f,B_{h}(x))(1+\mu)={\Phi_{\mu}(f,B_{h}(x))(1+\mu)\over(2T+1)^{d/2}}(2T+1)^{-d/2}$

as required in (6). ∎

5.2 Proof of Theorem 3

In the main body of the proof, we focus on the case $p,q<\infty$ ; the case of infinite $p$ and/or $q$ will be considered at the concluding step 5⁰.
Let us fix a family of well-filtered signals ${\mathbf{F}}={\mathbf{F}}^{k,\rho,p}_{d}(B;R)$ with the parameters satisfying the premise of Theorem 3 and a function $f$ from this class.
Recall that by the definition of ${\mathbf{F}}$ there exists a function $F\geq 0$ , $\|F\|_{p,B}\leq R$ , such that for all $x=m^{-1}t\in(\hbox{\rm int}B)\cap\Gamma_{n}$ and all $h,\;B_{h}(x)\subset B$ :

\Phi_{\mu}(f,{B}_{h}(x))\leq P_{1}h^{k-d/p}\Omega(f,{B}_{h}(x)),\,\,\Omega(f,B^{\prime})=\left(\displaystyle{\int\limits_{B^{\prime}}}F^{p}(u)du\right)^{1/p}.

(20)

From now on, $P$ (perhaps with sub- or superscripts) are quantities $\geq 1$ depending on $\mu,d,\gamma,p$ only and nonincreasing in $p>d$ .

1⁰. We need the following auxiliary result:

Lemma 4

Assume that

n^{{{k-d/p}\over{d}}}\sqrt{\ln n}\geq P_{1}(\mu+3)^{k-d/p+d/2}{R\over{\sigma\omega}}.

(21)

Given a point $x\in\Gamma_{n}\cap B_{\gamma}$ , let us choose the largest $h=h(x)$ such that

\begin{array}[]{cl}(a):&h\leq(1-\gamma)D(B),\\ (b):&P_{1}h^{k-d/p}\Omega(f,{B}_{h}(x))\leq S_{n}(h).\end{array}

(22)

Then $h(x)$ is well-defined and

h(x)\geq m^{-1}.

(23)

Besides this, the error at $x$ of the adaptive estimate $\widehat{f}_{n}$ as applied to $f$ can be bounded as follows:

|\widehat{f}_{n}(x;y)-f(x)|\leq C_{2}\left[S_{n}(h(x))\mathbf{1}\{\epsilon\in\Xi_{n}\}+\sigma\Theta_{(n)}\mathbf{1}\{\epsilon\not\in\Xi_{n}\}\right]

(24)

Proof: The quantity $h(x)$ is well-defined, since for small positive $h$ the left hand side in (22. $b$ ) is close to 0, while the right hand side one is large. From (21) it follows that $h=m^{-1}$ satisfies (22. $a$ ), so that ${B}_{m^{-1}}(x)\subset B$ . Moreover, (21. $b$ ) implies that

P_{1}m^{-k+d/p}R\leq S_{n}(m^{-1});

the latter inequality, in view of $\Omega(f,{B}_{m^{-1}}(x))\leq R$ , says that $h=m^{-1}$ satisfies (22. $b$ ) as well. Thus, $h(x)\geq m^{-1}$ , as claimed in (23).

Consider the window ${B}_{h(x)}(x)$ . By (22. $a$ ) it is admissible for $x$ , while from (22. $b$ ) combined with (20) we get $\Phi_{\mu}(f,{B}_{h(x)}(x))\leq S_{n}(h).$ It follows that the ideal window ${B}_{*}(x)$ of $x$ is not smaller than ${B}_{h(x)}(x)$ .

Assume that $\epsilon\in\Xi_{n}$ . Then, according to (15), we have

|\widehat{f}_{n}(x;y)-f(x)|\leq 5C_{1}\left[\Phi_{\mu}(f,{B}_{h_{*}(x)}(x))+S_{n}(h_{*}(x))\right].

(25)

Now, by the definition of an ideal window, $\Phi_{\mu}(f,{B}_{h_{*}(x)}(x))\leq S_{n}(h_{*}(x))$ , and the right hand side in (25) does not exceed $10C_{1}S_{n}(h_{*}(x))\leq 10C_{1}S_{n}(h(x))$ (recall that, as we have seen, $h_{*}(x)\geq h(x)$ ), as required in (24).

Now let $\epsilon\not\in\Xi_{n}$ . Note that $\widehat{f}_{n}(x;y)$ is certain estimate $\widehat{f}^{h}(x;y)$ associated with a centered at $x$ and admissible for $x$ cube ${B}_{h}(x)$ which is normal and such that $h\geq m^{-1}$ (the latter – since the window $B_{m^{-1}}(x)$ always is normal, and $B_{h}(x)$ is the largest normal window centered at $x$ ). Applying (14) with $h^{\prime}=m^{-1}$ (so that $\widehat{f}^{h^{\prime}}_{n}(x;y)=f(x)+\sigma\epsilon_{t}$ ), we get $|(f(x)+\sigma\epsilon_{t})-\widehat{f}_{n}(x;y)|\leq 4C_{1}S_{n}(m^{-1}),$ whence

|f(x)-\widehat{f}_{n}(x;y)|\leq\sigma|\epsilon_{t}|+4C_{1}S_{n}(m^{-1})\leq\sigma\Theta_{(n)}+4C_{1}\sigma\omega\sqrt{\ln n}\leq C_{2}\Theta_{(n)}

(recall that we are in the situation $\epsilon\not\in\Xi_{n}$ , whence $\omega\sqrt{\ln n}\leq\Theta_{(n)}$ ). We have arrived at (24). ∎

Now we are ready to complete the proof. Assume that (21) takes place, and let us fix $q$ , ${{2k+d}\over d}p\leq q<\infty$ .
2⁰. Let us denote $\widehat{\sigma}_{n}=\sigma\sqrt{{{\ln n}\over n}}$ . Note that for every $x\in\Gamma_{n}\cap B_{\gamma}$ either

h(x)=(1-\gamma)D(B),

h(x)=\left({{\widehat{\sigma}_{n}}\over{P_{1}\Omega(f,B_{h(x)}(x))}}\right)^{{2p\over{2kp+(p-2)d}}},

what means that

P_{1}h^{k-d/p}(x)\Omega(f,B_{h(x)}(x))=S_{n}(h(x)).

(26)

Let $U,V$ be the sets of those $x\in B_{\gamma}^{n}\equiv\Gamma_{n}\cap B_{\gamma}$ for which the first or, respectively, the second of this possibilities takes place. If $V$ is nonempty, let us partition it as follows.
1) We can choose $x_{1}\in V$ ( $V$ is finite!) such that $h(x)\geq h(x_{1})\quad\forall x\in V.$ After $x_{1}$ is chosen, we set $V_{1}=\{x\in V\mid\,B_{h(x)}(x)\cap B_{h(x_{1})}(x_{1})\neq\emptyset\}.$
2) If the set $V\backslash V_{1}$ is nonempty, we apply the construction from 1) to this set, thus getting $x_{2}\in V\backslash V_{1}$ such that $h(x)\geq h(x_{2})\quad\forall x\in V\backslash V_{1},$ and set $V_{2}=\{x\in V\backslash V_{1}\mid\,B_{h(x)}(x)\cap B_{h(x_{2})}(x_{2})\neq\emptyset\}.$ If the set $V\backslash(V_{1}\cup V_{2})$ still is nonempty, we apply the same construction to this set, thus getting $x_{3}$ and $V_{3}$ , and so on.
The outlined process clearly terminates after certain step (since $V$ is finite). On termination, we get a collection of $M$ points $x_{1},...,x_{M}\in V$ and a partition $V=V_{1}\cup V_{2}\cup...\cup V_{M}$ with the following properties:

1.

The cubes $B_{h(x_{1})}(x_{1}),...,B_{h(x_{M})}(x_{M})$ are mutually disjoint;
2.

For every $\ell\leq M$ and every $x\in V_{\ell}$ we have $h(x)\geq h(x_{\ell})\hbox{\ and\ }B_{h(x)}(x)\cap B_{h(x_{\ell})}(x_{\ell})\neq\emptyset.$
We claim that also
3.

For every $\ell\leq M$ and every $x\in V_{\ell}$ one has

$h(x)\geq\max\left[h(x_{\ell});\|x-x_{\ell}\|_{\infty}\right].$ (27)

Indeed, $h(x)\geq h(x_{\ell})$ by (ii), so that it suffices to verify (27) in the case when $\|x-x_{\ell}\|_{\infty}\geq h(x_{\ell})$ . Since $B_{h(x)}(x)$ intersects $B_{h(x_{\ell})}(x_{\ell})$ , we have

\|x-x_{\ell}\|_{\infty}\leq{1\over 2}(h(x)+h(x_{\ell})).

Whence

h(x)\geq 2\|x-x_{\ell}\|_{\infty}-h(x_{\ell})\geq\|x-x_{\ell}\|_{\infty},

which is what we need.
3⁰. Let us set $B_{\gamma}^{n}=\Gamma_{n}\cap B_{\gamma}$ . Assume that $\epsilon\in\Xi_{n}$ . When substituting $h(x)=(1-\gamma)[D(B)]$ for $x\in U$ , we have by (24):

$\displaystyle\left\|\widehat{f}_{n}(\cdot;y)-f(\cdot)\right\|_{q,B_{\gamma}}^{q}\leq C_{2}^{q}m^{-{d\over q}}{\displaystyle}{\sum\limits_{x\in B_{\gamma}^{n}}}S_{n}^{q}(h(x))$
	$\displaystyle=$	$\displaystyle C_{2}^{q}m^{-{d\over q}}{\displaystyle}{\sum\limits_{x\in U}}S_{n}^{q}(h(x))+C_{2}^{q}m^{-{d\over q}}{\displaystyle}{\sum\limits_{\ell=1}^{M}\sum\limits_{x\in V_{\ell}}}S_{n}^{q}(h(x))$
	$\displaystyle=$	$\displaystyle C_{2}^{q}m^{-{d\over q}}{\displaystyle}{\sum\limits_{x\in U}\left[{{\widehat{\sigma}_{n}}\over{((1-\gamma)D(B))^{d/2}}}\right]^{q}}+C_{2}^{q}m^{-{d\over q}}{\displaystyle}{\sum\limits_{\ell=1}^{M}\sum\limits_{x\in V_{\ell}}S_{n}^{q}(h(x))}$
[by (27)]	$\displaystyle\leq$	$\displaystyle C_{3}^{q}\widehat{\sigma}_{n}^{q}m^{-{d\over q}}{\displaystyle}{\sum\limits_{\ell=1}^{M}\sum\limits_{x\in V_{\ell}}\left(\max\left[h(x_{\ell}),\\|x-x_{\ell}\\|_{\infty}\right]\right)^{-{dq\over 2}}}+C_{3}^{q}\widehat{\sigma}_{n}^{q}[D(B)]^{{d(2-q)\over 2}}$
	$\displaystyle\leq$	$\displaystyle C_{4}^{q}\widehat{\sigma}_{n}^{q}{\displaystyle}{\sum\limits_{\ell=1}^{M}\int\left(\max\left[h(x_{\ell}),\\|x-x_{\ell}\\|_{\infty}\right]\right)^{-{dq\over 2}}}dx+C_{3}^{q}\widehat{\sigma}_{n}^{q}[D(B)]^{{d(2-q)\over 2}}$
	$\displaystyle\leq$	$\displaystyle C_{5}^{q}\widehat{\sigma}_{n}^{q}{\displaystyle}{\sum\limits_{\ell=1}^{M}\int\limits_{0}^{\infty}r^{d-1}\left(\max\left[h(x_{\ell}),r\right]\right)^{-{dq\over 2}}dr}+C_{3}^{q}\widehat{\sigma}_{n}^{q}D[D(B)]^{{d(2-q)\over 2}},$

due to $h(x)\geq m^{-1}$ , see (23). Further, note that

{dq\over 2}-d+1\geq{{2k+d}\over 2}p-d+1\geq d^{2}/2+1

in view of $q\geq{{2k+d}\over d}p$ , $k\geq 1$ and $p>d$ , and

$\displaystyle\left\|\widehat{f}_{n}(\cdot;y)-f(\cdot)\right\|_{q,B_{\gamma}}^{q}\leq C_{6}^{q}\widehat{\sigma}_{n}^{q}{\displaystyle}{\sum\limits_{\ell=1}^{M}\left[h(x_{\ell})\right]^{{d(2-q)\over 2}}}+C_{3}^{q}\widehat{\sigma}_{n}^{q}[D(B)]^{{d(2-q)\over 2}}$
[by (26]]	$\displaystyle=$	$\displaystyle C_{6}^{q}\widehat{\sigma}_{n}^{q}{\displaystyle}{\sum\limits_{\ell=1}^{M}}\left[{{\widehat{\sigma}_{n}}\over{P_{1}\Omega(f,B_{h(x_{\ell})}(x_{\ell}))}}\right]^{{d(2-q)\over 2k-2d/p+d}}+C_{3}^{q}\widehat{\sigma}_{n}^{q}[D(B)]^{{d(2-q)\over 2}}$
	$\displaystyle=$	$\displaystyle C_{3}^{q}\widehat{\sigma}_{n}^{q}[D(B)]^{{d(2-q)\over 2}}+C_{6}^{q}\widehat{\sigma}_{n}^{2\beta(p,k,d,q)q}{\displaystyle}{\sum\limits_{\ell=1}^{M}}\left[P_{1}\Omega(f,B_{h(x_{\ell})}(x_{\ell}))\right]^{{d(q-2)\over 2k-2d/p+d}}$

by definition of $\beta(p,k,d,q)$ .
Now note that ${{d(q-2)}\over{2k-2d/p+d}}\geq p$ in view of $q\geq{{2k+d}\over d}p$ , so that

	$\displaystyle\sum\limits_{\ell=1}^{M}\left[P_{1}\Omega(f,B_{h(x_{\ell})}(x_{\ell}))\right]^{{{d(q-2)}\over{2k-2d/p+d}}}$	$\displaystyle\leq$	$\displaystyle\left[\sum\limits_{\ell=1}^{M}\left(P_{1}\Omega(f,B_{h(x_{\ell})}(x_{\ell}))\right)^{p}\right]^{{{dq-2d}\over{p(2k-2d/p+d)}}}$
		$\displaystyle\leq$	$\displaystyle\left[P_{1}^{p}R^{p}\right]^{{{d(q-2)}\over{p(2k-2d/p+d)}}}$

(see (20) and take into account that the cubes $B_{h(x_{\ell})}(x_{\ell})$ , $\ell=1,...,M$ , are mutually disjoint by (i)). We conclude that for $\epsilon\in\Xi_{n}$

	$\displaystyle\left\|\widehat{f}_{n}(\cdot;y_{f}(\epsilon))-f(\cdot)\right\|_{q,B_{\gamma}}$	$\displaystyle\leq$	$\displaystyle C_{7}\widehat{\sigma}_{n}[D(B)]^{{d(2/q-1)\over 2}}+P_{2}\widehat{\sigma}_{n}^{2\beta(p,k,d,q)}R^{{{d(1-2/q)}\over{2k-2d/p+d}}}$		(28)
		$\displaystyle=$	$\displaystyle C_{7}\widehat{\sigma}_{n}[D(B)]^{{d(2/q-1)\over 2}}+P_{2}R\left({{\widehat{\sigma}_{n}}\over R}\right)^{2\beta(p,k,d,q)}.$		(28)

4⁰. Now assume that $\epsilon\not\in\Xi_{n}$ . In this case, by (24),

|\widehat{f}_{n}(x;y)-f(x)|\leq C_{2}\sigma\Theta_{(n)}\quad\forall x\in B_{\gamma}^{n}.

Hence, taking into account that $mD(B)\geq 1$ ,

\left|\widehat{f}_{n}(\cdot;y)-f(\cdot)\right|_{q,B_{\gamma}}\leq C_{2}\sigma\Theta_{(n)}[D(B)]^{{d\over q}}.

(29)

5⁰. When combining (28) and (29), we get

\left(E\left\{\|\widehat{f}_{n}(\cdot;y)-f(\cdot)\|_{q,B_{\gamma}}^{2}\right\}\right)^{1/2}\\ \leq C_{8}\max\left[\widehat{\sigma}_{n}[D(B)]^{{{d(2/q-1)}\over 2}};\,P_{4}R\left({{\widehat{\sigma}_{n}}\over{R}}\right)^{2\beta(p,k,d,q)};\;J\right],

where

	$\displaystyle J^{2}$	$\displaystyle=$	$\displaystyle E\left\{\mathbf{1}\{\epsilon\not\in\Xi_{n}\}C_{2}\sigma^{2}\Theta_{(n)}^{2}\right\}\leq C^{2}_{2}\sigma^{2}P^{{1/2}}\{\epsilon\not\in\Xi_{n}\}\left(E\left\{\Theta_{(n)}^{4}\right\}\right)^{{1/2}}$
		$\displaystyle\leq$	$\displaystyle C_{9}\sigma^{2}n^{-2(\mu+1)}{\ln n}$

(we have used (5) and (8)). Thus, when (21) holds, for all $d<p<\infty$ and all $q$ , ${{2k+d}\over d}p\leq q<\infty$ we have

	$\displaystyle\left(E\left\{\\|\widehat{f}_{n}(\cdot;y)-f(\cdot)\\|_{q,B_{\gamma}}^{2}\right\}\right)^{1/2}$				(30)
		$\displaystyle\leq$	$\displaystyle C_{8}\max\left[\widehat{\sigma}_{n}[D(B)]^{{{d(2/q-1)}\over 2}},P_{4}R\left({{\widehat{\sigma}_{n}}\over{R}}\right)^{2\beta(p,k,d,q)},{C^{1/2}_{9}\sigma\sqrt{\ln n}\over n^{(\mu+1)}}\right].$		(30)

Now it is easily seen that if $P\geq 1$ is a properly chosen function of $\mu,d,\gamma,p$ nonincreasing in $p>d$ and (16) takes place then

1.

assumption (21) holds,

the right hand side in (30) does not exceed the quantity

PR\left({{\widehat{\sigma}_{n}}\over{R}}\right)^{2\beta(p,k,d,q)}=PR\left({{\widehat{\sigma}_{n}}\over{R}}\right)^{2\beta(p,k,d,q)}[D(B)]^{d\lambda(p,k,d,q)}

(recall that $q\geq{{2k+d}\over d}p$ , so that $\lambda(p,k,d,q)=0$ ).

We conclude the bound (3) for the case of $d<p<\infty$ , $\infty>q\geq{{2k+d}\over d}p$ . When passing to the limit as $q\to\infty$ , we get the desired bound for $q=\infty$ as well.

Now let $d<p<\infty$ and $1\leq q\leq q_{*}\equiv{{2k+d}\over d}p.$ By the Hölder inequality and in view of $mD(B)\geq 1$ we have

\left|g\right|_{q,B_{\gamma}}\leq C_{10}\left|g\right|_{q_{*},B_{\gamma}}|B_{\gamma}|^{{1\over q}-{1\over{q_{*}}}},

and thus

\widehat{{\mathbf{R}}}_{q}\left(\widehat{f}_{n};{\mathbf{F}}\right)\leq C_{10}\widehat{{\mathbf{R}}}_{q_{*}}\left(\widehat{f}_{n};{\mathbf{F}}\right)[D(B)]^{d\left({1\over q}-{1\over q_{*}}\right)}.

Combining this observation with the (already proved) bound (3) associated with $q=q_{*}$ , we see that (3) is valid for all $q\in[1,\infty]$ , provided that $d<p<\infty$ . Passing in the resulting bound to limit as $p\to\infty$ , we conclude the validity of (3) for all $p\in(d,\infty]$ , $q\in[1,\infty]$ . ∎

References

[1] O.V. Besov, V.P. Il’in, and S.M. Nikol’ski. Integral representations of functions and embedding theorems. Moscow: Nauka Publishers, 1975 (in Russian).
[2] L. Birgé. Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrscheinlichkeitstheorie verw. Geb., 65:181–237, 1983.
[3] L. Birgé, P. Massart. From model selection to adaptive estimation. In: D. Pollard, E. Torgersen and G. Yang, Eds., Festschrift for Lucien Le Cam, Springer 1999, 55–89.
[4] E. Candés, D. Donoho. Ridgelets: a key to high-dimensional intermittency? Philos. Trans. Roy. Soc. London Ser. A 357:2495-2509, 1999.
[5] S. Chen, D.L. Donoho, M.A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1):33-61, 1998.
[6] D. Donoho, I. Johnstone. Ideal spatial adaptation via wavelet shrinkage. Biometrika 81(3):425-455, 1994.
[7] D. Donoho, I. Johnstone. Minimax risk over $\ell_{p}$ balls for $\ell_{q}$ losses. Probab. Theory Related Fields 99:277-303, 1994.
[8] D. Donoho, I. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90(432):1200–1224, 1995.
[9] D. Donoho, I. Johnstone, G. Kerkyacharian, D. Picard. Wavelet shrinkage: Asymptopia? (with discussion and reply by the authors). J. Royal Statist. Soc. Series B 57(2):301–369, 1995.
[10] D. Donoho. Tight frames of $k$ -plane ridgelets and the problem of representing objects that are smooth away from $d$ -dimensional singularities in ${\mathbb{R}}^{n}$ . Proc. Natl. Acad. Sci. USA 96(5):1828-1833, 1999.
[11] D. Donoho. Wedgelets: nearly minimax estimation of edges. Ann. Statist. 27:859-897, 1999.
[12] D. Donoho. Orthonormal ridgelets and linear singularities. SIAM J. Math. Anal. 31:1062-1099, 2000.
[13] D. Donoho. Ridge functions and orthonormal ridgelets. J. Approx. Theory 111(2):143-179, 2001.
[14] D. Donoho. Curvelets and curvilinear integrals. J. Approx. Theory 113(1):59-90, 2001.
[15] D. Donoho. Sparse components of images and optimal atomic decompositions. Constr. Approx. 17:353-382, 2001.
[16] D. Donoho, X. Huo. Uncertainty principle and ideal atomic decomposition. IEEE Trans. on Information Theory 47(7):2845-2862, 2001.
[17] D. Donoho, X. Huo. Beamlets and multiscale image analysis. Lect. Comput. Sci. Eng. 20:149-196, Springer, 2002.
[18] M. Elad, A. Bruckstein. A generalized uncertainty principle and sparse representation in pairs of bases. IEEE Trans. on Information Theory (to appear)
[19] A. Goldenshluger, A. Nemirovski. On spatially adaptive estimation of nonparametric regression. Math. Methods of Statistics 6(2):135–170, 1997.
[20] A. Goldenshluger, A. Nemirovski. Adaptive de-noising of signals satisfying differential inequalities. IEEE Trans. on Information Theory 43, 1997.
[21] Yu. Golubev. Asymptotic minimax estimation of regression function in additive model. Problemy peredachi informatsii 28(2):3–15, 1992. (English transl. in Problems Inform. Transmission 28, 1992.)
[22] W. Härdle. Applied Nonparametric Regression, ES Monograph Series 19, Cambridge, U.K., Cambridge University Press, 1990.
[23] I. Ibragimov and R. Khasminskii. On nonparametric estimation of regression. Soviet Math. Dokl. 21:810–814, 1980.
[24] I. Ibragimov and R. Khasminskii. Statistical Estimation. Springer-Verlag, New York, 1981.
[25] A. Juditsky. Wavelet estimators: Adapting to unknown smoothness. Math. Methods of Statistics 6(1):1–25, 1997.
[26] A. Juditsky and A. Nemirovski. Nonparametric Denoising of Signals with Unknown Local Structure, I: Oracle Inequalities Accepted to Appl. Comp. Harm. Anal.
[27] A. Korostelev, A. Tsybakov. Minimax theory of image reconstruction. Lecture Notes in Statistics 82, Springer, New York, 1993.
[28] O. Lepskii. On a problem of adaptive estimation in Gaussian white noise. Theory of Probability and Its Applications 35(3):454–466, 1990.
[29] O. Lepskii. Asymptotically minimax adaptive estimation I: Upper bounds. Optimally adaptive estimates. Theory of Probability and Its Applications, 36(4):682–697, 1991.
[30] O. Lepskii, E. Mammen, V. Spokoiny. Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. Ann. Statist. 25(3):929–947, 1997.
[31] A. Nemirovskii. On nonparametric estimation of smooth regression functions. Sov. J. Comput. Syst. Sci., 23(6):1–11, 1985.
[32] A. Nemirovski. On nonparametric estimation of functions satisfying differential inequalities. R. Khasminski, Ed. Advances in Soviet Mathematics 12:7–43, American Mathematical Society, 1992.
[33] A. Nemirovski. Topics in Non-parametric Statistics. In: M. Emery, A. Nemirovski, D. Voiculescu, Lectures on Probability Theory and Statistics, Ecole d’Eteé de Probabilités de Saint-Flour XXVII – 1998, Ed. P. Bernard. – Lecture Notes in Mathematics 1738:87–285.
[34] M. Pinsker. Optimal filtration of square-integrable signals in Gaussian noise. Problemy peredachi informatsii, 16(2):120–133. 1980. (English transl. in Problems Inform. Transmission 16, 1980.)
[35] M. Pinsker, S. Efroimovich. Learning algorithm for nonparametric filtering. Automation and Remote Control 45(11):1434–1440, 1984.
[36] M. Rosenblatt. Stochastic curve estimation. Institute of Mathematical Statistics, Hayward, California, 1991.
[37] J.-L. Starck, E. Candés, D. Donoho. The curvelet transform for image denoising. IEEE Trans. Image Process. 11(6):670-684, 2002.
[38] Ch. Stone. Optimal rates of convergence for nonparametric estimators. Annals of Statistics, 8(3):1348–1360, 1980.
[39] G. Wahba. Spline models for observational data. SIAM, Philadelphia, 1990.

\begin{array}[]{cc}\epsfbox{sig1.eps}\hfil&\epsfbox{obs1.eps}\hfil\\ \hbox{True Image}&\hbox{Observation}\\ \epsfbox{lio1.eps}\hfil&\epsfbox{rec1.eps}\hfil\\ \hbox{Standard recovery}&\hbox{Adaptive recovery}\end{array}

Figure 1: Recovery for

\omega_{\max}=2.0

\begin{array}[]{cc}\epsfbox{sig2.eps}\hfil&\epsfbox{obs2.eps}\hfil\\ \hbox{True Image}&\hbox{Observation}\\ \epsfbox{lio2.eps}\hfil&\epsfbox{rec2.eps}\hfil\\ \hbox{Standard recovery}&\hbox{Adaptive recovery}\end{array}

Figure 2: Recovery for

\omega_{\max}=8.0

\begin{array}[]{cc}\epsfbox{sig3.eps}\hfil&\epsfbox{obs3.eps}\hfil\\ \hbox{True Image}&\hbox{Observation}\\ \epsfbox{lio3.eps}\hfil&\epsfbox{rec3.eps}\hfil\\ \hbox{Standard recovery}&\hbox{Adaptive recovery}\end{array}

Figure 3: Recovery for

\omega_{\max}=32.0

\begin{array}[]{cc}\epsfbox{sig5.eps}\hfil&\epsfbox{obs5.eps}\hfil\\ \hbox{True Image}&\hbox{Observation}\\ \epsfbox{lio5.eps}\hfil&\epsfbox{rec5.eps}\hfil\\ \hbox{Standard recovery}&\hbox{Adaptive recovery}\end{array}

Figure 4: Recovery for

\omega_{\max}=128.0

$\displaystyle\|f_{\tau}-(q(\Delta)f)_{\tau}\|\leq\|\delta_{\tau}\|+\|p_{\tau}-(q(\Delta)p)_{\tau}\|+\|(q(\Delta)\delta)_{\tau}\|$
	$\displaystyle\leq$	$\displaystyle\Phi_{\mu}(f,B_{h}(x))+\|q\|_{1}\max\{\|\delta_{\nu}\|:\|\nu-\tau\|\leq T\}\leq\Phi_{\mu}(f,B_{h}(x))$
		$\displaystyle+\|q\|_{2}(2T+1)^{d/2}\Phi_{\mu}(f,B_{h}(x))\max\{\|\delta_{\nu}\|:\|\nu-\tau\|\leq T\}$
		[note that $\|\tau-t\|\leq 3T$ and $\|\nu-\tau\|\leq T$ implies $\|\nu-t\|\leq 4T$ ]
	$\displaystyle\leq$	$\displaystyle\Phi_{\mu}(f,B_{h}(x))(1+\mu)={\Phi_{\mu}(f,B_{h}(x))(1+\mu)\over(2T+1)^{d/2}}(2T+1)^{-d/2}$