Nonparametric denoising Signals of Unknown Local Structure, II: Nonparametric Regression Estimation
Abstract
We consider the problem of recovering of continuous multi-dimensional functions from the noisy observations over the regular grid , . Our focus is at the adaptive estimation in the case when the function can be well recovered using a linear filter, which can depend on the unknown function itself. In the companion paper [26] we have shown in the case when there exists an adapted time-invariant filter, which locally recovers “well” the unknown signal, there is a numerically efficient construction of an adaptive filter which recovers the signals “almost as well”. In the current paper we study the application of the proposed estimation techniques in the non-parametric regression setting. Namely, we propose an adaptive estimation procedure for “locally well-filtered” signals (some typical examples being smooth signals, modulated smooth signals and harmonic functions) and show that the rate of recovery of such signals in the -norm on the grid is essentially the same as that rate for regular signals with nonhomogeneous smoothness.
keywords:
Nonparametric denoising, adaptive filtering, minimax estimation, nonparametric regression.,
1 Introduction
Let be a probability space. We consider the problem of recovering unknown complex-valued random field over from noisy observations
(1) |
We assume that the field of observation noises is independent of and is of the form , where are independent of each other standard Gaussian complex-valued variables; the adjective “standard” means that , are independent of each other random variables.
We suppose that the observations (1) come from a function (“signal”) of continuous argument (which we assume to vary in the -dimensional unit cube ); this function is observed in noise along an -point equidistant grid in , and the problem is to recover via these observations. This problem fits the framework of nonparametric regression estimation with a “traditional setting” as follows:
-
A. The objective is to recover an unknown smooth function , which is sampled on the observation grid with , so that . The error of recovery is measured with some functional norm (or a semi-norm) on , and the risk of recovery of is the expectation ;
-
B. The estimation routines are aimed at recovering smooth signals, and their quality is measured by their maximal risks, the maximum being taken over running through natural families of smooth signals, e.g., Hölder or Sobolev balls;
-
C. The focus is on the asymptotic, as the volume of observations goes to infinity, behavior of the estimation routines, with emphasis on asymptotically minimax (nearly) optimal estimates – those which attain (nearly) best possible rates of convergence of the risks to 0 as the observation sample size .
Initially, the research was focused on recovering smooth signals
with a priori known smoothness parameters and the estimation
routines were tuned to these parameters (see, e.g.,
[23, 34, 38, 24, 2, 31, 39, 22, 36, 21, 27]).
Later on, there was a significant research on adaptive
estimation. Adaptive estimation methods are free of a priori assumptions on
the smoothness parameters of the signal to be recovered, and the
primary goal is to develop the routines which exhibit asymptotically
(nearly) optimal behavior on a wide variety of families of smooth functions
(cf. [35, 28, 29, 30, 6, 8, 9, 25, 3, 7, 19]).
For a more compete overview of results on smooth nonparametric
regression estimation see, for instance, [33].333Our “brief outline” of adaptive approach to nonparametric
regression would be severely incomplete without mentioning a novel
approach aimed at recovering nonsmooth signals possessing
sparse representations in properly constructed functional
systems
[5, 10, 4, 11, 12, 13, 14, 15, 16, 17, 37, 18].
This promising approach is completely beyond the scope of our
paper.
The traditional focus on recovering smooth signals ultimately
comes from the fact that such a signal locally can be
well-approximated by a polynomial of a fixed order , and such
a polynomial is an “easy to estimate” entity. Specifically, for
every integer , the value of a polynomial at an
observation point can be recovered via
neighboring observations “at a parametric rate” – with the expected squared error
which is inverse proportional to the amount
of the observations used by the estimate. The
coefficient depends solely on the order and the dimensionality
of the polynomial. The corresponding estimate
of is pretty simple: it is given by
a “time-invariant filter”, that is, by convolution of
observations with an appropriate discrete kernel
vanishing outside the box
:
then the estimation of is taken as .
Note that the kernel is readily given by the degree of the approximating polynomial, and dimension . The “classical” adaptation routines takes care of choosing “good” values of the approximation parameters (namely, and ). On the other hand, the polynomial approximation “mechanism” is supposed to be fixed once for ever. Thus, in those procedures the “form” of the kernel is considered as given in advance.
In the companion paper [26] (referred hereafter as Part I) we have introduced the notion of a well-filtered signal. In brief, the signal is -well-filtered for some if there is a filter (kernel) which recovers in the box with the mean square error comparable with :
The universe of these signals is much wider than the one of smooth signals.As we have seen in Part I that it contains, in particular, “modulated smooth signals” – sums of a fixed number of products of smooth functions and multivariate harmonic oscillations of unknown (and arbitrarily high) frequencies. We have shown in Part I that whenever a discrete time signal (that is, a signal defined on a regular discrete grid) is well-filtered, we can recover this signal at a “nearly parametric” rate without a priori knowledge of the associated filter. In other words, a well-filtered signal can be recovered on the observation grid basically as well as if it were an algebraic polynomial of a given order.
We are about to demonstrate that the results of Part I on recovering well-filtered signals of unknown structure can be applied to recovering nonparametric signals which admit well-filtered local approximations. Such an extension has an unavoidable price – now we cannot hope to recover the signal well outside of the observation grid (a highly oscillating signal can merely vanish on the observation grid and be arbitrarily large outside it). As a result, in what follows we are interested in recovering the signals along the observation grid only and, consequently, replace the error measures based on the functional norms on by their grid analogies.
The estimates to be developed will be “double adaptive”, that is, adaptive with respect to both the unknown in advance structures of well-filtered approximations of our signals and to the unknown in advance “approximation rate” – the dependence between the size of a neighborhood of a point where the signal in question is approximated and the quality of approximation in this neighborhood. Note that in the case of smooth signals, this approximation rate is exactly what is given by the smoothness parameters. The results to follow can be seen as extensions of the results of [32, 20] (see also [33]) dealing with the particular case of univariate signals satisfying differential inequalities with unknown differential operators.
2 Nonparametric regression problem
We start with the formal description of the components of the nonparametric regression problem.
Let for , , and let for some denote . Let be a positive integer, , and let .
Let be the linear space of complex-valued fields over . We associate with a signal its observations along :
(2) |
where are independent standard Gaussian complex-valued random noises. Our goal is to recover from observations (2). In what follows, we write
Below we use the following notations. For a set , we denote by the set of all such that . We denote the standard -norm on :
and its discrete analogy, so that
We set
Let . We say that a nonempty open cube
centered at is admissible for , if . For such a cube, denotes the largest nonnegative integer such that
For a cube
stands for the edge of . For we denote
the -shrinkage of to the center of .
2.1 Classes of locally well-filtered signals
Recall that we say that a function on is smooth if it can be locally well-approximated by a polynomial. Informally, the the definition below sais that a continuous signal is locally well-filtered if admits a good local approximation by a well-filtered discrete signal on (see Definition 1 of Section 2.1, Part I).
Definition 1
Let
be a cube, be a positive integer,
, be reals, and let .
The
collection , , , , specifies the family of locally well-filtered on signals
defined by the following requirements:
(1) ;
(2) There exists a nonnegative function
such that for every
and for every admissible for
cube contained in there exists a field such that (where the set of -well filtered signals is defined in Definition 1 of Part I)
and
(3) |
In the sequel, we use for also the shortened notation , where stands for the collection of “parameters” .
Remark The motivating example of locally well-filtered signals is that of modulated smooth signals as follows. Let a cube , , positive integers and a real be given. Consider a collection of functions which are times continuously differentiable and satisfy the constraint
Let , and let
By the standard argument [1], whenever and is admissible for , the Taylor polynomial , of order , taken at , of satisfies the inequality
(here and in what follows, are positive constants depending solely on , and ). It follows that if then
(4) |
Now observe that the exponential polynomial belongs to for any (Proposition 10 of Part I). Combining this fact with (4), we conclude that
2.2 Accuracy measures
Let us fix and . Given an estimate of the restriction of on the grid , based on observations (2) (i.e., a Borel real-valued function of and ) and , let us characterize the quality of the estimate on the set by the worst-case risks
3 Estimator construction
The recovering routine we are about to build is aimed at estimating functions from classes with unknown in advance parameters . The only design parameters of the routine is an a priori upper bound on the parameter and a .
3.1 Preliminaries
From now on, we denote by the deterministic function of observation noises defined as follows. For every cube with vertices in , we consider the discrete Fourier transform of the observation noises reduced to , and take the maximum of modules of the resulting Fourier coefficients, let it be denoted . By definition,
where the maximum is taken over all cubes of the indicated type. By the origin of , due to the classical results on maxima of Gaussian processes (cf also Lemma 15 of Part I), we have
(5) |
where depends solely on .
3.2 Building blocks: window estimates
To recover a signal via observations (2), we use point-wise window estimates of defined as follows.
Let us fix a point ; our goal is to build an estimate of . Let be an admissible window for . We associate with this window an estimate of defined as follows. If the window is “very small”, specifically, , so that is the only point from the observation grid in , we set and . For a larger window, we choose the largest nonnegative integer such that
and apply Algorithm
A of Part I to build the estimate of , the design parameters of the algorithm being . Let the resulting estimate be
denoted by .
To characterize the quality of the estimate
, let us set
Lemma 2
One has
(6) |
3.3 The adaptive estimate
We are about to “aggregate” the window estimates
into an adaptive estimate, applying
Lepskii’s adaptation scheme in the
same fashion as in [30, 19, 20].
Let us fix a “safety
factor” in such a way that the event
is “highly un-probable”,
namely,
(8) |
by (5), the required may be chosen as a
function of only. We are to describe the basic blocks of the construction of the adaptive estimate.
“Good” realizations of noise. Let us define the set of
“good realizations of noise” as
(9) |
Now (7) implies the “conditional” error bound
(10) |
Observe that as grows, the “deterministic term” does not decrease, while the “stochastic term” decreases.
The “ideal” window. Let us define the ideal window as the largest admissible window for which the stochastic term dominates the deterministic one:
(11) |
Note that such a window does exist, since as . Besides this, since the cubes are open, the quantity is continuous from the left, so that
(12) |
Thus, the ideal window is well-defined for every
possessing admissible windows, i.e., for every
.
Normal windows. Assume that . Then the
errors of all estimates associated with
admissible windows smaller than the ideal one are dominated by the
corresponding stochastic terms:
(13) |
(by (10) and (12)). Let us fix an (and thus – a realization of the observations) and let us call an admissible for window normal, if the associated estimate differs from every estimate associated with a smaller window by no more than times the stochastic term of the latter estimate, i.e.
(14) |
Note that if , then possesses a normal window, specifically, the window . Indeed, this window contains a single observation point, namely, itself, so that the corresponding estimate, same as every estimate corresponding to a smaller window, by construction coincides with the observation at , so that all the estimates , , are the same. Note also that (13) implies that
(!) If , then the ideal window is normal.
The adaptive estimate . The property of an admissible window to be normal is “observable” – given observations , we can say whether a given window is or is not normal. Besides this, it is clear that among all normal windows there exists the largest one . The adaptive estimate is exactly the window estimate associated with the window . Note that from (!) it follows that
(!!) If , then the largest normal window contains the ideal window .
By definition of a normal window, under the premise of (!!) we have
and we come to the conclusion as follows:
(*) If , then the error of the estimate is dominated
by the error bound (10) associated with the ideal
window:
(15) |
Thus, the estimate – which is based
solely on observations and does not require any a priori knowledge
of the “parameters of well-filterability of ” – possesses
basically the same accuracy as the “ideal” estimate associated
with the ideal window (provided, of course, that the realization
of noises is not “pathological”: ).
Note that the adaptive estimate we have built
depends solely on “design parameters” , (recall that depends on ),
the volume of
observations and the dimension .
4 Main result
Our main result is as follows:
Theorem 3
Let , be an integer, let be a family of locally well-filtered signals associated with a cube with , and . For properly chosen depending solely on and nonincreasing in the following statement holds true:
Suppose that the volume of observations (2) is large enough, namely,
(16) |
where is the edge of the cube .
Then for every the worst case, with respect to , -risk of the adaptive estimate associated with the parameter can be bounded as follows:
where
(recall that here is the concentric to times smaller cube).
Note that the rates of convergence to 0, as , of the risks of our adaptive estimate on the families are exactly the same as those stated by Theorem 3 from [31] (see also [30, 9, 19, 33]) in the case of recovering non-parametric smooth regression functions from Sobolev balls. It is well-known that in the smooth case the latter rates are optimal in order, up to logarithmic in factors. Since the families of locally well-filtered signals are much wider than local Sobolev balls (smooth signals are trivial examples of modulated smooth signals!), it follows that the rates of convergence stated by Theorem 3 also are nearly optimal.
5 Simulation examples
In this section we present the results of a small simulation study of the adaptive filtering algorithm applied to the 2-dimensional de-noising problem. The simulation setting is as follows: we consider real-valued signals
being independent standard Gaussian random variables. The problem is to estimate, given observations , the values of the signal on the grid , and . The value is common to all experiments.
We consider signals which are sums of three harmonic components:
the frequencies and the phase shifts , are drawn randomly from the uniform distribution over, respectively, and and the coefficient is chosen to obtain the signal-to-noise ratio equal to one.
In the simulations we present here we compared the result of adaptive recovery with to that of a “standard nonparametric recovery”, i.e. the recovery by the locally linear estimator with square window. We have done independent runs for each of eight values of ,
In Table 1 we summarize the results for the mean integrated squared error (MISE) of the estimation,
The observed phenomenon is rather expectable: for slowly oscillating signals the quality of the adaptive recovery is slightly worse than that of “standard recovery”, which are tuned for estimation of regular signals. When we rise the frequency of the signal components, the adaptive recovery stably outperforms the standard recovery. Finally, standard recovery is clearly unable to recover highly oscillating signals (cf Figures 1-4).
Standard | Adaptive | |
---|---|---|
recovery | recovery | |
0.12 | 0.1 | |
0.20 | 0.12 | |
0.36 | 0.18 | |
0.54 | 0.27 | |
0.79 | 0.25 | |
0.75 | 0.29 | |
0.27 | 0.98 | |
0.24 | 1.00 |
Appendix
We denote the linear space of complex-valued fields over . A field with finitely many nonzero entries is called a filter. We use the commun notation , , for the “basic shift operators” on :
and denote the output of a filter , the input to the filter being a field , so that
5.1 Proof of Lemma 2.
To save notation, let and . Let be such that and for all . Since , there exists a filter such that and whenever . Setting , we have for any , ,
[note that and implies ] | ||||
as required in (6). ∎
5.2 Proof of Theorem 3
In the main body of the proof, we focus on the case
; the case of infinite and/or will be
considered at the concluding step 50.
Let us fix a family of well-filtered signals with the parameters satisfying the premise
of Theorem 3 and a function from this class.
Recall that by the definition of there exists a
function , , such that for all
and all :
(20) |
From now on, (perhaps with sub- or superscripts) are
quantities depending on only and
nonincreasing in .
10. We need the following auxiliary result:
Lemma 4
Assume that
(21) |
Given a point , let us choose the largest such that
(22) |
Then is well-defined and
(23) |
Besides this, the error at of the adaptive estimate as applied to can be bounded as follows:
(24) |
Proof: The quantity is well-defined, since for small positive the left hand side in (22.) is close to 0, while the right hand side one is large. From (21) it follows that satisfies (22.), so that . Moreover, (21.) implies that
the latter inequality, in view of , says that satisfies (22.) as well. Thus, , as claimed in (23).
Consider the window . By (22.) it is admissible for , while from (22.) combined with (20) we get It follows that the ideal window of is not smaller than .
Assume that . Then, according to (15), we have
(25) |
Now, by the definition of an ideal window, , and the right hand side in (25) does not exceed (recall that, as we have seen, ), as required in (24).
Now let . Note that is certain estimate associated with a centered at and admissible for cube which is normal and such that (the latter – since the window always is normal, and is the largest normal window centered at ). Applying (14) with (so that ), we get whence
(recall that we are in the situation , whence ). We have arrived at (24). ∎
Now we are ready to complete the proof. Assume that
(21) takes place, and let us fix , .
20. Let us denote . Note that for every either
or
what means that
(26) |
Let be the sets of those for which the first or, respectively, the second of this
possibilities takes place. If is nonempty, let us partition it
as follows.
1) We can choose ( is finite!) such that
After is chosen, we
set
2) If the set is nonempty, we apply the
construction from 1) to this set, thus getting such that and set If the set still is nonempty, we apply the same construction to this
set, thus getting and , and so on.
The outlined
process clearly terminates after certain step (since is
finite). On termination, we get a collection of points
and a partition with the following properties:
-
1.
The cubes are mutually disjoint;
-
2.
For every and every we have
We claim that also -
3.
For every and every one has
(27)
Indeed, by (ii), so that it suffices to verify (27) in the case when . Since intersects , we have
Whence
which is what we need.
30. Let us set .
Assume that . When substituting for , we have by (24):
[by (27)] | ||||
due to , see (23). Further, note that
in view of , and , and
[by (26]] | ||||
by definition of
.
Now note that in view of , so that
(see (20) and take into account that the cubes , , are mutually disjoint by (i)). We conclude that for
(28) | |||||
40. Now assume that . In this case, by (24),
Hence, taking into account that ,
(29) |
50. When combining (28) and (29), we get
where
(we have used (5) and (8)). Thus, when (21) holds, for all and all , we have
(30) | |||||
Now it is easily seen that if is a properly chosen function of nonincreasing in and (16) takes place then
-
1.
assumption (21) holds,
- 2.
We conclude the bound (3) for the case of , . When passing to the limit as , we get the desired bound for as well.
References
- [1] O.V. Besov, V.P. Il’in, and S.M. Nikol’ski. Integral representations of functions and embedding theorems. Moscow: Nauka Publishers, 1975 (in Russian).
- [2] L. Birgé. Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrscheinlichkeitstheorie verw. Geb., 65:181–237, 1983.
- [3] L. Birgé, P. Massart. From model selection to adaptive estimation. In: D. Pollard, E. Torgersen and G. Yang, Eds., Festschrift for Lucien Le Cam, Springer 1999, 55–89.
- [4] E. Candés, D. Donoho. Ridgelets: a key to high-dimensional intermittency? Philos. Trans. Roy. Soc. London Ser. A 357:2495-2509, 1999.
- [5] S. Chen, D.L. Donoho, M.A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1):33-61, 1998.
- [6] D. Donoho, I. Johnstone. Ideal spatial adaptation via wavelet shrinkage. Biometrika 81(3):425-455, 1994.
- [7] D. Donoho, I. Johnstone. Minimax risk over balls for losses. Probab. Theory Related Fields 99:277-303, 1994.
- [8] D. Donoho, I. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90(432):1200–1224, 1995.
- [9] D. Donoho, I. Johnstone, G. Kerkyacharian, D. Picard. Wavelet shrinkage: Asymptopia? (with discussion and reply by the authors). J. Royal Statist. Soc. Series B 57(2):301–369, 1995.
- [10] D. Donoho. Tight frames of -plane ridgelets and the problem of representing objects that are smooth away from -dimensional singularities in . Proc. Natl. Acad. Sci. USA 96(5):1828-1833, 1999.
- [11] D. Donoho. Wedgelets: nearly minimax estimation of edges. Ann. Statist. 27:859-897, 1999.
- [12] D. Donoho. Orthonormal ridgelets and linear singularities. SIAM J. Math. Anal. 31:1062-1099, 2000.
- [13] D. Donoho. Ridge functions and orthonormal ridgelets. J. Approx. Theory 111(2):143-179, 2001.
- [14] D. Donoho. Curvelets and curvilinear integrals. J. Approx. Theory 113(1):59-90, 2001.
- [15] D. Donoho. Sparse components of images and optimal atomic decompositions. Constr. Approx. 17:353-382, 2001.
- [16] D. Donoho, X. Huo. Uncertainty principle and ideal atomic decomposition. IEEE Trans. on Information Theory 47(7):2845-2862, 2001.
- [17] D. Donoho, X. Huo. Beamlets and multiscale image analysis. Lect. Comput. Sci. Eng. 20:149-196, Springer, 2002.
- [18] M. Elad, A. Bruckstein. A generalized uncertainty principle and sparse representation in pairs of bases. IEEE Trans. on Information Theory (to appear)
- [19] A. Goldenshluger, A. Nemirovski. On spatially adaptive estimation of nonparametric regression. Math. Methods of Statistics 6(2):135–170, 1997.
- [20] A. Goldenshluger, A. Nemirovski. Adaptive de-noising of signals satisfying differential inequalities. IEEE Trans. on Information Theory 43, 1997.
- [21] Yu. Golubev. Asymptotic minimax estimation of regression function in additive model. Problemy peredachi informatsii 28(2):3–15, 1992. (English transl. in Problems Inform. Transmission 28, 1992.)
- [22] W. Härdle. Applied Nonparametric Regression, ES Monograph Series 19, Cambridge, U.K., Cambridge University Press, 1990.
- [23] I. Ibragimov and R. Khasminskii. On nonparametric estimation of regression. Soviet Math. Dokl. 21:810–814, 1980.
- [24] I. Ibragimov and R. Khasminskii. Statistical Estimation. Springer-Verlag, New York, 1981.
- [25] A. Juditsky. Wavelet estimators: Adapting to unknown smoothness. Math. Methods of Statistics 6(1):1–25, 1997.
- [26] A. Juditsky and A. Nemirovski. Nonparametric Denoising of Signals with Unknown Local Structure, I: Oracle Inequalities Accepted to Appl. Comp. Harm. Anal.
- [27] A. Korostelev, A. Tsybakov. Minimax theory of image reconstruction. Lecture Notes in Statistics 82, Springer, New York, 1993.
- [28] O. Lepskii. On a problem of adaptive estimation in Gaussian white noise. Theory of Probability and Its Applications 35(3):454–466, 1990.
- [29] O. Lepskii. Asymptotically minimax adaptive estimation I: Upper bounds. Optimally adaptive estimates. Theory of Probability and Its Applications, 36(4):682–697, 1991.
- [30] O. Lepskii, E. Mammen, V. Spokoiny. Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. Ann. Statist. 25(3):929–947, 1997.
- [31] A. Nemirovskii. On nonparametric estimation of smooth regression functions. Sov. J. Comput. Syst. Sci., 23(6):1–11, 1985.
- [32] A. Nemirovski. On nonparametric estimation of functions satisfying differential inequalities. R. Khasminski, Ed. Advances in Soviet Mathematics 12:7–43, American Mathematical Society, 1992.
- [33] A. Nemirovski. Topics in Non-parametric Statistics. In: M. Emery, A. Nemirovski, D. Voiculescu, Lectures on Probability Theory and Statistics, Ecole d’Eteé de Probabilités de Saint-Flour XXVII – 1998, Ed. P. Bernard. – Lecture Notes in Mathematics 1738:87–285.
- [34] M. Pinsker. Optimal filtration of square-integrable signals in Gaussian noise. Problemy peredachi informatsii, 16(2):120–133. 1980. (English transl. in Problems Inform. Transmission 16, 1980.)
- [35] M. Pinsker, S. Efroimovich. Learning algorithm for nonparametric filtering. Automation and Remote Control 45(11):1434–1440, 1984.
- [36] M. Rosenblatt. Stochastic curve estimation. Institute of Mathematical Statistics, Hayward, California, 1991.
- [37] J.-L. Starck, E. Candés, D. Donoho. The curvelet transform for image denoising. IEEE Trans. Image Process. 11(6):670-684, 2002.
- [38] Ch. Stone. Optimal rates of convergence for nonparametric estimators. Annals of Statistics, 8(3):1348–1360, 1980.
- [39] G. Wahba. Spline models for observational data. SIAM, Philadelphia, 1990.