This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Convergence Rates for the Orthogonal Greedy Algorithm

Jonathan W. Siegel
Department of Mathematics
Pennsylvania State University
University Park, PA 16802
jus1949@psu.edu
&Jinchao Xu
Department of Mathematics
Pennsylvania State University
University Park, PA 16802
jxx1@psu.edu
Abstract

We analyze the orthogonal greedy algorithm when applied to dictionaries 𝔻\mathbb{D} whose convex hull has small entropy. We show that if the metric entropy of the convex hull of 𝔻\mathbb{D} decays at a rate of O​(nβˆ’12βˆ’Ξ±)O(n^{-\frac{1}{2}-\alpha}) for Ξ±>0\alpha>0, then the orthogonal greedy algorithm converges at the same rate on the variation space of 𝔻\mathbb{D}. This improves upon the well-known O​(nβˆ’12)O(n^{-\frac{1}{2}}) convergence rate of the orthogonal greedy algorithm in many cases, most notably for dictionaries corresponding to shallow neural networks. These results hold under no additional assumptions on the dictionary beyond the decay rate of the entropy of its convex hull. In addition, they are robust to noise in the target function and can be extended to convergence rates on the interpolation spaces of the variation norm. We show empirically that the predicted rates are obtained for the dictionary corresponding to shallow neural networks with Heaviside activation function in two dimensions. Finally, we show that these improved rates are sharp and prove a negative result showing that the iterates generated by the orthogonal greedy algorithm cannot in general be bounded in the variation norm of 𝔻\mathbb{D}.

1 Introduction

Let HH be a Hilbert space and π”»βŠ‚H\mathbb{D}\subset H a dictionary of basis functions. An important problem in machine learning, statistics, and signal processing is the non-linear approximation of a target function ff by a sparse linear combinations of dictionary elements

fn=βˆ‘i=1nai​gi,f_{n}=\sum_{i=1}^{n}a_{i}g_{i}, (1)

where giβˆˆπ”»g_{i}\in\mathbb{D} depend upon the function ff. Typical examples include non-linear approximation by redundant wavelet frames [25], shallow neural networks, which correspond to non-linear approximation by dictionaries of ridge functions [20], and gradient boosting [15], which corresponds to non-linear approximation by a dictionary of weak learners. Another important example is compressed sensing [13, 6, 8], where a function ff which is a sparse linear combination of dictionary elements is recovered from a small number of linear measurements.

In this work, we study the problem of algorithmically calculating an expansion of the form (1) to approximate a target function ff. A common class of algorithms for this purpose are greedy algorithms, specifically the pure greedy algorithm [25],

f0=0,gk=arg⁑maxgβˆˆπ”»β‘|⟨g,fβˆ’fkβˆ’1⟩|,fk=fkβˆ’1+⟨gk,fβˆ’fkβˆ’1βŸ©β€‹gk,f_{0}=0,~g_{k}=\arg\max_{g\in\mathbb{D}}|\langle g,f-f_{k-1}\rangle|,~f_{k}=f_{k-1}+\langle g_{k},f-f_{k-1}\rangle g_{k}, (2)

which is also known as matching pursuit, the relaxed greedy algorithm [17, 3, 2],

f0=0,(Ξ±k,Ξ²k,gk)=arg⁑minΞ±k,Ξ²kβˆˆβ„,gkβˆˆπ”»β‘β€–fβˆ’Ξ±k​fkβˆ’1βˆ’Ξ²k​gkβ€–H,fk=Ξ±k​fkβˆ’1+Ξ²k​gk,f_{0}=0,~(\alpha_{k},\beta_{k},g_{k})=\arg\min_{\alpha_{k},\beta_{k}\in\mathbb{R},g_{k}\in\mathbb{D}}\|f-\alpha_{k}f_{k-1}-\beta_{k}g_{k}\|_{H},~f_{k}=\alpha_{k}f_{k-1}+\beta_{k}g_{k}, (3)

and the orthogonal greedy algorithm (also known as orthogonal matching pursuit) [28]

f0=0,gk=arg⁑maxgβˆˆπ”»β‘|⟨g,rkβˆ’1⟩|,fk=Pk​f,f_{0}=0,~g_{k}=\arg\max_{g\in\mathbb{D}}|\langle g,r_{k-1}\rangle|,~f_{k}=P_{k}f, (4)

where rk=fβˆ’fkr_{k}=f-f_{k} is the residual and PkP_{k} is the orthogonal projection onto the span of g1,…,gkg_{1},...,g_{k}.

When applied to general dictionaries π”»βŠ‚H\mathbb{D}\subset H, greedy algorithms are often analyzed under the assumption that the target function ff lies in the convex hull of the dictionary 𝔻\mathbb{D}. Specifically, following [31, 10, 19] we write the closed convex hull of 𝔻\mathbb{D} as

B1​(𝔻)={βˆ‘j=1naj​hj:nβˆˆβ„•,hjβˆˆπ”»,βˆ‘i=1n|ai|≀1}Β―,B_{1}(\mathbb{D})=\overline{\left\{\sum_{j=1}^{n}a_{j}h_{j}:~n\in\mathbb{N},~h_{j}\in\mathbb{D},~\sum_{i=1}^{n}|a_{i}|\leq 1\right\}}, (5)

and denote the gauge norm [30] of this set, which is often called the variation norm with respect to 𝔻\mathbb{D}, as

β€–f‖𝒦1​(𝔻)=inf{c>0:f∈c​B1​(𝔻)}.\|f\|_{\mathcal{K}_{1}(\mathbb{D})}=\inf\{c>0:~f\in cB_{1}(\mathbb{D})\}. (6)

A typical assumption then is that the target function ff satisfies β€–f‖𝒦1​(𝔻)<∞\|f\|_{\mathcal{K}_{1}(\mathbb{D})}<\infty. One notable exception is compressed sensing, where it is typically assumed that ff is a linear combination of a small number of elements giβˆˆπ”»g_{i}\in\mathbb{D}. In this case, however, one needs an additional assumption on the dictionary 𝔻\mathbb{D}, usually incoherence [14] or a restricted isometry property (RIP) [7]. The analysis of greedy algorithms under these assumptions is given in [35, 26, 27, 34] for instance. In this work we are interested in the case of general dictionaries π”»βŠ‚H\mathbb{D}\subset H which do not satisfy any incoherence or RIP conditions, however, so we will consider the convex hull condition on ff given above.

Given a function f∈B1​(𝔻)f\in B_{1}(\mathbb{D}), a sampling argument due to Maurey [29] implies that there exists an nn-term expansion which satisfies

inffn∈Σn​(𝔻)β€–fβˆ’fnβ€–H≀|𝔻|​nβˆ’12.\inf_{f_{n}\in\Sigma_{n}(\mathbb{D})}\|f-f_{n}\|_{H}\leq|\mathbb{D}|n^{-\frac{1}{2}}. (7)

Here |𝔻|:=supgβˆˆπ”»β€–gβ€–H|\mathbb{D}|:=\sup_{g\in\mathbb{D}}\|g\|_{H} is the maximum norm of the dictionary and

Ξ£n(𝔻)={βˆ‘i=1naigi,aiβˆˆβ„,giβˆˆπ”»}\Sigma_{n}(\mathbb{D})=\left\{\sum_{i=1}^{n}a_{i}g_{i},~a_{i}\in\mathbb{R},~g_{i}\in\mathbb{D}\right\} (8)

is the set of nn-term non-linear dictionary expansions. One can even take the expansion fnf_{n} in (7) to satisfy βˆ‘i=1n|ai|≀1\sum_{i=1}^{n}|a_{i}|\leq 1. It is also known that for general dictionaries 𝔻\mathbb{D}, the approximation rate (7) is the best possible [19] up to a constant factor. The key problem addressed by greedy algorithms is whether the rate in (7) can be achieved algorithmically.

For simplicity of notation, we assume in the following that |𝔻|≀1|\mathbb{D}|\leq 1, i.e. that β€–gβ€–H≀1\|g\|_{H}\leq 1 for all gβˆˆπ”»g\in\mathbb{D}, which can always be achieved by scaling the dictionary appropriately. Relaxing this assumption changes the results in a straightforward manner. It has been shown that under these assumptions the orthogonal greedy algorithm [12] and a suitable version of the relaxed greedy algorithm [17, 3] satisfy

β€–fnβˆ’fβ€–H≀K​‖f‖𝒦1​(𝔻)​nβˆ’12,\|f_{n}-f\|_{H}\leq K\|f\|_{\mathcal{K}_{1}(\mathbb{D})}n^{-\frac{1}{2}}, (9)

for a suitable constant KK (here and in the following KK, MM, and CC represent unspecified constants). Thus the orthogonal and relaxed greedy algorithms are able to algorithmically attain the approximation rate in (7) up to a constant. The behavior of the pure greedy algorithm is much more subtle. A sequence of improved upper bounds on the convergence rate of the pure greedy algorithm have been obtained in [12, 21, 33], which culminate in the bound

β€–fnβˆ’fβ€–H≀K​‖f‖𝒦1​(𝔻)​nβˆ’Ξ³,\|f_{n}-f\|_{H}\leq K\|f\|_{\mathcal{K}_{1}(\mathbb{D})}n^{-\gamma}, (10)

where Ξ³β‰ˆ0.182\gamma\approx 0.182 satisfies a particular non-linear equation. Conversely, in [22] a dictionary is constructed for which the pure greedy algorithm satisfies

β€–fnβˆ’fβ€–Hβ‰₯K​‖f‖𝒦1​(𝔻)​nβˆ’0.1898.\|f_{n}-f\|_{H}\geq K\|f\|_{\mathcal{K}_{1}(\mathbb{D})}n^{-0.1898}. (11)

Thus the precise convergence rate of the pure greedy algorithm is still open, but it is known that it fails to achieve the rate (7) by a significant margin. A further interesting open problem concerns the pure greedy algorithm with shrinkage s∈(0,1]s\in(0,1], which is given by

f0=0,gk=arg⁑maxgβˆˆπ”»β‘|⟨g,rkβˆ’1⟩|,fk=fkβˆ’1+sβ€‹βŸ¨gk,rkβˆ’1βŸ©β€‹gk,f_{0}=0,~g_{k}=\arg\max_{g\in\mathbb{D}}|\langle g,r_{k-1}\rangle|,~f_{k}=f_{k-1}+s\langle g_{k},r_{k-1}\rangle g_{k}, (12)

where again rk=fβˆ’fkr_{k}=f-f_{k} is the residual. This algorithm is important for understanding gradient boosting [15]. It is shown in [33] that the convergence order improves as ss decreases to a maximum of about 0.3050.305 as sβ†’0s\rightarrow 0, but it is an open problem whether the optimal rate (7) can be achieved in the limit as sβ†’0s\rightarrow 0.

The preceding results hold under the assumption that β€–f‖𝒦1​(𝔻)<∞\|f\|_{\mathcal{K}_{1}(\mathbb{D})}<\infty, i.e. that the target function is in the scaled convex hull of the dictionary. An important question is how robust the greedy algorithms are to noise. This is captured by the more general assumption that ff is not itself in the convex hull, but rather that ff is close to an element in the convex hull. Specifically, we assume that hβˆˆπ’¦1​(𝔻)h\in\mathcal{K}_{1}(\mathbb{D}) and that β€–fβˆ’hβ€–H\|f-h\|_{H}, which can be thought of as noise, is small. The convergence rate of the orthogonal and relaxed greedy algorithms are analyzed under these assumptions in [3]. Specifically, it is shown that for any f∈Hf\in H and any hβˆˆπ’¦1​(𝔻)h\in\mathcal{K}_{1}(\mathbb{D}), the iterates of the orthogonal greedy algorithm satisfy

β€–fnβˆ’fβ€–H2≀‖fβˆ’hβ€–H2+K2​‖h‖𝒦1​(𝔻)2​nβˆ’1.\|f_{n}-f\|^{2}_{H}\leq\|f-h\|_{H}^{2}+K^{2}\|h\|^{2}_{\mathcal{K}_{1}(\mathbb{D})}n^{-1}. (13)

This result is important for the statistical analysis of the orthogonal greedy algorithm, for instance in showing the universal consistency of the estimator obtained by applying this algorithm to fit a function ff based on samples f​(x1),…,f​(xn)f(x_{1}),...,f(x_{n}) [3]. It also gives a convergence rate for the orthogonal greedy algorithm on the interpolation spaces between 𝒦1​(𝔻)\mathcal{K}_{1}(\mathbb{D}) and HH. The KK-functional of HH and 𝒦1​(𝔻)\mathcal{K}_{1}(\mathbb{D}) is defined by

K​(f,t):=K​(f,t,H,𝒦1​(𝔻))=infg∈Hβ€–fβˆ’gβ€–H+t​‖g‖𝒦1​(𝔻).K(f,t):=K(f,t,H,\mathcal{K}_{1}(\mathbb{D}))=\inf_{g\in H}\|f-g\|_{H}+t\|g\|_{\mathcal{K}_{1}(\mathbb{D})}. (14)

The interpolation space XΞΈ:=[H,𝒦1​(𝔻)]ΞΈ,∞X_{\theta}:=[H,\mathcal{K}_{1}(\mathbb{D})]_{\theta,\infty} is defined by the norm β€–fβ€–X=supt>0K​(f,t)​tβˆ’ΞΈ\|f\|_{X}=\sup_{t>0}K(f,t)t^{-\theta}. Intuitively, the interpolation spaces measure how efficiently a function can be approximated by elements with small 𝒦1​(𝔻)\mathcal{K}_{1}(\mathbb{D})-norm. We refer to [3] and [11], Chapter 6, for more information on interpolation spaces.

Although the approximation rate (7) is sharp for general dictionaries, for compact dictionaries the rate can be improved using a stratified sampling argument [24, 18, 36]. Specifically, the dictionary

β„™0d={Οƒ0(Ο‰β‹…x+b),Ο‰βˆˆβ„d,bβˆˆβ„}βŠ‚L2(Ξ©),\mathbb{P}_{0}^{d}=\{\sigma_{0}(\omega\cdot x+b),~\omega\in\mathbb{R}^{d},~b\in\mathbb{R}\}\subset L^{2}(\Omega), (15)

where the activation function Οƒ0\sigma_{0} is the Heaviside function and Ξ©βŠ‚β„d\Omega\subset\mathbb{R}^{d} a bounded domain was analyzed in [24]. There it was shown that for f∈B1​(β„™0d)f\in B_{1}(\mathbb{P}_{0}^{d}) the approximation rate (7) can be improved to

inffn∈Σn​(β„™0d)β€–fβˆ’fnβ€–H≀K​nβˆ’12βˆ’12​d.\inf_{f_{n}\in\Sigma_{n}(\mathbb{P}_{0}^{d})}\|f-f_{n}\|_{H}\leq Kn^{-\frac{1}{2}-\frac{1}{2d}}. (16)

This corresponds to an improved approximation rate for shallow neural networks with Heaviside (and also sigmoidal) activation function. For shallow neural networks with ReLUk activation function, the relevant dictionary [32] is

β„™kd={Οƒk(Ο‰β‹…x+b),Ο‰βˆˆSdβˆ’1,b∈[βˆ’c,c]}βŠ‚L2(Ξ©),\mathbb{P}_{k}^{d}=\{\sigma_{k}(\omega\cdot x+b),~\omega\in S^{d-1},~b\in[-c,c]\}\subset L^{2}(\Omega), (17)

where Sdβˆ’1S^{d-1} is the unit sphere is ℝd\mathbb{R}^{d} and cc is the diameter of Ξ©\Omega (assuming without loss of generality that 0∈Ω0\in\Omega). Using the smoothness of the dictionary β„™kd\mathbb{P}_{k}^{d}, it has been shown that for f∈B1​(β„™kd)f\in B_{1}(\mathbb{P}_{k}^{d}) an approximation rate of

inffn∈Σn​(β„™kd)β€–fβˆ’fnβ€–H≀K​nβˆ’12βˆ’2​k+12​d\inf_{f_{n}\in\Sigma_{n}(\mathbb{P}_{k}^{d})}\|f-f_{n}\|_{H}\leq Kn^{-\frac{1}{2}-\frac{2k+1}{2d}} (18)

can be achieved, and that this rate is optimal if the coefficients of fnf_{n} are bounded [32]. Given these improved theoretical approximation rates for shallow neural networks, it has been an important open problem whether they can be achieved algorithmically by greedy algorithms.

In this work, we show that the improved approximation rates (18) can be achieved using the orthogonal greedy algorithm. More generally, we show that the orthogonal greedy algorithm improves upon the rate (7) whenever the convex hull of the dictionary 𝔻\mathbb{D} has small metric entropy. Specifically, we recall that the (dyadic) metric entropy of a set AA in a Banach space XX is defined by

Ο΅n​(A)X=inf{Ο΅:A​can be covered byΒ 2nΒ balls of radiusΒ Ο΅}.\epsilon_{n}(A)_{X}=\inf\{\epsilon:~A~\text{can be covered by $2^{n}$ balls of radius $\epsilon$}\}. (19)

The metric entropy gives a measure of compactness of the set AA and a detailed theory can be found in [23], Chapter 15. We show that if the dictionary 𝔻\mathbb{D} satisfies

Ο΅n(B1(𝔻)H≀Knβˆ’12βˆ’Ξ±,\epsilon_{n}(B_{1}(\mathbb{D})_{H}\leq Kn^{-\frac{1}{2}-\alpha}, (20)

then the orthogonal greedy algorithm (4) satisfies

β€–fnβˆ’fβ€–H≀K′​‖f‖𝒦1​(𝔻)​nβˆ’12βˆ’Ξ±,\|f_{n}-f\|_{H}\leq K^{\prime}\|f\|_{\mathcal{K}_{1}(\mathbb{D})}n^{-\frac{1}{2}-\alpha}, (21)

where Kβ€²K^{\prime} only depends upon KK and Ξ±\alpha. More generally, this analysis is also robust to noise in the sense considered in [3], i.e. for any f∈Hf\in H and gβˆˆπ’¦1​(𝔻)g\in\mathcal{K}_{1}(\mathbb{D}) we have

β€–fnβˆ’fβ€–H2≀‖fβˆ’hβ€–H2+(Kβ€²)2​‖h‖𝒦1​(𝔻)2​nβˆ’1βˆ’2​α.\|f_{n}-f\|^{2}_{H}\leq\|f-h\|_{H}^{2}+(K^{\prime})^{2}\|h\|^{2}_{\mathcal{K}_{1}(\mathbb{D})}n^{-1-2\alpha}. (22)

Utilizing the metric entropy bounds proven in [32], this implies that the orthogonal greedy algorithm achieves the rate (18) for shallow neural networks with ReLUk activation function. We provide numerical experiments using shallow neural networks with the Heaviside activation function which confirm that these optimal rates are indeed achieved. Additional numerical experiments which confirm the theoretical approximation rates for shallow networks with ReLUk activation function can be found in [16], where the orthogonal greedy algorithm is used to solve elliptic PDEs.

We conclude the manuscript with a lower bound and negative result concerning the orthogonal greedy algorithm. Consider approximating f∈B1​(𝔻)f\in B_{1}(\mathbb{D}) by dictionary expansions with β„“1\ell^{1}-bounded coefficients, i.e. from the set

Ξ£n,M​(𝔻)={βˆ‘j=1naj​hj:hjβˆˆπ”»,βˆ‘i=1n|ai|≀M}.\Sigma_{n,M}(\mathbb{D})=\left\{\sum_{j=1}^{n}a_{j}h_{j}:~h_{j}\in\mathbb{D},~\sum_{i=1}^{n}|a_{i}|\leq M\right\}. (23)

It is known that the corresponding approximation rates are lower bounded by the metric entropy up to logarithmic factors. In particular, if for some M>0M>0 and any f∈B1​(𝔻)f\in B_{1}(\mathbb{D}), we have

inffn∈Σn,M​(𝔻)β€–fβˆ’fn‖≀K​nβˆ’12βˆ’Ξ±,\inf_{f_{n}\in\Sigma_{n,M}(\mathbb{D})}\|f-f_{n}\|\leq Kn^{-\frac{1}{2}-\alpha}, (24)

then we must have Ο΅n​log⁑n​(B1​(𝔻))≀K​nβˆ’12βˆ’Ξ±\epsilon_{n\log{n}}(B_{1}(\mathbb{D}))\leq Kn^{-\frac{1}{2}-\alpha}. The result (21) shows that the orthogonal greedy algorithm achieves this rate. However, we note that these lower bounds do not a priori apply to the orthogonal greedy algorithm since the expansions generated will in general not have coefficients uniformly bounded in β„“1\ell^{1}. In fact, we give an example demonstrating that the iterates generated by the orthogonal greedy algorithm may have arbitrarily large 𝒦1​(𝔻)\mathcal{K}_{1}(\mathbb{D})-norm. Nonetheless, we will show that the rate (21) is sharp and cannot be further improved for general dictionaries satisfying 20. We do not know whether there exists a greedy algorithm which can attain the improved approximation rate (21) while also guaranteeing a uniform β„“1\ell^{1}-bound on the coefficients.

The paper is organized as follows. In the Section 2, we derive the improved convergence rate (21). Next, in Section 3 we provide numerical experiments which demonstrate the improved rates. Then, in Section 4, we give an example showing that the iterates generated by the orthogonal greedy algorithm cannot be bounded in 𝒦1​(𝔻)\mathcal{K}_{1}(\mathbb{D}) and also show that the improved rates are tight. Finally, we give concluding remarks and further research directions.

2 Analysis of the Orthogonal Greedy Algorithm

We begin with the following key lemma.

Lemma 1.

Let Ξ΄>0\delta>0 and 𝔻\mathbb{D} be a dictionary with

Ο΅n​(B1​(𝔻))≀C​nβˆ’12βˆ’Ξ΄.\epsilon_{n}(B_{1}(\mathbb{D}))\leq Cn^{-\frac{1}{2}-\delta}. (25)

Then there exists a c=c​(Ξ΄,C)>0c=c(\delta,C)>0 such that for any sequence d1,…,dnβˆˆπ”»d_{1},...,d_{n}\in\mathbb{D}, we have

βˆ‘k=1nβ€–(Iβˆ’Pkβˆ’1)​dkβ€–βˆ’2β‰₯c​n1+2​δ,\sum_{k=1}^{n}\|(I-P_{k-1})d_{k}\|^{-2}\geq cn^{1+2\delta}, (26)

where PkP_{k} is the orthogonal projection onto the span of d1,…,dkd_{1},...,d_{k}.

The intuituve idea behind this lemma is that if (26) fails, then the convex hull B1​(𝔻)B_{1}(\mathbb{D}) must contain a relatively large skewed simplex. A comparison of volumes then gives a lower bound on its entropy.

Proof.

By scaling the dictionary, we may assume without loss of generality that C=1C=1. Fix a c>0c>0 and consider a sequence d1,…,d2​nβˆˆπ”»d_{1},...,d_{2n}\in\mathbb{D}. Assume that

βˆ‘k=12​nβ€–(Iβˆ’Pkβˆ’1)​dkβ€–βˆ’2≀c​n1+2​δ.\sum_{k=1}^{2n}\|(I-P_{k-1})d_{k}\|^{-2}\leq cn^{1+2\delta}. (27)

We will show that for sufficiently small cc this contradicts the entropy bound (25). Rescaling this value of cc by 2βˆ’1βˆ’2​δ2^{-1-2\delta} gives the desired bound (26).

Choose the nn indices k1<β‹―<knk_{1}<\cdots<k_{n} for which β€–(Iβˆ’Pkβˆ’1)​dkβ€–βˆ’2\|(I-P_{k-1})d_{k}\|^{-2} is the smallest. For each i=1,…,ni=1,...,n, there are at least nn indices kk for which β€–(Iβˆ’Pkβˆ’1)​dkβ€–βˆ’2β‰₯β€–(Iβˆ’Pkiβˆ’1)​dkiβ€–βˆ’2\|(I-P_{k-1})d_{k}\|^{-2}\geq\|(I-P_{k_{i}-1})d_{k_{i}}\|^{-2}. This gives the inequality

βˆ‘k=12​nβ€–(Iβˆ’Pkβˆ’1)​dkβ€–βˆ’2β‰₯n​‖(Iβˆ’Pkiβˆ’1)​dkiβ€–βˆ’2,\sum_{k=1}^{2n}\|(I-P_{k-1})d_{k}\|^{-2}\geq n\|(I-P_{k_{i}-1})d_{k_{i}}\|^{-2}, (28)

so that by (27) we must have β€–(Iβˆ’Pkiβˆ’1)​dkiβ€–βˆ’2≀c​n2​δ\|(I-P_{k_{i}-1})d_{k_{i}}\|^{-2}\leq cn^{2\delta} and thus

β€–(Iβˆ’Pkiβˆ’1)​dkiβ€–β‰₯cβˆ’12​nβˆ’Ξ΄\|(I-P_{k_{i}-1})d_{k_{i}}\|\geq c^{-\frac{1}{2}}n^{-\delta} (29)

for each i=1,…,ni=1,...,n. Further, if we replace the projections Pkiβˆ’1P_{k_{i}-1} by the orthogonal projection onto the span of dk1,…,dkiβˆ’1d_{k_{1}},...,d_{k_{i-1}}, the length in (29) can only increase (since we are removing dictionary elements from the projection). Thus, by relabelling we obtain a sequence d1,…,dnβˆˆπ”»d_{1},...,d_{n}\in\mathbb{D} such that

β€–(Iβˆ’Pkβˆ’1)​dkβ€–β‰₯cβˆ’12​nβˆ’Ξ΄\|(I-P_{k-1})d_{k}\|\geq c^{-\frac{1}{2}}n^{-\delta} (30)

for k=1,…,nk=1,...,n.

Let d~1,…,d~n\tilde{d}_{1},...,\tilde{d}_{n} be the Gram-Schmidt orthogonalization of the sequence d1,…,dnd_{1},...,d_{n}, i.e. d~k=(Iβˆ’Pkβˆ’1)​dk\tilde{d}_{k}=(I-P_{k-1})d_{k}. From (30), we obtain β€–d~iβ€–β‰₯cβˆ’12​nβˆ’Ξ΄\|\tilde{d}_{i}\|\geq c^{-\frac{1}{2}}n^{-\delta} for each i=1,…,ni=1,...,n, and thus since the d~i\tilde{d}_{i} are orthogonal, we obtain the following bound

|co​(Β±d~1,…,Β±d~n)|β‰₯(cβˆ’12​nβˆ’Ξ΄)n​2nn!.|\text{co}(\pm\tilde{d}_{1},...,\pm\tilde{d}_{n})|\geq(c^{-\frac{1}{2}}n^{-\delta})^{n}\frac{2^{n}}{n!}. (31)

Here co​(Β±d~1,…,Β±d~n)\text{co}(\pm\tilde{d}_{1},...,\pm\tilde{d}_{n}) is the absolute convex hull of the sequence d~1,…,d~n\tilde{d}_{1},...,\tilde{d}_{n}.

Next we use the fact that the change of variables between the {di}\{d_{i}\} and the {d~i}\{\tilde{d}_{i}\} is upper triangular with ones on the diagonal and thus has determinant 11 (since Pkβˆ’1​dk∈span​(d1,…,dkβˆ’1)P_{k-1}d_{k}\in\text{span}(d_{1},...,d_{k-1})). This implies that

|co​(Β±d1,…,Β±dn)|=|co​(Β±d~1,…,Β±d~n)|β‰₯(cβˆ’12​nβˆ’Ξ΄)n​2nn!.|\text{co}(\pm d_{1},...,\pm d_{n})|=|\text{co}(\pm\tilde{d}_{1},...,\pm\tilde{d}_{n})|\geq(c^{-\frac{1}{2}}n^{-\delta})^{n}\frac{2^{n}}{n!}. (32)

In other words the convex hull of the did_{i} is a skewed simplex with large volume.

We now use the covering definition of the entropy, setting Ο΅:=Ο΅n​(B1​(𝔻))H\epsilon:=\epsilon_{n}(B_{1}(\mathbb{D}))_{H}, and the fact that co​(Β±d1,…,Β±dn)βŠ‚B1​(𝔻)\text{co}(\pm d_{1},...,\pm d_{n})\subset B_{1}(\mathbb{D}) to get

(cβˆ’12​nβˆ’Ξ΄)n​2nn!≀|co​(Β±d1,…,Β±dn)|≀|B1​(𝔻)|≀(2​ϡ)n​πn/2Γ​(n2+1),(c^{-\frac{1}{2}}n^{-\delta})^{n}\frac{2^{n}}{n!}\leq|\text{co}(\pm d_{1},...,\pm d_{n})|\leq|B_{1}(\mathbb{D})|\leq(2\epsilon)^{n}\frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2}+1\right)}, (33)

where the right hand side is the volume of 2n2^{n} balls of radius Ο΅\epsilon.

Utilizing Sterling’s formula and taking nn-th roots, we get

Ο΅β‰₯K​cβˆ’12​nβˆ’12βˆ’Ξ΄,\epsilon\geq Kc^{-\frac{1}{2}}n^{-\frac{1}{2}-\delta}, (34)

for an absolute constant KK. For sufficiently small cc, this will contradict the bound (25), which completes the proof. ∎

Finally, we come to the main result of this section, which shows that the orthogonal greedy algorithm (4) achieves a convergence rate which matches the entropy for dictionaries 𝔻\mathbb{D} whose entropy decays faster than O​(nβˆ’12)O(n^{-\frac{1}{2}}). In fact, we prove a bit more, generalizing the result from [3] to obtain improved approximation rates for the interpolation spaces between 𝒦1​(𝔻)\mathcal{K}_{1}(\mathbb{D}) and HH as well.

Theorem 1.

Let Ξ΄>0\delta>0 and suppose that π”»βŠ‚H\mathbb{D}\subset H is a dictionary which satisfies

Ο΅n​(B1​(𝔻))≀C​nβˆ’12βˆ’Ξ΄,\epsilon_{n}(B_{1}(\mathbb{D}))\leq Cn^{-\frac{1}{2}-\delta}, (35)

for some constant C<∞C<\infty. Let the iterates fnf_{n} be given by algorithm (4), where f∈Hf\in H. Further, let hβˆˆπ’¦1​(𝔻)h\in\mathcal{K}_{1}(\mathbb{D}) be arbitrary. Then we have

β€–fnβˆ’fβ€–2≀‖fβˆ’hβ€–2+K​‖h‖𝒦1​(𝔻)2​nβˆ’1βˆ’2​δ,\|f_{n}-f\|^{2}\leq\|f-h\|^{2}+K\|h\|^{2}_{\mathcal{K}_{1}(\mathbb{D})}n^{-1-2\delta}, (36)

where KK depends only upon Ξ΄\delta and CC. In particular, if fβˆˆπ’¦1​(𝔻)f\in\mathcal{K}_{1}(\mathbb{D}), then

β€–fnβˆ’f‖≀K​nβˆ’12βˆ’Ξ΄.\|f_{n}-f\|\leq Kn^{-\frac{1}{2}-\delta}. (37)

We note that although fnf_{n} is a linear combination of nn dictionary elements g1,…,gng_{1},...,g_{n}, we do necessarily have that fnf_{n} can be bounded in 𝒦1​(𝔻)\mathcal{K}_{1}(\mathbb{D}), as we show in Proposition 1. This is due to the fact that the coefficients in the expansion generated by the orthogonal greedy algorithm (4) cannot be bounded in β„“1\ell^{1}.

Further, the bound (36) shows that the improved convergence behavior of the orthogonal greedy algorithm is robust to noise. This is critical for the statistical analysis of the algorithm, for instance for proving statistical consistency [3]. Further, it enables one to prove a convergence rate for functions ff in the interpolation space XΞΈ=[H,𝒦1​(𝔻)]ΞΈ,∞X_{\theta}=[H,\mathcal{K}_{1}(\mathbb{D})]_{\theta,\infty}, analogous to the results in [3]. We remark that in [3] a Fourier integrability condition is given which guarantees membership in XΞΈX_{\theta} when 𝔻=β„™0d\mathbb{D}=\mathbb{P}_{0}^{d} is the dictionary corresponding to shallow neural networks with sigmoidal activation function. This generalizes the Fourier condition introduced by Barron in [2].

Corollary 1.

Let π”»βŠ‚H\mathbb{D}\subset H be a dictionary satisfying the assumptions of Theorem 1. Then for f∈XΞΈf\in X_{\theta} we have

β€–fnβˆ’f‖≀K​‖fβ€–Xθ​nβˆ’(12+Ξ΄)​θ,\|f_{n}-f\|\leq K\|f\|_{X_{\theta}}n^{-(\frac{1}{2}+\delta)\theta}, (38)

where KK depends only upon Ξ΄\delta and CC.

Proof.

Taking the square root of (36) we see that

β€–fnβˆ’f‖≀‖fβˆ’hβ€–+M​‖h‖𝒦1​(𝔻)​nβˆ’12βˆ’Ξ΄\|f_{n}-f\|\leq\|f-h\|+M\|h\|_{\mathcal{K}_{1}(\mathbb{D})}n^{-\frac{1}{2}-\delta} (39)

for a new constant MM (we can take M=KM=\sqrt{K}). Taking the infemum over hβˆˆπ’¦1​(𝔻)h\in\mathcal{K}_{1}(\mathbb{D}), we get

β€–fnβˆ’f‖≀K​(M​nβˆ’12βˆ’Ξ΄,f),\|f_{n}-f\|\leq K(Mn^{-\frac{1}{2}-\delta},f), (40)

where KK is the KK-functional introduced in the introduction (see also [11], Chapter 6 for a more detailed theory). The definition of the interpolation space XΞΈX_{\theta} implies that

K​(t,f)≀‖fβ€–Xθ​tΞΈK(t,f)\leq\|f\|_{X_{\theta}}t^{\theta} (41)

for all t>0t>0. Setting t=M​nβˆ’12βˆ’Ξ΄t=Mn^{-\frac{1}{2}-\delta} in (40) gives the desired result. ∎

Proof of Theorem 1.

Throughout the proof, let PkP_{k} denote the projection onto g1,…,gkg_{1},...,g_{k} and rk=fβˆ’fkr_{k}=f-f_{k} denote the residual. Since fkf_{k} is the best approximation to ff from the space Vk=span​(g1,…,gk)V_{k}=\text{span}(g_{1},...,g_{k}), we have

β€–rkβ€–2≀‖rkβˆ’1βˆ’βŸ¨rkβˆ’1,(Iβˆ’Pkβˆ’1)​gkβŸ©β€–(Iβˆ’Pkβˆ’1)​gkβ€–2​(Iβˆ’Pkβˆ’1)​gkβ€–2=β€–rkβˆ’1β€–2βˆ’|⟨rkβˆ’1,(Iβˆ’Pkβˆ’1)​gk⟩|2β€–(Iβˆ’Pkβˆ’1)​gkβ€–2.\|r_{k}\|^{2}\leq\left\|r_{k-1}-\frac{\langle r_{k-1},(I-P_{k-1})g_{k}\rangle}{\|(I-P_{k-1})g_{k}\|^{2}}(I-P_{k-1})g_{k}\right\|^{2}=\|r_{k-1}\|^{2}-\frac{|\langle r_{k-1},(I-P_{k-1})g_{k}\rangle|^{2}}{\|(I-P_{k-1})g_{k}\|^{2}}. (42)

Next, we note that since rkβˆ’1r_{k-1} is orthogonal to g1,…,gkβˆ’1g_{1},...,g_{k-1} we have ⟨rkβˆ’1,(Iβˆ’Pkβˆ’1)​gk⟩=⟨rkβˆ’1,gk⟩\langle r_{k-1},(I-P_{k-1})g_{k}\rangle=\langle r_{k-1},g_{k}\rangle. In addition, as rkβˆ’1r_{k-1} is orthogonal to fkβˆ’1=fβˆ’rkβˆ’1f_{k-1}=f-r_{k-1}, we see that

β€–rkβˆ’1β€–2=⟨rkβˆ’1,f⟩=⟨rkβˆ’1,h+fβˆ’hβŸ©β‰€β€–h‖𝒦1​maxgβˆˆπ”»β‘|⟨rkβˆ’1,g⟩|+⟨rkβˆ’1,fβˆ’hβŸ©β‰€β€–h‖𝒦1​|⟨rkβˆ’1,gk⟩|+12​(β€–rkβˆ’1β€–2+β€–fβˆ’hβ€–2).\begin{split}\|r_{k-1}\|^{2}=\langle r_{k-1},f\rangle&=\langle r_{k-1},h+f-h\rangle\leq\|h\|_{\mathcal{K}_{1}}\max_{g\in\mathbb{D}}|\langle r_{k-1},g\rangle|+\langle r_{k-1},f-h\rangle\\ &\leq\|h\|_{\mathcal{K}_{1}}|\langle r_{k-1},g_{k}\rangle|+\frac{1}{2}\left(\|r_{k-1}\|^{2}+\|f-h\|^{2}\right).\end{split} (43)

Setting bk=β€–rkβ€–2βˆ’β€–fβˆ’hβ€–2b_{k}=\|r_{k}\|^{2}-\|f-h\|^{2}, we can rewrite this as

bkβˆ’1≀2​‖h‖𝒦1​|⟨rkβˆ’1,gk⟩|,b_{k-1}\leq 2\|h\|_{\mathcal{K}_{1}}|\langle r_{k-1},g_{k}\rangle|, (44)

which gives the lower bound

|⟨rkβˆ’1,(Iβˆ’Pkβˆ’1)​gk⟩|=|⟨rkβˆ’1,gk⟩|β‰₯bkβˆ’1​‖h‖𝒦1βˆ’1.|\langle r_{k-1},(I-P_{k-1})g_{k}\rangle|=|\langle r_{k-1},g_{k}\rangle|\geq b_{k-1}\|h\|_{\mathcal{K}_{1}}^{-1}. (45)

If bkb_{k} is ever negative, then the desired result clearly holds for all nβ‰₯kn\geq k since bkb_{k} is decreasing. So we assume without loss of generality that bk>0b_{k}>0 in what follows.

Subtracting β€–fβˆ’hβ€–2\|f-h\|^{2} from equation (42) and using lower bound above, we get the recursion

bk≀bkβˆ’1​(1βˆ’bkβˆ’14​‖(Iβˆ’Pkβˆ’1)​gkβ€–2​‖h‖𝒦12).b_{k}\leq b_{k-1}\left(1-\frac{b_{k-1}}{4\|(I-P_{k-1})g_{k}\|^{2}\|h\|_{\mathcal{K}_{1}}^{2}}\right). (46)

Define the sequence ak=(2​‖h‖𝒦1)βˆ’2​bka_{k}=(2\|h\|_{\mathcal{K}_{1}})^{-2}b_{k}, and we get the recursion

ak≀akβˆ’1​(1βˆ’β€–(Iβˆ’Pkβˆ’1)​gkβ€–βˆ’2​akβˆ’1).a_{k}\leq a_{k-1}(1-\|(I-P_{k-1})g_{k}\|^{-2}a_{k-1}). (47)

If a0>1a_{0}>1, then this recursion implies that a1<0a_{1}<0 (since β€–gk‖≀1\|g_{k}\|\leq 1 and thus β€–(Iβˆ’Pkβˆ’1)​gk‖≀1\|(I-P_{k-1})g_{k}\|\leq 1) and as remarked before the desired conclusion is easily seen to hold in this case. Hence we can assume without loss of generality that a0≀1a_{0}\leq 1. In addition, if ak≀0a_{k}\leq 0 ever holds the result immediately follows for all nβ‰₯kn\geq k. So we also assume without loss of generality in the following that ak>0a_{k}>0.

Utilizing the approximation log⁑(1+x)≀x\log(1+x)\leq x, we rewrite the recursion (47) as

log⁑(ak)≀log⁑(akβˆ’1)βˆ’β€–(Iβˆ’Pkβˆ’1)​gkβ€–βˆ’2​akβˆ’1.\log(a_{k})\leq\log(a_{k-1})-\|(I-P_{k-1})g_{k}\|^{-2}a_{k-1}. (48)

At this point, we could expand the recursion and use that a0≀1a_{0}\leq 1 to get

log⁑(an)β‰€βˆ’βˆ‘k=0nβˆ’1β€–(Iβˆ’Pkβˆ’1)​gkβ€–βˆ’2​akβ‰€βˆ’anβ€‹βˆ‘k=0nβˆ’1β€–(Iβˆ’Pkβˆ’1)​gkβ€–βˆ’2,\log(a_{n})\leq-\sum_{k=0}^{n-1}\|(I-P_{k-1})g_{k}\|^{-2}a_{k}\leq-a_{n}\sum_{k=0}^{n-1}\|(I-P_{k-1})g_{k}\|^{-2}, (49)

where the last inequality is due to the fact that the sequence aka_{k} is decreasing. Applying Lemma 1 we obtain

log⁑(an)β‰€βˆ’c​an​n1+2​δ.\log(a_{n})\leq-ca_{n}n^{1+2\delta}. (50)

Solving this gives the desired result up to logarithmic factors.

In order to remove the logarithmic factors, we use again that the sequence aka_{k} is decreasing and dyadically expand the recursion (48) to get

log⁑(a2j)≀log⁑(a2jβˆ’1)βˆ’βˆ‘k=2jβˆ’12jβˆ’1β€–(Iβˆ’Pkβˆ’1)​gkβ€–βˆ’2​ak≀log⁑(a2jβˆ’1)βˆ’a2jβ€‹βˆ‘k=2jβˆ’12jβˆ’1β€–(Iβˆ’Pkβˆ’1)​gkβ€–βˆ’2,\begin{split}\log(a_{2^{j}})&\leq\log(a_{2^{j-1}})-\sum_{k=2^{j-1}}^{2^{j}-1}\|(I-P_{k-1})g_{k}\|^{-2}a_{k}\\ &\leq\log(a_{2^{j-1}})-a_{2^{j}}\sum_{k=2^{j-1}}^{2^{j}-1}\|(I-P_{k-1})g_{k}\|^{-2},\end{split} (51)

for each jβ‰₯0j\geq 0 (if we interpret a1/2=a0a_{1/2}=a_{0}).

Note that β€–(Iβˆ’Pkβˆ’1)​gk‖≀‖(Iβˆ’P~kβˆ’1)​gkβ€–\|(I-P_{k-1})g_{k}\|\leq\|(I-\tilde{P}_{k-1})g_{k}\|, where P~kβˆ’1\tilde{P}_{k-1} is the projection onto the space spanned by g2jβˆ’1,…,gkg_{2^{j-1}},...,g_{k}. Thus, applying Lemma 1 to the dictionary 𝔻\mathbb{D} and the sequence g2jβˆ’1,…,g2jβˆ’1g_{2^{j-1}},...,g_{2^{j}-1}, we get

log⁑(a2j)≀log⁑(a2jβˆ’1)βˆ’c​2(1+2​δ)​(jβˆ’1)​a2j.\log(a_{2^{j}})\leq\log(a_{2^{j-1}})-c2^{(1+2\delta)(j-1)}a_{2^{j}}. (52)

Here we may assume, by decreasing cc if necessary, that c≀1c\leq 1. Next we will prove by induction that

a2j≀(1+2​δ)​cβˆ’1​2βˆ’(1+2​δ)​(jβˆ’1).a_{2^{j}}\leq(1+2\delta)c^{-1}2^{-(1+2\delta)(j-1)}.

This completes the proof since aka_{k} is decreasing, and so it implies that

ak≀C​kβˆ’(1+2​δ),a_{k}\leq Ck^{-(1+2\delta)}, (53)

with C=41+2​δ​(1+2​δ)​cβˆ’1C=4^{1+2\delta}(1+2\delta)c^{-1}.

The proof by induction proceeds as follows. We note that for j=0j=0, a1≀a0≀1≀(1+2​δ)​cβˆ’1​21+2​δa_{1}\leq a_{0}\leq 1\leq(1+2\delta)c^{-1}2^{1+2\delta}. (Note that here we use that cc is taken ≀1\leq 1.)

Next, assume that the result holds for jβˆ’1j-1, so we get

log⁑(a2j)≀log⁑((1+2​δ)​cβˆ’1)βˆ’log⁑(2)​(1+2​δ)​(jβˆ’2)βˆ’c​2(1+2​δ)​(jβˆ’1)​a2j.\log(a_{2^{j}})\leq\log((1+2\delta)c^{-1})-\log(2)(1+2\delta)(j-2)-c2^{(1+2\delta)(j-1)}a_{2^{j}}. (54)

Observe that the left hand side of (54) is an increasing function and the right hand side is a decreasing function of a2ja_{2^{j}}. Further, if we set a2j=(1+2​δ)​cβˆ’1​2βˆ’(1+2​δ)​(jβˆ’1)a_{2^{j}}=(1+2\delta)c^{-1}2^{-(1+2\delta)(j-1)}, we obtain equality in (54). This implies that

a2j≀(1+2​δ)​cβˆ’1​2βˆ’(1+2​δ)​(jβˆ’1),a_{2^{j}}\leq(1+2\delta)c^{-1}2^{-(1+2\delta)(j-1)},

completing the inductive step. ∎

3 Numerical Experiments

In this section, we give numerical experiments which demonstrate the improved convergence rates derived in Section 2. The setting we consider is a special case of the situation considered in [20]. We consider the dictionary

β„™02:={Οƒ0​(Ο‰β‹…x+b):Ο‰βˆˆβ„2,bβˆˆβ„}βŠ‚L2​([0,1]2),\mathbb{P}_{0}^{2}:=\{\sigma_{0}(\omega\cdot x+b):~\omega\in\mathbb{R}^{2},~b\in\mathbb{R}\}\subset L^{2}([0,1]^{2}), (55)

where Οƒ0\sigma_{0} is the Heaviside activation function. Nonlinear approximation from this dictionary corresponds to approximation by shallow neural networks with Heaviside activation function. The results proved in [32] imply that the metric entropy of the convex hull of β„™02\mathbb{P}_{0}^{2} satisfies

Ο΅n​(B1​(β„™02))≂nβˆ’34.\epsilon_{n}(B_{1}(\mathbb{P}_{0}^{2}))\eqsim n^{-\frac{3}{4}}. (56)

We use the orthgonal greedy algorithm to approximate the target function

f(x,y)=sin(Ο€(x+y))2sin(Ο€(xβˆ’y2))f(x,y)=\sin(\pi(x+y))^{2}\sin(\pi(x-y^{2})) (57)

by a non-linear expansion from the dictionary β„™02\mathbb{P}_{0}^{2}. Note that the target function ff is smooth so that fβˆˆπ’¦1​(β„™02)f\in\mathcal{K}_{1}(\mathbb{P}_{0}^{2}) [2]. We approximate the L2L^{2} norm on the domain [0,1]2[0,1]^{2} by the empirical L2L^{2}-norm on a set of NN sample points XN={x1,…,xN}X_{N}=\{x_{1},...,x_{N}\} drawn uniformly at random from the square [0,1]2[0,1]^{2}. The subproblem

gk=arg⁑maxgβˆˆβ„™02⁑|⟨g,rk⟩|g_{k}=\arg\max_{g\in\mathbb{P}_{0}^{2}}|\langle g,r_{k}\rangle| (58)

becomes the combinatorial problem of determining the optimal splitting of the sample points x1,…,xNx_{1},...,x_{N} by a hyperplane Ο‰β‹…x+b\omega\cdot x+b such that

|βŸ¨Οƒ0​(Ο‰β‹…x+b),rk⟩L2​(XN)|=|βˆ‘Ο‰β‹…xi+bβ‰₯0rk​(xi)||\langle\sigma_{0}(\omega\cdot x+b),r_{k}\rangle_{L^{2}(X_{N})}|=\left|\sum_{\omega\cdot x_{i}+b\geq 0}r_{k}(x_{i})\right| (59)

is maximized. There are O​(N2)O(N^{2}) such hyperplane splittings and the optimal splitting can be determined using O​(N2​log⁑(N))O(N^{2}\log(N)) operations using a simple modification of the algorithm in [20] which is specific to the case of ℝ2\mathbb{R}^{2}.

We run this algorithm with N=5000N=5000 for a total of n=100n=100 iteration111All of the code used to run the experiments and generate the plots shown here can be found at https://github.com/jwsiegel2510/OrthogonalGreedyConvergence. In Figure 1 we plot the error as a function of iteration nn on a log-log scale. Estimating the convergence order from this plot (here we have removed the first 1010 errors since the rate depends upon the tail of the error sequence) gives a convergence order of O​(nβˆ’0.717)O(n^{-0.717}). Previous theoretical results for the orthogonal greedy algorithm [12, 3] only give a convergence rate of O​(nβˆ’12)O(n^{-\frac{1}{2}}), while our bounds incorporating the compactness of the dictionary β„™0d\mathbb{P}_{0}^{d} imply a convergence rate of O​(nβˆ’34)O(n^{-\frac{3}{4}}). The empirically estimated convergence order is significantly better than O​(nβˆ’12)O(n^{-\frac{1}{2}}) and is very close to the rate predicted by Theorem 1.

Refer to caption
Figure 1: Estimated convergence rate of the orthogonal greedy algorithm for the dictionary β„™0d\mathbb{P}_{0}^{d}. We see that the empirically observed convergence order of O​(nβˆ’0.717)O(n^{-0.717}) is very close to the rate of O​(nβˆ’34)O(n^{-\frac{3}{4}}) predicted by Theorem 1.

In Figure 2 we plot the target function and the approximants fnf_{n} obtained at iterations 55, 1010, and 100100. This illustrates how the approximation of the target function ff generated by the orthogonal greedy algorithm improves as we add more dictionary elements to our expansion.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: (Top Left) Target function ff, (Top Right) Approximation after 55 iterations of the orthogonal greedy algorithm, (Bottom Left) Approximation after 1010 iterations of the orthogonal greedy algorithm, (Bottom Right) Approximation after 100100 iterations of the orthogonal greedy algorithm.

4 Lower Bounds

In this section, we show that the the iterates generated by the orthogonal greedy algorithm cannot be bounded in 𝒦1​(𝔻)\mathcal{K}_{1}(\mathbb{D}) and that the rate derived in Theorem 1 cannot be further improved.

Proposition 1.

For each R<∞R<\infty, there exists a (normalized) dictionary π”»βŠ‚H\mathbb{D}\subset H and an f∈B1​(𝔻)f\in B_{1}(\mathbb{D}) such that if f3f_{3} is the iterate generated by the orthogonal greedy algorithm when applied to ff after 33 steps, then

β€–f3‖𝒦1​(𝔻)>R.\|f_{3}\|_{\mathcal{K}_{1}(\mathbb{D})}>R. (60)

This shows that in the worst case the iterates generated by the orthogonal greedy algorithm are not bounded in the variation space 𝒦1​(𝔻)\mathcal{K}_{1}(\mathbb{D}).

Proof.

Let H=ℝ5H=\mathbb{R}^{5} with orthonormal basis e1,…,e5e_{1},...,e_{5}. Let 0<Ξ΄<Ο΅<120<\delta<\epsilon<\frac{1}{2} and consider the dictionary 𝔻={x1,…,x5}βŠ‚H\mathbb{D}=\{x_{1},...,x_{5}\}\subset H given by

x1=ϡ​e1βˆ’1βˆ’Ο΅2​e2x2=ϡ​e2+1βˆ’Ο΅2​e3x3=e3x4=Ο΅4​(e1+e2)+12​e3+c​e4+δ​e5x5=Ο΅4​(e1+e2)+12​e3βˆ’c​e4+δ​e5,\begin{split}x_{1}&=\epsilon e_{1}-\sqrt{1-\epsilon^{2}}e_{2}\\ x_{2}&=\epsilon e_{2}+\sqrt{1-\epsilon^{2}}e_{3}\\ x_{3}&=e_{3}\\ x_{4}&=\frac{\epsilon}{4}(e_{1}+e_{2})+\frac{1}{2}e_{3}+ce_{4}+\delta e_{5}\\ x_{5}&=\frac{\epsilon}{4}(e_{1}+e_{2})+\frac{1}{2}e_{3}-ce_{4}+\delta e_{5},\end{split} (61)

where cc is chosen so that β€–xiβ€–H=1\|x_{i}\|_{H}=1 for all ii. Consider the element

f=Ο΅4​(e1+e2)+12​e3+δ​e5=12​x4+12​x5∈B1​(𝔻).f=\frac{\epsilon}{4}(e_{1}+e_{2})+\frac{1}{2}e_{3}+\delta e_{5}=\frac{1}{2}x_{4}+\frac{1}{2}x_{5}\in B_{1}(\mathbb{D}). (62)

We claim that if Ξ΄<Ο΅8\delta<\frac{\epsilon}{\sqrt{8}}, then the orthogonal greedy algorithm applied to ff and 𝔻\mathbb{D} will select g1=x3g_{1}=x_{3}, g2=x2g_{2}=x_{2}, and g3=x1g_{3}=x_{1} in the first 33 steps.

Indeed, we calculate

⟨f,x1⟩=Ο΅4​(Ο΅βˆ’1βˆ’Ο΅2),⟨f,x2⟩=12​(Ο΅22+1βˆ’Ο΅2),⟨f,x3⟩=12,⟨f,x4⟩=⟨f,x5⟩=Ο΅28+14+Ξ΄2.\begin{split}\langle f,x_{1}\rangle=\frac{\epsilon}{4}(\epsilon-\sqrt{1-\epsilon^{2}}),~\langle f,x_{2}\rangle&=\frac{1}{2}\left(\frac{\epsilon^{2}}{2}+\sqrt{1-\epsilon^{2}}\right),~\langle f,x_{3}\rangle=\frac{1}{2},\\ \langle f,x_{4}\rangle=\langle f,x_{5}\rangle&=\frac{\epsilon^{2}}{8}+\frac{1}{4}+\delta^{2}.\end{split} (63)

We now verify, by differentiating for example, that for Ο΅>0\epsilon>0 we have

Ο΅22+1βˆ’Ο΅2<1,\frac{\epsilon^{2}}{2}+\sqrt{1-\epsilon^{2}}<1, (64)

which implies that |⟨f,x2⟩|<|⟨f,x3⟩||\langle f,x_{2}\rangle|<|\langle f,x_{3}\rangle|. The other inequalities are obvious (recalling that δ<ϡ8\delta<\frac{\epsilon}{\sqrt{8}}) and so it holds that g1=x3g_{1}=x_{3}. Projecting ff orthogonal to x3=e3x_{3}=e_{3}, we get

r1:=fβˆ’f1=Ο΅4​(e1+e2)+δ​e5.r_{1}:=f-f_{1}=\frac{\epsilon}{4}(e_{1}+e_{2})+\delta e_{5}.

Calculating inner products, we see that

⟨r1,x1⟩=Ο΅4​(Ο΅βˆ’1βˆ’Ο΅2),⟨r1,x2⟩=Ο΅24,⟨f,x4⟩=⟨f,x5⟩=Ο΅28+Ξ΄2.\begin{split}\langle r_{1},x_{1}\rangle&=\frac{\epsilon}{4}(\epsilon-\sqrt{1-\epsilon^{2}}),~\langle r_{1},x_{2}\rangle=\frac{\epsilon^{2}}{4},\\ \langle f,x_{4}\rangle&=\langle f,x_{5}\rangle=\frac{\epsilon^{2}}{8}+\delta^{2}.\end{split} (65)

Since Ξ΄<Ο΅8\delta<\frac{\epsilon}{\sqrt{8}}, these relations imply that g2=x2g_{2}=x_{2}. Projecting orthogonal to span​(g1,g2)=span​(x3,x2)=span​(e2,e3)\text{span}(g_{1},g_{2})=\text{span}(x_{3},x_{2})=\text{span}(e_{2},e_{3}), we get

r2:=fβˆ’f2=Ο΅4​e1+δ​e5.r_{2}:=f-f_{2}=\frac{\epsilon}{4}e_{1}+\delta e_{5}.

Finally, computing inner products, we get

⟨r1,x1⟩=ϡ24,⟨f,x4⟩=⟨f,x5⟩=ϡ216+δ2.\langle r_{1},x_{1}\rangle=\frac{\epsilon^{2}}{4},~\langle f,x_{4}\rangle=\langle f,x_{5}\rangle=\frac{\epsilon^{2}}{16}+\delta^{2}. (66)

Again, since Ξ΄<Ο΅8\delta<\frac{\epsilon}{\sqrt{8}} this implies that g3=x1g_{3}=x_{1}. Projecting orthogonal to span​(g1,g2,g3)=span​(e1,e2,e3)\text{span}(g_{1},g_{2},g_{3})=\text{span}(e_{1},e_{2},e_{3}), we obtain

r3=δ​e5​and​f3=Ο΅4​(e1+e2)+12​e3.r_{3}=\delta e_{5}~\text{and}~f_{3}=\frac{\epsilon}{4}(e_{1}+e_{2})+\frac{1}{2}e_{3}. (67)

Now, if f3=βˆ‘i=15ai​xif_{3}=\sum_{i=1}^{5}a_{i}x_{i}, then by taking inner products with e4e_{4} and e5e_{5}, we see that a4βˆ’a5=0a_{4}-a_{5}=0 and a4+a5=0a_{4}+a_{5}=0 which implies that a4=a5=0a_{4}=a_{5}=0. Thus f3=βˆ‘i=13ai​xif_{3}=\sum_{i=1}^{3}a_{i}x_{i} and since the xix_{i} are linearly independent, the aia_{i} are uniquely determined, and a simple calculation shows them to be

a1=14,a2=14+1βˆ’Ο΅2Ο΅,a3=12βˆ’1βˆ’Ο΅24βˆ’1βˆ’Ο΅2Ο΅.a_{1}=\frac{1}{4},~a_{2}=\frac{1}{4}+\frac{\sqrt{1-\epsilon^{2}}}{\epsilon},~a_{3}=\frac{1}{2}-\frac{\sqrt{1-\epsilon^{2}}}{4}-\frac{1-\epsilon^{2}}{\epsilon}. (68)

We finally get

β€–f‖𝒦1​(𝔻)=|a1|+|a2|+|a3|β‰₯1βˆ’Ο΅2Ο΅.\|f\|_{\mathcal{K}_{1}(\mathbb{D})}=|a_{1}|+|a_{2}|+|a_{3}|\geq\frac{\sqrt{1-\epsilon^{2}}}{\epsilon}. (69)

Letting Ο΅β†’0\epsilon\rightarrow 0, we obtain the desired result.

∎

Proposition 2.

There exists a Hilbert space HH and a dictionary π”»βŠ‚H\mathbb{D}\subset H such that Ο΅n​(B1​(𝔻))≀C​nβˆ’12βˆ’Ξ±\epsilon_{n}(B_{1}(\mathbb{D}))\leq Cn^{-\frac{1}{2}-\alpha} and for each nn

supf∈B1​(𝔻)β€–fβˆ’fnβ€–Hβ‰₯K​nβˆ’12βˆ’Ξ±,\sup_{f\in B_{1}(\mathbb{D})}\|f-f_{n}\|_{H}\geq Kn^{-\frac{1}{2}-\alpha}, (70)

where fnf_{n} is the nn-th iterate of the orthogonal greedy algorithm applied to ff.

This implies the optimality of the rates in Theorem 1 under the fiven assunmptions on the entropy of B1​(𝔻)B_{1}(\mathbb{D}).

Proof.

Let H=β„“2H=\ell^{2} and consider the dictionary

𝔻={kβˆ’Ξ±β€‹ek:kβ‰₯1}.\mathbb{D}=\{k^{-\alpha}e_{k}:~k\geq 1\}. (71)

It is known that Ο΅n​(B1​(𝔻))≀C​nβˆ’12βˆ’Ξ±\epsilon_{n}(B_{1}(\mathbb{D}))\leq Cn^{-\frac{1}{2}-\alpha} for a constant CC [1]. Let NN be an integer, and consider the element

fN=1Nβ€‹βˆ‘k=1Nkβˆ’Ξ±β€‹ek.f_{N}=\frac{1}{N}\sum_{k=1}^{N}k^{-\alpha}e_{k}. (72)

We obviously have that β€–fN‖𝒦1​(𝔻)=1\|f_{N}\|_{\mathcal{K}_{1}(\mathbb{D})}=1. Moreover, it is clear that after nn iterations of the orthogonal greedy algorithm, the residual rnr_{n} will satisfy

rn=1Nβ€‹βˆ‘k=n+1Nkβˆ’Ξ±β€‹ek,r_{n}=\frac{1}{N}\sum_{k=n+1}^{N}k^{-\alpha}e_{k}, (73)

since the dictionary elements gkg_{k} chosen at each iteration will be gk=kβˆ’Ξ±β€‹ekg_{k}=k^{-\alpha}e_{k}. Choosing N=2​nN=2n, we get

β€–rnβ€–=(14​n2β€‹βˆ‘k=n+12​nkβˆ’2​α)12β‰₯(14​n​(2​n)βˆ’2​α)12β‰₯2βˆ’(1+Ξ±)​nβˆ’12βˆ’Ξ±.\|r_{n}\|=\left(\frac{1}{4n^{2}}\sum_{k=n+1}^{2n}k^{-2\alpha}\right)^{\frac{1}{2}}\geq\left(\frac{1}{4n}(2n)^{-2\alpha}\right)^{\frac{1}{2}}\geq 2^{-(1+\alpha)}n^{-\frac{1}{2}-\alpha}. (74)

∎

This shows that the rate derived in Theorem 1 is unimprovable. To conclude, we note that in the preceding argument we used a sequence of elements fN∈B1​(𝔻)f_{N}\in B_{1}(\mathbb{D}). However, the same lower bound can be obtained for a single element ff by modifying the construction in [22], Section 8 in a straightforward manner.

5 Conclusion

We have shown that the orthogonal greedy algorithm achieves an improved convergence rate on dictionaries whose convex hull is compact. An important point, however, is that the expansions thus derived generally do not have their coefficients aia_{i} bounded in β„“1\ell^{1}. It is an important follow-up question whether the improved rates can be obtained by a greedy algorithm which satisfies this further restriction. Another interesting research direction is to extend our analysis to greedy algorithms for other problems such as reduced basis methods [9, 4] or sparse PCA [5].

6 Acknowledgements

We would like to thank Professors Russel Caflisch, Ronald DeVore, Weinan E, Albert Cohen, Stephan Wojtowytsch and Jason Klusowski for helpful discussions. We would also like to thank the anonymous referees for their helpful comments. This work was supported by the Verne M. Willaman Chair Fund at the Pennsylvania State University, and the National Science Foundation (Grant No. DMS-1819157 and DMS-2111387).

References

  • [1] Ball, K., Pajor, A.: The entropy of convex bodies with β€œfew” extreme points. In: Proceedings of the 1989 Conference in Banach Spaces at Strob. Austria. Cambridge Univ. Press (1990)
  • [2] Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory 39(3), 930–945 (1993)
  • [3] Barron, A.R., Cohen, A., Dahmen, W., DeVore, R.A.: Approximation and learning by greedy algorithms. The annals of statistics 36(1), 64–94 (2008)
  • [4] Binev, P., Cohen, A., Dahmen, W., DeVore, R., Petrova, G., Wojtaszczyk, P.: Convergence rates for greedy algorithms in reduced basis methods. SIAM journal on mathematical analysis 43(3), 1457–1472 (2011)
  • [5] Cai, T.T., Ma, Z., Wu, Y.: Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics 41(6), 3074–3110 (2013)
  • [6] CandΓ¨s, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory 52(2), 489–509 (2006)
  • [7] Candes, E.J., Tao, T.: Decoding by linear programming. IEEE transactions on information theory 51(12), 4203–4215 (2005)
  • [8] Cohen, A., Dahmen, W., DeVore, R.: Compressed sensing and best kk-term approximation. Journal of the American mathematical society 22(1), 211–231 (2009)
  • [9] DeVore, R., Petrova, G., Wojtaszczyk, P.: Greedy algorithms for reduced bases in banach spaces. Constructive Approximation 37(3), 455–466 (2013)
  • [10] DeVore, R.A.: Nonlinear approximation. Acta numerica 7, 51–150 (1998)
  • [11] DeVore, R.A., Lorentz, G.G.: Constructive approximation, vol. 303. Springer Science & Business Media (1993)
  • [12] DeVore, R.A., Temlyakov, V.N.: Some remarks on greedy algorithms. Advances in computational Mathematics 5(1), 173–187 (1996)
  • [13] Donoho, D.L.: Compressed sensing. IEEE Transactions on information theory 52(4), 1289–1306 (2006)
  • [14] Donoho, D.L., Elad, M., Temlyakov, V.N.: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on information theory 52(1), 6–18 (2005)
  • [15] Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Annals of statistics pp. 1189–1232 (2001)
  • [16] Hao, W., Jin, X., Siegel, J.W., Xu, J.: An efficient greedy training algorithm for neural networks and applications in pdes. arXiv preprint arXiv:2107.04466 (2021)
  • [17] Jones, L.K.: A simple lemma on greedy approximation in hilbert space and convergence rates for projection pursuit regression and neural network training. The annals of Statistics 20(1), 608–613 (1992)
  • [18] Klusowski, J.M., Barron, A.R.: Approximation by combinations of relu and squared relu ridge functions with β„“1\ell^{1} and β„“0\ell^{0} controls. IEEE Transactions on Information Theory 64(12), 7649–7656 (2018)
  • [19] KurkovΓ‘, V., Sanguineti, M.: Bounds on rates of variable-basis and neural-network approximation. IEEE Transactions on Information Theory 47(6), 2659–2665 (2001)
  • [20] Lee, W.S., Bartlett, P.L., Williamson, R.C.: Efficient agnostic learning of neural networks with bounded fan-in. IEEE Transactions on Information Theory 42(6), 2118–2132 (1996)
  • [21] Livshits, E.D.: Rate of convergence of pure greedy algorithms. Mathematical Notes 76(3), 497–510 (2004)
  • [22] Livshits, E.D.: Lower bounds for the rate of convergence of greedy algorithms. Izvestiya: Mathematics 73(6), 1197 (2009)
  • [23] Lorentz, G.G., Golitschek, M.v., Makovoz, Y.: Constructive approximation: advanced problems, vol. 304. Springer (1996)
  • [24] Makovoz, Y.: Random approximants and neural networks. Journal of Approximation Theory 85(1), 98–109 (1996)
  • [25] Mallat, S.G., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing 41(12), 3397–3415 (1993)
  • [26] Needell, D., Tropp, J.A.: Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis 26(3), 301–321 (2009)
  • [27] Needell, D., Vershynin, R.: Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit. Foundations of computational mathematics 9(3), 317–334 (2009)
  • [28] Pati, Y.C., Rezaiifar, R., Krishnaprasad, P.S.: Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In: Proceedings of 27th Asilomar conference on signals, systems and computers, pp. 40–44. IEEE (1993)
  • [29] Pisier, G.: Remarques sur un rΓ©sultat non publiΓ© de b. maurey. SΓ©minaire Analyse fonctionnelle (dit β€œMaurey-Schwartz") pp. 1–12 (1981)
  • [30] Rockafellar, R.T.: Convex analysis. 28. Princeton university press (1970)
  • [31] Siegel, J.W., Xu, J.: Improved approximation properties of dictionaries and applications to neural networks. arXiv preprint arXiv:2101.12365 (2021)
  • [32] Siegel, J.W., Xu, J.: Sharp bounds on the approximation rates, metric entropy, and nn-widths of shallow neural networks. arXiv preprint arXiv:2101.12365 (2021)
  • [33] Sil’nichenko, A.: Rate of convergence of greedy algorithms. Mathematical Notes 76(3), 582–586 (2004)
  • [34] Tropp, J.A.: Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory 50(10), 2231–2242 (2004)
  • [35] Tropp, J.A., Gilbert, A.C.: Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on information theory 53(12), 4655–4666 (2007)
  • [36] Xu, J.: Finite neuron method and convergence analysis. Communications in Computational Physics 28(5), 1707–1745 (2020). DOIΒ https://doi.org/10.4208/cicp.OA-2020-0191. URL http://global-sci.org/intro/article_detail/cicp/18394.html