This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

More efficient approximation of smoothing splines via space-filling basis selection

Cheng Meng Xinlian Zhang Jingyi Zhang Wenxuan Zhong Ping Ma111cheng.meng25@uga.edu  xinlian.zhang25@uga.edu  jingyi.zhang25@uga.edu wenxuan@uga.edu  pingma@uga.edu
Department of Statistics
University of Georgia
Abstract

We consider the problem of approximating smoothing spline estimators in a nonparametric regression model. When applied to a sample of size nn, the smoothing spline estimator can be expressed as a linear combination of nn basis functions, requiring O(n3)O(n^{3}) computational time when the number of predictors d2d\geq 2. Such a sizable computational cost hinders the broad applicability of smoothing splines. In practice, the full sample smoothing spline estimator can be approximated by an estimator based on qq randomly-selected basis functions, resulting in a computational cost of O(nq2)O(nq^{2}). It is known that these two estimators converge at the identical rate when qq is of the order O{n2/(pr+1)}O\{n^{2/(pr+1)}\}, where p[1,2]p\in[1,2] depends on the true function η\eta, and r>1r>1 depends on the type of spline. Such qq is called the essential number of basis functions. In this article, we develop a more efficient basis selection method. By selecting the ones corresponding to roughly equal-spaced observations, the proposed method chooses a set of basis functions with a large diversity. The asymptotic analysis shows our proposed smoothing spline estimator can decrease qq to roughly O{n1/(pr+1)}O\{n^{1/(pr+1)}\}, when dpr+1d\leq pr+1. Applications on synthetic and real-world datasets show the proposed method leads to a smaller prediction error compared with other basis selection methods.

Keywords: Space-filling design; Star discrepancy; Nonparametric regression; Penalized least squares; Subsampling.

1 Introduction

Consider the nonparametric regression model yi=η(xi)+ϵi(i=1,,n)y_{i}=\eta(x_{i})+\epsilon_{i}(i=1,\ldots,n), where yiy_{i}\in\mathbb{R} denotes the iith observation of the response, η\eta represents an unknown function to be estimated, xidx_{i}\in\mathbb{R}^{d} is the iith observation of the predictor variable, and {ϵi}i=1n\{\epsilon_{i}\}_{i=1}^{n} are independent and identically distributed random errors with zero mean and unknown variance σ2\sigma^{2}. The function η\eta can be estimated by minimizing the penalized least squares criterion,

1ni=1n{yiη(xi)}2+λJ(η),\displaystyle\frac{1}{n}\sum_{i=1}^{n}\{y_{i}-\eta(x_{i})\}^{2}+\lambda J(\eta), (1)

where J(η)J(\eta) denotes a quadratic roughness penalty (Wahba, 1990; Wang et al., 2011; Gu, 2013). The smoothing parameter λ\lambda here administrates the trade-off between the goodness-of-fit of the model and the roughness of the function η\eta. In this paper, expression (1) is minimized in a reproducing kernel Hilbert space {\mathcal{H}}, which leads to a smoothing spline estimate for η\eta.

Although univariate smoothing splines can be computed in O(n)O(n) time (Reinsch, 1967), it takes O(n3)O(n^{3}) time to find the minimizer of (1) when d2d\geq 2. Such a computational cost hinders the use of smoothing splines for large samples. To reduce the computational cost for smoothing splines, extensive efforts have been made to approximate the minimizer of (1) by restricting the estimator η^\hat{\eta} to a subspace E{\mathcal{H}}_{E}\subset{\mathcal{H}}. Let the dimension of the space E{\mathcal{H}}_{E} be qq and the restricted estimator be η^E\hat{\eta}_{E}. Compared with the O(n3)O(n^{3}) computational cost of calculating η^\hat{\eta}, the computational cost of η^E\hat{\eta}_{E} can be reduced to O(nq2)O(nq^{2}). Along this line of thinking, numerous studies have been developed in recent decades. Luo and Wahba (1997) and Zhang et al. (2004) approximated the minimizer of (1) using variable selection techniques. Pseudosplines (Hastie, 1996) and penalized splines (Ruppert et al., 2009) were also developed to approximate smoothing splines.

Although these methods have already yielded impressive algorithmic benefits, they are usually ad hoc in choosing the value of qq. The value of qq regulates the trade-off between the computational time and the prediction accuracy. One fundamental question is how small qq can be in order to ensure the restricted estimator η^E\hat{\eta}_{E} converge to the true function η\eta at the identical rate as the full sample estimator η^\hat{\eta}. To answer this question, Gu and Kim (2002); Ma et al. (2015) developed random sampling methods for selecting the basis functions and established the coherent theory for the convergence of the restricted estimator η^E\hat{\eta}_{E}. To ensure that η^E\hat{\eta}_{E} has the same convergence rate as η^\hat{\eta}, both methods in Gu and Kim (2002) and Ma et al. (2015) require qq be of the order O{n2/(pr+1)+δ}O\{n^{2/(pr+1)+\delta}\}, where δ\delta is an arbitrary small positive number, p[1,2]p\in[1,2] depends on the true function η\eta, and rr depends on the fitted spline. In Gao (1999), it is shown that fewer basis functions are needed to warrant the aforementioned convergence rate if we select the basis functions {R(zj,)}j=1q\{R(z_{j},\cdot)\}_{j=1}^{q}, where {zj}j=1q\{z_{j}\}_{j=1}^{q} are roughly equal-spaced. However, they only provide the theory in the univariate predictor case, and their method cannot be directly applied to multivariate cases.

In this paper, we develop a more efficient computational method to approximate smoothing splines. The distinguishing feature of the proposed method is that it considers the notion of diversity of the selected basis functions. We propose the space-filling basis selection method, which chooses the basis functions with a large diversity, by selecting the ones that correspond to roughly uniformly-distributed observations. When dpr+1d\leq pr+1, we show that the smoothing spline estimator proposed here has the same convergence rate as the full sample estimator, and the order of the essential number of basis function qq is reduced to O{n(1+δ)/(pr+1)}O\{n^{(1+\delta)/(pr+1)}\}.

2 Smoothing splines and the basis selection method

Let ={η:J(η)<}{\mathcal{H}}=\{\eta:J(\eta)<\infty\} be a reproducing kernel Hilbert space, where J()J(\cdot) is a squared semi-norm. Let 𝒩J={η:J(η)=0}{\mathcal{N}}_{J}=\{\eta:J(\eta)=0\} be the null space of J(η)J(\eta) and assume that 𝒩J{\mathcal{N}}_{J} is a finite-dimensional linear subspace of {\mathcal{H}} with basis {ξi}i=1m\{\xi_{i}\}_{i=1}^{m} in which mm is the dimension of 𝒩J{\mathcal{N}}_{J}. Let J{\mathcal{H}}_{J} be the orthogonal complement of 𝒩J{\mathcal{N}}_{J} in {\mathcal{H}} such that =𝒩JJ{\mathcal{H}}={\mathcal{N}}_{J}\oplus{\mathcal{H}}_{J}. The space J{\mathcal{H}}_{J} is a reproducing kernel Hilbert space with J()J(\cdot) as the squared norm. The reproducing kernel of J{\mathcal{H}}_{J} is denoted by RJ(,)R_{J}(\cdot,\cdot). The well-known representer theorem (Wahba, 1990) states that there exist vectors d=(d1,,dm)Tmd=(d_{1},\ldots,d_{m})^{T}\in\mathbb{R}^{m} and c=(c1,,cn)Tnc=(c_{1},\ldots,c_{n})^{T}\in\mathbb{R}^{n}, such that the minimizer of (1) in {\mathcal{H}} is given by η(x)=k=1mdkξk(x)+i=1nciRJ(xi,x).\eta(x)=\sum_{k=1}^{m}d_{k}\xi_{k}(x)+\sum_{i=1}^{n}c_{i}R_{J}(x_{i},x). Let Y=(y1,,yn)TY=(y_{1},\ldots,y_{n})^{T} be the vector of response observations, SS be the n×mn\times m matrix with the (i,j)(i,j)th entry ξj(xi)\xi_{j}(x_{i}), and RR be the n×nn\times n matrix with the (i,j)(i,j)th entry RJ(xi,xj)R_{J}(x_{i},x_{j}). Solving the minimization problem in (1) thus is equivalent to solving

(d^,c^)=argmind,c1n(YSdRc)T(YSdRc)+λcTRc,(\hat{d},\hat{c})=\underset{d,c}{\mathrm{argmin}}\frac{1}{n}(Y-Sd-Rc)^{T}(Y-Sd-Rc)+\lambda c^{T}Rc, (2)

where the smoothing parameter λ\lambda can be selected based on the generalized cross-validation criterion (Wahba and Craven, 1978). In a general case where nmn\gg m and d2d\geq 2, the computation cost for calculating (d^,c^)(\widehat{d},\widehat{c}) in equation (2) is of the order O(n3)O(n^{3}), which is prohibitive when the sample size nn is considerable. To reduce this computational burden, one can restrict the full sample estimator η^\hat{\eta} to a subspace E{\mathcal{H}}_{E}\subset{\mathcal{H}}, where E=𝒩Jspan{RJ(xi,),i=1,,q}{\mathcal{H}}_{E}={\mathcal{N}}_{J}\oplus\mbox{span}\{R_{J}(x_{i}^{*},\cdot),i=1,\ldots,q\}. Here, E{\mathcal{H}}_{E}, termed as the effective model space, can be constructed by selecting a subsample {xi}i=1q\{x_{i}^{*}\}_{i=1}^{q} from {xi}i=1n\{x_{i}\}_{i=1}^{n}. Such an approach is thus called the basis selection method.

Denote a matrix Rn×qR_{*}\in\mathbb{R}^{n\times q} with the (i,j)(i,j)th entry RJ(xi,xj)R_{J}(x_{i},x_{j}^{*}) and Rq×qR_{**}\in\mathbb{R}^{q\times q} with the (i,j)(i,j)th entry RJ(xi,xj)R_{J}(x_{i}^{*},x_{j}^{*}). The minimizer of (1) in the effective model space E{\mathcal{H}}_{E} thus can be written as ηE(x)=k=1mdkξk(x)+i=1qciR(xi,x)\eta_{E}(x)=\sum_{k=1}^{m}d_{k}\xi_{k}(x)+\sum_{i=1}^{q}c_{i}R(x^{*}_{i},x) and the coefficients, dE=(d1,,dm)Td_{E}=(d_{1},\ldots,d_{m})^{T} and cE=(c1,,cq)Tc_{E}=(c_{1},\ldots,c_{q})^{T} can be obtained through solving

(d^E,c^E)=argmindE,cE1n(YSdERcE)T(YSdERcE)+λcETRcE.\displaystyle(\hat{d}_{E},\hat{c}_{E})=\underset{d_{E},c_{E}}{\mathrm{argmin}}\frac{1}{n}(Y-Sd_{E}-R_{*}c_{E})^{T}(Y-Sd_{E}-R_{*}c_{E})+\lambda c_{E}^{T}R_{**}c_{E}. (3)

The evaluation of the restricted estimator η^E\hat{\eta}_{E} at sample points thus satisfies η^E=Sd^E+Rc^E\hat{\eta}_{E}=S\hat{d}_{E}+R_{*}\hat{c}_{E}, where η^E={η^E(x1),,η^E(xn)}T\hat{\eta}_{E}=\{\hat{\eta}_{E}(x_{1}),\ldots,\hat{\eta}_{E}(x_{n})\}^{T}. In a general case where mqnm\ll q\ll n, the overall computational cost for the basis selection method is O(nq2)O(nq^{2}), which is a significant reduction compared with O(n3)O(n^{3}). Recall that the value of qq controls the trade-off between the computational time and the prediction accuracy. To ensure that η^E\hat{\eta}_{E} converges to the true function η\eta at the same rate as η^\hat{\eta}, researchers showed that the essential number of basis functions qq needs to be of the order O{n2/(pr+1)+δ}O\{n^{2/(pr+1)+\delta}\}, where δ\delta is an arbitrary small positive number (Kim and Gu, 2004; Ma et al., 2015). In the next section, we present the proposed space-filling basis selection method, which reduces such an order to O{n(1+δ)/(pr+1)}O\{n^{(1+\delta)/(pr+1)}\}.

3 Space-filling basis selection

3.1 Motivation and Notations

To motivate the development of the proposed method, we first re-examine the ensemble learning methods, which are well-known in statistics and machine learning (Dietterich, 2002; Rokach, 2010). To achieve better predictive performance than a single learner 222A learner is either a model or a learning algorithm., ensemble learning methods first build a committee which consists of a number of different learners, then aggregate the predictions of these learners in the committee. The aggregation is usually achieved by employing the majority vote or by calculating a weighted average. The diversity among the learners in the committee holds the key to the success of the ensemble learning methods. A large diversity in the committee yields a better performance of ensemble learning methods (Kuncheva and Whitaker, 2003).

The restricted smoothing spline estimator η^E\hat{\eta}_{E} can be considered as an ensemble learning method. In particular, the prediction of η^E\hat{\eta}_{E} is conducted by taking a weighted average of the predictions of the selected basis functions RJ(xi,)R_{J}(x_{i}^{*},\cdot), i{1,,q}i\in\{1,\ldots,q\}, in addition to the basis functions in the null space 𝒩J{\mathcal{N}}_{J}. Inspired by Kuncheva and Whitaker (2003), we propose to select a subsample {xi}i=1q\{x_{i}^{*}\}_{i=1}^{q}, such that the diversity among the basis functions {RJ(xi,)}i=1q\{R_{J}(x_{i}^{*},\cdot)\}_{i=1}^{q} is as large as possible. One crucial question is how to measure the diversity among a set of basis functions. Notice that adjacent data points, i.e., xixjx_{i}^{*}\approx x_{j}^{*} (i,j{1,,q}i,j\in\{1,\ldots,q\}) yields similar basis functions, i.e., RJ(xi,)RJ(xj,)R_{J}(x_{i}^{*},\cdot)\approx R_{J}(x_{j}^{*},\cdot). On the other hand, if xix_{i}^{*} is far away from xjx_{j}^{*}, the basis function RJ(xi,)R_{J}(x_{i}^{*},\cdot) tends to be different from RJ(xj,)R_{J}(x_{j}^{*},\cdot). These observations inspire us to select a set of basis functions {RJ(xi,)}i=1q\{R_{J}(x_{i}^{*},\cdot)\}_{i=1}^{q} where {xi}i=1q\{x_{i}^{*}\}_{i=1}^{q} are as uniformly-distributed as possible. The uniformly-distributed property, usually termed as the space-filling property in the experimental design literature (Pukelsheim, 2006), can be systematically measured by the star discrepancy.

Since the star discrepancy is defined for data in [0,1]d[0,1]^{d}, in practice, we need to map the data with arbitrary distribution to this domain. Suppose 𝒳n={xi}i=1n{\mathcal{X}}_{n}=\{x_{i}\}_{i=1}^{n} are independent and identically distributed observations generated from the cumulative distribution function FF with bounded support 𝒟d\mathcal{D}\subset\mathbb{R}^{d}. Suppose τ\tau is a transformation, such that {τ(xi)}i=1n\{\tau(x_{i})\}_{i=1}^{n} has the uniform distribution on [0,1]d[0,1]^{d}. In a simple case where d=1d=1 and FF is known, we can find the transformation τ\tau by setting τ=F\tau=F. In a more general case where d>1d>1 and FF is unknown, the transformation τ\tau can be calculated via the optimal transport theory (Villani, 2008). However, finding the exact solution via the optimal transport theory can be time-consuming. Instead, one may approximate the transformation τ\tau using the iterative transformation approach (Pukelsheim, 2006) or the sliced optimal transport map approach (Rabin et al., 2011). The computational cost of these two approaches is of the order O{Knlog(n)}O\{Kn\log(n)\}, where KK denotes the number of iterations (Kolouri et al., 2018; Cuturi and Doucet, 2014; Bonneel et al., 2015). Such a computational cost is negligible compared with the computational cost of the proposed method. In practice, the data can always be preprocessed using the transformation of τ\tau. Without loss of generality, 𝒳n{\mathcal{X}}_{n} may be assumed to be independent and identically distributed observations generated from the uniform distribution on [0,1]d[0,1]^{d}.

3.2 Star discrepancy and space-filling design

Let a=(a1,,ad)T[0,1]da=(a_{1},\ldots,a_{d})^{T}\in[0,1]^{d}, [0,a)=j=1d[0,aj)[0,a)=\prod_{j=1}^{d}[0,a_{j}) be a hyper-rectangle and 1{}1\{\cdot\} be the indicator function. The local discrepancy and the star discrepancy are defined as follows (Fang et al., 2005; Pukelsheim, 2006).

Given 𝒳q={x1,,xq}\mathcal{X}_{q}=\{x_{1},\ldots,x_{q}\} in [0,1]d[0,1]^{d} and a hyper-rectangle [0,a)[0,a), the corresponding local discrepancy is defined as D(𝒳q,a)=|1qi=1q1{xi[0,a)}j=1daj|.D(\mathcal{X}_{q},a)=|\frac{1}{q}\sum^{q}_{i=1}1\{x_{i}\in[0,a)\}-\prod^{d}_{j=1}a_{j}|. The star discrepancy corresponding to 𝒳q\mathcal{X}_{q} is defined as D(𝒳q)=supa[0,1]dD(𝒳q,a).D^{*}({\mathcal{X}}_{q})=\sup_{a\in[0,1]^{d}}D({\mathcal{X}}_{q},a).

The local discrepancy D(𝒳q,a)D({\mathcal{X}}_{q},a) measures the difference between the volume of the hyper-rectangle [0,a)[0,a) and the fraction of the data points located in [0,a)[0,a). The local discrepancy is illustrated in the left panel of Fig. 1. The star discrepancy D(𝒳q)D^{*}({\mathcal{X}}_{q}) calculates the supreme of all the local discrepancy over a[0,1]da\in[0,1]^{d}. In other words, the smaller the D(𝒳q)D^{*}({\mathcal{X}}_{q}) is, the more space-filling the data points 𝒳q{\mathcal{X}}_{q} are (Fang et al., 2005).

Refer to caption Refer to caption
Figure 1: Left panel: A toy example for local discrepancy. A set of ten data points are generated in [0,1]2[0,1]^{2}, and four of them locate in the rectangular [0,a)[0,a), where a=(0.4,0.5)Ta=(0.4,0.5)^{T}. The local discrepancy is |4/100.4×0.5|=0.2|4/10-0.4\times 0.5|=0.2. Right panel: An illustration for the proposed basis selection method. The data points are labelled as gray dots. The nearest neighbor data point for each design point (black triangle) is labeled as a circle.

Chung (1949) showed that when 𝒳q{\mathcal{X}}_{q} is generated from the uniform distribution in [0,1]d[0,1]^{d}, D(𝒳q)D^{*}({\mathcal{X}}_{q}) converges to 0 with the order of convergence O[{loglog(q)/q}1/2]O[\{\log\log(q)/q\}^{1/2}]. Faster convergence rate can be achieved using space-filling design methods (Pukelsheim, 2006) and the low-discrepancy sequence method (Halton, 1960; Sobol, 1967; Owen, 2003). The space-filling design methods, developed in the experimental design literature, aim to generate a set of design points that can cover the space as uniformly as possible. For further details, please refer to Wu and Hamada (2011); Fang et al. (2005); Pukelsheim (2006). The low-discrepancy sequence method Such a method is frequently applied in the field of quasi-Monte Carlo and is extensively employed for numerical integration. For a Sobol sequence 𝒮q{\mathcal{S}}_{q}, one type of low-discrepancy sequence, it is known that D(𝒮q)D^{*}({\mathcal{S}}_{q}) is of the order O{log(q)d/q}O\{\log(q)^{d}/q\}, which is roughly the square order D(𝒳q)D^{*}({\mathcal{X}}_{q}) for fixed dd. For more in-depth discussions on the quasi-Monte Carlo methods, see e.g., Lemieux (2009); Leobacher and Pillichshammer (2014); Dick et al. (2013) or Chapter 5 in Glasserman (2013) and references therein.

Existing studies suggested that space-filling property can be exploited to achieve a fast convergence rate for numerical integration and response surface estimation (Fang et al., 2005; Pukelsheim, 2006). These results inspire us to select the space-filling basis functions in smoothing splines. Unfortunately, the existing techniques of space-filling design cannot be applied to our basis selection problem directly due to the following fact. The design space in space-filling design methods is usually continuous, whereas our sample space {xi}i=1n\{x_{i}\}_{i=1}^{n} is finite and discrete. We propose an algorithm to overcome the barrier.

3.3 Main algorithm

We shall develop a space-filling basis selection method, in which we select the space-filling data points in a computationally-attractive manner. First, a set of design points 𝒮q={si}i=1q[0,1]d{\mathcal{S}}_{q}=\{s_{i}\}_{i=1}^{q}\in[0,1]^{d} are generated, either using low-discrepancy sequence or space-filling design methods. Subsequently, the nearest neighbor xix^{*}_{i} is selected for each sis_{i}, from the sample points {xi}i=1n\{x_{i}\}_{i=1}^{n}. Thus, {xi}i=1q\{x_{i}^{*}\}_{i=1}^{q} can approximate the design points 𝒮q{\mathcal{S}}_{q} well, if xix^{*}_{i} is sufficiently close to sis_{i}, for i=1,qi=1,\ldots q. The proposed method is summarized as follows.

  • Step 1. Generate a set of design points {si}i=1q\{s_{i}\}_{i=1}^{q} from [0,1]d[0,1]^{d}.

  • Step 2. Select the nearest neighbor for each design point sis_{i} from {xi}i=1n\{x_{i}\}_{i=1}^{n}. Let the selected data points be {xi}i=1q\{x^{*}_{i}\}_{i=1}^{q},.

  • Step 3. Minimize the penalized least squares criterion (1) over the following effective model space E=𝒩Jspan{RJ(xi,),i=1,,q}{\mathcal{H}}_{E}=\mathcal{N}_{J}\oplus\mbox{span}\{R_{J}(x_{i}^{*},\cdot),i=1,\ldots,q\}

The proposed algorithm is illustrated through a toy example in the right panel of Fig. 1. One hundred data points (gray dots) are generated from the uniform distribution in [0,1]2[0,1]^{2}, and a set of design points (black triangles) are generated through the max projection design (Joseph et al., 2015), a recently developed space-filling design method. The nearest neighbor to each design point is selected (circle). It is observed that the selected subsample is space-filling since it can effectively approximate the design points.

Refer to caption
Figure 2: Comparison of different basis selection methods. The leftmost panel shows the heat map for the true function. The heat maps for the spline estimates based on the uniform basis selection method (UNIF), the adaptive basis selection method (ABS), and the proposed space-filling basis selection method (SBS) are presented in the other three panels, respectively. Black dots are the sampled basis functions. We observe that the proposed method outperforms the other methods in approximating the true function.

In Fig. 2, the proposed space-filling basis selection method is compared with the uniform basis selection method (Gu and Kim, 2002) and the adaptive basis selection method (Ma et al., 2015) using a two-dimensional toy example. We generate 5000 data points from the uniform distribution in [0,1]2[0,1]^{2}. The leftmost panel in Fig. 2 presents the heat map for the true response surface y=sin{5(x1+x2)}y=\sin\{5(x_{1}+x_{2})\}. The dimension of the effective model space qq is set to be 5×(5000)2/9335\times(5000)^{2/9}\approx 33, for all basis selection methods. The selected basis functions are labeled as solid dots in each panel. The right three panels of Fig. 2 plot the heat maps of the spline estimates of all three basis selection methods. In the uniform basis selection method, the default random number generator in R is employed to select the basis functions. It is observed that the selected points are not uniformly distributed. This is a very common phenomenon for uniform basis selection since the randomly selected points do not necessarily look uniformly-distributed, especially when the number of selected points is small. In contrast, it is observed that the basis functions selected by the proposed method are space-filling. Using the space-filling design techniques, the proposed method overcomes the pitfall of uniform basis selection method and uniformly-distribute the selected points. The true response can be better estimated using the proposed method than using other methods.

Now we calculate the computational cost of the proposed method. In Step 1, the design points can be generated beforehand; thus, the computational cost in Step 1 can be ignored. For the nearest neighbor search in Step 2, we employ the kk-d tree method, which takes O{nlog(n)}O\{n\log(n)\} flops (Bentley, 1975; Wald and Havran, 2006). The computational cost of this step can be further reduced if we are willing to sacrifice the accuracy of the searching results, e.g., using those approximate nearest neighbor searching algorithms (Altman, 1992; Arya et al., 1994). For Step 3, computing the smoothing spline estimates in the restricted space E{\mathcal{H}}_{E} is of the order O(nq2)O(nq^{2}), as discussed in Section 2.2. In summary, the overall computational cost for the space-filling basis selection method is of the order O(nq2)O(nq^{2}).

4 Convergence rates for function estimation

Recall that the data points are assumed to be generated from the uniform distribution on [0,1]d[0,1]^{d}. Thus, for each coordinate xx, the corresponding marginal density fX()=1f_{X}(\cdot)=1. We define that V(g)=[0,1]dg2dxV(g)=\int_{[0,1]^{d}}g^{2}\mbox{d}x. The following four regularity conditions are required for the asymptotic analysis, and the first three are the identical conditions employed by Ma et al. (2015), in which one can find more technical discussions.

Condition 1. The function VV is completely continuous with respect to JJ;

Condition 2. for some β>0\beta>0 and r>1r>1, ρν>βνr\rho_{\nu}>\beta\nu^{r} for sufficiently large ν\nu;

Condition 3. for all μ\mu and ν\nu, var{ϕν(x)ϕμ(x)}C1\mbox{var}\{\phi_{\nu}(x)\phi_{\mu}(x)\}\leq C_{1}, where ϕν\phi_{\nu}, ϕμ\phi_{\mu} are the eigenfunctions associated with VV and JJ in {\mathcal{H}}, C1C_{1} denotes a positive constant;

Condition 4. for all μ\mu and ν\nu, 𝒱(gν,μ)C2\mathcal{V}(g_{\nu,\mu})\leq C_{2}, where 𝒱()\mathcal{V}(\cdot) denotes the total variation, gν,μ(x)=ϕν(x)ϕμ(x)g_{\nu,\mu}(x)=\phi_{\nu}(x)\phi_{\mu}(x), and C2C_{2} represents a positive constant. The total variation here is defined in the sense of Hardy and Krause (Owen, 2003). As a specific case when d=1d=1, the total variation 𝒱(g)=|g(x)|dx\mathcal{V}(g)=\int|g^{\prime}(x)|\mbox{d}x, revealing that a smooth function displays a small total variation. Intuitively, the total variation measures how wiggly the function gg is.

Condition 1 indicates that there exist a sequence of eigenfunctions ϕν\phi_{\nu}\in{\mathcal{H}} and the associated sequence of eigenvalues ρν\rho_{\nu}\uparrow\infty satisfying V(ϕν,ϕμ)=δνμV(\phi_{\nu},\phi_{\mu})=\delta_{\nu\mu} and J(ϕν,ϕμ)=ρνδνμJ(\phi_{\nu},\phi_{\mu})=\rho_{\nu}\delta_{\nu\mu}, where δνμ\delta_{\nu\mu} is the Kronecker delta. The growth rate of the eigenvalues ρν\rho_{\nu} dictates how fast λ\lambda should approach to 0, and further what the convergence rate of smoothing spline estimates is (Gu, 2013). Notice that the eigenfunctions ϕν\phi_{\nu}s have a close relationship with the Demmler-Reinsch basis, which are orthogonal vectors representing l2l_{2} norm  (Ruppert, 2002). The eigenfunctions ϕν\phi_{\nu}s can be calculated explicitly in several specific scenarios. For instance, ϕν\phi_{\nu}s are the sine and cosine functions when J(η)=01(η′′)2dxJ(\eta)=\int_{0}^{1}(\eta^{{}^{\prime\prime}})^{2}\mbox{d}x, where η\eta denotes a periodic function on [0,1][0,1]. For more details on the construction of ϕν\phi_{\nu}s can be found in Section 9.1 of Gu (2013).

We now present our main theoretical results, and all the proofs are relegated to the Supplementary Material. For a set of design points 𝒮q{\mathcal{S}}_{q} of size qq, we now assume the star discrepancy D(𝒮q)D^{*}({\mathcal{S}}_{q}) converges to zero at the rate of O{log(q)d/q}O\{\log(q)^{d}/q\}, or O{q(1δ)}O\{q^{-(1-\delta)}\} for an arbitrary small positive number δ\delta. Such a convergence rate is warranted if 𝒮q{\mathcal{S}}_{q} is generated from a low-discrepancy sequence or space-filling design methods, as discussed in Section 3.1. Recall that the proposed method aims to select a subsample that is space-filling, and the key to success depends on whether the chosen subsample 𝒳q{\mathcal{X}}_{q}^{*} can effectively approximate the design points 𝒮q{\mathcal{S}}_{q}. The following lemma bounds the difference between 𝒳q{\mathcal{X}}_{q}^{*} and 𝒮q{\mathcal{S}}_{q} in terms of star discrepancy.

Lemma 1

Suppose dd is fixed and D(𝒮q)=O{q(1δ)}D^{*}({\mathcal{S}}_{q})=O\{q^{-(1-\delta)}\}, for any arbitrary small δ>0\delta>0. If q=O(n1/d)q=O(n^{1/d}), as nn\rightarrow\infty, we have D(𝒳q)=Op{q(1δ)}.D^{*}({\mathcal{X}}^{*}_{q})=O_{p}\{q^{-(1-\delta)}\}.

Lemma 1 states that the selected subsample 𝒳q{\mathcal{X}}_{q}^{*} can effectively approximate the design points 𝒮q{\mathcal{S}}_{q} in the sense that the convergence rate of D(𝒳q)D^{*}({\mathcal{X}}^{*}_{q}) is similar to that of D(𝒮q)D^{*}({\mathcal{S}}_{q}). The following theorem is the Koksma–Hlawka inequality, which will be used in proving our main theorem. See Kuipers and Niederreiter (2012) for a proof.

Theorem 1 (Koksma–Hlawka inequality)

Let 𝒯q={t1,,tq}\mathcal{T}_{q}=\{t_{1},\ldots,t_{q}\} be a set of data points in [0,1]d[0,1]^{d}, and hh be a function defined on [0,1]d[0,1]^{d} with bounded total variation 𝒱(h)\mathcal{V}(h). We have |[0,1]dh(x)dxi=1qh(ti)/q|D(𝒯q)𝒱(h).|\int_{[0,1]^{d}}h(x)\mbox{d}x-\sum_{i=1}^{q}h(t_{i})/q|\leq D^{*}(\mathcal{T}_{q})\mathcal{V}(h).

Combining Lemma 1 and Theorem 1 and set h=gν,μh=g_{\nu,\mu}, 𝒯q=𝒳q\mathcal{T}_{q}={\mathcal{X}}_{q}^{*} yields the following lemma.

Lemma 2

If q=O(n1/d)q=O(n^{1/d}), under Condition 4, for all μ\mu and ν\nu, we have

|[0,1]dϕνϕμdx1qj=1qϕν(xj)ϕμ(xj)|=Op{q(1δ)}.\displaystyle\left|\int_{[0,1]^{d}}\phi_{\nu}\phi_{\mu}\mbox{d}x-\frac{1}{q}\sum_{j=1}^{q}\phi_{\nu}(x_{j}^{*})\phi_{\mu}(x_{j}^{*})\right|=O_{p}\{q^{-(1-\delta)}\}.

Lemma 2 shows the advantage of {xi}i=1q\{x_{i}^{*}\}_{i=1}^{q}, the subsample selected by the proposed method, over a randomly selected subsample {xi+}i=1q\{x_{i}^{+}\}_{i=1}^{q}. To be specific, as a direct consequence of Condition 3, we have E[[0,1]dϕνϕμdxj=1qϕν(xj+)ϕμ(xj+)/q]2=O(q1),E[\int_{[0,1]^{d}}\phi_{\nu}\phi_{\mu}\mbox{d}x-\sum_{j=1}^{q}\phi_{\nu}(x_{j}^{+})\phi_{\mu}(x_{j}^{+})/q]^{2}=O(q^{-1}), for all μ\mu and ν\nu. Lemma 2 therefore implies the subsample 𝒳q{\mathcal{X}}_{q}^{*} can more efficiently approximate the integration [0,1]dϕνϕμdx\int_{[0,1]^{d}}\phi_{\nu}\phi_{\mu}\mbox{d}x, for all μ\mu and ν\nu. We now present our main theoretical result.

Theorem 2

Suppose iρipV(η0,ϕi)2<\sum_{i}\rho_{i}^{p}V(\eta_{0},\phi_{i})^{2}<\infty for some p[1,2]p\in[1,2] and δ\delta is an arbitrary small positive number. Under Conditions 1 - 4, q=O(n1/d)q=O(n^{1/d}), as λ0\lambda\rightarrow 0 and q1δλ1/rq^{1-\delta}\lambda^{1/r}\rightarrow\infty, we have (V+λJ)(η^Eη0)=Op(n1λ1/r+λp)(V+\lambda J)(\hat{\eta}_{E}-\eta_{0})=O_{p}(n^{-1}\lambda^{-1/r}+\lambda^{p}). In particular, if λnr/(pr+1)\lambda\asymp n^{-r/(pr+1)}, the estimator achieves the optimal convergence rate (V+λJ)(η^Eη0)=Op{npr/(pr+1)}.(V+\lambda J)(\hat{\eta}_{E}-\eta_{0})=O_{p}\{n^{-pr/(pr+1)}\}.

It is shown in Theorem 9.17 of Gu (2013) that the full sample smoothing spline estimator η^\hat{\eta} has the convergence rate, (V+λJ)(η^η0)=Op(npr/(pr+1))(V+\lambda J)(\hat{\eta}-\eta_{0})=O_{p}(n^{-pr/(pr+1)}) under some regularity conditions. Theorem 2 thus states that proposed estimator η^E\hat{\eta}_{E} achieves the same convergence rate as the full sample estimator η^\hat{\eta}, under two extra conditions imposed on qq (a) q=O(n1/d)q=O(n^{1/d}), and (b) q1δλ1/rq^{1-\delta}\lambda^{1/r}\rightarrow\infty as λ0\lambda\rightarrow 0. Moreover, Theorem 2 indicates that to achieve the identical convergence rate as the full sample estimator η^\hat{\eta}, the proposed approach requires a much smaller number of basis functions, in the case when λnr/(pr+1)\lambda\asymp n^{-r/(pr+1)}. q1δλ1/rq^{1-\delta}\lambda^{1/r}\rightarrow\infty indicates an essential choice of qq for the proposed estimator should satisfy q=O{n(1+δ)/(pr+1)}q=O\{n^{(1+\delta)/(pr+1)}\}, when λnr/(pr+1)\lambda\asymp n^{-r/(pr+1)}. For comparison, both the random basis selection (Gu and Kim, 2002) and the adaptive basis selection method (Ma et al., 2015) require the essential number of basis functions to be q=O{n2/(pr+1)+δ}q=O\{n^{2/(pr+1)+\delta}\}. As a result, the proposed estimator is more efficient since it reduces the order of the essential number of basis functions.

Given q=O(n1/d)q=O(n^{1/d}), when dpr+1d\leq pr+1, it follows naturally when q1δλ1/rq^{1-\delta}\lambda^{1/r}\rightarrow\infty is satisfied. Otherwise, when d>pr+1d>pr+1, q=O(n1/d)q=O(n^{1/d}) becomes sufficient but not necessary for q1δλ1/rq^{1-\delta}\lambda^{1/r}\rightarrow\infty. We thus stress that the essential number of basis functions for the proposed method, q=O{n(1+δ)/(pr+1)}q=O\{n^{(1+\delta)/(pr+1)}\}, can only be achieved when dpr+1d\leq pr+1. The parameter pp in Theorem 2 is closely associated with the true function η0\eta_{0} and will affect the convergence rate of the proposed estimator. Intuitively, the larger the pp is, the smoother the function η0\eta_{0} will be. For p[1,2]p\in[1,2], the optimal convergence rate of (V+λJ)(η^Eη0)(V+\lambda J)(\hat{\eta}_{E}-\eta_{0}) falls in the interval [Op(nr/(r+1)),Op(n2r/(2r+1))][O_{p}(n^{-r/(r+1)}),O_{p}(n^{-2r/(2r+1)})]. To the best of our knowledge, the problem of selecting the optimal pp has rarely been studied, and one exception is Serra and Krivobokova (2017), where the author studied such a problem in one-dimensional cases. Serra and Krivobokova (2017) provided a Bayesian approach for selecting the optimal parameter, named β\beta, which is known to be proportional to pp. Nevertheless, since the constant β/p\beta/p is usually unknown, such an approach still cannot be used to select the optimal pp in practice. Furthermore, whether such an approach can be extended to the high-dimensional cases remains unclear.

For the dimension of the effective model space qq, a suitable choice is q=n(1+δ)/(4p+1)+δq=n^{(1+\delta)/(4p+1)+\delta} in the following two cases. Case 1. Univariate cubic smoothing splines with the penalty J(η)=01(η′′)2J(\eta)=\int_{0}^{1}(\eta^{\prime\prime})^{2}, r=4r=4 and λn4/(4p+1)\lambda\asymp n^{-4/(4p+1)}; Case 2. Tensor-product splines with r=4δr=4-\delta^{*}, where δ>0\delta^{*}>0. For p[1,2]p\in[1,2], the dimension roughly lies in the interval (O(n1/9),O(n1/5))(O(n^{1/9}),O(n^{1/5})).

5 Simulation Results

To assess the performance of the proposed space-filling basis selection method, we carry out extensive analysis on simulated datasets. We report some of them in this section. We compare the proposed method with uniform basis selection and adaptive basis selection, and report both prediction error and running time.

The following four functions on [0,1][0,1] (Lin and Zhang, 2006) are used as the building blocks in our simulation study, g1(t)=tg_{1}(t)=t, g2(t)=(2t1)2g_{2}(t)=(2t-1)^{2}, g3(t)=sin(2πt)/{2sin(2πt)}g_{3}(t)=\sin(2\pi t)/\{2-\sin(2\pi t)\}, and g4(t)=0.1sin(2πt)+0.2cos(2πt)+0.3sin(2πt)2+0.4cos(2πt)3+0.5sin(2πt)3g_{4}(t)=0.1\sin(2\pi t)+0.2\cos(2\pi t)+0.3\sin(2\pi t)^{2}+0.4\cos(2\pi t)^{3}+0.5\sin(2\pi t)^{3}. In addition, we also use the following functions on [0,1]2[0,1]^{2} (Wood, 2003) as the building blocks,

h1(t1,t2)={0.75/(πσ1σ2)}×exp{(t10.2)2/σ12(t20.3)2/σ22},\displaystyle h_{1}(t_{1},t_{2})=\{0.75/(\pi\sigma_{1}\sigma_{2})\}\times\exp\{-(t_{1}-0.2)^{2}/\sigma_{1}^{2}-(t_{2}-0.3)^{2}/\sigma_{2}^{2}\},
h2(t1,t2)={0.45/(πσ1σ2)}×exp{(t10.7)2/σ12(t20.8)2/σ22},\displaystyle h_{2}(t_{1},t_{2})=\{0.45/(\pi\sigma_{1}\sigma_{2})\}\times\exp\{-(t_{1}-0.7)^{2}/\sigma_{1}^{2}-(t_{2}-0.8)^{2}/\sigma_{2}^{2}\},

where σ1=0.3\sigma_{1}=0.3 and σ2=0.4\sigma_{2}=0.4. The signal-to-noise ratio (SNR), defined as var{η(X)}/σ2\mbox{var}\{\eta(X)\}/\sigma^{2}, is set to be at two levels: 5 and 2. We generate replicated samples with sample sizes n={210,211,,214}n=\{2^{10},2^{11},\ldots,2^{14}\} and dimensions d={2,4,6}d=\{2,4,6\} uniformly on [0,1]p[0,1]^{p} from the following four regression settings,

(1) A 2-d function g1(x1x2)+g2(x2)+g3(x1)+g4(x2)+g3{(x1+x2)/2}g_{1}(x_{1}x_{2})+g_{2}(x_{2})+g_{3}(x_{1})+g_{4}(x_{2})+g_{3}\{(x_{1}+x_{2})/2\};

(2) A 2-d function h1(x1,x2)+h2(x1,x2)h_{1}(x_{1},x_{2})+h_{2}(x_{1},x_{2});

(3) A 4-d function g1(x1)+g2(x2)+g3(x3)+2g1{(x1+x4)/2}+2g2{(x2+x3)/2}+2g3{(x1+x3)/2}g_{1}(x_{1})+g_{2}(x_{2})+g_{3}(x_{3})+2g_{1}\{(x_{1}+x_{4})/2\}+2g_{2}\{(x_{2}+x_{3})/2\}+2g_{3}\{(x_{1}+x_{3})/2\};

(4) A 6-d function h(x1,x2)+h(x1,x5)h(x_{1},x_{2})+h(x_{1},x_{5}).

In the simulation, qq is set to be 5n2/95n^{2/9} and 10n1/910n^{1/9}, based on the asymptotic results. To combat the curse of dimensionality, we fit smoothing spline analysis of variance models with all main effects and two-way interactions. The prediction error is measured by the mean squared error (MSE), defined as [i=1n0{η^E(ti)η0(ti)}2]/n0[\sum_{i=1}^{n_{0}}\{\hat{\eta}_{E}(t_{i})-\eta_{0}(t_{i})\}^{2}]/n_{0}, where {ti}i=1n0\{t_{i}\}_{i=1}^{n_{0}} denotes an independent testing dataset uniformly generated on [0,1]p[0,1]^{p} with n0=5000n_{0}=5000. The max projection design (Joseph et al., 2015) is used to generate design points in Step 1 of the proposed method. Our empirical studies suggest that the Sobol sequence and other space-filling techniques, e.g., the Latin hypercube design (Pukelsheim, 2006) and the uniform design (Fang et al., 2000), also yield similar performance.

Refer to caption
Figure 3: Simulation under different settings (from left to right) with SNR being five (the upper row) and two (the lower row). The prediction errors are plotted versus sample size nn The vertical bars are standard error bars obtained from 20 replicates. The solid lines, dotted lines, and dashed lines refer to the space-filling basis selection (SBS), the adaptive basis selection (ABS), and the uniform basis selection (UNIF), respectively. The lines with triangles and those with circles represent q=5n2/9q=5n^{2/9} and q=10n1/9q=10n^{1/9}, respectively.

Figure 3 shows the MSE against the sample size on the log-log scale. Each column presents the results of a function setting as described above. We set the signal-to-noise ratio to be five and two in the upper row and the lower row, respectively. We use solid lines for the proposed method, dotted lines for adaptive basis selection method, and dashed lines for uniform basis selection method. The number of the basis functions qq is 5n2/95n^{2/9} and 10n1/910n^{1/9} for the lines with triangles and the lines with circles, respectively. The vertical bars represent standard error bars obtained from 20 replicates. The full sample estimator is omitted here due to the high computation cost. It is observed that the space-filling basis selection method provides more accurate smoothing spline predictions than the other two methods in almost all settings. Notably, the lines with circles for the space-filling basis selection method displays a linear trend similar to the lines with triangles for the other two methods. This observation reveals the proposed estimator yields a faster convergence rate than the other two methods.

More simulation results can be found in the Supplementary Material, in which we consider the regression functions that exhibit several sharp peaks. In those cases, the results suggest that both the space-filling basis selection method and the adaptive basis selection method outperform the uniform basis selection, whereas neither the space-filling basis selection method nor the adaptive basis selection method dominates each other. Moreover, the proposed space-filling basis selection method outperforms the adaptive basis selection method as the sample size nn gets larger.

Table 1: Means and standard errors (in parentheses) of the computational time, in CPU seconds, for multivariate cases, based on 20 replicates.
True function SNR UNIF ABS SBS
Function 1 5 0.97(0.15) 0.90(0.05) 0.90(0.04)
2 0.92(0.10) 0.87(0.04) 0.87(0.06)
Function 2 5 0.88(0.04) 0.87(0.03) 0.90(0.06)
2 0.86(0.05) 0.85(0.02) 0.90(0.06)
Function 3 5 3.92(0.24) 3.95(0.24) 4.04(0.19)
2 4.08(0.30) 4.51(0.66) 4.27(0.39)
Function 4 5 12.95(0.61) 15.10(3.20) 15.45(3.04)
2 14.33(1.44) 13.72(1.02) 14.25(1.09)

Table 1 summarizes the computing times of model-fitting using all methods on a synthetic dataset with n=214n=2^{14} and q=5n2/9q=5n^{2/9}. The simulation is replicated for 20 runs using a computer with an Intel 2.6 GHz processor. In Table 1, UNIF, ABS, and SBS represent the uniform basis selection method, the adaptive basis selection method, and the proposed space-filling basis selection method, respectively. The time for calculating the smoothing parameter is not included. The result for the full basis smoothing spline estimator is omitted here due to the huge computational cost. The computational time for generating a set of design points, i.e., Step 1 in the proposed algorithm, is not included since the design points can be generated beforehand. It is observed that the computing time of the proposed method is comparable with that of the other two basis selection methods under all settings. Combining such an observation with the result in Fig. 3, it is concluded that the proposed method can achieve a more accurate prediction, without requiring much more computational time.

6 Real data example

The problem of measuring total column ozone has attracted significant attention for decades. Ozone depletion facilitates the transmission of ultraviolet radiation (290–400 nm wavelength) through the atmosphere and causes severe damage to DNA and cellular proteins that are involved in biochemical processes, affecting growth and reproduction. Statistical analysis of total column ozone data has three steps. In the first step, the raw satellite data (level 1) are retrieved by NASA. Subsequently, NASA calibrates and preprocesses the data to generate spatially and temporally irregular total column ozone measurements (level 2). Finally, the level 2 data are processed to yield the level 3 data, which are the daily and spatially regular data product released extensively to the public.

We fit the nonparametric model yij=η(x1i,x2j)+ϵijy_{ij}=\eta(x_{\langle 1\rangle i},x_{\langle 2\rangle j})+\epsilon_{ij} to a level 2 total column ozone dataset (nn=173,405) compiled by Cressie and Johannesson (2008). Here, yijy_{ij} is the level 2 total column ozone measurement at the iith longitude, i.e., x1ix_{\langle 1\rangle i}, and the jjth latitude, i.e., x2ix_{\langle 2\rangle i}, and ϵij\epsilon_{ij} represent the independent and identically distributed random errors. The heat map of the raw data is presented in the Supplementary Material. The thin-plate smoothing spline is used for the model-fitting, and the proposed method is employed to facilitate the estimation. The number of basis functions is set to q=20n2/9292q=20n^{2/9}\approx 292. The design points employed in the proposed basis selection method are yielded from a Sobol sequence (Dutang and Savicky, 2013). The heat map of the predicted image on a 1×11^{\circ}\times 1^{\circ} regular grid is presented in Fig. 4. It is seen that the total column ozone value decreases dramatically to form the ozone hole over the South Pole, around –5555^{\circ} latitudinal zone.

Refer to caption
Figure 4: Smoothing spline prediction of total column ozone value for 10/01/1988, in Dobson units

We now report computing times of the model-fitting that are performed on the identical laptop computer for the simulation studies. The computational times, in CPU seconds, are presented in parentheses, including basis selection (0.1s), model fitting (129s), and prediction (21s). Further comparison between the proposed method and other basis selection methods on this dataset can be found in the Supplementary Material.

7 Acknowledgement

This study was partially supported by the National Science Foundation and the National Institutes of Health.

References

  • Altman (1992) Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
  • Arya et al. (1994) Sunil Arya, David M Mount, Nathan Netanyahu, Ruth Silverman, and Angela Y Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. In Proc. 5th ACM-SIAM Sympos. Discrete Algorithms, pages 573–582, 1994.
  • Bentley (1975) Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
  • Bonneel et al. (2015) Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.
  • Chung (1949) Kai-Lai Chung. An estimate concerning the kolmogroff limit distribution. Transactions of the American Mathematical Society, 67(1):36–50, 1949.
  • Cressie and Johannesson (2008) Noel Cressie and Gardar Johannesson. Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209–226, 2008.
  • Cuturi and Doucet (2014) Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pages 685–693, 2014.
  • Dick et al. (2013) Josef Dick, Frances Y Kuo, and Ian H Sloan. High-dimensional integration: the quasi-monte carlo way. Acta Numerica, 22:133–288, 2013.
  • Dietterich (2002) Thomas G Dietterich. Ensemble learning. The handbook of brain theory and neural networks, 2:110–125, 2002.
  • Dutang and Savicky (2013) C Dutang and P Savicky. randtoolbox: Generating and testing random numbers. R package, page 67, 2013.
  • Fang et al. (2000) Kai-Tai Fang, Dennis KJ Lin, Peter Winker, and Yong Zhang. Uniform design: Theory and application. Technometrics, 42(3):237–248, 2000.
  • Fang et al. (2005) Kai-Tai Fang, Runze Li, and Agus Sudjianto. Design and Modeling for Computer Experiments. CRC Press, 2005.
  • Gao (1999) Fangyu Gao. Penalized multivariate logistic regression with a large data set. Technical report, University of Wisconsin–Madison, 1999.
  • Glasserman (2013) Paul Glasserman. Monte Carlo methods in financial engineering, volume 53. Springer Science & Business Media, 2013.
  • Gu (2013) Chong Gu. Smoothing spline ANOVA models. Springer Science & Business Media, 2013.
  • Gu and Kim (2002) Chong Gu and Young-Ju Kim. Penalized likelihood regression: general formulation and efficient approximation. Canadian Journal of Statistics, 30(4):619–628, 2002.
  • Halton (1960) John H Halton. On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numerische Mathematik, 2(1):84–90, 1960.
  • Hastie (1996) Trevor Hastie. Pseudosplines. Journal of the Royal Statistical Society. Series B (Methodological), pages 379–396, 1996.
  • Joseph et al. (2015) V Roshan Joseph, Evren Gul, and Shan Ba. Maximum projection designs for computer experiments. Biometrika, 102(2):371–380, 2015.
  • Kim and Gu (2004) Young-Ju Kim and Chong Gu. Smoothing spline gaussian regression: more scalable computation via efficient approximation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(2):337–356, 2004.
  • Kolouri et al. (2018) Soheil Kolouri, Charles E Martin, and Gustavo K Rohde. Sliced-wasserstein autoencoder: An embarrassingly simple generative model. arXiv preprint arXiv:1804.01947, 2018.
  • Kuipers and Niederreiter (2012) Lauwerens Kuipers and Harald Niederreiter. Uniform distribution of sequences. Courier Corporation, 2012.
  • Kuncheva and Whitaker (2003) Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2):181–207, 2003.
  • Lemieux (2009) Christiane Lemieux. Monte Carlo and quasi-Monte Carlo sampling. Springer, New York, 2009.
  • Leobacher and Pillichshammer (2014) Gunther Leobacher and Friedrich Pillichshammer. Introduction to quasi-Monte Carlo integration and applications. Springer, 2014.
  • Lin and Zhang (2006) Yi Lin and Hao Helen Zhang. Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5):2272–2297, 2006.
  • Luo and Wahba (1997) Zhen Luo and Grace Wahba. Hybrid adaptive splines. Journal of the American Statistical Association, 92(437):107–116, 1997.
  • Ma et al. (2015) Ping Ma, Jianhua Z Huang, and Nan Zhang. Efficient computation of smoothing splines via adaptive basis sampling. Biometrika, 102(3):631–645, 2015.
  • Owen (2003) Art B Owen. Quasi-monte carlo sampling. Monte Carlo Ray Tracing: Siggraph, 1:69–88, 2003.
  • Pukelsheim (2006) Friedrich Pukelsheim. Optimal design of experiments. SIAM, 2006.
  • Rabin et al. (2011) Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2011.
  • Reinsch (1967) Christian H Reinsch. Smoothing by spline functions. Numerische mathematik, 10(3):177–183, 1967.
  • Rokach (2010) Lior Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2):1–39, 2010.
  • Ruppert (2002) David Ruppert. Selecting the number of knots for penalized splines. Journal of computational and graphical statistics, 11(4):735–757, 2002.
  • Ruppert et al. (2009) David Ruppert, Matt P Wand, and Raymond J Carroll. Semiparametric regression during 2003–2007. Electronic journal of statistics, 3:1193, 2009.
  • Serra and Krivobokova (2017) Paulo Serra and Tatyana Krivobokova. Adaptive empirical bayesian smoothing splines. Bayesian Analysis, 12(1):219–238, 2017.
  • Sobol (1967) I.M. Sobol. The distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4):86–112, 1967.
  • Villani (2008) Cédric Villani. Optimal transport: old and new. Springer Science & Business Media, 2008.
  • Wahba (1990) Grace Wahba. Spline models for observational data. SIAM, 1990.
  • Wahba and Craven (1978) Grace Wahba and Peter Craven. Smoothing noisy data with spline functions. estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377–404, 1978.
  • Wald and Havran (2006) Ingo Wald and Vlastimil Havran. On building fast kd-trees for ray tracing, and on doing that in o (n log n). In Interactive Ray Tracing 2006, IEEE Symposium on, pages 61–69. IEEE, 2006.
  • Wang et al. (2011) Xiao Wang, Jinglai Shen, and David Ruppert. On the asymptotics of penalized spline smoothing. Electronic Journal of Statistics, 5:1–17, 2011.
  • Wood (2003) Simon N Wood. Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):95–114, 2003.
  • Wu and Hamada (2011) CF Jeff Wu and Michael S Hamada. Experiments: planning, analysis, and optimization. John Wiley & Sons, 2011.
  • Zhang et al. (2004) Hao Helen Zhang, Grace Wahba, Yi Lin, Meta Voelker, Michael Ferris, Ronald Klein, and Barbara Klein. Variable selection and model building via likelihood basis pursuit. Journal of the American Statistical Association, 99(467):659–672, 2004.