This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nonlinear Functional Principal Component Analysis Using Neural Networks

Rou Zhong 12pt0.5emCenter for Applied Statistics, School of Statistics, Renmin University of China Chunming Zhang 12pt0.5emDepartment of Statistics, University of Wisconsin-Madison Jingxiao Zhang zhjxiaoruc@163.com 12pt0.5emCenter for Applied Statistics, School of Statistics, Renmin University of China
Abstract

Functional principal component analysis (FPCA) is an important technique for dimension reduction in functional data analysis (FDA). Classical FPCA method is based on the Karhunen-Loève expansion, which assumes a linear structure of the observed functional data. However, the assumption may not always be satisfied, and the FPCA method can become inefficient when the data deviates from the linear assumption. In this paper, we propose a novel FPCA method that is suitable for data with a nonlinear structure by neural network approach. We construct networks that can be applied to functional data and explore the corresponding universal approximation property. The main use of our proposed nonlinear FPCA method is curve reconstruction. We conduct a simulation study to evaluate the performance of our method. The proposed method is also applied to two real-world data sets to further demonstrate its superiority.

Keywords : Functional principal component analysis; Neural Network; Nonlinear dimension reduction; Curve reconstruction.

1 Introduction

Functional data analysis (FDA) has become widely concerned, with the rapid development of data collection technology. There are many monographs that provide a detailed introduction of FDA, such as Ramsay and Silverman (2005), Ferraty and Vieu (2006) and Horváth and Kokoszka (2012). Functional principal component analysis (FPCA), as a dimension reduction technique, plays a greatly important role in FDA, since functional data is a type of data with infinite dimensional. However, traditional FPCA is merely a linear approach, and the linear assumption can limit the effectiveness of dimensional reduction. In this paper, we aim to develop a nonlinear FPCA method based on the use of neural networks.

Recently, neural network approach draws more and more attention in the field of FDA and shows strong potential. Thind et al. (2023) proposed a functional neural network to handle nonlinear regression model with functional covariates and scalar response. Further, Rao and Reimherr (2023) introduced a continuous layer in the construction of functional neural networks, so that the functional nature of the data can be maintained as long as possible. The nonlinear function-on-scalar regression model has been considered by Wu et al. (2022) with the use of neural networks. Moreover, neural network method is also employed in the classification problem of functional data, such as Thind et al. (2020) and Wang et al. (2022a). The above works all focus on supervised learning, as the regression or classification issues for functional data are considered, where label variables are defined. Furthermore, unsupervised learning problems for functional data have also been studied by neural networks. Wang et al. (2021) discussed the mean function estimation for functional data using neural networks. Multi-dimensional functional data is taken into account by Wang and Cao (2022a), and a robust location function estimator is proposed via neural networks. Sarkar and Panaretos (2022) concentrated on covariance estimation for multi-dimensional functional data, and three types of covariance networks are defined correspondingly. Though neural networks have gained extensive interest in FDA, there are only very few studies working on nonlinear dimensional reduction for functional data through neural networks. Wang and Cao (2022b) presented a functional nonlinear learning method, which is a representation learning approach for multivariate functional data and can be applied to curve reconstruction and classification. However, their method is developed based on recurrent neural network (RNN), thus only discrete values of the data are used in the neural network. Therefore, a nonlinear dimension reduction method by neural networks that treats the continuously observed data from a functional perceptive is needed.

FPCA is a crucial dimension reduction tool in FDA. There has been a great many works contributing to the development of FPCA in various aspects. These include, but are not limited to the study of principal component analysis for sparsely observed functional data (Yao et al., 2005; Hall et al., 2006; Li and Hsing, 2010). Robust FPCA approaches were introduced in Locantore et al. (1999); Gervini (2008) and Zhong et al. (2022). Moreover, Chiou et al. (2014) and Happ and Greven (2018) discussed principal component analysis methods for more complex functional data, such as multivariate functional data and multi-dimensional functional data. For nonlinear FPCA, Song and Li (2021) generalized kernel principal component analysis to accommodate functional data. Currently, research on nonlinear FPCA is not sufficient enough. Nevertheless, the consideration of nonlinear structure of functional data can be beneficial, since more parsimonious representation can be obtained.

To this end, we propose a new nonlinear functional principal component analysis method by neural networks, which can be simply denoted as nFunNN. In specific, we borrow the idea of the autoassociative neural networks in (Kramer, 1991) for the construction of our networks, to realize dimension reduction and curve reconstruction. Kramer (1991) achieved the purpose of dimension reduction for multivariate data through an internal “bottleneck” layer. However, the extension to functional data is nontrivial due to the infinite dimensional nature of functional data, which adds the complexity of the neural networks and increases the difficulty in the optimization. For our proposed neural network, both input and output are functions. To the best of our knowledge, though neural networks with functional input have been studied in the existing works, networks with both functional input and functional output have not been taken into account yet, and the consideration of which can be more complicated. B-spline basis functions are employed in our computation and backpropagation algorithm is applied. The simulation study and real data application show the superiority of the nFunNN method under various nonlinear settings. Moreover, we also establish the universal approximation property of our proposed nonlinear model.

The contributions of this paper can be summarized as follows. First, our work is the first attempt to the generalization of the autoassociative neural networks to functional data settings, which is not straightforward. The use of neural networks provides new framework of nonlinear dimension reduction for functional data. Second, the universal approximation property of the proposed model is discussed, which brings theoretical guarantees to our method. Third, we present an innovative algorithm for the computation in practice and develop a python package, called nFunNN, for implementation.

The organization of this paper is as follows. In Section 2, we first give an explanation of nonlinear FPCA, and then introduce a functional autoassociative neural network to complete nonlinear principal component analysis for functional data. We also discuss the practical implementation of our method. In Section 3, we display the simulation results in our numerical study. The evaluation of our method by real-world data is provided in Section 4. In Section 5, we conclude this paper with some discussions.

2 Methodology

2.1 Nonlinear FPCA

Let X(t)X(t) be a smooth random function in L2(𝒯)L^{2}(\mathcal{T}), where 𝒯\mathcal{T} is a bounded and closed interval. In this paper, 𝒯\mathcal{T} is set as [0,1][0,1] if there is no specific explanation. Let μ(t)\mu(t) and Σ(s,t)\Sigma(s,t) denote the mean function and covariance function of X(t)X(t), respectively. For linear FPCA, according to the Karhunen-Loève Theorem, X(t)X(t) admits the following expansion

X(t)=μ(t)+k=1ξkϕk(t),\displaystyle X(t)=\mu(t)+\sum_{k=1}^{\infty}\xi_{k}\phi_{k}(t),

where ξk\xi_{k} is the kk-th functional principal component score, ϕk(t)\phi_{k}(t) is the kk-th eigenfunction of Σ(s,t)\Sigma(s,t) and satisfies 𝒯ϕk(t)2𝑑t=1\int_{\mathcal{T}}\phi_{k}(t)^{2}dt=1 and 𝒯ϕk(t)ϕl(t)𝑑t=0\int_{\mathcal{T}}\phi_{k}(t)\phi_{l}(t)dt=0 for lkl\neq k. Moreover, the functional principal component scores are uncorrelated random variables with mean zero and Eξk2=λkE\xi_{k}^{2}=\lambda_{k}, where λk\lambda_{k} is the kk-th eigenvalue of Σ(s,t)\Sigma(s,t). In practice, as only the first several functional principal component scores dominate the variation, a truncated expansion

X(t)=μ(t)+k=1Kξkϕk(t)\displaystyle X(t)=\mu(t)+\sum_{k=1}^{K}\xi_{k}\phi_{k}(t) (1)

is often applied, where KK is the number of functional principal components that used. Furthermore, we also have that

ξk=𝒯{X(t)μ(t)}ϕk(t)𝑑t.\displaystyle\xi_{k}=\int_{\mathcal{T}}\{X(t)-\mu(t)\}\phi_{k}(t)dt. (2)

It is obvious that X(t)X(t) is mapped into a lower dimensional space through a linear transformation. However, with the constraint of linear map, the nonlinear structure is ignored, which may lower the efficiency of dimension reduction.

For nonlinear FPCA, we extend the linear map in (2) to arbitrary nonlinear map. That is

ξk=Gk(X),\displaystyle\xi_{k}=G_{k}(X), (3)

where GkG_{k} is a nonlinear function that maps function in square integrable space L2(𝒯)L^{2}(\mathcal{T}) to scalar in \mathbb{R}. Similarly, (1) can be generalized into nonlinear version as

X(t)=H(𝝃)(t),\displaystyle X(t)=H(\bm{\xi})(t),

where 𝝃=(ξ1,,ξK)\bm{\xi}=(\xi_{1},\ldots,\xi_{K})^{\top} and HH is a nonlinear function that maps from a vector space to a square integrable space. The scores obtained from nonlinear FPCA also contain the main information of X(t)X(t), but the dimension of the feature space can be lower, as the nonlinear structure is taken into account in the process of dimension reduction. To estimate the nonlinear functional principal component scores, the nonlinear functions GkG_{k} and HH have to be learnt, and a neural network method is employed.

2.2 Neural Networks for Nonlinear FPCA

In this section, we construct a functional autoassociative neural network for nonlinear FPCA. The structure of the proposed neural network is shown in Figure 1. The output is the reconstruction of the input data, thus both input and output are functions in our network, which brings challenge to the optimization of the neural network. Furthermore, dimension reduction is realized by the second hidden layer, which can also be called “bottleneck” layer as in (Kramer, 1991). More details about the computation of the proposed functional autoassociative neural network are given below.

Refer to caption
Figure 1: The proposed functional autoassociative neural network for nonlinear FPCA.

To be specific, the left two hidden layers are designed to learn the nonlinear functions GkG_{k}’s that map the functional input to the scores, which can be viewed as a dimension reduction process. For the jjth neuron in the first hidden layer, we define that

Hj(1)=σ{bj+𝒯βj(t)X(t)𝑑t},j=1,,J,\displaystyle H_{j}^{(1)}=\sigma\Big{\{}b_{j}+\int_{\mathcal{T}}\beta_{j}(t)X(t)dt\Big{\}},\ j=1,\ldots,J,

where bjb_{j}\in\mathbb{R} is the intercept, βj(t)L2(𝒯)\beta_{j}(t)\in L^{2}(\mathcal{T}) is the weight function, σ()\sigma(\cdot) is a nonlinear activation function, and JJ is the number of neurons in the first hidden layer. This is a natural generalization of the first two layers of a multilayer perceptron to adapt to the functional input. As Hj(1)H_{j}^{(1)}\in\mathbb{R}, computation of the score in the second hidden layer can be promoted naturally. The kkth neuron in the second hidden layer is

Hk(2)=j=1JwjkHj(1),k=1,,K,\displaystyle H_{k}^{(2)}=\sum_{j=1}^{J}w_{jk}H_{j}^{(1)},\ k=1,\ldots,K,

where wjkw_{jk}\in\mathbb{R} is the weight. Moreover, let 𝒮(σ,L2(𝒯))\mathcal{S}(\sigma,L^{2}(\mathcal{T})) be the set of functions from L2(𝒯)L^{2}(\mathcal{T}) to \mathbb{R} of the form

Xj=1Jwjkσ{bj+𝒯βj(t)X(t)𝑑t}.\displaystyle X\mapsto\sum_{j=1}^{J}w_{jk}\sigma\Big{\{}b_{j}+\int_{\mathcal{T}}\beta_{j}(t)X(t)dt\Big{\}}.

According to Corollary 5.1.2 in (Stinchcombe, 1999) and the proof in Section 6.1.2 of (Rossi and Conan-Guez, 2006), 𝒮(σ,L2(𝒯))\mathcal{S}(\sigma,L^{2}(\mathcal{T})) has the universal approximation property for L2(𝒯)L^{2}(\mathcal{T}). That means any nonlinear function from L2(𝒯)L^{2}(\mathcal{T}) to \mathbb{R} can be approximated up to arbitrary degree of precision by functions in 𝒮(σ,L2(𝒯))\mathcal{S}(\sigma,L^{2}(\mathcal{T})).

The right two layers in Figure 1, which correspond to the estimation of the nonlinear function HH, are used to reconstruct X(t)X(t) by the low-dimensional scores Hk(2)H_{k}^{(2)} in the second hidden layer. The procedure can be more challenging, since we have to get functional output from scalars in the second hidden layer. To this end, the rrth neuron in the third layer is defined as

Hr(3)(t)=σ{ar(t)+k=1Kγkr(t)Hk(2)},t𝒯,r=1,,R,\displaystyle H_{r}^{(3)}(t)=\sigma\Big{\{}a_{r}(t)+\sum_{k=1}^{K}\gamma_{kr}(t)H_{k}^{(2)}\Big{\}},\ t\in\mathcal{T},\ r=1,\ldots,R,

where ar(t)L2(𝒯)a_{r}(t)\in L^{2}(\mathcal{T}) is the intercept function, γkr(t)L2(𝒯)\gamma_{kr}(t)\in L^{2}(\mathcal{T}) is the weight function, and RR is the number of the neurons in the third hidden layer. It can be observed that each neuron in the third hidden layer is a function. Then, X(t)X(t) is reconstructed by

X^(t)=r=1RurHr(3)(t),t𝒯,\displaystyle\widehat{X}(t)=\sum_{r=1}^{R}u_{r}H_{r}^{(3)}(t),\ t\in\mathcal{T}, (4)

where uru_{r}\in\mathbb{R} is the weight. The whole network is trained by minimizing the following reconstruction error

RE=𝒯{X(t)X^(t)}2𝑑t.\displaystyle RE=\int_{\mathcal{T}}\{X(t)-\widehat{X}(t)\}^{2}dt.

As functional data is involved in the proposed network, some of the parameters need to be estimated are functions, which makes the optimization of the network nontrivial. In Section 2.3, we introduce the optimization algorithm for practical implementation.

Note that Hk(2)H_{k}^{(2)} can be viewed as the estimation of ξk\xi_{k} in (3). Therefore, the dimension of XX is reduced to KK through the functional autoassociative neural network. We can use the low-dimensional vector to complete further inference, such as curve reconstruction, regression and clustering. In this paper, we mainly focus on the curve reconstruction by the low-dimensional representation.

2.3 The Transformed Network for Practical Implementation

As discussed in Section 2.2, it can be hard to optimize the proposed functional autoassociative neural network in Figure 1, since many parameters appear as a function. Here, we employ the B-spline basis functions to transform the estimation of functions to the estimation of their coefficients. Let BL={Bl(t),l=1,,L}\textbf{B}_{L}=\{B_{l}(t),l=1,\ldots,L\} be a set of B-spline basis functions with degree dd, where LL is the number of basis functions. Then, we have

βj(t)=l=1LcjlBl(t),ar(t)=l=1LαrlBl(t),γkr(t)=l=1LvkrlBl(t),\displaystyle\beta_{j}(t)=\sum_{l=1}^{L}c_{jl}B_{l}(t),\ a_{r}(t)=\sum_{l=1}^{L}\alpha_{rl}B_{l}(t),\ \gamma_{kr}(t)=\sum_{l=1}^{L}v_{krl}B_{l}(t),

for j=1,,Jj=1,\ldots,J, k=1,,Kk=1,\ldots,K, and r=1,,Rr=1,\ldots,R, where cjlc_{jl}, αrl\alpha_{rl} and vkrlv_{krl} are the basis expansion coefficients of βj(t)\beta_{j}(t), ar(t)a_{r}(t) and γkr(t)\gamma_{kr}(t), respectively.

With the use of B-spline basis functions, the computation of the first two layers for the proposed functional autoassociative neural network turns out to be

Hk(2)=j=1Jwjkσ{bj+l=1Lcjl𝒯X(t)Bl(t)𝑑t}j=1Jwjkσ(bj+l=1LcjlX~l),\displaystyle H_{k}^{(2)}=\sum_{j=1}^{J}w_{jk}\sigma\Big{\{}b_{j}+\sum_{l=1}^{L}c_{jl}\int_{\mathcal{T}}X(t)B_{l}(t)dt\Big{\}}\triangleq\sum_{j=1}^{J}w_{jk}\sigma\Big{(}b_{j}+\sum_{l=1}^{L}c_{jl}\widetilde{X}_{l}\Big{)}, (5)

where X~l=𝒯X(t)Bl(t)𝑑t\widetilde{X}_{l}=\int_{\mathcal{T}}X(t)B_{l}(t)dt. In practice, X~l\widetilde{X}_{l} is calculated through the B-spline expansion of XX, that is X~l=h=1Lxh{𝒯Bh(t)Bl(t)𝑑t}\widetilde{X}_{l}=\sum_{h=1}^{L}x_{h}\{\int_{\mathcal{T}}B_{h}(t)B_{l}(t)dt\}, where xhx_{h} is the basis expansion coefficient of X(t)X(t). Then, we have

Hk(2)=j=1Jwjkσ[bj+h=1L{l=1Lcjl𝒯Bh(t)Bl(t)𝑑t}xh]j=1Jwjkσ(bj+h=1Ldjhxh),\displaystyle H_{k}^{(2)}=\sum_{j=1}^{J}w_{jk}\sigma\Big{[}b_{j}+\sum_{h=1}^{L}\Big{\{}\sum_{l=1}^{L}c_{jl}\int_{\mathcal{T}}B_{h}(t)B_{l}(t)dt\Big{\}}x_{h}\Big{]}\triangleq\sum_{j=1}^{J}w_{jk}\sigma\Big{(}b_{j}+\sum_{h=1}^{L}d_{jh}x_{h}\Big{)},

where djh=l=1Lcjl𝒯Bh(t)Bl(t)𝑑td_{jh}=\sum_{l=1}^{L}c_{jl}\int_{\mathcal{T}}B_{h}(t)B_{l}(t)dt. The following theorem discusses the universal approximation property of the first two layers based on B-spline basis functions. The proof is provided in the Appendix.

Theorem 1.

Let σ\sigma be a continuous non polynomial function from \mathbb{R} to \mathbb{R}, and 𝒮(σ,BL)\mathcal{S}(\sigma,\textbf{B}_{L}) be the set of functions from L2(𝒯)L^{2}(\mathcal{T}) to \mathbb{R} of the form

Xj=1Jwj0σ(bj+h=1Ldjhxh),\displaystyle X\mapsto\sum_{j=1}^{J}w_{j0}\sigma\Big{(}b_{j}+\sum_{h=1}^{L}d_{jh}x_{h}\Big{)},

where xhx_{h} is the hhth coordinate of XX on the basis BL\textbf{B}_{L}, JJ\in\mathbb{N}^{\ast}, wj0w_{j0}\in\mathbb{R}, bjb_{j}\in\mathbb{R} and djhd_{jh}\in\mathbb{R}. Then, 𝒮(σ,BL)\mathcal{S}(\sigma,\textbf{B}_{L}) has the universal approximation property. That is for any compact subset 𝕂\mathbb{K} of L2(𝒯)L^{2}(\mathcal{T}), for any FF from 𝕂\mathbb{K} to \mathbb{R} and for any ϵ>0\epsilon>0, there exists G𝒮(σ,BL)G\in\mathcal{S}(\sigma,\textbf{B}_{L}) such that for all X𝕂X\in\mathbb{K}, |G(X)F(X)|<ϵ|G(X)-F(X)|<\epsilon.

For the computation of the third hidden layer of the proposed functional autoassociative neural network, we have

ar(t)+k=1Kγkr(t)Hk(2)\displaystyle a_{r}(t)+\sum_{k=1}^{K}\gamma_{kr}(t)H_{k}^{(2)} =l=1LαrlBl(t)+k=1Kl=1LvkrlBl(t)Hk(2)=ArB(t)+k=1KVkrB(t)Hk(2)\displaystyle=\sum_{l=1}^{L}\alpha_{rl}B_{l}(t)+\sum_{k=1}^{K}\sum_{l=1}^{L}v_{krl}B_{l}(t)H_{k}^{(2)}=A_{r}^{\top}B(t)+\sum_{k=1}^{K}V_{kr}^{\top}B(t)H_{k}^{(2)}
=(Ar+k=1KVkrHk(2))B(t),\displaystyle=\Big{(}A_{r}+\sum_{k=1}^{K}V_{kr}H_{k}^{(2)}\Big{)}^{\top}B(t),

where Ar=(αr1,,αrL)A_{r}=(\alpha_{r1},\ldots,\alpha_{rL})^{\top}, Vkr=(vkr1,,vkrL)V_{kr}=(v_{kr1},\ldots,v_{krL})^{\top}, and B(t)=(B1(t),,BL(t))B(t)=(B_{1}(t),\ldots,B_{L}(t))^{\top}. For the term Ar+k=1KVkrHk(2)A_{r}+\sum_{k=1}^{K}V_{kr}H_{k}^{(2)} above, it can be viewed as weighted sums of H1(2),,HK(2)H_{1}^{(2)},\ldots,H_{K}^{(2)}. Thus, the computation from the second hidden layer to the third hidden layer of the functional autoassociative neural network can be transformed accordingly, that is

Hr(3)(t)=σ{(Ar+k=1KVkrHk(2))B(t)},t𝒯,\displaystyle H_{r}^{(3)}(t)=\sigma\Big{\{}\Big{(}A_{r}+\sum_{k=1}^{K}V_{kr}H_{k}^{(2)}\Big{)}^{\top}B(t)\Big{\}},t\in\mathcal{T}, (6)

the structure of which is shown in Figure 2. The middle layer in Figure 2 represents the elements of the LL-dimensional vector Ar+k=1KVkrHk(2)A_{r}+\sum_{k=1}^{K}V_{kr}H_{k}^{(2)}. Furthermore, the reconstruction of X(t)X(t) can be obtained by (4) as discussed in Section 2.2.

Refer to caption
Figure 2: The network structure corresponding to the computation of (6).

In practice, suppose that X1,,XnX_{1},\ldots,X_{n} are nn independent realizations of XX, where nn is the sample size. Then, the loss function for the network can be represented as

1ni=1n𝒯{Xi(t)X^i(t)}2𝑑t.\displaystyle\frac{1}{n}\sum_{i=1}^{n}\int_{\mathcal{T}}\{X_{i}(t)-\widehat{X}_{i}(t)\}^{2}dt.

However, integral appeared in the loss function can increase the difficulty of optimization. Hence, right-hand Riemann sum is employed for the computation. Specifically, let 0=t1<t2<<tM=10=t_{1}<t_{2}<\ldots<t_{M}=1 be some equally spaced times points on 𝒯\mathcal{T}. Moreover, denote s1,,sTs_{1},\ldots,s_{T} be the observation time points of the random curves, where TT is the observation size. Then, we consider the following loss function

RE~=1ni=1n1M1m=2M{X~i(tm)X^i(tm)}2,\displaystyle\widetilde{RE}=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{M-1}\sum_{m=2}^{M}\{\widetilde{X}_{i}(t_{m})-\widehat{X}_{i}(t_{m})\}^{2}, (7)

where X~i(tm)\widetilde{X}_{i}(t_{m}) is the estimation of Xi(tm)X_{i}(t_{m}) using the observed data by smoothing. Note that only values of the curves at t1,,tMt_{1},\ldots,t_{M} involved in the loss function. Therefore, we just need to consider discrete values of X(t)X(t) in the last two layers of the network. Let Bl=(Bl(t0),,Bl(tM))\textbf{B}_{l}=(B_{l}(t_{0}),\ldots,B_{l}(t_{M}))^{\top} for l=1,,Ll=1,\ldots,L, and B=(B1,,BL)\textbf{B}=(\textbf{B}_{1},\ldots,\textbf{B}_{L})^{\top}. The rrth neuron of the third hidden layer can be computed by

H~r(3)=σ{(Ar+k=1KVkrHk(2))B}.\displaystyle\widetilde{H}_{r}^{(3)}=\sigma\Big{\{}\Big{(}A_{r}+\sum_{k=1}^{K}V_{kr}H_{k}^{(2)}\Big{)}^{\top}\textbf{B}\Big{\}}. (8)

Then the output is given by

X^=r=1RurH~r(3),\displaystyle\widehat{\textbf{X}}=\sum_{r=1}^{R}u_{r}\widetilde{H}_{r}^{(3)}, (9)

where X^=(X^(t0),,X^(tM))\widehat{\textbf{X}}=(\widehat{X}(t_{0}),\ldots,\widehat{X}(t_{M})). To be clear, the transformed functional autoassociative neural network is shown in Figure 3. The computation involved in the transformed autoassociative neural network corresponds to (5), (8) and (9) respectively.

Refer to caption
Figure 3: The transformed functional autoassociative neural network for nonlinear FPCA.

By turning the proposed functional autoassociative neural network in Section 2.2 into the transformed functional autoassociative neural network, the Adam algorithm (Kingma and Ba, 2014) can be employed in the optimization. This algorithm is popularly-used, and it can be realized by the Python package torch. Moreover, we also provide the Python package nFunNN for the implementation specific to our method.

3 Simulation

3.1 Numerical performance

In this section, we conduct a simulation study to explore the performance of the proposed nFunNN method. For the simulated data, we take into account the measurement error to make it more consistent with the practical cases. Specifically, the observed data is generated by

Yij=Xi(sj)+ϵij,i=1,,n,j=1,,T,\displaystyle Y_{ij}=X_{i}(s_{j})+\epsilon_{ij},\ i=1,\ldots,n,\ j=1,\ldots,T,

where YijY_{ij} is the jjth observation for the iith subject, and ϵij\epsilon_{ij}’s are the independent measurement errors, which we obtain from the normal distribution 𝒩(0,δ2)\mathcal{N}(0,\delta^{2}). We set δ=0.1\delta=0.1, and the observation size is set as T=51T=51. The observation grids are equally spaced on 𝒯=[0,1]\mathcal{T}=[0,1]. For the setting of XiX_{i}, we consider the following cases.

  • Case 1: Xi(t)=ξi1sin(2πt)+ξi2cos(2πt),t𝒯X_{i}(t)=\xi_{i1}\sin(2\pi t)+\xi_{i2}\cos(2\pi t),\ t\in\mathcal{T}, where ξi1\xi_{i1}’s and ξi2\xi_{i2}’s are simulated from 𝒩(0,32)\mathcal{N}(0,3^{2}) and 𝒩(0,22)\mathcal{N}(0,2^{2}), respectively.

  • Case 2: Xi(t)=ξi2sin(ξi1t),t𝒯X_{i}(t)=\xi_{i2}\sin(\xi_{i1}t),\ t\in\mathcal{T}, where both ξi1\xi_{i1}’s and ξi2\xi_{i2}’s are simulated from 𝒩(0,22)\mathcal{N}(0,2^{2}).

  • Case 3: Xi(t)=ξi2cos(ξi1t),t𝒯X_{i}(t)=\xi_{i2}\cos(\xi_{i1}t),\ t\in\mathcal{T}, where both ξi1\xi_{i1}’s and ξi2\xi_{i2}’s are simulated in the same way as that in Case 2.

  • Case 4: Xi(t)=ξi1sin(2πt)+ξi2cos(2πt)+ξi2sin(ξi1t),t𝒯X_{i}(t)=\xi_{i1}\sin(2\pi t)+\xi_{i2}\cos(2\pi t)+\xi_{i2}\sin(\xi_{i1}t),\ t\in\mathcal{T}, where both ξi1\xi_{i1}’s and ξi2\xi_{i2}’s are simulated in the same way as that in Case 2.

  • Case 5: Xi(t)=ξi1sin(2πt)+ξi2cos(2πt)+ξi2cos(ξi1t),t𝒯X_{i}(t)=\xi_{i1}\sin(2\pi t)+\xi_{i2}\cos(2\pi t)+\xi_{i2}\cos(\xi_{i1}t),\ t\in\mathcal{T}, where both ξi1\xi_{i1}’s and ξi2\xi_{i2}’s are simulated in the same way as that in Case 2.

The above setups include various structures of Xi(t)X_{i}(t). In Case 1, Xi(t)X_{i}(t) is actually generated through the Karhunen-Loève expansion, with zero mean, λ1=32\lambda_{1}=3^{2}, λ2=22\lambda_{2}=2^{2}, and λk=0\lambda_{k}=0 for k3k\geq 3. Thus, Xi(t)X_{i}(t) in Case 1 has a linear structure, and a linear method may be suitable enough for this case. Moreover, the other four cases consider the nonlinear structure of Xi(t)X_{i}(t). Case 2 and Case 3 impose only one nonlinear term in the setup, while Case 4 and Case 5 combine the linear terms in Case 1 with nonlinear terms in Case 2 and Case 3 respectively. Further, whether a linear structure or a nonlinear structure is considered, all the five cases are set to contain two principal components.

The proposed nFunNN method is compared with the classical linear FPCA method (Ramsay and Silverman, 2005). Specifically, the numbers of neurons in different layers for our transformed functional autoassociative network are set as L=10L=10, J=20J=20, K=2K=2, and R=20R=20. And the number of principal components for the linear FPCA method is selected as 22. To evaluate the performance of curve reconstruction, we consider the following criteria:

RMSE =1nMi=1nm=1M{X^i(tm)Xi(tm)}2,\displaystyle=\sqrt{\frac{1}{nM}\sum_{i=1}^{n}\sum_{m=1}^{M}\{\widehat{X}_{i}(t_{m})-X_{i}(t_{m})\}^{2}},
RRMSE =i=1nm=1M{X^i(tm)Xi(tm)}2/i=1nm=1MXi(tm)2,\displaystyle=\sqrt{\sum_{i=1}^{n}\sum_{m=1}^{M}\{\widehat{X}_{i}(t_{m})-X_{i}(t_{m})\}^{2}}\Bigg{/}\sqrt{\sum_{i=1}^{n}\sum_{m=1}^{M}X_{i}(t_{m})^{2}},

where RRMSE means the relative RMSE. Note that the prediction at t1,,tMt_{1},\ldots,t_{M} is assessed, and these time points can be different from the observation time points. In our simulation study, t1,,tMt_{1},\ldots,t_{M} are also equally spaced on [0,1][0,1] with MM being set as 101101. We consider the performance for both training set and test set in the evaluation, and the sizes of both sets are set as 10001000. Moreover, 100100 Monte Carlo runs are conducted for each considered case.

Table 1 lists the simulation results of the proposed nFunNN method and FPCA method in all the five cases. In Case 1, though FPCA method yields slightly better RMSE and RRMSE than our nFunNN method, both methods perform well for training set and test set. The result is not surprising, since Xi(t)X_{i}(t) is generated by the Karhunen-Loève expansion in Case 1, and classical linear FPCA is good enough to handle such case. For Case 2 and Case 3, where nonlinear structure is considered, it is evident that the proposed nFunNN method outperforms the FPCA method and gives more accurate prediction. That implies the advantage of our nFunNN method when solving nonlinear cases. Furthermore, FPCA method is almost invalid for Case 4 and Case 5, while the proposed nFunNN method still provides encouraging curve reconstruction results. In the setting of Xi(t)X_{i}(t) in Case 4 and Case 5, we add a nonlinear term besides two linear terms, which makes the linear FPCA method cannot fulfill the prediction with only two principal components. It can be observed that the prediction error of our nFunNN method is much lower than that of the FPCA method in Case 4 and Case 5. That indicates our method can achieve more effective dimension reduction results in nonlinear settings.

To sum up, the proposed nFunNN method gives great predicting results for all the five cases. Although the error can be a bit larger than that of the linear method in linear case, the predicting results of our nFunNN method is still reasonable. Furthermore, the nFunNN method shows obvious superiority for the nonlinear cases. Therefore, our nFunNN method can be a good choice, when we have no idea whether the data at hand has a linear structure or a nonlinear structure.

Table 1: The averaged RMSE and RRMSE of the nFunNN and FPCA methods across 100100 Monte Carlo runs for all the five cases, with standard deviation in parentheses.
Training set Test set
RASE RRASE RASE RRASE
Case 1 nFunNN 0.0285 (0.0088) 0.0112 (0.0034) 0.0324 (0.0187) 0.0127 (0.0072)
FPCA 0.0201 (0.0003) 0.0079 (0.0002) 0.0201 (0.0003) 0.0079 (0.0002)
Case 2 nFunNN 0.0453 (0.0146) 0.0386 (0.0121) 0.0607 (0.0182) 0.0519 (0.0154)
FPCA 0.0890 (0.0177) 0.0760 (0.0147) 0.0878 (0.0192) 0.0753 (0.0164)
Case 3 nFunNN 0.0790 (0.0250) 0.0488 (0.0156) 0.1036 (0.0300) 0.0639 (0.0182)
FPCA 0.1955 (0.0255) 0.1207 (0.0156) 0.2033 (0.0290) 0.1254 (0.0174)
Case 4 nFunNN 0.1828 (0.0576) 0.0791 (0.0252) 0.2181 (0.0746) 0.0943 (0.0320)
FPCA 0.9992 (0.0279) 0.4323 (0.0093) 1.0065 (0.0275) 0.4356 (0.0096)
Case 5 nFunNN 0.2248 (0.0514) 0.0877 (0.0199) 0.2781 (0.0637) 0.1080 (0.0244)
FPCA 0.6573 (0.0291) 0.2565 (0.0112) 0.6628 (0.0348) 0.2566 (0.0123)

3.2 Effect of the tuning parameters

In this section, we discuss the effect of LL, JJ, KK, and RR on the performance of the proposed nFunNN method. Though the values of JJ and RR can be different for our method, we set J=RJ=R for simplicity here. We consider the settings of Case 1, Case 2 and Case 4 for explanation in this section. For the discussion of the effect of LL, we fix J=20J=20 and K=2K=2, and the value of LL can be selected by 1010, 1515, and 2020. Moreover, we set J=10,15,20J=10,15,20, and L=10L=10, K=2K=2 for the exploration of the influence of JJ. When discussing the effect of KK, we set K=2,3,4K=2,3,4, L=10L=10 and J=20J=20. Note that the number of parameters in the network is J(KL+K+2L+2)J(KL+K+2L+2) when J=RJ=R. As the sample size should be larger than the number of parameters, we increase the sample size of training set to 20002000, and the size of test set is still 10001000 as in Section 3.1.

Tables 24 present the simulation results of our nFunNN method with the use of various LL, JJ and KK. From Table 2, it can be observed that with the rise of LL, both RMSE and RRMSE increase slightly. As shown in Table 3, there is no obvious difference in the prediction errors with the use of various JJ, which implies that the effect of JJ is not significant. For results in Table 4, the prediction errors for Case 1 are similar when different values of KK are considered. However, for Case 2 and Case 4, which are both nonlinear cases, various values of KK lead to different performance of the network. In Case 2, the prediction error first decreases with the increase of KK, and then shows a minor growth. Furthermore, the performance of the nFunNN method gets much better when larger KK is used in Case 4. We conjecture the reason is related to the complex setting of Case 4.

To summarize, according to the simulation results, the effects of LL and JJ are not very obvious, while different values of KK can bring large changes for the prediction in some complex cases. Moreover, the selection of tuning parameters in the neural network can be completed through the validation set.

Table 2: The averaged RMSE and RRMSE of the nFunNN methods across 100100 Monte Carlo runs with the use of various values of LL for Case 1, Case 2 and Case 4, with standard deviation in parentheses.
Training set Test set
RASE RRASE RASE RRASE
Case 1 L=10L=10 0.0247 (0.0038) 0.0097 (0.0015) 0.0262 (0.0042) 0.0103 (0.0016)
L=15L=15 0.0301 (0.0060) 0.0119 (0.0024) 0.0312 (0.0062) 0.0122 (0.0024)
L=20L=20 0.0375 (0.0082) 0.0147 (0.0032) 0.0386 (0.0080) 0.0151 (0.0031)
Case 2 L=10L=10 0.0414 (0.0084) 0.0354 (0.0071) 0.0522 (0.0162) 0.0443 (0.0138)
L=15L=15 0.0547 (0.0209) 0.0469 (0.0179) 0.0628 (0.0240) 0.0533 (0.0204)
L=20L=20 0.0708 (0.0309) 0.0605 (0.0261) 0.0780 (0.0318) 0.0663 (0.0276)
Case 4 L=10L=10 0.1802 (0.0770) 0.0777 (0.0328) 0.1956 (0.0796) 0.0845 (0.0352)
L=15L=15 0.1881 (0.0678) 0.0812 (0.0293) 0.2018 (0.0763) 0.0870 (0.0327)
L=20L=20 0.2350 (0.1851) 0.1014 (0.0804) 0.2473 (0.1885) 0.1067 (0.0812)
Table 3: The averaged RMSE and RRMSE of the nFunNN methods across 100100 Monte Carlo runs with the use of various values of JJ for Case 1, Case 2 and Case 4, with standard deviation in parentheses.
Training set Test set
RASE RRASE RASE RRASE
Case 1 J=10J=10 0.0228 (0.0031) 0.0090 (0.0012) 0.0251 (0.0060) 0.0099 (0.0023)
J=15J=15 0.0231 (0.0024) 0.0091 (0.0009) 0.0250 (0.0041) 0.0098 (0.0016)
J=20J=20 0.0246 (0.0035) 0.0097 (0.0014) 0.0262 (0.0043) 0.0103 (0.0017)
Case 2 J=10J=10 0.0522 (0.0122) 0.0445 (0.0104) 0.0599 (0.0165) 0.0512 (0.0137)
J=15J=15 0.0457 (0.0112) 0.0390 (0.0094) 0.0541 (0.0147) 0.0463 (0.0125)
J=20J=20 0.0413 (0.0088) 0.0352 (0.0074) 0.0506 (0.0136) 0.0433 (0.0114)
Case 4 J=10J=10 0.2102 (0.0507) 0.0907 (0.0220) 0.2266 (0.0612) 0.0978 (0.0259)
J=15J=15 0.1953 (0.0794) 0.0843 (0.0340) 0.2145 (0.0898) 0.0926 (0.0386)
J=20J=20 0.1684 (0.0413) 0.0727 (0.0179) 0.1854 (0.0507) 0.0801 (0.0224)
Table 4: The averaged RMSE and RRMSE of the nFunNN methods across 100100 Monte Carlo runs with the use of various values of KK for Case 1, Case 2 and Case 4, with standard deviation in parentheses.
Training set Test set
RASE RRASE RASE RRASE
Case 1 K=2K=2 0.0248 (0.0042) 0.0097 (0.0016) 0.0258 (0.0045) 0.0102 (0.0018)
K=3K=3 0.0255 (0.0038) 0.0100 (0.0015) 0.0266 (0.0040) 0.0105 (0.0016)
K=4K=4 0.0263 (0.0029) 0.0103 (0.0012) 0.0275 (0.0036) 0.0108 (0.0015)
Case 2 K=2K=2 0.0394 (0.0082) 0.0337 (0.0071) 0.0512 (0.0185) 0.0436 (0.0158)
K=3K=3 0.0271 (0.0016) 0.0232 (0.0015) 0.0349 (0.0133) 0.0297 (0.0114)
K=4K=4 0.0290 (0.0015) 0.0248 (0.0014) 0.0352 (0.0096) 0.0300 (0.0082)
Case 4 K=2K=2 0.1733 (0.0522) 0.0748 (0.0224) 0.1899 (0.0663) 0.0820 (0.0283)
K=3K=3 0.0610 (0.0084) 0.0263 (0.0036) 0.0754 (0.0158) 0.0326 (0.0067)
K=4K=4 0.0334 (0.0027) 0.0144 (0.0012) 0.0405 (0.0079) 0.0175 (0.0033)

4 Real Data Analysis

In this section, we discuss the performance of the proposed nFunNN method for real data application. Yoga data set and StarLightCurves data set from (Chen et al., 2015) are considered. In specific, we aim to assess the predicting ability of the nFunNN method by these two data sets.

For the Yoga data set, it contains 33003300 samples and the observation size is 426426 for each subject. We randomly divide the data set into training set and test set with the sample size being 30003000 and 300300 respectively. The values at these 426426 observation grids are predicted by both our nFunNN method and the classical FPCA method via different KK. The numbers of neurons in other layers for our transformed functional autoassociative neural network are set as L=20L=20, J=20J=20, and R=20R=20. The following criteria are considered:

RMSE~\displaystyle\widetilde{\mbox{RMSE}} =1nMi=1nm=1M{X^i(tm)Yim}2,\displaystyle=\sqrt{\frac{1}{nM}\sum_{i=1}^{n}\sum_{m=1}^{M}\{\widehat{X}_{i}(t_{m})-Y_{im}\}^{2}},
RRMSE~\displaystyle\widetilde{\mbox{RRMSE}} =i=1nm=1M{X^i(tm)Yim}2/i=1nm=1MYim2,\displaystyle=\sqrt{\sum_{i=1}^{n}\sum_{m=1}^{M}\{\widehat{X}_{i}(t_{m})-Y_{im}\}^{2}}\Bigg{/}\sqrt{\sum_{i=1}^{n}\sum_{m=1}^{M}Y_{im}^{2}},

where M=426M=426 for the Yoga data set. The data set is randomly split for 100100 times, and the predicting results for both training set and test set are presented in Table 5. It is shown that with the rise of KK, the predicting results are getting better for both nFunNN and FPCA methods. Moreover, the proposed nFunNN method can provide more precise predicting results when using the same KK as the FPCA method, which demonstrates the advantage of our method in real data application. Figure 4 exhibits the predicting performance of both methods for training set and test set by boxplots. It can be observed that nFunNN method always produces less predicting error.

Table 5: The averaged RMSE~\widetilde{\mbox{RMSE}} and RRMSE~\widetilde{\mbox{RRMSE}} of the nFunNN and FPCA methods across 100100 runs for the Yoga data set, with standard deviation in parentheses.
Training set Test set
RMSE~\widetilde{\mbox{RMSE}} RMMSE~\widetilde{\mbox{RMMSE}} RMSE~\widetilde{\mbox{RMSE}} RMMSE~\widetilde{\mbox{RMMSE}}
K=2K=2 FunNN 0.2709 (0.0061) 0.2712 (0.0061) 0.2777 (0.0127) 0.2780 (0.0127)
FPCA 0.4464 (0.0009) 0.4469 (0.0009) 0.4453 (0.0093) 0.4459 (0.0093)
K=3K=3 FunNN 0.2172 (0.0052) 0.2174 (0.0052) 0.2265 (0.0097) 0.2267 (0.0097)
FPCA 0.3745 (0.0009) 0.3750 (0.0009) 0.3769 (0.0085) 0.3773 (0.0085)
K=4K=4 FunNN 0.1747 (0.0032) 0.1749 (0.0032) 0.1821 (0.0089) 0.1823 (0.0089)
FPCA 0.2943 (0.0009) 0.2947 (0.0009) 0.2952 (0.0092) 0.2955 (0.0092)
K=5K=5 FunNN 0.1492 (0.0032) 0.1494 (0.0032) 0.1572 (0.0068) 0.1574 (0.0068)
FPCA 0.2554 (0.0007) 0.2557 (0.0007) 0.2565 (0.0068) 0.2568 (0.0068)
K=6K=6 FunNN 0.1294 (0.0029) 0.1295 (0.0029) 0.1366 (0.0059) 0.1367 (0.0059)
FPCA 0.2240 (0.0006) 0.2243 (0.0006) 0.2242 (0.0063) 0.2245 (0.0063)
K=7K=7 FunNN 0.1159 (0.0028) 0.1161 (0.0028) 0.1232 (0.0055) 0.1234 (0.0055)
FPCA 0.1947 (0.0005) 0.1949 (0.0005) 0.1955 (0.0055) 0.1958 (0.0055)
K=8K=8 FunNN 0.1043 (0.0036) 0.1044 (0.0036) 0.1110 (0.0063) 0.1111 (0.0063)
FPCA 0.1707 (0.0005) 0.1709 (0.0005) 0.1714 (0.0049) 0.1716 (0.0049)
Refer to caption
Figure 4: The averaged RMSE~\widetilde{\mbox{RMSE}} of the nFunNN and FPCA methods across 100100 runs for the Yoga data set. (a) The boxplot for training set. (b) The boxplot for test set.

Furthermore, we consider the analysis of the StarLightCurves data set. There are 92369236 subjects in this data set and each subject has 10241024 observations. Similar to the analysis of Yoga data set, we randomly divide the StarLightCurves data set into training set and test set, and the sizes are set as 80008000 and 12361236 respectively. We intend to predict the values at these 10241024 observation grids by both nFunNN and FPCA methods using various KK. The numbers of the neurons for the transformed functional autoassociative neural network are set as the same as that in the analysis of Yoga data set. We also conduct 100100 runs for StarLightCurves data set, and the averaged RMSE~\widetilde{\mbox{RMSE}} and RRMSE~\widetilde{\mbox{RRMSE}} are reported in Table 6. The trend of the prediction error for both methods is analogous with that for Yoga data set. Figure 5 provides visual illustration for the comparison of nFunNN and FPCA methods. It is shown that our nFunNN method constantly outperforms the linear FPCA method with the use of different KK. That further indicates the effectiveness of the proposed nFunNN method in real-world application.

Table 6: The averaged RMSE~\widetilde{\mbox{RMSE}} and RRMSE~\widetilde{\mbox{RRMSE}} of the nFunNN and FPCA methods across 100100 runs for the StarLightCurves data set, with standard deviation in parentheses.
Training set Test set
RMSE~\widetilde{\mbox{RMSE}} RMMSE~\widetilde{\mbox{RMMSE}} RMSE~\widetilde{\mbox{RMSE}} RMMSE~\widetilde{\mbox{RMMSE}}
K=2K=2 FunNN 0.2048 (0.0039) 0.2049 (0.0039) 0.2074 (0.0057) 0.2075 (0.0057)
FPCA 0.3052 (0.0007) 0.3053 (0.0007) 0.3047 (0.0046) 0.3048 (0.0046)
K=3K=3 FunNN 0.1636 (0.0032) 0.1637 (0.0032) 0.1662 (0.0045) 0.1663 (0.0045)
FPCA 0.2615 (0.0007) 0.2617 (0.0007) 0.2609 (0.0047) 0.2611 (0.0047)
K=4K=4 FunNN 0.1484 (0.0023) 0.1485 (0.0023) 0.1519 (0.0044) 0.1520 (0.0044)
FPCA 0.2344 (0.0008) 0.2345 (0.0008) 0.2346 (0.0050) 0.2347 (0.0050)
K=5K=5 FunNN 0.1386 (0.0019) 0.1387 (0.0019) 0.1415 (0.0033) 0.1416 (0.0033)
FPCA 0.2110 (0.0007) 0.2112 (0.0007) 0.2110 (0.0044) 0.2111 (0.0044)
K=6K=6 FunNN 0.1305 (0.0020) 0.1305 (0.0020) 0.1344 (0.0032) 0.1344 (0.0032)
FPCA 0.1908 (0.0007) 0.1909 (0.0007) 0.1919 (0.0048) 0.1920 (0.0048)
K=7K=7 FunNN 0.1228 (0.0017) 0.1229 (0.0017) 0.1267 (0.0032) 0.1267 (0.0032)
FPCA 0.1744 (0.0007) 0.1745 (0.0007) 0.1747 (0.0044) 0.1748 (0.0044)
K=8K=8 FunNN 0.1158 (0.0021) 0.1158 (0.0021) 0.1198 (0.0031) 0.1199 (0.0031)
FPCA 0.1568 (0.0006) 0.1569 (0.0006) 0.1580 (0.0037) 0.1581 (0.0037)
Refer to caption
Figure 5: The averaged RMSE~\widetilde{\mbox{RMSE}} of the nFunNN and FPCA methods across 100100 runs for the StarLightCurves data set. (a) The boxplot for training set. (b) The boxplot for test set.

5 Conclusions and Discussion

In this paper, we introduce a nonlinear FPCA method to realize effective dimension reduction and curve reconstruction. We generalize the autoassociative neural network to our functional data analysis framework and construct a transformed functional autoassociative neural network for practical implementation. The proposed method takes into account the nonlinear structure of the functional observations. A Python package is developed for the convenience of using the proposed nFunNN method. The theoretical properties of the proposed networks are also considered. Moreover, the results of the simulation study and real data application further suggest the superiority of our nFunNN method.

There are also several possible extension for our work. First, we only consider usual functional data in the development of our method. However, complex function data, such as multivariate functional data (Chiou et al., 2014) and multidimensional functional data (Wang et al., 2022b), becomes more and more common nowadays. Thus, considering nonlinear FPCA method for these types of data and generalizing our method to solve such issue can be of great significance. Second, only curve reconstruction error is considered in the construction of the loss function (7) for our method. So, our method is particularly suitable for the curve reconstruction issue. It can be beneficial if other concerns can be imposed in the loss function, such as regression and clustering problems. To achieve this goal, some modifications of the proposed neural network are needed, which is worth further research.

Acknowledgments

This work was supported by Public Health &\& Disease Control and Prevention, Major Innovation &\& Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China. This work was also supported by the Outstanding Innovative Talents Cultivation Funded Programs 2021 of Renmin University of China.

Appendix A Proof of Theorem 1

Proof.

Let 𝕂\mathbb{K} be a compact subset of L2(𝒯)L^{2}(\mathcal{T}). Recall that 𝒮(σ,L2(𝒯))\mathcal{S}(\sigma,L^{2}(\mathcal{T})) is the set of functions from L2(𝒯)L^{2}(\mathcal{T}) to \mathbb{R} of the form

Xj=1Jwj0σ{bj+𝒯X(t)βj(t)𝑑t},\displaystyle X\mapsto\sum_{j=1}^{J}w_{j0}\sigma\Big{\{}b_{j}+\int_{\mathcal{T}}X(t)\beta_{j}(t)dt\Big{\}},

where JJ\in\mathbb{N}^{\ast}, wj0w_{j0}\in\mathbb{R}, bjb_{j}\in\mathbb{R} and βj(t)L2(𝒯)\beta_{j}(t)\in L^{2}(\mathcal{T}). According to Corollary 5.1.2 in (Stinchcombe, 1999) and Section 6.1.2 of (Rossi and Conan-Guez, 2006), 𝒮(σ,L2(𝒯))\mathcal{S}(\sigma,L^{2}(\mathcal{T})) has the universal approximation property. Hence, for any continuous function FF from 𝕂\mathbb{K} to \mathbb{R} and for any ϵ>0\epsilon>0, there exist H𝒮(σ,L2(𝒯))H\in\mathcal{S}(\sigma,L^{2}(\mathcal{T})) such that for all X𝕂X\in\mathbb{K},

|H(X)F(X)|<ϵ/2.\displaystyle|H(X)-F(X)|<\epsilon/2. (10)

As HH is continuous in L2(𝒯)L^{2}(\mathcal{T}), we have that for any X𝕂X\in\mathbb{K}, there exist η(X)>0\eta(X)>0 such that for any fB(X,η(X))f\in B(X,\eta(X)), |H(X)H(f)|<ϵ/4|H(X)-H(f)|<\epsilon/4. By the approximation property of B-splines (Zhong et al., 2023) and the compactness of 𝕂\mathbb{K}, we can get

|H(X~)H(X)|<ϵ/2,\displaystyle|H(\widetilde{X})-H(X)|<\epsilon/2, (11)

similar to the proof in Section 6.1.3 of (Rossi and Conan-Guez, 2006), where X~(t)=h=1LxhBh(t)\widetilde{X}(t)=\sum_{h=1}^{L}x_{h}B_{h}(t). Then according to (10) and (11),

|H(X~)F(X)||H(X~)H(X)|+|H(X)F(X)|<ϵ.\displaystyle|H(\widetilde{X})-F(X)|\leq|H(\widetilde{X})-H(X)|+|H(X)-F(X)|<\epsilon.

Moreover,

H(X~)\displaystyle H(\widetilde{X}) =j=1Jwj0σ{bj+𝒯X~(t)βj(t)𝑑t}\displaystyle=\sum_{j=1}^{J}w_{j0}\sigma\Big{\{}b_{j}+\int_{\mathcal{T}}\widetilde{X}(t)\beta_{j}(t)dt\Big{\}}
=j=1Jwj0σ[bj+𝒯{h=1LxhBh(t)}βj(t)𝑑t]\displaystyle=\sum_{j=1}^{J}w_{j0}\sigma\Big{[}b_{j}+\int_{\mathcal{T}}\Big{\{}\sum_{h=1}^{L}x_{h}B_{h}(t)\Big{\}}\beta_{j}(t)dt\Big{]}
=j=1Jwj0σ{bj+h=1Lxh𝒯βj(t)Bh(t)𝑑t}\displaystyle=\sum_{j=1}^{J}w_{j0}\sigma\Big{\{}b_{j}+\sum_{h=1}^{L}x_{h}\int_{\mathcal{T}}\beta_{j}(t)B_{h}(t)dt\Big{\}}
=j=1Jwj0σ(bj+k=1Ldjhxh).\displaystyle=\sum_{j=1}^{J}w_{j0}\sigma\Big{(}b_{j}+\sum_{k=1}^{L}d_{jh}x_{h}\Big{)}.

Define H~(X)=H(X~)\widetilde{H}(X)=H(\widetilde{X}), then we have H~𝒮(σ,BL)\widetilde{H}\in\mathcal{S}(\sigma,\textbf{B}_{L}). The proof is completed.

References

  • Ramsay and Silverman [2005] James O. Ramsay and Bernard W. Silverman. Functional Data Analysis (2nd ed.). Springer Series in Statistics, New York: Springer, 2005.
  • Ferraty and Vieu [2006] Frédéric Ferraty and Philippe Vieu. Nonparametric Functional Data Analysis: Theory and Practice. Springer, New York, 2006.
  • Horváth and Kokoszka [2012] Lajos Horváth and Piotr Kokoszka. Inference for Functional Data with Applications. Springer, New York, 2012.
  • Thind et al. [2023] Barinder Thind, Kevin Multani, and Jiguo Cao. Deep learning with functional inputs. Journal of Computational and Graphical Statistics, 32(1):171–180, 2023.
  • Rao and Reimherr [2023] Aniruddha Rajendra Rao and Matthew Reimherr. Non-linear functional modeling using neural networks. Journal of Computational and Graphical Statistics, pages 1–20, 2023.
  • Wu et al. [2022] Sidi Wu, Cédric Beaulac, and Jiguo Cao. Neural networks for scalar input and functional output. arXiv preprint arXiv:2208.05776, 2022.
  • Thind et al. [2020] Barinder Thind, Kevin Multani, and Jiguo Cao. Neural networks as functional classifiers. arXiv preprint arXiv:2010.04305, 2020.
  • Wang et al. [2022a] Shuoyang Wang, Guanqun Cao, and Zuofeng Shang. Deep neural network classifier for multi-dimensional functional data. arXiv preprint arXiv:2205.08592, 2022a.
  • Wang et al. [2021] Shuoyang Wang, Guanqun Cao, Zuofeng Shang, and Alzheimer’s Disease Neuroimaging Initiative. Estimation of the mean function of functional data via deep neural networks. Stat, 10(1):e393, 2021.
  • Wang and Cao [2022a] Shuoyang Wang and Guanqun Cao. Robust deep neural network estimation for multi-dimensional functional data. Electronic Journal of Statistics, 16(2):6461–6488, 2022a.
  • Sarkar and Panaretos [2022] Soham Sarkar and Victor M Panaretos. Covnet: Covariance networks for functional data on multidimensional domains. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(5):1785–1820, 2022.
  • Wang and Cao [2022b] Haixu Wang and Jiguo Cao. Functional nonlinear learning. arXiv preprint arXiv:2206.11424, 2022b.
  • Yao et al. [2005] Fang Yao, Hans-Georg Müller, and Jane-Ling Wang. Functional data analysis for sparse longitudinal data. Journal of the American statistical association, 100(470):577–590, 2005.
  • Hall et al. [2006] Peter Hall, Hans-Georg Müller, and Jane-Ling Wang. Properties of principal component methods for functional and longitudinal data analysis. The Annals of Statistics, 34(3):1493–1517, 2006.
  • Li and Hsing [2010] Yehua Li and Tailen Hsing. Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. The Annals of Statistics, 38(6):3321–3351, 2010.
  • Locantore et al. [1999] N Locantore, JS Marron, DG Simpson, N Tripoli, JT Zhang, KL Cohen, Graciela Boente, Ricardo Fraiman, Babette Brumback, Christophe Croux, et al. Robust principal component analysis for functional data. Test, 8:1–73, 1999.
  • Gervini [2008] Daniel Gervini. Robust functional estimation using the median and spherical principal components. Biometrika, 95(3):587–600, 2008.
  • Zhong et al. [2022] Rou Zhong, Shishi Liu, Haocheng Li, and Jingxiao Zhang. Robust functional principal component analysis for non-gaussian longitudinal data. Journal of Multivariate Analysis, 189:104864, 2022.
  • Chiou et al. [2014] Jeng-Min Chiou, Yu-Ting Chen, and Ya-Fang Yang. Multivariate functional principal component analysis: A normalization approach. Statistica Sinica, 24:1571–1596, 2014.
  • Happ and Greven [2018] Clara Happ and Sonja Greven. Multivariate functional principal component analysis for data observed on different (dimensional) domains. Journal of the American Statistical Association, 113(522):649–659, 2018.
  • Song and Li [2021] Jun Song and Bing Li. Nonlinear and additive principal component analysis for functional data. Journal of Multivariate Analysis, 181:104675, 2021.
  • Kramer [1991] Mark A Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE journal, 37(2):233–243, 1991.
  • Stinchcombe [1999] Maxwell B Stinchcombe. Neural network approximation of continuous functionals and continuous functions on compactifications. Neural Networks, 12(3):467–477, 1999.
  • Rossi and Conan-Guez [2006] Fabrice Rossi and Brieuc Conan-Guez. Theoretical properties of projection based multilayer perceptrons with functional inputs. Neural Processing Letters, 23(1):55–70, 2006.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Chen et al. [2015] Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo Batista. The ucr time series classification archive, July 2015. www.cs.ucr.edu/~eamonn/time_series_data/.
  • Wang et al. [2022b] Jiayi Wang, Raymond KW Wong, and Xiaoke Zhang. Low-rank covariance function estimation for multidimensional functional data. Journal of the American Statistical Association, 117(538):809–822, 2022b.
  • Zhong et al. [2023] Rou Zhong, Shishi Liu, Haocheng Li, and Jingxiao Zhang. Sparse logistic functional principal component analysis for binary data. Statistics and Computing, 33(1):15, 2023.