Multi-view Kernel PCA for Time series Forecasting

Arun Pandey
Department of Electrical Engineering
ESAT-STADIUS, KU Leuven
Kasteelpark Arenberg 10, B-3001 Leuven, Belgium
arun.pandey@esat.kuleuven.be &Hannes De Meulemeester
Department of Electrical Engineering
ESAT-STADIUS, KU Leuven
Kasteelpark Arenberg 10, B-3001 Leuven, Belgium
hannes.demeulemeester@esat.kuleuven.be &
&Bart De Moor
Department of Electrical Engineering
ESAT-STADIUS, KU Leuven
Kasteelpark Arenberg 10, B-3001 Leuven, Belgium
bart.demoor@kuleuven.be&Johan A.K. Suykens
Department of Electrical Engineering
ESAT-STADIUS, KU Leuven
Kasteelpark Arenberg 10, B-3001 Leuven, Belgium
johan.suykens@kuleuven.be

Abstract

In this paper, we propose a kernel principal component analysis model for multi-variate time series forecasting, where the training and prediction schemes are derived from the multi-view formulation of Restricted Kernel Machines. The training problem is simply an eigenvalue decomposition of the summation of two kernel matrices corresponding to the views of the input and output data. When a linear kernel is used for the output view, it is shown that the forecasting equation takes the form of kernel ridge regression. When that kernel is non-linear, a pre-image problem has to be solved to forecast a point in the input space. We evaluate the model on several standard time series datasets, perform ablation studies, benchmark with closely related models and discuss its results.

1 Introduction

Kernel methods have seen great success in many applications with very-high dimensional data but low number of samples, and are therefore one of the most popular non-parametric models. In critical machine learning applications, kernel methods are preferred due to their strong theoretical foundation in learning theory [1, 2, 3, 4]. Kernel methods map the data into a high-dimensional (possibly infinite) feature space by using the kernel trick. This kernel trick allows for natural, non-linear extensions to the traditional linear methods in terms of a dual representation using a suitable kernel function. This led to numerous popular methods such as kernel principal component analysis [5], kernel fisher discriminant analysis [6] and the least-squares support vector machine [1].

However, when it comes to learning large-scale problems, kernel methods fall behind deep learning techniques due to their time and memory complexity. This also holds in the time series analysis and forecasting domain, which has recently been dominated by specialized deep neural network models [7, 8, 9, 10].

Attempts have been made to combine kernel and deep learning methods [11, 12], especially for specific cases such as with deep gaussian processes [13] and multi-layer support vector machines [14]. Recently, a new unifying framework, named Restricted Kernel Machines (RKM), was proposed [15] that attempts to bridge kernel methods with deep learning. The Lagrangian function of the Least-Squares Support Vector Machine (LS-SVM) is similar to the energy function of Restricted Boltzmann Machine (RBMs), thereby drawing link between kernel methods and RBMs; hence the name Restricted Kernel Machines.

Contribution: In this work, we propose a novel kernel autoregressive time series forecasting model based on the RKM framework, where the training problem is the eigen-decomposition of two kernel matrices. Additionally, we use the same objective function to derive a novel prediction scheme to recursively forecast several steps ahead in the future. Further, we experimentally compare the model with several closely related models on multiple publically available time series datasets.

2 Related Works

Energy-based methods: Energy-based models encode dependencies between visible and latent variables using the so-called energy function. Such a function is used to model joint-probability distribution over visible and latent variables of the system. Such a model is trained by finding a configuration with the lowest total energy. Energy based models have the advantage over probabilistic methods that they do not require proper normalization, which in turn allows for more freedom when designing a model. One of the most well known energy-based models is the Restricted Boltzmann Machine [16]. Among other applications, this model has been extended for time series modeling [17, 18]. For an excellent review of Energy-based Models and Restricted Boltzmann Machines for time series modeling, see [19, 20].

Kernel methods for time series: Kernel methods have long been a staple in the time series analysis toolkit. Thanks to Mercer’s theorem [21], these methods can implicitly work in very high dimensional spaces and can often allow for complicated non-linear problems to be turned into convex optimization problems.

In the case of time series forecasting, one can change the traditional kernel ridge regression formulation into an online variant where only a number of data points are available at each time step. The solution to this problem can then be iteratively updated for each new time step. This technique is known as Kernel Recursive Least-Squares [22] (KRLS). One disadvantage to this method is that the model increases in size for each new data point that gets introduced. Various solutions for this have been proposed that attempt to limit the final model size [23, 24]. Alternatively, a bayesian formulation for KRLS exists which also includes a forgetting mechanism [25].

Support vector machines [26, 3] are a particular popular type of kernel machine since training an SVM corresponds with solving a convex optimization problem with sparse solutions. The standard, static, regression variants of support vector machines can be used in feedforward fashion for time series forecasting. Alternatively, explicitly recursive extensions have been formulated [27]. The disadvantage is that the resulting optimization problems become non-convex.

Gaussian processes [28] are the extension of kernel methods to the probabilistic setting. Unlike support vector machines, gaussian processes are not sparse, which can be problematic for large scale problems. Various solutions have been proposed to reduce the effective number of data points and train on large data sets, such as by using inducing variables [29] or by performing the computations on subsets of the data [30]. [31] showed that a gaussian process could be formulated as a latent variable model. This allowed for hierarchical, deep, gaussian processes [13].

2.1 Restricted Kernel Machines

Restricted Kernel Machines (RKM) [15] is a general kernel methods framework characterized by visible and hidden units. Using the Fenchel-Young conjugate for quadratic functions, RKMs make the connection between Least-Squares Support Vector Machines [3] and Restricted Boltzmann Machines [16]. The workhorse of restricted kernel machines is kernel PCA, of which the energy function can be defined in the RKM framework using visible ( $\bm{v}_{i}$ ) and latent ( $\bm{h}_{i}$ ) units as follows:

\displaystyle J(\bm{W},\bm{h}_{i})=\sum_{i=1}^{n}\langle\bm{v}_{i},\bm{W}\bm{h}_{i}\rangle_{\mathscr{H}}+\dfrac{1}{2}\bm{h}_{i}^{\top}\bm{\Lambda}\bm{h}_{i}+\dfrac{1}{2}\operatorname{Tr}(\bm{W}^{\top}\bm{W})

(1)

The stationary points of the energy function are given by:

	$\displaystyle\frac{\partial{J}}{\partial{\bm{h_{i}}}}$	$\displaystyle=0\implies\bm{W^{\top}}\bm{v}_{i}=\bm{\Lambda}\bm{h_{i}},\quad\forall i$
	$\displaystyle\frac{\partial{J}}{\partial{\bm{W}}}$	$\displaystyle=0\implies\bm{W}=\sum_{i=1}^{n}\bm{v}_{i}\bm{h}_{i}^{\top}.$

Eliminating the $\bm{W}$ from above yields the foloowing eigenvalue problem:

\bm{K}\bm{H}^{\top}=\bm{H}^{\top}\bm{\Lambda}

(2)

with $\bm{H}=\big{[}\bm{h}_{1}\ldots\bm{h}_{n}\big{]}\in\mathbb{R}^{s\times N}$ and $\bm{\Lambda}=\operatorname{diag}(\lambda_{i})\in\mathbb{R}^{s\times s}$ where generally $s\leq N$ , indicating the number of components. Training an RKM model consists of solving an eigenvalue problem. This is in contrast with restricted boltzmann machines where the model is trained by approximating the gradient of the log-likelihood of $\bm{W}$ by using methods such as contrastive divergence [32].

The objective function $J(\bm{W},\bm{h}_{i})$ in Eq. 1 forms the basis for more specific RKM models which can be used in downstream tasks. Additionally, a key advantage of characterizing kernel methods by visible and hidden units is that deeper, multi-layer, models can be developed similarly to deep boltzmann machines [33].

Classification and regression can be performed by introducing the primal formulation of LS-SVM into the objective function [15]. This is motivated by the fact that kernel PCA corresponds to a one-class LS-SVM problem with zero target value around which the variance is maximized [34]. Classification using RKMs has been extended to multi-view classification using a tensor-based representation of the methods [35]. RKMs have been shown to be powerful feature extractors and can be used for disentangled representation learning [36, 37]. This, in turn, allows them to be used in outlier and out-of-distribution detection tasks [38]. Finally, the latent space of an RKM can be sampled from to generate new samples from the learned distribution [39, 40, 41, 42].

One aspect that was missing, until now, was a general time series model situated in the RKM setting. This work attempts to fill this gap by introducing a framework with autoregressive RKMs. This idea was briefly introduced in [43] where an RKM model with autoregression in the latent space was discussed. In the next section, we generalize this method and introduce a complete framework for time series analysis using restricted kernel machines.

3 Proposed Model

Our objective is to capture the dynamics of the multi-variate time series $\{\bm{{x}}_{i}\}_{i=1}^{n},\bm{{x}}_{i}\in\mathbb{R}^{d}$ containing $n$ time steps. Given the lag $p\in\mathbb{Z}^{+}$ , we transform the given time series such that the input vector is a vector of lagged values and the corresponding output vector is one-step ahead data-point. Let the input data-vectors be $\{[\bm{{x}}_{i}^{(1)},\ldots,\bm{{x}}_{i-p}^{(1)},\bm{{x}}_{i}^{(d)},\ldots,\bm{{x}}_{i-p}^{(d)}]\}_{i=p+1}^{n-1}\subset\mathcal{X}$ and the corresponding target values be $\{\bm{{x}}_{i+1}\}_{i=p+1}^{n-1}\subset\mathcal{Y}$ . This model structure is employed to model the dynamics of the system which enables us to use static regression models for training. However, for prediction, we employ the recursive model such that the predictions are used as input for further predictions ${\bm{\hat{y}}}=f(\bm{\hat{y}}_{i-1},\ldots,\bm{\hat{y}}_{i-p})$ .

We define feature maps $\phi:\mathscr{X}\rightarrow\mathscr{H}_{x}$ , $\psi:\mathscr{Y}\rightarrow\mathscr{H}_{y}$ with $\mathscr{H}_{x},\mathscr{H}_{y}$ indicating the Reproducing Kernel Hilbert Space¹¹1Throughout our discussion, we assume that the feature vectors are centered in the feature-space i.e. (possibly infinite dimensional) $\bar{\phi}(\bm{x})=\phi(\bm{x})-\mu_{\phi}$ with $\mu_{\phi}=\mathbb{E}_{\bm{\xi}\mathcal{\sim X}}\left[\phi(\bm{\xi})\right]$ . Using an implicit formulation, it suffices to notice that $\langle\bar{\phi}(\bm{x}),\bar{\phi}(\bm{y})\rangle_{\mathscr{H}}=\langle\phi(\bm{x})-\mu_{\phi},\phi(\bm{y})-\mu_{\phi}\rangle_{\mathscr{H}}=\langle\phi(\bm{x}),\phi(\bm{y})\rangle_{\mathscr{H}}-\langle\mu_{\phi},\phi(\bm{y})\rangle_{\mathscr{H}}-\langle\phi(\bm{x}),\mu_{\phi})\rangle_{\mathscr{H}}+\langle\mu_{\phi},\mu_{\phi}\rangle_{\mathscr{H}}=k(\bm{x},\bm{y})-\mathbb{E}_{\bm{\xi}\mathcal{\sim X}}\left[k(\bm{\xi},\bm{y})\right]-\mathbb{E}_{\bm{\zeta}\mathcal{\sim X}}\left[k(\bm{x},\bm{\zeta})\right]+\mathbb{E}_{\bm{\xi},\bm{\zeta}\mathcal{\sim X}}\left[k(\bm{\xi},\bm{\zeta})\right]$ . In practice, we can compute statistics on $\mathcal{X}$ . (RKHS; see [44] for more details). Such a feature map could be constructed explicitly or implicitly via a kernel function $k(\bm{x}_{i},\bm{x}_{j}):\mathscr{X}^{2}\rightarrow\mathbb{R}:\left(\bm{x}_{i},\bm{x}_{j}\right)\mapsto\langle\phi(\bm{x}_{i}),\phi(\bm{x}_{j})\rangle_{\mathscr{H}_{x}}$ . Further we define two linear operators²²2The linear operator $\bm{U}_{x}$ is often referred to as a matrix as it only exists explicitly in the case of finite dimensional Hilbert spaces $\mathscr{H}$ . It then takes the form $\bm{U}_{x}\in\mathbb{R}^{\mathrm{dim}(\mathscr{H}_{x})\times s}$ . $\bm{U}_{x}:\mathbb{R}^{s}\rightarrow\mathscr{H}_{x}$ and $\bm{U}_{y}:\mathbb{R}^{s}\rightarrow\mathscr{H}_{y}$ , $s\leq n$ with their ad-joints $\bm{U}_{x}^{\top}:\mathrm{dim}(\mathscr{H}_{x})\rightarrow\mathbb{R}^{s}$ , $\bm{U}_{y}^{\top}:\mathrm{dim}(\mathscr{H}_{y})\rightarrow\mathbb{R}^{s}$ . Data points $\bm{x}_{i}$ and $\bm{y}_{i}$ will be associated to a latent variable $\bm{h}_{t}\in\mathbb{R}^{s}$ through the pairing terms $\langle\phi\left(\bm{x}_{i}\right),\bm{U}_{x}\bm{h}_{i}\rangle_{\mathscr{H}_{x}}$ and $\langle\psi\left(\bm{y}_{i}\right),\bm{U}_{y}\bm{h}_{i}\rangle_{\mathscr{H}_{y}}$ .

3.1 Training

	$\displaystyle J_{n}(\bm{U}_{x},\bm{U}_{y},\mathcal{H}_{n})=$	$\displaystyle\sum_{i=1}^{n}\big{[}-\overbrace{\vphantom{\sum_{i}^{P}}\langle\phi(\bm{x}_{i}),\bm{U}_{x}\bm{h}_{i}\rangle_{\mathscr{H}_{x}}}^{\text{1\textsuperscript{st} view}}-\overbrace{\vphantom{\sum_{i}^{P}}\langle\psi(\bm{y}_{i}),\bm{U}_{y}\bm{h}_{i}\rangle_{\mathscr{H}_{y}}}^{\text{2\textsuperscript{nd} view}}+\dfrac{1}{2}\bm{h}_{i}^{\top}\bm{\Lambda}\bm{h}_{i}$
		$\displaystyle\underbrace{+\lVert\phi(\bm{x}_{i})\rVert_{\mathscr{H}_{x}}^{2}+\lVert\psi(\bm{y}_{i})\rVert_{\mathscr{H}_{y}}^{2}\big{]}+\dfrac{1}{2}\big{[}\lVert\bm{U}_{x}\rVert^{2}_{\mathrm{F}}+\lVert\bm{U}_{y}\rVert^{2}_{\mathrm{F}}\big{]}}_{\text{regularization}}$		(3)

The first two terms in the objective maximizes the pairing between the visible variables in the feature-space $\left\{\phi(\bm{x}):\bm{x}\in\mathscr{X}\right\}$ , $\left\{\psi(\bm{y}):\bm{y}\in\mathscr{Y}\right\}$ and latent variables $\left\{\bm{h}\in\mathbb{R}^{s}\right\}$ . In this way we map the points from the two data-sources to a single point in the latent space. The regularization terms and constraints are meant to bound the objective.

Solving the objective

Given the visible variables, characterizing the stationary points of $J(\bm{U}_{x},\bm{U}_{y},\mathcal{H}_{T}|\mathcal{X}_{T})$ in the pairing linear operators and the latent variables leads to the following equations for $1\leq i\leq n$ :

$\displaystyle\dfrac{\partial J}{\partial\bm{h}_{i}}$	$\displaystyle=0\Rightarrow\bm{\Lambda}\bm{h}_{i}=\bm{U}_{x}^{\top}\phi(\bm{x}_{i})+\bm{U}_{y}^{\top}\psi(\bm{y}_{i})$	(4)
$\displaystyle\dfrac{\partial J}{\partial\bm{U}_{x}}$	$\displaystyle=0\Rightarrow\bm{U}_{x}=\sum_{i=1}^{n}\phi(\bm{x}_{i})\bm{h}_{i}^{\top}$	(5)
$\displaystyle\dfrac{\partial J}{\partial\bm{U}_{y}}$	$\displaystyle=0\Rightarrow\bm{U}_{y}=\sum_{i=1}^{n}\psi(\bm{y}_{i})\bm{h}_{i}^{\top}.$	(6)

Eliminating $\bm{U}_{x},\bm{U}_{y}$ from the above equations and denoting $\bm{H}=\big{[}\bm{h}_{1}\ldots\bm{h}_{n}\big{]}\in\mathbb{R}^{s\times n}$ and $\bm{\Lambda}=\operatorname{diag}(\lambda_{i})\in\mathbb{R}^{s\times s}$ , we get the following eigen-value problem:

\Big{[}\bm{K}(\mathcal{X})+\bm{K}(\mathcal{Y})\Big{]}\bm{H}^{\top}=\bm{H}^{\top}\bm{\Lambda},

(7)

where the elements of $\bm{K}(\mathcal{X})_{i,j}=k_{x}(\bm{x}_{i},\bm{x}_{j})$ and $\bm{K}(\mathcal{Y})_{i,j}=k_{y}(\bm{y}_{i},\bm{y}_{j})$ . Further, $\bm{K}(\mathcal{X})$ and $\bm{K}(\mathcal{Y})$ are Hermitian matrices and their sum is also Hermitian. Spectral theorem guarantees the existence of real eigenvalues when diagonalized by a unitary matrix. Moreover these eigenvalues are non-negative, since these matrices are positive semi-definite.

Remarks

The proposed model is inspired by [43], however there are three important differences. (i) The data in the proposed model is in the auto-regressive format. This enables us to formulate the problem in the well-studied Non-linear Auto-Regressive (NAR) model setting. (ii) The second-view $\bm{K}(\mathcal{Y})$ is now a kernel matrix i.e. a positive semi-definite matrix. This was not a constraint with the time-kernel in [43]. Moreover, $\bm{K}(\mathcal{Y})$ implicitly captures the weights on the past latent-variables via a kernel function instead of hand-crafting it. (iii) The prediction scheme in [43] used past latent variables to predict the next latent variable. This was sufficient to simulate the dynamics forward after the training-set, however there was no way to start with any given initial state vector. In contrast, here we require the initial state vector from the input space to predict the dynamics forward.

3.2 Prediction

The main goal is to forecast ahead of the training set i.e. find $\left\{\bm{x}_{n+1},\ldots,\bm{x}_{n^{\prime}}\right\}$ , $n^{\prime}>n$ . To do so, we now work in $\mathcal{X}_{n^{\prime}}=\mathcal{X}_{n}\cup\left\{\bm{x}_{n+1},\ldots,\bm{x}_{n+n^{\prime}}\right\}$ , $\mathcal{H}_{n^{\prime}}=\mathcal{H}_{n}\cup\left\{\bm{h}_{n+1},\ldots,\bm{h}_{n+n^{\prime}}\right\}$ and consider $\mathcal{A}_{n^{\prime}}$ . This gives the following objective

	$\displaystyle J_{n^{\prime}}(\bm{U}_{x},\bm{U}_{y},\mathcal{H}_{n^{\prime}})=$	$\displaystyle\sum_{i=1}^{n^{\prime}}\big{[}-\langle\phi(\bm{x}_{i}),\bm{U}_{x}\bm{h}_{i}\rangle_{\mathscr{H}_{x}}-\langle\psi(\bm{y}_{i}),\bm{U}_{y}\bm{h}_{i}\rangle_{\mathscr{H}_{y}}+\dfrac{1}{2}\bm{h}_{i}^{\top}\bm{\Lambda}\bm{h}_{i}$
		$\displaystyle+\lVert\phi(\bm{x}_{i})\rVert_{\mathscr{H}_{x}}^{2}+\lVert\phi(\bm{y}_{i})\rVert_{\mathscr{H}_{y}}^{2}\big{]}+\dfrac{1}{2}\big{[}\lVert\bm{U}_{x}\rVert^{2}_{\mathrm{F}}+\lVert\bm{U}_{y}\rVert^{2}_{\mathrm{F}}\big{]}$		(8)

Given the learned $\bm{U}_{x},\bm{U}_{y}$ from the training, characterizing the stationary points of $J_{n^{\prime}}\left(\mathcal{X}_{n^{\prime}},\mathcal{H}_{\prime}\left|\bm{U}_{x},\bm{U}_{y}\right.\right)$ in terms of visible and latent variables gives for $1\leq i\leq n^{\prime}$ :

$\displaystyle\dfrac{\partial J}{\partial\phi(\bm{x}_{i})}$	$\displaystyle=0\Rightarrow\phi(\bm{x}_{i})=\bm{U}_{x}\bm{h}_{i}$	(9)
$\displaystyle\dfrac{\partial J}{\partial\phi(\bm{y}_{i})}$	$\displaystyle=0\Rightarrow\psi(\bm{y}_{i})=\bm{U}_{y}\bm{h}_{i}$	(10)
$\displaystyle\dfrac{\partial J}{\partial\bm{h}_{i}}$	$\displaystyle=0\Rightarrow\bm{\Lambda}\bm{h}_{i}-\bm{U}_{y}^{\top}\psi(\bm{y}_{i})=\bm{U}_{x}^{\top}\phi(\bm{x}_{i}).$	(11)

Using the above equations, we can obtain an expression for $\bm{h}_{i}$ with $\phi(\bm{x}_{i})$ :

\bm{h}_{i}=\big{(}\bm{\Lambda}-\bm{U}_{y}^{\top}\bm{U}_{y}\big{)}^{-1}\bm{U}_{x}^{\top}\phi({\bm{x}_{i}}).

(12)

Substituting it back in Eq. 10, we obtain an expression for output in the feature-space:

\displaystyle\psi(\bm{y}_{i})

\displaystyle=\bm{U}_{y}\big{(}\bm{\Lambda}-\bm{U}_{y}^{\top}\bm{U}_{y}\big{)}^{-1}\bm{U}_{x}^{\top}\phi(\bm{x}_{i}).

(13)

Case I: Non-linear $k_{x}$ , linear $k_{y}$ : In this case, Eq. 13 reduces to the following variant of kernel ridge regression $\bm{y}_{i}=\big{(}\sum_{j=1}^{n}\tilde{\bm{h}}_{j}k_{x}(\bm{x}_{j},\bm{x}_{i})\big{)}$ , where $\tilde{\bm{h}}=\bm{U}_{y}\big{(}\bm{\Lambda}-\bm{U}_{y}^{\top}\bm{U}_{y}\big{)}^{-1}\bm{h}_{j}$ . The way in which those coefficients are obtained, differs between our method and kernel ridge regression.

Case II: Non-linear $k_{x}$ , non-linear $k_{y}$ : If the feature-map $\psi(\cdot)$ is explicitly known and invertible, then after the inversion, we have the desired output $\bm{y}_{i}$ . However typically in kernel methods, for instance when using Gaussian kernel, $\psi(\cdot)$ is unknown. As a consequence, the closed-form expression in Eq. 13 cannot be computed. In such situations, we leverage the kernel-trick to obtain the distances between the implicit feature-vector and the feature-vectors of the training-set.

Kernel-trick: Left-multiplying Eq. 13 with $\phi(\bm{y}_{j})^{\top}$ and substituting for $\bm{U}_{y}$ from Eq. 6, we obtain $\forall j\in\{1,\ldots,n\}$ :

	$\displaystyle k_{y}(\bm{y}_{j},\bm{y}_{i})=\Big{(}\sum_{k=1}^{n}k_{y}(\bm{y}_{j},\bm{y}_{k})\bm{h}_{k}^{\top}\Big{)}\Big{(}\bm{\Lambda}-\sum_{l,m=1}^{n}\bm{h}_{l}k_{y}(\bm{y}_{l},\bm{y}_{m})\bm{h}_{m}^{\top}\Big{)}^{-1}$
	$\displaystyle\Big{(}\sum_{k=1}^{n}\bm{h}_{k}k(\bm{x}_{k},\bm{x}_{i})\Big{)}.$

The above could be reduced in the matrix form as follows:

\displaystyle\bm{k}_{\bm{y}}(\bm{y}_{i})={\bm{K}(\mathcal{Y})}\bm{H}^{\top}\Big{(}\bm{\Lambda}-\bm{H}{\bm{K}(\mathcal{Y})}\bm{H}^{\top}\Big{)}^{-1}\Big{(}\bm{H}\bm{k}_{\bm{x}}(\bm{x}_{i})\Big{)},

(14)

where $\bm{k}_{\bm{y}}(\bm{y}_{i})=[{k}_{y}(\bm{y}_{1},\bm{y}_{i}),\dots,{k}_{y}(\bm{y}_{n},\bm{y}_{i})]^{\top}$ . These distances in the feature space can be used in various pre-image methods to obtain the output $\bm{y}_{i}$ . In the next section, we briefly discuss the pre-image problem and some common methods.

3.2.1 Pre-image Problem

The advantage of using a kernel, $k(\bm{y}_{t},\bm{y}_{j})=\langle\phi(\bm{y}_{i}),\phi(\bm{y}_{j})\rangle_{\mathcal{H}_{y}}$ , is that all computations can be implicitly performed in feature space and the exact mapping $\psi:\mathcal{Y}\rightarrow\mathcal{H}_{y}$ is not required. However, this gives rise to the pre-image problem: given a point $\Psi\in\mathcal{H}$ , find $\bm{y}\in\mathcal{Y}$ such that $\Psi=\psi(\bm{y})$ . This pre-image problem is known to be ill-posed [45] as the pre-image might not exist [2] or, if it exists, it may not be unique. Instead, an optimization problem is formulated to find the approximate pre-image $\bm{y}^{\star}$ :

\underset{\bm{y}^{\star}}{\operatorname*{argmin}}\|\Psi-\psi(\bm{y}^{\star})\|^{2}

(15)

While various techniques exist to approximately solve this problem [46, 47, 48, 49], we discuss two particular methods that can conveniently be incorporated into the RKM framework.

Kernel Smoother: A kernel smoother [39] is essentially a weighted-average of nearby data-points in the input space where both the weights and the nearby-points are given by the kernels in Eq. 14. Let $n_{r}<N$ be the number of nearest-neighbors, then the approximate pre-image for time-step $i$ is given by:

\tilde{\bm{y}}_{i}=\sum_{j=1}^{n_{r}}\frac{{k}_{y}\left(\bm{y}_{j},\bm{y}_{i}\right)\bm{y}_{j}}{\sum_{l=1}^{n_{r}}{k_{y}}\left(\bm{y}_{l},\bm{y}_{i}\right)}.

With this method, the implicit assumption is that the forecasted point lives within the simplex of training data-points. Further, in the limit-case $n_{r}=1$ , only the nearest training point is chosen as the prediction. In our experiments, we treat $n_{r}$ as the hyperparameter.

Kernel Ridge Regression: Alternatively, one can learn the pre-image map using kernel ridge regression [49]. Consider a specific case of Eq. 15 where the denoised point $\Psi=\bm{U}_{y}\bm{h}^{\star}$ (see Eq. 10) i.e. $\underset{\bm{x}^{\star}}{\operatorname*{argmin}}\|\bm{U}_{y}\bm{h}^{\star}-\psi(\bm{y}^{\star})\|^{2}$ . Then one could formulate the following problem to learn the pre-image map $\Gamma\colon\mathcal{H}\mapsto\mathcal{Y}$ as follows:

\operatorname*{argmin}_{\Gamma}\sum_{i=1}^{n}\ell(\bm{y}_{i},\bm{U}_{y}\bm{h}_{i})+\lambda\Omega(\Gamma).

Since $\mathcal{Y}$ is a vector space $\mathbb{R}^{d}$ , we can consider the above as kernel ridge regression problem with kernel $k$ . This gives the mapping $\Gamma_{j}(\bm{U}_{y}\bm{h}_{i})=\sum_{r=1}^{n}\alpha_{r}^{j}k(\bm{U}_{y}\bm{h}_{r},\bm{U}_{y}\bm{h}_{i})=\sum_{r=1}^{n}\alpha_{r}^{j}k(\bm{h}_{r},\bm{h}_{i}),\forall j\in\{1,\ldots,n\}$ where the coefficient are obtained by $\alpha=\left(\bm{K}^{\top}\bm{K}+\lambda\mathbb{I}\right)^{-1}\bm{K}\bm{Y}$ .

Input: Time series

\{\bm{{x}}_{i}\}_{i=1}^{n},\bm{{x}}_{i}\in\mathbb{R}^{d}

Output: Optimal kernel parameters

2while validation error $\geq$ 0 do

Init. kernels

k_{x},k_{y}

; lag

p

// Sample from grid

\bm{X}\leftarrow\{[\bm{{x}}_{i}^{(1)},\ldots,\bm{{x}}_{i-p}^{(1)},\bm{{x}}_{i}^{(d)},\ldots,\bm{{x}}_{i-p}^{(d)}]\}_{i=p+1}^{n-1}

// AR format

\bm{Y}\leftarrow\{\bm{{x}}_{i+1}\}_{i=p+1}^{n-1}

// One-step ahead

\bm{K}(\mathcal{X})\leftarrow\bm{{X}}

\bm{K}(\mathcal{Y})\leftarrow\bm{{Y}}

// kernel matrices

\bm{H},\bm{\Lambda}\leftarrow(\bm{K}(\mathcal{X}),\bm{K}(\mathcal{Y}))

// eigen-decomposition Eq. 7

Algorithm 1 MV-RKM hyperparameter tuning with grid-search

Input: Initial lagged vector:

\bm{x}_{0}\in\mathbb{R}^{(p+1)d}

, time-steps

n^{\prime}

Output: Recursive forecasts

\{\bm{y}_{0},\ldots,\bm{y}_{n^{\prime}}\}

3while $i\leq n^{\prime}$ do

4 if $i\neq 0$ then

\bm{x}_{i}\leftarrow\{\bm{y}_{i-1},\bm{x}_{i-1},\ldots,\bm{x}_{i-p}\}

Compute

\bm{k}_{\bm{x}}(\bm{x}_{i})

// similarity vector

\mathcal{O}(n)

Get

\bm{h}_{i}

(see Eq. 12)

// latent vector

\mathcal{O}(1)

7 if $\psi(\cdot)$ is known & invertible then

\bm{y}_{i}\leftarrow

solve Eq. 13

\mathcal{O}(1)

9 else

Compute

\bm{k}_{y}(\bm{y}_{i})

// similarity vector

\mathcal{O}(n)

\bm{y}_{i}\leftarrow

solve Eq. 15

// pre-image problem

11 i++

Algorithm 2 MV-RKM recursive prediction

3.3 Computational complexity

The eigendecomposition for the training (Eq. 7) requires slightly less than $\mathcal{O}(n^{3})$ operations, since the matrices are hermitian. The complexity of the predictions in the latent space is $\mathcal{O}(n)$ (Eq. 12), since it requires kernels $\bm{k}_{x}(\bm{x}_{i})$ . Further, if the $k_{y}$ is a linear kernel, then the cost of prediction is constant i.e. $\mathcal{O}(1)$ (Eq. 13). Otherwise, one has to solve the pre-image problem and the complexity depends on the choice of pre-image method used. For instance with the kernel-smoother, it is $\mathcal{O}(n_{r})$ and with kernel ridge regression, it is $\mathcal{O}(n^{3})$ , since it requires matrix inversion.

4 Experiments

In this section, we perform ablation studies on our proposed model on synthetic and real-world datasets to study the influence of different hyperparameters on the embeddings $\bm{H}$ and the predictions. Additionally, we show how visualizing the latent embeddings can help with explaining the data and model predictions. Finally, the model is quantitatively compared against other standard time series models.

4.1 Ablation study

Sine wave. A simple sine wave is used to illustrate how the models learns the dynamics present in the input space, as latent representations. After hyperparameter tuning, we obtain a closed-loop attractor in the latent space (ref. Fig. 1). The forecasted embeddings of the unseen test set, indicated by the green curve, follow the same manifold as the embeddings of training set as indicated with blue curve. When the forecasted embeddings are transformed back to input space, it becomes apparent that the model has learned the data correctly, reflected by the high accuracy of the forecasted data points.

Refer to caption — Figure 1: MV-RKM on $\sin$ wave. The embedded data points are *temporally* de-correlated in the embedding space.

Sum of sines. A synthetic dataset was created as the sum of two since waves as follows: $\sum_{i=1}^{n}\alpha_{i}\sin(2\pi\omega_{i}t+a_{i})$ , where $n=2$ , $\alpha_{i}\in\{1,0.2\}$ , $\omega_{i}\in\{1,20\}$ , $a_{i}\in\{0,0\}$ . Using this dataset, we illustrate how the predictions of the model can be explained using the orthogonal latent components. On Fig. 2, the latent components are visualised on the right. On the left, the predictions of the model are shown where iteratively more and more of the plotted components are used. Visually, it can be seen that more components correspond with a more accurate forecast. This is further shown by plotting the error, in terms of the mean squared error, as done on Fig. 3(a). When visualised, the orthogonal components offer a measure of explainability and show how the model decomposes the signal into uncorrelated components. Generally, we observe that higher frequency components are extracted for lower lag values and lower frequency components for high lag values (see first and second row in Fig. 2). Empirically, it was found that a lower lag means that a higher number of components are required to fit the model sufficiently well.

SantaFe. The same experiment was performed on the SantaFe data, which is a real world chaotic uni-variate time series dataset. We can observe the same effects as with the sine experiments in Fig. 2 (last row). The predictions improve given increasingly more of the visualised components. Further, the components can be observed to be able to capture the jumps in the chaotic time series, given a certain lag. It can be observed that, given a sufficiently large lag, more components correlate with a lower prediction error (see Fig. 3(b)). This effect is similar as in principal component analysis where only the top components are needed for a good reconstruction of the data.

4.2 Real-world experiments

Baselines: To provide a benchmark study, we consider six baseline models: Recurrent Neural Networks (RNN), Long-Short Term Memory networks (LSTM), Autoregressive Moving Average (ARMA), Kernel-Recursive Least-Square [25] (KRLS), non-linear autoregressive formulation of the Least-Squares Support Vector Machines [3] (LS-SVM), Recurrent Restricted Kernel Machine [43] (RRKM). Further we consider two variants of the proposed model i.e. when $k_{y}(\bm{y}_{i},\bm{y}_{j})=\langle\bm{y}_{i},\bm{y}_{j}\rangle$ and $k_{y}(\bm{y}_{i},\bm{y}_{j})=\exp(-\lVert\bm{y}_{i}-\bm{y}_{j}\rVert^{2}/{2\sigma_{y}^{2}})$ ; with $k_{x}(\bm{x}_{i},\bm{x}_{j})=\exp(-\lVert\bm{x}_{i}-\bm{x}_{j}\rVert^{2}/{2\sigma_{x}^{2}})$ in both cases.

Table 1: Mean squared error on the forecasted data. Standard deviation for 10 iterations between brackets for the stochastic models.

Data	RNN	ARMA	KRLS	LSTM	LS-SVM	RRKM	RKM-MV
Data	RNN	ARMA	KRLS	LSTM	LS-SVM	RRKM	( $k_{y}$ rbf)	( $k_{y}$ lin)
SantaFe	$3037.17~{}(\pm 625.36)$	$2224.55$	$416.17$	$2272.13~{}(\pm 328.87)$	$113.78$	$119.06$	$\bm{90.23}$	127.83
Lorenz	$386.01~{}(\pm 51.53)$	$412.28$	$273.00$	$331.49~{}(\pm 61.30)$	$\bm{131.87}$	$247.19$	$182.65$	163.659
Chickenpox	$34329.95~{}(\pm 9513.07)$	$23571.35$	$19505.99$	$22296.24~{}(\pm 3097.18)$	$15541.28$	$20716.91$	$\bm{15479.93}$	16807.56
Energy	$16002.11~{}(\pm 1809.89)$	$24797.40$	$11994.98$	$14781.65~{}(\pm 3126.80)$	$10635.00$	$12764.097$	$\bm{10385.07}$	10474.06
Turbine	$2401.12~{}(\pm 644.53)$	$1317.67$	$1193.82$	$1817.38~{}(\pm 552.21)$	$1226.87$	$1299.915$	$\bm{1126.55}$	1291.72

Datasets: We consider both uni-variate and multi-variate time series: two chaotic time series (SantaFe laser³³3https://rdrr.io/cran/TSPred/man/SantaFe.A.html [50], Lorenz attractor⁴⁴4https://www.openml.org/search?type=data&sort=runs&id=42182&status=active) and three real world datasets (Chickenpox⁵⁵5https://archive.ics.uci.edu/ml/datasets/Hungarian+Chickenpox+Cases [51], Belgian appliances energy consumption⁶⁶6https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction [52] and Gas turbine⁷⁷7https://archive.ics.uci.edu/ml/datasets/Gas+Turbine+CO+and+NOx+Emission+Data+Set [53]). See B for further details on these datasets.

On each dataset and method, hyperparameter tuning has been performed with grid-search and the result of the best set of parameters, quantified as the mean squared error, is shown in Table 1. For all methods, the entire validation set is forecasted, in recursive fashion, with the initial vector as the last point of the training set. The scores for RNN, ARMA and RRKM are taken from [43] since the experiment setting remains same. From the table, we see that the proposed model RKM-MV outperforms the baselines on most datasets except Lorenz where the LS-SVM has lower error than RKM-MV. Also note that these two models are closest in performance, compared to the rest, since both are in NAR setting. However the solution to LS-SVM requires solving a linear system (ref. C) in contrast with the eigen-decomposition of the proposed model. Another important difference is in the prediction schemes.

Moreover, LS-SVM outperforms RKM-MV when $k_{y}$ is linear on most datasets except the Belgian appliances energy dataset. However, the ability to include non-linearity on $k_{y}$ is an added advantage of the RKM-MV model which allows it to outperform other methods.

5 Conclusion

In this work, we introduced a time series forecasting model based on multi-view kernel PCA. This framework provides new insights into time series modeling such as latent space dynamics and novel relations between kernel PCA and time series forecasting. For future work, we believe a further extension of this model can be made to a multi-source setting. For example, when the data is coming from multiple weather stations and a forecast has to made for all stations simultaneously. Custom kernels can be used to better capture the relationship between data-points. Moreover, existing techniques in kernel methods literature can be leveraged to improve the computational complexity such as the Nyström approximation or random Fourier features. Besides the topics mentioned in this work, the framework can be further explored towards other tasks involving time series such as denoising, handling missing values and clustering.

Acknowledgements

European Research Council under the European Union’s Horizon 2020 research and innovation programme: ERC Advanced Grants agreements E-DUALITY(No 787960) and Back to the Roots (No 885682). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information. Research Council KUL: Optimization frameworks for deep kernel machines C14/18/068; Research Fund (projects C16/15/059, C3/19/053, C24/18/022, C3/20/117, C3I-21-00316); Industrial Research Fund (Fellowships 13-0260, IOFm/16/004, IOFm/20/002) and several Leuven Research and Development bilateral industrial projects. Flemish Government Agencies: FWO: projects: GOA4917N (Deep Restricted Kernel Machines: Methods and Foundations), PhD/Postdoc grant, EOS Project no G0F6718N (SeLMA), SBO project S005319N, Infrastructure project I013218N, TBM Project T001919N, PhD Grant (SB/1SA1319N); EWI: the Flanders AI Research Program; VLAIO: CSBO (HBC.2021.0076) Baekeland PhD (HBC.20192204). This research received funding from the Flemish Government (AI Research Program). Other funding: Foundation ‘Kom op tegen Kanker’, CM (Christelijke Mutualiteit).

Appendix A Note on kernel centering

Centering Kernel Matrix

Let $\Phi=[\phi(\bm{x}_{1}),\ldots,\phi(\bm{x}_{n})]$ and the sample mean be $\bm{\mu}_{\phi}=\frac{1}{n}\sum_{i=1}^{n}\phi(\bm{x}_{i})=\frac{1}{n}\Phi\mathbbm{1}_{n}$ , where $\mathbbm{1}_{n}=[1,\ldots,1]^{\top}$ . Let $\tilde{\phi}(\bm{x}_{i})=\phi(\bm{x}_{i})-\bm{\mu}_{\phi}$ denote the centered feature vector. Then

	$\displaystyle\tilde{k}(\bm{x}_{i},\bm{x}_{i})=$	$\displaystyle k(\bm{x}_{i},\bm{x}_{i})-\langle\phi(\bm{x}_{i}),\bm{\mu}_{\phi}\rangle_{\mathscr{H}}-\langle\bm{\mu}_{\phi},\phi(\bm{x}_{i})\rangle_{\mathscr{H}}-\langle\bm{\mu}_{\phi},\bm{\mu}_{\phi}\rangle_{\mathscr{H}}$
	$\displaystyle\tilde{k}(\bm{x}_{i},\bm{x}_{i})=$	$\displaystyle k(\bm{x}_{i},\bm{x}_{i})-\langle\phi(\bm{x}_{i}),\Phi\mathbbm{1}_{n}\rangle_{\mathscr{H}}\frac{1}{n}-\frac{1}{n}\langle\Phi\mathbbm{1}_{n},\phi(\bm{x}_{i})\rangle_{\mathscr{H}}+\frac{1}{n}\langle\Phi\mathbbm{1}_{n},\Phi\mathbbm{1}_{n}\rangle_{\mathscr{H}}\frac{1}{n}$

Composing the above in the matrix form $\forall i$ , gives the following

	$\displaystyle\bm{\tilde{K}}=$	$\displaystyle\bm{K}-\bm{K}\mathbbm{1}\mathbbm{1}^{\top}\frac{1}{n}-\frac{1}{n}\mathbbm{1}\mathbbm{1}^{\top}\bm{K}+\frac{1}{n^{2}}\mathbbm{1}\mathbbm{1}^{\top}\bm{K}\mathbbm{1}\mathbbm{1}^{\top}$
	$\displaystyle\bm{\tilde{K}}=$	$\displaystyle\big{(}\mathbb{I}-\mathbbm{1}\mathbbm{1}^{\top}\frac{1}{n}\big{)}\bm{K}\big{(}\mathbb{I}-\mathbbm{1}\mathbbm{1}^{\top}\frac{1}{n}\big{)}$		(16)

Centering Kernel Vector

Using the same sample statistic $\bm{\mu}_{\phi}$ , a kernel vector representing the similarities $\bm{k}_{\bm{x}_{i}}(\bm{x}^{\star})=[{k}(\bm{x}_{1},\bm{x}^{\star}),\ldots,{k}(\bm{x}_{n},\bm{x}^{\star})]^{\top}$ can be centered as follows :

\displaystyle\bm{\tilde{k}}_{\bm{x}_{i}}(\bm{x}^{\star})=\bm{k}_{\bm{x}_{i}}(\bm{x}^{\star})-\bm{K}\mathbbm{1}\frac{1}{n}-\frac{1}{n}\mathbbm{1}\mathbbm{1}^{\top}\bm{k}_{\bm{x}_{i}}(\bm{x}^{\star})+\frac{1}{n^{2}}\mathbbm{1}\mathbbm{1}^{\top}\bm{K}\mathbbm{1}

(17)

Appendix B Datasets and Hyperparameters

We refer to Table 2 for more details on model architectures, datasets and hyperparameters used in this paper. The PyTorch library (double precision) in Python was used. See Algorithm 1 for training the MV-RKM model.

Table 2: Datasets used for the experiments.

N

is the number of training samples,

d

the input dimension,

s

the subspace dimension and

m

the minibatch size.

Dataset	$N_{\text{train}}$	$N_{\text{test}}$	$d$
SantaFe	1000	100	1
Lorenz	2801	1200	3
Chickenpox	418	104	20
Energy	4000	1000	28
Gas Turbine	5929	1482	11

Lorenz attractor is defined by the following dynamical system:

	$\displaystyle\dot{x}(t)$	$\displaystyle=-ax+ay$
	$\displaystyle\dot{y}(t)$	$\displaystyle=-xz+rx-y$
	$\displaystyle\dot{z}(t)$	$\displaystyle=xy-bz.$

For our simulations, we set $a=10$ , $r=28$ and $b=2.667$ with initial condition $x(0)=1.0$ , $y(0)=-1.0$ and $z(0)=1.05$ and the above equations are solved using first-order finite difference scheme.

Appendix C Least-Squares Support Vector Machine

We consider the non-linear auto-regressive setting for LS-SVM regression model: $\bm{y}=\sum_{i=1}^{n}\bm{\alpha}_{i}k_{x}(\bm{x}_{i},\bm{x})+\bm{b}$ , where the parameters $\bm{\alpha},\bm{\beta}$ are found by solving the following linear system: $\left[\begin{array}[]{c|c}0&\bm{1}_{n}^{\top}\\ \hline\cr\bm{1}_{n}&\bm{K}(\mathcal{X})+\gamma^{-1}\mathbb{I}\end{array}\right]\left[\begin{array}[]{c}\bm{b}\\ \hline\cr\bm{\alpha}\end{array}\right]=\left[\begin{array}[]{c}0\\ \hline\cr\bm{y}\end{array}\right]$ . For details on the derivation from a constrained optimization problem, we refer to [3, p. 98].

References

[1] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., 1995.
[2] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.
[3] Johan A. K. Suykens, Tony Van Gestel, Jos De Brabanter, Bart De Moor, and Joos Vandewalle. Least Squares Support Vector Machines. World Scientific, River Edge, NJ, January 2002.
[4] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006.
[5] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998.
[6] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.R. Mullers. Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468), pages 41–48, 1999.
[7] Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-beats: Neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations, 2020.
[8] Bryan Lim and Sercan Ö. Arik and Nicolas Loeff and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 37(4):1748–1764, 2021.
[9] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 27268–27286. PMLR, 17–23 Jul 2022.
[10] Cristian Challu, Kin G. Olivares, Boris N. Oreshkin, Federico Garza, Max Mergenthaler-Canseco, and Artur Dubrawski. N-hits: Neural hierarchical interpolation for time series forecasting, 2022.
[11] Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009.
[12] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel learning. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 370–378, Cadiz, Spain, 09–11 May 2016. PMLR.
[13] Andreas Damianou and Neil D. Lawrence. Deep Gaussian processes. In Carlos M. Carvalho and Pradeep Ravikumar, editors, Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, volume 31 of Proceedings of Machine Learning Research, pages 207–215, Scottsdale, Arizona, USA, 29 Apr–01 May 2013. PMLR.
[14] Marco A. Wiering and Lambert R.B. Schomaker. Multi-Layer support vector machines. In J. A. K. Suykens, M. Signoretto, and A. Argyriou, editors, Regularization, optimization, kernels, and support vector machines, pages 457–476. Chapman and Hall/CRC, 2014.
[15] Johan A. K. Suykens. Deep restricted kernel machines using conjugate feature duality. Neural Computation, 29(8):2123–2163, August 2017.
[16] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann Machines for collaborative filtering. In ICML ’07, pages 791–798, Corvalis, Oregon, 2007. ACM Press.
[17] Ilya Sutskever and Geoffrey Hinton. Learning multilevel distributed representations for high-dimensional sequences. In Marina Meila and Xiaotong Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, volume 2 of Proceedings of Machine Learning Research, pages 548–555, San Juan, Puerto Rico, 21–24 Mar 2007. PMLR.
[18] Ilya Sutskever, Geoffrey E Hinton, and Graham W Taylor. The recurrent temporal restricted boltzmann machine. In Advances in Neural Information Processing Systems, volume 21, 2008.
[19] Takayuki Osogami. Boltzmann machines and energy-based models. arXiv:1708.06008 [cs], January 2019.
[20] Takayuki Osogami. Boltzmann machines for time-series. arXiv:1708.06004 [cs], January 2019.
[21] James Mercer. Functions of positive and negative type, and their connection the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209(441-458):415–446, January 1909.
[22] Y. Engel, S. Mannor, and R. Meir. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing, 52(8):2275–2285, 2004.
[23] S. Van Vaerenbergh, J. Via, and I. Santamaria. A sliding-window kernel rls algorithm and its application to nonlinear channel identification. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 5, pages V–V, 2006.
[24] Steven Van Vaerenbergh, Ignacio Santamaría, Weifeng Liu, and José C. Príncipe. Fixed-budget kernel recursive least-squares. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1882–1885, 2010.
[25] Steven Van Vaerenbergh, Miguel Lázaro-Gredilla, and Ignacio Santamaria. Kernel recursive least-squares tracker for time-varying regression. Neural Networks and Learning Systems, IEEE Transactions on, 23:1313–1326, 08 2012.
[26] K. R. Müller, A. J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen, and V. Vapnik. Predicting time series with support vector machines. In Wulfram Gerstner, Alain Germond, Martin Hasler, and Jean-Daniel Nicoud, editors, Artificial Neural Networks — ICANN’97, pages 999–1004, Berlin, Heidelberg, 1997. Springer Berlin Heidelberg.
[27] Johan A. K. Suykens and J. Vandewalle. Recurrent least squares support vector machines. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 47(7):1109–1114, 2000.
[28] Christopher Williams and Carl Rasmussen. Gaussian processes for regression. In D. Touretzky, M.C. Mozer, and M. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8. MIT Press, 1995.
[29] Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In David van Dyk and Max Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR.
[30] Marc Deisenroth and Jun Wei Ng. Distributed gaussian processes. In International Conference on Machine Learning, pages 1481–1490. PMLR, 2015.
[31] Neil Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent variable models. Journal of Machine Learning Research, 6(60):1783–1816, 2005.
[32] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput., 14(8):1771–1800, aug 2002.
[33] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, volume 5 of JMLR, pages 448–455. PMLR, April 2009.
[34] Johan A. K. Suykens, Tony Van Gestel, Joos Vandewalle, and Bart De Moor. A support vector machine formulation to PCA analysis and its kernel version. IEEE Transactions on Neural Networks, 14(2):447–450, 2003.
[35] Lynn Houthuys and Johan A.K. Suykens. Tensor-based restricted kernel machines for multi-view classification. Information Fusion, 68:54–66, 2021.
[36] Francesco Tonin, Panagiotis Patrinos, and Johan A.K. Suykens. Unsupervised learning of disentangled representations in deep restricted kernel machines with orthogonality constraints. Neural Networks, 142:661–679, 2021.
[37] David Winant, Joachim Schreurs, and Johan A. K. Suykens. Latent space exploration using generative kernel PCA. In Bart Bogaerts, Gianluca Bontempi, Pierre Geurts, Nick Harley, Bertrand Lebichot, Tom Lenaerts, and Gilles Louppe, editors, Artificial Intelligence and Machine Learning, pages 70–82, Cham, 2020. Springer International Publishing.
[38] Francesco Tonin, Arun Pandey, Panagiotis Patrinos, and Johan A. K. Suykens. Unsupervised energy-based out-of-distribution detection using stiefel-restricted kernel machine. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2021.
[39] Joachim Schreurs and Johan A. K. Suykens. Generative Kernel PCA. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , pages 129–134, 2018.
[40] Arun Pandey, Joachim Schreurs, and Johan A. K. Suykens. Robust generative restricted kernel machines using weighted conjugate feature duality. In proceedings of the Sixth International Conference on Machine Learning, Optimization, and Data Science (LOD), 2020.
[41] Arun Pandey, Joachim Schreurs, and Johan A.K. Suykens. Generative restricted kernel machines: A framework for multi-view generation and disentangled feature learning. Neural Networks, 135:177–191, 2021.
[42] Arun Pandey, Michaël Fanuel, Joachim Schreurs, and Johan A. K. Suykens. Disentangled representation learning and generation with manifold optimization. Neural Computation, 34(10):2009–2036, 09 2022.
[43] Arun Pandey, Hannes De Meulemeester, Henri De Plaen, Bart De Moor, and Johan Suykens. Recurrent Restricted Kernel Machines for Time-series Forecasting. In ESANN 2022 proceedings, pages 399–404, Bruges (Belgium) and online event, 2022. Ciaco - i6doc.com.
[44] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Kernel principal component analysis. In International conference on artificial neural networks, pages 583–588. Springer, 1997.
[45] Sebastian Mika, Bernhard Schölkopf, Alex Smola, Klaus-Robert Müller, Matthias Scholz, and Gunnar Rätsch. Kernel pca and de-noising in feature spaces. In Advances in Neural Information Processing Systems, pages 536–542. MIT Press, 1999.
[46] Anh Tuan Bui, Joon-Ku Im, Daniel W. Apley, and George C. Runger. Projection-free kernel principal component analysis for denoising. Neurocomputing, 357:163–176, 2019.
[47] James T. Kwok and Ivor Wai-Hung Tsang. The pre-image problem in kernel methods. IEEE Transactions on Neural Networks, 15:1517–1525, 2003.
[48] Paul Honeine and Cedric Richard. Preimage problem in kernel-based machine learning. IEEE Signal Processing Magazine, 28(2):77–88, March 2011.
[49] Jason Weston, Bernhard Schölkopf, and Gökhan Bakir. Learning to find pre-images. In Advances in Neural Information Processing Systems, volume 16, 2003.
[50] A.S. Weigend and N.A. Gershenfeld. Time Series Prediction: Forecasting the Future and Understanding the Past. Addison-Wesley, 1993.
[51] Benedek Rozemberczki, Paul Scherer, Oliver Kiss, Rik Sarkar, and Tamas Ferenci. Chickenpox cases in hungary: a benchmark dataset for spatiotemporal signal processing with graph neural networks, 2021.
[52] Luis M. Candanedo, Véronique Feldheim, and Dominique Deramaix. Data driven prediction models of energy use of appliances in a low-energy house. Energy and Buildings, 140:81–97, 2017.
[53] Heysem Kaya, Pınar Tüfekci, and Erdinç Uzun. Predicting co and nox emissions from gas turbines: Novel data and a benchmark pems. Turkish Journal of Electrical Engineering and Computer Sciences, 27(6):4783–4796, 11 2019.

	Predictions	Latent components
Sine (lag 83)
Sine (lag 10)
SantaFe (lag 70)