This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Recursive variational Gaussian approximation with the Whittle likelihood for linear non-Gaussian state space models

Bao Anh Vu * Corresponding author
E-mail address: bavu@uow.edu.au School of Mathematics and Applied Statistics, University of Wollongong, Wollongong, New South Wales, Australia Securing Antarctica’s Environmental Future, University of Wollongong, Wollongong, New South Wales, Australia
David Gunawan School of Mathematics and Applied Statistics, University of Wollongong, Wollongong, New South Wales, Australia Andrew Zammit-Mangion School of Mathematics and Applied Statistics, University of Wollongong, Wollongong, New South Wales, Australia Securing Antarctica’s Environmental Future, University of Wollongong, Wollongong, New South Wales, Australia
Abstract

Parameter inference for linear and non-Gaussian state space models is challenging because the likelihood function contains an intractable integral over the latent state variables. While Markov chain Monte Carlo (MCMC) methods provide exact samples from the posterior distribution as the number of samples go to infinity, they tend to have high computational cost, particularly for observations of a long time series. Variational Bayes (VB) methods are a useful alternative when inference with MCMC methods is computationally expensive. VB methods approximate the posterior density of the parameters by a simple and tractable distribution found through optimisation. In this paper, we propose a novel sequential variational Bayes approach that makes use of the Whittle likelihood for computationally efficient parameter inference in this class of state space models. Our algorithm, which we call Recursive Variational Gaussian Approximation with the Whittle Likelihood (R-VGA-Whittle), updates the variational parameters by processing data in the frequency domain. At each iteration, R-VGA-Whittle requires the gradient and Hessian of the Whittle log-likelihood, which are available in closed form for a wide class of models. Through several examples using a linear Gaussian state space model and a univariate/bivariate non-Gaussian stochastic volatility model, we show that R-VGA-Whittle provides good approximations to posterior distributions of the parameters and is very computationally efficient when compared to asymptotically exact methods such as Hamiltonian Monte Carlo.

Keywords: Autoregressive process; Fast Fourier transform; Hamiltonian Monte Carlo; Stationary time series; Stochastic volatility models.

1 Introduction

State space models (SSMs) are a widely used class of hierarchical time series models employed in a diverse range of applications in fields such as epidemiology (e.g., Dukic et al.,, 2012), economics (e.g., Chan and Strachan,, 2023), finance (e.g., Kim et al.,, 1998), and engineering (e.g., Gordon et al.,, 1993). In an SSM, the observed data depend on unobserved, time-evolving latent state variables and unknown parameters. The observations are assumed to be conditionally independent given the latent states, the dynamics of the latent states are described through a Markov process, and often a simplifying assumption of linearity of the state variables is made. When the state variables are non-Gaussian, Bayesian inference for the parameters of these SSMs is challenging since the likelihood function involves an intractable integral over the latent states.

A popular class of inference methods for non-Gaussian SSMs is Markov chain Monte Carlo (MCMC). While MCMC methods can provide samples from the posterior distribution that are asymptotically exact, they tend to be computationally expensive. An early application of MCMC to nonlinear, non-Gaussian state-space models (SSMs) is the work of Meyer et al., (2003), which combines the Laplace approximation (Laplace,, 1986) and the extended Kalman filter (Harvey,, 1990) to approximate the likelihood function. This approximation is then used within a Metropolis-Hastings algorithm to estimate parameters in a nonlinear, non-Gaussian stochastic volatility model. While this likelihood approximation is analytical, it is only exact for linear, Gaussian SSMs. An alternative approach is the pseudo-marginal Metropolis-Hastings algorithm developed by Andrieu and Roberts, (2009), which employs an unbiased likelihood estimate obtained through a particle filter. However, particle filters are computationally intensive and prone to particle degeneracy, especially when dealing with large datasets (see Kantas et al., (2009) for a detailed discussion of this issue). Both approaches typically require many iterations, each involving a full data pass with complexity O(T)\textrm{O}(T), where TT is the number of observations, making them costly for long data sequences. Additionally, other alternatives, such as the Hamiltonian Monte Carlo (HMC) method  (Duane et al.,, 1987; Neal,, 2011; Hoffman and Gelman,, 2014), target the joint posterior distribution of states and parameters and typically require many iterations to achieve parameter convergence when TT is large and the state dimension is high.

Variational Bayes (VB) methods are becoming increasingly popular for Bayesian inference for state space models (Zammit-Mangion et al.,, 2012; Gunawan et al.,, 2021; Frazier et al.,, 2023). VB approximates the posterior distribution of model parameters with a simpler and more tractable class of distributions constructed using so-called variational parameters, which are unknown and estimated via optimisation techniques. While VB methods tend to be more computationally efficient than MCMC methods, they provide approximate posterior distributions. For a review of VB methods, see Blei et al., (2017).

There are two main classes of VB methods: batch and sequential. A batch VB framework (e.g., Tran et al.,, 2017; Tan and Nott,, 2018; Loaiza-Maya et al.,, 2022; Gunawan et al.,, 2024) requires an optimisation algorithm to repeatedly process the entire dataset until variational parameters converge to a fixed point. A sequential VB framework, on the other hand, updates variational parameters using one observation at a time. Sequential methods only pass through the entire dataset once, and are thus ideal for settings with large datasets. Sequential VB methods are available for statistical models for independent data (Lambert et al.,, 2022), for classical autoregressive models for time-series data (Tomasetti et al.,, 2022), for generalised linear mixed models (Vu et al.,, 2024), and for SSMs where the state evolution model is linear and Gaussian (Beal,, 2003, Chapter 5; Zammit Mangion et al.,, 2011; Mangion et al.,, 2011). Campbell et al., (2021) proposed a sequential VB approach that is applicable to an SSM with non-linear state dynamics and non-Gaussian measurement errors, but this approach can be computationally expensive as it requires solving an optimisation problem and a regression task for each time step.

Our article proposes a sequential VB approach to estimate parameters in linear non-Gaussian state space models. This approach builds on the Recursive Variational Gaussian Approximation (R-VGA) algorithm of Lambert et al., (2022). We first transform time-domain data into frequency-domain data and then use the Whittle likelihood (Whittle,, 1953) in a sequential VB scheme. The logarithm of the Whittle likelihood function is a sum of the individual log-likelihood components in Fourier space, and is thus well-suited for use in sequential updating. We call our algorithm the Recursive Variational Gaussian Approximation with the Whittle likelihood (R-VGA-Whittle) algorithm. At each iteration, R-VGA-Whittle sequentially updates the variational parameters by processing one frequency or a block of frequencies at a time. Although sequential in nature, and requiring only one pass through the frequencies, we note that the R-VGA-Whittle algorithm requires the full dataset to be available in order to construct the periodogram, from which we obtain the frequency-domain data.

We demonstrate the performance of R-VGA-Whittle in simulation studies involving a linear Gaussian state space model and univariate and bivariate non-Gaussian stochastic volatility models. We show that R-VGA-Whittle is accurate and is much faster than HMC with the exact likelihood as well as HMC based on the Whittle likelihood; hereafter, we refer to these two methods as HMC-exact and HMC-Whittle, respectively. We subsequently apply R-VGA-Whittle to analyse exchange rate data.

The rest of the article is organised as follows. Section 2 gives an overview of SSMs and the univariate and multivariate Whittle likelihood functions, and then introduces the R-VGA-Whittle and HMC-Whittle algorithms. Section 3 compares the performance of R-VGA-Whittle, HMC-Whittle and HMC with the exact likelihood in several examples involving linear Gaussian and non-Gaussian SSMs. Section 4 summarises and discusses our main findings. The article also has an online supplement with additional technical details and empirical results.

2 Methodology

We give an overview of SSMs in Section 2.1, and the univariate and multivariate Whittle likelihoods in Section 2.2. Section 2.3 introduces R-VGA-Whittle, while Section 2.4 proposes a blocking approach to improve computational efficiency. Section 2.5 gives details of HMC-Whittle.

2.1 State space models

An SSM consists of two sub-models: a data model that describes the relationship between the observations and the latent state variables, and a state evolution model that describes the (temporal) evolution of the latent states.

In SSMs, the dd-dimensional latent state variables {𝐱t}t=1T\{\mathbf{x}_{t}\}_{t=1}^{T} are characterised by an initial density 𝐗1p(𝐱1𝜽){\mathbf{X}_{1}\sim p(\mathbf{x}_{1}\mid{\bm{\theta}})} and state transition densities

𝐗t(𝐗t1=𝐱t1)p(𝐱t𝐱t1,𝜽),t=2,,T,\mathbf{X}_{t}\mid(\mathbf{X}_{t-1}=\mathbf{x}_{t-1})\sim p(\mathbf{x}_{t}\mid\mathbf{x}_{t-1},\bm{\theta}),\quad t=2,\dots,T, (1)

where 𝜽\bm{\theta} is a vector of model parameters. The observations {𝐲t}t=1T\{\mathbf{y}_{t}\}_{t=1}^{T} are related to the latent states via

𝐘t(𝐗t=𝐱t)p(𝐲t𝐱t,𝜽),t=1,,T,\mathbf{Y}_{t}\mid(\mathbf{X}_{t}=\mathbf{x}_{t})\sim p(\mathbf{y}_{t}\mid\mathbf{x}_{t},\bm{\theta}),\quad t=1,\dots,T, (2)

where each 𝐘t\mathbf{Y}_{t} is assumed to be conditionally independent of the other observations, when conditioned on the latent state 𝐱t\mathbf{x}_{t}. A linear SSM is one where 𝐗t\mathbf{X}_{t} and 𝐘t\mathbf{Y}_{t} in (1) and (2) can each be expressed as the sum of two components: a linear function of their conditioning variables and an exogenous random component.

Let 𝐲1:T(𝐲1,,𝐲T)\mathbf{y}_{1:T}\equiv(\mathbf{y}_{1}^{\top},\dots,\mathbf{y}_{T}^{\top})^{\top} and 𝐱1:T(𝐱1,,𝐱T)\mathbf{x}_{1:T}\equiv(\mathbf{x}_{1}^{\top},\dots,\mathbf{x}_{T}^{\top})^{\top} be vectors of observations and latent states, respectively. The posterior density of the model parameters is p(𝜽𝐲1:T)p(𝐲1:T𝜽)p(𝜽)p(\bm{\theta}\mid\mathbf{y}_{1:T})\propto p(\mathbf{y}_{1:T}\mid\bm{\theta})p(\bm{\theta}), where the likelihood function is

p(𝐲1:T𝜽)=p(𝐲1𝐱1,𝜽)p(𝐱1𝜽)t=2Tp(𝐲t𝐱t,𝜽)p(𝐱t𝐱t1,𝜽)d𝐱1:T.p(\mathbf{y}_{1:T}\mid\bm{\theta})=\int p(\mathbf{y}_{1}\mid\mathbf{x}_{1},\bm{\theta})p(\mathbf{x}_{1}\mid\bm{\theta})\prod_{t=2}^{T}p(\mathbf{y}_{t}\mid\mathbf{x}_{t},\bm{\theta})p(\mathbf{x}_{t}\mid\mathbf{x}_{t-1},\bm{\theta})\,\mathrm{d}\mathbf{x}_{1:T}.

This likelihood function is intractable, except for special cases such as the linear Gaussian SSM (e.g., Gibson and Ninness,, 2005). In non-Gaussian SSMs, the likelihood is typically estimated using a particle filter (Gordon et al.,, 1993), such as in the pseudo-marginal Metropolis-Hastings (PMMH) method of Andrieu and Roberts, (2009), which can be computationally expensive for large TT.

The Whittle likelihood (Whittle,, 1953) is a frequency-domain approximation of the exact likelihood, and offers a computationally efficient alternative for parameter inference in SSMs. Transforming the data to the frequency domain “bypasses” the need for integrating out the states when computing the likelihood, as the Whittle likelihood only depends on the model parameters and not the latent variables. In the following sections, we review the Whittle likelihood and its properties, and then propose a sequential VB approach that uses the Whittle likelihood for parameter inference.

2.2 The Whittle likelihood

The Whittle likelihood has been widely used to estimate the parameters of complex time series or spatial models. It is more computationally efficient to evaluate than the exact likelihood, although it is approximate (e.g., Sykulski et al.,, 2019). Some applications of the Whittle likelihood include parameter estimation with long memory processes (e.g., Robinson,, 1995; Hurvich and Chen,, 2000; Giraitis and Robinson,, 2001; Abadir et al.,, 2007; Shao and Wu, 2007b, ; Giraitis et al.,, 2012), and with spatial models with irregularly spaced data (e.g., Fuentes,, 2007; Matsuda and Yajima,, 2009). The Whittle likelihood is also used for estimating parameters of vector autoregressive moving average (VARMA) models with heavy-tailed errors (She et al.,, 2022), and of vector autoregressive tempered fractionally integrated moving average (VARTFIMA) models (Villani et al.,, 2022).

2.2.1 Univariate Whittle likelihood

Consider a real-valued, zero-mean, second-order stationary, univariate time series {yt}t=1T\{y_{t}\}_{t=1}^{T} with autocovariance function C(h;𝜽)𝔼(ytyth)C(h;\bm{\theta})\equiv\mathbb{E}(y_{t}y_{t-h}), where 0h<t0\leq h<t and 𝔼()\mathbb{E}(\cdot) is the expectation operator. The power spectral density of the time series is the Fourier transform of the autocovariance function (Brillinger,, 2001):

f(ω;𝜽)=h=C(h;𝜽)exp(iωh),f(\omega;\bm{\theta})=\sum_{h=-\infty}^{\infty}C(h;\bm{\theta})\exp(-i\omega h), (3)

where ω(π,π]\omega\in(-\pi,\pi] is the angular frequency and 𝜽\bm{\theta} is a vector of model parameters. The periodogram, an asymptotically unbiased estimate of the spectral density, is given by

I(ωk)=1T|J(ωk)|2,k=T2+1,,T2,I(\omega_{k})=\frac{1}{T}\left\lvert J(\omega_{k})\right\rvert^{2},\quad k=-\left\lceil\frac{T}{2}\right\rceil+1,\dots,\left\lfloor\frac{T}{2}\right\rfloor, (4)

where ωk=2πkT\omega_{k}=\frac{2\pi k}{T}, and J(ωk)J(\omega_{k}) is the discrete Fourier transform (DFT) of {yt}t=1T\{y_{t}\}_{t=1}^{T}:

J(ωk)=t=1Tytexp(iωkt).J(\omega_{k})=\sum_{t=1}^{T}y_{t}\exp(-i\omega_{k}t). (5)

For an observed time series, the Whittle likelihood, denoted by lW(𝜽)l_{W}(\bm{\theta}), is (e.g., Salomone et al.,, 2020)

lW(𝜽)=k=1T12(logf(ωk;𝜽)+I(ωk)f(ωk;𝜽)).l_{W}(\bm{\theta})=-\sum_{k=1}^{\left\lfloor\frac{T-1}{2}\right\rfloor}\left(\log f(\omega_{k};\bm{\theta})+\frac{I(\omega_{k})}{f(\omega_{k};\bm{\theta})}\right). (6)

Here, the Whittle likelihood is evaluated by summing over the non-negative frequencies only, since f(ωk;𝜽)f(\omega_{k};\bm{\theta}) and I(ωk)I(\omega_{k}) are both symmetric about the origin for real-valued time series. The term ωk=0\omega_{k}=0 is not included since I(0)=0I(0)=0 when the time series is de-meaned. The term ωk=π\omega_{k}=\pi is also omitted since it has a different distribution from the other frequencies and its influence is asymptotically negligible (Salomone et al.,, 2020; Villani et al.,, 2022).

Asymptotically, it can be shown that the parameter estimates obtained by maximising the exact likelihood and those obtained by maximising the Whittle likelihood are equivalent (Dzhaparidze and Yaglom,, 1983). The calculation of the Whittle likelihood involves evaluation of the periodogram, which does not depend on the parameters 𝜽\bm{\theta} and can, therefore, be pre-computed at a cost of O(TlogT)O(T\log T) via the Fast Fourier Transform (FFT, Cooley and Tukey,, 1965). Since the periodogram ordinates are asymptotically independent (Whittle,, 1953), the logarithm of the Whittle likelihood is a sum of the individual log-likelihood terms computed over different frequencies.

2.2.2 Multivariate Whittle likelihood

Consider a dd-variate, real valued, second-order stationary, zero mean time series {𝐲t}t=1T\{\mathbf{y}_{t}\}_{t=1}^{T} with autocovariance function 𝐂(h;𝜽)𝔼(𝐲t𝐲th)\mathbf{C}(h;\bm{\theta})\equiv\mathbb{E}(\mathbf{y}_{t}\mathbf{y}_{t-h}^{\top}), 0h<t0\leq h<t. At a given angular frequency ω\omega, the power spectral density matrix of this series is

𝐟(ω;𝜽)=h=𝐂(h;𝜽)exp(iωh),ω(π,π].\mathbf{f}(\omega;\bm{\theta})=\sum_{h=-\infty}^{\infty}\mathbf{C}(h;\bm{\theta})\exp(-i\omega h),\quad\omega\in(-\pi,\pi]. (7)

The matrix-valued periodogram is

𝐈(ωk)=1T𝐉(ωk)𝐉(ωk)H,k=T2+1,,T2,\mathbf{I}(\omega_{k})=\frac{1}{T}\mathbf{J}(\omega_{k})\mathbf{J}(\omega_{k})^{H},\quad k=-\left\lceil\frac{T}{2}\right\rceil+1,\dots,\left\lfloor\frac{T}{2}\right\rfloor, (8)

where ωk=2πkT\omega_{k}=\frac{2\pi k}{T}, 𝐀H\mathbf{A}^{H} denotes the conjugate transpose of a complex matrix 𝐀\mathbf{A}, and 𝐉(ωk)\mathbf{J}(\omega_{k}), the DFT of the series, is given by

𝐉(ωk)=t=1T𝐲texp(iωkt).\mathbf{J}(\omega_{k})=\sum_{t=1}^{T}\mathbf{y}_{t}\exp(-i\omega_{k}t). (9)

The Whittle likelihood for the multivariate series {𝐲t}t=1T\{\mathbf{y}_{t}\}_{t=1}^{T} is (Villani et al.,, 2022):

lW(𝜽)=k=1T12(log|𝐟(ωk;𝜽)|+Tr(𝐟(ωk;𝜽))1𝐈(ωk)).l_{W}(\bm{\theta})=-\sum_{k=1}^{\lfloor\frac{T-1}{2}\rfloor}\left(\log\lvert\mathbf{f}(\omega_{k};\bm{\theta})\rvert+\operatorname{Tr}(\mathbf{f}(\omega_{k};\bm{\theta}))^{-1}\mathbf{I}(\omega_{k})\right). (10)

2.3 R-VGA-Whittle

The original R-VGA algorithm of Lambert et al., (2022) is a sequential VB algorithm for parameter inference in statistical models where the observations are independent, such as linear or logistic regression models. Given independent and exchangeable observations {𝐲i:i=1,,N}\{\mathbf{y}_{i}:i=1,\dots,N\}, this algorithm sequentially approximates the “pseudo-posterior” distribution of the parameters given data up to the iith observation, p(𝜽𝐲1:i)p(\bm{\theta}\mid\mathbf{y}_{1:i}), with a Gaussian distribution qi(𝜽)=Gau(𝝁i,𝚺i)q_{i}(\bm{\theta})=\mathrm{Gau}(\bm{\mu}_{i},\mathbf{\Sigma}_{i}). The updates for 𝝁i\bm{\mu}_{i} and 𝚺i\mathbf{\Sigma}_{i} are available in closed form, as seen in Algorithm 1. We describe this algorithm in more detail in Section S1 of the online supplement.

At each iteration i=1,,N{i=1,\dots,N}, the R-VGA algorithm requires the gradient and Hessian of the log-likelihood contribution from each observation, logp(𝐲i𝜽)\log p(\mathbf{y}_{i}\mid\bm{\theta}). In a setting with independent observations, these likelihood contributions are easy to obtain; however, in SSMs where the latent states follow a Markov process and are hence temporally dependent, the log-likelihood function is intractable, as shown in Section 2.1. The gradient and Hessian of logp(𝐲t𝜽)\log p(\mathbf{y}_{t}\mid\bm{\theta}), t=1,,T{t=1,\dots,T}, are hence also intractable. While these quantities can be estimated via particle filtering (e.g., Doucet and Shephard,, 2012), one would need to run a particle filter from the first time period to the current time period for every step in the sequential algorithm. This quickly becomes infeasible for long time series.

Algorithm 1 R-VGA (Lambert et al.,, 2022)
Input: independent observations 𝐲1,,𝐲N\mathbf{y}_{1},\dots,\mathbf{y}_{N}, initial values 𝝁0\bm{\mu}_{0} and 𝚺0\mathbf{\Sigma}_{0}.
Set q0(𝜽)=Gau(𝝁0,𝚺0)q_{0}(\bm{\theta})=\mathrm{Gau}(\bm{\mu}_{0},\mathbf{\Sigma}_{0}).
for i=1,,Ni=1,\dots,N do
     𝝁i=𝝁i1+𝚺i𝔼qi1(𝜽logp(𝐲i𝜽))\bm{\mu}_{i}=\bm{\mu}_{i-1}+\mathbf{\Sigma}_{i}\mathbb{E}_{q_{i-1}}(\nabla_{\bm{\theta}}\log p(\mathbf{y}_{i}\mid\bm{\theta}))
     𝚺i1=𝚺i11𝔼qi1(𝜽2logp(𝐲i𝜽))\mathbf{\Sigma}_{i}^{-1}=\mathbf{\Sigma}_{i-1}^{-1}-\mathbb{E}_{q_{i-1}}(\nabla_{\bm{\theta}}^{2}\log p(\mathbf{y}_{i}\mid\bm{\theta}))
     Set qi(𝜽)=Gau(𝝁i,𝚺i)q_{i}(\bm{\theta})=\mathrm{Gau}(\bm{\mu}_{i},\mathbf{\Sigma}_{i}).
end for
Output: variational parameters 𝝁i\bm{\mu}_{i} and 𝚺i\mathbf{\Sigma}_{i}, for i=1,,Ni=1,...,N.

We now extend the R-VGA algorithm of Lambert et al., (2022) to state space models by replacing the time domain likelihood with a frequency domain alternative: the Whittle likelihood. From (6) and (10) we see that the log of the Whittle likelihood is a sum over individual log-likelihood components, a consequence of the asymptotic independence of the periodogram frequency components (Shao and Wu, 2007a, ). We replace the quantities 𝜽logp(𝐲i𝜽){\nabla_{\bm{\theta}}\log p(\mathbf{y}_{i}\mid\bm{\theta})} and 𝜽2logp(𝐲i𝜽){\nabla_{\bm{\theta}}^{2}\log p(\mathbf{y}_{i}\mid\bm{\theta})} in Algorithm 1 with the gradient and Hessian of the Whittle log-likelihood evaluated at each frequency. In the univariate case, the individual log-likelihood contribution at frequency ωk\omega_{k} is

lWk(𝜽)=logf(ωk;𝜽)+I(ωk)f(ωk;𝜽),k=1,,T12,l_{W_{k}}(\bm{\theta})=\log f(\omega_{k};\bm{\theta})+\frac{I(\omega_{k})}{f(\omega_{k};\bm{\theta})},\quad k=1,\dots,\left\lfloor\frac{T-1}{2}\right\rfloor, (11)

while in the multivariate case,

lWk(𝜽)=log|𝐟(ωk;𝜽)|+Tr(𝐟(ωk;𝜽))1𝐈(ωk),k=1,,T12,l_{W_{k}}(\bm{\theta})=\log\lvert\mathbf{f}(\omega_{k};\bm{\theta})\rvert+\operatorname{Tr}(\mathbf{f}(\omega_{k};\bm{\theta}))^{-1}\mathbf{I}(\omega_{k}),\quad k=1,\dots,\left\lfloor\frac{T-1}{2}\right\rfloor, (12)

for ωk=2πkT\omega_{k}=\frac{2\pi k}{T}. The log-likelihood contributions lWk(𝜽)l_{W_{k}}(\bm{\theta}) are available for models where the power spectral density is available analytically, such as models where the latent states follow a VARMA (She et al.,, 2022) or a VARTFIMA (Villani et al.,, 2022) process. The gradient 𝜽lWk(𝜽){\nabla_{\bm{\theta}}l_{W_{k}}(\bm{\theta})} and Hessian 𝜽2lWk(𝜽){\nabla_{\bm{\theta}}^{2}l_{W_{k}}(\bm{\theta})} are therefore also available in closed form for these models. At each VB update, 𝜽lWk(𝜽){\nabla_{\bm{\theta}}l_{W_{k}}(\bm{\theta})} and 𝜽2lWk(𝜽){\nabla_{\bm{\theta}}^{2}l_{W_{k}}(\bm{\theta})} are used to update the variational parameters for k=1,,T12k=1,\dots,\left\lfloor\frac{T-1}{2}\right\rfloor. Their expectations with respect to qk1(𝜽)q_{k-1}(\bm{\theta}) are approximated using Monte Carlo samples 𝜽(s)qk1(𝜽){\bm{\theta}^{(s)}\sim q_{k-1}(\bm{\theta})}, s=1,,S{s=1,\dots,S}:

𝔼qk1(𝜽lWk(𝜽))1Ss=1S𝜽lWk(𝜽(s)), and 𝔼qk1(𝜽2lWk(𝜽))1Ss=1S𝜽2lWk(𝜽(s)),\mathbb{E}_{q_{k-1}}(\nabla_{\bm{\theta}}l_{W_{k}}(\bm{\theta}))\approx\frac{1}{S}\sum_{s=1}^{S}\nabla_{\bm{\theta}}l_{W_{k}}(\bm{\theta}^{(s)}),\text{ and }\mathbb{E}_{q_{k-1}}(\nabla_{\bm{\theta}}^{2}l_{W_{k}}(\bm{\theta}))\approx\frac{1}{S}\sum_{s=1}^{S}\nabla_{\bm{\theta}}^{2}l_{W_{k}}(\bm{\theta}^{(s)}),

for k=1,,T12k=1,\dots,\left\lfloor\frac{T-1}{2}\right\rfloor. We set SS following an experiment on the sensitivity of the algorithm to the choice of SS; see Section S4 of the online supplement for more details.

We also implement the damping approach of Vu et al., (2024) to mitigate instability that may arise when traversing the first few frequencies. Damping is done by splitting an update into DD steps, where DD is user-specific. In each step, the gradient and the Hessian of the log of the Whittle likelihood are multiplied by a “step size” a=1Da=\frac{1}{D}, and the variational parameters are updated DD times during one VB iteration. This has the effect of splitting a log-likelihood contribution into DD “parts”, and using them to update variational parameters one part at a time. In practice, we set DD through experimentation, and choose it so that initial algorithmic instability is reduced with only a small increase in computational cost. We summarise R-VGA-Whittle in Algorithm 2.

Algorithm 2 Damped R-VGA-Whittle
Input: time series observations 𝐲1,,𝐲T\mathbf{y}_{1},\dots,\mathbf{y}_{T}, initial values 𝝁0\bm{\mu}_{0} and 𝚺0\mathbf{\Sigma}_{0}, number of frequencies to damp ndampn_{damp}, number of damping steps DD.
Compute the periodogram 𝐈(ωk)\mathbf{I}(\omega_{k}), k=T2+1,,T2k=-\left\lceil\frac{T}{2}\right\rceil+1,\dots,\left\lfloor\frac{T}{2}\right\rfloor, via the FFT algorithm.
Set q0(𝜽)=Gau(𝝁0,𝚺0)q_{0}(\bm{\theta})=\mathrm{Gau}(\bm{\mu}_{0},\mathbf{\Sigma}_{0}).
for k=1,,T12k=1,\dots,\lfloor\frac{T-1}{2}\rfloor do
     if kndampk\leq n_{damp} then
         Set a=1/Da=1/D, 𝝁k,0=𝝁k1,𝚺k,0=𝚺k1\bm{\mu}_{k,0}=\bm{\mu}_{k-1},\mathbf{\Sigma}_{k,0}=\mathbf{\Sigma}_{k-1}
         for l=1,,Dl=1,\dots,D do
              Set qk,l1(𝜽)=Gau(𝝁k,l1,𝚺k,l1)q_{k,l-1}(\bm{\theta})=\mathrm{Gau}(\bm{\mu}_{k,l-1},\mathbf{\Sigma}_{k,l-1}).
              𝝁k,l=𝝁k,l1+a𝚺k,l𝔼qk,l1(𝜽lWk(𝜽))\bm{\mu}_{k,l}=\bm{\mu}_{k,l-1}+a\mathbf{\Sigma}_{k,l}\mathbb{E}_{q_{k,l-1}}(\nabla_{\bm{\theta}}l_{W_{k}}(\bm{\theta}))
              𝚺k,l1=𝚺k,l11a𝔼qk,l1(𝜽2lWk(𝜽))\mathbf{\Sigma}_{k,l}^{-1}=\mathbf{\Sigma}_{k,l-1}^{-1}-a\mathbb{E}_{q_{k,l-1}}(\nabla_{\bm{\theta}}^{2}l_{W_{k}}(\bm{\theta}))
         end for
         Set 𝝁k=𝝁k,D,𝚺k=𝚺k,D\bm{\mu}_{k}=\bm{\mu}_{k,D},\mathbf{\Sigma}_{k}=\mathbf{\Sigma}_{k,D}, qk(𝜽)=Gau(𝝁k,𝚺k)q_{k}(\bm{\theta})=\mathrm{Gau}(\bm{\mu}_{k},\mathbf{\Sigma}_{k}).
     else
         𝝁k=𝝁k1+𝚺k𝔼qk1(𝜽lWk(𝜽))\bm{\mu}_{k}=\bm{\mu}_{k-1}+\mathbf{\Sigma}_{k}\mathbb{E}_{q_{k-1}}({\nabla_{\bm{\theta}}l_{W_{k}}(\bm{\theta})})
         𝚺k1=𝚺k11𝔼qk1(𝜽2lWk(𝜽))\mathbf{\Sigma}_{k}^{-1}=\mathbf{\Sigma}_{k-1}^{-1}-\mathbb{E}_{q_{k-1}}({\nabla_{\bm{\theta}}^{2}l_{W_{k}}(\bm{\theta})})
         Set qk(𝜽)=Gau(𝝁k,𝚺k)q_{k}(\bm{\theta})=\mathrm{Gau}(\bm{\mu}_{k},\mathbf{\Sigma}_{k}).
     end if
end for
Output: variational parameters 𝝁k\bm{\mu}_{k} and 𝚺k\mathbf{\Sigma}_{k}, for k=1,,T12k=1,\dots,\lfloor\frac{T-1}{2}\rfloor.

2.4 R-VGA-Whittle with block updates

The spectral density of a process captures the distribution of signal power (variance) across frequency (Percival and Walden,, 1993). The periodogram is an estimate of the spectral density. In Figure 1(a), we show the periodogram for data generated from the linear Gaussian SSM that we consider in Section 3.1. The variance (power) of the signal gradually decreases as the frequency increases, which means that lower frequencies contribute more to the total power of the process. A similar pattern is seen in Figure 1(b), which shows the periodogram for data generated from a univariate stochastic volatility model that we consider in Section 3.2. In this case there is an even larger concentration of power at the lower frequencies.

Refer to caption
(a) Linear Gaussian model.
Refer to caption
(b) Univariate SV model.
Figure 1: Plot of raw periodograms of the data (black) and smoothed periodograms (light red) for two univariate models specified in Section 3. Vertical red dashed lines show the frequency cutoffs at half-power (3dB). Frequencies after this cutoff are processed in blocks.
Refer to caption
(a) Variable 1.
Refer to caption
(b) Variable 2.
Figure 2: Plot of raw periodograms (black) and smoothed periodograms (light red) for the bivariate stochastic volatility model specified in Section 3. Vertical dashed lines show the 3dB frequency cutoff points for each series. In this bivariate case, the cutoff point we use for R-VGA-Whittle is the higher of the two.

Frequencies that carry less power have less effect on R-VGA-Whittle updates. To speed up the algorithm and reduce the number of gradient/Hessian calculations, we therefore process angular frequency contributions with low power, namely those beyond the 3dB cutoff point, together in “blocks”. We define the 3dB cutoff point as the angular frequency at which the power drops to half the maximum power in the (smoothed) periodogram, which we obtain using Welch’s method (Welch,, 1967). Section S3 studies several cutoff points and empirically justifies our choice of a 3dB cutoff.

Assume the 3dB cutoff point occurs at ωn~\omega_{\tilde{n}}; we split the remaining T12n~\lfloor\frac{T-1}{2}\rfloor-\tilde{n} angular frequencies into blocks of approximately equal length BB. Suppose that there are nbn_{b} blocks, with Ωb,b=1,,nb\Omega_{b},b=1,\dots,n_{b}, denoting the set of angular frequencies in the bbth block. The variational mean and precision matrix are then updated based on each block’s Whittle log-likelihood contribution. For b=1,,nbb=1,\dots,n_{b}, this is given by

lWb(𝜽)=kΩblWk(𝜽),l_{W}^{b}(\bm{\theta})=\sum_{k\in\Omega_{b}}l_{W_{k}}(\bm{\theta}), (13)

that is, by the sum of the Whittle log-likelihood contributions of the frequencies in the block.

For a multivariate time series, we compute the periodogram for each individual time series along with its corresponding 3dB cutoff point, and then use the highest cutoff point in R-VGA-Whittle. An example of bivariate time series periodograms along with their 3dB cutoffs points is shown in Figure 2.

We summarise R-VGA-Whittle with block updates in Algorithm 3.

Algorithm 3 Block updates for R-VGA with the Whittle likelihood
Input: observations 𝐲1,,𝐲T\mathbf{y}_{1},\dots,\mathbf{y}_{T}, initial values 𝝁0\bm{\mu}_{0} and 𝚺0\mathbf{\Sigma}_{0}, number of angular frequencies to damp ndampn_{damp}, number of damping steps DD, number of angular frequencies to update individually n~>ndamp\tilde{n}>n_{damp}, block length BB.
Step 1: Compute the periodogram 𝐈(ωk)\mathbf{I}(\omega_{k}), k=T2+1,,T2k=-\left\lceil\frac{T}{2}\right\rceil+1,\dots,\left\lfloor\frac{T}{2}\right\rfloor, via the FFT algorithm.
Step 2: Divide the last T12n~\lfloor\frac{T-1}{2}\rfloor-\tilde{n} frequencies into nbn_{b} blocks of length (approximately) BB.
Step 3: Set q0(𝜽)=Gau(𝝁0,𝚺0)q_{0}(\bm{\theta})=\mathrm{Gau}(\bm{\mu}_{0},\mathbf{\Sigma}_{0}).
for k~=1,,n~\tilde{k}=1,\dots,\tilde{n} do
     Update 𝝁k~,𝚺k~\bm{\mu}_{\tilde{k}},\mathbf{\Sigma}_{\tilde{k}} as in Algorithm 2
end for
for b=1,,nbb=1,\dots,n_{b} do
     Set k~=n~+b\tilde{k}=\tilde{n}+b
     𝝁k~=𝝁k~1+𝚺k~𝔼qk~1(𝜽lWb(𝜽))\bm{\mu}_{\tilde{k}}=\bm{\mu}_{{\tilde{k}}-1}+\mathbf{\Sigma}_{\tilde{k}}\mathbb{E}_{q_{{\tilde{k}}-1}}(\nabla_{\bm{\theta}}l_{W}^{b}(\bm{\theta})), where lWb(𝜽)l_{W}^{b}(\bm{\theta}) is given in (13)
     𝚺k~1=𝚺k~11𝔼qk~1(𝜽2lWb(𝜽))\mathbf{\Sigma}_{\tilde{k}}^{-1}=\mathbf{\Sigma}_{{\tilde{k}}-1}^{-1}-\mathbb{E}_{q_{{\tilde{k}}-1}}(\nabla_{\bm{\theta}}^{2}l_{W}^{b}(\bm{\theta}))
end for
Output: variational parameters 𝝁k~\bm{\mu}_{\tilde{k}} and 𝚺k~\mathbf{\Sigma}_{\tilde{k}}, for k~=1,,n~+nb\tilde{k}=1,\dots,\tilde{n}+n_{b}.

2.5 HMC-Whittle

We now discuss the use of HMC with the Whittle likelihood for estimating parameters in SSMs. Suppose we want to sample from the posterior distribution p(𝜽𝐲1:T)p(\bm{\theta}\mid\mathbf{y}_{1:T}). We introduce an auxiliary momentum variable 𝐫\mathbf{r} of the same length as 𝜽\bm{\theta} and assume 𝐫\mathbf{r} comes from a density Gau(𝟎,𝐌)\mathrm{Gau}(\mathbf{0},\mathbf{M}), where 𝐌\mathbf{M} is a mass matrix that is often a multiple of the identity matrix. The joint density of 𝜽\bm{\theta} and 𝐫\mathbf{r} is p(𝜽,𝐫𝐲)exp(H(𝜽,𝐫))p(\bm{\theta},\mathbf{r}\mid\mathbf{y})\propto\exp(-H(\bm{\theta},\mathbf{r})), where

H(𝜽,𝐫)=(𝜽)+12𝐫𝐌1𝐫H(\bm{\theta},\mathbf{r})=-\mathcal{L}(\bm{\theta})+\frac{1}{2}\mathbf{r}^{\top}\mathbf{M}^{-1}\mathbf{r} (14)

is called the Hamiltonian, and (𝜽)=logp(𝜽𝐲1:T)\mathcal{L}(\bm{\theta})=\log p(\bm{\theta}\mid\mathbf{y}_{1:T}). HMC proceeds by moving 𝜽\bm{\theta} and 𝐫\mathbf{r} according to the Hamiltonian equations

d𝜽dt=H𝐫=𝐌1𝐫,d𝐫dt=H𝜽=𝜽(𝜽),\frac{\,\mathrm{d}\bm{\theta}}{\,\mathrm{d}t}=\frac{\partial H}{\partial\mathbf{r}}=\mathbf{M}^{-1}\mathbf{r},\quad\frac{\,\mathrm{d}\mathbf{r}}{\,\mathrm{d}t}=-\frac{\partial H}{\partial\bm{\theta}}=\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta}), (15)

to obtain new proposed values 𝜽\bm{\theta}^{*} and 𝐫\mathbf{r}^{*}. By Bayes’ theorem, p(𝜽𝐲1:T)p(𝐲1:T𝜽)p(𝜽)p(\bm{\theta}\mid\mathbf{y}_{1:T})\propto p(\mathbf{y}_{1:T}\mid\bm{\theta})p(\bm{\theta}), so the gradient 𝜽(𝜽)\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta}) can be written as

𝜽(𝜽)=𝜽logp(𝐲1:T𝜽)+𝜽logp(𝜽).\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta})=\nabla_{\bm{\theta}}\log p(\mathbf{y}_{1:T}\mid\bm{\theta})+\nabla_{\bm{\theta}}\log p(\bm{\theta}). (16)

The gradient 𝜽logp(𝐲1:T𝜽)\nabla_{\bm{\theta}}\log p(\mathbf{y}_{1:T}\mid\bm{\theta}) on the right hand side of (16) is typically intractable in SSMs and needs to be estimated using a particle filter (e.g., Nemeth et al.,, 2016). However, if the exact likelihood is replaced with the Whittle likelihood, an approximation to 𝜽(𝜽){\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta})} can be obtained without the need for running an expensive particle filter repeatedly. Rather than targeting the true posterior distribution p(𝜽𝐲1:T)p(\bm{\theta}\mid\mathbf{y}_{1:T}), HMC-Whittle proceeds by targeting an approximation p~(𝜽𝐲1:T)pW(𝐲1:T𝜽)p(𝜽)\tilde{p}(\bm{\theta}\mid\mathbf{y}_{1:T})\propto p_{W}(\mathbf{y}_{1:T}\mid\bm{\theta})p(\bm{\theta}), where pW(𝐲1:T𝜽)exp(lW(𝜽))p_{W}(\mathbf{y}_{1:T}\mid\bm{\theta})\equiv\exp(l_{W}(\bm{\theta})) is the Whittle likelihood as defined in (6) for a univariate time series, or (10) for multivariate time series. Then (𝜽)\mathcal{L}(\bm{\theta}) in (14) is replaced by the quantity W(𝜽)=logp~(𝜽𝐲1:T)\mathcal{L}_{W}(\bm{\theta})=\log\tilde{p}(\bm{\theta}\mid\mathbf{y}_{1:T}), and the term 𝜽(𝜽)\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta}) in (15) is replaced with

𝜽W(𝜽)=𝜽lW(𝜽)+𝜽logp(𝜽).\nabla_{\bm{\theta}}\mathcal{L}_{W}(\bm{\theta})=\nabla_{\bm{\theta}}l_{W}(\bm{\theta})+\nabla_{\bm{\theta}}\log p(\bm{\theta}). (17)

For models where the spectral density is available in closed form, the Whittle likelihood is also available in closed form, and the gradient 𝜽W(𝜽)\nabla_{\bm{\theta}}\mathcal{L}_{W}(\bm{\theta}) can be obtained analytically.

In practice, the continuous-time Hamiltonian dynamics in (15) are simulated over LL discrete time steps of size ϵ\epsilon using a leapfrog integrator (see Algorithm 4). The resulting 𝜽\bm{\theta}^{*} and 𝐫\mathbf{r}^{*} from this integrator are then used as proposals and are accepted with probability

min(1,exp(H(𝜽,𝐫)exp(H(𝜽,𝐫)).\min\left(1,\frac{\exp(-H(\bm{\theta}^{*},\mathbf{r}^{*})}{\exp(-H(\bm{\theta},\mathbf{r})}\right).

This process of proposing and accepting/rejecting is repeated until the desired number of posterior samples has been obtained. We summarise HMC-Whittle in Algorithm 5.

Algorithm 4 One step of the Leapfrog algorithm, Leapfrog(𝜽,𝐫,ϵ\bm{\theta}^{*},\mathbf{r}^{*},\epsilon)
Input: parameters 𝜽\bm{\theta}, 𝐫\mathbf{r}, leapfrog step size ϵ\epsilon.
Set 𝐫=𝐫+ϵ𝜽(𝜽)/2\mathbf{r}^{*}=\mathbf{r}+\epsilon\nabla_{\bm{\theta}}\mathcal{L}\left({\bm{\theta}}\right)/2,
Set 𝜽=𝜽+ϵ𝑴1𝐫\bm{\theta}^{*}={\bm{\theta}}+\epsilon\bm{{M}}^{-1}\mathbf{r}^{*},
Set 𝐫=𝐫+ϵ𝜽(𝜽)/2\mathbf{r}^{*}=\mathbf{r}^{*}+\epsilon\nabla_{\bm{\theta}}\mathcal{L}\left(\bm{\theta}^{*}\right)/2.
return 𝜽\bm{\theta}^{*}, 𝐫\mathbf{r}^{*}.
Algorithm 5 HMC-Whittle
Input: observations 𝐲1,,𝐲T\mathbf{y}_{1},\dots,\mathbf{y}_{T}, prior distribution p(𝜽)p(\bm{\theta}), mass matrix 𝐌\mathbf{M}, number of HMC iterations JJ, leapfrog step size ϵ\epsilon, number of leapfrog steps LL.
Sample 𝜽(0)p(𝜽)\bm{\theta}^{(0)}\sim p(\bm{\theta}).
for j=1,,Jj=1,\dots,J do
     Sample 𝐫(0)Gau(𝟎,𝐌)\mathbf{r}^{(0)}\sim\mathrm{Gau}(\mathbf{0},\mathbf{M}).
     Set 𝜽=𝜽(j1)\bm{\theta}^{*}=\bm{\theta}^{(j-1)}, 𝐫=𝐫(0)\mathbf{r}^{*}=\mathbf{r}^{(0)}.
     for l=1,,Ll=1,...,L do
         (𝜽,𝐫)Leapfrog(𝜽,𝐫,ϵ)(\bm{\theta}^{*},\mathbf{r}^{*})\leftarrow\text{Leapfrog}(\bm{\theta}^{*},\mathbf{r}^{*},\epsilon). (see Algorithm 4)
     end for
     With probability
α=min(1,exp(W(𝜽)12𝐫𝐌1𝐫)exp(W(𝜽(j1))12𝐫(0)𝐌1𝐫(0))),\alpha=\text{min}\left(1,\frac{\exp\left(\mathcal{L}_{W}(\bm{\theta}^{*})-\frac{1}{2}\mathbf{r}^{*\top}\mathbf{M}^{-1}\mathbf{r}^{*}\right)}{\exp\left(\mathcal{L}_{W}(\bm{\theta}^{(j-1)})-\frac{1}{2}\mathbf{r}^{(0)\top}\mathbf{M}^{-1}\mathbf{r}^{(0)}\right)}\right),
     set 𝜽(j)=𝜽\bm{\theta}^{(j)}=\bm{\theta}^{*}, else 𝜽(j)=𝜽(j1)\bm{\theta}^{(j)}=\bm{\theta}^{(j-1)}.
end for
Output: posterior samples 𝜽(1),,𝜽(J)\bm{\theta}^{(1)},\dots,\bm{\theta}^{(J)}.

3 Applications

This section demonstrates the use of R-VGA-Whittle for estimating parameters of a univariate linear Gaussian SSM and those of a univariate/bivariate stochastic volatility (SV) model using simulated and real data. We compare the posterior densities obtained from R-VGA-Whittle to those from HMC-exact and HMC-Whittle. All examples are implemented in R, and the reproducible code is available from https://github.com/bao-anh-vu/R-VGA-Whittle. HMC-exact and HMC-Whittle are implemented using the CmdStanR interface to the Stan programming language (Gabry et al.,, 2024).

For all examples, we implement R-VGA-Whittle with damping and blocking as described in Algorithm 3. We apply damping to the first ndamp=5n_{damp}=5 angular frequencies with D=100D=100 damping steps each. These values were chosen to reduce the initial instability at the expense of a small additional computational cost. We select n~\tilde{n} for each model by finding the 3dB frequency cutoff and process the remaining frequencies in blocks of length B=100B=100, following a study in Section S3 of the online supplement. At each iteration of R-VGA-Whittle, we use S=1000S=1000 Monte Carlo samples to approximate the expectations of the gradient and Hessian of the Whittle log-likelihood. This value of SS was chosen because it results in very little variation between repeated runs of R-VGA-Whittle; see Section S4 of the online supplement for an experiment exploring the algorithm’s sensitivity to SS. HMC-Whittle and HMC-exact were run with two chains with 15000 samples each. The first 5000 samples of each chain were discarded as burn-in. All experiments were carried out on a high-end server, with an NVIDIA Tesla V100 32GB graphics processing unit (GPU) used to parallelise over the S=1000S=1000 Monte Carlo samples needed for estimating the expectations of the gradient and Hessian of the Whittle log-likelihood at each iteration of Algorithm 2.

3.1 Univariate linear Gaussian SSM

We generate T=10000T=10000 observations from the linear Gaussian SSM,

yt\displaystyle y_{t} =xt+ϵt,ϵtGau(0,σϵ2),t=1,,T,\displaystyle=x_{t}+\epsilon_{t},\quad\epsilon_{t}\sim\mathrm{Gau}(0,\sigma_{\epsilon}^{2}),\quad t=1,\dots,T, (18)
xt\displaystyle x_{t} =ϕxt1+ηt,ηtGau(0,ση2),t=2,,T,\displaystyle=\phi x_{t-1}+\eta_{t},\quad\eta_{t}\sim\mathrm{Gau}(0,\sigma_{\eta}^{2}),\quad t=2,\dots,T, (19)
x1\displaystyle x_{1} Gau(0,ση21ϕ2).\displaystyle\sim\mathrm{Gau}\left(0,\frac{\sigma_{\eta}^{2}}{1-\phi^{2}}\right). (20)

We set the true auto-correlation parameter to ϕ=0.9\phi=0.9, the true state disturbance standard deviation to ση=0.7\sigma_{\eta}=0.7 and the true measurement error standard deviation to σϵ=0.5\sigma_{\epsilon}=0.5.

The parameters that need to be estimated, ϕ,ση\phi,\sigma_{\eta} and σϵ\sigma_{\epsilon}, take values in subsets of \mathbb{R}. We therefore instead estimate the transformed parameters 𝜽=(θϕ,θη,θϵ){\bm{\theta}=(\theta_{\phi},\theta_{\eta},\theta_{\epsilon})^{\top}}, where θϕ=tanh1(ϕ)\theta_{\phi}=\tanh^{-1}(\phi), θη=log(ση2)\theta_{\eta}=\log(\sigma_{\eta}^{2}), and θϵ=log(σϵ2)\theta_{\epsilon}=\log(\sigma_{\epsilon}^{2}), which are unconstrained.

The spectral density of the observed time series {yt}t=1T\{y_{t}\}_{t=1}^{T} is

fy(ωk;𝜽)\displaystyle f_{y}(\omega_{k};\bm{\theta}) =fx(ωk;𝜽)+fϵ(ωk;𝜽),ωk=2πkT,k=T2+1,,T2,\displaystyle=f_{x}(\omega_{k};\bm{\theta})+f_{\epsilon}(\omega_{k};\bm{\theta}),\quad\omega_{k}=\frac{2\pi k}{T},\quad k=-\left\lceil\frac{T}{2}\right\rceil+1,\dots,\left\lfloor\frac{T}{2}\right\rfloor, (21)

where

fx(ωk;𝜽)=ση21+ϕ22ϕcos(ωk)f_{x}(\omega_{k};\bm{\theta})=\frac{\sigma_{\eta}^{2}}{1+\phi^{2}-2\phi\cos(\omega_{k})}

is the spectral density of the AR(1) process {xt}t=1T\{x_{t}\}_{t=1}^{T}, and

fϵ(ωk;𝜽)=σϵ2f_{\epsilon}(\omega_{k};\bm{\theta})=\sigma_{\epsilon}^{2}

is the spectral density of the white noise process {ϵt}t=1T\{\epsilon_{t}\}_{t=1}^{T}. The Whittle log-likelihood for this model is given by (6), where the periodogram {I(ωk):k=T2+1,,T2}\{I(\omega_{k}):k=-\left\lceil\frac{T}{2}\right\rceil+1,\dots,\left\lfloor\frac{T}{2}\right\rfloor\} is the discrete Fourier transform (DFT) of the observations {yt}t=1T\{y_{t}\}_{t=1}^{T}, which we compute using the fft function in R. We evaluate the gradient and Hessian of the Whittle log-likelihood at each frequency, θlWk(𝜽)\nabla_{\theta}l_{W_{k}}(\bm{\theta}) and θ2lWk(𝜽)\nabla_{\theta}^{2}l_{W_{k}}(\bm{\theta}), using automatic differentiation via the tensorflow library (Abadi et al.,, 2015) in R. The initial/prior distribution we use is

p(𝜽)=q0(𝜽)=Gau([011],[100010001]).p(\bm{\theta})=q_{0}(\bm{\theta})=\mathrm{Gau}\left(\begin{bmatrix}0\\ -1\\ -1\end{bmatrix},\begin{bmatrix}1&0&0\\ 0&1&0\\ 0&0&1\\ \end{bmatrix}\right). (22)

This leads to 95% probability intervals of (0.96,0.96)(-0.96,0.96) for ϕ\phi, and (0.23,1.59)(0.23,1.59) for ση\sigma_{\eta} and σϵ\sigma_{\epsilon}.

Figure 3 shows the trajectory of the R-VGA-Whittle variational mean for each parameter. There are only 226 R-VGA-Whittle updates, as the 3dB cutoff point is the 176th angular frequency, and there are 50 blocks thereafter. The figure shows that the trajectory of ϕ\phi moves towards the true value within the first 10 updates. This is expected, since most of the information on ϕ\phi, which is reasonably large in our example, is in the low-frequency components of the signal. The trajectories of ση\sigma_{\eta} and σϵ\sigma_{\epsilon} are more sporadic before settling close to the true value around the last 25 (block) updates. This later convergence is not unexpected since high-frequency log-likelihood components contain non-negligible information on these parameters that describe fine components of variation (the spectral density of white noise is a constant function of frequency). This behaviour is especially apparent for the measurement error standard deviation σϵ\sigma_{\epsilon}.

Figures 8(a) and 8(b) in Section S5 of the online supplement show the trace plots for each parameter obtained from HMC-exact and HMC-Whittle. Trace plots from both methods show good mixing for all three parameters but, for ση\sigma_{\eta} and σϵ\sigma_{\epsilon} in particular, HMC-Whittle has slightly better mixing than HMC-exact.

Figure 4 shows the marginal posterior distributions of the parameters, along with bivariate posterior distributions. The three methods yield very similar posterior distributions for all parameters. We also show in Section S3 of the online supplement that the posterior densities obtained from R-VGA-Whittle using block size B=100B=100 are highly similar to those obtained without block updates. This suggests that block updates are useful for speeding up R-VGA-Whittle without compromising the accuracy of the approximate posterior densities.

Refer to caption
Figure 3: Trajectories of the variational means for parameters in the linear Gaussian SSM with simulated data. True parameter values are marked with dashed lines. The vertical line is at n~\tilde{n}, the point beyond which updates are done in blocks.
Refer to caption
Figure 4: Posterior distributions from HMC-exact (blue), and approximate posterior distributions from R-VGA-Whittle (red) and HMC-Whittle (yellow) for data simulated from the univariate linear Gaussian SSM with T=10000T=10000. Diagonal panels: Marginal posterior distributions with true parameters denoted using dotted lines. Off-diagonal panels: Bivariate posterior distributions with true parameters denoted using the symbol ×\times.
TT R-VGA-Whittle HMC-Whittle HMC-exact
Linear Gaussian 10000 14 125 447
SV (sim. data) 10000 13 104 17005
SV (real data) 2649 13 25 3449
2D SV (sim. data) 5000 121 14521 72170
2D SV (real data) 2649 107 8890 74462
Table 1: Computing time (in seconds) for R-VGA-Whittle, HMC-Whittle and HMC-exact for each model we considered. The number of observations TT for each example is also shown.

A comparison of the computing time between R-VGA-Whittle, HMC-Whittle and HMC-exact for this model can be found in Table 1. The table shows that R-VGA-Whittle is nearly 10 times faster than HMC-Whittle and more than 30 times faster than HMC-exact.

3.2 Univariate stochastic volatility model

SV models are a popular class of non-Gaussian financial time series models used for modelling the volatility of stock returns. An overview of SV models and traditional likelihood-based methods for parameter inference for use with these models can be found in Kim et al., (1998); an overview of MCMC-based inference for a univariate SV model can be found in Fearnhead, (2011).

In this section, we follow Fearnhead, (2011) and simulate T=10000T=10000 observations from the following univariate SV model:

yt\displaystyle y_{t} =σtϵt,ϵtGau(0,1),t=1,,T,\displaystyle=\sigma_{t}\epsilon_{t},\quad\epsilon_{t}\sim\mathrm{Gau}(0,1),\quad t=1,\dots,T, (23)
σt\displaystyle\sigma_{t} =κexp(xt2),t=1,,T,\displaystyle=\kappa\exp\left(\frac{x_{t}}{2}\right),\quad t=1,\dots,T, (24)
xt\displaystyle x_{t} =ϕxt1+ηt,ηtGau(0,ση2),t=2,,T,\displaystyle=\phi x_{t-1}+\eta_{t},\quad\eta_{t}\sim\mathrm{Gau}(0,\sigma_{\eta}^{2}),\quad t=2,\dots,T, (25)
x1\displaystyle x_{1} Gau(0,ση21ϕ2).\displaystyle\sim\mathrm{Gau}\left(0,\frac{\sigma_{\eta}^{2}}{1-\phi^{2}}\right). (26)

We set the true parameters to ϕ=0.99\phi=0.99, ση=0.1\sigma_{\eta}=0.1, and κ=2\kappa=2.

Following Ruiz, (1994), we apply a log-squared transformation,

log(yt2)\displaystyle\log(y_{t}^{2}) =log(κ2)+xt+log(ϵt2)\displaystyle=\log(\kappa^{2})+x_{t}+\log(\epsilon_{t}^{2})
=log(κ2)+𝔼(log(ϵt2))+xt+log(ϵt2)𝔼(log(ϵt2))\displaystyle=\log(\kappa^{2})+\mathbb{E}(\log(\epsilon_{t}^{2}))+x_{t}+\log(\epsilon_{t}^{2})-\mathbb{E}(\log(\epsilon_{t}^{2}))
=μ+xt+ξt,\displaystyle=\mu+x_{t}+\xi_{t}, (27)

where μlog(κ2)+𝔼(log(ϵt2))\mu\equiv\log(\kappa^{2})+\mathbb{E}(\log(\epsilon_{t}^{2})) and ξtlog(ϵt2)𝔼(log(ϵt2))\xi_{t}\equiv\log(\epsilon_{t}^{2})-\mathbb{E}(\log(\epsilon_{t}^{2})), for t=1,,Tt=1,\dots,T. The terms {ξt}t=1T\{\xi_{t}\}_{t=1}^{T} are independent and identically distributed (iid) non-Gaussian terms with mean zero and variance π22\frac{\pi^{2}}{2} (see Ruiz, (1994) for a derivation, which is reproduced in Section S2 of the online supplement for completeness).

Rearranging (27) yields the de-meaned series

ztlog(yt2)μ=xt+ξt,t=1,,T.z_{t}\equiv\log(y_{t}^{2})-\mu=x_{t}+\xi_{t},\quad t=1,\dots,T. (28)

Since 𝔼(log(yt2))=μ\mathbb{E}(\log(y_{t}^{2}))=\mu, the parameter μ\mu can be estimated using the sample mean:

μ^=1Tt=1Tlog(yt2).\hat{\mu}=\frac{1}{T}\sum_{t=1}^{T}\log(y_{t}^{2}).

As μ=log(κ2)+𝔼(log(ϵt2))\mu=\log(\kappa^{2})+\mathbb{E}(\log(\epsilon_{t}^{2})), once an estimate μ^\hat{\mu} is obtained, κ\kappa can be estimated as

κ^=exp(μ^𝔼(log(ϵt2))),\hat{\kappa}=\sqrt{\exp(\hat{\mu}-\mathbb{E}(\log(\epsilon_{t}^{2})))},

where 𝔼(log(ϵt2)))=ψ(1/2)+log(2)\mathbb{E}(\log(\epsilon_{t}^{2})))=\psi(1/2)+\log(2), and ψ()\psi(\cdot) is the digamma function (Abramowitz and Stegun,, 1968). As we show below, the estimate κ^\hat{\kappa} is not required for R-VGA-Whittle and HMC-Whittle as the Whittle likelihood does not depend on κ\kappa. However, κ\kappa appears in the likelihood used by HMC-exact. Since κ\kappa is not estimated by the Whittle-based methods, for comparison purposes we fix it to its estimate κ^\hat{\kappa} when using HMC-exact.

The remaining parameters to estimate are ϕ\phi and ση\sigma_{\eta}. As with the linear Gaussian state space model, we work with the transformed parameters 𝜽=(θϕ,θη)\bm{\theta}=(\theta_{\phi},\theta_{\eta})^{\top}, where θϕ=tanh1(ϕ)\theta_{\phi}=\tanh^{-1}(\phi) and θη=log(ση2)\theta_{\eta}=\log(\sigma_{\eta}^{2}), which are unconstrained. We fit the spectral density

fz(ωk;𝜽)\displaystyle f_{z}(\omega_{k};\bm{\theta}) =fx(ωk;𝜽)+fξ(ωk;𝜽),\displaystyle=f_{x}(\omega_{k};\bm{\theta})+f_{\xi}(\omega_{k};\bm{\theta}), (29)

to the log-squared, demeaned series {zt}t=1T\{z_{t}\}_{t=1}^{T}, where

fx(ωk;𝜽)=ση21+ϕ22ϕcos(ωk),f_{x}(\omega_{k};\bm{\theta})=\frac{\sigma_{\eta}^{2}}{1+\phi^{2}-2\phi\cos(\omega_{k})},

and fξ(ωk;𝜽)f_{\xi}(\omega_{k};\bm{\theta}) is the Fourier transform of the covariance function Cξ(h;𝜽)C_{\xi}(h;\bm{\theta}) of {ξt}t=1T\{\xi_{t}\}_{t=1}^{T}:

Cξ(h;𝜽)=𝔼(ξtξth)={π22,h=0,0,h0,C_{\xi}(h;\bm{\theta})=\mathbb{E}(\xi_{t}\xi_{t-h})=\begin{cases}\frac{\pi^{2}}{2},&h=0,\\ 0,&h\neq 0,\end{cases}

and is thus given by

fξ(ωk;𝜽)=h=Cξ(h;𝜽)exp(iωkh)=Cξ(0)=π22,k=T2+1,,T2.f_{\xi}(\omega_{k};\bm{\theta})=\sum_{h=-\infty}^{\infty}C_{\xi}(h;\bm{\theta})\exp(-i\omega_{k}h)=C_{\xi}(0)=\frac{\pi^{2}}{2},\quad k=-\left\lceil\frac{T}{2}\right\rceil+1,\dots,\left\lfloor\frac{T}{2}\right\rfloor. (30)

The prior/initial variational distribution we use is

p(𝜽)=q0(𝜽)=Gau([23],[0.5000.5]),p(\bm{\theta})=q_{0}(\bm{\theta})=\mathrm{Gau}\left(\begin{bmatrix}2\\ -3\end{bmatrix},\begin{bmatrix}0.5&0\\ 0&0.5\\ \end{bmatrix}\right), (31)

which leads to 95% probability intervals of (0.57,0.99)(0.57,0.99) for ϕ\phi, and (0.11,0.44)(0.11,0.44) for ση\sigma_{\eta}. The interval for ϕ\phi aligns closely with those used in similar studies of the stochastic volatility (SV) model. For example, Kim et al., (1998) employed the transformation ϕ=2ϕ1\phi=2\phi^{*}-1 with ϕBeta(20,1.5)\phi^{*}\sim\textrm{Beta}(20,1.5), resulting in a 95% probability interval for ϕ\phi of (0.59,0.99)(0.59,0.99). Additionally, Kim et al., (1998) utilized an inverse gamma prior for ση2\sigma_{\eta}^{2} with shape and rate parameters of 2.52.5 and 0.0250.025, respectively, corresponding to a 95% probability interval ση\sigma_{\eta} for (0.06,0.24)(0.06,0.24), which is comparable to the interval implied by our chosen prior.

Figure 5 shows the trajectories of the R-VGA-Whittle variational means for all parameters. The 3dB cutoff point for this example is the 75th frequency, after which there are 50 blocks, for a total of 125 updates. Similar to the case of the LGSS model in Section 3.1, the variational mean of ϕ\phi is close to the true value from the early stages of the algorithm, while that of ση\sigma_{\eta} requires more iterations to achieve convergence.

Figures 9(a)9(c) in Section S5 of the online supplement show the HMC-exact and HMC-Whittle trace plots. For this example, HMC-exact produces posterior samples with a much higher correlation than HMC-Whittle, particularly for the variance parameter ση\sigma_{\eta}. This is because HMC-Whittle only targets the posterior density of the parameters, while HMC-exact targets the joint posterior density of the states and parameters, which are high dimensional. We thin the posterior samples from HMC-exact by a factor of 100 for the remaining samples to be treated as uncorrelated draws from the posterior distribution.

We compare the posterior densities from R-VGA-Whittle, HMC-Whittle, and HMC-exact in Figure 6. The posterior densities from R-VGA-Whittle and HMC-Whittle are very similar, while the posterior densities from HMC-exact are slightly narrower. Table 1 shows that R-VGA-Whittle is 8 times faster than HMC-Whittle, and more than 1000 times faster than HMC-exact in this example.

Refer to caption
Figure 5: Trajectories of the variational means for parameters in the univariate SV model with simulated data. True parameter values are marked with dashed lines. The vertical line is at n~\tilde{n}, the point beyond which updates are done in blocks.
Refer to caption
Figure 6: Posterior distributions from HMC-exact (blue), and approximate posterior distributions from R-VGAL-Whittle (red) and HMC-Whittle (yellow) for data simulated from the univariate SV model with T=10000T=10000. Diagonal panels: Marginal posterior distributions with true parameters denoted using dotted lines. Off-diagonal panels: Bivariate posterior distributions with true parameters denoted using the symbol ×\times.

3.3 Bivariate SV model

We now consider the following bivariate extension of the SV model in Section 3.2 and simulate T=5000T=5000 observations as follows:

𝐲t\displaystyle\mathbf{y}_{t} =𝐕t𝜺t,𝜺tGau(𝟎,𝐈2),t=1,,T,\displaystyle=\mathbf{V}_{t}\bm{\varepsilon}_{t},\quad\bm{\varepsilon}_{t}\sim\mathrm{Gau}(\mathbf{0},\mathbf{I}_{2}),\quad t=1,\dots,T, (32)
𝐱t\displaystyle\mathbf{x}_{t} =𝚽𝐱t1+𝜼t,𝜼tGau(𝟎,𝚺η),t=2,,T,\displaystyle=\mathbf{\Phi}\mathbf{x}_{t-1}+\bm{\eta}_{t},\quad\bm{\eta}_{t}\sim\mathrm{Gau}(\mathbf{0},\mathbf{\Sigma}_{\eta}),\quad t=2,\dots,T, (33)
𝐱1\displaystyle\mathbf{x}_{1} Gau(𝟎,𝚺η1),\displaystyle\sim\mathrm{Gau}(\mathbf{0},\mathbf{\Sigma}_{\eta_{1}}), (34)

where 𝐱t(x1,t,x2,t)\mathbf{x}_{t}\equiv(x_{1,t},x_{2,t})^{\top}, 𝐲t(y1,t,y2,t)\mathbf{y}_{t}\equiv(y_{1,t},y_{2,t})^{\top}, 𝐕tdiag(exp(x1,t2),exp(x2,t2)){\mathbf{V}_{t}\equiv\operatorname{diag}(\exp(\tfrac{x_{1,t}}{2}),\exp(\tfrac{x_{2,t}}{2}))}, and 𝐈m\mathbf{I}_{m} is the m×mm\times m identity matrix. The covariance matrix 𝚺η1\mathbf{\Sigma}_{\eta_{1}} is the solution to the equation vec(𝚺η1)=(𝐈4𝚽𝚽)1vec(𝚺η)\operatorname{vec}(\mathbf{\Sigma}_{\eta_{1}})=(\mathbf{I}_{4}-\mathbf{\Phi}\otimes\mathbf{\Phi})^{-1}\operatorname{vec}(\mathbf{\Sigma}_{\eta}), where vec()\textrm{vec}(\cdot) is the vectorisation operator and \otimes denotes the Kronecker product of two matrices. We let 𝚽\mathbf{\Phi} be a diagonal matrix but let 𝚺η\mathbf{\Sigma}_{\eta} have off-diagonal entries so that there is dependence between the two series. We set these matrices to

𝚽=[0.99000.98],𝚺η=[0.020.0050.0050.01].\mathbf{\Phi}=\begin{bmatrix}0.99&0\\ 0&0.98\end{bmatrix},\quad\mathbf{\Sigma}_{\eta}=\begin{bmatrix}0.02&0.005\\ 0.005&0.01\end{bmatrix}. (35)

As with the univariate model, we take a log-squared transformation of (32) to find

𝐲~t=𝝁+𝐱t+𝝃t,\tilde{\mathbf{y}}_{t}=\bm{\mu}+\mathbf{x}_{t}+\bm{\xi}_{t}, (36)

where

𝐲~t=[log(y1,t2)log(y2,t2)],𝝁=[μ1μ2]=[𝔼(log(ϵ1,t2))𝔼(log(ϵ2,t2))],𝝃t=[ξ1,tξ2,t]=[log(ϵ1,t2)𝔼(log(ϵ1,t2))log(ϵ2,t2)𝔼(log(ϵ2,t2))].\tilde{\mathbf{y}}_{t}=\begin{bmatrix}\log(y_{1,t}^{2})\\ \log(y_{2,t}^{2})\end{bmatrix},\quad\bm{\mu}=\begin{bmatrix}\mu_{1}\\ \mu_{2}\end{bmatrix}=\begin{bmatrix}\mathbb{E}(\log(\epsilon_{1,t}^{2}))\\ \mathbb{E}(\log(\epsilon_{2,t}^{2}))\end{bmatrix},\quad\bm{\xi}_{t}=\begin{bmatrix}\xi_{1,t}\\ \xi_{2,t}\end{bmatrix}=\begin{bmatrix}\log(\epsilon_{1,t}^{2})-\mathbb{E}(\log(\epsilon_{1,t}^{2}))\\ \log(\epsilon_{2,t}^{2})-\mathbb{E}(\log(\epsilon_{2,t}^{2}))\\ \end{bmatrix}.

The terms ξ1,t\xi_{1,t} and ξ2,t\xi_{2,t} are iid non-Gaussian uncorrelated terms, each with mean zero and variance σξ2=π22\sigma_{\xi}^{2}=\frac{\pi^{2}}{2}. We express (36) as

𝐳t𝐲~t𝝁=𝐱t+𝝃t,\mathbf{z}_{t}\equiv\tilde{\mathbf{y}}_{t}-\bm{\mu}=\mathbf{x}_{t}+\bm{\xi}_{t},

and use the sample means

μ^1=1Tt=1Tlog(y1,t2),μ^2=1Tt=1Tlog(y2,t2)\hat{\mu}_{1}=\frac{1}{T}\sum_{t=1}^{T}\log(y_{1,t}^{2}),\quad\hat{\mu}_{2}=\frac{1}{T}\sum_{t=1}^{T}\log(y_{2,t}^{2})

as estimates of μ1\mu_{1} and μ2\mu_{2}, respectively. The parameters in this model are 𝚽\mathbf{\Phi} and 𝚺η\mathbf{\Sigma}_{\eta}. To estimate these parameters, we fit the following spectral density to {𝐳t}t=1T\{\mathbf{z}_{t}\}_{t=1}^{T}:

𝐟𝐳(ωk;𝜽)=𝐟𝐱(ωk;𝜽)+𝐟𝝃(ωk;𝜽),\mathbf{f}_{\mathbf{z}}(\omega_{k};\bm{\theta})=\mathbf{f}_{\mathbf{x}}(\omega_{k};\bm{\theta})+\mathbf{f}_{\bm{\xi}}(\omega_{k};\bm{\theta}),

where 𝐟𝐱(ωk;𝜽)\mathbf{f}_{\mathbf{x}}(\omega_{k};\bm{\theta}) is the spectral density matrix of the VAR(1) process {𝐱t}t=1T\{\mathbf{x}_{t}\}_{t=1}^{T},

𝐟𝐱(ωk;𝜽)=(𝐈2𝚽eiωk)1𝚺η(𝐈2𝚽eiωk)H,k=T2+1,,T2,\mathbf{f}_{\mathbf{x}}(\omega_{k};\bm{\theta})=(\mathbf{I}_{2}-\mathbf{\Phi}e^{-i\omega_{k}})^{-1}\mathbf{\Sigma}_{\eta}(\mathbf{I}_{2}-\mathbf{\Phi}e^{-i\omega_{k}})^{-H},\quad k=-\left\lceil\frac{T}{2}\right\rceil+1,\dots,\left\lfloor\frac{T}{2}\right\rfloor, (37)

with 𝐀H\mathbf{A}^{-H} denoting the inverse of the conjugate transpose of the matrix 𝐀\mathbf{A}, and 𝐟𝝃(ωk;𝜽)=π22𝐈2\mathbf{f}_{\bm{\xi}}(\omega_{k};\bm{\theta})=\tfrac{\pi^{2}}{2}\mathbf{I}_{2} is the spectral density of {𝝃t}t=1T\{\bm{\xi}_{t}\}_{t=1}^{T} (following a similar derivation as that for the univariate SV model in Equation (30)).

We parameterise θii=tanh1(Φii),i=1,2\theta_{ii}=\tanh^{-1}(\Phi_{ii}),i=1,2, 𝚺η=𝐋𝐋{\mathbf{\Sigma}_{\eta}=\mathbf{L}\mathbf{L}^{\top}}, where 𝐋\mathbf{L} is the lower Cholesky factor of 𝚺η\mathbf{\Sigma}_{\eta}, and the diagonal elements of 𝐋\mathbf{L} as γii=log(lii),i=1,2\gamma_{ii}=\log(l_{ii}),i=1,2. The vector of transformed parameters is 𝜽=(θ11,θ22,γ11,γ22,l21){\bm{\theta}=(\theta_{11},\theta_{22},\gamma_{11},\gamma_{22},l_{21})^{\top}}.

The initial/prior distribution we use is

p(𝜽)=q0(𝜽)=Gau([22230],[0.5000000.5000000.5000000.05000000.05]).p(\bm{\theta})=q_{0}(\bm{\theta})=\mathrm{Gau}\left(\begin{bmatrix}2\\ 2\\ -2\\ -3\\ 0\end{bmatrix},\begin{bmatrix}0.5&0&0&0&0\\ 0&0.5&0&0&0\\ 0&0&0.5&0&0\\ 0&0&0&0.05&0\\ 0&0&0&0&0.05\\ \end{bmatrix}\right). (38)

For Φ11\Phi_{11} and Φ22\Phi_{22}, this prior leads to 95% probability intervals of (0.52,0.99)(0.52,0.99). For Ση11\Sigma_{\eta_{11}}, Ση21\Sigma_{\eta_{21}}, and Ση22\Sigma_{\eta_{22}}, the 95% probability intervals are (0.001,0.286)(0.001,0.286), (0.101,0.101)(-0.101,0.101), and (0.002,0.261)(0.002,0.261), respectively.

Refer to caption
Figure 7: Trajectories of the variational means for parameters in the bivariate SV model with simulated data. True parameter values are marked with dashed lines. The vertical line is at n~\tilde{n}, the point beyond which updates are done in blocks.

Figure 7 shows the trajectories of the R-VGA-Whittle variational means for all parameters in this model. For most parameters, the trajectories quickly move towards the true value within the first 10 to 20 updates, with only small adjustments afterwards. The only exception is for parameter Ση11\Sigma_{\eta_{11}}, which exhibits more gradual changes before settling close to the true value in the last 25 iterations; this is somewhat expected since this parameter corresponds to a high-frequency signal component. R-VGA-Whittle slightly overestimated Φ22\Phi_{22} and underestimated Ση22\Sigma_{\eta_{22}}.

Figures 10(a)10(c) in Section S5 of the online supplement show the HMC-exact and HMC-Whittle trace plots for this model. HMC-exact samples are highly correlated, especially those of Ση22\Sigma_{\eta_{22}}. We find that it is necessary to thin the HMC-exact samples by a factor of 50 for the samples to be considered uncorrelated draws from the posterior distribution. In contrast, HMC-Whittle produces well-mixed posterior samples. Section S5 of the online supplement also shows that the effective sample sizes from HMC-Whittle are much higher than those from HMC-exact for all parameters.

We compare the posterior densities from R-VGA-Whittle, HMC-Whittle, and HMC-exact in Figure 8. In this example, we observe broad agreement between the three methods, though the posterior densities from R-VGA-Whittle and HMC-Whittle are wider than those from HMC-exact. The posterior density of Φ22\Phi_{22} from R-VGA-Whittle puts more probability on values close to one than those from the other two methods, while the posterior density of Ση21\Sigma_{\eta_{21}} from both R-VGA-Whittle and HMC-Whittle appear slightly shifted towards zero compared to that from HMC-exact. As shown in Table 1, R-VGA-Whittle is nearly 600 times faster than HMC-exact, and over 100 times faster than HMC-Whittle in this example.

Refer to caption
Figure 8: Posterior distributions from HMC-exact (blue), and approximate posterior distributions from R-VGA-Whittle (red) and HMC-Whittle (yellow) for data simulated from the bivariate SV model with T=5000T=5000. Diagonal panels: Marginal posterior distributions with true parameters denoted using dotted lines. Off-diagonal panels: Bivariate posterior distributions with true parameters denoted using the symbol ×\times.

3.4 Real univariate and bivariate examples

We apply R-VGA-Whittle to estimate parameters in univariate and bivariate stochastic volatility models from real data. We first apply it to daily JPY-EUR exchange rate data (Kastner et al.,, 2017) from April 1st, 2005 to August 6th, 2015, which consists of 26502650 observations. Typically, the data used in SV models are de-meaned log-returns, which are computed from the raw exchange rate values as:

yt=yt1Tt=1Tyt,yt=log(rt+1rt),t=1,,2649,y_{t}=y_{t}^{*}-\frac{1}{T}\sum_{t^{\prime}=1}^{T}y_{t^{\prime}}^{*},\quad y_{t}^{*}=\log\left(\frac{r_{t+1}}{r_{t}}\right),\quad t=1,\dots,2649,

where rtr_{t} is the exchange rate on day tt, and yty_{t}^{*} is the log-return on day tt. We fit the model (23)–(26) to this time series of de-meaned log-returns with T=2649T=2649, and use the initial variational distribution specified in (31).

The HMC-exact trace plots in Figures 11(a) of the online supplement show that posterior samples of ϕ\phi are relatively well-mixed, but that the posterior samples of ση\sigma_{\eta} are highly correlated. We find it necessary to thin these samples by a factor of 100 in order for them to be considered uncorrelated. In contrast, HMC-Whittle produces well-mixed posterior samples, so that no thinning is needed.

We compare the posterior densities from R-VGA-Whittle, HMC-Whittle and HMC-exact in Figure 9. The posterior densities obtained from R-VGA-Whittle are similar, but slightly wider than those obtained from HMC-Whittle and HMC-exact.

Refer to caption
Figure 9: Posterior distributions from HMC-exact (blue), and approximate posterior distributions from R-VGAL-Whittle (red) and HMC-Whittle (yellow) for the univariate SV model fitted using JPY-EUR exchange rate data with T=2649T=2649. Diagonal panels: Marginal posterior distributions of the parameters. Off-diagonal panels: Bivariate posterior distributions of the parameters.

We also apply R-VGA-Whittle to a bivariate time series of daily GBP-EUR and USD-EUR exchange rates (Kastner et al.,, 2017) from April 1st, 2005, to August 6th, 2015, which consists of 26502650 observations per exchange rate. We compute the de-meaned log-returns for both series and then fit the model (32)–(34) to these log-returns. The prior distribution we use is specified in (38).

Figures 12(a)12(c) in Section S5 of the online supplement show the HMC-exact and HMC-Whittle trace plots for the bivariate SV model with exchange rate data. Similar to the simulation study, we find that HMC-exact samples are highly correlated; samples of parameters in 𝚽\mathbf{\Phi} tend to be less correlated than samples of parameters in 𝚺η\mathbf{\Sigma}_{\eta}. We thin the HMC-exact samples by a factor of 50 so that the remaining samples can be treated as uncorrelated. Posterior samples from HMC-Whittle have low correlation and are mixed well for all parameters, so no thinning is necessary.

We compare the posterior densities from R-VGA-Whittle, HMC-Whittle, and HMC-exact in Figure 10. The marginal posterior densities from R-VGA-Whittle and HMC-Whittle are similar, but noticeably wider than the posterior densities from HMC-exact. Nevertheless, there is a substantial overlap between the posterior densities from all three methods.

Table 1 shows that R-VGA-Whittle is twice as fast as HMC-Whittle in the univariate case, and more than 80 times faster than HMC-Whittle in the bivariate case. R-VGA-Whittle is significantly faster than HMC-exact, with approximately a 260-fold speedup in the univariate case and a 700-fold speedup in the bivariate cases.

Refer to caption
Figure 10: Posterior distributions from HMC-exact (blue), and approximate posterior distributions from R-VGAL-Whittle (red) and HMC-Whittle (yellow) for the bivariate SV model fitted using GBP-EUR and USD-EUR exchange rate data with T=2649T=2649. Diagonal panels: Marginal posterior distributions of the parameters. Off-diagonal panels: Bivariate posterior distributions of the parameters.

4 Conclusion

We develop a novel sequential variational Bayes algorithm that uses the Whittle likelihood, called R-VGA-Whittle, for parameter inference in linear non-Gaussian state space models. The Whittle likelihood is a frequency-domain approximation of the exact likelihood. It is available in closed form and its log can be expressed as a sum of individual log-likelihood frequency components. At each iteration, R-VGA-Whittle sequentially updates the variational parameters by processing one frequency or a block of frequencies at a time. We also demonstrate the use of the Whittle likelihood in HMC (HMC-Whittle), and compare the parameter estimates from R-VGA-Whittle and HMC-Whittle to those of HMC with the exact likelihood (HMC-exact). We show that the posterior densities from the three methods are generally similar, though posterior densities from R-VGA-Whittle and HMC-Whittle tend to be wider compared to those from HMC-exact. As HMC-Whittle only targets the parameters and not the states, it tends to produce posterior samples of the parameters that are less correlated than those from HMC-exact. We also show that R-VGA-Whittle and HMC-Whittle are much more computationally efficient than HMC-exact. R-VGA-Whittle is particularly efficient when dealing with bivariate time series, where it can be up to two orders of magnitude faster than HMC-Whittle and three orders of magnitude faster than HMC-exact.

There are several avenues for future research. First, R-VGA-Whittle assumes a Gaussian variational distribution, so future work could attempt to extend this framework to more flexible variational distributions for approximating skewed or multimodal posterior distributions. Second, some tuning parameters of R-VGA-Whittle (such as the number of damping observations and number of damping steps) are currently user-defined, so an adaptive tuning scheme could be developed to select these parameters. Third, as the Whittle likelihood is known to produce biased parameter estimates for finite sample sizes, using R-VGA-Whittle with the debiased Whittle likelihood of Sykulski et al., (2019) may improve parameter estimates.

Acknowledgements

This work was supported by ARC SRIEAS Grant SR200100005 Securing Antarctica’s Environmental Future. Andrew Zammit-Mangion’s research was also supported by ARC Discovery Early Career Research Award DE180100203.

References

  • Abadi et al., (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
  • Abadir et al., (2007) Abadir, K. M., Distaso, W., and Giraitis, L. (2007). Nonstationarity-extended local Whittle estimation. Journal of Econometrics, 141(2):1353–1384.
  • Abramowitz and Stegun, (1968) Abramowitz, M. and Stegun, I. (1968). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. U.S. Government Printing Office, Washington, D. C.
  • Andrieu and Roberts, (2009) Andrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697–725.
  • Beal, (2003) Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. PhD thesis, University College London, UK.
  • Blei et al., (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877.
  • Bonnet, (1964) Bonnet, G. (1964). Transformations des signaux aléatoires a travers les systemes non linéaires sans mémoire. Annales des Télécommunications, 19:203–220.
  • Brillinger, (2001) Brillinger, D. R. (2001). Time Series: Data Analysis and Theory. Society for Industrial and Applied Mathematics, Philadelphia, PA.
  • Campbell et al., (2021) Campbell, A., Shi, Y., Rainforth, T., and Doucet, A. (2021). Online variational filtering and parameter learning. Advances in Neural Information Processing Systems, 34:18633–18645.
  • Chan and Strachan, (2023) Chan, J. C. and Strachan, R. W. (2023). Bayesian state space models in macroeconometrics. Journal of Economic Surveys, 37(1):58–75.
  • Cooley and Tukey, (1965) Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297–301.
  • Doucet and Shephard, (2012) Doucet, A. and Shephard, N. G. (2012). Robust inference on parameters via particle filters and sandwich covariance matrices. Discussion paper series, Department of Economics, University of Oxford, UK.
  • Duane et al., (1987) Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195(2):216–222.
  • Dukic et al., (2012) Dukic, V., Lopes, H. F., and Polson, N. G. (2012). Tracking epidemics with Google flu trends data and a state-space SEIR model. Journal of the American Statistical Association, 107(500):1410–1426.
  • Dzhaparidze and Yaglom, (1983) Dzhaparidze, K. and Yaglom, A. (1983). Spectrum parameter estimation in time series analysis. In Krishnaiah, P. R., editor, Developments in Statistics, pages 1–96. Academic Press, New York, NY.
  • Fearnhead, (2011) Fearnhead, P. (2011). MCMC for state-space models. In Brooks, S., Gelman, A., Jones, G. J., and Meng, X.-L., editors, Handbook of Markov Chain Monte Carlo, pages 513–530. CRC Press, Boca Raton, FL.
  • Frazier et al., (2023) Frazier, D. T., Loaiza-Maya, R., and Martin, G. M. (2023). Variational Bayes in state space models: Inferential and predictive accuracy. Journal of Computational and Graphical Statistics, 32(3):793–804.
  • Fuentes, (2007) Fuentes, M. (2007). Approximate likelihood for large irregularly spaced spatial data. Journal of the American Statistical Association, 102(477):321–331.
  • Gabry et al., (2024) Gabry, J., Češnovar, R., and Johnson, A. (2024). Getting started with CmdStanR.
  • Gibson and Ninness, (2005) Gibson, S. and Ninness, B. (2005). Robust maximum-likelihood estimation of multivariable dynamic systems. Automatica, 41(10):1667–1682.
  • Giraitis et al., (2012) Giraitis, L., Koul, H. L., and Surgailis, D. (2012). Large Sample Inference for Long Memory Processes. Imperial College Press, London, UK.
  • Giraitis and Robinson, (2001) Giraitis, L. and Robinson, P. M. (2001). Whittle estimation of ARCH models. Econometric Theory, 17(3):608–631.
  • Gordon et al., (1993) Gordon, N. J., Salmond, D. J., and Smith, A. F. (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F, 140(2):107–113.
  • Gunawan et al., (2021) Gunawan, D., Kohn, R., and Nott, D. (2021). Variational Bayes approximation of factor stochastic volatility models. International Journal of Forecasting, 37(4):1355–1375.
  • Gunawan et al., (2024) Gunawan, D., Kohn, R., and Nott, D. (2024). Flexible variational Bayes based on a copula of a mixture. Journal of Computational and Graphical Statistics, 33(2):665–680.
  • Harvey, (1990) Harvey, A. C. (1990). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, Cambridge, UK.
  • Hoffman and Gelman, (2014) Hoffman, M. D. and Gelman, A. (2014). The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623.
  • Hurvich and Chen, (2000) Hurvich, C. M. and Chen, W. W. (2000). An efficient taper for potentially overdifferenced long-memory time series. Journal of Time Series Analysis, 21(2):155–180.
  • Kantas et al., (2009) Kantas, N., Doucet, A., Singh, S. S., and Maciejowski, J. M. (2009). An overview of sequential monte carlo methods for parameter estimation in general state-space models. IFAC Proceedings Volumes, 42(10):774–785.
  • Kastner et al., (2017) Kastner, G., Frühwirth-Schnatter, S., and Lopes, H. F. (2017). Efficient Bayesian inference for multivariate factor stochastic volatility models. Journal of Computational and Graphical Statistics, 26(4):905–917.
  • Kim et al., (1998) Kim, S., Shephard, N., and Chib, S. (1998). Stochastic volatility: likelihood inference and comparison with ARCH models. The Review of Economic Studies, 65(3):361–393.
  • Lambert et al., (2022) Lambert, M., Bonnabel, S., and Bach, F. (2022). The recursive variational Gaussian approximation (R-VGA). Statistics and Computing, 32(10).
  • Laplace, (1986) Laplace, P. S. (1986). Memoir on the probability of the causes of events. Statistical Science, 1(3):364–378.
  • Loaiza-Maya et al., (2022) Loaiza-Maya, R., Smith, M. S., Nott, D. J., and Danaher, P. J. (2022). Fast and accurate variational inference for models with many latent variables. Journal of Econometrics, 230(2):339–362.
  • Mangion et al., (2011) Mangion, A. Z., Yuan, K., Kadirkamanathan, V., Niranjan, M., and Sanguinetti, G. (2011). Online variational inference for state-space models with point-process observations. Neural Computation, 23(8):1967–1999.
  • Matsuda and Yajima, (2009) Matsuda, Y. and Yajima, Y. (2009). Fourier analysis of irregularly spaced data on d\mathbb{R}^{d}. Journal of the Royal Statistical Society Series B, 71(1):191–217.
  • Meyer et al., (2003) Meyer, R., Fournier, D. A., and Berg, A. (2003). Stochastic volatility: Bayesian computation using automatic differentiation and the extended Kalman filter. The Econometrics Journal, 6(2):408–420.
  • Neal, (2011) Neal, R. (2011). MCMC using Hamiltonian dynamics. In Brooks, S., Gelman, A., Jones, G. J., and Meng, X.-L., editors, Handbook of Markov Chain Monte Carlo. CRC Press, Boca Raton, FL.
  • Nemeth et al., (2016) Nemeth, C., Fearnhead, P., and Mihaylova, L. (2016). Particle approximations of the score and observed information matrix for parameter estimation in state–space models with linear computational cost. Journal of Computational and Graphical Statistics, 25(4):1138–1157.
  • Percival and Walden, (1993) Percival, D. B. and Walden, A. T. (1993). Spectral Analysis for Physical Applications. Cambridge University Press, Cambridge, UK.
  • Price, (1958) Price, R. (1958). A useful theorem for nonlinear devices having Gaussian inputs. IRE Transactions on Information Theory, 4(2):69–72.
  • Robinson, (1995) Robinson, P. M. (1995). Gaussian semiparametric estimation of long range dependence. The Annals of Statistics, 23(5):1630–1661.
  • Ruiz, (1994) Ruiz, E. (1994). Quasi-maximum likelihood estimation of stochastic volatility models. Journal of Econometrics, 63(1):289–306.
  • Salomone et al., (2020) Salomone, R., Quiroz, M., Kohn, R., Villani, M., and Tran, M.-N. (2020). Spectral subsampling MCMC for stationary time series. In Proceedings of the 37th International Conference on Machine Learning, pages 8449–8458. PMLR.
  • (45) Shao, X. and Wu, W. B. (2007a). Asymptotic spectral theory for nonlinear time series. The Annals of Statistics, 35(4):1773–1801.
  • (46) Shao, X. and Wu, W. B. (2007b). Local Whittle estimation of fractional integration for nonlinear processes. Econometric Theory, 23(5):899–929.
  • She et al., (2022) She, R., Mi, Z., and Ling, S. (2022). Whittle parameter estimation for vector ARMA models with heavy-tailed noises. Journal of Statistical Planning and Inference, 219:216–230.
  • Sykulski et al., (2019) Sykulski, A. M., Olhede, S. C., Guillaumin, A. P., Lilly, J. M., and Early, J. J. (2019). The debiased Whittle likelihood. Biometrika, 106(2):251–266.
  • Tan and Nott, (2018) Tan, L. S. and Nott, D. J. (2018). Gaussian variational approximation with sparse precision matrices. Statistics and Computing, 28:259–275.
  • Tomasetti et al., (2022) Tomasetti, N., Forbes, C., and Panagiotelis, A. (2022). Updating variational Bayes: Fast sequential posterior inference. Statistics and Computing, 32(4).
  • Tran et al., (2017) Tran, M.-N., Nott, D. J., and Kohn, R. (2017). Variational Bayes with intractable likelihood. Journal of Computational and Graphical Statistics, 26(4):873–882.
  • Villani et al., (2022) Villani, M., Quiroz, M., Kohn, R., and Salomone, R. (2022). Spectral subsampling MCMC for stationary multivariate time series with applications to vector ARTFIMA processes. Econometrics and Statistics.
  • Vu et al., (2024) Vu, B. A., Gunawan, D., and Zammit-Mangion, A. (2024). R-VGAL: a sequential variational Bayes algorithm for generalised linear mixed models. Statistics and Computing, 34(110).
  • Welch, (1967) Welch, P. (1967). The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms. IEEE Transactions on Audio and Electroacoustics, 15(2):70–73.
  • Whittle, (1953) Whittle, P. (1953). Estimation and information in stationary time series. Arkiv för Matematik, 2(5):423–434.
  • Zammit Mangion et al., (2011) Zammit Mangion, A., Sanguinetti, G., and Kadirkamanathan, V. (2011). A variational approach for the online dual estimation of spatiotemporal systems governed by the IDE. IFAC Proceedings Volumes, 44(1):3204–3209. 18th IFAC World Congress.
  • Zammit-Mangion et al., (2012) Zammit-Mangion, A., Sanguinetti, G., and Kadirkamanathan, V. (2012). Variational estimation in spatiotemporal systems from continuous and point-process observations. IEEE Transactions on Signal Processing, 60(7):3449–3459.

S1 The R-VGA algorithm

In this section, we provide a sketch of the derivations for the R-VGA algorithm of Lambert et al., (2022), which serves as the basis for our R-VGA-Whittle algorithm.

The R-VGA algorithm of Lambert et al., (2022) requires iid observations, denoted by 𝐲1:N(𝐲1,,𝐲N){\mathbf{y}_{1:N}\equiv(\mathbf{y}_{1}^{\top},\dots,\mathbf{y}_{N}^{\top})^{\top}}, and approximates the posterior distribution p(𝜽𝐲1:N)p(\bm{\theta}\mid\mathbf{y}_{1:N}) with a Gaussian distribution N(𝝁,𝚺)N(\bm{\mu},\mathbf{\Sigma}). By assumption of conditional independence between observations 𝐲1:N\mathbf{y}_{1:N} given the parameters 𝜽\bm{\theta}, the KL divergence between the variational distribution qi(𝜽)q_{i}(\bm{\theta}) and the posterior distribution p(𝜽𝐲1:i)p(\bm{\theta}\mid\mathbf{y}_{1:i}) can be expressed as

KL(qi(𝜽)p(𝜽𝐲1:i))\displaystyle\text{KL}(q_{i}(\bm{\theta})\;\|\;p(\bm{\theta}\mid\mathbf{y}_{1:i})) qi(𝜽)logqi(𝜽)p(𝜽𝐲1:i)d𝜽\displaystyle\equiv\int q_{i}(\bm{\theta})\log\frac{q_{i}(\bm{\theta})}{p(\bm{\theta}\mid\mathbf{y}_{1:i})}\,\mathrm{d}\bm{\theta}
=𝔼qi(logqi(𝜽)logp(𝜽𝐲1:i1)logp(𝐲i𝜽))+logp(𝐲1:i)logp(𝐲1:i1).\displaystyle=\mathbb{E}_{q_{i}}\left(\log q_{i}(\bm{\theta})-\log p(\bm{\theta}\mid\mathbf{y}_{1:i-1})-\log p(\mathbf{y}_{i}\mid\bm{\theta})\right)+\log p(\mathbf{y}_{1:i})-\log p(\mathbf{y}_{1:i-1}).

In a sequential VB framework, the posterior distribution after incorporating the first i1i-1 observations, p(𝜽𝐲1:i1)p(\bm{\theta}\mid\mathbf{y}_{1:i-1}), is approximated by the variational distribution qi1(𝜽)q_{i-1}(\bm{\theta}) to give

KL(qi(𝜽)p(𝜽𝐲1:i))𝔼qi(logqi(𝜽)logqi1(𝜽)logp(𝐲i𝜽))+logp(𝐲1:i)logp(𝐲1:i1).\text{KL}(q_{i}(\bm{\theta})\;\|\;p(\bm{\theta}\mid\mathbf{y}_{1:i}))\approx\mathbb{E}_{q_{i}}(\log q_{i}(\bm{\theta})-\log q_{i-1}(\bm{\theta})-\log p(\mathbf{y}_{i}\mid\bm{\theta}))+\log p(\mathbf{y}_{1:i})-\log p(\mathbf{y}_{1:i-1}). (S1)

As the last two terms on the right hand side of (S1) do not depend on 𝜽\bm{\theta}, the KL-minimisation problem is equivalent to solving

argmin𝝁i,𝚺i𝔼qi(logqi(𝜽)logqi1(𝜽)logp(𝐲i𝜽)).\operatorname*{arg\,min}_{\bm{\mu}_{i},\mathbf{\Sigma}_{i}}\,\mathbb{E}_{q_{i}}(\log q_{i}(\bm{\theta})-\log q_{i-1}(\bm{\theta})-\log p(\mathbf{y}_{i}\mid\bm{\theta})). (S2)

Differentiating the expectation in (S2) with respect to 𝝁i\bm{\mu}_{i} and 𝚺i\mathbf{\Sigma}_{i}, setting the derivatives to zero, and rearranging the resulting equations, yields the following recursive updates for the variational mean 𝝁i\bm{\mu}_{i} and precision matrix 𝚺i1\mathbf{\Sigma}_{i}^{-1}:

𝝁i\displaystyle\bm{\mu}_{i} =𝝁i1+𝚺i1𝝁i𝔼qi(logp(𝐲i𝜽)),\displaystyle=\bm{\mu}_{i-1}+\mathbf{\Sigma}_{i-1}\nabla_{\bm{\mu}_{i}}\mathbb{E}_{q_{i}}(\log p(\mathbf{y}_{i}\mid\bm{\theta})), (S3)
𝚺i1\displaystyle\mathbf{\Sigma}_{i}^{-1} =𝚺i112𝚺i𝔼qi(logp(𝐲i𝜽)).\displaystyle=\mathbf{\Sigma}_{i-1}^{-1}-2\nabla_{\mathbf{\Sigma}_{i}}\mathbb{E}_{q_{i}}(\log p(\mathbf{y}_{i}\mid\bm{\theta})). (S4)

Then, using Bonnet’s Theorem (Bonnet,, 1964) on (S3) and Price’s Theorem (Price,, 1958) on (S4), we rewrite the gradient terms as

𝝁i𝔼qi(logp(𝐲i𝜽))\displaystyle\nabla_{\bm{\mu}_{i}}\mathbb{E}_{q_{i}}(\log p(\mathbf{y}_{i}\mid\bm{\theta})) =𝔼qi(𝜽logp(𝐲i𝜽)),\displaystyle=\mathbb{E}_{q_{i}}(\nabla_{\bm{\theta}}\log p(\mathbf{y}_{i}\mid\bm{\theta})), (S5)
𝚺i𝔼qi(logp(𝐲i𝜽))\displaystyle\nabla_{\mathbf{\Sigma}_{i}}\mathbb{E}_{q_{i}}(\log p(\mathbf{y}_{i}\mid\bm{\theta})) =12𝔼qi(𝜽2logp(𝐲i𝜽)).\displaystyle=\frac{1}{2}\mathbb{E}_{q_{i}}(\nabla_{\bm{\theta}}^{2}\log p(\mathbf{y}_{i}\mid\bm{\theta})). (S6)

Thus the updates (S3) and (S4) become

𝝁i\displaystyle\bm{\mu}_{i} =𝝁i1+𝚺i1𝔼qi(𝜽logp(𝐲i𝜽)),\displaystyle=\bm{\mu}_{i-1}+\mathbf{\Sigma}_{i-1}\mathbb{E}_{q_{i}}(\nabla_{\bm{\theta}}\log p(\mathbf{y}_{i}\mid\bm{\theta})), (S7)
𝚺i1\displaystyle\mathbf{\Sigma}_{i}^{-1} =𝚺i11𝔼qi(𝜽2logp(𝐲i𝜽)).\displaystyle=\mathbf{\Sigma}_{i-1}^{-1}-\mathbb{E}_{q_{i}}(\nabla_{\bm{\theta}}^{2}\log p(\mathbf{y}_{i}\mid\bm{\theta})). (S8)

These updates are implicit as they require the evaluation of expectations with respect to qi(𝜽)q_{i}(\bm{\theta}). Under the assumption that qi(𝜽)q_{i}(\bm{\theta}) is close to qi1(𝜽)q_{i-1}(\bm{\theta}), Lambert et al., (2022) propose replacing qi(𝜽)q_{i}(\bm{\theta}) with qi1(𝜽)q_{i-1}(\bm{\theta}) in (S7) and (S8), and replacing 𝚺i1\mathbf{\Sigma}_{i-1} with 𝚺i\mathbf{\Sigma}_{i} on the right hand side of (S7), to yield an explicit scheme

𝝁i\displaystyle\bm{\mu}_{i} =𝝁i1+𝚺i𝔼qi1(𝜽logp(𝐲i𝜽)),\displaystyle=\bm{\mu}_{i-1}+\mathbf{\Sigma}_{i}\mathbb{E}_{q_{i-1}}(\nabla_{\bm{\theta}}\log p(\mathbf{y}_{i}\mid\bm{\theta})), (S9)
𝚺i1\displaystyle\mathbf{\Sigma}_{i}^{-1} =𝚺i11𝔼qi1(𝜽2logp(𝐲i𝜽)).\displaystyle=\mathbf{\Sigma}_{i-1}^{-1}-\mathbb{E}_{q_{i-1}}(\nabla_{\bm{\theta}}^{2}\log p(\mathbf{y}_{i}\mid\bm{\theta})). (S10)

Equations (S9) and (S10) form the so-called R-VGA algorithm of Lambert et al., (2022).

S2 Additional details on the SV model

In this section, we follow Ruiz, (1994) and show that the terms {ξt}t=1T\{\xi_{t}\}_{t=1}^{T} in the log-squared transformation (27) in the main paper have mean zero and variance π2/2\pi^{2}/2.

Since ϵtN(0,1)\epsilon_{t}\sim N(0,1), we have ϵt2χ12\epsilon_{t}^{2}\sim\chi^{2}_{1}, that is, ϵt2\epsilon^{2}_{t} is gamma distributed with shape and scale parameters equal to 1/21/2 and 22, respectively. Therefore, log(ϵt2)\log(\epsilon_{t}^{2}) has mean ψ(1/2)+log(2)\psi(1/2)+\log(2), where ψ()\psi(\cdot) denotes the digamma function (the first derivative of the logarithm of the gamma function), and variance ψ(1)(12)=π22\psi^{(1)}\left(\frac{1}{2}\right)=\frac{\pi^{2}}{2}, where ψ(1)()\psi^{(1)}(\cdot) is the trigamma function (the second derivative of the logarithm of the gamma function; see, e.g., Abramowitz and Stegun, (1968)). Thus

𝔼(ξt)=𝔼(log(ϵt2)𝔼(log(ϵt2)))=0,\mathbb{E}(\xi_{t})=\mathbb{E}\left(\log(\epsilon_{t}^{2})-\mathbb{E}(\log(\epsilon_{t}^{2}))\right)=0,

and

Var(ξt)=Var(log(ϵt2)𝔼(log(ϵt2)))=Var(log(ϵt2))=π22.\operatorname{Var}(\xi_{t})=\operatorname{Var}\left(\log(\epsilon_{t}^{2})-\mathbb{E}(\log(\epsilon_{t}^{2}))\right)=\operatorname{Var}\left(\log(\epsilon_{t}^{2})\right)=\frac{\pi^{2}}{2}.

S3 Experiments involving different blocking strategies

In this section, we discuss the tuning parameters in Algorithm 3: the choice of block length BB and the number of frequencies to be processed individually before blocking, n~\tilde{n}. We base our discussion on inspection of the periodogram.

Recall that the periodogram is an estimate of the spectral density and can be viewed as a measure of the relative contributions of the frequencies to the total variance of the process (Percival and Walden,, 1993). For example, Figure S1 shows the periodograms for the linear Gaussian SSM in Section 3.1 and the univariate SV model in Section 3.2. This figure shows that in general, power decreases as frequency increases. Frequencies that carry less power tend to have less effect on the trajectory of the variational mean; see, for example, the trajectories of the parameters of the linear Gaussian model in Figure S2. The trajectory of ϕ\phi moves towards the true parameter value within the first few frequencies, with only minor adjustments afterwards. The trajectories for ση\sigma_{\eta} and σϵ\sigma_{\epsilon} behave in a similar manner, but with more gradual changes. This behaviour can also be observed in Figure S3, in which the trajectories of both parameters in the univariate SV model exhibit large changes in the first few iterations, then stabilise afterwards. Higher frequencies that have little effect on the variational mean can thus be processed together in “blocks”. The R-VGA-Whittle updates can then be made based on the Whittle log-likelihood for a “block” of frequencies, which we compute as the sum of the individual log-likelihood contribution evaluated at each frequency within that block.

Refer to caption
(a) Linear Gaussian model.
Refer to caption
(b) Univariate SV model.
Figure S1: Plot of raw periodograms of the data (black) and smoothed periodograms (light red) for two models. Vertical lines show the frequency cutoffs at half-power (3dB, red dashed line), one-fifth-power (7dB, blue dotted line) and one-tenth-power (10dB, purple dash-dotted line).
Refer to caption
Figure S2: Trajectories of the variational mean for the parameters in the LGSS model estimated using damped R-VGA-Whittle (Algorithm 2) with no blocking. The true parameter values are marked with dashed lines.
Refer to caption
Figure S3: Trajectories of the variational mean for the parameters in the univariate SV model estimated using damped R-VGA-Whittle (Algorithm 2) with no blocking. The true parameter values are marked with dashed lines.

It is important to choose an appropriate point to start blocking, as well as an appropriate length for each block of frequencies. To determine the frequency at which to begin blocking, we borrow a concept from physics and electrical engineering called the 3dB frequency cutoff, which is the frequency on the power spectrum at which the power drops to approximately half the maximum power of the spectrum. The value 3 comes from the fact that, on the decibel scale, a drop from a power level P1P_{1} to another level P2=12P1P_{2}=\frac{1}{2}P_{1} results in a drop of 3 decibels. The change in decibels is calculated as

ΔdB=10log10P2P1=10log10(12)=3.0103.\Delta\textrm{dB}=10\log_{10}\frac{P_{2}}{P_{1}}=10\log_{10}\left(\frac{1}{2}\right)=-3.0103.

We employ a similar idea here, where we define the “cutoff” frequency as the frequency at which the power drops to half of the maximum power in a smoothed version of the periodogram. Specifically, we seek the frequency ω\omega^{*} such that I~(ω)=1/2max(I~(ω))\tilde{I}(\omega^{*})=1/2\max(\tilde{I}(\omega)), where I~(ω)\tilde{I}(\omega) is the periodogram smoothed using Welch’s method (Welch,, 1967). Briefly, Welch’s method works by dividing the frequencies into equal segments, computing the periodogram for each segment, and then averaging over these segments to obtain a smoother periodogram with lower variance. We also experiment with other cutoffs, such as the frequencies at which the power are one-fifth and one-tenth of the maximum power in the smoothed periodogram, which correspond to a drop of approximately 7dB and 10dB, respectively. We mark these cutoff frequencies in Figure S1.

Using each of the 3dB, 7dB and 10dB cutoffs, we run R-VGA-Whittle with no blocking, and then with varying block sizes (B=10,50,100,300,500B=10,50,100,300,500, and 10001000 frequencies), and examine the posterior densities of the parameters under these different settings. For all runs, we keep the settings for other tuning parameters the same as in Section 3 and use S=1000S=1000 Monte Carlo samples, ndamp=5n_{damp}=5, and D=100D=100 damping steps. For the linear Gaussian SSM, we plot the R-VGA-Whittle posteriors for these different cutoffs and block sizes in Figures 4(a), 4(b) and 4(c). The corresponding plots for the univariate SV model are shown in Figures 5(a), 5(b), and 5(c). The plots for the bivariate SV model are similar to those from the univariate SV model and not shown.

We find that for the linear Gaussian SSM, block sizes of 100100 and below result in R-VGA-Whittle posterior densities that are very similar to each other, similar to the posterior densities obtained without blocking, and also similar to the posterior densities from HMC-Whittle and HMC-exact. For block sizes of 300300 and 500500, the posterior densities of the parameters ση\sigma_{\eta} and σϵ\sigma_{\epsilon} begin to differ slightly compared to the posterior densities obtained without blocking. When the block size is 10001000, the posterior densities for all three parameters become noticeably biased, although this bias reduces when the frequency cutoff point increases, as expected. However, for the SV model, the R-VGA-Whittle posterior densities are highly similar for all block sizes. To achieve the best reduction in computing time while retaining estimation accuracy, we choose to begin blocking at the earliest cutoff point we consider in this experiment (3dB), with the block size set to 100.

Refer to caption
(a) Blocking begins at the 3dB cutoff point.
Refer to caption
(b) Blocking begins at the 7dB cutoff point.
Refer to caption
(c) Blocking begins at the 10dB cutoff point.
Figure S4: Posterior densities of the parameters of the linear Gaussian SSM estimated using R-VGA-Whittle without blocking (denoted as “None”) and with blocking starting at different cutoff points for several block sizes (coloured lines), HMC-Whittle (dashed lines), and HMC-exact (dotted lines). True parameter values are denoted using vertical dashed lines.
Refer to caption
(a) Blocking begins at the 3dB cutoff point.
Refer to caption
(b) Blocking begins at the 7dB cutoff point.
Refer to caption
(c) Blocking begins at the 10dB cutoff point.
Figure S5: Posterior densities of the parameters of the univariate SV model estimated using R-VGA-Whittle without blocking (denoted as “None”) and with blocking starting at different cutoff points for several block sizes (coloured lines), HMC-Whittle (dashed lines), and HMC-exact (dotted lines). True parameter values are denoted using vertical dashed lines.

S4 Variance of the R-VGA-Whittle posterior densities for various Monte Carlo sample sizes

In this section, we study the robustness of R-VGA-Whittle posterior distributions to different values of SS, the Monte Carlo sample size used to approximate the expectations of the gradient and Hessian in Algorithm 2. We consider sample sizes S=100,S=500S=100,S=500 and S=1000S=1000. We apply blocking as in Algorithm 3, and fix the block length to B=100B=100. We set n~=176\tilde{n}=176 for the linear Gaussian SSM, and n~=75\tilde{n}=75 for the univariate SV model, which corresponds to the 3dB cutoff point as discussed in Section S3.

Figures 6(a)6(c) show the posterior densities of the parameters of the linear Gaussian SSM from 10 runs of R-VGA-Whittle with S=100S=100, S=500S=500, and S=1000S=1000, respectively, while Figures 7(a)7(c) show the corresponding plots for the SV model. As the Monte Carlo sample size increases, the variance in the posterior densities across different runs decreases. When S=1000S=1000, the variance across different runs is significantly reduced, so we choose this as our Monte Carlo sample size for all examples in Section 3 of the main paper.

Refer to caption
(a) S=100S=100.
Refer to caption
(b) S=500S=500.
Refer to caption
(c) S=1000S=1000.
Figure S6: Posterior densities of the parameters of the linear Gaussian SSM from 10 runs of R-VGA-Whittle (coloured solid lines) with different Monte Carlo sample sizes SS, from a single run of HMC-Whittle (dashed lines), and from a single run of HMC-exact (dotted lines). True parameter values are marked with dashed lines.
Refer to caption
(a) S=100S=100.
Refer to caption
(b) S=500S=500.
Refer to caption
(c) S=1000S=1000.
Figure S7: Posterior densities of the parameters of the univariate SV model from 10 runs of R-VGA-Whittle (coloured solid lines) with different Monte Carlo sample sizes SS, from a single run of HMC-Whittle (dashed lines), and from a single run of HMC-exact (dotted lines). True parameter values are marked with dashed lines.

S5 Trace plots

This section shows the HMC-exact and HMC-Whittle trace plots and effective sample size (ESS) for the parameters in all the models we consider. Figure S8 shows trace plots for the linear Gaussian SSM, Figures S9 and S10 show trace plots for the univariate and bivariate SV models with simulated data, and Figures S11 and S12 show trace plots for the univariate and bivariate SV models with real data.

In general, we find that HMC-Whittle tends to produce posterior samples with much lower correlation than HMC-exact. For the same number of posterior samples, HMC-Whittle has higher ESS than HMC-exact, and the ESS of HMC-Whittle is high for all paramaters, while HMC-exact tends to produce higher ESS for the autoregressive parameters (ϕ\phi for univariate examples and 𝚽\mathbf{\Phi} for bivariate examples) than for the variance parameters (ση\sigma_{\eta} and σϵ\sigma_{\epsilon} for the linear Gaussian model, ση\sigma_{\eta} for the univariate SV model, and 𝚺η\mathbf{\Sigma}_{\eta} for the bivariate SV model). This is because HMC-Whittle only targets the posterior density of the parameters, whereas HMC-exact targets the joint posterior density of the parameters and states, which are high dimensional. See Table S1 for the ESS corresponding to each parameter in our examples.

Example Parameter ESS (HMC-exact) ESS (HMC-Whittle)
Linear Gaussian ϕ\phi 2452.8033 8884.775
Linear Gaussian ση\sigma_{\eta} 846.3039 7933.656
Linear Gaussian σϵ\sigma_{\epsilon} 697.3547 8163.235
Univariate SV (sim. data) ϕ\phi 230.00751 6854.622
Univariate SV (sim. data) ση\sigma_{\eta} 87.94789 6949.066
Bivariate SV (sim. data) Φ11\Phi_{11} 814.3441 13064.54
Bivariate SV (sim. data) Φ22\Phi_{22} 282.3187 12967.59
Bivariate SV (sim. data) Ση11\Sigma_{\eta_{11}} 346.6532 13382.68
Bivariate SV (sim. data) Ση21\Sigma_{\eta_{21}} 440.2904 16141.23
Bivariate SV (sim. data) Ση22\Sigma_{\eta_{22}} 138.6995 12034.28
Univariate SV (real data) ϕ\phi 304.6549 5920.983
Univariate SV (real data) ση\sigma_{\eta} 118.8147 5980.517
Bivariate SV (real data) Φ11\Phi_{11} 730.7380 9058.774
Bivariate SV (real data) Φ22\Phi_{22} 796.0991 9242.738
Bivariate SV (real data) Ση11\Sigma_{\eta_{11}} 206.4234 11170.016
Bivariate SV (real data) Ση21\Sigma_{\eta_{21}} 161.9922 9989.325
Bivariate SV (real data) Ση22\Sigma_{\eta_{22}} 245.5852 11127.356
Table S1: The ESS obtained for all parameters considered in the examples described in the main text. The ESS is shown for both HMC-exact and HMC-Whittle from 20000 posterior samples.
Refer to caption
(a) HMC-exact trace plots (samples not thinned).
Refer to caption
(b) HMC-Whittle trace plots (samples not thinned).
Figure S8: Trace plots for the linear Gaussian SSM with simulated data. Red dashed lines denote the true parameter values.
Refer to caption
(a) HMC-exact trace plots (samples not thinned).
Refer to caption
(b) HMC-exact trace plots (samples thinned by a factor of 100).
Refer to caption
(c) HMC-Whittle trace plots (samples not thinned).
Figure S9: Trace plots for the univariate SV model with simulated data. Red dashed lines denote the true parameter values.
Refer to caption
(a) HMC-exact trace plots (samples not thinned).
Refer to caption
(b) HMC-exact trace plots (samples thinned by a factor of 50).
Refer to caption
(c) HMC-Whittle trace plots (samples not thinned).
Figure S10: Trace plots for the bivariate SV model with simulated data. Red dashed lines denote the true parameter values.
Refer to caption
(a) HMC-exact trace plots (samples not thinned).
Refer to caption
(b) HMC-exact trace plots (samples thinned by a factor of 100).
Refer to caption
(c) HMC-Whittle trace plots (samples not thinned).
Figure S11: Trace plots for the univariate SV model applied to exchange rate data.
Refer to caption
(a) HMC-exact trace plots (samples not thinned).
Refer to caption
(b) HMC-exact trace plots (samples thinned by a factor of 50).
Refer to caption
(c) HMC-Whittle trace plots (samples not thinned).
Figure S12: Trace plots for the bivariate SV model applied to exchange rate data.