This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Neural Moving Average Model for Scalable Variational Inference of State Space Models

Thomas Ryder School of Mathematics Statistics and Physics
Newcastle University
Newcastle
United Kingdom
Huawei Noah’s Ark Lab
Dennis Prangle School of Mathematics Statistics and Physics
Newcastle University
Newcastle
United Kingdom
Andrew Golightly School of Mathematics Statistics and Physics
Newcastle University
Newcastle
United Kingdom
Isaac Matthews School of Mathematics Statistics and Physics
Newcastle University
Newcastle
United Kingdom
Abstract

Variational inference has had great success in scaling approximate Bayesian inference to big data by exploiting mini-batch training. To date, however, this strategy has been most applicable to models of independent data. We propose an extension to state space models of time series data based on a novel generative model for latent temporal states: the neural moving average model. This permits a subsequence to be sampled without drawing from the entire distribution, enabling training iterations to use mini-batches of the time series at low computational cost. We illustrate our method on autoregressive, Lotka-Volterra, FitzHugh-Nagumo and stochastic volatility models, achieving accurate parameter estimation in a short time.

1 Introduction

State space models (SSMs) are a flexible and interpretable model class for sequential data, popular in areas including engineering [Elliott et al., 2008], economics [Zeng and Wu, 2013], epidemiology [Fasiolo et al., 2016] and neuroscience [Paninski et al., 2010]. SSMs assume a latent Markov chain xx with states x1,x2,,xTx_{1},x_{2},\ldots,x_{T}, with data as noisy observations of some or all of these.

Standard inference methods for the parameters, θ\theta, of an SSM require evaluating or estimating the likelihood under various choices of θ\theta e.g. using a Kalman or particle filter [Särkkä, 2013]. Each such evaluation has O(T)O(T) cost at best, and even larger costs may be required to control the variance of likelihood estimates. These methods can thus be impractically expensive for long time series (e.g. T106T\gg 10^{6}), which are increasingly common in applications such as genomics [Foti et al., 2014] and geoscience [Foreman-Mackey et al., 2017].

In contrast, for models of independent data, one can estimate the log-likelihood using a short mini-batch, at an O(1)O(1) cost only. This allows scalable inference methods based on stochastic gradient optimisation e.g. maximum likelihood or variational inference. The latter introduces a family of approximate densities for the latent variables indexed by ϕ\phi. One then selects ϕ\phi to minimise the Kullback-Leibler divergence from the approximate density to the posterior.

We propose a mini-batch variational inference method for SSMs, for the case of continuous states i.e. xidx_{i}\in\mathbb{R}^{d}. This requires a family of variational approximations q(θ,x;ϕ)q(\theta,x;\phi) with a crucial locality property. It must be possible to sample a subsequence (xi)aib(x_{i})_{a\leq i\leq b}, to be used as a mini-batch, from the middle of the xx sequence at a O(1)O(1) cost. Much existing work on flexibly modelling sequence data (e.g. van den Oord et al. 2016, Ryder et al. 2018, Radev et al. 2020) uses an autoregressive model for xx. Here xix_{i} is generated from some or all xjx_{j}s with i<ji<j, so sampling xax_{a} requires sampling (xi)i<a(x_{i})_{i<a}, and the locality property is not met.

To achieve the locality property we introduce the neural moving average (nMA) model. This is a generative model for sequence data in which a learnable convolutional neural network (CNN) processes (1) an underlying sequence of base N(0,1)N(0,1) variables and (2) the sequence of observed data. The CNN’s receptive field has a limited size, rather than encompassing the entire input sequences. Therefore a sample from the nMA model is a type of moving average of the input sequences (1) and (2): xix_{i} is produced from a window of values in the input sequences close to position ii. This achieves the locality property. Also, by viewing the nMA model as a type of normalising flow [Rezende and Mohamed, 2015, Papamakarios et al., 2019], we show later that the mini-batch samples can be used to unbiasedly estimate the log-density of the whole xx chain, which is crucial to implement variational inference.

A trade-off for producing the locality property is that, under a nMA model, xix_{i} and xjx_{j} are independent for |ij||i-j| sufficiently large i.e. if they are far enough apart that their CNN receptive fields do not overlap. Hence using a nMA as the variational approximation assumes no long-range dependence in the posterior for xx. Despite this, we demonstrate that our approach works well in several examples. These include various challenging observation regimes: sparse observation times, partial observation of xix_{i}, low observation variance and a large number of observations. Our flexible variational family produces good posterior estimates in these examples: at best our variational output is indistinguishable from the true posterior.

The remainder of our paper is as follows. Section 2 describes state space models. Section 3 reviews relevant material on normalising flows and presents the nMA model. Section 4 sets out our variational inference method. The nMA model and the resulting inference algorithm are our novel methodological contribution. Section 5 presents our experiments, and Section 6 gives conclusions and opportunities for future work. Code for the paper is available at https://github.com/Tom-Ryder/VIforSSMs.

Related Work

Bayesian inference for SSMs commonly uses sampling-based Markov chain Monte Carlo (MCMC) methods, involving repeated use of Kalman or particle filters [Doucet et al., 2001, Cappé et al., 2010, Särkkä, 2013]. As discussed above, these methods typically become expensive for long time series, with each likelihood estimate requiring an O(T)O(T) pass through the data. A recent O(1)O(1) sampling-based scheme using a related strategy to ours for scalable SSM inference is Aicher et al. [2019]. This approach uses stochastic gradient MCMC with buffered gradient estimates, which are based on running a particle filter on a short subsequence of data. Like our contribution, this approach neglects long-range dependence.

Aicher et al. [2019], in common with several other papers discussed here, requires an observation for each xix_{i}. However many applications involve missing or sparsely observed data. Our generative model can be applied to such settings as it learns to impute xix_{i} values between observations.

Several stochastic optimisation variational inference methods for SSMs have previously been proposed, with different variational families for xx, including: a multivariate normal distribution with tridiagonal covariance structure [Archer et al., 2016], a recurrent neural network [Krishnan et al., 2017], an autoregressive distribution [Karl et al., 2014, Ryder et al., 2018], a particle filter [Hirt and Dellaportas, 2019]. However, all of these methods have an O(T)O(T) cost for each iteration of training and/or require storing O(T)O(T) parameters.

Foti et al. [2014] also propose an O(1)O(1) variational inference method based on mini-batch updates. They consider hidden Markov models – SSMs with discrete states – which allows the xx posterior to be derived using a forward-backward algorithm [Rabiner, 1989]. This would usually require forward and backward passes over the full dataset, at cost O(T)O(T), but they show that approximating these on short subsequences suffices to perform variational inference. In contrast our paper explores SSMs with continuous states, where a forward-backward algorithm is not available in general [Briers et al., 2010]. Another difference is that Foti et al. [2014] use a variational approximation with independence between θ\theta and xx, while our approach avoids this strong assumption.

Parallel Wavenet [van den Oord et al., 2018] similarly uses a normalising-flow-based generative model for sequence data. This incorporates long-range dependence using dilated convolutions, while we use only short range dependence to allow mini-batch inference. Our local normalising flow is also similar at a high level to a masked convolutional generative flow (MACOW) [Ma et al., 2019]. The novelty of our approach is that we develop this idea to allow fast variational inference for time series, while Ma et al. [2019] focus on density estimation and sampling for image data.

Finally, Ward et al. [2020] successfully apply our method to mechanistic models with Gaussian process priors placed on unobserved forcing terms, including a multi-output system using real-world data, and Gaussian process regression using a Poisson observation model.

2 State Space Models

Notation

Throughout we use xix_{i} to denote an individual state, xx to denote the whole sequence of states and xa:bx_{a{:}b} to denote a subsequence (xi)aib(x_{i})_{a\leq i\leq b}. We use similar notation for sequences represented by other letters. More generally we use a:ba{:}b to represent the sequence (a,a+1,,b)(a,a+1,\ldots,b).

2.1 Definition

A SSM is based on a latent Markov chain x=x1:Tx=x_{1{:}T}. We focus on the case of continuous states xidx_{i}\in\mathbb{R}^{d}. States evolve through a transition density p(xi|xi1,θ),p(x_{i}|x_{i-1},\theta), with parameters θp\theta\in\mathbb{R}^{p}. We assume the initial state is x0(θ)x_{0}(\theta), a deterministic function of θ\theta. (This allows examples with initial state known – x0x_{0} is a constant – or unknown – x0x_{0} depends on unknown parameters.) Observations yidyy_{i}\in\mathbb{R}^{d_{y}} are available for i𝒮0:Ti\in\mathcal{S}\subseteq 0{:}T following an observation density p(yi|xi,θ).p(y_{i}|x_{i},\theta).

In the Bayesian framework, after specifying a prior density p(θ)p(\theta), interest lies in the posterior density

p(θ,x|y)p(θ,x,y)=p(θ)i=1Tp(xi|xi1,θ)i𝒮p(yi|xi,θ).p(\theta,x|y)\propto p(\theta,x,y)=p(\theta)\prod_{i=1}^{T}p(x_{i}|x_{i-1},\theta)\prod_{i\in\mathcal{S}}p(y_{i}|x_{i},\theta). (1)

2.2 Discretised Stochastic Differential Equations

One application of SSMs, which we use in our examples, is as discrete approximations to stochastic differential equations (SDEs), as follows:

xi+1=xi+α(xi,θ)Δt+β(xi,θ)Δtϵi,x_{i+1}=x_{i}+\alpha(x_{i},\theta)\Delta t+\sqrt{\beta(x_{i},\theta)\Delta t}\epsilon_{i}, (2)

where ϵiN(0,Id)\epsilon_{i}\sim N(0,I_{d}) are independent random vectors. Here α\alpha is a dd-dimensional drift vector, β\beta is a d×dd\times d positive-definite diffusion matrix and β\sqrt{\beta} denotes its Cholesky factor. The state xix_{i} approximates the state of the SDE process at time iΔti\Delta t. Taking the limit Δt0\Delta t\to 0 in an appropriate way recovers the exact SDE [Øksendal, 2003, Särkkä and Solin, 2019].

3 The Neural Moving Average Model

Section 1 gave an intuitive description of the nMA model. In this section we present a formal description. First Section 3.1 presents background material on inverse autoregressive flows (IAFs). Then Sections 3.23.3 describe the nMA model as a special case of an IAF. Section 3.2 describes the case where xix_{i} (a state of the SSM) is scalar, and Section 3.3 extends this to the multivariate case.

3.1 Inverse Autoregressive Flows

A normalising flow represents a random object xx as gmg2g1(z)g_{m}\circ\ldots g_{2}\circ g_{1}(z): a composition of learnable bijections of a base random object zz. Here we suppose x=x1:Tx=x_{1:T} and xix_{i}\in\mathbb{R}. (Later we consider xix_{i} as a vector.) We take z=z1:Tz=z_{1:T} as independent N(0,1)N(0,1) variables. By the standard change of variable result, the log-density of xx is

logq(x)=i=1Tφ(zi)j=1mlog|detJj|\log q(x)=\sum_{i=1}^{T}\varphi(z_{i})-\sum_{j=1}^{m}\log|\det J_{j}| (3)

where φ\varphi is the N(0,1)N(0,1) log-density function and JjJ_{j} is the Jacobian matrix of transformation gjg_{j} given input gj1g2g1(z)g_{j-1}\circ\ldots g_{2}\circ g_{1}(z).

The bijections in an IAF are mainly affine layers, which transform input zinz^{\text{in}} to output zoutz^{\text{out}} by

ziout=μi(z1:i1in)+σi(z1:i1in)ziin,z^{\text{out}}_{i}=\mu_{i}(z^{\text{in}}_{1{:}{i-1}})+\sigma_{i}(z^{\text{in}}_{1{:}{i-1}})z^{\text{in}}_{i}, (4)

with σi>0\sigma_{i}>0. This transformation scales and shifts each ziinz^{\text{in}}_{i}. The shift and scale shift values, μi\mu_{i} and σi\sigma_{i}, are typically neural network outputs. An efficient approach is to use a single neural network to output all the μi,σi\mu_{i},\sigma_{i} values for a particular affine layer. This network uses masked dense layers so that (μi,σi)(\mu_{i},\sigma_{i}) depends only on z1:i1inz^{\text{in}}_{1{:}i-1} as required [Germain et al., 2015, Kingma et al., 2016, Papamakarios et al., 2017]. In the resulting IAF each affine layer is based on a different neural network of this form. We’ll refer to this as a masked IAF.

The shift and scale functions for zioutz^{\text{out}}_{i} in (4) have an autoregressive property: they depend on zinz^{\text{in}} only through zjinz^{\text{in}}_{j} with j<ij<i. Hence the Jacobian matrix of the transformation is diagonal with non-zero entries σ1:T\sigma_{1:T}. The log-density of an IAF made of mm affine layers is

logq(x)=i=1Tφ(zi)j=1mi=1Tlogσij\log q(x)=\sum_{i=1}^{T}\varphi(z_{i})-\sum_{j=1}^{m}\sum_{i=1}^{T}\log\sigma^{j}_{i} (5)

where σij\sigma^{j}_{i} is the shift value for the iith input to the jjth affine layer.

IAFs typically alternate between affine layers and permutation layers, using order reversing or random permutations. Such layers have Jacobians with absolute determinant 1. Thus the log-density calculation is unchanged (interpreting jj in (5) to index the jjth affine layer not the jjth layer of any type). The supplement (Section G.1) details methods to restrict the output of a IAF e.g. to ensure all xix_{i}s are positive.

IAFs are flexible and, for small TT, allow fast sampling and calculation of a sample’s log-density. However they are expensive for large TT as large neural networks are needed to map between length TT sequences.

3.2 The Neural Moving Average Model

Our neural moving average (nMA) model reduces the number of weights that IAFs require by using a CNN to calculate the μi\mu_{i} and σi\sigma_{i} values in an affine layer. Thus it can be thought of as a kind of local IAF. Here we explain the main idea by presenting a version for scalar xix_{i}. Section 3.3 extends this to the vector xix_{i} case.

To define the nMA model we describe how a single affine layer produces its shift and scale values. The affine layer uses a CNN with input zinz^{\text{in}}, a vector of length TT. Let hkh^{k} represent the kkth hidden layer of the CNN, a matrix of dimension (T,nk)(T,n_{k}) where nkn_{k} is a tuning choice. The first layer applies a convolution with receptive field length \ell. This is an off-centre convolution so that row ii of h1h^{1} is a transformation of zi:i1inz^{\text{in}}_{i-\ell{:}i-1}. We use zero-padding by taking ziin=0z^{\text{in}}_{i}=0 for i<0i<0. The following hidden layers are length-11 convolutions, so row ii of hk+1h^{k+1} is a transformation of row ii of hkh^{k}. The output, hnh^{n}, is a matrix of dimension (T,2)(T,2) whose iith row contains μi\mu_{i} and σi\sigma_{i}. The final layer applies a softplus activation to produce the σi\sigma_{i} values, ensuring they are positive. An identity activation is used to produce the μi\mu_{i} values. The μi\mu_{i} and σi\sigma_{i} values are used in (4) to produce the output of the affine layer.

A nMA model composes several affine layers of the form just described. Some properties of the distribution for the output sequence xx are:

  1. 1.

    No long-range dependence: xix_{i} and xjx_{j} are independent if |ij|>m|i-j|>m\ell, where mm is the number of affine layers.

  2. 2.

    Stationary local dependence: the distributions of xi:jx_{i{:}j} and xi+a:j+ax_{i+a{:}j+a} are the same for most choices of aa. (Subsequences near to the start of xx can differ due to zero-padding.)

To improve the flexibility of the nMA model, affine layers can be alternated with order-reversing permutations. (Random permutations would not be suitable, as they would disrupt our ability to sample subsequences quickly, as described in Section 3.4.) Throughout the paper we consider nMA models without order reversing permutation layers, as we found these models already sufficiently flexible for our examples. (The supplement, Section C, details how to incorporate these layers.)

We relax stationary local dependence by injecting local side information to the CNN i.e. giving an extra feature vector sis_{i} as input for each position ii in the first CNN layer. We also use global side information to allow xx to depend on the parameter values θ\theta i.e. giving θ\theta as extra input for every position ii. See the supplement (Sections D, E) for details of the side information we use in practice.

3.3 Multivariate Case

Here we generalise the nMA model to the case where xidx_{i}\in\mathbb{R}^{d}. We now let zz be a sequence z1,z2,,zTz_{1},z_{2},\ldots,z_{T} of independent random N(0,Id)N(0,I_{d}) vectors. A nMA affine layer makes the transformation

ziout=μi+σiziin,z^{\text{out}}_{i}=\mu_{i}+\sigma_{i}\odot z^{\text{in}}_{i}, (6)

scaling the vector ziinz^{\text{in}}_{i} (elementwise multiplication by vector σi\sigma_{i}) then shifting it (adding vector μi\mu_{i}).

In the scalar case it was important to allow complex dependencies between zioutz^{\text{out}}_{i} values. Now we must also allow dependencies within each zioutz^{\text{out}}_{i} vector. To do so we use coupling layers as in Dinh et al. [2016].

We use an extra kk subscript to denote the kkth component of a vector e.g. zikinz^{\text{in}}_{ik}. We select some ad/2a\approx d/2. For kak\leq a, we take μik=0\mu_{ik}=0 and σik=1\sigma_{ik}=1, so that zikout=zikinz^{\text{out}}_{ik}=z^{\text{in}}_{ik}. For k>ak>a, we compute μik\mu_{ik} and σik\sigma_{ik} using a CNN, modifying the scalar case as follows. Now row ii of h1h^{1} is a transformation of zi:i1inz^{\text{in}}_{i-\ell{:}i-1} (the \ell vectors preceding ziinz^{\text{in}}_{i}), and also zikinz^{\text{in}}_{ik} for kak\leq a (the part of ziinz^{\text{in}}_{i} not being modified). The output hnh^{n} is now a tensor of dimension (T,da,2)(T,d-a,2) containing μik\mu_{ik} and σik\sigma_{ik} values for k>ak>a.

This affine layer does not transform the first aa components of ziinz^{\text{in}}_{i}. To allow different components to be transformed in each layer, we permute components between affine layers. For example, a d=2d=2 permutation layer transforms zinz^{\text{in}} to zoutz^{\text{out}} by zi1out=zi2inz^{\text{out}}_{i1}=z^{\text{in}}_{i2}, zi2out=zi1inz^{\text{out}}_{i2}=z^{\text{in}}_{i1}. The log-density is now

logq(x)=i=1Tλi,λi=φ(zi)k=1dj=1mlogσikj,\log q(x)=\sum_{i=1}^{T}\lambda_{i},\quad\lambda_{i}=\varphi(z_{i})-\sum_{k=1}^{d}\sum_{j=1}^{m}\log\sigma^{j}_{ik}, (7)

where φ\varphi is the N(0,Id)N(0,I_{d}) log-density function and σikj\sigma^{j}_{ik} is the kkth entry of the shift vector for position ii output by the jjth affine layer. Decomposing logq(x)\log q(x) into λi\lambda_{i} contributions will be useful in Section 4.

3.4 Sampling

Sampling from a nMA model is straightforward. First sample the base random object zz. This is a sequence of length TT (of scalars or vectors – the sampling process is similar in either case). Now apply the IAF’s layers to this in turn. To apply an affine layer, pass the input (and any side information) through the layer’s CNN to calculate shift and scale values, then apply the affine transformation. The final output is the sampled sequence xx. The cost of sampling in this way is O(T)O(T).

In the next section, we will often wish to sample a short subsequence xu:vx_{u{:}v}. It is possible to do this at O(1)O(1) cost with respect to TT. Algorithm A in the supplement gives the details. In brief, the key insight is that xu:vx_{u{:}v} only depends on zz through zum:vz_{u-m\ell{:}v}. Therefore we sample zum:vz_{u-m\ell{:}v} and apply the layers to this subsequence. The output will contain the correct values of xu:vx_{u{:}v}.

4 Variational Inference for SSMs

This section describes how we use nMA models to perform variational inference (VI) efficiently for SSMs. Section 4.1 reviews standard details of VI. See e.g. Blei et al. [2017] for more details. We then present our novel VI derivation involving nMA models in Section 4.2 and the resulting algorithm in Section 4.3.

4.1 Variational Inference Background

We wish to infer the joint posterior density p(θ,x|y)p(\theta,x|y). We introduce a family of approximations indexed by ϕ\phi, q(θ,x;ϕ)q(\theta,x;\phi). Optimisation is used to find ϕ\phi minimising the Kullback-Leibler divergence KL[q(θ,x;ϕ)||p(θ,x|y)]KL[q(\theta,x;\phi)||p(\theta,x|y)]. This is equivalent to maximising the ELBO (evidence lower bound) [Jordan et al., 1999],

(ϕ)\displaystyle\mathcal{L}(\phi) =Eθ,xq[r(θ,x,y,ϕ)],\displaystyle=E_{\theta,x\sim q}[r(\theta,x,y,\phi)], (8)
forr(θ,x,y,ϕ)\displaystyle\text{for}\quad r(\theta,x,y,\phi) =logp(θ,x,y)logq(θ,x;ϕ).\displaystyle=\log p(\theta,x,y)-\log q(\theta,x;\phi). (9)

Here rr is a log-density ratio. The optimal q(θ,x;ϕ)q(\theta,x;\phi) approximates the posterior density. It is typically overconcentrated, unless the approximating family is expressive enough to allow particularly close matches to the posterior.

Optimisation for VI can be performed efficiently using the reparameterisation trick [Kingma and Welling, 2014, Rezende et al., 2014, Titsias and Lázaro-Gredilla, 2014]. That is, letting (θ,x)(\theta,x) be the output of an invertible deterministic function g(ε,ϕ)g(\varepsilon,\phi) for some random variable ε\varepsilon with a fixed distribution. Then the ELBO gradient and unbiased Monte Carlo estimate are

(ϕ)\displaystyle\nabla\mathcal{L}(\phi) =Eε[r(θ,x,y,ϕ)],\displaystyle=E_{\varepsilon}[\nabla r(\theta,x,y,\phi)], (10)
^(ϕ)\displaystyle\widehat{\nabla\mathcal{L}}(\phi) =1nj=1n[r(θ(j),x(j),y,ϕ)],\displaystyle=\frac{1}{n}\sum_{j=1}^{n}[\nabla r(\theta^{(j)},x^{(j)},y,\phi)], (11)

where (θ(j),x(j))=g(ε(j),ϕ)(\theta^{(j)},x^{(j)})=g(\varepsilon^{(j)},\phi) and ε(1),,ε(n)\varepsilon^{(1)},\ldots,\varepsilon^{(n)} are independent ε\varepsilon samples. This gradient estimate can be used in stochastic gradient optimisation algorithms.

4.2 ELBO Derivation

Our variational family for the SSM posterior (1) is

q(θ,x;ϕ)=q(θ;ϕθ)q(x|θ;ϕx),q(\theta,x;\phi)=q(\theta;\phi_{\theta})q(x|\theta;\phi_{x}), (12)

where ϕ=(ϕθ,ϕx)\phi=(\phi_{\theta},\phi_{x}). We use a masked IAF for q(θ;ϕθ)q(\theta;\phi_{\theta}) and a nMA model for q(x|θ;ϕx)q(x|\theta;\phi_{x}). For the latter we inject θ\theta as side information. See the supplement (Sections D and E) for more details.

The masked IAF maps a base random vector zθz_{\theta} to θ\theta using parameters ϕθ\phi_{\theta}, as described in Section 3.1. The nMA model maps θ\theta and a sequence of vectors zxz_{x} to xx using parameters ϕx\phi_{x} as described in Section 3.3. Hence we have a mapping from ε=(zθ,zx)\varepsilon=(z_{\theta},z_{x}) to g(ε,ϕ)=(θ,x)g(\varepsilon,\phi)=(\theta,x), allowing us to use the reparameterisation trick below.

This section derives a mini-batch optimisation algorithm to train ϕ\phi based on sampling short xx subsequences, so that the cost-per-training-iteration is O(1)O(1). The algorithm is applicable for scalar or multivariate xix_{i}. In this presentation we assume that 𝒮=0:T\mathcal{S}=0{:}T i.e. there are observations for all ii values. To relax this assumption remove any terms involving yiy_{i} for i𝒮i\not\in\mathcal{S}.

For our variational family (12), the ELBO is (8) with

r=logp(θ,x,y)logq(θ;ϕθ)logq(x|θ;ϕx).r=\log p(\theta,x,y)-\log q(\theta;\phi_{\theta})-\log q(x|\theta;\phi_{x}). (13)

Substituting (1) and (7) into (13) gives

r=logp(θ)logq(θ;ϕθ)+logp(y0|x0,θ)+i=1T{logp(xi|xi1,θ)+logp(yi|xi,θ)λi}.\begin{split}r=&\log p(\theta)-\log q(\theta;\phi_{\theta})+\log p(y_{0}|x_{0},\theta)+\\ &\sum_{i=1}^{T}\big{\{}\log p(x_{i}|x_{i-1},\theta)+\log p(y_{i}|x_{i},\theta)-\lambda_{i}\big{\}}.\end{split} (14)

Now introduce batches B1,B2,,BbB_{1},B_{2},\ldots,B_{b}: length MM sequences of consecutive integers partitioning 1:T1{:}T. Draw κ\kappa uniformly from 1:b1{:}b. Then an unbiased estimate of rr is

rκ=logp(θ)logq(θ;ϕθ)+logp(y0|x0,θ)+TMiBκ{logp(xi|xi1,θ)+logp(yi|xi,θ)λi}.\begin{split}r_{\kappa}=&\log p(\theta)-\log q(\theta;\phi_{\theta})+\log p(y_{0}|x_{0},\theta)+\\ &\frac{T}{M}\sum_{i\in B_{\kappa}}\big{\{}\!\log p(x_{i}|x_{i-1},\theta)+\log p(y_{i}|x_{i},\theta)-\lambda_{i}\!\big{\}}.\end{split} (15)

Hence an unbiased estimate of the ELBO gradient is

^(ϕ)=1nj=1nrκ(θ(j),x(j),y,ϕ).\widehat{\nabla\mathcal{L}}(\phi)=\frac{1}{n}\sum_{j=1}^{n}\nabla r_{\kappa}(\theta^{(j)},x^{(j)},y,\phi). (16)

where (θ(j),x(j))=g(ε(j),ϕ)(\theta^{(j)},x^{(j)})=g(\varepsilon^{(j)},\phi) and ε(1),,ε(n)\varepsilon^{(1)},\ldots,\varepsilon^{(n)} are independent ε\varepsilon samples.

4.3 Optimisation Algorithm

Algorithm 1 presents our mini-batch training procedure. Each iteration of Algorithm 1 involves sampling a subsequence of xx values of length M+1M+1. The cost is O(1)O(1) with respect to the total length of the sequence TT. This compares favourably to the O(T)O(T) cost of sampling the entire xx sequence. Discussion of implementation details is given in the supplement (Section E).

Algorithm 1 Mini-batch variational inference for state space models
1: Initialise ϕθ,ϕx\phi_{\theta},\phi_{x}.
2:loop
3:  Sample a batch κ\kappa uniformly from 1:b1{:}b. Let uu and vv denote the endpoints of BκB_{\kappa}.
4:  Calculate ^(ϕ)\widehat{\nabla\mathcal{L}}(\phi) from (16), generating the terms in the sum as follows.
5:  for 1jn1\leq j\leq n do
6:   Sample θ(j)q(θ;ϕθ)\theta^{(j)}\sim q(\theta;\phi_{\theta}).
7:   Sample111Note this samples xu1x_{u-1}, the state immediately before the current batch of interest. This is needed for the p(xi|xi1,θ)p(x_{i}|x_{i-1},\theta) term in (15) when i=ui=u. xu1:v(j)x^{(j)}_{u-1{:}v} from q(x|θ;ϕx)q(x|\theta;\phi_{x}) (unless u=1u=1 in which case sample xu:vx_{u{:}v}), calculating corresponding λu:v\lambda_{u:v} values. See Algorithm A in the supplement for details.
8:   Calculate rκ(θ(j),x(j),y,ϕ)\nabla r_{\kappa}(\theta^{(j)},x^{(j)},y,\phi) using automatic differentiation of (15).
9:  end for
10:  Update ϕθ,ϕx\phi_{\theta},\phi_{x} using stochastic gradient optimisation.
11:end loop

5 Experiments

Below we apply our method to several examples. All results were obtained using an NVIDIA Titan XP and an 8 core CPU. For tuning choices and experimental specifics see the supplement (Sections E, F). Sections 5.15.3 use simulated data so the results can be compared to true parameter values, while Section 5.4 uses real data.

5.1 AR(1) Model

First we consider the AR(1) model xi+1=θ1+θ2xi+θ3ϵx_{i+1}=\theta_{1}+\theta_{2}x_{i}+\theta_{3}\epsilon, with ϵN(0,1)\epsilon\sim N(0,1) and x0=10x_{0}=10. We assume observations yiN(xi,1)y_{i}\sim N(x_{i},1) for i0:Ti\in 0{:}T, and independent N(0,102)N(0,10^{2}) priors on θ1,θ2,logθ3\theta_{1},\theta_{2},\log\theta_{3}. We use this model to investigate how our method scales with larger TT, and the effect of receptive field length \ell. To judge the accuracy of our results we compare to near-exact posterior inference using MCMC, as described in the supplement (Section A).

Effect of Observation Sequence Length

We simulated a synthetic dataset for each of four TT values: 5000,10000,50000,1000005000,10000,50000,100000 under true parameter values θ=(5.0,0.5,3.0)\theta=(5.0,0.5,3.0). We then inferred θ\theta, fixing the hyperparameters so that the cost per iteration for each setting is constant.

Figure 1(a) plots the accuracy of our results against number of iterations performed. Accuracy is measured as Maximum Mean Discrepancy (MMD) [Gretton et al., 2012] between variational approximation and MCMC output. (We use MMD with a Gaussian kernel.) In all cases, variational inference approximates the posterior well. Also, the number of training iterations required remains similar as TT increases. As a further check on the quality of the posterior approximation, Figure 1(b) shows a good match between marginal posteriors for MCMC and variational output for the case T=5000T=5000. Here, as for other TT values, the 10,000th iteration is achieved after 3\sim 3 minutes of computation. In comparison, the cost per iteration of MCMC is roughly proportional to TT.

Effect of Receptive Field Length

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 1: AR(1) results. (a) MMD between variational and MCMC output for θ\theta. (b) Marginal posterior density plots of MCMC output (blue) and variational output after 10,000 iterations (black). (c,d) MMD between variational and MCMC output for θ\theta for a range of receptive field lengths \ell. The horizontal axis shows (c) number of training iterations (d) wall-clock training time.

We consider again the T=5000T=5000 dataset, and investigate the effect of \ell. Figure 1 shows MMD against iteration and wall-clock time for {5,10,50,100,200}\ell\in\{5,10,50,100,200\}. In all cases the variational output converges to a good approximation of the posterior. Convergence takes a similar number of iterations for all choices of \ell, but wall-clock time per iteration increases with \ell.

5.2 Lotka-Volterra

Refer to caption
Refer to caption
Figure 2: Lotka-Volterra results. Left: marginal density plots for setting (a) (dense observations), comparing MCMC (blue) to variational (black) output. Right: variational output for setting (b) (sparse observations). Right top: 100 xx samples, with observations displayed as crosses. The horizontal axis shows t=0.1it=0.1i. Right bottom: marginal density plots for θ\theta with true values displayed with dashed black lines. All variational results used 30,000 training iterations.

Next we test our method on short time series with complex dynamics. We use a version of the Lotka-Volterra model for predator-prey population dynamics under three events: prey reproduction, predation (in which prey are consumed and predators get resources to reproduce) and predator death. A SDE Lotka-Volterra model (for derivation see e.g. Golightly and Wilkinson [2011]) is defined by drift and diffusion

α(x,θ)\displaystyle\alpha(x,\theta) =(θ1uθ2uvθ2uvθ3v),\displaystyle=\begin{pmatrix}\theta_{1}u-\theta_{2}uv\\ \theta_{2}uv-\theta_{3}v\end{pmatrix}, (17)
β(x,θ)\displaystyle\beta(x,\theta) =(θ1u+θ2uvθ2uvθ2uvθ2uv+θ3v),\displaystyle=\begin{pmatrix}\theta_{1}u+\theta_{2}uv&-\theta_{2}uv\\ -\theta_{2}uv&\theta_{2}uv+\theta_{3}v\end{pmatrix}, (18)

where x=(u,v)x=(u,v) represents population sizes of prey and predators. The parameters θ=(θ1,θ2,θ3)\theta=\left(\theta_{1},\theta_{2},\theta_{3}\right) control the rates of the three events described above.

We consider a discretised version of this SDE, as described in Section 2.2, with Δt=0.1\Delta t=0.1 and u0=v0=100u_{0}=v_{0}=100. We simulated realisations under parameters θ=(0.5,0.0025,0.3)\theta=(0.5,0.0025,0.3) of xix_{i} for i1:500i\in 1{:}500, and construct two datasets with observations at: (a) i=0,10,20,,500i=0,10,20,\ldots,500 (dense); (b) i=0,100,200,,500i=0,100,200,\ldots,500 (sparse). We assume noisy observations yiN(xi,I2)y_{i}\sim N(x_{i},I_{2}) and independent N(0,102)N(0,10^{2}) priors for logθ1,logθ2,logθ3\log\theta_{1},\log\theta_{2},\log\theta_{3}.

Unlike previous examples, we needed to restrict uiu_{i} and viv_{i} to be positive. Also, we found multiple posterior modes, and needed to pretrain carefully to control which we converged to. See Section G of the supplement for more details of both these issues.

For observation setting (a), we compared our results to near-exact posterior samples from MCMC [Golightly and Wilkinson, 2008, Fuchs, 2013]. These papers use a Metropolis-within-Gibbs MCMC scheme with carefully chosen proposal constructs. Designing suitable proposals can be challenging, particularly in sparse observation regimes [Whitaker et al., 2017]. Consequently we were unable to use MCMC in setting (b).

Figure 2 (left plot) displays the visual similarity between marginal densities estimates from variational and MCMC output in setting (a). The VI output is taken from the 30,000th iteration, after 10\approx 10 minutes of computation. Figure 2 (right plots) shows variational output for θ\theta and xx in setting (b) after 20\approx 20 minutes of computation. These results are consistent with the ground truth parameter values and xx path. VI using an autoregressive distribution for xx has also performed well in a similar scenario [Ryder et al., 2018], but required more training time (roughly 2 hours).

5.3 FitzHugh-Nagumo

Refer to caption
Refer to caption
Figure 3: Fitzhugh-Nagumo results. Left: 100 variational posterior samples (light green) and true values (dark green) for unobserved coordinate ww over a short time range. The xx-axis shows t=0.1it=0.1i. Right: marginal density plots of variational output for θ\theta and true values (dashed black lines).
Refer to caption
Refer to caption
Figure 4: Stochastic volatility results. Left: the observed returns process (blue line) and 50 samples from the variational posterior for the latent volatility path (green lines). Right: Marginal density plots of the variational posterior for θ\theta.

Here we test our method on a long time series with an unobserved component, using the FitzHugh-Nagumo model. A SDE version, following Jensen et al. [2012], van der Meulen and Schauer [2017], is defined by drift and diffusion

α(x,θ)\displaystyle\alpha(x,\theta) =(θ1(v3+vw+θ2)θ3vw+1.4),\displaystyle=\begin{pmatrix}\theta_{1}\left(-v^{3}+v-w+\theta_{2}\right)\\ \theta_{3}v-w+1.4\end{pmatrix}, (19)
β(x,θ)\displaystyle\beta(x,\theta) =(θ400θ5),\displaystyle=\begin{pmatrix}\theta_{4}&0\\ 0&\theta_{5}\end{pmatrix}, (20)

where x=(v,w)x=(v,w) represents the current membrane potential and latent recovery variables.

We consider a discretised version of this SDE, as in Section 2.2, with Δt=0.1,v0=2,w0=3\Delta t=0.1,v_{0}=2,w_{0}=3. We simulate synthetic data under parameter values θ=(2.0,1.0,1.5,0.5,0.3)\theta=(2.0,1.0,1.5,0.5,0.3) up to T=1,000,000T=1,000,000, recording observations at every ii to mimic a high frequency observation scenario. We assume independent observations yiN(vi,0.12)y_{i}\sim N(v_{i},0.1^{2}) and independent N(0,102)N(0,10^{2}) priors for logθ1,θ2,θ3,logθ4,logθ5\log\theta_{1},\theta_{2},\theta_{3},\log\theta_{4},\log\theta_{5}.

Figure 3 displays estimates of the unobserved component ww, and marginal density estimates for θ\theta. The results are consistent with the ground truth parameter values and ww path. The approximate posterior is sampled after roughly 180180 minutes of training.

5.4 Log-Gaussian Stochastic Volatility

We analyse a real data under a log-Gaussian stochastic volatility model presented as a discretised SDE with drift and diffusion

α(x,θ)\displaystyle\alpha(x,\theta) =(θ1rθ2θ3z),\displaystyle=\begin{pmatrix}\theta_{1}r\\ \theta_{2}-\theta_{3}z\end{pmatrix}, β(x,θ)\displaystyle\beta(x,\theta) =(rez00θ42),\displaystyle=\begin{pmatrix}re^{z}&0\\ 0&\theta_{4}^{2}\end{pmatrix}, (21)

where x=(r,z)x=\left(r,z\right) is the returns process and latent volatility factor, respectively. Similar discrete-time models have been analysed by Andersen and Lund [1997], Eraker [2001], Durham and Gallant [2002], but we use the form presented in Golightly and Wilkinson [2006].

Similarly to Golightly and Wilkinson [2006], we use 1508 weekly observations on the three-month U.S. Treasury bill rate for August 13, 1967 – August 30, 1996, and perform inference under independent N(0,102)N(0,10^{2}) priors for θ1,θ2,logθ3,logθ4\theta_{1},\theta_{2},\log\theta_{3},\log\theta_{4} and z0z_{0}. We assume the returns process is fully observed without error and set Δt=1.0\Delta t=1.0. Our analysis took  20 minutes on a single GPU. Figure 4 shows the results, which are consistent with those obtained from the MCMC analysis in Golightly and Wilkinson [2006].

6 Conclusion

We present a variational inference method for state space models based on a neural moving average model. This is designed to model complex dependence in the conditional posterior p(x|θ,y)p(x|\theta,y) and be scalable to long time series. In particular, they allow mini-batch inference where each training iteration has O(1)O(1) cost. We show that our method works well in several applications with challenging features including: an unobserved state component, sparse observation times, and a large number of observations. Further applications can be found in Ward et al. [2020].

Future work could investigate changing several aspects of the flow: alternating affine transformations with order reversing permutations; allowing some long-range dependence using a multi-scale architecture; incorporating recently proposed ideas from the literature [Durkan et al., 2019].

{contributions}

T. Ryder and D. Prangle conceived the idea and wrote the paper. T. Ryder and I. Matthews wrote the code and performed the experiments. A. Golightly advised on SSM models and methods, and wrote Appendix A.

Acknowledgements.
Thomas Ryder and Isaac Matthews are supported by the Engineering and Physical Sciences Research Council, Centre for Doctoral Training in Cloud Computing for Big Data (grant number EP/L015358/1). We acknowledge with thanks an NVIDIA academic GPU grant for this project. We thank Wil Ward and Stephen McGough for helpful discussions, and anonymous reviewers for useful comments.

References

  • Aicher et al. [2019] Christopher Aicher, Yi-An Ma, Nicholas J. Foti, and Emily B Fox. Stochastic gradient MCMC for state space models. SIAM Journal on Mathematics of Data Science, 1(3):555–587, 2019.
  • Andersen and Lund [1997] Torben G. Andersen and Jesper Lund. Estimating continuous-time stochastic volatility models of the short-term interest rate. Journal of Econometrics, 77(2), 1997.
  • Archer et al. [2016] Evan. Archer, Il M. Park, Lars Buesing, John Cunningham, and Liam Paninski. Black box variational inference for state space models. In International Conference on Learning Representations, Workshops, 2016.
  • Blei et al. [2017] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112:859–877, 2017.
  • Briers et al. [2010] Mark Briers, Arnaud Doucet, and Simon Maskell. Smoothing algorithms for state–space models. Annals of the Institute of Statistical Mathematics, 62(1):61, 2010.
  • Cappé et al. [2010] Olivier Cappé, Eric Moulines, and Tobias Ryden. Inference in Hidden Markov Models. Springer, 2010.
  • Dinh et al. [2016] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. arXiv preprint arXiv:1605.08803, 2016.
  • Doucet et al. [2001] Arnaud Doucet, Nando de Freitas, and Neil Gordon. An Introduction to Sequential Monte Carlo Methods. Springer New York, 2001.
  • Durham and Gallant [2002] Garland B Durham and A. Ronald Gallant. Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes. Journal of Business & Economic Statistics, 20(3):297–338, 2002.
  • Durkan et al. [2019] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In Advances in Neural Information Processing Systems, 2019.
  • Elliott et al. [2008] Robert J Elliott, Lakhdar Aggoun, and John B Moore. Hidden Markov models: estimation and control. Springer, 2008.
  • Eraker [2001] Bjørn Eraker. MCMC analysis of diffusion models with application to finance. Journal of Business & Economic Statistics, 19(2):177–191, 2001.
  • Fasiolo et al. [2016] Matteo Fasiolo, Natalya Pya, and Simon N. Wood. A comparison of inferential methods for highly nonlinear state space models in ecology and epidemiology. Statistical Science, 31:96–118, 2016.
  • Foreman-Mackey et al. [2017] Daniel Foreman-Mackey, Eric Agol, Sivaram Ambikasaran, and Ruth Angus. Fast and scalable Gaussian process modeling with applications to astronomical time series. The Astronomical Journal, 154(6):220, 2017.
  • Foti et al. [2014] Nick Foti, Jason Xu, Dillon Laird, and Emily Fox. Stochastic variational inference for hidden Markov models. In Advances in Neural Information Processing Systems, 2014.
  • Fuchs [2013] Christiane Fuchs. Inference for Diffusion Processes: With Applications in Life Sciences. Springer, 2013.
  • Germain et al. [2015] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, 2015.
  • Golightly and Wilkinson [2006] Andrew Golightly and Darren J. Wilkinson. Bayesian sequential inference for nonlinear multivariate diffusions. Statistics and Computing, 16(4), 2006.
  • Golightly and Wilkinson [2008] Andrew Golightly and Darren J. Wilkinson. Bayesian inference for nonlinear multivariate diffusion models observed with error. Computational Statistics & Data Analysis, 52:1674–1693, 2008.
  • Golightly and Wilkinson [2011] Andrew Golightly and Darren J. Wilkinson. Bayesian parameter inference for stochastic biochemical network models using particle Markov chain Monte Carlo. Interface focus, 1:807–820, 2011.
  • Green [1995] Peter J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711–732, 1995.
  • Gretton et al. [2012] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13:723–773, 2012.
  • Hirt and Dellaportas [2019] Marcel Hirt and Petros Dellaportas. Scalable Bayesian learning for state space models using variational inference with SMC samplers. In Artificial Intelligence and Statistics, 2019.
  • Jensen et al. [2012] Anders Chr. Jensen, Susanne Ditlevsen, Mathieu Kessler, and Omiros Papaspiliopoulos. Markov chain Monte Carlo approach to parameter estimation in the FitzHugh-Nagumo model. Physical Review E, 86, 2012.
  • Jordan et al. [1999] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37:183–233, 1999.
  • Karl et al. [2014] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational Bayes filters: unsupervised learning of state space models from raw data. In International Conference on Learning Representations, 2014.
  • Kingma and Ba [2015] Durk P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • Kingma and Welling [2014] Durk P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  • Kingma et al. [2016] Durk P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, 2016.
  • Krishnan et al. [2017] Rahul Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
  • Ma et al. [2019] Xuezhe Ma, Xiang Kong, Shanghang Zhang, and Eduard Hovy. Macow: Masked convolutional generative flow. In Advances in Neural Information Processing Systems, 2019.
  • Øksendal [2003] Bernt Øksendal. Stochastic Differential Equations: An Introduction with Applications. Hochschultext Universitext. Springer, 2003.
  • Paninski et al. [2010] Liam Paninski, Yashar Ahmadian, Daniel Gil Ferreira, Shinsuke Koyama, Kamiar Rahnama Rad, Michael Vidne, Joshua Vogelstein, and Wei Wu. A new look at state-space models for neural data. Journal of computational neuroscience, 29:107–126, 2010.
  • Papamakarios et al. [2017] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems, 2017.
  • Papamakarios et al. [2019] George Papamakarios, Eric Nalisnick, Danilo J. Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762, 2019.
  • Pascanu et al. [2013] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, 2013.
  • Rabiner [1989] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
  • Radev et al. [2020] Stefan T. Radev, Ulf K. Mertens, Andreass Voss, Lynton Ardizzone, and Ullrich Köthe. BayesFlow: Learning complex stochastic models with invertible neural networks. arXiv preprint arXiv:2003.06281, 2020.
  • Rezende and Mohamed [2015] Danilo J. Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.
  • Rezende et al. [2014] Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
  • Ryder et al. [2018] Tom Ryder, Andrew Golightly, A. Stephen McGough, and Dennis Prangle. Black-box variational inference for stochastic differential equations. In International Conference on Machine Learning, 2018.
  • Särkkä [2013] Simo Särkkä. Bayesian filtering and smoothing. Cambridge University Press, 2013.
  • Särkkä and Solin [2019] Simo Särkkä and Arno Solin. Applied Stochastic Differential Equations. Cambridge University Press, 2019.
  • Sherlock et al. [2015] Chris Sherlock, Alexandre H. Thiery, Gareth O. Roberts, and Jeffrey S. Rosenthal. On the efficiency of pseudo-marginal random walk Metropolis algorithms. The Annals of Statistics, 43(1):238–275, 2015.
  • Titsias and Lázaro-Gredilla [2014] Michalis Titsias and Michalis K. Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, 2014.
  • van den Oord et al. [2016] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, 2016.
  • van den Oord et al. [2018] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel WaveNet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, 2018.
  • van der Meulen and Schauer [2017] Frank van der Meulen and Moritz Schauer. Bayesian estimation of discretely observed multi-dimensional diffusion processes using guided proposals. Electronic Journal of Statistics, 11:2358–2396, 2017.
  • Ward et al. [2020] Wil Ward, Tom Ryder, Dennis Prangle, and Mauricio Alvarez. Black-box inference for non-linear latent force models. In International Conference on Artificial Intelligence and Statistics. PMLR, 2020.
  • West and Harrison [2006] Mike West and Jeff Harrison. Bayesian forecasting and dynamic models. Springer, 2006.
  • Whitaker et al. [2017] Gavin A. Whitaker, Andrew Golightly, Richard J. Boys, and Chris Sherlock. Improved bridge constructs for stochastic differential equations. Statistics and Computing, 27:885–900, 2017.
  • Zeng and Wu [2013] Yong Zeng and Shu Wu. State-space models: Applications in economics and finance. Springer, 2013.

  Supplementary Material

Appendix A MCMC Algorithm for AR(1) via Forward-Filter Recursion

Here we present an MCMC method for the AR(1) model in the main paper,

xi+1=θ1+θ2xi+θ3ϵ.x_{i+1}=\theta_{1}+\theta_{2}x_{i}+\theta_{3}\epsilon. (22)

Recall that we take ϵN(0,1)\epsilon\sim N(0,1) and x0=10x_{0}=10. We assume observations yiN(xi,1)y_{i}\sim N(x_{i},1) for i0:Ti\in 0{:}T, and independent N(0,102)N(0,10^{2}) priors on θ1,θ2,logθ3\theta_{1},\theta_{2},\log\theta_{3}.

Assuming TT observations, the marginal parameter posterior is given by

p(θ|y0:T)p(θ)p(y0:T|θ),p(\theta|y_{0:T})\propto p(\theta)p(y_{0:T}|\theta), (23)

where p(y0:T|θ)p(y_{0:T}|\theta) is the marginal likelihood obtained from integrating out the latent variables from p(θ,x0:T|y0:T)p(\theta,x_{0:T}|y_{0:T}). We sample the marginal parameter posterior p(θ|y0:T)p(\theta|y_{0:T}) using a random walk Metropolis-Hastings scheme.

This appendix describes the key step of evaluating the marginal likelihood given θ\theta, which is achieved using a forward filter. See West and Harrison [2006] for a general introduction to forward-filtering algorithms for linear state space models. We adapt this as follows.

As can be seen from (22), the AR(1) model is linear and Gaussian. Hence, for a Gaussian observation model with variance σ2\sigma^{2}, the marginal likelihood is tractable and can be efficiently computed via a forward-filter recursion. This utilises the factorisation

p(y0:T|θ)=p(y0|θ)i=1Tp(yi|y0:i1,θ),p(y_{0:T}|\theta)=p(y_{0}|\theta)\prod_{i=1}^{T}p(y_{i}|y_{0:i-1},\theta), (24)

by recursively evaluating each term.

Suppose that xi|y0:iN(ai,ci)x_{i}|y_{0:i}\sim N(a_{i},c_{i}). Since x0=10x_{0}=10 we can take a0=10,c0=0a_{0}=10,c_{0}=0. It follows that

xi+1|y0:iN(θ1+θ2ai,θ22ci+θ32),\displaystyle x_{i+1}\big{|}y_{0:i}\sim N\left(\theta_{1}+\theta_{2}a_{i},\theta_{2}^{2}c_{i}+\theta_{3}^{2}\right), (25)

which, from the observation model, gives us the one-step-ahead forecast

yi+1|y0:iN(θ1+θ2ai,θ22ci+θ32+σ2).\displaystyle y_{i+1}\big{|}y_{0:i}\sim N\left(\theta_{1}+\theta_{2}a_{i},\theta_{2}^{2}c_{i}+\theta_{3}^{2}+\sigma^{2}\right). (26)

Hence the marginal likelihood can be recursively updated using

p(y0:i+1|θ)=p(y0:i|θ)p(yi+1|y0:i,θ),p(y_{0:{i+1}}|\theta)=p(y_{0:i}|\theta)p(y_{i+1}|y_{0:i},\theta), (27)

where p(yi+1|y0:i,θ)p(y_{i+1}|y_{0:i},\theta) is the density of (26).

The next filtering distribution is obtained as xi+1|y0:i+1N(ai+1,ci+1)x_{{i+1}}|y_{0:{i+1}}\sim N(a_{i+1},c_{i+1}) where

ai+1=\displaystyle a_{i+1}= θ1+θ2ai+(θ22ci+θ32)(yi+1θ1θ2ai)(θ22ci+θ32+σ2)\displaystyle\theta_{1}\!+\theta_{2}a_{i}\!+\frac{\left(\theta_{2}^{2}c_{i}+\theta_{3}^{2}\right)\left(y_{i+1}-\theta_{1}-\theta_{2}a_{i}\right)}{\left(\theta_{2}^{2}c_{i}+\theta_{3}^{2}+\sigma^{2}\right)} (28)
ci+1=\displaystyle c_{i+1}= θ22ci+θ32(θ22ci+θ32)2(θ22ci+θ32+σ2).\displaystyle\theta_{2}^{2}c_{i}+\theta_{3}^{2}-\frac{\left(\theta_{2}^{2}c_{i}+\theta_{3}^{2}\right)^{2}}{\left(\theta_{2}^{2}c_{i}+\theta_{3}^{2}+\sigma^{2}\right)}. (29)

Evaluation of (25)-(29) for i=0,1,,T1i=0,1,\ldots,T-1 gives the marginal likelihood p(y0:T|θ)p(y_{0:T}|\theta).

Appendix B Mini-Batch Sampling

Algorithm A describes how to sample a subsequence xa:bx_{a{:}b} from q(x|θ;ϕx)q(x|\theta;\phi_{x}) without needing to sample the entire xx sequence.

Let z0z^{0} be the base random sequence and zjz^{j} be the sequence after jj affine layers. We assume no layers permuting the sequence order (but do allow layers permuting the components within each vector in the sequence). Suppose there are mm affine layers, so the output is x=zmx=z^{m}.

Algorithm A presents the multivariate case where xi,zij,μij,σijx_{i},z^{j}_{i},\mu^{j}_{i},\sigma^{j}_{i} are all vectors in d\mathbb{R}^{d}. This includes d=1d=1 as a special case. We denote the kkth entry of σij\sigma^{j}_{i} as σiki\sigma^{i}_{ik}. Recall that φ\varphi is the N(0,Id)N(0,I_{d}) log density function.

Algorithm A Sampling a subsequence from a nMA model
1: Sample zi0z^{0}_{i} for ac0iba-c_{0}\leq i\leq b. These are sampled from independent N(0,Id)N(0,I_{d}) distributions except when i1:Ti\not\in 1{:}T. In the latter case zi0z^{0}_{i} is a vector of zeros.
2:for 1jm1\leq j\leq m do
3:  Apply the CNN with input zacj1:bj1z^{j-1}_{a-c_{j-1}{:}b}. This outputs μacj:bj\mu^{j}_{a-c_{j}{:}b} and σacj:bj\sigma^{j}_{a-c_{j}{:}b}.
4:  Calculate zacj:bjz^{j}_{a-c_{j}{:}b} using affine transformation zij=μij+σijzij1z^{j}_{i}=\mu^{j}_{i}+\sigma^{j}_{i}\odot z^{j-1}_{i}.
5:  Permute components in zjz^{j} if necessary.
6:end for
7: Return sampled subsequence xa:b=za:bmx_{a{:}b}=z^{m}_{a{:}b}, and log density contributions λa:b\lambda_{a{:}b}, where
λi=φ(zi0)k=1dj=1mlogσikj.\lambda_{i}=\varphi(z^{0}_{i})-\sum_{k=1}^{d}\sum_{j=1}^{m}\log\sigma^{j}_{ik}.

Each iteration of the algorithm (except the last) must sample zijz^{j}_{i} vectors over an interval of ii which is wider than simply a:ba{:}b. The number of extra zijz^{j}_{i}s required at the lower end of this interval is

cj=(mj).c_{j}=(m-j)\ell. (30)

In other words, at each iteration the required interval shrinks by \ell, the length of the receptive field for zz.

Appendix C Order-Reversing Permutations

The main paper mentions that affine flow layers could be alternated with layers which reverse the order of the x1:Tx_{1:T} sequence. This can be accomplished by replacing Algorithm A above with Algorithm B. We state this algorithm using only the original sequence ordering. To do so we introduce an operation \mathfrak{R} which reverses the order of a sequence.

As before, each iteration (except the last) samples zijz^{j}_{i} vectors over an interval of ii wider than a:ba{:}b. However now we need extra entries at both ends of this interval, cjc^{-}_{j} and cj+c^{+}_{j} at the lower and upper ends respectively. These can be defined recursively by cm=cm+=0c^{-}_{m}=c^{+}_{m}=0 and

(cj,cj+)=(cj+1,cj+1+)+{(,0)j odd(0,)j even(c^{-}_{j},c^{+}_{j})=(c^{-}_{j+1},c^{+}_{j+1})+\begin{cases}(\ell,0)&j\text{ odd}\\ (0,\ell)&j\text{ even}\end{cases} (31)
Algorithm B Sampling a subsequence from a nMA model, with order-reversing permutations
1: Sample zi0z^{0}_{i} for ac0iba-c_{0}\leq i\leq b as in Algorithm A.
2:for 1jm1\leq j\leq m do
3:  if jj odd then
4:   Apply the CNN with input zacj1:b+cj1+j1z^{j-1}_{a-c^{-}_{j-1}{:}b+c^{+}_{j-1}}. This outputs μacj:b+cj+j\mu^{j}_{a-c^{-}_{j}{:}b+c^{+}_{j}} and σacj:b+cj+j\sigma^{j}_{a-c^{-}_{j}{:}b+c^{+}_{j}}.
5:  else
6:   Apply the CNN with input (zacj1:b+cj1j1)\mathfrak{R}(z^{j-1}_{a-c^{-}_{j-1}{:}b+c^{-}_{j-1}}). This outputs (μacj:b+cj+j)\mathfrak{R}(\mu^{j}_{a-c^{-}_{j}{:}b+c^{+}_{j}}) and (σacj:b+cj+j)\mathfrak{R}(\sigma^{j}_{a-c^{-}_{j}{:}b+c^{+}_{j}}).
7:  end if
8:  Calculate zacj:b+cj+jz^{j}_{a-c^{-}_{j}{:}b+c^{+}_{j}} using affine transformation zij=μij+σijzij1z^{j}_{i}=\mu^{j}_{i}+\sigma^{j}_{i}\odot z^{j-1}_{i}.
9:  Permute components in zjz^{j} if necessary.
10:end for
11: Return sampled subsequence xa:b=za:bmx_{a{:}b}=z^{m}_{a{:}b}, and log density contributions λa:b\lambda_{a{:}b}, where
λi=φ(zi0)k=1dj=1mlogσikj.\lambda_{i}=\varphi(z^{0}_{i})-\sum_{k=1}^{d}\sum_{j=1}^{m}\log\sigma^{j}_{ik}.

Appendix D Side Information

Here we give more details of what side information we inject into our nMA model for xx. We inject this information into the first layer of the CNN for each of our affine transformations. Recall that firstly we include the parameters θ\theta as global side information. Also we provide local side information, encoding information in yy local to ii which is useful for inferring the state xix_{i}.

In more detail, first we define sis_{i} to be a vector of data features relevant to xix_{i}. We pick these so that sis_{i} exists for all ii even if (1) no yiy_{i} observations exist for xix_{i} or (2) ii is outside the range 0:T0{:}T. The data features we use in our examples are listed in the next section.

The side information corresponding to the iith position in the sequence processed by the CNN is θ\theta and the vector si:i+s_{i-\ell^{\prime}{:}i+\ell^{\prime}}. The tuning parameter \ell^{\prime} is a receptive field length (like \ell earlier). This receptive field extends in both directions from the sequence position ii, so it can take account of both recent and upcoming observations. The side information is encoded using a feed-forward network, and this vector is then used as part of the input to the first layer of the CNN.

Appendix E Implementation Details for Algorithm 1

Optimisation

We use the AdaMax optimiser [Kingma and Ba, 2015], due to its robustness to occasional large gradient estimates. These sometimes occurred in our training procedure when different batches of the time series had significantly different properties. See Section F for its tuning choices. To stabilise optimisation, we also follow Pascanu et al. [2013] and clip gradients using the global L1L_{1} norm.

Variational Approximation for θ\theta

For q(θ;ϕθ)q(\theta;\phi_{\theta}) we use a masked IAF as described in Section 3.1 of the main paper. In all our examples, this alternates between 5 affine layers and random permutations. Each affine transformation is based on a masked feed-forward network of 3 layers with 10 hidden units.

Unequal Batch Sizes

The main paper assumes the training batch is split into batches B1,B2,,BbB_{1},B_{2},\ldots,B_{b} of equal length. Recall that in this case a batch BκB_{\kappa} is sampled at random to use in a training iteration where κ\kappa is drawn uniformly from 1:b1{:}b.

Often the length of the data will require batches of unequal lengths to be used. To do so, simply take Pr(κ)=|Bκ|/T\Pr(\kappa)=|B_{\kappa}|/T, and replace T/MT/M in (15) (in the main paper) with T/|Bκ|T/|B_{\kappa}|.

Pre-Training

We found that pre-training our variational approximation to sensible initial values reduced the training time. A general framework for this is to train q(θ;ϕθ)q(\theta;\phi_{\theta}) to be close to the prior, and q(x|θ;ϕx)q(x|\theta;\phi_{x}) to be close to the observations, or some other reasonable initial value. See Section F of for details of how we implemented this in our examples.

One of our examples required more complex pre-training, described below in Section G.2. Although our method does sometimes require such non-trivial tuning choices, so do most other competing methods for Bayesian inference of SSMs (see e.g. Sherlock et al., 2015).

Local Side Information

Our local side information vector sis_{i} is made up of:

  • Time ii.

  • Binary variable indicating whether or not i𝒮i\in\mathcal{S} (i.e. whether there is an observation of xix_{i}).

  • Vector of observations yiy_{i} if i𝒮i\in\mathcal{S}. Replaced by the next recorded observation vector if i𝒮i\not\in\mathcal{S}, or by a vector of zeros if there is no next observation.

  • Time until next observation (omitted in settings where every ii has an observation).

  • Binary variable indicating whether i0:Ti\in 0{:}T (as the sis_{i} receptive field can stretch beyond this).

Choice of \ell^{\prime}

Throughout we use =10\ell^{\prime}=10. We found that this relatively short receptive field length for local side information was sufficient to give good results for our examples.

Appendix F Experimental Details

This section lists tuning choices for our examples. In all of our examples we set both nn (number of samples used in ELBO gradient estimate) and MM (batch length) equal to 50, and use m=3m=3 affine layers in our flow for xx.

Each affine layer has a CNN with 4 layers of one-dimensional convolutional networks. Each intermediate layer has 50 filters, uses ELU activation and batch normalisation (except the output layer). Before being injected to the first CNN layer, side information vectors (see Section D) are processed through a feed-forward network to produce an encoded vector of length 50. We use a vanilla feed-forward network of 50 hidden units by 3 layers, with ELU activation.

We use the AdaMax optimiser with tuning parameters β1=0.95\beta_{1}=0.95 (non-default choice) and β2=0.999\beta_{2}=0.999 (default choice). See the tables below for learning rates used.

Each experiment uses a small number of pre-training SGD iterations for ϕθ\phi_{\theta} optimising Eθq[p(θ)]E_{\theta\sim q}[p(\theta)], the expected prior density. We separately pre-train ϕx\phi_{x} to optimise an objective detailed in the tables below. As discussed above (Section E), where possible we aim to initialise xx to be close to the observations, or some other reasonable initial value.

Choices specific to each experiment are listed below.

AR(1)

Learning rate 10310^{-3}
Pre-training for xx 500 iterations minimising Eθ,xq[xy^2]E_{\theta,x\sim q}[||x-\hat{y}||_{2}], where y^\hat{y} is the observed data.
\ell 10

Lotka-Volterra: Data Setting (a)

Learning rate 10310^{-3}
Pre-training for xx 500 iterations minimising Eθ,xq[xy^2]E_{\theta,x\sim q}[||x-\hat{y}||_{2}], where y^\hat{y} is linear interpolation of the data.
\ell 20

Lotka-Volterra: Data Setting (b)

Learning rate 5×1045\times 10^{-4}
Pre-training for xx 500 iterations maximising Exq[p(x|θ)]E_{x\sim q}[p(x|\theta^{*})] where θ=(0.5,0.0025,0.3)\theta^{*}=(0.5,0.0025,0.3). See Section G.2 for more details.
\ell 20

FitzHugh-Nagumo

Learning rate 5×1045\times 10^{-4}
Pre-training for xx 500 iterations minimising Eθ,xq[x2]E_{\theta,x\sim q}[||x||_{2}]. Here the model has some unobserved components, so we cannot initialise xx close to the observations. Instead we simply encourage xx to take small initial values.
\ell 20

Appendix G Lotka-Volterra Details

Here we discuss some methodology specific to the Lotka-Volterra example in more detail.

G.1 Restricting xx to Positive Values

For our Lotka-Volterra model, xi=(ui,vi)x_{i}=(u_{i},v_{i}) represents two population sizes. Negative values don’t have a natural interpretation, and also cause numerical errors in the model i.e. the matrix β\beta in (17) may no longer be positive definite so that a Cholesky factor, required in (2), is not available222 Note all equation references in Section G.1 are to the main paper. .

Therefore we wish to restrict the support of q(x|θ;ϕx)q(x|\theta;\phi_{x}) to positive values. We so by the following method, which can be applied more generally, beyond this specific model. We add a final elementwise softplus bijection to our nMA model. Let x~\tilde{x} be the output before this final bijection. The log density (7) gains an extra term to become

logq(x)=i=1Tφ(zi)k=1dj=1mi=1Tlogσikjk=1di=1Tγ(x~ik),\log q(x)=\sum_{i=1}^{T}\varphi(z_{i})-\sum_{k=1}^{d}\sum_{j=1}^{m}\sum_{i=1}^{T}\log\sigma^{j}_{ik}-\sum_{k=1}^{d}\sum_{i=1}^{T}\gamma(\tilde{x}_{ik}), (32)

where γ\gamma is the derivative of the softplus function (i.e. the logistic function). The ELBO calculations remain unchanged except for taking

λi=φ(zi)k=1dj=1mlogσikjk=1dγ(x~ik).\lambda_{i}=\varphi(z_{i})-\sum_{k=1}^{d}\sum_{j=1}^{m}\log\sigma^{j}_{ik}-\sum_{k=1}^{d}\gamma(\tilde{x}_{ik}). (33)

We implement our method as before with this modification to λi\lambda_{i}.

G.2 Multiple Modes and Pre-Training

Observation setting (b) of our Lotka-Volterra example has multiple posterior modes. Without careful initialisation of q(x|θ)q(x|\theta), the variational approach typically finds a mode with high frequency oscillations in xx. An example is displayed in Figure 5. The corresponding estimated maximum a-posteriori parameter values are θ^=(4.428,0.029,2.957)\hat{\theta}=(4.428,0.029,2.957).

Refer to caption
Figure 5: Lotka-Volterra results finding a high-frequency mode. This shows the latent path (dashed line), available observations (crosses) and 50 samples of the variational posterior for xx. Here, for ease of presentation, we present results for uu only. The horizontal axis shows t=0.1it=0.1i.

Ideally we would aim to find the most likely modes and evaluate their posterior probabilities, but this is infeasible for our method. (It could be feasible to design a reversible jump MCMC algorithm, following Green, 1995, to do this, but we are unaware of such a method for this application.) Instead we attempt to constrain our analysis to find the mode we expect to be most plausible – that giving a single oscillation between each pair of data points. It is difficult to encode this belief in our prior distribution, so instead we use pretraining so that VI concentrates on this mode. This is comparable to the common MCMC tuning strategy of choosing a plausible initial value.

We use 500 pretraining iterations maximising the likelihood of p(x|θ)p(x|\theta^{*}), where θ=0.1θ^\theta^{*}=0.1\hat{\theta}. The basis for this choice is that periodic Lotka-Volterra dynamics roughly correspond to cycles in (u,v)(u,v) space. Multiplying θ\theta, the rate constants of the dynamics, by ξ\xi should give similar dynamics but increase the frequency by a factor of ξ\xi. Based on Figure 5 we wish to reduce the frequency by a factor of 10, so we choose ξ=0.1\xi=0.1. Using this pre-training approach, we obtain the results shown in the main paper (Figure 3), corresponding to a more plausible mode.