This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Stochastic Volatility Model

Xiuqin Xu,1 Ying Chen 2
Abstract

Volatility for financial assets returns can be used to gauge the risk for financial market. We propose a deep stochastic volatility model (DSVM) based on the framework of deep latent variable models. It uses flexible deep learning models to automatically detect the dependence of the future volatility on past returns, past volatilities and the stochastic noise, and thus provides a flexible volatility model without the need to manually select features. We develop a scalable inference and learning algorithm based on variational inference. In real data analysis, the DSVM outperforms several popular alternative volatility models. In addition, the predicted volatility of the DSVM provides a more reliable risk measure that can better reflex the risk in the financial market, reaching more quickly to a higher level when the market becomes more risky and to a lower level when the market is more stable, compared with the commonly used GARCH type model with a huge data set on the U.S. stock market.

1 Introduction

The volatility of the financial assets returns, as a risk measure for the financial market, is very crucial for investment decision (Markowitz 1952). It is also the most important factor in derivatives pricing (Duan 1995; Goard and Mazur 2013), since the uncertainty of the assets returns, reflected as volatility, drives the values of the derivatives. The volatility of assets returns is not observable and the historical volatility is usually measured by the standard deviation of the historical returns over a fixed time interval.

The returns of financial assets often exhibit heteroscedasticity, in the sense that the volatility of the returns is time-dependent. In particular, higher volatility tends to be followed by higher volatility, while lower volatility usually is followed by lower volatility, one phenomenon called volatility clustering in the financial market. In addition, the past positive and negative returns have asymmetric effects on the volatility — a negative past return will increase the volatility more significantly than a positive past return.

In order to do future investment decision and derivatives pricing, volatility models need to be developed to forecast volatility in the future. As the volatility is not observable, the biggest challenge in constructing volatility model is to identify the dependence between the unobserved volatility and the observed data (assets returns) so as to capture the above two empirical characteristics of the volatility: volatility clustering and asymmetric effects of positive and negative returns.

Generalized Autoregressive Conditional Heteroscedasticity model (GARCH) (Bollerslev 1986) is the most popular volatility model, in which the variance of the returns (squared volatility) is assumed to be a linear function of the past squared returns and past squared volatility. A number of variant GARCH models have been developed, most of which mainly attempt to address the two limitations of the GARCH model: the assumption of a linear relationship between current and past squared volatility and the assumption of symmetric effects of the positive and negative past returns. Maximum likelihood method can be used to learn the parameters of the GARCH family models. A detailed review of the GARCH family can be find in (Hentschel 1995). The GARCH family models are deterministic volatility models, in the sense that given the past returns and past volatility, the future volatility is a deterministic value. An alternative type of volatility models is the stochastic volatility model, in which the volatility is modeled as a latent stochastic process defined through partial differential equations. In practice, stochastic volatility models discretized for even spaced data are usually used instead for forecasting (S. Kim, N. Sheppard, and Chibb 1997; Omori et al. 2007), and are estimated using Monte Carlo Markov Chain methods. Although theoretically more sound, the stochastic volatility models usually do not outperform the GARCH model in practice (S. Kim, N. Sheppard, and Chibb 1997; Omori et al. 2007).

Both the above deterministic and stochastic volatility models developed in statistics community impose restricted form on the function of the volatility model. Recently, volatility modeling has received attention from the machine learning community, where researchers are interested in proposing volatility model with less assumptions on the function form. For example, (Wu, Lobato, and Ghahramani 2014) propose to model the dependence of future volatility on past volatility and past return with a flexible non-linear Gaussian process model, which only assumes the volatility model function to be smooth. Recent developments in deep latent neural network models (Krishnan, Shalit, and Sontag 2017; Fraccaro et al. 2016; Chung et al. 2015; Bayer and Osendorfer 2014; Liu et al. 2018) provide a flexible framework to model the dependence between observed and unobservable latent variables. (Luo et al. 2018) propose a volatility model in which a sequence of latent variables are used to extract a compact representation of the original multivariate stock return time series. However, the model does not consider the effects of volatility clustering explicitly, but implicitly through the latent variables. In addition, modeling multivariate stock time series (using more information) entails more effects on data pre-processing, for example, how to choose the group of stocks to be modeled together, which may affect the performance.

In this work, we propose a Deep Stochastic Volatility Model (DSVM) based on the framework of the deep latent variable model using only the information of univariate return time series. The DSVM imposes no restriction on the function form for the volatility model, and explicitly considers the effects of stochastic noise, past volatility (autoregression) and past returns (exogenous variables) on volatility, thus providing a general framework for volatility modeling. A scalable learning and inference method is developed to estimate the DSVM based on variational inference. Real data analysis shows that the DSVM outperforms several popular volatility models in terms of testing negative likelihood based on only the information of the univariate return time series. In addition, the predicted volatility of DSVM gives a more reliable risk measure for financial assets: higher volatility when market is more risky and lower volatility when market is more stable.

In Section 2, we will review several volatility models in detail. The formulation of the DSVM is given in Section 3 and a scalable inference and learning algorithm is presented in Section 4. Section 5 concludes.

2 Volatility models

In this section, we will detail on several volatility models. Considering a sequence of T observations of the stock returns, labeled r1:T={r1,r2,,rT}r_{1:T}=\{r_{1},r_{2},\cdots,r_{T}\}, we are interested in modeling the distribution p(r1:T)p(r_{1:T}) as well as the latent volatility process σt,t=1,,T\sigma_{t},t=1,\cdots,T. One common approach is to assume the return rtr_{t} follows a normal distribution with mean 0 and variance σt2\sigma_{t}^{2}. Mathematically, we can formulate as

rt\displaystyle r_{t} =σtϵt\displaystyle=\sigma_{t}\epsilon_{t}
ϵt\displaystyle\epsilon_{t}  i.i.d. 𝒩(0,1)\displaystyle\sim\text{ i.i.d. }\mathcal{N}\left(0,1\right)

Volatility models aim to find a function form for the evolution of the volatility σt\sigma_{t} or some transformations of the volatility such as the variance σt2\sigma_{t}^{2} and the log variance log(σt2)\log(\sigma_{t}^{2}).

Deterministic volatility models

The GARCH (Bollerslev 1986) assumes a linear dependence of the future variance σt2\sigma_{t}^{2} on the previous pp squared returns and previous qq variances, computed as as follows:

σt2=ω0+i=1pαirti2+j=1qβjσtj2\sigma_{t}^{2}=\omega_{0}+\sum_{i=1}^{p}\alpha_{i}r_{t-i}^{2}+\sum_{j=1}^{q}\beta_{j}\sigma_{t-j}^{2}

Volatility clustering is captured by the parameters αi,i=1,,p\alpha_{i},i=1,\cdots,p. However, the GARCH model assumes symmetric effects of negative and positive returns. (Glosten, Jagannathan, and Runkle 1993) propose the GJR-GARCH model which extends the GARCH model to account for the asymmetry effects of positive and negative returns as follows:

σt2=ω0+i=1pαi(rti2+γiI{rti<0}rti2)+j=1qβjσtj2\sigma_{t}^{2}=\omega_{0}+\sum_{i=1}^{p}\alpha_{i}(r_{t-i}^{2}+\gamma_{i}I\left\{r_{t-i}<0\right\}r_{t-i}^{2})+\sum_{j=1}^{q}\beta_{j}\sigma_{t-j}^{2}

where the asymmetry effect is captured by the indicator function I{rti<0}I\left\{r_{t-i}<0\right\} and controlled by the leverage parameters γi\gamma_{i}. Similarly, the Threshold GARCH (TGARCH) proposed by (Zakoian 1994) also use leverage parameter γi\gamma_{i} to reflect the asymmetry effects. The difference is that the TGARCH directly works with the volatility σt\sigma_{t} instead of the variance σt2\sigma_{t}^{2}:

σt=ω0+i=1pαi(|rti|γirti)+j=1qβjσtj\sigma_{t}=\omega_{0}+\sum_{i=1}^{p}\alpha_{i}(\left|r_{t-i}\right|-\gamma_{i}r_{t-i})+\sum_{j=1}^{q}\beta_{j}\sigma_{t-j}

Another popular extension of the GARCH model is the Exponential GARCH (EGARCH, (Nelson 1991)), which models the evolution of the log variance log(σt2)\log(\sigma_{t}^{2}):

log(σt2)=ω0+i=1p(αiϵt+γi(|ϵt|𝔼|ϵt|))+j=1qβjlog(σtj2)\log\left(\sigma_{t}^{2}\right)=\omega_{0}+\sum_{i=1}^{p}(\alpha_{i}\epsilon_{t}+\gamma_{i}\left(\left|\epsilon_{t}\right|-\mathbb{E}\left|\epsilon_{t}\right|\right))+\sum_{j=1}^{q}\beta_{j}\log\left(\sigma_{t-j}^{2}\right)

where the leverage parameter γi\gamma_{i} is used to account for the asymmetry effect.

Stochastic volatility models

For analysis convenience, stochastic volatility models are usually formulated with stochastic differential equations in continuous-time (Hull and White 1987; Heston 1993). The Heston model for a univariate process is

dσ(t)\displaystyle\mathrm{d}\sigma(t) =βσ(t)dt+δdWσ(t)\displaystyle=-\beta\sigma(t)\mathrm{d}t+\delta\mathrm{d}W^{\sigma}(t)
ds(t)\displaystyle\mathrm{d}s(t) =(μ0.5σ2(t))dt+σ(t)dWs(t)\displaystyle=\left(\mu-0.5\sigma^{2}(t)\right)\mathrm{d}t+\sigma(t)\mathrm{d}W^{s}(t)

where s(t)s(t) is the logrithm of stock price at tt, Wσ(t)W^{\sigma}(t) and Ws(t)W^{s}(t) are two correlated Wiener processes, with correlation represented as E[dWσ(t)dWs(t)]=ρdtE[\mathrm{d}W^{\sigma}(t)\mathrm{d}W^{s}(t)]=\rho\mathrm{d}t. For practical use, stochastic volatility can be formulated in a discrete-time fashion for regularly spaced data. The discrete stochastic volatility without leverage (S. Kim, N. Sheppard, and Chibb 1997) is defined as

rt\displaystyle r_{t} =σtϵt\displaystyle=\sigma_{t}\epsilon_{t}
log(σt2)\displaystyle\log(\sigma_{t}^{2}) =μ+ϕ(log(σt12)μ)+zt\displaystyle=\mu+\phi(\log(\sigma_{t-1}^{2})-\mu)+z_{t}
ϵt i.i.d. 𝒩(0,1)\displaystyle\epsilon_{t}\sim\text{ i.i.d. }\mathcal{N}(0,1) zt i.i.d. 𝒩(0,σz2)\displaystyle\quad z_{t}\sim\text{ i.i.d. }\mathcal{N}(0,\sigma_{z}^{2})

In order to accommodate the asymmetric effect of positive and negative returns, (Omori et al. 2007) propose the discrete stochastic volatility with leverage as follows:

rt=σtϵtlog(σt2)=μ+ϕ(log(σt12)μ)+zt(ϵt1zt) i.i.d. 𝒩2(𝟎,Σ),Σ=(1ρσzρσzσz2)\begin{array}[]{c}{r_{t}=\sigma_{t}\epsilon_{t}}\\ {\log\left(\sigma_{t}^{2}\right)=\mu+\phi\left(\log\left(\sigma_{t-1}^{2}\right)-\mu\right)+z_{t}}\\ {\left(\begin{array}[]{c}{\epsilon_{t-1}}\\ {z_{t}}\end{array}\right)\sim\text{ i.i.d. }\mathcal{N}_{2}(\mathbf{0},\Sigma),\quad\Sigma=\left(\begin{array}[]{cc}{1}&{\rho\sigma_{z}}\\ {\rho\sigma_{z}}&{\sigma_{z}^{2}}\end{array}\right)}\end{array}

where the leverage ρ<0\rho<0 is trying to capture the increase in volatility that follows a drop in the return.

Machine learning stochastic volatility model

(Wu, Lobato, and Ghahramani 2014) proposed a Gaussian process volatility model (GP-Vol):

rt\displaystyle r_{t} =σtϵt\displaystyle=\sigma_{t}\epsilon_{t}
log(σt2)\displaystyle\log(\sigma_{t}^{2}) =f(log(σt12),rt1)+zt\displaystyle=f(\log(\sigma_{t-1}^{2}),r_{t-1})+z_{t}
ϵt i.i.d. 𝒩(0,1)\displaystyle\epsilon_{t}\sim\text{ i.i.d. }\mathcal{N}(0,1) zt i.i.d. 𝒩(0,σz2)\displaystyle\quad z_{t}\sim\text{ i.i.d. }\mathcal{N}(0,\sigma_{z}^{2})

where ff can be modeled as a sample path of a Gaussian Process and ztz_{t} is the stochastic noise. The model can be trained with Sequential Monte Carlo method.

The neural stochastic volatility model (NSVM) proposed by (Luo et al. 2018) can be formulated as follows:

σt=f(r1:t1,z1:t)\sigma_{t}=f(r_{1:t-1},z_{1:t})

where ff is a deep neural network model. As will be detailed in the following, one of the main differences of our DSVM model and the NSVM is that we use a different generative probabilistic model to account for the effect of past volatility and model the volatility as σt=fσ(σ1:t1,r1:t1,z1:t)\sigma_{t}=f_{\sigma}(\sigma_{1:t-1},r_{1:t-1},z_{1:t}). Another difference is that, (Luo et al. 2018) model the volatility utilizing the information from multiple stock return time series which are correlated to each other and thus utilize more information. However, how to properly group multiple stocks is not a easy task in the first place. In this paper, we aim to propose a volatility model using less information, i.e. only based on the knowledge of the univariate stock return series.

Besides, there exist other machine learning volatility models using more exogenous variables, such as sentiment data (Xing, Cambria, and Zhang 2019) and using other advanced deep learning architectures, see (Zhang et al. 2018).

3 Deep stochastic volatility model

In this section, we give the formulation for the Deep Stochastic Volatility Model (DSVM).

Recall that rtr_{t} is the asset return and σt\sigma_{t} is the volatility of the asset at timestep tt. We introduce ztz_{t} as the continuous stochastic noise to the volatility at timestep tt. The stochastic noise is assumed to be normal distributed, whose parameters depend on past stochastic noise as follows:

ztN(mt(p),(vt(p))2)z_{t}\sim N(m_{t}^{(p)},(v_{t}^{(p)})^{2})
mt(p)=f1(zt1),vt(p)=f2(zt1)m_{t}^{(p)}=f_{1}\left(z_{t-1}\right),\quad v_{t}^{(p)}=f_{2}\left(z_{t-1}\right)

where ztz_{t} is a Gaussian variable with parameter mt(p),vt(p)m_{t}^{(p)},v_{t}^{(p)} given by MLP models f1f_{1}, f2f_{2} that depend on zt1z_{t-1}.

In the DSVM, the volatility is assumed to depend on past returns, past volatilities and past stochastic noise through a Recurrent Neural Network (RNN), computed as

ht\displaystyle h_{t} =fh(σt1,rt1,zt,ht1)\displaystyle=f_{h}(\sigma_{t-1},r_{t-1},z_{t},h_{t-1})
σt\displaystyle\sigma_{t} =f3(ht)\displaystyle=f_{3}(h_{t})

where fhf_{h} is a RNN model and f3f_{3} is a MLP model. Therefore, current volatility depends on previous volatility σt1\sigma_{t-1}, previous return rt1r_{t-1} and current stochastic noise ztz_{t} as well as σ1:t2,r1:t2,z1:t1\sigma_{1:t-2},r_{1:t-2},z_{1:t-1} encoded in ht1h_{t-1}. This parameterization make σt\sigma_{t} depends on σ1:t1,r1:t1,z1:t\sigma_{1:t-1},r_{1:t-1},z_{1:t} and the volatility model can then be denoted as σt=fσ(σ1:t1,r1:t1,z1:t)\sigma_{t}=f_{\sigma}(\sigma_{1:t-1},r_{1:t-1},z_{1:t}).

The return is assumed to follow a Gaussian distribution with mean 0 and standard deviation σt\sigma_{t}:

rt𝒩(0,σt2)r_{t}\sim\mathcal{N}(0,\sigma_{t}^{2})

The graphical representation of the DSVM model is given in Figure 1, and is termed as the generative network which specifies the generating process of the data. The parameters of the generative network include the parameters in the following functions: θ={f1,f2,f3,fh}\theta=\{f_{1},f_{2},f_{3},f_{h}\}.

Refer to caption
Figure 1: The generative network of the DSVM

The joint probability of the return r1:Tr_{1:T} and the stochastic noise z1:Tz_{1:T} can be factorized as

pθ(r1:T,z1:T|σ0=0,r0=0,z0=0)\displaystyle\quad p_{\theta}\left(r_{1:T},z_{1:T}|\sigma_{0}=0,r_{0}=0,z_{0}=0\right)
=pθ(r1:T|z1:T)pθ(z1:T)\displaystyle=p_{\theta}\left(r_{1:T}|z_{1:T}\right)p_{\theta}\left(z_{1:T}\right)
=t=1Tpθ(rt|r1:t1,z1:t)pθ(zt|z1:t1)\displaystyle=\prod_{t=1}^{T}p_{\theta}\left(r_{t}|r_{1:t-1},{z}_{1:t}\right)p_{\theta}\left(z_{t}|z_{1:t-1}\right)
=t=1Tpθ(rt|σt)pθ(zt|zt1)\displaystyle=\prod_{t=1}^{T}p_{\theta}\left(r_{t}|\sigma_{t}\right)p_{\theta}\left(z_{t}|z_{t-1}\right)
=t=1Tpθ(rt|σt)pθ(zt|zt1).\displaystyle=\prod_{t=1}^{T}p_{\theta}\left(r_{t}|\sigma_{t}\right)p_{\theta}\left(z_{t}|z_{t-1}\right). (1)

The marginal loglikelihood of the observed returns r1:Tr_{1:T}, denoted as (θ)=logpθ(r1:T)\mathcal{L}(\theta)=\log p_{\theta}\left(r_{1:T}\right), can be obtained by averaging out z1:Tz_{1:T} in the joint distribution in Equation (1). However, integrating out z1:Tz_{1:T} is challenging due to the non-linearility introduced by deep neural networks, resulting intractable (θ)\mathcal{L}(\theta). Therefore, maximum likelihood methods can not be used to estimate the parameters of DSVM model. In the next section, we will develop a scalable learning and inference for the DSVM using variational inference.

4 Scalable learning and inference

Instead of maximizing the likelihood (θ)\mathcal{L}(\theta) with respect to the parameter θ\theta, we instead maximize a variational evidence lower bound ELBO(θ,ϕ)ELBO(\theta,\phi) with respect to both θ\theta and ϕ\phi, where ϕ\phi is the parameter for the approximated posterior.

Variational inference for the DSVM

Denote qϕ(z1:T|r1:T)q_{\phi}\left(z_{1:T}|r_{1:T}\right) as any approximated posterior for the true posterior pθ(z1:T|r1:T)p_{\theta}\left(z_{1:T}|r_{1:T}\right), it can be derived that

(θ)\displaystyle\mathcal{L}(\theta) ELBO(θ,ϕ)\displaystyle\geq ELBO(\theta,\phi)
=qϕ(z1:T|r1:T)logpθ(r1:T,z1:T)qϕ(z1:T|r1:T)dz1:T\displaystyle=\iint q_{\phi}\left(z_{1:T}|r_{1:T}\right)\log\frac{p_{\theta}\left(r_{1:T},z_{1:T}\right)}{q_{\phi}\left(z_{1:T}|r_{1:T}\right)}\mathrm{d}z_{1:T}

When qϕ(z1:T|r1:T)q_{\phi}\left(z_{1:T}|r_{1:T}\right) is equal to the true posterior pθ(z1:T|r1:T)p_{\theta}\left(z_{1:T}|r_{1:T}\right), the lower bound is tight and we have (θ)=ELBO(θ,ϕ)\mathcal{L}(\theta)=ELBO(\theta,\phi). However, the true posterior is intractable. In order to make the lower bound ELBOELBO tight, we design the inference network which represents qϕ(z1:T|r1:T)q_{\phi}\left(z_{1:T}|r_{1:T}\right) to best approximate the true posterior pθ(z1:T|r1:T)p_{\theta}\left(z_{1:T}|r_{1:T}\right). In fact, the true posterior can be factorized as

pθ(z1:T|r1:T)\displaystyle p_{\theta}\left(z_{1:T}|r_{1:T}\right) =Πt=1Tpθ(zt|z1:t1,r1:T)\displaystyle=\Pi_{t=1}^{T}p_{\theta}\left(z_{t}|{z}_{1:t-1},r_{1:T}\right)
=Πt=1Tpθ(zt|zt1,rt:T)\displaystyle=\Pi_{t=1}^{T}p_{\theta}\left(z_{t}|z_{t-1},r_{t:T}\right)
Refer to caption
Figure 2: The inference network of the DSVM

The factorization can be obtained from the Figure 1 using d-seperation (Geiger, Verma, and Pearl 1990). The factorization shows that, the posterior of zt{z_{t}} depends on the past information encoded in zt1z_{t-1} as well as the future information in rt:T{r}_{t:T}. We can design our inference network according to this factorization form of the true posterior. Specifically,

qϕ(z1:T|r1:T)=t=1Tqϕ(zt|zt1,rt:T)q_{\phi}\left(z_{1:T}|r_{1:T}\right)=\prod_{t=1}^{T}q_{\phi}\left(z_{t}|{z}_{t-1},{r}_{t:T}\right) (2)

In addition, we have the following two factorization

pθ(z1:T)=t=1Tpθ(zt|zt1)p_{\theta}\left(z_{1:T}\right)=\prod_{t=1}^{T}p_{\theta}\left(z_{t}|z_{t-1}\right)
pθ(r1:T|z1:T)=t=1Tpθ(rt|r1:t1,z1:t)=t=1Tpθ(rt|σt)p_{\theta}\left(r_{1:T}|z_{1:T}\right)=\prod_{t=1}^{T}p_{\theta}\left(r_{t}|r_{1:t-1},{z}_{1:t}\right)=\prod_{t=1}^{T}p_{\theta}\left(r_{t}|\sigma_{t}\right)

The past information r1:t1,z1:tr_{1:t-1},{z}_{1:t} is reflected in σt\sigma_{t}. Therefore, the conditional density of return at time tt on the past returns and the stochastic noises can be replaced by the conditional density of return on its simultaneous volatility.

The ELBO then can be derived as

(θ)ELBO(θ,ϕ)=qϕ(z1:T|r1:T)logpθ(r1:T|z1:T)pθ(z1:T)qϕ(z1:T|r1:T)dz1:T=qϕ(z1:T|r1:T)logpθ(r1:T|z1:T)dz1:T+qϕ(z1:T|r1:T)logpθ(z1:T)qϕ(z1:T|r1:T)dz1:T=𝔼qϕ(z1:T|r1:T)[logt=1Tpθ(rt|σt)]qϕ(z1:T|r1:T)logt=1Tqϕ(zt|zt1,rt:T)t=1Tpθ(zt|zt1)dz1:T=t=1T𝔼qϕ(zt)[logpθ(rt|σt)]t=1T𝔼qϕ(zt1)[KL[qϕ(zt|zt1,rt:T)||pθ(zt|zt1)]]\begin{aligned} &\quad\mathcal{L}(\theta)\\ &\geq ELBO(\theta,\phi)\\ &=\int q_{\phi}\left(z_{1:T}|r_{1:T}\right)\log\frac{p_{\theta}\left(r_{1:T}|z_{1:T}\right)p_{\theta}\left(z_{1:T}\right)}{q_{\phi}\left(z_{1:T}|r_{1:T}\right)}\mathrm{d}z_{1:T}\\ &=\int q_{\phi}\left(z_{1:T}|r_{1:T}\right)\log p_{\theta}\left(r_{1:T}|z_{1:T}\right)\mathrm{d}z_{1:T}+\int q_{\phi}\left(z_{1:T}|r_{1:T}\right)\log\frac{p_{\theta}\left(z_{1:T}\right)}{q_{\phi}\left(z_{1:T}|r_{1:T}\right)}\mathrm{d}z_{1:T}\\ &=\mathbb{E}_{q_{\phi}\left(z_{1:T}|r_{1:T}\right)}\left[\log\prod_{t=1}^{T}p_{\theta}\left(r_{t}|\sigma_{t}\right)\right]-\int q_{\phi}\left(z_{1:T}|r_{1:T}\right)\log\frac{\prod_{t=1}^{T}q_{\phi}\left(z_{t}|z_{t-1},{r}_{t:T}\right)}{\prod_{t=1}^{T}p_{\theta}\left(z_{t}|z_{t-1}\right)}\mathrm{d}z_{1:T}\\ &=\sum_{t=1}^{T}\mathbb{E}_{q_{\phi}^{*}(z_{t})}\left[\log p_{\theta}\left(r_{t}|\sigma_{t}\right)\right]-\sum_{t=1}^{T}\mathbb{E}_{q_{\phi}^{*}(z_{t-1})}\left[KL[q_{\phi}\left(z_{t}|z_{t-1},{r}_{t:T}\right)||p_{\theta}\left(z_{t}|z_{t-1}\right)]\right]\end{aligned}

(3)

where qϕ(zt)q_{\phi}^{*}({z}_{t}) is the marginal probability density of zt{z}_{t}:

qϕ(zt)\displaystyle q_{\phi}^{*}\left({z}_{t}\right) =qϕ(z1:T|r1:T)dz1:t1\displaystyle=\int q_{\phi}\left(z_{1:T}|r_{1:T}\right)\mathrm{d}{z}_{1:t-1}
=Eqϕ(zt1)[qϕ(zt|zt1,rt:T)]\displaystyle=E_{q_{\phi}^{*}\left(z_{t-1}\right)}\left[q_{\phi}\left({z}_{t}|z_{t-1},{r}_{t:T}\right)\right]

We approximate the ELBO in Equation (3) using the Monte Carlo estimation. Specifically, we samples from qϕ(zt)q_{\phi}^{*}\left({z}_{t}\right) using ancestral sampling according to the Equation (2). Given a sample zt1(s)z_{t-1}^{(s)} from qϕ(zt1)q_{\phi}^{*}\left(z_{t-1}\right), a sample zt(s){z}_{t}^{(s)} from qϕ(zt|zt1(s),rt:T)q_{\phi}\left(z_{t}|z_{t-1}^{(s)},{r}_{t:T}\right) follows the distribution of qϕ(zt)q_{\phi}^{*}\left({z}_{t}\right). After we get the samples zt(s) for t=1,Tz_{t}^{(s)}\text{ for }t=1\cdots,T from the qϕ(z1:T|r1:T)q_{\phi}\left(z_{1:T}|r_{1:T}\right), the ELBO in (3) can be approximated with the generated samples as:

ELBO(θ,ϕ)\displaystyle ELBO(\theta,\phi) =t=1T[logpθ(rt|σt=fσ(σ1:t1,r1:t1,z1:t(s)))]\displaystyle=\sum_{t=1}^{T}\left[\log p_{\theta}\left(r_{t}|\sigma_{t}=f_{\sigma}(\sigma_{1:t-1},r_{1:t-1},{z}_{1:t}^{(s)})\right)\right]
t=1T[KL[qϕ(zt|zt1(s),rt:T)||pθ(zt|zt1(s))]]\displaystyle\quad-\sum_{t=1}^{T}\left[KL[q_{\phi}\left(z_{t}|z_{t-1}^{(s)},{r}_{t:T}\right)||p_{\theta}\left(z_{t}|z_{t-1}^{(s)}\right)]\right] (4)

Parameterization of the inference network

Recall that, instead of factorizing qϕq_{\phi} as mean-field approximation across different time step, i.e. zt,t=1,,Tz_{t},t=1,\cdots,T are assumed to independent and does not use the information of the dependence structure in the generative network, we choose the variational posterior to have the same factorization as the true posterior in Equation (2), where the posterior of ztz_{t} depend on zt1z_{t-1} and rt:Tr_{t:T}.

We additionally assume that the information in rt:Tr_{t:T} can be encoded with a backward RNN, denoted as gAg_{A} with hidden state, denoted as AtA_{t}, as follows:

At\displaystyle{A}_{t} =gA(At+1,rt)\displaystyle=g_{A}\left({A}_{t+1},r_{t}\right)
qϕ(z1:T|r1:T)\displaystyle q_{\phi}\left(z_{1:T}|r_{1:T}\right) =t=1Tqϕ(zt|zt1,rt:T)=t=1Tqϕ(zt|zt1,At)\displaystyle=\prod_{t=1}^{T}q_{\phi}\left(z_{t}|{z}_{t-1},{r}_{t:T}\right)=\prod_{t=1}^{T}q_{\phi}\left(z_{t}|{z}_{t-1},A_{t}\right)

where gAg_{A} is a RNN model. The graphical model of the inference network is shown in Figure 2. zt1z_{t-1} represents the information coming from the past, whereas AtA_{t} encodes the information from present and the future, thus the information from all time steps are used to approximate the posterior at each tt. Further, qϕz(zt|zt1,At)q_{\phi_{z}}\left(z_{t}|z_{t-1},A_{t}\right) is parameterized to be a Gaussian distribution density:

ztN(mt(q),(vt(q))2)z_{t}\sim N(m^{(q)}_{t},(v^{(q)}_{t})^{2})
mt(q)=g1(zt1,At),vt(q)=g2(zt1,At)m^{(q)}_{t}=g_{1}\left(z_{t-1},A_{t}\right),\quad v^{(q)}_{t}=g_{2}\left(z_{t-1},A_{t}\right)

ztz_{t} is a Gaussian variable with parameter mt(q),vt(q)m^{(q)}_{t},v^{(q)}_{t} given by MLP models g1g_{1} and g2g_{2} that depend on zt1,Atz_{t-1},A_{t}. The parameters in the inference network include parameters in the following functions: ϕ={g1,g2,gA}\phi=\{g_{1},g_{2},g_{A}\}.

The gradients

The derivative of the ELBO(θ,ϕ)ELBO(\theta,\phi) with respect to θ\theta can be calculated as:

θELBO(θ,ϕ)\displaystyle\quad\nabla_{\theta}ELBO(\theta,\phi)
=Eqϕ(z1:T|r1:T)θlogpθ(r1:T,z1:T)\displaystyle=E_{q_{\phi}\left(z_{1:T}|r_{1:T}\right)}\nabla_{\theta}\log p_{\theta}\left(r_{1:T},z_{1:T}\right)
=t=1TEqϕ(z1:T|r1:T)θ[logpθ(rt|σt)pθ(zt|zt1)]\displaystyle=\sum_{t=1}^{T}E_{q_{\phi}\left(z_{1:T}|r_{1:T}\right)}\nabla_{\theta}[\log p_{\theta}\left(r_{t}|\sigma_{t}\right)p_{\theta}\left(z_{t}|z_{t-1}\right)]

The derivative of the ELBO(θ,ϕ)ELBO(\theta,\phi) with respect to ϕ\phi is more tricky as ϕ\phi appears in the expectation in ELBO(θ,ϕ)ELBO(\theta,\phi). Score function gradient estimator (Glynn 1987; Williams 1992) can be used to approximate the gradient, but the obtained gradients suffer from high variance. Reparameterization trick (Rezende, Mohamed, and Wierstra 2014; Kingma and Welling 2013) is often used to obtain a lower variance gradient estimator instead. Specifically, in order to obtain a sample ztz_{t} from N(mt(q),(vt(q))2)N(m_{t}^{(q)},(v_{t}^{(q)})^{2}), we first generate a sample ηtN(0,1)\eta_{t}\sim N(0,1), and then use mt(q)+ηtvt(q)m_{t}^{(q)}+\eta_{t}v_{t}^{(q)} as a sample of ztz_{t}, denote as zt=kt(ηt)z_{t}=k_{t}(\eta_{t}). The gradients then can be backpropogated through the gaussian random variables and ϕELBO(θ,ϕ)\nabla_{\phi}ELBO(\theta,\phi) can be calculated as

ϕELBO(θ,ϕ)\displaystyle\quad\nabla_{\phi}ELBO(\theta,\phi)
=t=1TϕEqϕ(z1:T|r1:T)logpθ(rt|σt)pθ(zt|zt1)qϕ(zt|zt1,rt:T)\displaystyle=\sum_{t=1}^{T}\nabla_{\phi}E_{q_{\phi}\left(z_{1:T}|r_{1:T}\right)}\log\frac{p_{\theta}\left(r_{t}|\sigma_{t}\right)p_{\theta}\left(z_{t}|z_{t-1}\right)}{q_{\phi}\left(z_{t}|{z}_{t-1},{r}_{t:T}\right)}
=t=1TϕEη1:tN(0,I)logpθ(rt|σt)pθ(kt(ηt)|kt1(ηt1))qϕ(kt(ηt)|kt1(ηt1),rt:T)\displaystyle=\sum_{t=1}^{T}\nabla_{\phi}E_{\eta_{1:t}\sim N(0,I)}\log\frac{p_{\theta}\left(r_{t}|\sigma_{t}\right)p_{\theta}\left(k_{t}(\eta_{t})|k_{t-1}(\eta_{t-1})\right)}{q_{\phi}\left(k_{t}(\eta_{t})|k_{t-1}({\eta}_{t-1}),{r}_{t:T}\right)}
=t=1TEη1:tN(0,I)ϕlogpθ(rt|σt)pθ(kt(ηt)|kt1(ηt1))qϕ(kt(ηt)|kt1(ηt1),rt:T)\displaystyle=\sum_{t=1}^{T}E_{\eta_{1:t}\sim N(0,I)}\nabla_{\phi}\log\frac{p_{\theta}\left(r_{t}|\sigma_{t}\right)p_{\theta}\left(k_{t}(\eta_{t})|k_{t-1}(\eta_{t-1})\right)}{q_{\phi}\left(k_{t}(\eta_{t})|k_{t-1}({\eta}_{t-1}),{r}_{t:T}\right)}

The above inference and learning algorithm for training DSVM are summarized in Algorithm 1.

Algorithm 1 Structured Inference Algorithm for DSVM
  Inputs: {r1:T}i=1N\{r_{1:T}\}_{i=1}^{N}    Randomly initialized ϕ(0)\phi^{(0)} and θ(0)\theta^{(0)}    Generative network model: pθ(r1:T,z1:T)p_{\theta}\left(r_{1:T},z_{1:T}\right)    Inference network model: qϕ(z1:T|r1:T)q_{\phi}\left(z_{1:T}|r_{1:T}\right)
  Outputs: the parameters of θ\theta, ϕ\phi
  while IterIter <<do
     1. Sample a mini-batch sequences {r1:T}i=1B\{r_{1:T}\}_{i=1}^{B} uniformly from dataset
     2. Generate samples of ηt(s),t=1,,T\eta_{t}^{(s)},t=1,\cdots,T to obtain samples of zt(s),t=1,2,,Tz_{t}^{(s)},t=1,2,\cdots,T sequentially according to Equation (2) to approximate the ELBO in Equation (4)
     3. Derive the gradient θELBO(θ,ϕ)\nabla_{\theta}ELBO(\theta,\phi) which is approximated with one Monte Carlo sample
     4. Derive the gradient ϕELBO(θ,ϕ)\nabla_{\phi}ELBO(\theta,\phi) which is approximated with one Monte Carlo sample
     5. Update θ(Iter),ϕ(Iter)\theta^{(Iter)},\phi^{(Iter)} using the ADAM.
     set Iter=Iter+1Iter=Iter+1
  end while

One-step-ahead forecasting

After training the model, we can use the estimated model to do one-step-ahead forecasting. Specifically, the probability of rT+1r_{T+1} conditional on the information of r1:Tr_{1:T} can be derived as:

p(rT+1|r1:T)=z1:T+1p(rT+1,z1:T+1|r1:T)𝑑z1:T+1=z1:T+1p(rT+1|z1:T+1,r1:T)p(z1:T+1|r1:T)𝑑z1:T+1=z1:T+1p(rT+1|σT+1=fσ(σ1:T,r1:T,z1:T+1))p(zT+1|zT)p(z1:T|r1:T)𝑑z1:T+1z1:T+1p(rT+1|σT+1=fσ(σ1:T,r1:T,z1:T+1))p(zT+1|zT)q(z1:T|r1:T)𝑑z1:T+11Ss=1Sp(rT+1|σT+1=fσ(σ1:T,r1:T,z1:T+1(s)))\begin{aligned} &\quad\quad p(r_{T+1}|r_{1:T})\\ &=\int_{z_{1:T+1}}p(r_{T+1},z_{1:T+1}|r_{1:T})dz_{1:T+1}\\ &=\int_{z_{1:T+1}}p(r_{T+1}|z_{1:T+1},r_{1:T})p(z_{1:T+1}|r_{1:T})dz_{1:T+1}\\ &=\int_{z_{1:T+1}}p(r_{T+1}|\sigma_{T+1}=f_{\sigma}(\sigma_{1:T},r_{1:T},z_{1:T+1}))p(z_{T+1}|z_{T})p(z_{1:T}|r_{1:T})dz_{1:T+1}\\ &\approx\int_{z_{1:T+1}}p(r_{T+1}|\sigma_{T+1}=f_{\sigma}(\sigma_{1:T},r_{1:T},z_{1:T+1}))p(z_{T+1}|z_{T})q(z_{1:T}|r_{1:T})dz_{1:T+1}\\ &\approx\frac{1}{S}\sum_{s=1}^{S}p(r_{T+1}|\sigma_{T+1}=f_{\sigma}(\sigma_{1:T},r_{1:T},z_{1:T+1}^{(s)}))\end{aligned}

(5)

We approximate the p(z1:T|r1:T)p(z_{1:T}|r_{1:T}) with q(z1:T|r1:T)q(z_{1:T}|r_{1:T}) and use ancestral sampling to generate samples from the approximated p(rT+1|r1:t)p(r_{T+1}|r_{1:t}). Specifically, we generate {z1:T(s)}s=1S\{z_{1:T}^{(s)}\}_{s=1}^{S} from q(z1:T|r1:T)q(z_{1:T}|r_{1:T}) and then generate samples {zT+1(s)}s=1S\{z_{T+1}^{(s)}\}_{s=1}^{S} from p(zT+1|zT)p(z_{T+1}|z_{T}) based on {zT(s)}s=1S\{z_{T}^{(s)}\}_{s=1}^{S}. The approximated predictive distribution for p(rT+1|r1:t)p(r_{T+1}|r_{1:t}) is then a mixture of SS Gaussian models. We can then generate {rT+1(s)}s=1S\{r_{T+1}^{(s)}\}_{s=1}^{S} from the predictive distribution (5) and use the sample standard deviation of {rT+1(s)}s=1S\{r_{T+1}^{(s)}\}_{s=1}^{S} as the predicted volatility for rT+1r_{T+1}.

5 Real data analysis

In this section, we present the experiment results of the proposed DSVM model on the US stock market.

Data and model setting

We choose the daily adjusted return of the 39 biggest stocks in terms of market values among SP500 stocks from 2001-01-02 to 2018-12-31. For each stock, the training and validation data are from 2001-01-02 to 2015-10-18 (in total of 3721 days) and the testing data are from 2015-10-19 to 2018-12-31 (in total of 806 days). We extract sequences of 10 timestep data (corresponding to half of a trading month) from the original univariate time series and the training, validation, testing data size is 105924, 31434 and 31473 respectively (a proportion of around 6:2:2).

We train one model for all the 39 stocks. The dimension of the latent variable zz is chosen to be 1. f1,f2,f3,g1,g2f_{1},f_{2},f_{3},g_{1},g_{2} are all parameterized by two-hidden-layer MLP models with different activation functions at last layer (Linear for mean function f1f_{1}, g1g_{1}, Softplus for standard deviation function f2,f3,g2f_{2},f_{3},g_{2}). fh,gAf_{h},g_{{A}} are both parameterized by one-layer GRU models with 10 hidden states. Monte Carlo samples size used for forecasting is 1000. The model is trained on a Tesla V100 GPU for 1 hours (300 epochs) before it converges. The final model is chosen by minimizing the validation loss over the 300 epochs. The DSVM is only trained once on the training and validation data, and then is used to do recursive one-day-ahead forecasting without retraining.

Alternative models

We compare the performance of the propsed model with seven popular alternative volatility models described in Section 2.

  1. 1.

    Deterministic volatility models:
    We consider four models: GARCH, GJR-GARCH, TGARCH and EGARCH. The parameter estimation is done with a widely used R package “rugarch” 111https://cran.r-project.org/web/packages/rugarch.

  2. 2.

    Stochastic volatility models:
    We choose two models: the discrete stochastic volatility model without leverage and discrete stochastic volatility model with leverage, and use the R package “stochvo” 222https://cran.r-project.org/web/packages/stochvol to estimate the parameters.

  3. 3.

    Gaussian process volatility model (GP-Vol)
    The implementation of the model is obtained from the authors’ website333https://bitbucket.org/jmh233/code-gpvol-nips2014/src/master and the hyperparameters are chosen as suggested in (Wu, Lobato, and Ghahramani 2014).

For each type of the alternatives, we train one model for each stock as the alternatives are designed to be so. We will perform recursive one-day-ahead forecasting and all the alternative models are retrained at every forecasting time step with a rolling window of 1000.

Result and discussion

We first evaluate each model in terms of negative loglikelihood (NLL) for the testing samples for each stock. The smaller the NLL value, the better the model can explain the distribution of the observed testing samples. The results of NLL evaluations are listed in Table 1. We first apply the Friedman rank sum test and the test indicates that there exist significant differences of the NLL values among different models (the p-value is less than e35e^{-35}). Furthermore, a pairwise pos thoc Nemenyi-Friedman test (Demšar 2006) is performed to compare different models, and it confirms that the DSVM performs best among all the methods. Our DSVM is the only stochastic volatility model that is able to beat deterministic volatility models and is the best for 36 out of the 38 stocks.

Table 1: Negative log-likelihood (NLL) evaluated on the test samples of 39 stock return time series for different volatility models. The best performing model are highlighted in bold. The last line shows the mean NLL for the 39 stocks. NA values for TGARCH are due to non-convergence of the parameter estimation (MLE).
Stock DSVM GARCH GJRGARCH TGARCH EGARCH GPVol StoVol StoVolL
ORCL -3.03 -2.87 -2.87 -2.87 -2.88 -2.82 -2.25 -2.23
MSFT -2.96 -2.84 -2.86 -2.86 -2.87 -2.87 -2.44 -2.45
KO -3.35 -3.36 -3.36 -3.38 -3.38 -3.27 -3.14 -3.15
XOM -3.11 -3.10 -3.10 -3.11 -3.11 -2.99 -2.96 -2.97
GE -2.87 -2.81 -2.83 -2.83 -2.80 -2.72 -2.46 -2.46
IBM -3.06 -2.92 -2.92 NA -2.93 -2.96 -2.49 -2.51
PEP -3.29 -3.30 -3.30 NA -3.30 -3.21 -3.13 -3.14
MO -3.08 -3.00 -2.99 -3.01 -3.03 -2.97 -2.71 -2.73
COP -2.52 -2.51 -2.52 -2.52 -2.52 -2.32 -2.40 -2.42
AMGN -2.92 -2.85 -2.86 -2.86 -2.86 -2.81 -2.67 -2.70
SLB -2.82 -2.82 -2.82 -2.82 -2.82 -2.73 -2.73 -2.77
CVX -2.95 -2.93 -2.94 -2.93 -2.93 -2.80 -2.80 -2.84
AAPL -2.89 -2.82 -2.85 -2.85 -2.85 -2.77 -2.52 -2.59
UTX -3.16 -3.12 -3.13 -3.13 -3.12 -3.05 -2.85 -2.89
PG -3.27 -3.25 -3.25 -3.26 -3.26 -3.18 -2.94 -2.98
BMY -2.85 -2.67 -2.66 NA -2.67 -2.71 -2.07 -2.08
BA -2.86 -2.77 -2.77 -2.76 -2.76 -2.71 -2.50 -2.50
ABT -3.00 -2.91 -2.91 -2.92 -2.90 -2.82 -2.51 -2.57
PFE -3.18 -3.15 -3.16 -3.15 -3.15 -3.07 -2.96 -2.97
JNJ -3.27 -3.20 -3.21 -3.19 -3.19 -3.16 -2.83 -2.83
MMM -3.17 -3.09 -3.08 -3.11 -3.09 -3.06 -2.67 -2.71
MRK -3.08 -2.99 -3.01 NA -2.99 -2.92 -2.71 -2.75
DIS -3.12 -3.07 -3.07 -3.08 -3.08 -3.04 -2.86 -2.88
WFC -2.93 -2.88 -2.87 -2.88 -2.88 -2.82 -2.67 -2.74
MCD -3.20 -3.07 -3.04 -3.06 -3.13 -3.03 -2.71 -2.75
JPM -2.98 -2.93 -2.93 -2.93 -2.93 -2.83 -2.66 -2.74
WMT -3.09 -2.90 -2.87 -2.94 -2.94 -2.88 -2.34 -2.34
INTC -2.85 -2.73 -2.76 -2.76 -2.76 -2.65 -2.33 -2.33
BAC -2.74 -2.71 -2.71 -2.71 -2.71 -2.61 -2.53 -2.58
VZ -3.09 -3.05 -3.04 -3.04 -3.05 -2.96 -2.81 -2.80
T -3.09 -3.03 -3.03 -3.04 -3.04 -2.91 -2.66 -2.65
HD -3.12 -3.10 -3.11 -3.11 -3.11 -3.02 -2.94 -2.98
AIG -3.02 -2.91 -2.91 -2.90 -2.89 -2.76 -2.60 -2.63
C -2.82 -2.78 -2.79 -2.80 -2.80 -2.71 -2.58 -2.65
CSCO -3.01 -2.91 -2.93 -2.95 -2.95 -2.86 -2.50 -2.56
QCOM -2.72 -2.52 -2.52 -2.50 -2.53 -2.45 -1.51 -1.50
BRK -3.20 -3.16 -3.16 -3.15 -3.15 -3.09 -2.93 -2.91
AMZN -2.77 -2.58 -2.60 NA -2.65 -2.69 -2.24 -2.29
UNH -3.06 -3.00 -3.01 -3.01 -3.02 -2.91 -2.81 -2.86
Mean -3.01 -2.94 -2.94 -2.95 -2.95 -2.88 -2.63 -2.65

We further plot the predicted volatility of DSVM for training data and testing data and compare with the volatility produced by the EGARCH model (the best performing alternative model in terms of mean NLL). We use the two largest stocks (in terms of market value), the MSFT and the AAPL, for illustration. Other stocks perform similarly and are omitted in the manuscript due to page limit. The predicted volatility for training and testing data are displayed in Figure 3 (with a randomly selected zoom-in period) and Figure 4 respectively. In general, the changes of the DSVM is more adaptive to the market than the EGARCH model, which is reflected by several spikes in the predicted volatility of DSVM. In addition, the DSVM produces higher predicted volatility when the market is more volatile (see the circle areas) and gives lower volatility when the market is more stable (see the rectangle areas) for both training and testing data. In contrast, the predicted volatility of EGARCH is more smooth and the changes of the market status could not be quickly captured. Take the AAPL stock for example (see Figure 4), the volatility of the EGARCH decays more slowly around 2016 June when the market was calm, while it increases more slowly on 2018 September when the market became more volatile. In summary, the DSVM produces predicted volatility more responsive to the market status, which is more desirable as a risk measure for the financial market.

Refer to caption
(a) Stock MSFT
Refer to caption
(b) Stock MSFT, zoom in for a subsample
Refer to caption
(c) Stock AAPL
Refer to caption
(d) Stock AAPL, zoom in for a subsample
Figure 3: Training data
Refer to caption
(e) Stock MSFT
Refer to caption
(f) Stock AAPL
Figure 4: Testing data with recursive one-day-ahead forecasting

6 Conclusion

In this work, we propose a deep stochastic volatility model based on the framework of deep latent variable models. The proposed volatility model does not impose restrictions on the function form on the volatility model, instead, it uses the flexible deep learning models to automatically detect the dependence of the future volatility on past returns, past volatilities and the stochastic noise. We developed a scalable inference and learning algorithm based on the variational inference, in which the parameters of the generative network and the inference network are jointly trained. In addition, the results of real data analysis on the U.S. stock market show that the DSVM model using only the information from the univariate return time series outperforms several alternative volatility models. Furthermore, the the predicted volatility of the deep stochastic volatility model is more adaptive to the market status, and thus provides a more reliable risk measure to better reflect the risk in the financial market.

References

  • Bayer and Osendorfer (2014) Bayer, J.; and Osendorfer, C. 2014. Learning Stochastic Recurrent Networks. In NIPS 2014 Workshop on Advances in Variational Inference.
  • Bollerslev (1986) Bollerslev, T. 1986. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31(3): 307–327.
  • Chung et al. (2015) Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A. C.; and Bengio, Y. 2015. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems, 2980–2988.
  • Demšar (2006) Demšar, J. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7: 1–30. ISSN 15337928.
  • Duan (1995) Duan, J. C. 1995. The GARCH option pricing model. Mathematical Finance 5(1): 13–32. ISSN 14679965. doi:10.1111/j.1467-9965.1995.tb00099.x.
  • Fraccaro et al. (2016) Fraccaro, M.; Sønderby, S. K.; Paquet, U.; and Winther, O. 2016. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems, 2199–2207.
  • Geiger, Verma, and Pearl (1990) Geiger, D.; Verma, T.; and Pearl, J. 1990. Identifying independence in Bayesian networks. Networks 20(5): 507–534.
  • Glosten, Jagannathan, and Runkle (1993) Glosten, L. R.; Jagannathan, R.; and Runkle, D. E. 1993. On the relation between the expected value and the volatility of the nominal excess return on stocks. The Journal of Finance 48(5): 1779–1801.
  • Glynn (1987) Glynn, P. W. 1987. Likelilood ratio gradient estimation: an overview. In Proceedings of the 19th Conference on Winter Simulation, 366–375. ACM.
  • Goard and Mazur (2013) Goard, J.; and Mazur, M. 2013. Stochastic volatility models and the pricing of vix options. Mathematical Finance 23(3): 439–458. ISSN 09601627. doi:10.1111/j.1467-9965.2011.00506.x.
  • Hentschel (1995) Hentschel, L. 1995. All in the family nesting symmetric and asymmetric garch models. Journal of Financial Economics 39(1): 71–104.
  • Heston (1993) Heston, S. L. 1993. A closed-form solution for options with stochastic volatility with applications to bond and currency options. The Review of Financial Studies 6(2): 327–343.
  • Hull and White (1987) Hull, J.; and White, A. 1987. The pricing of options on assets with stochastic volatilities. The Journal of Finance 42(2): 281–300.
  • Kingma and Welling (2013) Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 .
  • Krishnan, Shalit, and Sontag (2017) Krishnan, R. G.; Shalit, U.; and Sontag, D. 2017. Structured Inference Networks for Nonlinear State Space Models. In AAAI, 2101–2109.
  • Liu et al. (2018) Liu, H.; He, L.; Bai, H.; Dai, B.; Bai, K.; and Xu, Z. 2018. Structured inference for recurrent hidden semi-Markov model. IJCAI International Joint Conference on Artificial Intelligence 2018-July: 2447–2453. ISSN 10450823.
  • Luo et al. (2018) Luo, R.; Zhang, W.; Xu, X.; and Wang, J. 2018. A neural stochastic volatility model. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 6401–6408.
  • Markowitz (1952) Markowitz, H. 1952. Portfolio selection. The Journal of Finance 7(1): 77–91.
  • Nelson (1991) Nelson, D. B. 1991. Conditional heteroskedasticity in asset returns: A new approach. Econometrica: Journal of the Econometric Society 347–370.
  • Omori et al. (2007) Omori, Y.; Chib, S.; Shephard, N.; and Nakajima, J. 2007. Stochastic volatility with leverage: Fast and efficient likelihood inference. Journal of Econometrics 140(2): 425–449. ISSN 03044076. doi:10.1016/j.jeconom.2006.07.008.
  • Rezende, Mohamed, and Wierstra (2014) Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In International Conference on Machine Learning, 1278–1286.
  • S. Kim, N. Sheppard, and Chibb (1997) S. Kim; N. Sheppard; and Chibb, S. 1997. Stochastic Volatility: likelihood inference and comparison with ARCH models. Review of Economic Studies 81: 159–192.
  • Williams (1992) Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8(3-4): 229–256.
  • Wu, Lobato, and Ghahramani (2014) Wu, Y.; Lobato, J. M. H.; and Ghahramani, Z. 2014. Gaussian process volatility model. Advances in Neural Information Processing Systems 1044–1052. ISSN 10495258.
  • Xing, Cambria, and Zhang (2019) Xing, F. Z.; Cambria, E.; and Zhang, Y. 2019. Sentiment-aware volatility forecasting. Knowledge-Based Systems 176: 68–76.
  • Zakoian (1994) Zakoian, J.-M. 1994. Threshold heteroskedastic models. Journal of Economic Dynamics and Control 18(5): 931–955.
  • Zhang et al. (2018) Zhang, Q.; Luo, R.; Yang, Y.; and Liu, Y. 2018. Benchmarking deep sequential models on volatility predictions for financial time series. arXiv preprint arXiv:1811.03711 .