This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Continuous-Time Path-Dependent Exploratory Mean-Variance Portfolio Construction

Zhou Fang Department of Mathematics, The University of Texas at Austin
(Feb 2023)
Abstract

In this paper, we present an extended exploratory continuous-time mean-variance framework for portfolio management. Our strategy involves a new clustering method based on simulated annealing, which allows for more practical asset selection. Additionally, we consider past wealth evolution when constructing the mean-variance portfolio. We found that our strategy effectively learns from the past and performs well in practice.

1 Introduction

The revolutionary work Selection (\APACyear1952) is considered as the beginning of modern portfolio management, which proposes a framework for constructing portfolios in single periods. Much research in portfolio management came out after then. Li \BBA Ng (\APACyear2000) considers the portfolio selection problem in multiple periods setting and Zhou \BBA Li (\APACyear2000) further studies the portfolio selection problem in a continuous-time setting, which makes portfolio management in high frequency possible. In Wang \BOthers. (\APACyear2018), an innovative exploratory continuous-time mean-variance framework is first introduced. It replaces the deterministic policy with a stochastic policy to adjust assets’ holdings, achieving the ideal trade-off between exploration and exploitation, and is considered more robust than the deterministic policy. This exploratory framework is further studied in a series of subsequent papers. Wang (\APACyear2019) extends the framework into a multi-assets setting. Jia \BBA Zhou (\APACyear2022\APACexlab\BCnt2), Jia \BBA Zhou (\APACyear2022\APACexlab\BCnt1), Jia \BBA Zhou (\APACyear2022\APACexlab\BCnt3) develops a more general framework for policy evaluation and policy improvement.

In this paper, we further extend the exploratory continuous-time mean-variance framework by first clustering assets into several groups and then constructing a mean-variance portfolio by taking past wealth processes into consideration. In other words, there are phases in our framework. The first one is the clustering phase, and the second is the portfolio construction phase. The motivation for adding a clustering phase is to exclude similar assets and thus reduce the complexity. It is not possible and also not necessarily to hold thousands of stocks at one time for small retail and institutional investors. In the second portfolio construction phase, the motivation for considering the past wealth process or past performance is we think past experience should be important for making a current decision. When making a current decision, one should base it on past lessons together with the current wealth level instead of basing it only on the current wealth level. One intuitive example will be if a person came into the stock market in mid-2020, that person probably would think making money in the stock market is easy and maybe start to use high leverage in expect to make a higher gain, therefore, this person probably had a huge loss in 2022. However, if a person came into the stock market in 2008, that person probably would become more cautious, and therefore be aware of the market environment such as the Feds’ interest rate policy.

Many asset clustering methods have been developed in the past decades. As pointed out in Tang \BOthers. (\APACyear2022), one necessary criterion that wasn’t in the previous literature is that correlations for two assets from the same cluster with any other asset should be similar. Following that paper’s spirit, we use the similarity metric proposed in Dolphin \BOthers. (\APACyear2021) instead of using correlation. It is because correlation sometimes can’t capture similarities between two assets, as pointed out in Dolphin \BOthers. (\APACyear2021) one asset can have negative returns, and the other asset can have high returns, but these two assets still are highly correlated. The similarity metric we use here takes cumulative returns into consideration and therefore can capture the pattern similarities between assets. The clustering method we use in this paper is a simulated-annealing-based method, which is inspired by Ludkovski \BOthers. (\APACyear2022). We define an energy function that depends on asset similarities within a group. To our best knowledge, it is the first literature that uses this new similarity metric to cluster assets by a simulated annealing clustering method. From the empirical results, this method performs well, and can even possibly help investors find assets that guarantee statistical arbitrage opportunities, which requires further studies.

There is a huge literature on portfolio management or portfolio optimization, but very few consider portfolio management problems in a path-dependent setting. The path-dependent setting in portfolio management means that the positions of assets should depend on the performance of past investments. Learning from the past is an important philosophy that shouldn’t be ignored. The path-dependent case has technical difficulties, and there were no suitable tools for a long time. Thanks to the groundbreaking paper Dupire (\APACyear2019) that introduces functional Ito calculus, which defines a new type of calculus on path space, and thus proposes a path version of the Ito formula, and the Feynmann-Kac formula. However, the path-dependent Hamilton-Jacobi-Bellman equation is very hard to solve numerically. Thanks to the recent development of deep learning, makes solving path-dependent Hamilton-Jacobi-Bellman equations possible. There are literature that uses neural networks to solve path-dependent PDE, such as Saporito \BBA Zhang (\APACyear2020), and Sabate-Vidales \BOthers. (\APACyear2020). In this paper, we use PDGM to solve the PDE numerically, which is a neural network architecture whose main components are LSTM and feedforward networks. LSTM, as one of the most classical neural networks, has the ability to process sequential information and is naturally suitable to summarize past history. The feed-forward neural can then model certain functional by using the summaries given by LSTM. It is worth pointing out that LSTM is designed to forget some information in the past. In the deep learning community, the transformer basically replaced the LSTM to process sequential information. We believe replacing the LSTM component by a transformer component can have better results in solving path-dependent PDE, which requires further studies.

2 Simulated Annealing Clustering

Subsection 2.1 introduces a new similarity metric, and subsection 2.2 introduces the clustering method based on the proposed similarity metric via simulated annealing.

2.1 Similarity Metric of Financial Time Series

As pointed out in Dolphin \BOthers. (\APACyear2021), two assets can be perfectly correlated, but their performances will be very different. It is because conventional correlation is a similarity m, correlation cannot be the only criterion to cluster assets with similar patterns assets, correlation cannot be the only criterion. In this paper, we use the metric proposed in Dolphin \BOthers. (\APACyear2021) to measure the similarity of two assets.

Assume there are two time-series X=[x1.x2,,xn]X=[x_{1}.x_{2},...,x_{n}], and Y=[y1,y2,,yn]Y=[y_{1},y_{2},...,y_{n}], which represent returns of two assets, and the similarity between these two time-series is defined to be

𝐬𝐢𝐦(X,Y)=w1+e(X,Y)+(1w)τ(X,Y),\mathbf{sim}(X,Y)=\frac{w}{1+e(X,Y)}+(1-w)\tau(X,Y),\hskip 5.69046pt (1)
e(X,Y)=(Π1in(1+xi)Π1in(1+yi))2e(X,Y)=\sqrt{\big{(}\underset{1\leq i\leq n}{\Pi}(1+x_{i})-\underset{1\leq i\leq n}{\Pi}(1+y_{i})\big{)}^{2}} (2)
τ(X,Y)=1inxiyi1inxi21inyi2\tau(X,Y)=\frac{\underset{1\leq i\leq n}{\sum}x_{i}y_{i}}{\sqrt{{\underset{1\leq i\leq n}{\sum}}x_{i}^{2}}{\sqrt{\underset{1\leq i\leq n}{\sum}y_{i}^{2}}}} (3)

In the above definition of similarity, e(X,Y)e(X,Y) measures the distance between the cumulative returns of two assets and τ(X,Y)\tau(X,Y) is a modified version of correlation that measures the similarity of absolute performance instead of relative performance to mean returns. ww is a hyper-parameter to be determined. Empirically, the metric performs best when ww is between 0.4 and 0.6.

2.2 Simulated Annealing Clustering

The clustering procedure is based on simulated annealing. The key to the simulated annealing clustering method is the energy functions. The ideal energy function should control the size of each cluster, and let similar assets be in the same group. Let C={C1,C2,,Ck}C=\{C_{1},C_{2},...,C_{k}\} be a partition of assets. Inspired by Ludkovski \BOthers. (\APACyear2022), we propose a similar energy function

E(C)=CiC(1κ|C|1X,YCi𝐬𝐢𝐦(X,Y))\displaystyle E(C)=\underset{C_{i}\in C}{\sum}\Big{(}1-\frac{\kappa}{|C|-1}\underset{X,Y\in C_{i}}{\sum}\mathbf{sim}(X,Y)\Big{)} (4)

In the above definition, |C||C| is the number of clusters, which should never be 1, otherwise, it is not a clustering method. κ\kappa is a hyper-parameter to be determined, which put a soft constraint on the number of clusters.

The simulated annealing clustering method is as follows – initially, all assets are in the same group, and for each iteration, a perturbation operation is applied to move one asset from one cluster to another with some criterion of accepting the perturbation. The criterion of accepting perturbation is whether or not the energy function is decreased or not. If the energy function is decreased, then one should accept the perturbation. If the energy function is increased, then one should accept the perturbation under a certain probability related to the energy function.

The exact simulated annealing clustering method is as follows
Initialization: Choose an initial temperature: T0T_{0}, final temperature: TfT_{f}, cooling rate: α<1\alpha<1, and let all assets in the same cluster.
Step 1: Assume current step is lthl^{\text{th}} step. Denote the current partition as ClC^{l}, and apply perturbation operation to the current partition, which resulting a new partition, denote as Cnewl+1C^{l+1}_{new}. Notice that the current temperature is TlT_{l}
Step 2: ΔEl=E(Cnewl+1)E(Cl)\Delta E^{l}=E(C^{l+1}_{new})-E(C^{l}).
If: ΔEl<0\Delta E^{l}<0, accept partition Cnewl+1C^{l+1}_{new} as the partition Cl+1C^{l+1} for next iteration
Else:

Cnewl+1C^{l+1}_{new} is accepted as Cl+1C^{l+1} with probability exp(ΔElTl)\text{exp}(-\frac{\Delta E^{l}}{T_{l}})

ClC^{l} is accepted as Cl+1C^{l+1} with probability 1exp(ΔElTl)1-\text{exp}(-\frac{\Delta E^{l}}{T_{l}})

Step 3: Lowering the temperature based on cooling rate: Tl+1=αTlT_{l+1}=\alpha T_{l}. Step 4: If Tl+1<TfT_{l+1}<T_{f}, end the clustering procedure. Otherwise, repeat steps 1 - 3

3 Exploratory Mean-Variance Framework

The exploratory mean-variance framework is first proposed in Wang \BOthers. (\APACyear2018) for one-dimensional asset dynamics and extends to multi-dimensional asset dynamics in Wang (\APACyear2019). We follow Wang (\APACyear2019) to give an introduction to the exploratory mean-variance framework in this section.

The portfolio management problem is forming a portfolio that minimizes the variance under expected returns.

min 𝐕𝐚𝐫(xt)\displaystyle\text{min }\mathbf{Var}(x_{t}) (5)
s.t. 𝔼[xt]=z\displaystyle\text{s.t. }\mathbb{E}[x_{t}]=z (6)

Assume the clustering phase is already completed, and there are nn clusters. We randomly select nn assets, by choosing one asset from each cluster, whose dynamics are denoted as {St1,St2,,Stn}\{S_{t}^{1},S_{t}^{2},...,S_{t}^{n}\}. For each asset ii, assume that the underlying dynamics to be

dSti=Sti(μtidt+1jnσtijdWtj)dS_{t}^{i}=S_{t}^{i}(\mu_{t}^{i}dt+\underset{1\leq j\leq n}{\sum}\sigma_{t}^{ij}dW_{t}^{j})

Let 𝝁t=(μt1,μt2,,μtn)\boldsymbol{\mu}_{t}=(\mu_{t}^{1},\mu_{t}^{2},...,\mu_{t}^{n}), 𝝈t=(σtij)\boldsymbol{\sigma}_{t}=(\sigma_{t}^{ij}), and d𝑾t=(dWt1,dWt2,,dWtn)d\boldsymbol{W}_{t}=(dW_{t}^{1},dW_{t}^{2},...,dW_{t}^{n}) denote the vector of drift rates, the covariance matrix, and the vector of n-dim independent Brownian motions respectively. Let 𝐚𝐭=(at1,at2,,atn)\mathbf{a_{t}}=(a_{t}^{1},a_{t}^{2},...,a_{t}^{n}) denote the holdings (in dollar) of those nn assets at time tt. Therefore, the wealth process will be

dxt𝐚𝐭=(𝝁tr𝒆d)T𝒂tdt+𝒂tT𝝈td𝑾tdx_{t}^{\mathbf{a_{t}}}=(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})^{T}\boldsymbol{a}_{t}dt+\boldsymbol{a}_{t}^{T}\boldsymbol{\sigma}_{t}d\boldsymbol{W}_{t} (7)

In this paper, the holdings/control is path-dependent, which means holdings/control 𝒂t\boldsymbol{a}_{t} depend on the past history of the wealth process. To differentiate the current wealth level and the path of the wealth process, we use lowercase letters for the current wealth level and use uppercase letters for the path of the wealth process. For example, xt𝒂tx_{t}^{\boldsymbol{a}_{t}} means the wealth level at time tt under holdings/control 𝒂t\boldsymbol{a}_{t}, while XtX_{t} means a path of wealth process from time 0 to time tt. Now, the policy 𝝅(𝒂t|Xt)\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) is a probability density, meaning that holdings/control at time tt has probability density 𝝅(𝒂t|Xt)\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) to be at 𝒂t\boldsymbol{a}_{t}, based on a realized wealth process is XtX_{t}.

Under the exploratory framework proposed in Wang \BOthers. (\APACyear2018), we have a new stochastic process that denotes the average performance of the wealth process is

dx~t𝝅=(𝝁tr𝒆d)T𝒎tdt+𝒎tT𝝈td𝑾t+𝐓𝐫(𝚺tT𝑪t)dWt~{d\tilde{x}_{t}}^{\boldsymbol{\pi}}=(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})^{T}\boldsymbol{m}_{t}dt+\boldsymbol{m}_{t}^{T}\boldsymbol{\sigma}_{t}d\boldsymbol{W}_{t}+\sqrt{\mathbf{Tr}(\boldsymbol{\Sigma}_{t}^{T}\boldsymbol{C}_{t})}d\widetilde{W_{t}} (8)

Here, we assume that policy 𝝅\boldsymbol{\pi} is appied at time tt. Wt~\widetilde{W_{t}} is an independent Brownian motion, 𝚺t=𝝈t𝝈tT\boldsymbol{\Sigma}_{t}=\boldsymbol{\sigma}_{t}\boldsymbol{\sigma}_{t}^{T} is a positive definite matrix. 𝒎t\boldsymbol{m}_{t}, and 𝑪t\boldsymbol{C}_{t} are as follows,

𝒎t=n𝒂t𝝅(𝒂t|Xt~)𝑑𝒂t\boldsymbol{m}_{t}=\int_{\mathbb{R}^{n}}\boldsymbol{a}_{t}\boldsymbol{\pi}(\boldsymbol{a}_{t}|\widetilde{X_{t}})d\boldsymbol{a}_{t} (9)
𝑪t=n[𝒂t𝒎t][𝒂t𝒎t]T𝝅(𝒂t|Xt~)𝑑𝒂t\boldsymbol{C}_{t}=\int_{\mathbb{R}^{n}}[\boldsymbol{a}_{t}-\boldsymbol{m}_{t}][\boldsymbol{a}_{t}-\boldsymbol{m}_{t}]^{T}\boldsymbol{\pi}(\boldsymbol{a}_{t}|\widetilde{X_{t}})d\boldsymbol{a}_{t} (10)

The goal is to identify a policy 𝝅\boldsymbol{\pi} that minimizes the following objective function, where ww is the Lagrangian multiplier

𝔼[(x~T𝝅w)2+γ0Tn𝝅(𝒂t|Xt~𝝅)log𝝅(𝒂t|Xt~𝝅)𝑑𝒂t𝑑t](wz)2\displaystyle\mathbb{E}\Big{[}(\tilde{x}_{T}^{\boldsymbol{\pi}}-w)^{2}+\gamma\int_{0}^{T}\int_{\mathbb{R}^{n}}\boldsymbol{\pi}(\boldsymbol{a}_{t}|\widetilde{X_{t}}^{\boldsymbol{\pi}})\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|\widetilde{X_{t}}^{\boldsymbol{\pi}})d\boldsymbol{a}_{t}dt\Big{]}-(w-z)^{2} (11)

This cost function is very similar to the cost function in Wang (\APACyear2019), but instead, the cost function is path-dependent. The superscript 𝝅\boldsymbol{\pi} indicates wealth processes are generated under the policy 𝝅\boldsymbol{\pi}. Now that we have path-dependent control and path-dependent cost function, we can’t simply apply classical stochastic control results here, a functional version of stochastic control is needed.

4 Path Dependent Stochastic Control

4.1 Functional Ito Calculus

Functional Ito calculus is a crucial and powerful tool developed in Dupire (\APACyear2019) to study path-dependent stochastic calculus problems. The following is a quick introduction to functional Ito calculus and some important consequences in the Dupire (\APACyear2019).

Let Λt\Lambda_{t} denote the set of cadlag paths up to time tt, and specifically, Λtn\Lambda_{t}^{n} is the space of cadlag paths that takes values in n\mathbb{R}^{n}. Λtn×k\Lambda_{t}^{n\times k} denotes the space of two cadlag paths, one takes values in n\mathbb{R}^{n} and the other takes value in k\mathbb{R}^{k}. In general, Λ=t[0,T]Λt\Lambda=\underset{t\in[0,T]}{\bigcup}\Lambda_{t}, and f:Λdf:\Lambda\to\mathbb{R}^{d} is a functional.

If XtX_{t} is a path that takes value in \mathbb{R}, and Xt(u)X_{t}(u) indicates the value at time uu. We define vertical, and flat extensions as follows, (For the path that takes value in n\mathbb{R}^{n}, the above vertical and flat extensions are done on each dimension individually)

Xth(u)={Xuif 0u<tXt+hif t=uX_{t}^{h}(u)=\begin{cases}X_{u}&\text{if $0\leq u<t$}\\ X_{t}+h&\text{if $t=u$}\end{cases}
Xt,δt(u)={Xuif 0u<tXtif tut+δtX_{t,\delta t}(u)=\begin{cases}X_{u}&\text{if $0\leq u<t$}\\ X_{t}&\text{if $t\leq u\leq t+\delta t$}\end{cases}

Let f:Λf:\Lambda\to\mathbb{R} be a functional, then the partial derivative as follows, (for f:Λdf:\Lambda\to\mathbb{R}^{d}, the following definition is on each dimension individually)

Δtf(Xt)=limδt0+f(Xt,δt)f(Xt)δt\Delta_{t}f(X_{t})=\underset{\delta t\to 0^{+}}{\text{lim}}\frac{f(X_{t,\delta t})-f(X_{t})}{\delta t}
Δxf(Xt)=infh0f(Xth)f(Xt)h\Delta_{x}f(X_{t})=\underset{h\to 0}{\text{inf}}\frac{f(X_{t}^{h})-f(X_{t})}{h}

To understand the functional Ito formula, we need to introduce the metric on the path space Λ\Lambda. Let XtX_{t}, and YsY_{s} be two paths in Λ\Lambda, and without loss of generality, assume tst\leq s. Their distance is defined to be

dΛ(Xt,Ys)=Xt,stYs+(st)d_{\Lambda}(X_{t},Y_{s})=||X_{t,s-t}-Y_{s}||_{\infty}+(s-t) (12)

After defining the metric, we can define the continuity of a functional.

A functional f:Λf:\Lambda\to\mathbb{R} is Λ\Lambda-continuous at XtΛX_{t}\in\Lambda if ϵ>0\forall\epsilon>0, δ>0\exists\delta>0, and Ys\forall Y_{s} such that dΛ(Xt,Ys)<δd_{\Lambda}(X_{t},Y_{s})<\delta, there is |f(Xt)f(Ys)|<ϵ|f(X_{t})-f(Y_{s})|<\epsilon. A functional is Λ\Lambda-continuous if it is Λ\Lambda-continuous at all paths.

A functional f:Λf:\Lambda\to\mathbb{R} is in 1,2\mathbb{C}^{1,2} if it is Λ\Lambda-continuous, C2C^{2} in x, and C1C^{1} in t, with its partial derivatives also be Λ\Lambda-continuous.

Theorem 4.1 (Functional Ito Formula)

Let xtx_{t} be a continuous semi-martingale process, f1,2f\in\mathbb{C}^{1,2}, and XtX_{t} is a path of process xtx_{t}. Then, for any t[0,T]t\in[0,T],

f(Xt)=f(X0)+0tΔtf(Xs)𝑑s+0tΔxf(Xs)𝑑xs+120tΔxxf(Xs)dxs\displaystyle f(X_{t})=f(X_{0})+\int_{0}^{t}\Delta_{t}f(X_{s})ds+\int_{0}^{t}\Delta_{x}f(X_{s})dx_{s}+\frac{1}{2}\int_{0}^{t}\Delta_{xx}f(X_{s})d\langle x\rangle_{s} (13)

4.2 Functional Feynman-Kac Formula and Functional HJB equation

Now, assume that XtX_{t} is a path for the wealth process. If we use policy 𝝅\boldsymbol{\pi} for the left time, the cost functional are as follow,

J(Xt,𝝅)=𝔼[(x~T𝝅w)2+γtTn𝝅(𝒂s|Xs~𝝅)log𝝅(𝒂s|Xs~𝝅)𝑑𝒂s𝑑s|Xt~𝝅=Xt](wz)2J(X_{t},\boldsymbol{\pi})=\mathbb{E}\Big{[}(\tilde{x}_{T}^{\boldsymbol{\pi}}-w)^{2}+\gamma\int_{t}^{T}\int_{\mathbb{R}^{n}}\boldsymbol{\pi}(\boldsymbol{a}_{s}|\widetilde{X_{s}}^{\boldsymbol{\pi}})\log\boldsymbol{\pi}(\boldsymbol{a}_{s}|\widetilde{X_{s}}^{\boldsymbol{\pi}})d\boldsymbol{a}_{s}ds\hskip 2.84544pt|\hskip 2.84544pt\widetilde{X_{t}}^{\boldsymbol{\pi}}=X_{t}\Big{]}-(w-z)^{2} (14)

To simplify our computation, denote the following functional as

𝒃~(s,Xs~,𝝅)=(𝝁sr𝒆d)T𝒎s\widetilde{\boldsymbol{b}}(s,\widetilde{X_{s}},\boldsymbol{\pi})=(\boldsymbol{\mu}_{s}-r\boldsymbol{e}_{d})^{T}\boldsymbol{m}_{s} (15)
𝝈~(s,Xs~,𝝅)=(𝒎sT𝝈s|𝐓𝐫(𝚺sT𝑪s))\widetilde{\boldsymbol{\sigma}}(s,\widetilde{X_{s}},\boldsymbol{\pi})=\big{(}\boldsymbol{m}_{s}^{T}\boldsymbol{\sigma}_{s}\hskip 2.84544pt|\hskip 2.84544pt\sqrt{\mathbf{Tr}(\boldsymbol{\Sigma}_{s}^{T}\boldsymbol{C}_{s})}\big{)} (16)
f(s,Xs~,𝝅)=n𝝅(𝒂s|Xs~𝝅)log𝝅(𝒂s|Xs~𝝅)𝑑𝒂sf(s,\widetilde{X_{s}},\boldsymbol{\pi})=\int_{\mathbb{R}^{n}}\boldsymbol{\pi}(\boldsymbol{a}_{s}|\widetilde{X_{s}}^{\boldsymbol{\pi}})\log\boldsymbol{\pi}(\boldsymbol{a}_{s}|\widetilde{X_{s}}^{\boldsymbol{\pi}})d\boldsymbol{a}_{s} (17)

Thus, the dynamics, and the cost functional are

dxs~𝝅=𝒃~(s,Xs~,𝝅)dt+𝝈~(s,Xs~,𝝅)(d𝑾t|dWt~)Td\widetilde{x_{s}}^{\boldsymbol{\pi}}=\widetilde{\boldsymbol{b}}(s,\widetilde{X_{s}},\boldsymbol{\pi})dt+\widetilde{\boldsymbol{\sigma}}(s,\widetilde{X_{s}},\boldsymbol{\pi})(d\boldsymbol{W}_{t}\hskip 2.84544pt|\hskip 2.84544ptd\widetilde{W_{t}})^{T} (18)
J(Xt,𝝅)=𝔼[(x~T𝝅w)2+γtTf(s,Xs~,𝝅)𝑑s|Xt~𝝅=Xt](wz)2J(X_{t},\boldsymbol{\pi})=\mathbb{E}\Big{[}(\tilde{x}_{T}^{\boldsymbol{\pi}}-w)^{2}+\gamma\int_{t}^{T}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\hskip 2.84544pt|\hskip 2.84544pt\widetilde{X_{t}}^{\boldsymbol{\pi}}=X_{t}\Big{]}-(w-z)^{2} (19)

Let τ>t\tau>t, and consider the following,

Yτ=J(Xτ~,𝝅)+γtτf(s,Xs~,𝝅)𝑑sY_{\tau}=J(\widetilde{X_{\tau}},\boldsymbol{\pi})+\gamma\int_{t}^{\tau}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds (20)

Consider the following,

𝔼[Yτ|Xt]\displaystyle\mathbb{E}[Y_{\tau}|X_{t}] =𝔼[𝔼[(x~Tπw)2+γτTf(s,Xs~,𝝅)𝑑s|Xτ~]+tτf(s,Xs~,𝝅)𝑑s|Xt](wz)2\displaystyle=\mathbb{E}\Big{[}\mathbb{E}[(\tilde{x}_{T}^{\pi}-w)^{2}+\gamma\int_{\tau}^{T}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds|\widetilde{X_{\tau}}]+\int_{t}^{\tau}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\Big{|}X_{t}\Big{]}-(w-z)^{2}
=𝔼[𝔼[(x~Tπw)2+γτTf(s,Xs~,𝝅)𝑑s|Xτ~]|Xt]+𝔼[tτf(s,Xs~,𝝅)𝑑s|Xt](wz)2\displaystyle=\mathbb{E}\Big{[}\mathbb{E}\big{[}(\tilde{x}_{T}^{\pi}-w)^{2}+\gamma\int_{\tau}^{T}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\big{|}\widetilde{X_{\tau}}\big{]}\Big{|}X_{t}\Big{]}+\mathbb{E}\Big{[}\int_{t}^{\tau}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\Big{|}X_{t}\Big{]}-(w-z)^{2}
=𝔼[(x~Tπw)2+γτTf(s,Xs~,𝝅)𝑑s|Xt]+𝔼[tτf(s,Xs~,𝝅)𝑑s|Xt](wz)2\displaystyle=\mathbb{E}\Big{[}(\tilde{x}_{T}^{\pi}-w)^{2}+\gamma\int_{\tau}^{T}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\Big{|}X_{t}\Big{]}+\mathbb{E}\Big{[}\int_{t}^{\tau}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\Big{|}X_{t}\Big{]}-(w-z)^{2}
=𝔼[(x~Tπw)2+γtTf(s,Xs~,𝝅)𝑑s|Xt](wz)2\displaystyle=\mathbb{E}\Big{[}(\tilde{x}_{T}^{\pi}-w)^{2}+\gamma\int_{t}^{T}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\Big{|}X_{t}\Big{]}-(w-z)^{2}
=J(Xt,π)\displaystyle=J(X_{t},\pi)
=Yt\displaystyle=Y_{t}

Therefore, we can see the functional YτY_{\tau} is martingale, by the functional Ito formula, the following can be derived

𝒃~(τ,Xτ~,𝝅)ΔxJ(Xτ~,𝝅)+ΔtJ(Xτ~,𝝅)+12𝝈~(τ,Xτ~,𝝅)𝝈~(τ,Xτ~,𝝅)TΔxxJ(Xτ~,𝝅)+γf(τ,Xτ~,𝝅)=0\widetilde{\boldsymbol{b}}(\tau,\widetilde{X_{\tau}},\boldsymbol{\pi})\Delta_{x}J(\widetilde{X_{\tau}},\boldsymbol{\pi})+\Delta_{t}J(\widetilde{X_{\tau}},\boldsymbol{\pi})+\frac{1}{2}\widetilde{\boldsymbol{\sigma}}(\tau,\widetilde{X_{\tau}},\boldsymbol{\pi})\widetilde{\boldsymbol{\sigma}}(\tau,\widetilde{X_{\tau}},\boldsymbol{\pi})^{T}\Delta_{xx}J(\widetilde{X_{\tau}},\boldsymbol{\pi})+\gamma f(\tau,\widetilde{X_{\tau}},\boldsymbol{\pi})=0 (21)

To simplify the notation, assume the policy 𝝅\boldsymbol{\pi} be applied at time 0, then at time tt, and there is a path Xt=X~t𝝅X_{t}=\widetilde{X}^{\boldsymbol{\pi}}_{t}. , and we have functional Feynman-Kac formula,

𝒃~(t,Xt,𝝅)ΔxJ(Xt,𝝅)+ΔtJ(Xt,𝝅)+12𝝈~(t,Xt,𝝅)𝝈~(t,Xt,𝝅)TΔxxJ(Xt,𝝅)+γf(t,Xt,𝝅)=0\widetilde{\boldsymbol{b}}(t,X_{t},\boldsymbol{\pi})\Delta_{x}J(X_{t},\boldsymbol{\pi})+\Delta_{t}J(X_{t},\boldsymbol{\pi})+\frac{1}{2}\widetilde{\boldsymbol{\sigma}}(t,X_{t},\boldsymbol{\pi})\widetilde{\boldsymbol{\sigma}}(t,X_{t},\boldsymbol{\pi})^{T}\Delta_{xx}J(X_{t},\boldsymbol{\pi})+\gamma f(t,X_{t},\boldsymbol{\pi})=0 (22)

Now, let V(Xt)=inf𝝅J(Xt,𝝅)V(X_{t})=\underset{\boldsymbol{\pi}}{\text{inf}}\hskip 2.84544ptJ(X_{t},\boldsymbol{\pi}), the following is an informal derivation of path-dependent HJB equation. The rigorous derivation of path-dependent HJB equation for portfolio management problems faces lots of technical challenges, which is beyond the scope of this paper.

V(Xt)\displaystyle V(X_{t}) =inf𝜋J(Xt,𝝅)\displaystyle=\underset{\pi}{\text{inf}}\hskip 2.84544ptJ(X_{t},\boldsymbol{\pi}) (23)
=inf𝜋𝔼[(x~Tπw)2+γtTf(s,Xs~,𝝅)𝑑s|Xt](wz)2\displaystyle=\underset{\pi}{\text{inf}}\hskip 2.84544pt\mathbb{E}\Big{[}(\tilde{x}_{T}^{\pi}-w)^{2}+\gamma\int_{t}^{T}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\Big{|}X_{t}\Big{]}-(w-z)^{2} (24)
=inf𝜋𝔼[𝔼[(x~Tπw)2+γt+dtTf(s,Xs~,𝝅)𝑑s|X~t+dt]|Xt](wz)2+𝔼[γtt+dtf(s,Xs~,𝝅)𝑑s|Xt]\displaystyle=\underset{\pi}{\text{inf}}\hskip 2.84544pt\mathbb{E}\Big{[}\mathbb{E}\big{[}(\tilde{x}_{T}^{\pi}-w)^{2}+\gamma\int_{t+dt}^{T}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\big{|}\widetilde{X}_{t+dt}\big{]}\Big{|}X_{t}\Big{]}-(w-z)^{2}+\mathbb{E}\Big{[}\gamma\int_{t}^{t+dt}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\Big{|}X_{t}\Big{]} (25)
=inf𝜋𝔼[(x~Tπw)2+γtTf(s,Xs~,𝝅)𝑑s|Xt](wz)2\displaystyle=\underset{\pi}{\text{inf}}\hskip 2.84544pt\mathbb{E}\Big{[}(\tilde{x}_{T}^{\pi}-w)^{2}+\gamma\int_{t}^{T}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\Big{|}X_{t}\Big{]}-(w-z)^{2} (26)
=inf𝜋𝔼[𝔼[(x~Tπw)2+γt+dtTf(s,Xs~,𝝅)𝑑s|X~t+dt](wz)2|Xt]+𝔼[γtt+dtf(s,Xs~,𝝅)𝑑s|Xt]\displaystyle=\underset{\pi}{\text{inf}}\hskip 2.84544pt\mathbb{E}\Big{[}\mathbb{E}\big{[}(\tilde{x}_{T}^{\pi}-w)^{2}+\gamma\int_{t+dt}^{T}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\big{|}\widetilde{X}_{t+dt}\big{]}-(w-z)^{2}\Big{|}X_{t}\Big{]}+\mathbb{E}\Big{[}\gamma\int_{t}^{t+dt}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\Big{|}X_{t}\Big{]} (27)

Since the first term in the above equation depends on the policy applied from time t+dtt+dt, which is independent from the policy applied during the interval [t,t+dt][t,t+dt]. In other words, the policy π\pi appears in the above equation is a combination of policy applied in the interval [t,t+dt][t,t+dt], and the policy applied after time t+dtt+dt. Therefore, the above equation becomes

V(Xt)=inf𝜋𝔼[V(Xt+dt)|Xt]+𝔼[γtt+dtf(s,Xs~,𝝅)𝑑s|Xt]V(X_{t})=\underset{\pi}{\text{inf}}\hskip 2.84544pt\mathbb{E}\big{[}V(X_{t+dt})\big{|}X_{t}\big{]}+\mathbb{E}\big{[}\gamma\int_{t}^{t+dt}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\big{|}X_{t}\big{]} (28)

Moving the left-hand side to the right, and letting dt0dt\to 0, the above equation becomes,

0\displaystyle 0 =inf𝜋𝔼[V(Xt+dt)V(Xt)|Xt]+𝔼[γtt+dtf(s,Xs~,𝝅)𝑑s|Xt]\displaystyle=\underset{\pi}{\text{inf}}\hskip 2.84544pt\mathbb{E}\big{[}V(X_{t+dt})-V(X_{t})\big{|}X_{t}\big{]}+\mathbb{E}\big{[}\gamma\int_{t}^{t+dt}f(s,\widetilde{X_{s}},\boldsymbol{\pi})ds\big{|}X_{t}\big{]} (29)
=inf𝜋𝔼[ΔtV(Xt)+ΔxV(Xt)dx~t𝝅+12ΔxxV(Xt)dx~𝝅t|Xt]+γf(t,Xt,𝝅)dt\displaystyle=\underset{\pi}{\text{inf}}\hskip 2.84544pt\mathbb{E}\big{[}\Delta_{t}V(X_{t})+\Delta_{x}V(X_{t})d\tilde{x}_{t}^{\boldsymbol{\pi}}+\frac{1}{2}\Delta_{xx}V(X_{t})d\langle\tilde{x}^{\boldsymbol{\pi}}\rangle_{t}\big{|}X_{t}\big{]}+\gamma f(t,X_{t},\boldsymbol{\pi})dt (30)
=inf𝜋[ΔtV(Xt)+𝒃~(t,Xt,𝝅)ΔxV(Xt)+12𝝈~(t,Xt,𝝅)𝝈~(t,Xt,𝝅)TΔxxV(Xt)+γf(t,Xt,𝝅)]dt\displaystyle=\underset{\pi}{\text{inf}}\hskip 2.84544pt\big{[}\Delta_{t}V(X_{t})+\widetilde{\boldsymbol{b}}(t,X_{t},\boldsymbol{\pi})\Delta_{x}V(X_{t})+\frac{1}{2}\widetilde{\boldsymbol{\sigma}}(t,X_{t},\boldsymbol{\pi})\widetilde{\boldsymbol{\sigma}}(t,X_{t},\boldsymbol{\pi})^{T}\Delta_{xx}V(X_{t})+\gamma f(t,X_{t},\boldsymbol{\pi})\big{]}dt (31)

Divide both sides by dtdt, we have the following path-dependent HJB equation

ΔtV(Xt)+inf𝝅{𝒃~(t,Xt,𝝅)ΔxV(Xt)+12𝝈~(t,Xt,𝝅)𝝈~(t,Xt,𝝅)TΔxxV(Xt)+γf(t,Xt,𝝅)}=0\Delta_{t}V(X_{t})+\underset{\boldsymbol{\pi}}{\text{inf}}\hskip 2.84544pt\big{\{}\widetilde{\boldsymbol{b}}(t,X_{t},\boldsymbol{\pi})\Delta_{x}V(X_{t})+\frac{1}{2}\widetilde{\boldsymbol{\sigma}}(t,X_{t},\boldsymbol{\pi})\widetilde{\boldsymbol{\sigma}}(t,X_{t},\boldsymbol{\pi})^{T}\Delta_{xx}V(X_{t})+\gamma f(t,X_{t},\boldsymbol{\pi})\big{\}}=0 (32)

Consider the formula inside the infinf, plug in 𝒃~(t,Xt,𝝅)\widetilde{\boldsymbol{b}}(t,X_{t},\boldsymbol{\pi}), 𝝈~(t,Xt,𝝅)\widetilde{\boldsymbol{\sigma}}(t,X_{t},\boldsymbol{\pi}), and f(t,Xt,𝝅)f(t,X_{t},\boldsymbol{\pi}), we have the following formula,

n[(𝝁tr𝒆d)T𝒂tΔxV(Xt)+12𝒂tT𝝈t𝝈tT𝒂tΔxxV(Xt)+γlog𝝅(𝒂t|Xt)]𝝅(𝒂t|Xt)𝑑𝒂t\int_{\mathbb{R}^{n}}\Big{[}(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})^{T}\boldsymbol{a}_{t}\Delta_{x}V(X_{t})+\frac{1}{2}\boldsymbol{a}_{t}^{T}\boldsymbol{\sigma}_{t}\boldsymbol{\sigma}_{t}^{T}\boldsymbol{a}_{t}\Delta_{xx}V(X_{t})+\gamma\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})\Big{]}\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})d\boldsymbol{a}_{t} (33)

To find a candidate for the optimal policy, the following equation has to be independent of 𝒂t\boldsymbol{a}_{t} and only depend on XtX_{t}

L(𝒂t,Xt,𝝅)=(𝝁tr𝒆d)T𝒂tΔxV(Xt)+12𝒂tT𝝈t𝝈tT𝒂tΔxxV(Xt)+γlog𝝅(𝒂t|Xt)L(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi})=(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})^{T}\boldsymbol{a}_{t}\Delta_{x}V(X_{t})+\frac{1}{2}\boldsymbol{a}_{t}^{T}\boldsymbol{\sigma}_{t}\boldsymbol{\sigma}_{t}^{T}\boldsymbol{a}_{t}\Delta_{xx}V(X_{t})+\gamma\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) (34)
U(𝒂t,Xt)=(𝝁tr𝒆d)T𝒂tΔxV(Xt)+12𝒂tT𝝈t𝝈tT𝒂tΔxxV(Xt)U(\boldsymbol{a}_{t},X_{t})=(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})^{T}\boldsymbol{a}_{t}\Delta_{x}V(X_{t})+\frac{1}{2}\boldsymbol{a}_{t}^{T}\boldsymbol{\sigma}_{t}\boldsymbol{\sigma}_{t}^{T}\boldsymbol{a}_{t}\Delta_{xx}V(X_{t}) (35)

Assume L(𝒂t,Xt,𝝅)L(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi}) depend on the 𝒂t\boldsymbol{a}_{t}, and by the fact that L(𝒂t,Xt,𝝅)L(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi}) is continuous at every point 𝒂t\boldsymbol{a}_{t}, then without loss of generality, there exist two regions V1V_{1}, and V2V_{2} such that L(𝒂t,Xt,𝝅)L(𝒂t,Xt,𝝅)>hL(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi})-L(\boldsymbol{a}^{\prime}_{t},X_{t},\boldsymbol{\pi})>h for points 𝒂tV1\boldsymbol{a}_{t}\in V_{1}, and 𝒂tV2\boldsymbol{a}^{\prime}_{t}\in V_{2}. Without loss of generality, assume the volume of these two regions is equal to VV. Now, let ϵ\epsilon be sufficiently small, and consider the upgraded 𝝅\boldsymbol{\pi}^{*} that 𝝅(𝒂t|Xt)=𝝅(𝒂t|Xt)+ϵ\boldsymbol{\pi}^{*}(\boldsymbol{a}_{t}|X_{t})=\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})+\epsilon for 𝒂tV1\boldsymbol{a}_{t}\in V_{1}, and 𝝅(𝒂t|Xt)=𝝅(𝒂t|Xt)ϵ\boldsymbol{\pi}^{*}(\boldsymbol{a}_{t}|X_{t})=\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})-\epsilon for 𝒂tV2\boldsymbol{a}_{t}\in V_{2}, and 𝝅\boldsymbol{\pi}^{*} is same for the rest space. Therefore, the change of L(𝒂t,Xt,𝝅)𝝅(𝒂t|Xt)L(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi})\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) for regions V1V_{1} is

L(𝒂t,Xt,𝝅)𝝅(𝒂t|Xt)L(𝒂t,Xt,𝝅)𝝅(𝒂t|Xt)\displaystyle L(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi}^{*})\boldsymbol{\pi}^{*}(\boldsymbol{a}_{t}|X_{t})-L(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi})\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) (36)
=\displaystyle= ϵU(𝒂t,Xt)+γ(𝝅(𝒂t|Xt)+ϵ)log(𝝅(𝒂t|Xt)+ϵ)γ𝝅(𝒂t|Xt)log𝝅(𝒂t|Xt)\displaystyle\epsilon U(\boldsymbol{a}_{t},X_{t})+\gamma(\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})+\epsilon)\log(\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})+\epsilon)-\gamma\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) (37)
=\displaystyle= ϵU(𝒂t,Xt)+γ(𝝅(𝒂t|Xt)+ϵ)[log𝝅(𝒂t|Xt)+log(1+ϵ𝝅(𝒂t|Xt))]γ𝝅(𝒂t|Xt)log𝝅(𝒂t|Xt)\displaystyle\epsilon U(\boldsymbol{a}_{t},X_{t})+\gamma(\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})+\epsilon)\big{[}\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})+\log(1+\frac{\epsilon}{\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})})\big{]}-\gamma\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) (38)
=\displaystyle= ϵU(𝒂t,Xt)+ϵγlog𝝅(𝒂t|Xt)+γϵ+𝒪(ϵ2)\displaystyle\epsilon U(\boldsymbol{a}_{t},X_{t})+\epsilon\gamma\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})+\gamma\epsilon+\mathcal{O}(\epsilon^{2}) (39)

Here, since the ϵ\epsilon is sufficiently small, we use Taylor expansion for the log(1+ϵ𝝅(𝒂t|Xt))\log(1+\frac{\epsilon}{\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})}) to the (37). Similarly, for region V2V_{2}, the change of the L(𝒂t,Xt,𝝅)𝝅(𝒂t|Xt)L(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi})\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) is

L(𝒂t,Xt,𝝅)𝝅(𝒂t|Xt)L(𝒂t,Xt,𝝅)𝝅(𝒂t|Xt)\displaystyle L(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi}^{*})\boldsymbol{\pi}^{*}(\boldsymbol{a}_{t}|X_{t})-L(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi})\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) (40)
=\displaystyle= ϵU(𝒂t,Xt)+γ(𝝅(𝒂t|Xt)ϵ)log(𝝅(𝒂t|Xt)ϵ)γ𝝅(𝒂t|Xt)log𝝅(𝒂t|Xt)\displaystyle-\epsilon U(\boldsymbol{a}_{t},X_{t})+\gamma(\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})-\epsilon)\log(\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})-\epsilon)-\gamma\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) (41)
=\displaystyle= ϵU(𝒂t,Xt)+γ(𝝅(𝒂t|Xt)ϵ)[log𝝅(𝒂t|Xt)+log(1ϵ𝝅(𝒂t|Xt))]γ𝝅(𝒂t|Xt)log𝝅(𝒂t|Xt)\displaystyle-\epsilon U(\boldsymbol{a}_{t},X_{t})+\gamma(\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})-\epsilon)\big{[}\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})+\log(1-\frac{\epsilon}{\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})})\big{]}-\gamma\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) (42)
=\displaystyle= ϵU(𝒂t,Xt)ϵγlog𝝅(𝒂t|Xt)γϵ+𝒪(ϵ2)\displaystyle-\epsilon U(\boldsymbol{a}_{t},X_{t})-\epsilon\gamma\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})-\gamma\epsilon+\mathcal{O}(\epsilon^{2}) (43)

Therefore, the change of (31) is

V1[ϵU(𝒂t,Xt)+ϵγlog𝝅(𝒂t|Xt)+γϵ]𝑑𝒂tV2[ϵU(𝒂t,Xt)+ϵγlog𝝅(𝒂t|Xt)+γϵ]𝑑𝒂t\displaystyle\int_{V_{1}}\big{[}\epsilon U(\boldsymbol{a}_{t},X_{t})+\epsilon\gamma\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})+\gamma\epsilon\big{]}d\boldsymbol{a}_{t}-\int_{V_{2}}\big{[}\epsilon U(\boldsymbol{a}_{t},X_{t})+\epsilon\gamma\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})+\gamma\epsilon\big{]}d\boldsymbol{a}_{t} (44)
=\displaystyle= ϵV1U(𝒂t,Xt)+γlog𝝅(𝒂t|Xt)d𝒂tϵV2U(𝒂t,Xt)+γlog𝝅(𝒂t|Xt)d𝒂t+ϵγV1𝑑𝒂tϵγV2𝑑𝒂t\displaystyle\epsilon\int_{V_{1}}U(\boldsymbol{a}_{t},X_{t})+\gamma\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})d\boldsymbol{a}_{t}-\epsilon\int_{V_{2}}U(\boldsymbol{a}_{t},X_{t})+\gamma\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})d\boldsymbol{a}_{t}+\epsilon\gamma\int_{V_{1}}d\boldsymbol{a}_{t}-\epsilon\gamma\int_{V_{2}}d\boldsymbol{a}_{t} (45)
=\displaystyle= ϵV1U(𝒂t,Xt)+γlog𝝅(𝒂t|Xt)d𝒂tϵV2U(𝒂t,Xt)+γlog𝝅(𝒂t|Xt)d𝒂t+VV\displaystyle\epsilon\int_{V_{1}}U(\boldsymbol{a}_{t},X_{t})+\gamma\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})d\boldsymbol{a}_{t}-\epsilon\int_{V_{2}}U(\boldsymbol{a}_{t},X_{t})+\gamma\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})d\boldsymbol{a}_{t}+V-V (46)
=\displaystyle= ϵV1L(𝒂t,Xt,𝝅)𝑑𝒂tϵV2L(𝒂t,Xt,𝝅)𝑑𝒂t>ϵhV\displaystyle\epsilon\int_{V_{1}}L(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi})d\boldsymbol{a}_{t}-\epsilon\int_{V_{2}}L(\boldsymbol{a}_{t},X_{t},\boldsymbol{\pi})d\boldsymbol{a}_{t}>\epsilon hV (47)

From the above derivation, one can see that if the policy depends on 𝒂t\boldsymbol{a}_{t}, there is always an improved policy by slightly adjusting the probability density.

(𝝁tr𝒆d)T𝒂tΔxV(Xt)+12𝒂tT𝝈t𝝈tT𝒂tΔxxV(Xt)+γlog𝝅(𝒂t|Xt)=C(Xt)(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})^{T}\boldsymbol{a}_{t}\Delta_{x}V(X_{t})+\frac{1}{2}\boldsymbol{a}_{t}^{T}\boldsymbol{\sigma}_{t}\boldsymbol{\sigma}_{t}^{T}\boldsymbol{a}_{t}\Delta_{xx}V(X_{t})+\gamma\log\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})=C(X_{t}) (48)

Thus, the optimal policy is of the following form, where AA is constant

𝝅(𝒂t|Xt)=A(Xt)exp(1γ((𝝁tr𝒆d)T𝒂tΔxV(Xt)+12𝒂tT𝝈t𝝈tT𝒂tΔxxV(Xt)))\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})=A(X_{t})\hskip 2.84544pt\text{exp}\Big{(}-\frac{1}{\gamma}\big{(}(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})^{T}\boldsymbol{a}_{t}\Delta_{x}V(X_{t})+\frac{1}{2}\boldsymbol{a}_{t}^{T}\boldsymbol{\sigma}_{t}\boldsymbol{\sigma}_{t}^{T}\boldsymbol{a}_{t}\Delta_{xx}V(X_{t})\big{)}\Big{)} (49)

By the fact that 𝝅(𝒂t|Xt)\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) is a probability distribution, we have

𝝅(𝒂t|Xt)=exp(1γ((𝝁tr𝒆d)T𝒂tΔxV(Xt)+12𝒂tT𝝈t𝝈tT𝒂tΔxxV(Xt)))nexp(1γ((𝝁tr𝒆d)T𝒂tΔxV(Xt)+12𝒂tT𝝈t𝝈tT𝒂tΔxxV(Xt)))𝑑𝒂t\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t})=\frac{\text{exp}\Big{(}-\frac{1}{\gamma}\big{(}(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})^{T}\boldsymbol{a}_{t}\Delta_{x}V(X_{t})+\frac{1}{2}\boldsymbol{a}_{t}^{T}\boldsymbol{\sigma}_{t}\boldsymbol{\sigma}_{t}^{T}\boldsymbol{a}_{t}\Delta_{xx}V(X_{t})\big{)}\Big{)}}{\int_{\mathbb{R}^{n}}\text{exp}\Big{(}-\frac{1}{\gamma}\big{(}(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})^{T}\boldsymbol{a}_{t}\Delta_{x}V(X_{t})+\frac{1}{2}\boldsymbol{a}_{t}^{T}\boldsymbol{\sigma}_{t}\boldsymbol{\sigma}_{t}^{T}\boldsymbol{a}_{t}\Delta_{xx}V(X_{t})\big{)}\Big{)}d\boldsymbol{a}_{t}} (50)

Thus, 𝝅(𝒂t|Xt)\boldsymbol{\pi}(\boldsymbol{a}_{t}|X_{t}) is Gaussian distribution, more specifically,

𝝅(|Xt)𝒩((𝝈t𝝈tT)1(𝝁tr𝒆d)ΔxV(Xt)ΔxxV(Xt),(𝝈t𝝈tT)1γΔxxV(Xt))\boldsymbol{\pi}(\hskip 2.84544pt\mathbin{\vbox{\hbox{\scalebox{0.5}{$\bullet$}}}}\hskip 2.84544pt|X_{t})\sim\mathcal{N}\Big{(}-(\boldsymbol{\sigma}_{t}\boldsymbol{\sigma}_{t}^{T})^{-1}(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})\frac{\Delta_{x}V(X_{t})}{\Delta_{xx}V(X_{t})}\hskip 2.84544pt,\hskip 2.84544pt(\boldsymbol{\sigma}_{t}\boldsymbol{\sigma}_{t}^{T})^{-1}\frac{\gamma}{\Delta_{xx}V(X_{t})}\Big{)} (51)

Plug the policy back into HJB equation, as shown in Wang (\APACyear2019), and the HJB equation becomes,

ΔtV(Xt)𝝈t1(𝝁tr𝒆d)22[ΔxV(Xt)]2ΔxxV(Xt)+γ2[ddln 2πeγΔxxV(Xt)+ln det(𝝈tT𝝈t)]=0\Delta_{t}V(X_{t})-\frac{||\boldsymbol{\sigma}_{t}^{-1}(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})||^{2}}{2}\frac{[\Delta_{x}V(X_{t})]^{2}}{\Delta_{xx}V(X_{t})}+\frac{\gamma}{2}\bigg{[}d-d\hskip 2.84544pt\text{ln }\frac{2\pi e\gamma}{\Delta_{xx}V(X_{t})}+\text{ln }\text{det}(\boldsymbol{\sigma}_{t}^{T}\boldsymbol{\sigma}_{t})\bigg{]}=0 (52)

In path-dependent cases, it is impossible to write an analytic solution for equation (28)(28). Solving equation (28)(28) numerically is the only way and should be sufficient for practitioners. It is natural to solve path-dependent PDE numerically using neural networks because the deep learning community has developed ways to deal with sequential information.

5 Deep Learning PDE Solver

To solve (28)(28) numerically, it is important to approximate the value function, V(Xt)V(X_{t}), where XtX_{t} is a path of wealth process up to time tt. However, there is no magic in deep learning. None of the deep learning algorithms can directly deal with continuous sample paths and output values. Therefore, we have to discretize the time and sample path, plug the discretized version into a neural network, and get the output value.

Perhaps the most famous neural network to capture long-term dependence is Long Short-Term Memory (LSTM), which is a special type of recurrent neural network but doesn’t successfully mitigate the problem of gradient exploding and gradient vanishing (Those two are the same thing). Plugging our discretized version of the sample path into LSTM, we can get outputs that represent the path information in a neat way. After capturing the path information, there is one more thing to do – modeling the value function based on those outputs that contain path information. It is an easier step, because of the universal approximation theorem, we can just use a feed-forward neural network. In the rest of this section, we follow the framework proposed in Saporito \BBA Zhang (\APACyear2020) to solve our equation.

5.1 Long Short-Term Memory

First, LSTM is not a specific neural network structure, but more like a building block instead. The structure of an LSTM block is as follows,

ai1\displaystyle i-1ci1\displaystyle i-1aiciLSTMxi

In the above block diagram, there are three inputs, ai1a_{i-1}, ci1c_{i-1}, and xix_{i}, and there are two outputs aia_{i}, and cic_{i}. Intuitively, one can think in this way, ai1a_{i-1}, as a single number, is a neat way to represent all the information for the first i1i-1 steps, so people call it ”long-term memory”. While for ci1c_{i-1}, the short-term memory, as the name indicated, recent several steps’ information has more influence on it. At last, xix_{i} represents the information one gets from the current step. Those three memories interact with each other to generate new long-term memory and short-term memory, aia_{i}, and cic_{i}.

Back to our equation, assume now we have a sample path of the wealth process XtX_{t}, where t[0,T]t\in[0,T]. We discretize the time interval into NN equal periods, 0=t0<t1<<tN=T0=t_{0}<t_{1}<...<t_{N}=T, and δt=ti+1ti\delta t=t_{i+1}-t_{i}, which will be useful later. Let xix_{i}, where i=0,1,2,,Ni=0,1,2,...,N be the wealth at time tit_{i}. So, we have the following network that consist of N+1N+1 blocks of LSTM. Here, a1=0a_{-1}=0, and c1=0c_{-1}=0 represent there is no memory initially.

LSTMLSTMLSTMa-1a0aNc-1c0cNx0x1xN

In summary, we discretize a sample path XtX_{t} into NN pieces, and plug it into a neural network that consists of N+1N+1 LSTM blocks, we get an output vector 𝒂=(a0,a1,,aN)\boldsymbol{a}=(a_{0},a_{1},...,a_{N}). The next thing to do is use the output vector to model the value function V(Xt)V(X_{t}).

5.2 Value Function Approximation

To approximate the value function v(Xt)v(X_{t}), where tit<ti+1t_{i}\leq t<t_{i+1}, we use the most well-known neural network — feed-forward neural network, or fully connected neural network with three inputs, tit_{i}, xix_{i}, and ai1a_{i-1}, and one output, denote as ϕ(ti,xi,ai1;θf)\phi(t_{i},x_{i},a_{i-1};\theta^{f}), where θf\theta^{f} means the parameters in the feed-forward network. The exact structure of the feed-forward network (the number of hidden layers, the width of each layer, and the exact activation function) will be discussed in the empirical studies, and won’t be important for the rest of the section.

To summarize the above procedure, denote u(Xt;θf,θlstm)=ϕ(ti,xi,ai1;θf)u(X_{t};\theta^{f},\theta^{lstm})=\phi(t_{i},x_{i},a_{i-1};\theta^{f}), where θlstm\theta^{lstm} denotes parameters of LSTM neural networks. The partial derivatives of the value function can thus be represented as (here, we adapt the notation in Saporito \BBA Zhang (\APACyear2020))

u(Xtih;θf,θlstm)=ϕ(ti,xi+h,ai1;θf)\displaystyle u(X_{t_{i}}^{h};\theta^{f},\theta^{lstm})=\phi(t_{i},x_{i}+h,a_{i-1};\theta^{f}) (53)
u(Xti,δt;θf,θlstm)=ϕ(ti,xi,ai1;θf)\displaystyle u(X_{t_{i},\delta t};\theta^{f},\theta^{lstm})=\phi(t_{i},x_{i},a_{i-1};\theta^{f}) (54)
Δt[δt]u(Xti;θf,θlstm)=u(Xti,δt;θf,θlstm)u(Xti;θf,θlstm)δt\displaystyle\Delta^{[\delta t]}_{t}u(X_{t_{i}};\theta^{f},\theta^{lstm})=\frac{u(X_{t_{i},\delta t};\theta^{f},\theta^{lstm})-u(X_{t_{i}};\theta^{f},\theta^{lstm})}{\delta t} (55)
Δx[h]u(Xti;θf,θlstm)=u(Xtih;θf,θlstm)u(Xti;θf,θlstm)h\displaystyle\Delta^{[h]}_{x}u(X_{t_{i}};\theta^{f},\theta^{lstm})=\frac{u(X_{t_{i}}^{h};\theta^{f},\theta^{lstm})-u(X_{t_{i}};\theta^{f},\theta^{lstm})}{h} (56)
Δxx[h]u(Xti;θf,θlstm)=u(Xtih;θf,θlstm)2u(Xti;θf,θlstm)+u(Xtih;θf,θlstm)h2\displaystyle\Delta^{[h]}_{xx}u(X_{t_{i}};\theta^{f},\theta^{lstm})=\frac{u(X_{t_{i}}^{h};\theta^{f},\theta^{lstm})-2u(X_{t_{i}};\theta^{f},\theta^{lstm})+u(X_{t_{i}}^{-h};\theta^{f},\theta^{lstm})}{h^{2}} (57)

In this way, with the ”right” parameters θf\theta^{f}, and θlstm\theta^{lstm}, we can approximate V(Xt)V(X_{t}) by u(Xt;θf,θlstm)u(X_{t};\theta^{f},\theta^{lstm}). In order to achieve the ”right” coefficients, we need to train the above model with simulated sample paths.

Consider the equation (28)(28), to simplify the notations, let

A=𝝈t1(𝝁tr𝒆d)22\displaystyle A=\frac{||\boldsymbol{\sigma}_{t}^{-1}(\boldsymbol{\mu}_{t}-r\boldsymbol{e}_{d})||^{2}}{2} (58)
B=γ2[dd ln 2πeγ+ln det(𝝈tT𝝈t)]\displaystyle B=\frac{\gamma}{2}\big{[}d-d\text{\hskip 1.9919ptln\hskip 1.9919pt}2\pi e\gamma+\text{ln\hskip 1.9919pt}\text{det}(\boldsymbol{\sigma}_{t}^{T}\boldsymbol{\sigma}_{t})\big{]} (59)
C=γd2\displaystyle C=\frac{\gamma d}{2} (60)

Thus, equation (28)(28) becomes

ΔtV(Xt)A[ΔxV(Xt)]2ΔxxV(Xt)+ClnΔxxV(Xt)+B=0\Delta_{t}V(X_{t})-A\frac{[\Delta_{x}V(X_{t})]^{2}}{\Delta_{xx}V(X_{t})}+C\hskip 1.9919pt\text{ln}\hskip 1.9919pt\Delta_{xx}V(X_{t})+B=0 (61)

If there are MM simulated paths, and we discretize each sample path into NN pieces. Denote Xt(j)X_{t}^{(j)} be the jthj^{th} simulated path. To save the space, let θ=[θf,θlstm]\theta=[\theta^{f},\theta^{lstm}]. then the loss function to minimize is

JN,M(θ)\displaystyle J_{N,M}(\theta) =1M1Nj=1Mi=0M(Δt[δt]u(Xt(j);θ)A[Δx[h]u(Xt(j);θ)]2Δxx[h]u(Xt(j);θ)+ClnΔxx[h]u(Xt(j);θ)+B)2\displaystyle=\frac{1}{M}\frac{1}{N}\sum_{j=1}^{M}\sum_{i=0}^{M}\big{(}\Delta_{t}^{[\delta t]}u(X_{t}^{(j)};\theta)-A\frac{[\Delta_{x}^{[h]}u(X_{t}^{(j)};\theta)]^{2}}{\Delta_{xx}^{[h]}u(X_{t}^{(j)};\theta)}+C\hskip 1.9919pt\text{ln}\hskip 1.9919pt\Delta_{xx}^{[h]}u(X_{t}^{(j)};\theta)+B\big{)}^{2} (62)

Here, α\alpha is a hyper-parameter to be determined depending on the situation, the last term is in order to let the average of final values of the simulated paths be as close to the expected return as possible.

6 Empirical Studies

There are two phases of empirical studies, the first is the clustering phase, and the second is the portfolio construction phase.

In the clustering phase, the simulated annealing clustering method with κ=0.0001\kappa=0.0001, w=0.5w=0.5, α=0.99\alpha=0.99, T0=100T_{0}=100, and Tf=0.1T_{f}=0.1 is used to cluster stocks in the S&P 500 index based on 2020-2022 historic data into 25 groups. The procedure is repeated 100 times to minimize the energy function. Figures 1-4 show the cumulative returns of stocks from four different groups. The results indicate that most stocks in a group perform similarly, while there may be some exceptional stocks that perform differently. This finding suggests that the clustering method could be used for pair trading to identify statistical arbitrage opportunities if the size of each cluster is limited to 2-5 stocks. However, this research direction is not explored in this paper and left for future studies.

In the second phase, 25 stocks are randomly selected from each group and a mean-variance portfolio is constructed based on their historic data. Before constructing the portfolio, the drift rate and covariance matrix for each day must be estimated. In this paper, the estimation is based on the past 75 days of stock returns, which may not be the most accurate method. More accurate estimation can be achieved with high-frequency data and sufficient computation resources.

The parameters used in the trading strategy were set as follows: γ=0.01\gamma=0.01, d=25d=25, and r=0r=0 since the interest rate has been zero for most of the past two years. The initial wealth path was [1.0,1.01][1.0,1.01] and the trading strategy began with 1 dollar. At the start of each period, the current wealth path XtX_{t} was used to generate hundreds of simulated paths with the same drift rate and volatility. These simulated paths were then used to train the neural network, which produced the policy parameters for generating holdings for the next period. A new wealth point was added to the current wealth path after each iteration, therefore, a new neural network needs to be trained. Due to limited computational resources, the trading strategy was only tested for 30 consecutive trading days (30 iterations) at three different periods covering bull, bear, and volatile markets. Figure 5-7 shows that this trading strategy outperformed the S&P 500 index.

7 Conclusion

This paper introduces a novel asset clustering method and extends the exploratory mean-variance framework to the path-dependent case. To further improve this framework, possible enhancements include replacing LSTM with transformer to enhance the neural network’s structure, using high-frequency data to estimate model parameters, and adding constraints on leverage or portfolio asset percentages to enhance the framework’s robustness.

References

  • Dolphin \BOthers. (\APACyear2021) \APACinsertmetastardolphin2021measuring{APACrefauthors}Dolphin, R., Smyth, B., Xu, Y.\BCBL \BBA Dong, R.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleMeasuring Financial Time Series Similarity with a View to Identifying Profitable Stock Market Opportunities Measuring financial time series similarity with a view to identifying profitable stock market opportunities.\BBCQ \BIn \APACrefbtitleInternational Conference on Case-Based Reasoning International conference on case-based reasoning (\BPGS 64–78). \PrintBackRefs\CurrentBib
  • Dupire (\APACyear2019) \APACinsertmetastardupire2019functional{APACrefauthors}Dupire, B.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleFunctional itô calculus Functional itô calculus.\BBCQ \APACjournalVolNumPagesQuantitative Finance195721–729. \PrintBackRefs\CurrentBib
  • Jia \BBA Zhou (\APACyear2022\APACexlab\BCnt1) \APACinsertmetastarjia2022policy_b{APACrefauthors}Jia, Y.\BCBT \BBA Zhou, X\BPBIY.  \APACrefYearMonthDay2022\BCnt1. \BBOQ\APACrefatitlePolicy evaluation and temporal-difference learning in continuous time and space: A martingale approach Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach.\BBCQ \APACjournalVolNumPagesJournal of Machine Learning Research231541–55. \PrintBackRefs\CurrentBib
  • Jia \BBA Zhou (\APACyear2022\APACexlab\BCnt2) \APACinsertmetastarjia2022policy_a{APACrefauthors}Jia, Y.\BCBT \BBA Zhou, X\BPBIY.  \APACrefYearMonthDay2022\BCnt2. \BBOQ\APACrefatitlePolicy gradient and actor-critic learning in continuous time and space: Theory and algorithms Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms.\BBCQ \APACjournalVolNumPagesJournal of Machine Learning Research231541–55. \PrintBackRefs\CurrentBib
  • Jia \BBA Zhou (\APACyear2022\APACexlab\BCnt3) \APACinsertmetastarjia2022q{APACrefauthors}Jia, Y.\BCBT \BBA Zhou, X\BPBIY.  \APACrefYearMonthDay2022\BCnt3. \BBOQ\APACrefatitleq-Learning in continuous time q-learning in continuous time.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2207.00713. \PrintBackRefs\CurrentBib
  • Li \BBA Ng (\APACyear2000) \APACinsertmetastarli2000optimal{APACrefauthors}Li, D.\BCBT \BBA Ng, W\BHBIL.  \APACrefYearMonthDay2000. \BBOQ\APACrefatitleOptimal dynamic portfolio selection: Multiperiod mean-variance formulation Optimal dynamic portfolio selection: Multiperiod mean-variance formulation.\BBCQ \APACjournalVolNumPagesMathematical finance103387–406. \PrintBackRefs\CurrentBib
  • Ludkovski \BOthers. (\APACyear2022) \APACinsertmetastarludkovski2022large{APACrefauthors}Ludkovski, M., Swindle, G.\BCBL \BBA Grannan, E.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleLarge Scale Probabilistic Simulation of Renewables Production Large scale probabilistic simulation of renewables production.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2205.04736. \PrintBackRefs\CurrentBib
  • Sabate-Vidales \BOthers. (\APACyear2020) \APACinsertmetastarsabate2020solving{APACrefauthors}Sabate-Vidales, M., Šiška, D.\BCBL \BBA Szpruch, L.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleSolving path dependent PDEs with LSTM networks and path signatures Solving path dependent pdes with lstm networks and path signatures.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2011.10630. \PrintBackRefs\CurrentBib
  • Saporito \BBA Zhang (\APACyear2020) \APACinsertmetastarsaporito2020pdgm{APACrefauthors}Saporito, Y\BPBIF.\BCBT \BBA Zhang, Z.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitlePDGM: a neural network approach to solve path-dependent partial differential equations Pdgm: a neural network approach to solve path-dependent partial differential equations.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2003.02035. \PrintBackRefs\CurrentBib
  • Selection (\APACyear1952) \APACinsertmetastarselection1952harry{APACrefauthors}Selection, P.  \APACrefYearMonthDay1952. \BBOQ\APACrefatitleHarry Markowitz Harry markowitz.\BBCQ \APACjournalVolNumPagesThe Journal of Finance7177–91. \PrintBackRefs\CurrentBib
  • Tang \BOthers. (\APACyear2022) \APACinsertmetastartang2022asset{APACrefauthors}Tang, W., Xu, X.\BCBL \BBA Zhou, X\BPBIY.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleAsset selection via correlation blockmodel clustering Asset selection via correlation blockmodel clustering.\BBCQ \APACjournalVolNumPagesExpert Systems with Applications195116558. \PrintBackRefs\CurrentBib
  • Wang (\APACyear2019) \APACinsertmetastarwang2019large{APACrefauthors}Wang, H.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleLarge scale continuous-time mean-variance portfolio allocation via reinforcement learning Large scale continuous-time mean-variance portfolio allocation via reinforcement learning.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1907.11718. \PrintBackRefs\CurrentBib
  • Wang \BOthers. (\APACyear2018) \APACinsertmetastarwang2018exploration{APACrefauthors}Wang, H., Zariphopoulou, T.\BCBL \BBA Zhou, X.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleExploration versus exploitation in reinforcement learning: a stochastic control approach Exploration versus exploitation in reinforcement learning: a stochastic control approach.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1812.01552. \PrintBackRefs\CurrentBib
  • Zhou \BBA Li (\APACyear2000) \APACinsertmetastarzhou2000continuous{APACrefauthors}Zhou, X\BPBIY.\BCBT \BBA Li, D.  \APACrefYearMonthDay2000. \BBOQ\APACrefatitleContinuous-time mean-variance portfolio selection: A stochastic LQ framework Continuous-time mean-variance portfolio selection: A stochastic lq framework.\BBCQ \APACjournalVolNumPagesApplied Mathematics and Optimization4219–33. \PrintBackRefs\CurrentBib
Refer to caption
Figure 1: cumulative returns of group 1 stocks
Refer to caption
Figure 2: culmulative returns of group 6 stocks
Refer to caption
Figure 3: cumulative returns of group 16 stocks
Refer to caption
Figure 4: cumulative returns of group 21 stocks
Refer to caption
Figure 5: 61 - 91 trading days since 2020-01-01
Refer to caption
Figure 6: 361 - 391 trading days since 2020-01-01
Refer to caption
Figure 7: 521 - 551 tradings days since 2020-01-01