This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Can Interpretable Reinforcement Learning Manage Prosperity Your Way?

Charl Maree   and Christian Omlin
Center for Artificial Intelligence Research
University of Agder
Grimstad, Norway
{charl.maree,christian.omlin}@uia.no
Second affiliation: Chief Technology Office, Sparebank 1 SR-Bank, Stavanger, Norway.
Abstract

Personalisation of products and services is fast becoming the driver of success in banking and commerce. Machine learning holds the promise of gaining a deeper understanding of and tailoring to customers’ needs and preferences. Whereas traditional solutions to financial decision problems frequently rely on model assumptions, reinforcement learning is able to exploit large amounts of data to improve customer modelling and decision-making in complex financial environments with fewer assumptions. Model explainability and interpretability present challenges from a regulatory perspective which demands transparency for acceptance; they also offer the opportunity for improved insight into and understanding of customers. Post-hoc approaches are typically used for explaining pretrained reinforcement learning models. Based on our previous modeling of customer spending behaviour, we adapt our recent reinforcement learning algorithm that intrinsically characterizes desirable behaviours and we transition to the problem of asset management. We train inherently interpretable reinforcement learning agents to give investment advice that is aligned with prototype financial personality traits which are combined to make a final recommendation. We observe that the trained agents’ advice adheres to their intended characteristics, they learn the value of compound growth, and, without any explicit reference, the notion of risk as well as improved policy convergence.

Keywords AI in banking; personalized services; asset management; explainable AI; reinforcement learning; policy regularisation

1 Introduction

Financial service providers are employing ever-advancing methods to improve the level of personalisation of their services [1, 2]. Artificial intelligence (AI) is a promising tool in this pursuit in areas such as anti-money laundering, trading and investment, and customer relationship management [3]. Examples of personalised services are recommender systems for product sales [4], risk evaluation for credit scoring [5], and segmentation for customer-centric marketing [6]. More commonly, AI has been applied to stock trading via ensemble learning [7], currency recognition using deep learning [8], stock index performance through time-series modelling with feature engineering [9], and investment portfolio management using reinforcement learning (RL) [10, 11]. These applications generally lack the personalisation needed to enhance customer relations and support service delivery for growing customer bases. The lack of explainability and interpretability has thus far hindered the wider adoption of machine learning, mainly due to model opacity; model understanding is essential in financial services [12, 13, 14]. We distinguish between explainability and interpretability: explainability refers to a symbolic representation of the knowledge a model has learned, while interpretability is necessary for reasoning about a model’s predictions.

We have previously investigated the interpretability of systems of multiple RL agents [15]. A regularisation term in the objective function imposed a desired agent behaviour during training. For our current purpose of asset management, our agents learn distinct optimal policies for continuously distributing a fixed monthly amount across five assets: a savings account, property, stocks, luxury items, and mortgage repayments. It is our intention to align the agents’ characteristics and behaviour with personality traits - openness, conscientiousness, extraversion, agreeableness, and neuroticism - as proposed for modelling spending behaviour in [16]. A linear combination of the resulting policies can provide investment advice that matches each customer’s unique personality profile. This intrinsic interpretability may fulfill the promise of a digital private assistant for personal wealth management.

2 Related Work

Recent evidence has revealed a causal relationship between spending patterns and individual happiness [17]: we are happiest when our spending matches our personality. For instance, extraverted individuals typically prefer spending at a bar rather than at a bookshop, while the opposite may apply to introverts. Our premise is that spending personality traits can be carried over to asset management: we are happiest when our investment matches our personality. For instance, conscientious investors may prefer the predictability of property over the volatility of stocks. This is consistent with the high affinity of conscientious spenders towards residential mortgages [17]. It is compelling to expand the notion of personality traits from spending to wealth creation, i.e., to base personal investment advice on historical spending behaviour [18, 19].

RL has been extensively applied to stock portfolio management [20, 21, 22, 23, 24, 25], but not yet to holistic asset management; the lack of model transparency may be a contributing factor. Interpretation of RL agents typically follows model training [26, 27, 28]; our ambition is to impose a desired characteristic behaviour during training, thus making it an intrinsic property of the agent. Based on a prior that defines a desired behaviour, we extend the deep deterministic policy gradient (DDPG [29]) objective function with a regularisation term [15]. Formally, for each agent ii, this objective function is given by:

J(θi)=𝔼oi,ai𝒟[Ri(oi,ai)]λLi\displaystyle J(\theta_{i})=\mathbb{E}_{o_{i},a_{i}\sim\mathcal{D}}\left[R_{i}(o_{i},a_{i})\right]-\lambda L_{i} (1)
Li=1Mij=0Mi[𝔼aπθi(aj)(aj|π0i(a))]2\displaystyle L_{i}=\frac{1}{M_{i}}\sum_{j=0}^{M_{i}}\left[\mathbb{E}_{a\sim\pi_{\theta_{i}}}(a_{j})-(a_{j}|\pi_{0_{i}}(a))\right]^{2}

where θi\theta_{i} is a set of parameters governing the policy, 𝒟\mathcal{D} is the replay buffer, Ri(oi,ai)R_{i}(o_{i},a_{i}) is the reward for action aia_{i} with the partial sate observation oio_{i}, λ0\lambda\in\mathbb{R}_{\geq 0} is a scaling parameter, MiM_{i} is the number of actions, and π0i\pi_{0_{i}} is the prior that defines the desired behaviour of the agent. Note that the prior is independent of the state, which simplifies it and thus makes it interpretable; this is a departure from traditional policy regularisation methods such as KL-regularisation and entropy regularisation which aim to improve learning convergence instead [30, 31]. Traditional regularisation encourages state space exploration by increasing the entropy of the policy, whereas our method guides agents’ learning towards the prior and thus imposes a desired characteristic behaviour.

3 Empirical Methodology

The aim of this work was to create an interpretable AI for personal investment management. We selected five assets in which a customer could invest a monthly amount over a duration of 30 years: a savings account, property, a portfolio of stocks, luxury expenditures, and additional mortgage payments. We include luxury expenditure to the portfolio under the premise that it may increase customer satisfaction in their portfolios [17]. We define luxury items as any expenditure that may appeal to a person’s personality profile; people scoring high on openness might derive joy from spending money on travelling, people scoring high on extraversion may prefer to spend money on festivities with other people [17], while other luxury items such as cars or artwork are also possible. We modelled the growth rates of assets according to historical index data, which we describe below.

3.1 Modelling Assumptions

We continuously distribute funds into assets based on the indices of the S&P 500 [32], Norwegian property [33], and the Norwegian interest rate [34]. In addition, we invest in mortgages and luxury items. We show this data for a 30-year period in Figure 1.

Refer to caption
Figure 1: Three asset value indices for a period of 30 years: The S&P 500 stock index, the Norwegian property index, and the Norwegian interest rate index. All indices are relative to their respective values on 1 January 1992. While the stock index performs the best overall, it has the highest volatility and therefore the highest risk. Conversely, the interest rate index has the lowest risk but also the lowest growth.

We make a number of assumptions which limit the scope of the portfolio and simplify investment choices to make the characterization of agent behaviour and interpretation of investment strategies tractable.

Assumption 1.

Asset growth rates can be modelled by their respective asset indices, i.e., a stock portfolio may be modeled by a major stock index - e.g., the S&P 500 -, and an investment in property by its corresponding index.

The outright investment in indices such as S&P 500 is very common; it will return the growth rates according to these indices. This is a conservative assumption as stock portfolio optimization frequently outperforms indices, which may serve as a performance measure of the investment strategy [25].

To give personalised advice, we depart from the premise that there is a mere correlation between spending behaviour and happiness. We are expanding the notion of the causal relationship of spending patterns and customer satisfaction to chart an investment strategy and provide advice that is aligned with customer personality [17]. We rated assets in terms of their risk and expected return from the historical values of the indices, and their liquidity, capital requirements, and novelty from domain experts. They also scored the different personality types’ affinity for different assets from the interval [1,+1][-1,+1]. Individuals with a high degree of openness may prefer the liquidity of stocks over mortgage payments, whereas neurotic individuals may prefer savings accounts over stocks in their portfolio. These coefficients, shown in Table 1, are the weighted sums of the asset ratings and affinities.

Table 1: Coefficients relating asset risk, expected return, liquidity, capital requirement, and novelty to prototypical personality traits: openness (O), conscientiousness (C), extraversion (E), agreeableness (A), and neuroticism (N). The values are in the range [1,1][-1,1].
Investment O C E A N
Savings -0.11 0.08 -0.15 0.51 0.68
Property -0.15 0.32 -0.22 -0.36 -0.24
Stocks 0.82 -0.61 0.95 0.42 0.12
Luxury 0.16 -0.51 -0.07 -0.80 -0.81
Mortgage -0.72 0.72 -0.52 0.23 0.25

We define a Markov decision process (MDP) for a multi-agent RL setting. The states consist of customer age111Customer age is normalized to a range of [0,1][0,1], the values of the assets222Asset values are scaled by 1:1061:10^{6}, and two market indicators for each of the three indices, i.e., their mean asset convergence divergence (MACD)333MACD here is the difference between the 26-month and the 12-month exponential moving average of a trend. which predicts trend reversals and relative strength index (RSI)444RSI=100100/(1+PxNx)RSI=100-100/(1+\frac{P_{x}}{N_{x}}) where PxP_{x} and NxN_{x} are the average positive and negative changes to the index values respectively, for xx periods. which corrects for potential false predictions by MACD. Rewards are the changes in portfolio values between time steps and actions are the continuous distribution of funds across the portfolio of assets. We make an initial loan of 2 million NOK in a mortgage and monthly investments over 30 years totalling 3.34 million NOK:

Assumption 2.

The initial values for a portfolio consist of a mortgage of NOK 2 million and a property valued at NOK 2 million. All other assets have zero initial value.

It is easy to adjust these initial portfolio assignments for different individuals.

Assumption 3.

We make consistent monthly investments of 10 000 Norwegian kroner (NOK).

This can be easily modified for individual customers’ contributions.

There is a priori no lower limit on the investment amounts:

Assumption 4.

Property investment does not require bulk payments, i.e., smaller investments can be made through property funds, trusts, or crowdfunding.

While investment in physical real estate normally requires larger deposits, we allow our agents to invest smaller amounts into the property market, i.e., a fraction of the monthly investment contribution specified in Assumption 3. This is not a strong assumption as it is possible to invest smaller amounts in property indices, trusts, funds, etc.

We assign interest rates for savings accounts at 5-10% below, and those of mortgage accounts at 5-10% over the interest index. Individuals younger than 35 years receive the more beneficial interest rate, as is common in Norwegian banks. Luxury items experience a depreciation of 20% per year; the depreciation of luxury items is highly variable and depends on the item, e.g., while artwork may appreciate, cars typically depreciate rapidly:

Assumption 5.

Luxury items depreciate at 20% per year.

Dividends are normally included in the calculation of indices and monthly transactions are relatively infrequent compared to high frequency trading:

Assumption 6.

Any additional income from investments - such as dividend payouts or rental income - as well as costs such as transaction costs and fund management costs are ignored.

3.2 Agents

We train five DDPG agents, one for each of the five personality traits. Using Equation (1) we regularise their objective functions with a prior derived from their respective personality traits in Table 1, e.g., the openness prior π0O\pi_{0}^{O} places the most weight on stocks and avoids mortgage repayments, property investment, and savings, while the conscientiousness prior π0C\pi_{0}^{C} places the most weight on mortgage repayments and avoids stocks and luxury expenditure. These priors, shown in Table 2, are probability distributions across the investment channels and therefore add up to one.

Table 2: Regularisation priors π0a\pi_{0}^{a} for each agent aa\in {openness (O), conscientiousness (C), extraversion (E), agreeableness (A), and neuroticism (N)}.
Investment π0O\pi_{0}^{O} π0C\pi_{0}^{C} π0E\pi_{0}^{E} π0A\pi_{0}^{A} π0N\pi_{0}^{N}
Savings 0.00 0.07 0.00 0.44 0.64
Property 0.00 0.28 0.00 0.00 0.00
Stocks 0.84 0.00 1.00 0.36 0.12
Luxury 0.16 0.00 0.00 0.00 0.00
Mortgage 0.00 0.65 0.00 0.02 0.24

Agents’ actor and critic neural networks each consist of two fully connected feed-forward layers with 2000 nodes in each layer. The actor networks each have a final soft-max activation layer while the critic networks have no final activations. We tuned the hyperparameters using a one-at-a-time parameter sweep resulting in learning rates of 0.0040.004 and 0.0010.001 for the actors and critics respectively, target network update parameters of τ=0.05\tau=0.05, and regularisation coefficients of λ=2\lambda=2. Training batch sizes were 256 time steps and we sized the replay buffer to hold 2048 transitions. Each iteration collected 256 time steps and completed two training batches.

4 Results

Each of our investment agents learns an optimal investment strategy for their respective prototypical personality traits, for instance, openness. The final portfolio values after 334 months of investing according to these policies are shown in Table 3. Given the common total investment of 3.34 million NOK, the compound annual growth rate varies between 5.8% and 7.8% which is the maximum return possible if investing in stocks only.

Table 3: Portfolio values of the five optimal policies for each of the prototypical personality traits.
Policy Final portfolio value (NOK 1M)
Openness 22.4
Conscientiousness 18.8
Extraversion* 27.7
Agreeableness 20.5
Neuroticism 16.4
Personal agent 20.3

*This agent’s regularisation prior was coincidentally the same as the optimal monetary policy πM\pi^{M} and it achieved the maximum possible final portfolio value.

Note that these personalised policies did not achieve the same final portfolio value. In fact, the optimum policy in monetary terms πM\pi^{M} in this case would have been to always buy stocks as shown in Figure 2; this is the default policy an agent will converge towards when personality traits are ignored. However, we postulate that this is not the ideal personal financial advice to give to all individuals; some customers may be more averse to risk and will thus prefer to avoid volatility in their portfolio. Our personalized agent takes into account such preferences and, e.g., it recommends property investments rather than stock investments.

Refer to caption
Figure 2: The return on investment at every time step, calculated as the index value at the final time step divided by the index value at the current time step. It is clear that S&P 500 has the greatest return on investment at every time step, except for a brief period in ca. 2000 where it was marginally below the property index. Therefore, the optimum monetary policy πM\pi^{M} is to always invest the maximum amount into stocks.

Thus far, our agents have each separately learned an optimal investment strategy for each prototypical personality trait. The aggregate policy is the weighted sum of these individually learned policies: a customer has a blend of personality traits which can be represented as a vector with five entries with values within the range [1,+1][-1,+1]. We calculate the inner product of the normalized personality vector and the prototypical policies to arrive at the aggregate investment policy. We show a representative aggregate investment policy for a customer with a random personality profile in Figure 3. We observe that the openness agent is the only agent to recommend spending on luxury items; this is to be expected because its regularisation prior π0O\pi_{0}^{O} is the only one with a non-zero coefficient for luxury purchases. We also observe that the conscientiousness agent recommends investing in property in early stages, followed by rigorous loan repayments in the second half of the investment period. This suggests that our agent has learned the concept of compound growth and its utility for portfolio optimization. By contrast, the extraversion agent was steadfast in purchasing stocks only, which is consistent with its regularisation prior π0E\pi_{0}^{E}. Unlike the conscientiousness agent, the agreeableness and neuroticism agents consistently recommend investing in savings towards the end of the investment period. In the early stages of the investment period, the agreeableness and neuroticism agents utilize compound growth to increase the portfolio value; in the latter phases, their regimen changes and they prefer the safety of savings accounts. This is noteworthy because although risk is not explicitly part of either the reward or regularisation functions, it is consistent with traditional financial advice, which decreases the risk level with age. Repeated training produces consistent results. We intend to elucidate this observation in future work.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Investment strategies for different prototypical personality traits: Figures (a) through (e) show the fractions of monthly investments for different assets. They reveal the distinct investment strategies with changing asset preferences for the five prototypical personality traits. In Figure (f) we illustrate the investment strategy for a fictitious customer with a random personality profile [openness, conscientiousness, agreeableness, extraversion, and neuroticism] = [0.22, 0.87, 0.21, 0.92, 0.49]. The customer invests in a mixture of assets throughout the investment period.

We observe that training converges quickly to the desired behaviour (see Figure 4); the contribution of the regularisation term decreases rapidly, which implies that the agent is learning the intended behaviour. We show the regularisation term for the extraversion agent where the regularisation prior π0E\pi_{0}^{E} matches the optimum monetary policy πM\pi^{M} in Figure 4(a). Further training causes no instability as is often observed in the DDPG algorithm [31]. We hypothesize that this may be due to the agent characteristics imposed by our regularisation whose effect may be similar to entropy regularisation [31].

The actions of any linear combination of these agents, i.e., any personal agent, are interpretable through the intrinsic characterizations, i.e., priors, of each of the regularized agents.

Refer to caption
(a) Learning curve for the regularisation term of the extraversion agent.
Refer to caption
(b) A typical learning curve for the regularisation term of other agents.
Refer to caption
(c) Learning curve where the regularisation term temporarily falls in a local minimum but subsequently converges to zero.
Figure 4: The regularisation term LL for three different runs. In (a) the regularisation prior π0E\pi_{0}^{E} of the extraverted agent coincides with the optimum monetary policy πM\pi^{M} and the policy converges within 5 time steps. (b) shows a typical training run for the other agents which converges within 100-200 training steps. (c) shows a training run where the regularisation term appears to fall in local minimum for a time, but eventually finds the optimum after about 200 training steps.

5 Conclusions and Directions for Future Work

We have presented a novel application of training RL agents to exhibit desired characteristics and behaviours in asset management. The method is based on the regularisation of the policy during training. Here, we use prototypical personality traits - openness, conscientiousness, agreeableness, extraversion, and neuroticism - to define a set of priors which express their affinity towards different assets and thus impose different investment strategies. This makes the agents’ behaviour explicit and thus offers an explanation for their recommendations. Our agents learn distinct optimal strategies for the continuous distribution of monthly investments across a portfolio of investment assets. We have shown that the agents learned to optimize total rewards while adhering to their distinct priors. This makes it possible to interpret the agents’ investment strategies.

Unlike traditional DDPG algorithms which may diverge with continuous training, our regularisation results in quick and robust convergence. This could become relevant if RL agents undergo continuous training to give personalized investment advice to customers. The justification of this observation will be subject to future research.

Our agents have learned the concept and utility of compound growth rates and risk avoidance, which form part of the interpretation of their investment strategies. These are solely based on the regularisation priors which express their personality traits; the reward function makes no reference to the personality traits. While the notion of compound growth may emerge from the reward function, we do not yet know whether the notion of risk avoidance is connected to the reward function or regularisation.

Here, we have chosen a linear combination of different, separately trained agents aligned with the prototypical personality traits to arrive at an aggregate investment advice. In the future, we will investigate whether the orchestration of these agents can be learned to approach the optimum monetary policy. This aggregation will need an explanation as well as interpretation to understand its impact on the investment strategy. The hierarchical orchestration of prototypical agents will be learned from real customers’ personality profiles. This will result in an explainable and interpretable personalized financial investment advisor.

Funding

This study was partially funded by The Norwegian Research Council; project number 311465.

References

  • [1] Matteo Stefanel and Udayan Goyal. Artificial intelligence & financial services: Cutting through the noise. Technical report, APIS partners, London, England, 2019.
  • [2] Sunanda Vincent Jaiwant. Artificial intelligence and personalized banking. In Garg Vikas and Richa Goel, editors, Handbook of Research on Innovative Management Using AI in Industry 5.0, pages 74–87. IGI Global, Bengaluru, India, 2022.
  • [3] Joost van der Burgt. General principles for the use of artificial intelligence in the financial sector. Technical report, De Nederlandsche Bank, Amsterdam, The Netherlands, 2019.
  • [4] Oladapo Oyebode and Rita Orji. A hybrid recommender system for product sales in a banking environment. Journal of Banking and Financial Technology, 4:15–25, 2020.
  • [5] Siddharth Bhatore, Lalit Mohan, and Raghu Reddy. Machine learning techniques for credit risk evaluation: a systematic literature review. Journal of Banking and Financial Technology, 4:111–138, 06 2020.
  • [6] Darshana Desai. Hyper-personalization: An AI-enabled personalization for customer-centric marketing. In Surabhi Singh, editor, Adoption and Implementation of AI in Customer Relationship Management, pages 40–53. IGI Global, Maharashtra, India, 2022.
  • [7] Dhanya Jothimani and Surendra Yadav. Stock trading decisions using ensemble-based forecasting models: a study of the indian stock market. Journal of Banking and Financial Technology, 3:113–129, 2019.
  • [8] Qian Zhang, Weiqi Yan, and Mohan Kankanhalli. Overview of currency recognition using deep learning. Journal of Banking and Financial Technology, 3:59–69, 2019.
  • [9] Tien-Yu Hsu. Machine learning applied to stock index performance enhancement. Journal of Banking and Financial Technology, 5:1–13, 01 2021.
  • [10] Petter Kolm and Gordon Ritter. Modern perspectives on reinforcement learning in finance. SSRN Electronic Journal, pages 1–28, 2019.
  • [11] Thomas G. Fischer. Reinforcement learning in financial markets - a survey. Technical report, Friedrich-Alexander University Erlangen-Nuremberg, Institute for Economics, 2018.
  • [12] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58:82–115, 2020.
  • [13] Longbing Cao. Ai in finance: Challenges, techniques and opportunities. Banking & Insurance eJournal, 2021.
  • [14] Charl Maree, Jan Erik Modal, and Christian W. Omlin. Towards responsible ai for financial transactions. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pages 16–21, 2020.
  • [15] Charl Maree and Christian Omlin. Reinforcement learning your way: Agent characterization through policy regularization. arXiv, 2201.10003, 2022.
  • [16] Joe Gladstone, Sandra Matz, and Alain Lemaire. Can psychological traits be inferred from spending? Evidence from transaction data. Psychological Science, 30:1–10, 2019.
  • [17] Sandra C. Matz, Joe J. Gladstone, and David Stillwell. Money buys happiness when spending fits our personality. Psychological Science, 27:715–725, 2016.
  • [18] Charl Maree and Christian W. Omlin. Clustering in recurrent neural networks for micro-segmentation using spending personality. In 2021 IEEE Symposium Series on Computational Intelligence (SSCI), pages 1–5, 2021.
  • [19] Charl Maree and Christian W. Omlin. Understanding spending behavior: Recurrent neural network explanation and interpretation (in print). In IEEE Computational Intelligence for Financial Engineering and Economics, pages 1–7, 2022.
  • [20] S. M. Bartram, J. Branke, G. D. Rossi, and M. Motahari. Machine learning for active portfolio management. The Journal of Financial Data Science, 3(3):9–30, 2021.
  • [21] Emmanuel Jurczenko. Machine Learning for Asset Management: New Developments and Financial Applications. Wiley - ISTE, London, United Kingdom, 2020.
  • [22] Q.Y.E. Lim, Q. Cao, and C. Quek. Dynamic portfolio rebalancing through reinforcement learning. Neural Computing and Applications, 33(24):1–15, 2021.
  • [23] Michael Pinelis and David Ruppert. Machine learning portfolio allocation. The Journal of Finance and Data Science, 8:35–54, 2022.
  • [24] Adrian Millea. Deep reinforcement learning for trading—a critical survey. Data, 6(11):119, 2021.
  • [25] Charl Maree and Christian W. Omlin. Balancing profit, risk, and sustainability for portfolio management (in print). In IEEE Computational Intelligence for Financial Engineering and Economics, pages 1–8, 2022.
  • [26] A. Heuillet, F. Couthouis, and N. Díaz-Rodríguez. Explainability in deep reinforcement learning. Knowledge-Based Systems, 214(106685):1–24, 2021.
  • [27] Lindsay Wells and Tomasz Bednarz. Explainable AI and reinforcement learning: A systematic review of current approaches and trends. Frontiers in Artificial Intelligence, 4:1–48, 2021.
  • [28] S. Gupta, G Singal, and D. Garg. Deep reinforcement learning techniques in diversified domains: A survey. Archives of Computational Methods in Engineering, 28:4715–4754, 2021.
  • [29] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv, 1509.02971, 2019.
  • [30] Brian D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Machine Learning Department, Carnegie Mellon University, December 2010.
  • [31] Tuomas Haarnoja, Haoran Tang, P. Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning (ICML), 2017.
  • [32] Yahoo Finance. Historical data for S&P500 stock index, 2022. https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC, Accessed on 30/01/2022.
  • [33] Statistics Norway. Table 07221 - Price index for existing dwellings, 2022. https://www.ssb.no/en/statbank/table/07221/, Accessed on 30/01/2022.
  • [34] Norges Bank. Interest rates, 2022. https://app.norges-bank.no/query/#/en/interest, Accessed on 30/01/2022.