Learning Optimal Behavior Through Reasoning and Experiences^†^†thanks: We thank Sam Gershman for extensive and helpful discussions. Email addresses: Ilut cosmin.ilut@duke.edu, Valchev valchev@bc.edu.

Cosmin Ilut
Duke University & NBER Rosen Valchev
Boston College & NBER

(March 20, 2024)

Abstract

We develop a novel framework of bounded rationality under cognitive frictions that studies learning over optimal behavior through both deliberative reasoning and accumulated experiences. Using both types of information, agents engage in Bayesian non-parametric estimation of the unknown action value function. Reasoning signals are produced internally through mental deliberation, subject to a cognitive cost. Experience signals are the observed utility outcomes at previous actions. Agents’ subjective estimation uncertainty, which evolves through information accumulation, modulates the two modes of learning in a state- and history-dependent way. We discuss how the model draws on and bridges conceptual, methodological and empirical insights from both economics and the cognitive sciences literature on reinforcement learning.

JEL Codes: D83, D91, E21, E71, C11

1 Introduction

There is a deep, interdisciplinary interest in understanding and modeling cognitive limitations in decision making. Across both economics and cognitive sciences, the literature recognizes two broad ways in which people learn about and deduce what is the best course of action in a given situation. The first is cognition and reasoning: through introspective, abstract deliberations, humans can get a better grasp of what is their optimal action in the situation at hand. The second is accumulated experiences: by observing realized outcomes of past decisions, agents update their views on the respective benefits of taking those actions in these circumstances. These two sources of information are conceptually distinct and are both limited: experiences are observed only along the realized path of situations the agent actually faced in the past, and while abstract thinking could help the agent deduce outcomes in counterfactual actions and situations, such deliberations are cognitively costly.

This paper develops a novel framework of constrained-optimal behavior under cognitive frictions that studies jointly learning through reasoning and accumulated experiences. To do so, the paper draws on conceptual, methodological and empirical insights from both economics and the cognitive sciences literature on reinforcement learning (RL) (Kaelbling et al. (1996), Sutton and Barto (2018)).

In our framework, agents are uncertain about optimal behavior in the sense of facing subjective uncertainty over the optimal policy and value functions that characterize their decision problem. In particular, we follow the RL approach of letting agents learn about the action value function $Q_{\pi}(a,s)$ , which gives the expected discounted sum of utility when taking action $a$ in state $s$ and following a given policy $\pi(a|s)$ thereafter.¹¹1In the language of RL, this “action” value function is closely related and derived from the “state” value function $V(s)$ which is perhaps more familiar to economists. While the standard approach in economics is to endow the agents with perfect knowledge about the optimal policy function $\pi^{*}(a|s)$ and therefore the action value function $Q_{\pi^{*}}(a,s)$ , in our model agents perceive $Q_{\pi^{*}}(a,s)$ as uncertain ex-ante, and gradually learn about it over time.

Learning can occur through cognition, which is costly, but beneficial in reducing agents’ uncertainty about the best course of action. Agents trade off that benefit and cost of engaging cognitive resources, giving rise to a state- and history- dependent choice of reasoning and thus exhibit constrained-optimal, or “resource-rational” behavior. Agents also update beliefs about optimal behavior based on the experienced flow utility each period. Critically, the effective precision of both reasoning and experiences in informing behavior is endogenous, as a function of the agent’s beginning of period prior beliefs and uncertainty which evolve dynamically and endogenously.

Elements of framework. There are four key specific features of our learning framework.

First, we model agents’ beliefs over the unknown function $Q_{\pi^{\ast}}(a,s)$ as a Gaussian Process (GP) distribution.²²2The GP distribution can be derived from first principles as the limiting case of Bayesian Gaussian kernel regression (eg. Rasmussen and Williams (2006), Liu et al. (2011)). On the one hand, the GP distribution is methodologically appealing as a prior over the space of functions, as it is very flexible, and also tractable in recursively characterizing the conditional moments of the unknown function.³³3In economics, these properties have also recently gathered attention to GP, including work like Callander (2011), Bardhi (2022), Dew-Becker and Nathanson (2019), Ilut et al. (2020) and Ilut and Valchev (2023). On the other hand, cognitive sciences increasingly emphasizes a Bayesian approach and in particular the appeal of GP distributions, both for conceptual and descriptive reasons.⁴⁴4See for example Griffiths et al. (2008), Griffiths et al. (2010) and Gershman et al. (2015).

Second, given an estimate of the action value function $\widehat{Q}_{t}(a,s)$ , agents take constrained optimal actions trading off exploitation and experimentation incentives. On the one hand, the agent has incentives to choose the action with the highest estimated value $\widehat{Q}_{t}(a,s_{t})$ at the current state $s_{t}$ (exploitation). On the other hand, the agent recognizes that $\widehat{Q}_{t}(a,s_{t})$ is just an uncertain estimate so there is a benefit to exploration and learning more about the unknown $Q_{\pi^{*}}(a,s)$ . We model this desire for experimentation following the concept of maximum entropy reinforcement learning, which essentially the entropy of the action distribution to be bigger than some minimum threshold, thus ensuring randomization.⁵⁵5See for example Mnih et al. (2016), Haarnoja et al. (2017), Eysenbach and Levine (2019)). In contrast to the standard approach that sets this minimum threshold as an exogenous time-invariant parameter, we let the entropy lower bound be proportional to the remaining subjective uncertainty over $Q_{\pi^{*}}(a,s_{t})$ . This captures the intuition that experimentation is valuable only to the extent that the $Q$ -function is uncertain. Putting it all together, the resulting optimal action policy function takes the form of the softmax function widely used in statistics and machine learning (see eg. Sutton and Barto (2018)).

Turning to the learning modes and the dynamic evolution of the estimates of the action value function, our third modeling feature is learning from experienced utility. Because it is derived from actual observed outcomes, this learning follows the so-called “model-free” or “Q-learning” dynamic programming solution techniques used in machine learning and cognitive sciences (e.g. Dearden et al. (1998)). In particular, at the beginning of the period, the agent observes the realization of the new state $s_{t}$ and also the realized utility $u(s_{t-1},a_{t-1})$ that she experienced based on last period’s choice of action $a_{t-1}$ . Knowing the flow utility yesterday and the realized state today implies an experience-based informative signal that updates beliefs about $Q_{\pi^{*}}(a,s)$ . The key intuition behind this update, is that the flow utility $u(a_{t-1},s_{t-1})$ reveals the “temporal difference” in the Q-function, i.e. $u(a_{t-1},s_{t-1})=Q_{\pi^{*}}(a_{t-1},s_{t-1})-\beta\mathbb{E}_{t-1}(Q_{\pi^{*}}(a_{t},s_{t}))$ .

Finally, but importantly, within the same objective function that determines optimal actions, we model the benefit and cost of abstract reasoning. By reasoning we mean the internal deliberation process through which the agent produces information and learns about $Q_{\pi^{\ast}}(a,s)$ generally. Intuitively, this is akin to an economist trying to solve for the value function globally. As such, our terminology of reasoning is similar to the notion model-based learning in the RL literature (Sutton and Barto (2018)). The benefit of reasoning works through the resulting reduction of uncertainty over $Q_{\pi^{*}}(a,s_{t})$ . We allow this reduction to impact the objective function in two ways. One is direct utility cost of higher uncertainty (like a cognitive dissonance cost, eg. Aronson (1969)).

The second emerges indirectly and endogenously, through an interaction between actions and reasoning. In particular, since reasoning decreases current conditional uncertainty, it also weakens the incentive to experiment, by formally lowering the threshold on the action distribution entropy. Intuitively, the agent values the reduction in uncertainty from reasoning because it allows her to select an action closer to the one with the currently highest value estimate, not having to worry about experimentation. In turn, the cost of reasoning is also in utility terms, quantified as the reduction in entropy achieved by acquiring these deliberation signals, as in information theory (e.g. Sims (2003a)), capturing effort as in the cognitive control literature (e.g. Botvinick et al. (2004), Kool et al. (2017)).

We show how both reasoning and experience signals can be incorporated in revising beliefs over the unknown $Q$ -function using formal non-parametric Bayesian updating formulas.⁶⁶6These updates resemble temporal difference solution techniques, used extensively in RL (Sutton (1988)), and in particular their Bayesian, GP-based version, connecting to the approach in Engel et al. (2003, 2005). Put together, these two sources of information update agents’ conditional mean and uncertainty over the whole function $Q_{\pi^{\ast}}(a,s)$ , providing a recursive structure to the GP distribution entering next period.

Key contributions. Our framework connects to several literature strands.

In terms of the economics literature, we use insights from the RL literature to integrate learning about optimal behavior from experiences into models of resource-rationality that emphasize a cost-benefit tradeoff of reasoning. This tradeoff is shared with various such approaches in economics (including Ilut and Valchev (2023), Sims (2003b), Gabaix (2014), Woodford (2020), Alaoui and Penta (2022) and others) and cognitive sciences (e.g. Gershman et al. (2015), Griffiths et al. (2015), Shenhav et al. (2017)). In turn, we connect to a growing interest and evidence in economics on experience-based learning, as in Malmendier and Nagel (2016), Malmendier (2021). We differ from this latter approach by studying experiences jointly with reasoning, as well as by modeling experiences as informative not about the state law of motion, but instead about the perceived value of taking specific actions, as in RL.

Our framework innovates within RL modeling itself as well, by building a dynamic cognitive model with several key features emerging endogenously, as follows. ⁷⁷7In this sense, we also differ from recent work by Barberis and Jin (2023), whose main objective is to show how to import typical RL frameworks into a specific economic environment - there of asset pricing.

First, we obtain and characterize an endogenous arbitration between learning from reasoning and experiences. We use the terminology of “arbitration” as in RL, meaning the characterization of the different weights put on the two modes of learning (“model-based” vs “model-free” learning, in the language of RL). Our framework produces an internally consistent, endogenous arbitration in the form of the state- and history-dependent Bayesian updating weights our agents put on their reasoning and experience signals. The property that arbitration is based on uncertainty is shared with cognitive and neuroscience work arguing that the brain puts more weight on the learning mode that it deems more reliable (as in Daw et al. (2005), Lee et al. (2014)). Our key contribution here is to incorporate everything in a unified Bayesian framework with GP priors, which allows (i) prior uncertainty to account for correlation between $Q$ -values at different state/action pairs, (ii) non-parametric cognitive learning, and (iii) delivers a tractable and parsimonious unified conditional belief process, as opposed to having to keep track and arbitrarily tie together two separate model-based and model-free estimates.

Second, the framework is characterized by an endogenous decision to engage reasoning. The model lets agents treat cognition as an accessible, but costly resource. Thus, it is not just that, given some model-based and model-free estimates, there is endogenous arbitration, but the intensity of the model-based learning and the associated precision of its reasoning signal is time-varying, conditional on states and choices. This implication of adjusting cognition in a state-dependent way is desirable for both empirical and conceptual reasons. Empirically, it is consistent with evidence and a general desired approach (see eg. Kool et al. (2017)) in the neuroscience and RL literatures emphasizing the cognitive cost of mental simulations versus their benefit. The latter is here modulated through reduction of uncertainty, where the value of reducing that uncertainty is (partly) state-dependent. In addition, when the agent values the reduction in uncertainty due to a direct, cognitive dissonance utility cost, this latter mechanism can capture evidence that active deliberation is indeed engaged only when there is sufficient “conflict” or uncertainty in prior beliefs (see eg. Thompson et al. (2011)).

Third, the framework describes jointly the evolution of beliefs and the constrained optimal policy function which agents follow in selecting their actions. We endogenously derive a constrained optimal action policy that takes the form of a softmax function, which is otherwise exogenously postulated in RL. A key input in that function is the ‘temperature’ parameter, which the existing literate treats as an exogenous parameter – the lower the temperature, the more the action policy leans away from “exploration” and exploits the action with the current highest estimated payoff. In our model, the temperature parameter is endogenous and is state- and history-dependent, since it depends on the conditional subjective uncertainty the agent still perceives in his estimate of $Q_{\pi^{*}}(a,s)$ . As such, this “temperature” of the softmax action policy endogenously evolves as the agent learns. The study of exogenous adjustments of this parameter along the sample path has been of large interest in RL, machine learning and constrained optimization literatures.⁸⁸8Typical approaches use pre-designed strategies tune temperature manually (eg. Kirkpatrick et al. (1983), Ackley et al. (1985)) or adaptively, in deep learning models (eg. Lin et al. (2018), Wang and Ni (2020)). In that literature, a robustly successful time-varying temperature (i.e. maximizing reward along simulated paths) is found to be decreasing through time, since later along the path there is typically less reason to explore (Sutton and Barto (2018)). Our framework delivers endogenously that result, as uncertainty over $Q_{\pi^{*}}(a,s_{t})$ estimates naturally declines over time, as well as adds novel state-dependencies in that logic.

Overall, the framework thus results in a dual-learning system, with two components akin to the “model-based” and the “model-free” modes of learning discussed in the RL literature. Crucially, the paper also extends this literature in providing a unified, formal way of modulating these two modes of learning via subjective uncertainty, which evolves through information accumulation. Thus, the paper extends both (i) the economics literature, by providing a conceptually new bounded rationality model deeply rooted in cognitive science insights and empirical results, and (ii) the RL literature, by bringing in constrained-optimal maximization approaches and tools typical to information economics.

2 Framework

We aim to model the process of economic agents figuring out optimal behavior using both their abstract thinking abilities, i.e. cognition, and their accumulated experience with past actions.

2.1 A generic recursive problem

To fix ideas, consider a generic recursive problem with discrete time indexed by $t$ , where an agent chooses an action $a_{t}$ (potentially a vector of actions) and is perfectly aware of all payoff-relevant details of the environment. The problem can be expressed as a Bellman equation

V^{\ast}(s_{t})=\max_{a_{t}\in B(s_{t})}u(a_{t},s_{t})+\beta\mathbb{E}_{t}V^{\ast}(s_{t+1}),

(1)

where $s_{t}$ collects all the relevant state variables, both exogenous and endogenous. The state follows the known law of motion $F(s_{t+1}|s_{t},a_{t})$ , giving the conditional expectation $\mathbb{E}_{t}$ in (1).

The primitives of the environment are the per-period utility function $u(a_{t},s_{t})$ , the time discount factor $\beta$ , the budget $B(s_{t})$ defining the set of currently feasible actions $a_{t},$ and the law of motion of the state $F$ . The value function $V^{\ast}(s)$ encodes the continuation value attached to starting in any particular realization of $s$ , when following the optimal policy function $\pi^{\ast}(a|s)$ . The latter is the optimally selected mapping from any given state $s$ to probabilities over feasible actions $a$ from equation (1).⁹⁹9Stating policies in terms of probabilities of taking actions is useful in developing the framework later. Under full information, the optimal $\pi^{*}(a|s)$ is naturally degenerate, with probability $\pi^{\ast}(a_{t}^{*}|s_{t})=1$ for $a_{t}^{*}=\arg\max_{a_{t}\in B(s_{t})}u(a_{t},s_{t})+\beta\mathbb{E}_{t}V^{\ast}(s_{t+1})$ , and zero for all other actions $a_{t}\neq a_{t}^{*}$ .

Such dynamic problems are at the core of modern economics. And as economists we understand that since dynamic problems require evaluating whole infinite paths of actions, or in recursive terms, an action plan with all possible future contingencies, characterizing $V^{\ast}(s)$ is generally a challenging functional problem (see e.g., Judd et al. (1998) or Bertsekas (2019)). This is often not fully tractable even to highly trained economists, and state-of-the-art approximate solution techniques require a lot of sophistication and effort to implement.

At the same time, in standard economic models is taken as given that the agents themselves always know the optimal policy $\pi^{*}(a|s)$ and value $V^{*}(s)$ functions. The difficulty economists face in actually computing the optimal objects $\pi^{*}(a|s)$ and $V^{*}(s)$ is simply abstracted away. We aim to address this apparent paradox, by developing a framework which puts the economic agents on a similar footing as economists, by requiring agents to invest cognitive effort in figuring out $V^{\ast}(s)$ , and thus $\pi^{\ast}(a|s)$ and modeling their gradual learning process.

In particular our starting point is that in real life people would typically have two sources of information about optimal behavior. The first source is experiences: by observing the per-period utility outcomes of different actions taken at various states in the past, the agent learns about $V^{\ast}(s)$ . The second source is cognition and reasoning: a unique human characteristic is the ability to think abstractly about the problem at hand. Through such internal deliberations, agents can learn about the implied value of taking different courses of action – for example, this could take the form of mentally simulating paths forward of possible behavior and comparing expected utility. These two sources of information are conceptually distinct and are both limited: experiences are realized only along the actual path taken by the agent, while abstract thinking is a scarce cognitive resource.

To model learning from both mental simulation and actual experiences, we connect to the large reinforcement learning (RL) literature (Kaelbling et al. (1996), Sutton and Barto (2018)). There an agent interacting with a dynamic and stochastic environment learns an optimal control policy for a sequential decision problem, typically a Markov Decision Process.

Using the notation of equation (1), the key object we focus on, as in RL, is the action-value function $Q_{\pi}(a,s)$ . This function gives the expected utility when taking action $a$ in state $s$ and following a given policy $\pi(a|s)$ thereafter (not necessarily the optimal $\pi^{\ast}(a|s))$ . The agent wants to know the optimal $Q_{\pi^{\ast}}(a,s)$ , defined as

Q_{\pi^{\ast}}(a_{t},s_{t})=u(a_{t},s_{t})+\beta E_{t}Q_{\pi^{\ast}}(\pi^{\ast}(a_{t+1}|s_{t+1}),s_{t+1})

(2)

where $\pi^{\ast}(a|s)$ satisfies the Bellman equation (1). The functions $V^{\ast}(s)$ and $Q_{\pi^{\ast}}(a,s)$ are implicitly related, as $V^{\ast}(s_{t})=\max_{a_{t}\in B(s_{t})}Q_{\pi^{\ast}}(a_{t},s_{t})$ , but as we detail later, it will be useful to work with $Q_{\pi^{\ast}}(a_{t},s_{t})$ .

2.2 Human learning with Gaussian Processes

We model agents’ beliefs over the unknown function $Q_{\pi^{\ast}}(a,s)$ as Gaussian Process (GP) distributions. In particular, at the beginning of time agents have the initial prior

Q_{\pi^{\ast}}(a,s)\sim GP(\widehat{Q}_{0}(a,s),{\widehat{\Sigma}}_{0}(a,s,a^{\prime},s^{\prime}))

(3)

where the mean function is $\widehat{Q}_{0}(a,s)=\mathbb{E}(Q_{\pi^{\ast}}(a,s))$ and the variance-covariance function is $\widehat{\Sigma}_{0}(a,s,a^{\prime},s^{\prime})=Cov(Q_{\pi^{\ast}}(a,s),Q_{\pi^{\ast}}(a^{\prime},s^{\prime}))$ . The defining feature of a Gaussian Process distribution is that for any two action-state pairs $(a,s)$ and $(a^{\prime},s^{\prime})$ , the values $Q_{\pi^{*}}(a,s)$ and $Q_{\pi^{*}}(a^{\prime},s^{\prime})$ have a joint-Normal distribution with mean and variance-covariance given by $\widehat{Q}_{0}$ and $\widehat{\Sigma}_{0}$ .

Why learning with GP? There is a variety of motivating insights and arguments to using GP distributions and in particular so in the context of modeling human cognition.

On the one hand, GP distributions are methodologically appealing due to their flexibility and tractability. In particular, GP distributions extend the familiar Kalman filter to learning about functions and the law of motion for the conditional mean and variance can be easily characterized recursively. As a result, the conditional beliefs in each period also follow a GP distribution, with properly updated mean and variance functions, as detailed below. In this formulation the resulting variance-covariance function $\widehat{\Sigma}_{0}(a,s,a^{\prime},s)$ encodes the agent’s prior view of how likely correlated are the function’s values at different points (here pairs $(a,s)$ and $(a^{\prime},s^{\prime})$ ), and therefore is inducing a measure of proximity, or similarity, between those points.

On the other hand, the cognitive sciences literature increasingly emphasizes a Bayesian approach and in particular the appeal of GP distributions. Indeed, at a broad level, cognitive sciences emphasize that cognition is fundamentally related to forming uncertain conjectures from partial or noisy information, and thus a probabilistic framework is particularly well suited both conceptually and in terms of accounting for data on observed behavior (eg. Chater et al. (2006), Griffiths et al. (2008), Griffiths et al. (2010)).

The GP distribution in particular has been increasingly used in the cognitive sciences literature, building on connections to statistics and machine learning (on the latter see eg. Barber (2012)). The reason is in the recent experimental and neuroscience evidence that the human brain’s learning process is well described by GP (see for example Gershman et al. (2015) for a survey and Schulz et al. (2018), Wu et al. (2018), Wu et al. (2021) for evidence using bandit-like tasks).

Even more specific to RL, the Bayesian approach and the GP distribution have also found applications in this RL literature (see eg. Engel et al. (2005)). As surveyed in Ghavamzadeh et al. (2015), this approach can provide a tractable and coherent way to model the exploitation-exploration tradeoff, fundamental to learning from experiences, as a function of subjective uncertainty over $Q_{\pi^{\ast}}(a,s)$ estimates, as well as offering a formal way to incorporate prior beliefs into the action-selection algorithms.

In building our framework, we nevertheless emphasize that these motivating insights from cognitive science have not been connected in a single framework that interacts learning from abstract thinking and from experiences, and have also not been previously applied to economic models.

2.3 Learning from experienced outcomes

We start by detailing how we formalize the learning through experienced outcomes and its connection to the concept of “model-free” learning in machine learning. We will consider this type of learning as free of cognitive costs, as it accumulates by the agent simply experiencing utility flows of the specific actions she has taken in the past, and not through abstractly thinking about contingencies and counterfactuals.

Consider a typical period $t$ . The agent enters the period with some prior beliefs, conditional on the prior information set $\mathcal{I}_{t-1}$ . As anticipated earlier, given the time zero prior in equation (3), the conditional $t-1$ beliefs also have a GP distribution:

Q_{\pi^{\ast}}(a,s)|\mathcal{I}_{t-1}\sim GP(\widehat{Q}_{t-1}(a,s),{\widehat{\Sigma}}_{t-1}(a,s,a^{\prime},s^{\prime}))

(4)

where $\widehat{Q}_{t-1}(a,s)=\mathbb{E}(Q_{\pi^{*}}(a,s)|\mathcal{I}_{t-1})$ and $\widehat{\Sigma}_{t-1}(a,s,a^{\prime},s^{\prime})=Cov(Q_{\pi^{*}}(a,s),Q_{\pi^{*}}(a^{\prime},s^{\prime})|\mathcal{I}_{t-1})$ follow a recursive formulation detailed below.

At the beginning of the period, the agent observes the realized utility $u(s_{t-1},a_{t-1})$ that she experienced based on last period’s choice of action $a_{t-1}$ . In addition, the time $t$ shock also realizes and the agent observes the realization of the new state variable $s_{t}$ . Knowing the flow utility yesterday and the realized state today implies an experience-based informative signal that updates beliefs about $Q_{\pi^{*}}(a,s)$ . This update, along the realized path of state action pairs $(a_{t},s_{t})$ , follows the so called “model-free” or “Q-learning” dynamic programming solution techniques used in machine learning and cognitive sciences (e.g. Dearden et al. (1998)).

The key intuition behind this update, is that the flow utility $u(a_{t-1},s_{t-1})$ reveals the “temporal difference” in the Q-function, that is from the Bellman equation we can express

u(a_{t-1},s_{t-1})=Q_{\pi^{*}}(a_{t-1},s_{t-1})-\beta\mathbb{E}_{t-1}(Q_{\pi^{*}}(\pi^{*}(a_{t}|s_{t}),s_{t}))

Given beliefs $\widehat{Q}_{t-1}(a,s)$ , one can then compute the deviation from the above equation that any specific beliefs $\widehat{Q}_{t-1}(a,s)$ imply, and adjust them accordingly. To do this directly from the above equation, this requires computing the expectation $\mathbb{E}_{t-1}(Q_{\pi^{*}}(\pi^{*}(a_{t}|s_{t}),s_{t}))$ which integrates over all possible states $s_{t}$ that could have realized at time $t$ . This integration is computationally and conceptually a complex step, and thus, typically machine learning applications actually use the more robust approach which utilizes instead the approximation

u(a_{t-1},s_{t-1})=Q_{\pi^{*}}(a_{t-1},s_{t-1})-\beta Q_{\pi^{*}}(\pi^{*}(a_{t}|s_{t}),s_{t})

(5)

where the temporal difference on the right-hand side is not between the Q-function at $(a_{t-1},s_{t-1})$ and the average across all possible states $s_{t}$ , but only the difference with the value of the Q-function in the actually realized state $s_{t}$ .

This approach is “robust”, in the sense that it does not need to compute the expectation $\mathbb{E}_{t-1}$ and also does not need to assume full knowledge of the transition probabilities $F(s_{t+1}|s_{t},a_{t})$ . The agent is simply sitting at time $t$ and observing the actual realization of $s_{t}$ , and then only looks back at the old state-action pair $(a_{t-1},s_{t-1})$ and the realized utility $u(a_{t-1},s_{t-1})$ . Still, using this approximation to update estimates of the $Q$ -function is asymptotically consistent, in that if agents visit all possible states in the support of $s_{t}$ with positive probability, then the update based on the approximation will eventually converge to the true $Q_{\pi^{*}}(a,s)$ as $t\rightarrow\infty$ .

Also worth stressing is that this “model-free” update is very simple to do, as it just compares the agent’s realized utility yesterday and her current expectation of the $Q$ -function at the realized state today. As such, it is straightforward to model this experiential updating as “cognitively free” – as something that just comes by default to the agent.

Formally, the agent perceives the following experience signal

\eta_{t}^{E}\equiv u(a_{t-1},s_{t-1})\approx Q_{\pi^{*}}(a_{t-1},s_{t-1})-\beta Q_{\pi^{*}}(\pi^{*}(a_{t}|s_{t}),s_{t})

And lastly, since our framework keeps track of beliefs about $Q_{\pi^{\ast}}(a,s)$ , but not explicitly of beliefs over $\pi^{\ast}(a|s_{t})$ , for tractability it is convenient to approximate the latter by setting $\pi^{\ast}(a|s_{t})=\pi_{t-1}^{greedy}(a|s_{t})$ .¹⁰¹⁰10This approximation can be in principle relaxed by adding a further layer of noise. Thus, we assume that the agent perceives the following structure of the experience signal

\eta_{t}^{E}\equiv u(a_{t-1},s_{t-1})\approx Q_{\pi^{*}}(a_{t-1},s_{t-1})-\beta Q_{\pi^{*}}(\pi_{t-1}^{greedy}(a_{t}|s_{t}),s_{t})

This expression for $\eta_{t}^{E}$ can be readily used to compute a formal Bayesian update based on this experiential information. Specifically,

\widehat{Q}_{t}^{E}(a,s)=\widehat{Q}_{t-1}(a,s)+\alpha_{t}^{E}(a,s)\left[\eta_{t}^{E}-\left(\widehat{Q}_{t-1}(a_{t-1},s_{t-1})-\beta\widehat{Q}_{t-1}(\pi_{t-1}^{greedy}(a_{t}|s_{t},s_{t}))\right)\right]

(6)

where $\widehat{Q}_{t}^{E}(a,s)\equiv\mathbb{E}(Q_{\pi^{*}}(a,s)|\mathcal{I}_{t-1},\eta_{t}^{E})$ is not quite the end-of-period $t$ beliefs, as beliefs will potentially be further updated via the abstract reasoning we describe below. Moreover, the signal to noise ratio $\alpha_{t}^{E}(a,s)$ can be derived in a straightforward say by evaluating

\alpha_{t}^{E}(a,s)=\frac{Cov(\eta_{t}^{E},Q_{\pi^{\ast}}(a,s)|\mathcal{I}_{t-1})}{Var(\eta_{t}^{E}|\mathcal{I}_{t-1})}

(7)

The signal also reduced the conditional uncertainty facing the agent, as

	$\displaystyle\Sigma_{t}^{E}(a,a^{\prime};s_{t})$	$\displaystyle\equiv$	$\displaystyle Cov(Q_{\pi^{}}(a,s_{t}),Q_{\pi^{}}(a^{\prime},s_{t})\|\mathcal{I}_{t-1},\eta_{t}^{E})$
		$\displaystyle=$	$\displaystyle\widehat{\Sigma}_{t-1}(a,a^{\prime};s_{t})-\alpha_{t}^{E}(a,s)Cov(\eta_{t}^{E},Q_{\pi^{\ast}}(a,s)\|\mathcal{I}_{t-1})$

2.4 Learning through abstract reasoning

After updating beliefs for free, each period $t$ the agent decides how intensely to engage in cognitively costly abstract reasoning, which produces further information about $Q_{\pi^{*}}(a,s)$ . And then, after updating beliefs based on all sources for time $t$ information, the agent optimally chooses an action policy $\pi_{t}(a|s_{t})$ . To understand these choices, consider first the reasoning process.

2.4.1 Reasoning

By abstract reasoning we mean the internal deliberation process through which the agent thinks abstractly about his decision problem, and learns about $Q_{\pi^{\ast}}(a,s)$ generally.¹¹¹¹11For ease of exposition, we assume that the action space is discrete with cardinality $N_{a}=|a|$ , but this can be relaxed and the framework can be defined on a continuous space of actions as well. We will remain agnostic about the specific mode of deliberation the agent engages in, but for example they can be trying to do value function iteration or just thinking forward through the decision tree of their problem. This kind of reasoning is deliberate and abstract, as it tries to deduce new information about $Q_{\pi^{*}}(a,s)$ over and above the simple experience of the flow utility given past choices. It is a potentially powerful source of information for the agent, but is mentally costly.

Formally, each period the agent can generate a vector of reasoning signals $\eta_{t}^{R}$ as unrestricted linear functions of $Q_{\pi^{\ast}}(a,s_{t})$

\eta_{t}^{R}=\Omega_{t}^{\prime}Q_{\pi^{\ast}}(a,s_{t})+\varepsilon_{\eta,t}

(8)

where $\varepsilon_{\eta,t}\sim N(0,\Sigma_{\eta,t})$ and $\Omega_{t}$ is a $N_{a}\times N_{a}$ matrix. The agent optimally chooses the structure of the matrix $\Omega_{t}$ and the noise variance matrix $\Sigma_{\eta,t}$ , subject to a cost-benefit tradeoff introduced below in subsection 2.4.2.

The vector of signals $\eta_{t}^{R}$ then updates beliefs over $Q_{\pi^{\ast}}(a,s)$ at all pairs $(a,s)$ :

\widehat{Q}_{t}(a,s)=\widehat{Q}_{t}^{E}(a,s)+\alpha_{t}^{R}(a,s)\left(\eta_{t}^{R}-\Omega_{t}^{\prime}\widehat{Q}_{t}^{E}(a,s_{t})\right)

(9)

where the signal-to-noise ratio $\alpha_{t}^{R}(a,s)$ is given by

\alpha_{t}^{R}(a,s)=Cov\left(Q_{\pi^{\ast}}(a,s),\Omega_{t}^{\prime}Q_{\pi^{\ast}}(a,s_{t})|\mathcal{I}_{t-1},\eta_{t}^{E}\right)\left(Var(\Omega_{t}^{\prime}Q_{\pi^{\ast}}(a,s_{t})|\mathcal{I}_{t-1},\eta_{t}^{E})+\Sigma_{\eta,t}\right)^{-1}

(10)

which can be tractably computed from $\widehat{\Sigma}_{t-1}(a,a^{\prime},s,s^{\prime})$ and the update based on $\eta_{t}^{E}$ described above.

Importantly,the reasoning signals further reduce uncertainty, so that

\widehat{\Sigma}_{t}(a,a^{\prime};s_{t})\equiv Cov(Q_{\pi^{*}}(a,s_{t}),Q_{\pi^{*}}(a^{\prime},s_{t})|\mathcal{I}_{t})=\Sigma_{t}^{E}(a,a^{\prime};s_{t})-\alpha_{t}^{R}(a,s)\Sigma_{t}^{E}(a,\Omega_{t}a;s_{t})^{\prime}

(11)

denote the posterior variance function over the vector

Interpretation. Our terminology of reasoning is similar to the notion of planning or model-based learning in the computational and RL literature (Sutton and Barto (2018)). Reasoning can in practice work through various specific processes. Consider for example a specific tool used in the model-based RL – the simulation of a limited number of future paths of actions and resulting states and utility flows in the future (see for example Moerland et al. (2023)). In our framework, this simulation is captured by the noisy signals $\eta_{t}^{R}$ that update the prior beliefs of the agent. In a related way, by producing new information internally to the agent, without being observed by an outsider, the notion of reasoning here also resembles in economics that of ‘fact-free learning’ in Aragones et al. (2005) and Alaoui and Penta (2022).

However, our proposed framework purposefully abstracts from the specifics of the mental method behind $\eta_{t}^{R}$ . Instead it aims to capture a key tradeoff that many different solution methods share, namely that taking more computational steps in any given method (e.g. simulate more and further paths forward), leads to (i) higher accuracy of the resulting solution, but (ii) is cognitively costlier, as we describe next.

Overall, by incorporating both signals $\eta_{t}^{E}$ and $\eta_{t}^{R}$ in the conditional beliefs of our agent we are integrating, in an endogenous way, both “model-free” and “model-based” learning, which are the two main types of learning discussed in RL.

This joint integration of both types of learning is present in both some recent treatments of RL (see Sutton and Barto (2018)) and also in classical RL algorithms such as Dyna-Q (Sutton (1990, 1991)).¹²¹²12These algorithms include many extensions, such as Dyna-Q+ (Sutton (1990)), which like in the current framework further stimulates exploration. Nevertheless, the RL literature usually resorts to ad-hoc assumptions on how the two types of learning are mixed together, while our framework proposes an endogenous arbitration between the two types of signals based on formal Bayesian updating notions.

Furthermore, the specific formulation of belief updating for both signals in equations (9) and (6) resembles temporal difference solution techniques, used extensively in RL (Sutton (1988)), and in particular their Bayesian, GP-based version, connecting to the approach in Engel et al. (2003, 2005).

2.4.2 Joint choice over reasoning and actions at time $t$

Equations (9) and (11) describe the updated beliefs, taking as given the structure for the signal in equation (8). Critically, given the current state $s_{t}$ and prior beliefs in equation (4), we let the agent jointly choose her optimal reasoning structure (i.e. $\Omega_{t}$ and $\Sigma_{\eta,t}$ ) and the action policy $\pi_{t}(a|s_{t})$ that he will follow this period.

Objective function. We let the agent maximize the following joint objective function

\max_{\pi_{t}(a|s_{t}),{\Sigma}_{\eta,t},\Omega_{t}}\underbrace{\sum_{a}\widehat{Q}_{t}(a,s_{t})\pi_{t}(a|s_{t})}_{\begin{subarray}{c}\text{exploitation benefit}\end{subarray}}-\underbrace{w\sum_{a}\sigma_{t}^{2}(a,s_{t})}_{\begin{subarray}{c}\text{cognitive dissonance cost}\end{subarray}}-\underbrace{\frac{\kappa}{2}\ln\left[\frac{|\Sigma_{t}^{E}(a,s_{t},a^{\prime},s_{t})|}{|{\Sigma}_{t}(a,a^{\prime};s_{t})|}\right]}_{\begin{subarray}{c}\text{reasoning cost}\end{subarray}}

(12)

where recall that $\Sigma_{t}^{E}(a,s_{t},a^{\prime},s_{t})$ is the variance conditional on both the beginning-of-period information set $\mathcal{I}_{t-1}$ and the experience signal $\eta_{t}^{E}$ , while $\Sigma_{t}^{R}(a,a^{\prime};s_{t})$ is the end-of-period posterior variance defined in equation (11), after also updating with the chosen reasoning signals. Here

\sigma_{t}^{2}(a,s_{t})\equiv\Sigma_{t}(a,a;s_{t})

denotes the diagonal entries of the posterior variance $\Sigma_{t}(a,a^{\prime};s_{t})$ .

The objective function in (12) is subject to two types of constraints on the action distribution. First, is that of feasibility: $\pi(a|s_{t})=0$ for actions that do not satisfy the budget constraint $a\notin B(s_{t})$ , and that the action distribution probabilities sum to one:

\sum_{a}\pi_{t}(a|s_{t})=1

Second, the action distribution $\pi_{t}(a|s_{t})$ is subject to an entropy constraint:

\underbrace{-\sum_{a}\ln(\pi_{t}(a|s_{t}))\pi_{t}(a|s_{t})}_{\begin{subarray}{c}\text{entropy of action distribution}\end{subarray}}\geq h\sum_{a}{\sigma}_{t}^{2}(a,s_{t})

(13)

Interpretation. The objective function in (12) and the entropy constraint in (13) allow our framework to capture jointly a variety of forces of interest emphasized in the RL and cognitive science literature, as follows.

The first term in the objective of equation (12) reflects the agent’s benefit of exploitation, or “greediness” in the RL language. This force incentivizes the agent to choose the action $a_{t}$ with the current highest estimated value $\widehat{Q}_{t}(a_{t},s_{t})$ , where the latter is given in equation (9). The second term captures a possible cognitive dissonance cost (Aronson (1969), Akerlof and Dickens (1982)). Essentially, when the primitive disutility parameter $w>0$ , there is a disutility cost of uncertainty over the values associated to each action. This cost generates a benefit of reasoning that goes purely through the reduction of that dissonance and the disutility of facing uncertainty about the Q-function. This mechanism is qualitatively similar to the one used in Ilut and Valchev (2023), and is distinct from the novel exploitation-experimentation tradeoff described below.

The third term in the objective function measures the cognitive cost of reasoning as proportional to the information content of signals $\eta_{t}^{R}$ , with information flow quantified as the reduction in entropy achieved by acquiring these signals, following a standard information theory approach, e.g. Sims (2003a). The reasoning cost captures the fact that increasing the reasoning intensity (e.g. mentally simulating more paths forward) requires higher cognitive effort. Like in the cognitive control literature (e.g. Kool et al. (2017), Botvinick et al. (2004), Botvinick and Cohen (2014)), the constant marginal cost $\kappa>0$ can be interpreted as the opportunity cost of cognitive capacity, which can be otherwise employed on other, outside-the-model tasks.¹³¹³13Thus, $\kappa$ will be higher if individuals have a higher opportunity cost of cognitive capacity or if their particular deliberation process behind $\eta^{R}_{t}$ takes longer to produce a given amount of information. In addition, $\kappa$ will also be higher if the environment is more complex and it is thus objectively harder to come up with insights on the unknown optimal $Q$ -function.

In turn, the interpretation and aim of the constraint in (13) is to capture a desire for experimentation. As in typical bandit problems, as long as the value function $Q_{\pi^{*}}(a,s)$ is uncertain, there is a desire to explore the function in other parts of the state space. Following the RL literature, we parsimoniously model this desire to experiment using the idea of “entropy regularization”, or maximum entropy RL, (eg. Mnih et al. (2016), Haarnoja et al. (2017), Eysenbach and Levine (2019)). This regularization requires the entropy of the action distribution (the LHS of equation (13)) to be bigger than some minimum threshold, thus ensuring randomization. With this constraint, the chosen action distribution $\pi_{t}(a|s_{t})$ is not degenerate, but always puts some probability of exploring any feasible action, thus capturing the exploration incentive inherent to dynamic learning problems. Importantly, in our framework this entropy constraint builds in that experimentation is valuable only to the extent to which the $Q$ -function is uncertain – indeed, if $Q_{\pi^{*}}(a,s_{t})$ was known, there would be no point to experiment as there is nothing further to learn. Hence, when the experimentation parameter $h>0$ , in the RHS of equation (13) the entropy lower bound is set proportional to the remaining subjective uncertainty over $Q_{\pi^{*}}(a,s_{t})$ , as measured by $\sum_{a}\sigma_{t}^{2}(a,s_{t})$ .

2.4.3 Optimal policy action

Let $\delta_{t}$ denote the Lagrange Multiplier on the constraint in equation (13). Given the set of feasible actions in $B(s_{t})$ , the optimal policy action is

\widehat{\pi}_{t}(a|s_{t})=\begin{cases}\frac{\exp\left(\frac{\widehat{Q}_{t}(a,s_{t})}{\delta_{t}}\right)}{\sum_{a}\exp\left(\frac{\widehat{Q}_{t}(a,s_{t})}{\delta_{t}}\right)}&\text{ if }{\delta}_{t}>0\\ \widehat{\pi}_{t}^{greedy}(a|s_{t})&\text{ if }{\delta}_{t}=0\end{cases}

(14)

where $\widehat{\pi}_{t}^{greedy}(a|s_{t})$ is the degenerate greedy policy

\widehat{\pi}_{t}^{greedy}(a|s_{t})=\begin{cases}1,\quad\text{for}\hskip 4.26773pta=\widetilde{a_{t}}\equiv\arg\max_{a_{t}\in B(s_{t})}\widehat{Q}_{t}(a_{t},s_{t})\\ 0,\quad\forall\hskip 2.84544pta\neq\widetilde{a_{t}}\end{cases}

(15)

Softmax. The optimal policy in (14) takes the form of the softmax function widely used in statistics and machine learning. This function takes the expected rewards of following any given action and transforms them into action probabilities (eg. Sutton and Barto (2018)). However, in that literature $\delta_{t}$ , known as the ‘temperature’ parameter, is typically exogenous. The lower is $\delta_{t}$ , the more the expected rewards affect the probability of taking actions. In our model, $\delta_{t}$ is endogenous and state and time-dependent. Specifically, it will be high for states $s_{t}$ where the agent faces a higher uncertainty (and thus $\sum_{a}\sigma_{t}^{2}(a,s_{t})$ is high).

In economics, a similar softmax function describes the multinomial logic model of choices (eg. Luce (1959)), typically micro-founded from a random utility model where valuations of actions are subject to additively separable and independently distributed shocks drawn from the extreme-value distribution (eg. McFadden (1974)). Instead, the optimal policy $\widehat{\pi}_{t}(a|s_{t})$ proposed here does not rely on such random utility shocks.

A complementary micro-foundation for logit choice in the literature is based on rational inattention (Sims (2003b)). In that work (eg. Matějka and McKay (2014) in economics, or the policy compression work of Lai and Gershman (2021) in cognitive sciences), the entropy of the actions distribution, i.e. LHS of equation (13), speaks to an information flow constraint across actions, given signals about the rewards associated to those actions. Here instead the entropy on actions distribution speaks to the entropy regularization logic in RL, per above.

To see that comparison, consider setting $h=0$ in constraint (13). In that case, the constraint would not be binding and the optimal action becomes $\widehat{\pi}_{t}^{greedy}(a|s_{t})$ . Critically, even without a role for the entropy on actions distributions, when $\kappa>0$ there is still an entropy informational cost on reasoning signals in objective (12). So, when $h=0$ reasoning is here still imperfect, but given imperfect posterior Bayesian estimates $\widehat{Q}_{t}(a,s_{t})$ , the agent would take the greedy perceived action with probability one. The extreme case of $h=0$ would thus have no role for exploration, a fundamental feature to dynamic learning problems.¹⁴¹⁴14This discussion also implies that the framework could extend the RHS of constraint (13) to be $c+h\sum_{a}\widetilde{\sigma}_{t}^{2}(a,s_{t})$ . Even if uncertainty over $\widehat{Q}_{t}(a,s_{t})$ would disappear, when the new parameter $c>0$ , it would still difficult for the agent to compare all the $\widehat{Q}_{t}(a,s_{t})$ at current state $s_{t}$ for all possible actions, generating stochastic choice like in the policy compression logic. That integrated softmax could then nest both policy compression and experimentation sources.

At another extreme, consider the limit of costless reasoning, i.e. $\kappa=0.$ In that case, $\sigma_{t}^{2}(a,s_{t})=0$ as cognitive effort is chosen to be infinitely high, and agents learn $Q_{\pi^{*}}(a,s)$ perfectly. As there is no remaining uncertainty, the entropy regularization constraint (13) would not bind, $\delta_{t}=0,$ and the optimal action follows the greedy policy. The objective in equation (12) would then recover exactly the problem solved under full rationality (see equations (1) and (2)) since $\widehat{Q}_{t}(a,s)=Q_{\pi^{*}}(a,s).$

2.4.4 Optimal reasoning signal structure

The optimal reasoning choice consists of the agent deciding on the matrix $\Omega_{t}$ and signal noise variance matrix $\Sigma_{\eta,t}$ . To solve this problem, we use insights from rate-distortion theory.

Let $\lambda_{t}^{(i)}$ denote the eigenvalues of posterior variance ${\widehat{\Sigma}}_{t}(a,a^{\prime};s_{t})$ . Using the standard property that a matrix determinant equals the product of its eigenvalues, $\ln(|\widehat{\Sigma}_{t}(a,a^{\prime};s_{t})|)=\sum_{i}\ln(\lambda_{t}^{(i)}),$ it follows that optimal reasoning can be expressed as choosing the resulting eigenvalues $\lambda_{t}^{(i)}$ optimally.

For notational convenience, it is useful to sort the eigenvalues in descending order so

\lambda_{t}^{(1)}\geq\lambda_{t}^{(2)}\geq...\geq\lambda_{t}^{(N)}

By standard properties of variance matrices $\Sigma_{\eta,t}$ is positive definite, and thus the (sorted) eigenvalues of the posterior variance $\widehat{\Sigma}_{t}(a,a^{\prime};s_{t})$ must be weakly smaller than those of the “prior” variance $\widehat{\Sigma}_{t}^{E}(a,a^{\prime};s_{t})$ , hence $\lambda_{t}^{(i)}\leq\lambda_{E,t}^{(i)}$ (where $\lambda_{Et}^{(i)}$ denote the sorted eigenvalues of $\widehat{\Sigma}_{t}^{E}(a,a^{\prime};s_{t})$ ). Taking this constraint into account and optimizing the objective function over $\lambda_{t}^{(i)}$ , the optimal reasoning choice implies

\lambda_{t}^{(i)}=\min\left\{\frac{\kappa}{w+\delta_{t}h},\lambda_{E,t}^{(i)}\right\}

(16)

Hence, the agent has an optimal target level for uncertainty and wants to only reduce eigenvalues that currently bigger than that threshold $\frac{\kappa}{w+\delta_{t}h}$ – a property known as reverse ‘water filling’ in rate-distortion theory (see Cover and Thomas (1999), chapter 13). Thus, we can show that the optimal signal structure is for $\Omega_{t}$ to be the matrix of eigenvectors of $\widehat{\Sigma}_{t}^{E}(a,a^{\prime};s_{t})$ , and $\Sigma_{\eta,t}$ be a diagonal matrix with entries¹⁵¹⁵15In practice, the eigenvalues are likely to decline in value quickly, so only a few of these signals will typically have non-infinite noise variance making the solution very tractable and easy to implement.

\sigma_{i,t}^{2}=\begin{cases}\frac{\kappa\lambda_{E,t}^{(i)}}{\lambda_{E,t}^{(i)}(w+\delta_{t}h)-\kappa}&\text{when}\quad\lambda_{E,t}^{(i)}>\frac{\kappa}{w+\delta_{t}h}\\ \infty&\text{when}\quad\lambda_{E,t}^{(i)}\leq\frac{\kappa}{w+\delta_{t}h}\end{cases}

(17)

Put together, equations (14) and (16) uncover a tight intuitive interaction between actions and reasoning intensity. For example, take states $s_{t}$ where uncertainty is high and thus the endogenous $\delta_{t}$ is larger. By (14), a tighter constraint induces the agent to explore more and thus deviate from the action with the currently highest perceived $\widehat{Q}_{t}(a,s_{t})$ . On the other hand, reasoning lowers uncertainty and thus relaxes this constraint, lowering $\delta_{t}$ and leads to a policy function $\pi_{t}(a|s_{t})$ that selects actions $a$ with a higher expected payoff $\widehat{Q}_{t}(a,s_{t})$ .

Looping back to the choice of how much cognitive effort to invest in reasoning, from (16) we see that when $\delta_{t}$ is larger, the agent employs more cognition to achieve a greater reduction in uncertainty (the optimal target for $\lambda_{t}^{(i)}$ is lower). Intuitively, she values that reduction in uncertainty precisely because it allows her to select an action closer to the currently perceived greedy policy, not having to worry about experimentation.

Overall, the framework is well suited to apply to a range of concrete economic environments, some of which we actively investigate in our own work in progress. For example, a standard consumption-savings problem in a simple Aiyagari (1994) framework is a transparent and widely-studied setting in economics. Indeed, the consumption-saving decision is a fundamental mechanism in a number of different economic settings, and has been used as a laboratory for other recent bounded rationality and behavioral papers, such as for example Ilut and Valchev (2023) and Lian (2023). Due to its core elements, the bounded rationality framework proposed here appears promising in delivering endogenously empirically documented properties for consumption that are puzzling for standard fully-rational models: (a) experience effects of past income shocks on future consumption choices, (b) high sensitivity of consumption to income shocks; (c) large heterogeneity in consumption responses, through stochastic signals and choice.

References

(1)
Ackley et al. (1985) Ackley, David H, Geoffrey E Hinton, and Terrence J Sejnowski, “A learning algorithm for Boltzmann machines,” Cognitive science, 1985, 9 (1), 147–169.
Aiyagari (1994) Aiyagari, S Rao, “Uninsured idiosyncratic risk and aggregate saving,” The Quarterly Journal of Economics, 1994, 109 (3), 659–684.
Akerlof and Dickens (1982) Akerlof, George A and William T Dickens, “The economic consequences of cognitive dissonance,” The American economic review, 1982, 72 (3), 307–319.
Alaoui and Penta (2022) Alaoui, Larbi and Antonio Penta, “Cost-benefit analysis in reasoning,” Journal of Political Economy, 2022, 130 (4), 881–925.
Aragones et al. (2005) Aragones, Enriqueta, Itzhak Gilboa, Andrew Postlewaite, and David Schmeidler, “Fact-Free Learning,” The American Economic Review, 2005, 95, pp–1355.
Aronson (1969) Aronson, Elliot, “The theory of cognitive dissonance: A current perspective,” in “Advances in experimental social psychology,” Vol. 4, Elsevier, 1969, pp. 1–34.
Barber (2012) Barber, David, Bayesian reasoning and machine learning, Cambridge University Press, 2012.
Barberis and Jin (2023) Barberis, Nicholas C and Lawrence J Jin, “Model-free and model-based learning as joint drivers of investor behavior,” 2023. NBER Working Paper 31081.
Bardhi (2022) Bardhi, Arjada, “Optimal discovery and influence through selective sampling,” 2022. Duke University, mimeo.
Bertsekas (2019) Bertsekas, Dimitri P, Reinforcement learning and optimal control, Athena Scientific Belmont, MA, 2019.
Botvinick and Cohen (2014) Botvinick, Matthew M and Jonathan D Cohen, “The computational and neural basis of cognitive control: charted territory and new frontiers,” Cognitive science, 2014, 38 (6), 1249–1285.
Botvinick et al. (2004) , , and Cameron S Carter, “Conflict monitoring and anterior cingulate cortex: an update,” Trends in cognitive sciences, 2004, 8 (12), 539–546.
Callander (2011) Callander, Steven, “Searching and learning by trial and error,” American Economic Review, 2011, 101 (6), 2277–2308.
Chater et al. (2006) Chater, Nick, Joshua B Tenenbaum, and Alan Yuille, “Probabilistic models of cognition: Conceptual foundations,” Trends in cognitive sciences, 2006, 10 (7), 287–291.
Cover and Thomas (1999) Cover, Thomas M and Joy Thomas, Elements of information theory, John Wiley & Sons, 1999.
Daw et al. (2005) Daw, Nathaniel D, Yael Niv, and Peter Dayan, “Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control,” Nature neuroscience, 2005, 8 (12), 1704–1711.
Dearden et al. (1998) Dearden, Richard, Nir Friedman, and Stuart Russell, “Bayesian Q-learning,” Aaai/iaai, 1998, 1998, 761–768.
Dew-Becker and Nathanson (2019) Dew-Becker, Ian and Charles G Nathanson, “Directed attention and nonparametric learning,” Journal of Economic Theory, 2019, 181, 461–496.
Engel et al. (2003) Engel, Yaakov, Shie Mannor, and Ron Meir, “Bayes meets Bellman: The Gaussian process approach to temporal difference learning,” in “Proceedings of the 20th International Conference on Machine Learning (ICML-03)” 2003, pp. 154–161.
Engel et al. (2005) , , and , “Reinforcement learning with Gaussian processes,” in “Proceedings of the 22nd international conference on Machine learning” 2005, pp. 201–208.
Eysenbach and Levine (2019) Eysenbach, Benjamin and Sergey Levine, “If MaxEnt RL is the answer, what is the question?,” arXiv preprint arXiv:1910.01913, 2019.
Gabaix (2014) Gabaix, Xavier, “A sparsity-based model of bounded rationality,” The Quarterly Journal of Economics, 2014, 129 (4), 1661–1710.
Gershman et al. (2015) Gershman, Samuel J, Eric J Horvitz, and Joshua B Tenenbaum, “Computational rationality: A converging paradigm for intelligence in brains, minds, and machines,” Science, 2015, 349 (6245), 273–278.
Ghavamzadeh et al. (2015) Ghavamzadeh, Mohammad, Shie Mannor, Joelle Pineau, Aviv Tamar et al., “Bayesian reinforcement learning: A survey,” Foundations and Trends® in Machine Learning, 2015, 8 (5-6), 359–483.
Griffiths et al. (2008) Griffiths, Thomas L, Charles Kemp, and Joshua B Tenenbaum, “Bayesian models of cognition.,” in “Annual Meeting of the Cognitive Science Society, 2004; This chapter is based in part on tutorials given by the authors at the aforementioned conference as well as the one held in 2006.” Cambridge University Press 2008.
Griffiths et al. (2015) , Falk Lieder, and Noah D Goodman, “Rational use of cognitive resources: Levels of analysis between the computational and the algorithmic,” Topics in cognitive science, 2015, 7 (2), 217–229.
Griffiths et al. (2010) , Nick Chater, Charles Kemp, Amy Perfors, and Joshua B Tenenbaum, “Probabilistic models of cognition: Exploring representations and inductive biases,” Trends in cognitive sciences, 2010, 14 (8), 357–364.
Haarnoja et al. (2017) Haarnoja, Tuomas, Haoran Tang, Pieter Abbeel, and Sergey Levine, “Reinforcement learning with deep energy-based policies,” in “International conference on machine learning” PMLR 2017, pp. 1352–1361.
Ilut and Valchev (2023) Ilut, Cosmin and Rosen Valchev, “Economic agents as imperfect problem solvers,” The Quarterly Journal of Economics, 2023, 138 (1), 313–362.
Ilut et al. (2020) , , and Nicolas Vincent, “Paralyzed by Fear: Rigid and Discrete Pricing under Demand Uncertainty,” Econometrica, 2020, 88 (5), 1899–1938.
Judd et al. (1998) Judd, Kenneth L et al., “Numerical Methods in Economics,” MIT Press Books, 1998, 1.
Kaelbling et al. (1996) Kaelbling, Leslie Pack, Michael L Littman, and Andrew W Moore, “Reinforcement learning: A survey,” Journal of artificial intelligence research, 1996, 4, 237–285.
Kirkpatrick et al. (1983) Kirkpatrick, Scott, C Daniel Gelatt Jr, and Mario P Vecchi, “Optimization by simulated annealing,” science, 1983, 220 (4598), 671–680.
Kool et al. (2017) Kool, Wouter, Samuel J Gershman, and Fiery A Cushman, “Cost-benefit arbitration between multiple reinforcement-learning systems,” Psychological science, 2017, 28 (9), 1321–1333.
Lai and Gershman (2021) Lai, Lucy and Samuel J Gershman, “Policy compression: An information bottleneck in action selection,” in “Psychology of Learning and Motivation,” Vol. 74, Elsevier, 2021, pp. 195–232.
Lee et al. (2014) Lee, Sang Wan, Shinsuke Shimojo, and John P O’Doherty, “Neural computations underlying arbitration between model-based and model-free learning,” Neuron, 2014, 81 (3), 687–699.
Lian (2023) Lian, Chen, “Mistakes in future consumption, high MPCs now,” American Economic Review: Insights, 2023, 5 (4), 563–581.
Lin et al. (2018) Lin, Junyang, Xu Sun, Xuancheng Ren, Muyu Li, and Qi Su, “Learning when to concentrate or divert attention: Self-adaptive attention temperature for neural machine translation,” arXiv preprint arXiv:1808.07374, 2018.
Liu et al. (2011) Liu, Weifeng, Jose C Principe, and Simon Haykin, Kernel adaptive filtering: a comprehensive introduction, Vol. 57, John Wiley & Sons, 2011.
Luce (1959) Luce, R Duncan, Individual choice behavior: a Theoretical analysis, John Wiley, 1959.
Malmendier (2021) Malmendier, Ulrike, “Experience effects in finance: Foundations, applications, and future directions,” Review of Finance, 2021, 25 (5), 1339–1363.
Malmendier and Nagel (2016) and Stefan Nagel, “Learning from inflation experiences,” The Quarterly Journal of Economics, 2016, 131 (1), 53–87.
Matějka and McKay (2014) Matějka, Filip and Alisdair McKay, “Rational inattention to discrete choices: A new foundation for the multinomial logit model,” The American Economic Review, 2014, 105 (1), 272–298.
McFadden (1974) McFadden, Daniel, “Conditional Logit Analysis of Qualitative Choice Behavior,” in “Frontiers in Econometrics,” Academic Press, 1974.
Mnih et al. (2016) Mnih, Volodymyr, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in “International conference on machine learning” PMLR 2016, pp. 1928–1937.
Moerland et al. (2023) Moerland, Thomas M, Joost Broekens, Aske Plaat, Catholijn M Jonker et al., “Model-based reinforcement learning: A survey,” Foundations and Trends® in Machine Learning, 2023, 16 (1), 1–118.
Rasmussen and Williams (2006) Rasmussen, Carl Edward and Christopher KI Williams, Gaussian processes for machine learning, Vol. 1, MIT press Cambridge, 2006.
Schulz et al. (2018) Schulz, Eric, Emmanouil Konstantinidis, and Maarten Speekenbrink, “Putting bandits into context: How function learning supports decision making.,” Journal of experimental psychology: learning, memory, and cognition, 2018, 44 (6), 927.
Shenhav et al. (2017) Shenhav, Amitai, Sebastian Musslick, Falk Lieder, Wouter Kool, Thomas L Griffiths, Jonathan D Cohen, and Matthew M Botvinick, “Toward a rational and mechanistic account of mental effort,” Annual review of neuroscience, 2017, 40, 99–124.
Sims (2003a) Sims, Christopher A, “Implications of rational inattention,” Journal of Monetary Economics, 2003, 50 (3), 665–690.
Sims (2003b) , “Implications of rational inattention,” Journal of Monetary Economics, 2003, 50 (3), 665–690.
Sutton (1988) Sutton, Richard S, “Learning to predict by the methods of temporal differences,” Machine learning, 1988, 3, 9–44.
Sutton (1990) , “Integrated architectures for learning, planning, and reacting based on approximating dynamic programming,” in “Machine learning proceedings 1990,” Elsevier, 1990, pp. 216–224.
Sutton (1991) , “Dyna, an integrated architecture for learning, planning, and reacting,” ACM Sigart Bulletin, 1991, 2 (4), 160–163.
Sutton and Barto (2018) and Andrew G Barto, Reinforcement learning: An introduction, MIT press, 2018.
Thompson et al. (2011) Thompson, Valerie A, Jamie A Prowse Turner, and Gordon Pennycook, “Intuition, reason, and metacognition,” Cognitive psychology, 2011, 63 (3), 107–140.
Wang and Ni (2020) Wang, Yufei and Tianwei Ni, “Meta-sac: Auto-tune the entropy temperature of soft actor-critic via metagradient,” arXiv preprint arXiv:2007.01932, 2020.
Woodford (2020) Woodford, Michael, “Modeling imprecision in perception, valuation, and choice,” Annual Review of Economics, 2020, 12, 579–601.
Wu et al. (2021) Wu, Charley M, Eric Schulz, and Samuel J Gershman, “Inference and search on graph-structured spaces,” Computational Brain & Behavior, 2021, 4, 125–147.
Wu et al. (2018) , , Maarten Speekenbrink, Jonathan D Nelson, and Björn Meder, “Generalization guides human exploration in vast decision spaces,” Nature human behaviour, 2018, 2 (12), 915–924.

Learning Optimal Behavior Through Reasoning and Experiences††thanks: We thank Sam Gershman for extensive and helpful discussions. Email addresses: Ilut cosmin.ilut@duke.edu, Valchev valchev@bc.edu.