This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Approximate Inference for Stochastic Planning in Factored Spaces

Zhennan Wu Roni Khardon
Abstract

Stochastic planning can be reduced to probabilistic inference in large discrete graphical models, but hardness of inference requires approximation schemes to be used. In this paper we argue that such applications can be disentangled along two dimensions. The first is the direction of information flow in the idealized exact optimization objective, i.e., forward vs.  backward inference. The second is the type of approximation used to compute this objective, e.g., Belief Propagation (BP) vs. mean field variational inference (MFVI). This new categorization allows us to unify a large amount of isolated efforts in prior work explaining their connections and differences as well as potential improvements. An extensive experimental evaluation over large stochastic planning problems shows the advantage of forward BP over several algorithms based on MFVI. An analysis of practical limitations of MFVI motivates a novel algorithm, collapsed state variational inference (CSVI), which provides a tighter approximation and achieves comparable planning performance with forward BP.

1 Introduction

The connection between planning and probabilistic inference is well known and multiple reductions exist showing how inference algorithms can be used to solve stochastic planning problems. Such reductions are equivalent when one can perform exact inference but this is not typically the case for challenging planning problems that have many state variables, a.k.a. factored spaces, where approximate inference schemes are introduced. The planning and reinforcement learning literatures include multiple such efforts where different algorithmic frameworks are combined with different approximation schemes. For example, constructions exist through weighted model counting Domshlak and Hoffmann [2006], several forms of variational inference (e.g., [Toussaint and Storkey, 2006, Levine, 2018]), and several forms belief propagation (e.g., [Liu and Ihler, 2012, Cui et al., 2019]). However, it is not clear how different algorithmic approaches are related to one another and how the choice of approach interacts with the choice of approximation scheme.

The paper makes three contributions. First we provide a unified scheme that connects previous approaches along two dimensions, using either forward or backward reasoning, and choosing what approximation to use, where we address Belief Propagation (BP) and mean field variational inference (MFVI). This allows us to put prior work in a unified framework that explains choices made by corresponding algorithms. In particular, our analysis shows that Forward MFVI which is used in some papers can be understood to run multiple iterations of Backward MFVI and thus provides tighter approximations. Second, through extensive experiments over large planning problems, we show that forward reasoning with Belief Propagation provides the best performance among these algorithms, that MFVI provides poor performance in some domains, and that modifying MFVI using exponentiated rewards helps in some cases but not sufficiently. We also analyze the failures of MFVI experimentally pointing to sensitivity in updates. Third, based on the analysis, we propose a novel algorithm, Collapsed State Variational Inference (CSVI), that uses mean field with collapsed variational inference where state variables are integrated out. CSVI is motivated theoretically due to its tighter variational approximation and we show empirically that it matches the performance of Forward Belief Propagation. This shows that while naive application of mean field for planning fails, other variational approximations like CSVI can yield strong planning performance. Due to space constraints some technical details and experiment results are omitted from the main paper and are provided in [Wu and Khardon, 2022], which we refer to below as the the appendix.

2 Problem Formulation

We consider finite horizon MDPs as specifications of planning problems. Such specifications are often compiled from a high level description language but this is orthogonal to the discussion in the paper. Specifically, consider Markov Decision Processes 𝒮,p(s0),𝒜,𝒫,,T,γ\langle\mathcal{S},p(s_{0}),\mathcal{A},\mathcal{P},\mathcal{R},T,\gamma\rangle, where 𝒮\mathcal{S} denotes the state space, p(s0)p(s_{0}) is a distribution over start states, 𝒜\mathcal{A} denotes the action space, 𝒫\mathcal{P} denotes the transition probability p(st+1|st,at)p(s_{t+1}|s_{t},a_{t}), \mathcal{R} denotes the reward function R(st,at)R(s_{t},a_{t}), TT denotes the horizon and γ\gamma is the discount factor. In this paper we set γ=1\gamma=1, but this does not significantly affect any of the formulations. A solution is given by a policy, π\pi, that specifies pθt(at|st)p_{\theta_{t}}(a_{t}|s_{t}) with policy parameters θ={θt}\theta=\{\theta_{t}\} allowing for non-stationary policies. The task in planning is to find a policy that maximizes the expected cumulative reward 𝔼[t=0T1R(st,at)]\mathbb{E}[\sum_{t=0}^{T-1}R(s_{t},a_{t})] where the expectation is taken w.r.t. trajectories generated from the MDP with policy π\pi, that is, s0p(s0)s_{0}\sim p(s_{0}), atpθt(at|st)a_{t}\sim p_{\theta_{t}}(a_{t}|s_{t}), and stp(st|st1,at1)s_{t}\sim p(s_{t}|s_{t-1},a_{t-1}).

In this paper we follow recent practice in stochastic planning and use the online planning framework, where in a state ss, the algorithm computes for a limited time to pick an action aa, uses aa to control the MDP to get to the next state, and repeats this process. Online planning is often used with receding horizon control, where the planner uses a TT step lookahead in its search and then extracts the first action aa to be applied in ss.

Solving a finite horizon MDP is equivalent to solving an inference problem in the corresponding Dynamic Bayesian Network (DBN), or more precisely in the dynamic decision network. We assume a factored form of states consisting of binary state variables, i.e. st=(st1,,stM)s_{t}=(s_{t}^{1},\cdots,s_{t}^{M}). We also assume a factored action representation at=(at1,,atN)a_{t}=(a_{t}^{1},\cdots,a_{t}^{N}).

The MDP formulation above requires real-value reward nodes in the DBN. To facilitate inference one can replace these nodes with constructions that use only binary variables, and various such constructions appear in the literature. In the following we develop one such construction and use that in our experiments. We introduce binary reward random variables rtr_{t} to capture the reward after taking action in the previous time step t1t-1, the distribution of which is defined as

p(rt=1|st1=s,at1=a)=R(st1=s,at1=a)maxs,aR(s,a).\displaystyle p(r_{t}=1|s_{t-1}=s,a_{t-1}=a)=\frac{R(s_{t-1}=s,a_{t-1}=a)}{\max_{s,a}R(s,a)}. (1)

We can then define R~\tilde{R} where p(R~=1)=1TrtTp(\tilde{R}=1)=\frac{\sum_{1}^{T}r_{t}}{T} to capture the cumulative reward. We call the resulting DBN the intermediate representation. However, R~\tilde{R} has TT parents which hinders efficient inference. To avoid the use of R~\tilde{R}, we introduce cumulative reward binary random variables ctc_{t}. To keep the consistency of the graphical structure, we create an auxiliary node c01c_{0}\equiv 1, and for t>0t>0 the distribution of ctc_{t} is defined recursively depending on the previous cumulative reward ct1c_{t-1} and current reward rtr_{t}:

p(ct=1|ct1,rt)=(t1)ct1+rtt.p(c_{t}=1|c_{t-1},r_{t})=\frac{(t-1)c_{t-1}+r_{t}}{t}. (2)

In many planning problems the reward is given as an additive function over a set of small factors. For such problems we introduce another chain of binary reward variables within a time step using a similar construction. This yields a DBN that only includes binary variables with a small number of parents. As the following proposition shows the three constructions, using cumulative reward, using R~\tilde{R} and using cTc_{T} are equivalent. Further details and proofs are given in the the appendix.

Proposition 1

The construction satisfies 𝔼[t=0T1R(st,at)]𝔼(R~)𝔼(cT)\mathbb{E}[\sum_{t=0}^{T-1}R(s_{t},a_{t})]\propto\mathbb{E}(\tilde{R})\propto\mathbb{E}(c_{T}) where expectations are w.r.t. trajectories as above.

3 Planning Through Inference

In the following we restrict our discussion to open loop policies, that is, pθt(at|st)=pθt(at)p_{\theta_{t}}(a_{t}|s_{t})=p_{\theta_{t}}(a_{t}) where the policy is time dependent but does not depend on the state (other than s0s_{0} if it is fixed). Thus for an action sequence A={a0,,aT1}A=\{a_{0},\ldots,a_{T-1}\}, we have pθ(A)=pθt(at)p_{\theta}(A)=\prod p_{\theta_{t}}(a_{t}). This covers most of previous work on planning as inference in the literature. The extension to standard policies is straightforward but requires more complex algorithms for optimization.

3.1 Forward Backward Framework

We now present a simple framework that captures many algorithms in the literature. For the discussion below note that some algorithms optimize policy parameters θ{\theta} and then choose the actions, whereas others optimize the action sequence AA directly.

The Backward Framework: Observe that if θ\theta is the uniform distribution, uu, then

argmaxAp(cT=1|A)=argmaxApu(A|cT=1)pu(cT=1)pu(A)=argmaxApu(A|cT=1)\mathop{\mathrm{arg\,max}}_{A}p(c_{T}=1|A)=\mathop{\mathrm{arg\,max}}_{A}\frac{p_{u}(A|c_{T}=1)p_{u}(c_{T}=1)}{p_{u}(A)}=\mathop{\mathrm{arg\,max}}_{A}p_{u}(A|c_{T}=1) (3)

where the second equality is true because pu(A)p_{u}(A) is a fixed constant for all AA and pu(cT=1)p_{u}(c_{T}=1) does not depend on AA. This suggests that we can optimize p(cT=1|A)p(c_{T}=1|A) by optimizing pu(A|cT=1)p_{u}(A|c_{T}=1). Since calculating pu(A|cT=1)p_{u}(A|c_{T}=1) is hard, the backward framework optimizes an approximation of pu(A|cT=1)p_{u}(A|c_{T}=1). The choice of different approximations qϕ(A)q_{\phi}(A) will give us different concrete algorithms. This is captured in Algorithm 1

Algorithm 1 Backward Inference
1. Calculate qϕ(A)pu(A|cT=1)q_{\phi}(A)\approx p_{u}(A|c_{T}=1)
2. Pick A=argmaxqϕ(A)A=\mathop{\mathrm{arg\,max}}q_{\phi}(A)

The Forward Framework: in contrast, the forward approach aims to directly optimize pθ(cT=1)p_{\theta}(c_{T}=1) w.r.t the policy parameters (or alternatively, p(cT=1|A)p(c_{T}=1|A) but we focus on the more general case). Approximating pθ(cT=1)p_{\theta}(c_{T}=1) with a score function sc(θ)sc(\theta) defined on policy parameters yields the forward framework. In the ideal case, maximizing sc(θ)sc(\theta) will give us a delta function, directly selecting a concrete AA sequence. If not, we can use argmax\mathop{\mathrm{arg\,max}} or sample from the corresponding distribution. This is captured in Algorithm 2

Algorithm 2 Forward Inference
1. Define a score function sc(θ)sc(cT=1|θ)pθ(cT=1)sc(\theta)\triangleq sc(c_{T}=1|\theta)\approx p_{\theta}(c_{T}=1)
2. Optimize θ\theta to maximize the score function.
3. Pick AA using pθ(A)p_{\theta}(A)

3.2 Forward and Backward Loopy Belief Propagation

The forward and backward algorithms can be combined with any approximation scheme. We start by considering loopy BP (LBP) algorithms [Pearl, 1988, Kschischang et al., 2001]. For this construction we translate the DBN into a factor graph using standard constructions. For backward LBP, we instantiate cT=1c_{T}=1 as evidence, fix the factors corresponding to θ\theta to be the uniform distribution, and run LBP to calculate the marginal probabilities on action variables. That is, qϕ(A)q_{\phi}(A) is given by the output of LBP. Note that this is algorithmically simple because we do not need a separate optimization step aside from Belief Propagation. However, LBP may need many iterations to converge or may not converge at all.

For the forward algorithm, we define sc(θ)sc(\theta) to be the approximate marginal of pθ(cT)p_{\theta}(c_{T}) computed by LBP. However, LBP does not optimize θ\theta. As discussed below, multiple techniques for optimizing θ\theta for LBP exist in the literature. In the experiments we use the SOGBOFA system [Cui et al., 2019] that combines LBP with gradient based search.

3.3 Forward and Backward Mean Field Variational Inference

The idea in variational inference is to minimize the KL divergence between the approximate posterior and the true posterior over latent variables, i.e., in our case

dKL(qϕ(S,A,R,C\T)||pθ(S,A,R,C\T|cT=1))\displaystyle d_{KL}(q_{\phi}(S,A,R,C_{\backslash T})||p_{\theta}(S,A,R,C_{\backslash T}|c_{T}=1))

where the latent variables are SS, AA, RR, CC, that is, the sequences of state, action, reward, and cumulative reward variables, where C\TC_{\backslash T} excludes cTc_{T}. This is equivalent to maximizing the evidence lower bound (ELBO). In our case the ELBO is given in the next equation, where in the mean field approximation qϕq_{\phi} is a product of independent factors

logpθ(cT=1)𝔼qϕ[logpθ(S,A,R,C\T,cT=1)qϕ(S,A,R,C\T)]=:ELBOθ,ϕ.\log p_{\theta}(c_{T}=1)\geq\mathbb{E}_{q_{\phi}}[\log\frac{p_{\theta}(S,A,R,C_{\backslash T},c_{T}=1)}{q_{\phi}(S,A,R,C_{\backslash T})}]=:ELBO_{\theta,\phi}. (4)

For backward MFVI, note that pu(A|cT=1)p_{u}(A|c_{T}=1) is the marginal distribution of the true posterior pu(S,A,R,C\T|cT=1)p_{u}(S,A,R,C_{\backslash T}|c_{T}=1). Therefore we first maximize ELBOϕ,θ=uELBO_{\phi,\theta=u} to obtain qϕ(S,A,R,C\T)q_{\phi}(S,A,R,C_{\backslash T}) and then set qϕ(A)q_{\phi}(A) to be the corresponding marginal. Detailed update equations for MFVI are given in the the appendix.

For forward MFVI, we can pick sc(θ)=ELBOϕ,θlogpθ(cT=1)sc(\theta)=ELBO_{\phi,\theta}\approx\log p_{\theta}(c_{T}=1) where we need to optimize both ϕ\phi and θ\theta. For this, the standard approach is the Variational Expectation Maximization algorithm which optimizes ϕ\phi in the EE step and θ\theta in the MM step. To elaborate the algorithm, note that the ELBO can be reformulated as follows:

ELBOθ,ϕ=𝔼qϕ[logp(S,R,C\T,cT=1|A)qϕ(S,R,C\T|A)]dKL(qϕ(A)||pθ(A))\displaystyle ELBO_{\theta,\phi}=\mathbb{E}_{q_{\phi}}[\log\frac{p(S,R,C_{\backslash T},c_{T}=1|A)}{q_{\phi}(S,R,C_{\backslash T}|A)}]-d_{KL}(q_{\phi}(A)||p_{\theta}(A)) (5)

where the first term does not depend on θ\theta. Therefore:

  • In the E step, we maximize ELBOθ,ϕELBO_{\theta,\phi} w.r.t. ϕ\phi. Note that this is exactly as in the Backward Algorithm but under a general θ\theta.

  • In the M-step, we keep qϕq_{\phi} fixed and optimize the ELBOθ,ϕELBO_{\theta,\phi} w.r.t. θ\theta. From Eq (5) we see that this is equivalent to minimizing dKL(qϕ(A)||pθ(A))d_{KL}(q_{\phi}(A)||p_{\theta}(A)). If qϕ(A)q_{\phi}(A) and pθ(A)p_{\theta}(A) are from the same class of distributions, this step assigns θϕ\theta\leftarrow\phi.

From the procedure, we have the following observation.

Remark 2

For the mean field approximation, the forward algorithm is an iterative process that alternates the backward algorithm with policy updates.

This connection was not observed in prior work where the forward and backward algorithms are not clearly distinguished. Finally, as pointed by Toussaint and Storkey [2006] the E step is analogous to policy evaluation (except that we calculate marginals for many variables besides the reward) and the M step is analogous to policy improvement, so forward MFVI can be seen as an approximate version of Policy Iteration.

4 Related Work

The idea of using inference for stochastic planning has a long history and has attracted many different approaches. For example, Cooper [1988] showed how inference can be used for decision making in influence diagrams, Domshlak and Hoffmann [2006] use an approach based on weighted model counting, Nitti et al. [2015] use a probabilistic programming formulation, and Lee et al. [2021] use anytime marginal MAP solvers for planning problems.

Several groups have developed approaches that follow the forward variational framework, going back to Dayan and Hinton [1997]. This idea is often developed by defining a reward weighted path distribution which is similar to conditioning on cT=1c_{T}=1 in our framework, and developing algorithms from this formulation [Furmston and Barber, 2010, 2011, Toussaint and Storkey, 2006, Kumar et al., 2015]. We note, however, that these works did not explicitly address factoring over state and action variables.

On the other hand, some papers in robotics and reinforcement learning (RL) [Toussaint, 2009, Kappen et al., 2012, Levine, 2018] follow the backward variational framework. In contrast with the discussion above they use a formulation where the reward over trajectories is exponentiated. As shown by Levine [2018] this modifies the original optimization objective by adding a term with the expected entropy of the policy, and hence solves a slightly different problem, but the entropy term may be beneficial for exploration in RL. In addition, the work of Neumann [2011] uses the forward variational algorithm, but with an exponentiated reward, and additional sampling-based approximations. We can see that the forward and backward variational approaches have been widely used but have not been differentiated before. Our analysis above clarifies the relationship between these approaches.

For the case of BP approximation, Murphy and Weiss [2001] proposed the Factored Frontier Algorithm which is a forward BP method for marginal inference, and Boyen and Koller [1998] developed approximation bounds for forward inference. The work of Liu and Ihler [2012], Kiselev and Poupart [2014] follows the forward BP framework, but develops a generalized belief propagation algorithm that solves both optimization and expectation steps using message passing. The work of Cui et al. [2019] also follows the forward BP framework but decouples the expectation which is done through BP from the optimization that uses an approximation based on gradient search.

Several works have made additional assumptions on the structure of the DBN in their discussion of graph-based MDPs. Cheng et al. [2013] extend the algorithm of Liu and Ihler [2012] to this case. Peyrard and Sabbadin [2006] and Sabbadin et al. [2012] use the Mean Field approximation method but only use it to approximate the distribution over state variables. They then use the approximate distribution to approximate steps of the Policy Iteration algorithm. Hence their algorithm is different from MFVI in that reward variables are not included in the variational approximation. Finally, our work can be seen to extend the comparison of Mean Field and Loopy BP for general inference tasks [Weiss, 2001]. As in this early work, our experiments show that optimization of variational objectives can lead to local optima and that BP can provide some advantage.

5 Experiments and Analysis of MFVI & Belief Propagation Algorithms

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: First and second columns: algorithm comparisons on 60 problem instances, averaged over 12 simulations on each instance. Third column: comparing MFVI variants with and without state updates and the contributions of variable groups to the increase in the ELBO (in Skill teaching, instance 1, step 2 of execution).

This section presents an experimental evaluation of the algorithms. The code for regenerating all the results is available on Github111https://github.com/Zhennan-Wu/AISPFS. Our goal in this paper is to understand the quality of decisions provided by different approximate inference schemes, ignoring implementation details. Therefore, during the experiments we do not limit run time but instead allow the algorithms to converge, within bounds given below, before proposing a decision. We chose 6 problem domains from the ICAPS 2011 International Probabilistic Planning Competition to conduct our experiments. Each domain has 10 instances with factorized structure, horizon of 40 and discount factor of 1, and instances differ by the number of state and action variables. For our experiments we use the SPUDD [Hoey et al., 1999] translation of the original RDDL [Sanner et al., 2010] specification, which compiles away action factoring. This simplifies the implementation because it removes the need to reconcile action constraints with factoring. To control our overall experimental time we use online planning with receding horizon control, where we set the search horizon to be the minimum value between 9 and the remaining time steps.

Algorithmic parameters for MFVI: we perform at most 100 Variational updates and stop early if the infinity norm of the difference between consecutive approximation distributions is less than 0.1. We perform 3 outer iterations, i.e., policy updates for the forward version.

Algorithmic parameters for BP variants: We use SOGBOFA [Cui et al., 2019]222https://github.com/hcui01/SOGBOFA as forward Loopy BP, fixing search depth to 9, and limiting the number of gradient updates to 500. We note that SOGBOFA has outperformed other planners, including search based planners, in IPPC 2018 problems and is a state of the art baseline for the evaluation. For the backward algorithm, our implementation is based on Zhou et al. [2022] with parallel message update and a bound of 100 iterations with no damping (β=0\beta=0).

Normalized mean ±\pm one standard deviation of the cumulative reward over 12 simulations are shown in all the plots. Denote the mean value and standard deviation of the cumulative reward of algorithm aa on instance ii to be r¯i(a)\bar{r}_{i}(a), σi(a)\sigma_{i}(a), respectively. To facilitate comparisons across domains we report scores normalized relative to the random policy. Specifically, for algorithm aa on instance ii, score-meani(a)=|r¯i(a)r¯i(RANDOM)||r¯i(RANDOM)|\mbox{score-mean}_{i}(a)=\frac{|\bar{r}_{i}(a)-\bar{r}_{i}(RANDOM)|}{|\bar{r}_{i}(RANDOM)|} and score-stdi(a)=σi(a)|r¯i(RANDOM)|\mbox{score-std}_{i}(a)=\frac{\sigma_{i}(a)}{|\bar{r}_{i}(RANDOM)|} where the random algorithm has score 0 and higher scores indicates better performance. For reference, the raw results are given in the the appendix.

Comparison of Algorithms

Results are shown in the left column of Figure 1. The top plot shows that search direction is important for BP: the forward algorithm (SOGBOFA) outperforms the backward algorithm.333 While our focus is on the quality of approximation it is worth noting that Cui et al. [2018] have shown that with a directed model (equivalent to the Forward Framework with no downstream evidence as in our case), LBP converges in one iteration. Thus the forward algorithm is also faster. In contrast, the second plot shows that for MFVI, there is no significant difference between the forward and backward variants. This is an interesting result because, as shown above, the forward algorithms mimic Policy Iteration and they provide a tighter approximation. The third plot compares MFVI to BP showing that MFVI has poor performance in some problems and forward BP dominates in all problems. Finally, we can show (see the appendix) that the exponentiated variant of MFVI can be captured in our framework by conditioning on all reward and cumulative reward variables. The bottom plot compares this variant to standard MFVI. We see that the performance improves in two domains but the exponentiated variant is still dominated by forward BP.

Exploring the performance of MFVI

We believe that the main reason for the failure of MFVI is due to interaction between the flexibility that the mean field approximation allows with many state variables, and the sensitivity to ordering of updates due to local optima. To explore this we performed several additional experiments. In the first we introduce a new variant algorithm, MFVI-NoS, which does not update the marginal distribution over state variables, i.e. keeps them at the initialized value of 0.50.5. Results for two domains are shown in the top half of the third column of Figure 1. We see that while the NoS variant restricts the algorithm it improves the performance in these domains (this does not happen in all domains). Another view of this phenomenon is given by the relative contribution of each group of variables to the increase in the ELBO during updates of variational parameters. The bottom half of the third column of Figure 1 visualizes this for the MFVI and MFVI-NoS variants in one problem. We see that for MFVI the largest increase in ELBO is contributed by adjusting state variables and the NoS variant increases the share of other variables. We further explore this in the full paper using an artificial problem, showing that in this case limiting the flexibility of MFVI can lead to better posterior, that MFVI is sensitive to a choice of which subset of state variables is updated, and in addition to the order of updates.

6 CSVI

Motivated by the analysis above, we propose a new algorithm for variational inference in planning. Instead of treating all the latent nodes in the DBN in the same manner and computing approximate distributions over all these variables, the algorithm focuses on the action variables and effectively marginalize out other terms to achieve a tighter ELBO. This type of approach is known as collapsed variational inference, which has been shown to be effective in models where the marginalization can be done analytically (e.g., Teh et al. [2006]) but for planning one has to resolve additional computational challenges as we show below. Specifically we propose to use the following provably tighter ELBO

logpθ(cT=1)=log𝔼pθ(A)[p(cT=1,A)]𝔼qϕ[logpθ(cT=1,A)qϕ(A)].\quad\log p_{\theta}(c_{T}=1)=\log\mathbb{E}_{p_{\theta(A)}}[p(c_{T}=1,A)]\geq\mathbb{E}_{q_{\phi}}[\log\frac{p_{\theta}(c_{T}=1,A)}{q_{\phi}(A)}]. (6)

Here we have the same factorized transitions and policy distribution. However, we do not compute approximation distributions over state, reward, and cumulative reward variables. With mean field, the standard solution [Bishop, 2006] yields the update equation

logqϕ(atl)𝔼qϕ\atl(A)logpθ(cT=1,A)=𝔼qϕ\atl(A)loggθ(A)\displaystyle\log q_{\phi}(a_{t}^{l})\propto\mathbb{E}_{q_{\phi}\backslash a_{t}^{l}(A)}\log p_{\theta}(c_{T}=1,A)=\mathbb{E}_{q_{\phi}\backslash a_{t}^{l}(A)}\log g_{\theta}(A) (7)
where
gθ(A)=𝔼S,R,CT[pθ(A,S,R,C,cT=1)].\displaystyle g_{\theta}(A)=\mathbb{E}_{S,R,C_{\setminus T}}[p_{\theta}(A,S,R,C,c_{T}=1)]. (9)

The tighter approximation appears to yield an infeasible update, because AA is entangled in g()g() and we must perform an explicit marginalization in g()g() for each update.

We next show how the update equation can be approximated via sampling. The key is to first extract pθ(A)p_{\theta}(A) from the expectation. We therefore have:

loggθ(A)=logpθ(A)+log𝔼S,R,CT[pθ(S,R,C,cT=1|A)].\displaystyle\log g_{\theta}(A)=\log p_{\theta}(A)+\log\mathbb{E}_{S,R,C_{\setminus T}}[p_{\theta}(S,R,C,c_{T}=1|A)]. (10)

Recall that pθ(A)p_{\theta}(A) is a product of independent terms. This implies that the first part can be substituted with logpθ(ahl)\log p_{\theta}(a^{l}_{h}) since all other terms are constants w.r.t the variable of interest in (7) and they will vanish in the normalized update of qϕ(atl)q_{\phi}(a_{t}^{l}). The second part is conditioned on AA and does not include p(A)p(A) terms. Its expectation can be estimated through sampling. In particular, sampling can be intuitively done as follows: keeping atla^{l}_{t} fixed, sample the action sequence from approximate distribution qϕ\atl(A)q_{\phi\backslash a^{l}_{t}}(A). Then complement this by sampling values for sts_{t}, rtr_{t}, ctc_{t} nodes, including cTc_{T}. The resulting values for cTc_{T} are generated from the correct distribution and the average over cTc_{T} gives an estimate of the expectation. Since we are using sampling and averaging inside the logarithm this yields biased estimates for updates, but this type of biased estimates has been shown to work in other cases in machine learning (e.g., [Wei et al., 2021]) and it can be mitigated by taking sufficient samples. It is interesting to note from the above update that the policy distribution serves as a weight bias in the action update procedure. Algorithm 3 summarizes the update procedure.

Algorithm 3 Collapsed State Variational Inference
1:for t=1,2,,Tt=1,2,\ldots,T do
2:     for l=1,2,,Nl=1,2,\ldots,N do
3:         for value of action variable ll at time tt fixed to be 0,10,1 do
4:              for action sequence sample index i=1,,M1i=1,\ldots,M_{1} do
5:                  Sample action sequence A=a1,,aTA=a_{1},\ldots,a_{T} from qϕq_{\phi}
6:                  for trajectory sample =1,,M2=1,\ldots,M_{2} do
7:                       Sample and record cumulative reward variable cTc_{T} from gθ(A)g_{\theta}(A)
8:                  end for
9:                  Estimate p^i=#(cT=1)/M2\hat{p}_{i}=\#(c_{T}=1)/M_{2}
10:              end for
11:              Calculate logqϕ(atl)logpθ(atl)+i(logp^i)/M1\log q_{\phi}(a^{l}_{t})\propto\log p_{\theta}(a^{l}_{t})+\sum_{i}(\log\hat{p}_{i})/M_{1}
12:         end for
13:         Update qϕ(atl)q_{\phi}(a_{t}^{l}) by calculating the normalizing factor
14:     end for
15:end for

Performance of CSVI

For CSVI our implementation uses the same parameters as in MFVI except that we make at most 10 variational updates. The sample sizes are set to M1=20M_{1}=20 and M2=50M_{2}=50. Results are shown in the middle column of Figure 1. Considering the plots from top to bottom we observe that there is no significant difference between forward and backward variants of CSVI and that CSVI is significantly better than MFVI. The third plot shows that the exponentiated reward variant does not improve the performance of CSVI. This suggests that the improvement over exponential variant for MFVI is due to stabilizing the optimization rather than presenting a better objective. The fourth plots shows that the performance of CSVI is competitive with forward BP and therefore CSVI provides state of the art performance in stochastic planning.

7 Conclusion

In this paper we provide a unified scheme that categorizes many previous approaches along two dimensions, using either forward or backward reasoning and choosing an approximation scheme. Specifically, we focus on belief propagation and mean field variational inference as the approximation choices. In this context, we illustrate the advantage of Forward Loopy BP as providing the best performance. Algorithms based on MFVI perform poorly in some domains. They are improved by exponential reward weighting but not sufficiently so. An experimental analysis points to sensitivity of the optimization as a source for this failure. Motivated by this analysis we propose a novel algorithm, Collapsed State Variational Inference, which provides a tighter variational approximation, and while being computationally demanding it performs competitively with Forward Loopy BP. The results highlight that while BP has been less in focus in recent years, it provides a strong baseline for stochastic planning. It also shows the importance of focusing variational approximations on variables of interest as done in CSVI and the potential for developing strong variational algorithms for planning. These observations suggest interesting directions for future work including developing efficient variants of CSVI, using amortized variational inference in planning to improve CSVI, alternative schemes to capture the posterior distributions in VI, and developing tighter approximations and optimization algorithms through BP methods.

Acknowledgements

This work was supported by NSF under grant IIS-1906694 and grant IIS-2002393. Some of the experiments in this paper were run on the Big Red 3 computing system at Indiana University, supported in part by Lilly Endowment,Inc., through its support for the Indiana University Pervasive Technology Institute.

References

  • Bishop [2006] Christopher M. Bishop. Pattern recognition and machine learning. Springer, 2006.
  • Boyen and Koller [1998] X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages 33–42, 1998.
  • Cheng et al. [2013] Qiang Cheng, Qiang Liu, Feng Chen, and Alexander T Ihler. Variational planning for graph-based MDPs. Advances in Neural Information Processing Systems, 26, 2013.
  • Cooper [1988] Gregory F. Cooper. A method for using belief networks as influence diagrams. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 1988.
  • Cui et al. [2015] Hao Cui, Roni Khardon, Alan Fern, and Prasad Tadepalli. Factored MCTS for large scale stochastic planning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • Cui et al. [2018] Hao Cui, Radu Marinescu, and Roni Khardon. From stochastic planning to marginal MAP. Advances in Neural Information Processing Systems, 31, 2018.
  • Cui et al. [2019] Hao Cui, Thomas Keller, and Roni Khardon. Stochastic planning with lifted symbolic trajectory optimization. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 119–127, 2019.
  • Dayan and Hinton [1997] Peter Dayan and Geoffrey E Hinton. Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271–278, 1997.
  • Domshlak and Hoffmann [2006] Carmel Domshlak and Jörg Hoffmann. Fast probabilistic planning through weighted model counting. In International Conference on Automated Planning and Scheduling, pages 243–252, 2006.
  • Furmston and Barber [2010] Thomas Furmston and David Barber. Variational methods for reinforcement learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 241–248, 2010.
  • Furmston and Barber [2011] Thomas Furmston and David Barber. Efficient inference in markov control problems. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages 221–229, 2011.
  • Hoey et al. [1999] Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. SPUDD: stochastic planning using decision diagrams. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 279–288, 1999.
  • Kappen et al. [2012] Hilbert J Kappen, Vicenç Gómez, and Manfred Opper. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012.
  • Kiselev and Poupart [2014] Igor Kiselev and Pascal Poupart. POMDP planning by marginal-MAP probabilistic inference in generative models. In Proceedings of the 2014 AAMAS Workshop on Adaptive Learning Agents, 2014.
  • Kschischang et al. [2001] F.R. Kschischang, B.J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2):498–519, 2001.
  • Kumar et al. [2015] Akshat Kumar, Shlomo Zilberstein, and Marc Toussaint. Probabilistic inference techniques for scalable multiagent decision making. Journal of Artificial Intelligence Research, 53:223–270, 2015.
  • Lee et al. [2021] Junkyu Lee, Radu Marinescu, and Rina Dechter. Submodel decomposition bounds for influence diagrams. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  • Levine [2018] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
  • Liu and Ihler [2012] Qiang Liu and Alexander Ihler. Belief propagation for structured decision making. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages 523–532, 2012.
  • Murphy and Weiss [2001] Kevin Murphy and Yair Weiss. The factored frontier algorithm for approximate inference in DBNs. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages 378–385, 2001.
  • Neumann [2011] Gerhard Neumann. Variational inference for policy search in changing situations. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 817–824, 2011.
  • Nitti et al. [2015] Davide Nitti, Vaishak Belle, and Luc De Raedt. Planning in discrete and continuous markov decision processes by probabilistic programming. In Joint European conference on machine learning and knowledge discovery in databases, pages 327–342. Springer, 2015.
  • Pearl [1988] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988.
  • Peyrard and Sabbadin [2006] Nathalie Peyrard and Régis Sabbadin. Mean field approximation of the policy iteration algorithm for graph-based markov decision processes. In 17th European Conference on Artificial Intelligence August 29–September 1, 2006, Riva del Garda, Italy, pages 595–599, 2006.
  • Sabbadin et al. [2012] Régis Sabbadin, Nathalie Peyrard, and Nicklas Forsell. A framework and a mean-field algorithm for the local control of spatial processes. International Journal of Approximate Reasoning, 53(1):66–86, 2012.
  • Sanner et al. [2010] Scott Sanner et al. Relational dynamic influence diagram language (rddl): Language description. Unpublished ms. Australian National University, 32:27, 2010.
  • Teh et al. [2006] Yee Teh, David Newman, and Max Welling. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. Advances in neural information processing systems, 19, 2006.
  • Toussaint [2009] Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pages 1049–1056, 2009.
  • Toussaint and Storkey [2006] Marc Toussaint and Amos J. Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. In Machine Learning, Proceedings of the Twenty-Third International Conference, volume 148, pages 945–952, 2006.
  • Wei et al. [2021] Yadi Wei, Rishit Sheth, and Roni Khardon. Direct loss minimization for bayesian predictors. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS, 2021.
  • Weiss [2001] Yair Weiss. Comparing the mean field method and belief propagation for approximate inference in MRFs. In Advanced Mean Field Methods: Theory and Practice, pages 229–239. MIT Press, 2001.
  • Wu and Khardon [2022] Zhennan Wu and Roni Khardon. Approximate inference for stochastic planning in factored spaces. arXiv, 2203.12139, 2022.
  • Zhou et al. [2022] Guangyao Zhou, Nishanth Kumar, Antoine Dedieu, Miguel Lázaro-Gredilla, Shrinu Kushagra, and Dileep George. PGMax: Factor Graphs for Discrete Probabilistic Graphical Models and Loopy Belief Propagation in JAX. arXiv preprint arXiv:2202.04110, 2022.

Appendix A Equivalence of Different Reward Formulations

In the MDP framework, we are trying to maximize the cumulative reward. We first show that this is captured by the sum of binary reward variables.

U(πθ)\displaystyle U(\pi_{\theta}) =a0st𝒮at𝒜t1,,T[t=1Tp(st|st1,at1)pθ(at1|st1)](h=1TR(sh,ah))\displaystyle=\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ t\in 1,\cdots,T\end{subarray}}\Big{[}\prod_{t=1}^{T}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})\Big{]}\big{(}\sum_{h=1}^{T}R(s_{h},a_{h})\big{)}
=h=1Ta0st𝒮at𝒜t1,,ht=1hp(st|st1,at1)pθ(at1|st1)R(sh,ah)\displaystyle=\sum_{h=1}^{T}\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ t\in 1,\cdots,h\end{subarray}}\prod_{t=1}^{h}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})R(s_{h},a_{h})
=h=1T𝔼θ(R(sh,ah))\displaystyle=\sum_{h=1}^{T}\mathbb{E_{\theta}}(R(s_{h},a_{h})) (11)
h=1T𝔼θ(rh).\displaystyle\propto\sum_{h=1}^{T}\mathbb{E_{\theta}}(r_{h}). (12)

A.1 Cumulative Binary Reward Over time

Under the intermediate DBN setting, we need to calculate the expectation of the total reward R~\tilde{R}. We have:

𝔼θ(R~)\displaystyle\mathbb{E}_{\theta}(\tilde{R}) =pθ(R~=1)\displaystyle=p_{\theta}(\tilde{R}=1)
=a0st𝒮at𝒜rt{0,1}t=1,T[[t=1Tp(st|st1,at1)pθ(at1|st1)p(rt|st1,at1)]p(R~=1|r1,,rT)]\displaystyle=\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t}\in\{0,1\}\\ t=1,\cdots T\end{subarray}}\Big{[}\big{[}\prod_{t=1}^{T}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})p(r_{t}|s_{t-1},a_{t-1})\big{]}p(\tilde{R}=1|r_{1},\cdots,r_{T})\Big{]}
a0st𝒮at𝒜rt{0,1}t=1,T[[t=1Tp(st|st1,at1)pθ(at1|st1)p(rt|st1,at1)]j=1TrjT]\displaystyle\propto\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t}\in\{0,1\}\\ t=1,\cdots T\end{subarray}}\Big{[}\big{[}\prod_{t=1}^{T}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})p(r_{t}|s_{t-1},a_{t-1})\big{]}\frac{\sum_{j=1}^{T}r_{j}}{T}\Big{]}
=j=1Ta0st𝒮at𝒜rt{0,1}t=1,T[[t=1Tp(st|st1,at1)pθ(at1|st1)p(rt|st1,at1)]rjT]\displaystyle=\sum_{j=1}^{T}\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t}\in\{0,1\}\\ t=1,\cdots T\end{subarray}}\Big{[}\big{[}\prod_{t=1}^{T}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})p(r_{t}|s_{t-1},a_{t-1})\big{]}\frac{r_{j}}{T}\Big{]}
=1Ta0st𝒮at𝒜rt{0,1}t=1,T[[t=1tjTp(st|st1,at1)pθ(at1|st1)p(rt|st1,at1)]\displaystyle=\frac{1}{T}\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t}\in\{0,1\}\\ t=1,\cdots T\end{subarray}}\Big{[}\big{[}\prod_{\begin{subarray}{c}t=1\\ t\neq j\end{subarray}}^{T}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})p(r_{t}|s_{t-1},a_{t-1})\big{]}
p(sj|sj1,aj1)pθ(aj1|sj1)p(rj=1|sj1,aj1)]\displaystyle\qquad p(s_{j}|s_{j-1},a_{j-1})p_{\theta}(a_{j-1}|s_{j-1})p(r_{j}=1|s_{j-1},a_{j-1})\Big{]} (13)
Marginalize out all terms with index i>ji>j
=1Tj=1Ta0st𝒮at𝒜rt{0,1}t=1,T[[t=1j1p(st|st1,at1)pθ(at1|st1)p(rt|st1,at1)]\displaystyle=\frac{1}{T}\sum_{j=1}^{T}\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t}\in\{0,1\}\\ t=1,\cdots T\end{subarray}}\Big{[}\big{[}\prod_{t=1}^{j-1}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})p(r_{t}|s_{t-1},a_{t-1})\big{]}
p(sj|sj1,aj1)pθ(aj1|sj1)p(rj=1|sj1,aj1)]\displaystyle\qquad p(s_{j}|s_{j-1},a_{j-1})p_{\theta}(a_{j-1}|s_{j-1})p(r_{j}=1|s_{j-1},a_{j-1})\Big{]} (14)
Marginalize out all the reward terms with index iji\neq j
=1Tj=1Ta0st𝒮at𝒜t=1,j[[t=1jp(st|st1,at1)pθ(at1|st1)]p(rj=1|sj1,aj1)]\displaystyle=\frac{1}{T}\sum_{j=1}^{T}\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ t=1,\cdots j\end{subarray}}\Big{[}\big{[}\prod_{t=1}^{j}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})\big{]}p(r_{j}=1|s_{j-1},a_{j-1})\Big{]}
=1Tj=1T𝔼θ(rj)\displaystyle=\frac{1}{T}\sum_{j=1}^{T}\mathbb{E}_{\theta}(r_{j}) (15)
h=1T𝔼θ(rh).\displaystyle\propto\sum_{h=1}^{T}\mathbb{E}_{\theta}(r_{h}). (16)

Therefore the intermediate DBN with R~\tilde{R} is also captured by the sum of binary reward variables.

For our final proposed DBN, we have the expectation:

𝔼θ(cT=1)\displaystyle\mathbb{E}_{\theta}(c_{T}=1) =pθ(cT=1)\displaystyle=p_{\theta}(c_{T}=1)
=a0st𝒮at𝒜rt,ct{0,1}t=1,T1[t=1T1p(st|st1,at1)pθ(at1|st1)p(rt|st1,at1)p(ct|ct1,rt)\displaystyle=\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t},c_{t}\in\{0,1\}\\ t=1,\cdots T-1\end{subarray}}\Big{[}\prod_{t=1}^{T-1}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})p(r_{t}|s_{t-1},a_{t-1})p(c_{t}|c_{t-1},r_{t})
p(sT|sT1,aT1)pθ(aT1|sT1)p(rT|sT1,aT1)p(cT=1|cT1,rT)]\displaystyle\qquad p(s_{T}|s_{T-1},a_{T-1})p_{\theta}(a_{T-1}|s_{T-1})p(r_{T}|s_{T-1},a_{T-1})p(c_{T}=1|c_{T-1},r_{T})\Big{]}
=a0st𝒮at𝒜rt,ct{0,1}t=1,T1[t=1T1p(st|st1,at1)pθ(at1|st1)p(rt|st1,at1)p(ct|ct1,rt)\displaystyle=\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t},c_{t}\in\{0,1\}\\ t=1,\cdots T-1\end{subarray}}\Big{[}\prod_{t=1}^{T-1}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})p(r_{t}|s_{t-1},a_{t-1})p(c_{t}|c_{t-1},r_{t})
p(sT|sT1,aT1)pθ(aT1|sT1)p(rT|sT1,aT1)(T1)cT1+rTT]\displaystyle\qquad p(s_{T}|s_{T-1},a_{T-1})p_{\theta}(a_{T-1}|s_{T-1})p(r_{T}|s_{T-1},a_{T-1})\frac{(T-1)c_{T-1}+r_{T}}{T}\Big{]}
=a0st𝒮at𝒜rt,ct{0,1}t=1,T1[t=1T1p(st|st1,at1)pθ(at1|st1)p(rt|st1,at1)p(ct|ct1,rt)\displaystyle=\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t},c_{t}\in\{0,1\}\\ t=1,\cdots T-1\end{subarray}}\Big{[}\prod_{t=1}^{T-1}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})p(r_{t}|s_{t-1},a_{t-1})p(c_{t}|c_{t-1},r_{t})
p(sT|sT1,aT1)pθ(aT1|sT1)p(rT|sT1,aT1)T1TcT1]\displaystyle\qquad p(s_{T}|s_{T-1},a_{T-1})p_{\theta}(a_{T-1}|s_{T-1})p(r_{T}|s_{T-1},a_{T-1})\frac{T-1}{T}c_{T-1}\Big{]}
+a0st𝒮at𝒜rt,ct{0,1}t=1,T1[t=1T1p(st|st1,at1)pθ(at1|st1)p(rt|st1,at1)p(ct|ct1,rt)\displaystyle\quad+\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t},c_{t}\in\{0,1\}\\ t=1,\cdots T-1\end{subarray}}\Big{[}\prod_{t=1}^{T-1}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})p(r_{t}|s_{t-1},a_{t-1})p(c_{t}|c_{t-1},r_{t})
p(sT|sT1,aT1)pθ(aT1|sT1)p(rT|sT1,aT1)rTT]\displaystyle\qquad p(s_{T}|s_{T-1},a_{T-1})p_{\theta}(a_{T-1}|s_{T-1})p(r_{T}|s_{T-1},a_{T-1})\frac{r_{T}}{T}\Big{]} (17)
Marginalizing out sTs_{T}, aT1a_{T-1} and rTr_{T} in the first part of the equation
T1TAT1+1TBT\displaystyle\triangleq\frac{T-1}{T}A_{T-1}+\frac{1}{T}B_{T} (18)

where

AT1\displaystyle A_{T-1} =a0st𝒮at𝒜rt,ct{0,1}t=1,T2[t=1T2p(st|st1,at1)pθ(at1|st1)p(rt|st1,at1)p(ct|ct1,rt)\displaystyle=\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t},c_{t}\in\{0,1\}\\ t=1,\cdots T-2\end{subarray}}\Big{[}\prod_{t=1}^{T-2}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})p(r_{t}|s_{t-1},a_{t-1})p(c_{t}|c_{t-1},r_{t})
p(sT1|sT2,aT2)p(rT1|sT2,aT2)p(cT1=1|cT2,rT1)]\displaystyle\qquad p(s_{T-1}|s_{T-2},a_{T-2})p(r_{T-1}|s_{T-2},a_{T-2})p(c_{T-1}=1|c_{T-2},r_{T-1})\Big{]} (19)
BT\displaystyle B_{T} =a0st𝒮at𝒜rt,ct{0,1}t=1,T1[t=1T1p(st|st1,at1)pθ(at1|st1)p(rt|st1,at1)p(ct|ct1,rt)\displaystyle=\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t},c_{t}\in\{0,1\}\\ t=1,\cdots T-1\end{subarray}}\Big{[}\prod_{t=1}^{T-1}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t-1}|s_{t-1})p(r_{t}|s_{t-1},a_{t-1})p(c_{t}|c_{t-1},r_{t})
p(sT|sT1,aT1)p(aT1|sT1)p(rT|sT1,aT1)rT].\displaystyle\qquad p(s_{T}|s_{T-1},a_{T-1})p(a_{T-1}|s_{T-1})p(r_{T}|s_{T-1},a_{T-1})r_{T}\Big{]}. (20)

Notice that

pθ(cT=1)=AT.p_{\theta}(c_{T}=1)=A_{T}. (21)

For part BTB_{T}, since rTr_{T} is binary

BT=1Ta0st𝒮at𝒜rt,ct{0,1}t=1,T1[t=1T1p(st|st1,at1)pθ(at|st)p(rt|st1,at1)p(ct|ct1,rt)\displaystyle B_{T}=\frac{1}{T}\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t},c_{t}\in\{0,1\}\\ t=1,\cdots T-1\end{subarray}}\Big{[}\prod_{t=1}^{T-1}p(s_{t}|s_{t-1},a_{t-1})p_{\theta}(a_{t}|s_{t})p(r_{t}|s_{t-1},a_{t-1})p(c_{t}|c_{t-1},r_{t})
p(sT|sT1,aT1)p(aT1|sT1)p(rT=1|sT1,aT1)]\displaystyle\qquad p(s_{T}|s_{T-1},a_{T-1})p(a_{T-1}|s_{T-1})p(r_{T}=1|s_{T-1},a_{T-1})\Big{]} (22)
Marginalize out ctc_{t}, rtr_{t} for t=1,,T1t=1,\cdots,T-1
=1Ta0st𝒮at𝒜rt,ct{0,1}t=1,T1[t=1Tp(st|st1,at1)p(aT1|sT1)p(rT=1|sT1,aT1)]\displaystyle=\frac{1}{T}\sum_{\begin{subarray}{c}a_{0}\\ s_{t}\in\mathcal{S}\\ a_{t}\in\mathcal{A}\\ r_{t},c_{t}\in\{0,1\}\\ t=1,\cdots T-1\end{subarray}}\Big{[}\prod_{t=1}^{T}p(s_{t}|s_{t-1},a_{t-1})p(a_{T-1}|s_{T-1})p(r_{T}=1|s_{T-1},a_{T-1})\Big{]}
=1T𝔼θ(rT).\displaystyle=\frac{1}{T}\mathbb{E}_{\theta}(r_{T}). (23)

Given these observations we have the recursive equation

AT\displaystyle A_{T} =T1TAT1+1TBT\displaystyle=\frac{T-1}{T}A_{T-1}+\frac{1}{T}B_{T}
=T1TAT1+1T𝔼θ(rT)\displaystyle=\frac{T-1}{T}A_{T-1}+\frac{1}{T}\mathbb{E}_{\theta}(r_{T})
=T1T(T2T1AT2+1T1𝔼θ(rT1))+1T𝔼θ(rT)\displaystyle=\frac{T-1}{T}(\frac{T-2}{T-1}A_{T-2}+\frac{1}{T-1}\mathbb{E}_{\theta}(r_{T-1}))+\frac{1}{T}\mathbb{E}_{\theta}(r_{T})
=T2TAT2+1T(𝔼θ(rT)+𝔼θ(rT1))\displaystyle=\frac{T-2}{T}A_{T-2}+\frac{1}{T}(\mathbb{E}_{\theta}(r_{T})+\mathbb{E}_{\theta}(r_{T-1}))
=\displaystyle=\cdots
=2TA2+1Tt=3T𝔼θ(rt)\displaystyle=\frac{2}{T}A_{2}+\frac{1}{T}\sum_{t=3}^{T}\mathbb{E}_{\theta}(r_{t})
h=1T𝔼θ(rh).\displaystyle\propto\sum_{h=1}^{T}\mathbb{E}_{\theta}(r_{h}). (24)

i.e., the expectation of R~\tilde{R} and cTc_{T} are equivalent in two DBNs and they are both proportional to the expected cumulative reward of the original MDP problem.

Appendix B Accumulating reward from multiple nodes in the same time step

In large factored state and action spaces, the rewards are typically specified as an addition function over small factors that only depend on a small number of state and action variables given by some decision rules. The sum variable might have many parents and therefore we require an addition construction for the DBN. Since this construction is done for each time step separately, in this section we simplify the notation and omit the subscript of time step tt.

The construction is similar to the accumulation of reward over time. Assume there are KK decision rules to determine the reward at a particular time step with some state and action. Given some order over the decision rules, we expand the DBN so that each decision rule corresponds to a binary partial reward node pripr_{i}, (i=1,Ki=1,\cdots K), with edge between the partial reward node and the dependent state and action nodes according to the decision rule. Then for each partial reward node pripr_{i}, we create a binary collecting reward node cricr_{i} that connects to the partial reward node pripr_{i} and the collecting reward node cri1cr_{i-1} of the previous partial reward node. We also create an additional collecting reward node cr0cr_{0} which is set to 11.

We then define the conditional distribution of cricr_{i} given cri1cr_{i-1}, pripr_{i} to be

p(cri=1|cri1,pri)=(i1)cri1+prii\displaystyle p(cr_{i}=1|cr_{i-1},pr_{i})=\frac{(i-1)cr_{i-1}+pr_{i}}{i} (25)

and the partial reward distribution pripr_{i} to be

p(pri|s,a)ith reward decision rule.\displaystyle p(pr_{i}|s,a)\propto\mbox{$i$th reward decision rule}. (26)

We want to show that p(crK=1|s,a)r(s,a)p(cr_{K}=1|s,a)\propto r(s,a) for every time step.

p(crK=1|s,a)\displaystyle p(cr_{K}=1|s,a) =cri,pri,prKi=1,,K1p(crK=1|crK1,prK)p(prK|s,a)i=1K1p(cri|cri1,pri)p(pri|s,a)\displaystyle=\sum_{\begin{subarray}{c}cr_{i},pr_{i},pr_{K}\\ i=1,\cdots,K-1\end{subarray}}p(cr_{K}=1|cr_{K-1},pr_{K})p(pr_{K}|s,a)\prod_{i=1}^{K-1}p(cr_{i}|cr_{i-1},pr_{i})p(pr_{i}|s,a)
=cri,pri,prKi=1,,K1(K1)crK1+prKKp(prK|s,a)i=1K1p(cri|cri1,pri)p(pri|s,a)\displaystyle=\sum_{\begin{subarray}{c}cr_{i},pr_{i},pr_{K}\\ i=1,\cdots,K-1\end{subarray}}\frac{(K-1)cr_{K-1}+pr_{K}}{K}p(pr_{K}|s,a)\prod_{i=1}^{K-1}p(cr_{i}|cr_{i-1},pr_{i})p(pr_{i}|s,a)
Separate the formula above w.r.t. crK1cr_{K-1} and prKpr_{K}
=K1Kpart1+1Kpart2\displaystyle=\frac{K-1}{K}\mbox{part1}+\frac{1}{K}\mbox{part2} (27)

where

part1 =cri,pri,prK1i=1,,K2[p(crK1=1|crK2,prK1)i=1K2p(cri|cri1,pri)p(pri|s,a)=p(crk1=1|s,a)]\displaystyle=\sum_{\begin{subarray}{c}cr_{i},pr_{i},pr_{K-1}\\ i=1,\cdots,K-2\end{subarray}}\big{[}p(cr_{K-1}=1|cr_{K-2},pr_{K-1})\prod_{i=1}^{K-2}p(cr_{i}|cr_{i-1},pr_{i})p(pr_{i}|s,a)=p(cr_{k-1}=1|s,a)\big{]} (28)

because the whole equation vanishes when crK1=0cr_{K-1}=0, and prKpr_{K} got marginalized out. In addition,

part2 =p(prK=1|s,a)\displaystyle=p(pr_{K}=1|s,a) (29)

because all other variables are marginalized out.

Thus we have

p(crK=1|s,a)\displaystyle p(cr_{K}=1|s,a) =K1Kp(crK1=1|s,a)+1Kp(prK=1|s,a)\displaystyle=\frac{K-1}{K}p(cr_{K-1}=1|s,a)+\frac{1}{K}p(pr_{K}=1|s,a)
=\displaystyle=\cdots
=1Ki=1Kp(pri=1|s,a)\displaystyle=\frac{1}{K}\sum_{i=1}^{K}p(pr_{i}=1|s,a)
iith reward decision rule\displaystyle\propto\sum_{i}i\mbox{th reward decision rule}
r(s,a).\displaystyle\propto r(s,a). (30)

Appendix C Exponentially Weighted Reward

As discussed in the main paper, some prior work uses backward variational inference but does so with an exponential reward weighting. Here we show how this setting can be captured within our framework. Recall that Levine [2018] formulates the objective function as

DKL(p^(τ)||p(τ|O1T))\displaystyle D_{KL}(\hat{p}(\tau)||p(\tau|O_{1\cdots T})) (31)

where

p(τ|O1T+1)\displaystyle p(\tau|O_{1\cdots T+1}) =[t=1Tp(st|st1,at1)]exp(t=0Tr(st,at))\displaystyle=\Big{[}\prod_{t=1}^{T}p(s_{t}|s_{t-1},a_{t-1})\Big{]}\exp\Big{(}\sum_{t=0}^{T}r(s_{t},a_{t})\Big{)} (32)
p^(τ)\displaystyle\hat{p}(\tau) =t=1Tp(st|st1,at1)π(at1|st1).\displaystyle=\prod_{t=1}^{T}p(s_{t}|s_{t-1},a_{t-1})\pi(a_{t-1}|s_{t-1}). (33)

Here O(1,T+1)O(1,\cdots T+1) are indicator random variables denoting “optimality” in time t=1,T+1t=1,\cdots T+1 and the trajectory distribution is with an implicit uninformative policy.

In our formulation, ctc_{t} represents the cumulative reward up to time tt, and we have established that 𝔼(cT)𝔼(R)\mathbb{E}(c_{T})\propto\mathbb{E}(R). Recall in our graphical model, a complete trajectory distribution is

p(τ)\displaystyle p(\tau) =t=1Tp(st|st1,at1)u(at1)p(rt|st1,at1)p(ct|ct1,rt)\displaystyle=\prod_{t=1}^{T}p(s_{t}|s_{t-1},a_{t-1})u(a_{t-1})p(r_{t}|s_{t-1},a_{t-1})p(c_{t}|c_{t-1},r_{t})
u(aT)p(rT+1|sT,aT)p(cT+1|cT,rT+1).\displaystyle\quad u(a_{T})p(r_{T+1}|s_{T},a_{T})p(c_{T+1}|c_{T},r_{T+1}). (34)

To recover the joint probability of the formulation from Levine [2018], we need the following steps:

  1. 1.

    Change reward distribution to p(rt|st1,at1)exp(R(st1,at1))p(r_{t}|s_{t-1},a_{t-1})\propto\exp(R(s_{t-1},a_{t-1})).

  2. 2.

    The trajectory distribution need to conditioned on r1T+1=1,c1T+1=1r_{1\cdots T+1}=1,c_{1\cdots T+1}=1.

Then in our formulation we have

p(τ|r1T+1=1,c1T+1=1)\displaystyle p^{*}(\tau|r_{1\cdots T+1}=1,c_{1\cdots T+1}=1) [t=1Tp(st|st1,at1)]exp(t=0Tr(st,at))\displaystyle\propto\Big{[}\prod_{t=1}^{T}p(s_{t}|s_{t-1},a_{t-1})\Big{]}\exp\Big{(}\sum_{t=0}^{T}r(s_{t},a_{t})\Big{)} (35)

which is the same as the optimal trajectory distribution above. Then the objective of Levine [2018] can be seen to minimize

dKL(qϕ(τ)|p(τ|r1T+1=1,c1T+1=1))\displaystyle d_{KL}(q_{\phi}(\tau)|p^{*}(\tau|r_{1\cdots T+1}=1,c_{1\cdots T+1}=1)) (36)

which is captured in our framework with the backward VI by using additional observation variables. Using the same methodology as above, in our framework both forward and backward variants of this MFVI variant can be implemented.

Appendix D Full closed form update formula for MFVI

For completeness we list the full update formulas of MFVI:

logqϕ(shj)\displaystyle\log q_{\phi}(s_{h}^{j}) 𝔼qϕ\shj[logpθ(S,A,R,C\T,cT=1)]\displaystyle\propto\mathbb{E}_{q_{\phi}}^{\backslash s_{h}^{j}}\Big{[}\log p_{\theta}(S,A,R,C_{\backslash T},c_{T}=1)\Big{]}
=𝔼qϕ\shj[m=1Mlogp(sh+1m|sh,ah,sh+1Gm)+logp(shj|sh1,ah1,shGj)\displaystyle=\mathbb{E}_{q_{\phi}}^{\backslash s_{h}^{j}}\Big{[}\sum_{m=1}^{M}\log p(s_{h+1}^{m}|s_{h},a_{h},s_{h+1}^{G_{m}})+\log p(s_{h}^{j}|s_{h-1},a_{h-1},s_{h}^{G_{j}})
+i:jGilogp(shi|sh1,ah1,shGi)+logp(rh+1|sh,ah)]\displaystyle\qquad+\sum_{i:j\in G_{i}}\log p(s_{h}^{i}|s_{h-1},a_{h-1},s_{h}^{G_{i}})+\log p(r_{h+1}|s_{h},a_{h})\Big{]} (37)
logqϕ(ahl)\displaystyle\log q_{\phi}(a_{h}^{l}) 𝔼qϕ\ahl[logpθ(S,A,R,C\T,cT=1)]\displaystyle\propto\mathbb{E}_{q_{\phi}}^{\backslash a_{h}^{l}}\Big{[}\log p_{\theta}(S,A,R,C_{\backslash T},c_{T}=1)\Big{]}
=𝔼qϕ\ahl[logpθ(ahl)+logp(rh+1|sh,ah)+m=1Mlogp(sh+1m|sh,ah,sh+1Gm)]\displaystyle=\mathbb{E}_{q_{\phi}}^{\backslash a_{h}^{l}}\Big{[}\log p_{\theta}(a_{h}^{l})+\log p(r_{h+1}|s_{h},a_{h})+\sum_{m=1}^{M}\log p(s_{h+1}^{m}|s_{h},a_{h},s_{h+1}^{G_{m}})\Big{]} (38)
logqϕ(rh)\displaystyle\log q_{\phi}(r_{h}) 𝔼qϕ\rh[logpθ(S,A,R,C\T,cT=1)]\displaystyle\propto\mathbb{E}_{q_{\phi}}^{\backslash r_{h}}\Big{[}\log p_{\theta}(S,A,R,C_{\backslash T},c_{T}=1)\Big{]}
=𝔼qϕ\rh[logp(rh|sh1,ah1)+logp(ch|ch1,rh)]\displaystyle=\mathbb{E}_{q_{\phi}}^{\backslash r_{h}}\Big{[}\log p(r_{h}|s_{h-1},a_{h-1})+\log p(c_{h}|c_{h-1},r_{h})\Big{]} (39)
logqϕ(ch)\displaystyle\log q_{\phi}(c_{h}) 𝔼qϕ\ch[logpθ(S,A,R,C\T,cT=1)]\displaystyle\propto\mathbb{E}_{q_{\phi}}^{\backslash c_{h}}\Big{[}\log p_{\theta}(S,A,R,C_{\backslash T},c_{T}=1)\Big{]}
=𝔼qϕ\ch[logp(ch|ch1,rh)+logp(ch+1|ch,rh+1)].\displaystyle=\mathbb{E}_{q_{\phi}}^{\backslash c_{h}}\Big{[}\log p(c_{h}|c_{h-1},r_{h})+\log p(c_{h+1}|c_{h},r_{h+1})\Big{]}. (40)

Appendix E Analysis of MFVI on a Demo Problem

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Left: Performance of MFVI variants on the Cooking Problem. Other Plots: The increase in the value of the ELBO after updates of variables, shown for each group of variables separately.

In this section we further investigate the sensitivity of MFVI to updates of state variables. To achieve this, we design a new “cooking” domain, partly inspired by the structure of the Skill Teaching Domain, in the spirit of making inference intuitively clear. We have 2 dishes and three actions, “cook dish1”, “cook dish2”, “do nothing”. By applying the “cook” action repeatedly each dish goes (with some probability) through 5 potential stages stating at “not cooked”, and moving to “not cooked and cooking”, “cookMed”, “cookMed and cooking”, “cookWell”, “Burned”. These stages are encoded by binary variables “cookMed”, “cookWell”, “cooking”. An additional state variable “watching” per dish, which is equal to “cooking” adds flexibility while not affecting the dynamics. The reward is given when “cookMed” and “cookWell” are true but not when “burned”. Hence when conditioned on high cumulative reward we should expect to see ”cookMed” initially set to true.

We next explore algorithmic variants. The NoS variant does not update state variables at all, as in the main paper. Since “cookMed” is important we create additional update schemes around it. “MFVI-Med” fixes the distribution of other state variables to be uniformly random and only updates “cookMed” state variables together with action, reward, cumulative reward variables. “MFVI-Med-All” performs an asynchronous update stating by running ‘MFVI-Med” to convergence and then following by running MFVI until it converges again.

Results are shown in Fig 2. Consider first the plots that show increase in ELBO as a function of updates. We see that for MFVI the relative effect of state variables on the increase in ELBO is larger than all other variables. Comparing this to the NoS variant we see that in that case reward and cumulative variables are more important and the improvement it provides is potentially due to removing the large changes in ELBO due to state variables. On the other hand, as shown in the performance plot on the left, the two new variants based on the “cookMed” variable, still improve the performance but also have a large gap in the effect on ELBO, so this does not provide a full explanation. The success of “MFVI-Med-All” shows that the flexibility is not the whole story, but that the algorithm is sensitive to the order of updates.

As a final diagnostic, we print out the approximate posterior distribution of “cookMed”, and “cooking” variables of different schemes starting from the state where all the variables are set to be 0. Ideally, these four variables should be all biased towards 1. We see that for VI-Med and VI-Med-All, their approximate posterior, though not fully accurate, provides useful information for action distribution update while MFVI provides approximate state posterior in the wrong direction. This shows that MFVI can converge to uninformative local optima, which causes its poor performance. Overall we believe that the large number of state variables, their relative effect on the ELBO, and the sensitivity of the variational algorithm to order of updates are the cause of failure in some domains.

Table 1: Approximate posterior distribution of different MFVI variants at the initial state on the cooking problem.
Variables VI-noS VI-Med VI-Med-All MFVI
t1-CookMed [d1, d2] [0.5, 0.5] [1.48e-3, 4.09e-6] [3.37e-7, 5.47e-5] [3.59e-5, 5.38e-5]
t1-Cooking [d1, d2] [4.95e-1, 4.95e-1] [4.95e-1, 4.95e-1] [0.99, 1.00e-12] [5.00e-5, 5.00e-5]
t2-CookMed [d1, d2] [0.5, 0.5] [0.99, 8.49e-4] [0.94, 5.86e-5] [7.57e-5, 5.90e-5]
t2-Cooking [d1, d2] [4.85e-1, 4.85e-1] [4.67e-1, 4.67e-1] [2.49e-1, 2.51e-1] [5.00e-5, 5.00e-5]

Appendix F Details Experimental Results in All Domains

In this section we show the raw, un-normalized results separately for all domains (represented by the first five letters of the domain names in the main paper). We include results for a second implementation of forward Loopy BP which is discussed in the following subsection.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Algorithm Performance Details

F.1 Different Optimization Strategies for Forward Loopy BP

Recall that in the forward Loopy BP algorithm, we define sc(θ)sc(\theta) to be the approximate marginal distribution of pθ(cT)p_{\theta}(c_{T}) computed by LBP. For the same construction, but using BP with a directed model, Cui et al. [2018] showed that LBP does converge and that it does so in one iteration. This holds because factors are conditional probability tables and we do not have downstream evidence. The same holds in our case, for the corresponding message order. However, this does not solve the optimization problem, i.e. selecting θ\theta or AA.

In this context we experimented with two methods. The first one is discussed in the main paper. Namely, the SOGBOFA algorithm [Cui et al., 2019] that fully optimizes θ\theta by combining the one pass inference of the marginal problem, which is done symbolically, with a gradient search. The second one (labeled ”LoopyBP-Fwd” in the plot) is a computationally cheap compromise, introduced by Cui et al. [2015], which uses a uniform distribution for a1,,aT1a_{1},\ldots,a_{T-1}, and performs the optimization by enumerating values for a0a_{0}. That is, the second variant only optimizes the current action and uses a random rollout for subsequent actions. To make this as close as possible to SOGBOFA we used an implementation of BP with sequential updates where we can perform just one pass of forward messages (due to convergence).444 We have also experimented with the fast implementation using parallel updates as used by the backward algorithm which gives comparable results with 100 iterations of message propagation.

From the domain-by-domain experimental results we see that in all domains except Elevators the two algorithms have comparable and consistent performance. The need for optimizing the rollout policy for Elevators was discussed in Cui et al. [2015] Cui et al. [2019]. Briefly, a combination of positive and negative rewards in this domain means that random rollouts are not informative and, due to the large penalty, all actions look risky and the simpler planning algorithm chooses to do nothing. More importantly, for our experiments, the consistency in performance shows that the experimental advantage of SOGBOFA across all domains is not due to differences in implementation details but rather due to the inference strategy.