This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Also at ]Nano Life Science Institute, Kanazawa University, Kanazawa 920-1192, JapanAlso at ]Institute of Industrial Science, The University of Tokyo, Tokyo 153-8505, Japan Also at ]Universal Biology Institute, The University of Tokyo, Tokyo 113-8654, Japan

Optimal control of stochastic reaction networks with entropic control cost
and emergence of mode-switching strategies

Shuhei A. Horiguchi [    Tetsuya J. Kobayashi [ [ tetsuya@mail.crmind.net Department of Mathematical Informatics, the Graduate school of Information Science and Technology, the University of Tokyo, Tokyo, Japan
Abstract

Controlling the stochastic dynamics of biological populations is a challenge that arises across various biological contexts. However, these dynamics are inherently nonlinear and involve a discrete state space, i.e., the number of molecules, cells, or organisms. Additionally, the possibility of extinction has a significant impact on both the dynamics and control strategies, particularly when the population size is small. These factors hamper the direct application of conventional control theories to biological systems. To address these challenges, we formulate the optimal control problem for stochastic population dynamics by utilizing a control cost function based on the Kullback–Leibler divergence. This approach naturally accounts for population-specific factors and simplifies the complex nonlinear Hamilton–Jacobi–Bellman equation into a linear form, facilitating efficient computation of optimal solutions. We demonstrate the effectiveness of our approach by applying it to the control of interacting random walkers, Moran processes, and SIR models, and observe the mode-switching phenomena in the control strategies. Our approach provides new opportunities for applying control theory to a wide range of biological problems.

preprint: APS/123-QED

I Introduction

Optimal control problems for a population of stochastically interacting particles arise in diverse fields of biology [1, 2]. In intracellular chemical reactions, molecules interact stochastically and nonlinearly to generate complex dynamics, whose control is essential for medical and bioengineering applications [3]. In cellular or animal populations, cells or organisms with different phenotypic and genetic traits interact and compete for survival and reproduction. Strategic control of such populations into either extinction, survival, or amplification leads to cancer therapy [4, 5, 6], stem cell culturing [7], manipulating gut microbiota [8], maintenance of immunological memory [9, 10], biodiversity conservation [11, 12, 13], and control of evolving population [14, 15] In human populations, control of epidemic outbreaks, spurred by individual interactions, poses a significant public health challenge [16, 17, 18].

All these phenomena can be effectively formulated using the theoretical framework of reaction networks (RN), making stochastic RN theory a firm foundation for devising optimal control strategies across these areas (Hereafter, we use RN to designate this class of dynamics including chemical reactions, population dynamics, and epidemic dynamics.).

Despite its broad applicability, the optimal control of RNs remains underexplored. This oversight is largely due to the unique characteristics of stochastic RNs, which deviate from the setup of conventional optimal control for diffusion processes. The major deviations include the inherent nonlinearity of RNs, the discrete nature of state variables (representing counts of particles like molecules or organisms), and their stochastic dynamics, which are better modeled by Markov jump processes with Poissonian randomness rather than Gaussian diffusion [19, 1, 20]. Additionally, the non-negativity constraint of control parameters, i.e., kinetic rate constants, requires a natural measure for cost other than conventional quadratic ones, which presume the constraint-free Euclidian parameter space. Finally, the zero count states act as intrinsic absorbing boundaries, reaching them can dramatically alter system dynamics through the extinction of particles (Fig. 1).

These distinct properties of RNs necessitate (1) a departure from the traditional control theories based on diffusion processes, such as the Linear-Quadratic-Gaussian (LQG) model, (2) an establishment of alternative theoretical framework tailored to the unique requirements of RN optimal control, and (3) its applications to biologically relevant control problems.

In this work, we establish that all these issues can be resolved by integrating optimal control of jump processes and stochastic RN with a cost function based on relative entropy or Kullback–Leibler (KL) divergence. The optimal control with KL cost was originally proposed for diffusion processes in relation to the duality of control and inference [21, 22, 23]. We demonstrate that KL cost, as it is designed for non-negative probabilities and densities, can naturally accommodate the non-negativity constraints for both state space (particle counts) and control parameters (kinetic rate constants) of stochastic RN. Moreover, KL cost allows for the linearization of the nonlinear Hamilton–Jacobi–Bellman (HJB) equations through the Cole-Hopf transformation. This linearization facilitates the efficient derivation of optimal solutions in a similar matter with previously identified solvable control problems [24, 25, 26].

By leveraging the sound properties of optimal control for stochastic RN with KL cost, we demonstrate the effectiveness of our theory for different biological phenomena: an optimal transport by mutually excluding molecular motors, maintenance of cellular heterogeneity in populations, and epidemic outbreak control (Fig. 1). For molecular motors, we derive analytic solutions owing to the linearization. In the maintenance of heterogeneity and epidemic outbreak control, we identify mode-switching phenomena of population control that hinge on the transition in controllability of exponentially growing populations.

All these results are obtained without directly solving nonlinear HJB equations, which are typically intractable, even numerically. Thus, our new results can substantially broaden the scope of optimal control applications in stochastic chemical reactions, population dynamics, and epidemics.

Refer to caption
Figure 1: Examples of biological phenomena described as reaction networks. (a) Movement of molecular motors on a microtubule with mutual interference. (b) Competing dynamics of populations in population genetics or ecology. (c) Spread of infectious diseases in a population. Purple thick lines represent the absorbing states in each phenomenon.

II Optimal control of stochastic reaction systems

II.1 Stochastic reaction systems

Let us consider a population of particles that evolve through stochastic events, i.e., the occurrence of reactions. Each particle is assigned one of discrete type set XX. The number of particles of type xXx\in X at time tt is denoted as nx(t)0n_{x}(t)\in\mathbb{Z}_{\geq 0}. The number distribution n(t)0|X|=𝒩n(t)\in\mathbb{Z}_{\geq 0}^{|X|}=\mathcal{N} completely characterize the state of the system at time tt. The number distribution changes when a reaction occurs. There are |R||R| kinds of reactions, where RR is the set of reactions. Once a reaction rRr\in R occurs, sr,x0s^{-}_{r,x}\in\mathbb{Z}_{\geq 0} particles are consumed and sr,x+0s^{+}_{r,x}\in\mathbb{Z}_{\geq 0} particles are produced. The net change in the number of particles sr,x:=sr,x+sr,xs_{r,x}:=s_{r,x}^{+}-s_{r,x}^{-}\in\mathbb{Z} is called stoichiometric coefficient and sr=(sr,x)xX|X|s_{r}=(s_{r,x})_{x\in X}\in\mathbb{Z}^{|X|} is stoichiometric vector for reaction rr. Thus when the reaction rr occurs at time tt, the number distribution changes from n(t)n(t^{-}) to

n(t)=n(t)+sr.n(t)=n(t^{-})+s_{r}. (1)

The timing of reaction events is random, which follow an inhomogeneous Poisson process with rate λr(t)0\lambda_{r}(t)\geq 0 for reaction rr at time tt. It could vary over time and depends on the current number distribution n(t)n(t). We write the rate of reaction rRr\in R as the product of the reaction rate coefficient kr(t)0k_{r}(t)\geq 0 and the propensity function hr(n)0h_{r}(n)\geq 0:

λr(t)=kr(t)hr(n(t)).\lambda_{r}(t)=k_{r}(t)h_{r}(n(t)). (2)

When we assume the law of mass action kinetics, kr(t)k_{r}(t) is a kinetic constant, and the function hr(n)h_{r}(n) is given by hr(n)=N1xXsr,xxXnx!(nxsr,x)!h_{r}(n)=N^{1-\sum_{x\in X}s^{-}_{r,x}}\prod_{x\in X}\frac{n_{x}!}{(n_{x}-s^{-}{r,x})!}, where NN is a constant parameter that describe the system size, e.g., the volume or total number of the particles in the system. We do not assume the mass action kinetics in the following so that our results become applicable to any propensity function. Instead, we presume only the situations that make the process well-defined up to time TT or up to exit time TexitT_{\mathrm{exit}}. For example, hr(n)h_{r}(n) should be 0 when the reaction pushes the state out of the domain, i.e., n+sr𝒩n+s_{r}\not\in\mathcal{N}.

The stochastic reaction system defined in this way can be characterized by either the counting process representation or the Markov process representation. The counting process representation [20] is useful for characterizing the stochastic process (n(t))0tT(n(t))_{0\leq t\leq T} and simplifies the derivation of KL control cost. Let ξr(t)0\xi_{r}(t)\in\mathbb{Z}_{\geq 0} denote the number of occurrences of reaction rr from time 0 to tt. The number distribution n(t)n(t) is a linear transformation of the reaction count ξ(t)\xi(t), i.e., for any 0tT0\leq t\leq T,

n(t)=n(0)+rRsrξr(t).n(t)=n(0)+\sum_{r\in R}s_{r}\xi_{r}(t). (3)

The probability law of ξ()\xi(\cdot) is given by the Poisson processes. Using the time-change of independent unit Poisson processes Yr(t)Y_{r}(t), the reaction count can be written as ξr(t)=Yr(0tkr(τ)hr(n(τ))𝑑τ)\xi_{r}(t)=Y_{r}\left(\int_{0}^{t}k_{r}(\tau)h_{r}(n(\tau))d\tau\right).

The stochastic process n()n(\cdot) is a Markov process on the nonnegative integer lattice 𝒩\mathcal{N}. The transition rate from n𝒩n\in\mathcal{N} to n𝒩n^{\prime}\in\mathcal{N} is given by

ωn,n(t)=rRkr(t)hr(n)δn+sr,n,\omega_{n,n^{\prime}}(t)=\sum_{r\in R}k_{r}(t)h_{r}(n)\delta_{n+s_{r},n^{\prime}}, (4)

where δn,n\delta_{n,n^{\prime}} is the Kronecker delta. Let Pt(n):=𝔼[δn(t),n|n(0)=n0]P_{t}(n^{\prime}):=\mathbb{E}\left[\delta_{n(t),n^{\prime}}|n(0)=n_{0}\right] denote the probability of being the state n𝒩n^{\prime}\in\mathcal{N} at time tt given the initial state n0𝒩n_{0}\in\mathcal{N}. Then, Pt(n)P_{t}(n^{\prime}) satisfies the Kolmogorov forward equation (chemical master equation) for any 0<t<T0<t<T and n𝒩n\in\mathcal{N}:

tPt(n)\displaystyle\frac{\partial}{\partial t}P_{t}(n) =n𝒩\{n}[Pt(n)ωn,n(t)Pt(n)ωn,n(t)]\displaystyle=\sum_{n^{\prime}\in\mathcal{N}\backslash\{n\}}\left[P_{t}(n^{\prime})\omega_{n^{\prime},n}(t)-P_{t}(n)\omega_{n,n^{\prime}}(t)\right]
=rRkr(t)[hr(nsr)Pt(nsr)hr(n)Pt(n)]\displaystyle=\sum_{r\in R}k_{r}(t)\left[h_{r}(n-s_{r})P_{t}(n-s_{r})-h_{r}(n)P_{t}(n)\right]
=:kPt(n)\displaystyle=:\mathcal{L}_{k}^{\dagger}P_{t}(n) (5)

II.2 General formulation of optimal control problems

We assume that the controller can observe the current state n(t)n(t) and adjust the reaction rate coefficients kr(t)k_{r}(t) to any desired value at any time while the function hr(n)h_{r}(n) remains unchanged, i.e., the controller can modulate only the speeds of reactions. We would like to find the optimal reaction rate coefficients (k(t))0t<T(k(t))_{0\leq t<T} which drive the population to take a desirable trajectory (n(t))0t<T(n(t))_{0\leq t<T} while minimizing the cost of modulating the reaction rate coefficients. Let us define the utility function 𝒰(n()):=UT(n(T))+0TUτ(n(τ))𝑑τ\mathcal{U}(n(\cdot)):=U_{T}(n(T))+\int_{0}^{T}U_{\tau}(n(\tau))d\tau as the sum of the terminal utility function UT(n)U_{T}(n)\in\mathbb{R} and the time integral of instantaneous utility function Uτ(n)U_{\tau}(n)\in\mathbb{R}. The control cost function is also defined as 𝒞(n(),k()):=0TC(n(τ),k(τ))𝑑τ\mathcal{C}(n(\cdot),k(\cdot)):=\int_{0}^{T}C(n(\tau),k(\tau))d\tau. Then, we consider the following optimization problem:

(n0,u):=\displaystyle\mathcal{E}(n_{0},u):={} mink()𝔼k()[𝒞(n(),k())]\displaystyle\underset{k(\cdot)}{\text{min}}~{}~{}\mathbb{E}_{k(\cdot)}\left[\mathcal{C}(n(\cdot),k(\cdot))\right] (6)
s. t.𝔼k()[𝒰(n())]u,\displaystyle\text{s.\,t.}~{}~{}\mathbb{E}_{k(\cdot)}\left[\mathcal{U}(n(\cdot))\right]\geq u,

or equivalently, the following unconstrained optimization:

V(n0,β):=maxk()𝔼k()[β𝒰(n())𝒞(n(),k())],V(n_{0},\beta):=\underset{k(\cdot)}{\text{max}}~{}~{}\mathbb{E}_{k(\cdot)}\left[\beta\mathcal{U}(n(\cdot))-\mathcal{C}(n(\cdot),k(\cdot))\right], (7)

where the expectation 𝔼k()\mathbb{E}_{k(\cdot)} is taken over trajectories n()n(\cdot) generated under the designated reaction rate coefficients k()k(\cdot) and initial condition n(0)=n0n(0)=n_{0}. A scalar β\beta is the Lagrange multiplier or a parameter to adjust the importance of the utility relative to the control cost.

We will calculate the optimal transition rate (k(t))0t<T(k^{\dagger}(t))_{0\leq t<T} that attains the optimum of (n0,u)\mathcal{E}(n_{0},u) and V(n0,β)V(n_{0},\beta). The standard protocol for solving the problem is to consider the value function Vt(n,β)V_{t}(n,\beta), which is defined as the maximum of the expectation in Eq. (7) from tt to TT under the condition n(t)=nn(t)=n:

Vt(n,β):=maxk()𝔼k()[β𝒰t(n())𝒞t(n(),k())|n(t)=n],V_{t}(n,\beta):=\max_{k(\cdot)}\mathbb{E}_{k(\cdot)}\left[\beta\mathcal{U}_{t}(n(\cdot))-\mathcal{C}_{t}(n(\cdot),k(\cdot))|n(t)=n\right], (8)

where 𝒰t(n())\mathcal{U}_{t}(n(\cdot)) and 𝒞t(n(),k())\mathcal{C}_{t}(n(\cdot),k(\cdot)) are the utility and control cost functions from time tt to TT, e.g., 𝒰t(n()):=UT(n(T))+tTUτ(n(τ))𝑑τ\mathcal{U}_{t}(n(\cdot)):=U_{T}(n(T))+\int_{t}^{T}U_{\tau}(n(\tau))d\tau. The value function is equal to the terminal utility function at time TT, VT(n,β)=βUT(n)V_{T}(n,\beta)=\beta U_{T}(n), and also provides the optimum for the original problem at time 0, V0(n,β)=V(n,β)V_{0}(n,\beta)=V(n,\beta). Thus, the transition rate attaining Eq. (8) is identical to the optimal transition rates kk^{\dagger} in the original control problem.

If V(n,β)V(n,\beta) were analytically or numerically obtained for all nn and β\beta, (n,u)\mathcal{E}(n,u) in Eq. (6) is derived as the Legendre transformation of V(n,β)V(n,\beta): (see Sec. A in the Supplementary Material 111See Supplemental Material at for the derivation of equations and detailed numerical results, which includes Refs. [59, 60, 61, 62, 63, 64]. for the derivation)

(n,u)=maxβ[uβV(n,β)].\mathcal{E}(n,u)=\max_{\beta}\left[u\beta-V(n,\beta)\right]. (9)

Furthermore, the expected utility under the optimal control can also be derived from the value function (see Sec. A of the Supplementary Material [27] for the derivation)

𝒰(n,β):=𝔼k()[𝒰(n())]=βV(n,β).\mathcal{U}^{\dagger}(n,\beta):=\mathbb{E}_{k^{\dagger}(\cdot)}\left[\mathcal{U}(n(\cdot))\right]=\frac{\partial}{\partial\beta}V(n,\beta). (10)

II.3 Optimal control with KL cost

To obtain the value function, we usually leverage the Hamilton–Jacobi–Bellman (HJB) equation, a differential equation that Vt(n,β)V_{t}(n,\beta) satisfies. However, solving the HJB equation is generally intractable both analytically and even numerically because it is a nonlinear differential equation on a possibly infinite domain. This difficulty is the major obstacle to optimal control in applications. To address the difficulty, we are conventionally forced to restrict or approximate the original problem to a linear quadratic Gaussian (LQG) problem, in which only the linear dynamics on continuous Euclidean space with Gaussian noise with a quadratic control cost function can be considered. However, all these restrictions of the LQG problem conflict with the essential properties of stochastic RN control, i.e., the nonlinearity of hr(n)h_{r}(n), the discreteness and nonnegativity of state n0|X|n\in\mathbb{Z}_{\geq 0}^{|X|}, the nonnegativity of the control parameter kr0k_{r}\geq 0, and Poissonian nature of stochasticity. Several studies attempted to overcome a part of these difficulties [28, 29, 30]. Nonetheless, optimal control was not practical for biological problems described by RN.

In this work, we clarify that the difficulty can be resolved by adopting the following cost function

C(n,k)=rRhr(n)cKL(kr,kr0),C(n,k)=\sum_{r\in R}h_{r}(n)c_{KL}(k_{r},k^{0}_{r}), (11)

where cKL(a,a0)c_{KL}(a,a^{0}) is the generalized Kullback–Leibler (KL) divergence for positive scalars a,a0>0a,a^{0}>0:

cKL(a,a0):=alogaa0a+a0.c_{KL}(a,a^{0}):=a\log\frac{a}{a^{0}}-a+a^{0}. (12)

The value at a=0a=0 is defined by the limit cKL(0,a0):=lima0cKL(a,a0)=a0c_{KL}(0,a^{0}):=\lim_{a\rightarrow 0}c_{KL}(a,a^{0})=a^{0}.

The KL control cost function in Eq. (11) measures the deviation of the controlled reaction rate coefficient kk from its uncontrolled reaction rate coefficient k0>0|R|k^{0}\in\mathbb{R}^{|R|}_{>0}. Because the control cost function C(n,k)C(n,k) is minimized at k=k0k=k^{0}, the optimal reaction rate coefficient kk^{\dagger} is equal to k0k^{0} when β=0\beta=0, i.e., no preference on the state trajectory. If β0\beta\neq 0, the optimal reaction coefficient for rRr\in R at time tt is given by

kr(t,n,β)\displaystyle k^{\dagger}_{r}(t,n,\beta) =kr0exp(¯srVt(n,β)),\displaystyle=k^{0}_{r}\exp\left(\overline{\nabla}_{s_{r}}V_{t}(n,\beta)\right), (13)

where ¯srf(n):=f(n+sr)f(n)\overline{\nabla}_{s_{r}}f(n):=f(n+s_{r})-f(n) is the discrete gradient (see Sec. A of the Supplementary Material [27] for the derivation). As the slope of cKLc_{KL} goes to infinity, krcKL(kr,kr0)\frac{\partial}{\partial k_{r}}c_{KL}(k_{r},k_{r}^{0})\rightarrow-\infty for kr0k_{r}\rightarrow 0, the nonnegativity constraint over kk is automatically satisfied. In fact, the exponential function in Eq. (13) ensures that the optimal coefficient krk^{\dagger}_{r} is non-negative.

Note that the KL control cost in Eq. (11) is weighted by the propensity function hr(n)h_{r}(n) for each reaction rRr\in R, which makes the control cost function depend on the current state nn. It means that even if the reaction rate coefficient is the same, the control cost for a reaction rr is higher when the reaction occurs more frequently. A similar form of the control cost function has been used in quantifying the biological cost of cellular reactions [31].

For the KL cost function, the HJB equation becomes

tVt(n,β)\displaystyle-\frac{\partial}{\partial t}V_{t}(n,\beta) =βUt(n)\displaystyle=\beta U_{t}(n) (14)
+rRkr0hr(n)(exp(¯srVt(n,β))1),\displaystyle+\sum_{r\in R}k^{0}_{r}h_{r}(n)\left(\exp\left(\overline{\nabla}_{s_{r}}V_{t}(n,\beta)\right)-1\right),

which is yet a nonlinear differential equation. However, by the Cole–Hopf transformation (logarithmic transformation), Zt(n,β):=exp(Vt(n,β))Z_{t}(n,\beta):=\exp(V_{t}(n,\beta)), the HJB equation is linearized as

tZt(n,β)\displaystyle-\frac{\partial}{\partial t}Z_{t}(n,\beta) =βUt(n)Zt(n,β)+rRkr0hr(n)¯srZt(n,β)\displaystyle=\beta U_{t}(n)Z_{t}(n,\beta)+\sum_{r\in R}k^{0}_{r}h_{r}(n)\overline{\nabla}_{s_{r}}Z_{t}(n,\beta) (15)
=:βUt(n)Zt(n,β)+k0Zt(n,β),\displaystyle=:\beta U_{t}(n)Z_{t}(n,\beta)+\mathcal{L}_{k^{0}}Z_{t}(n,\beta),

with the terminal condition ZT(n,β)=exp(βUT(n))Z_{T}(n,\beta)=\exp(\beta U_{T}(n)).

Although it is still an ordinary differential equation on a possibly infinite domain, linearity allows us to calculate the value function efficiently. Note that the second term on the right-hand side is the backward operator (generator) k\mathcal{L}_{k}, which is the adjoint of the forward operator k\mathcal{L}^{\dagger}_{k} in the chemical master Eq. (II.1). Moreover, the special form of Eq. (15) allows us to have the following probabilistic representation for Zt(n,β)Z_{t}(n,\beta):

Zt(n,β)=𝔼k0[exp(β𝒰t(n()))|n(t)=n],Z_{t}(n,\beta)=\mathbb{E}_{k^{0}}\left[\exp\left(\beta\mathcal{U}_{t}(n(\cdot))\right)|n(t)=n\right], (16)

which is called the Feynmann-Kac formula, cf., Appendix 1. Prop. 7.1 of [32]. Thus, the value function is given by

Vt(n,β)=log𝔼k0[exp(β𝒰t(n()))|n(t)=n].V_{t}(n,\beta)=\log\mathbb{E}_{k^{0}}\left[\exp\left(\beta\mathcal{U}_{t}(n(\cdot))\right)|n(t)=n\right]. (17)

This representation enables us to compute the value function by evaluating the expectation of the utility function with respect to the uncontrolled reaction rate coefficient k0k^{0}. For simple reaction systems, we could obtain the analytical expression of the value function as demonstrated in the following sections. Even if it is not possible, efficient Monte Carlo sampling techniques can be used to estimate the expectation. It is worth noting that this representation of the value function can be evaluated in a time-forward manner, whereas the usual optimal control problem requires the time-backward calculation of the HJB equation due to the dynamic programming principle.

The linearization of the HJB equation and efficient computation of the optimal control problems for certain control cost functions was reported by Kappen [24] for diffusion processes and by Todorov [26] for discrete-time Markov chains. As Theodorou and Todorov [33] discussed, the linearization of the HJB equation is possible if the control cost function is given by KL divergence between path measures. In this case, the optimal control problem is related to an estimation (filtering) problem. Similar mathematical properties had been found in relation to the duality of control and inference [21, 22, 23]. Similar properties for Markov jump processes are identified in recent studies [34, 35, 36] as well as one of the earliest studies by Fleming [21]. In this paper, we develop the optimal control framework for stochastic RN inspired by these previous studies. In Sec. B of the Supplementary Material [27], we elaborate on the path measure perspective and why the KL cost function in Eq. (11) works.

II.4 First exit optimal control with KL cost

We have hitherto focused on the problem with a finite and fixed terminal time TT. One can extend our approach to the problem with infinite time length TT\rightarrow\infty. Since a stochastic reaction system often has absorbing states 𝒩abs𝒩\mathcal{N}_{\mathrm{abs}}\subset\mathcal{N}, the stochastic process inevitably reaches one of these absorbing states after some time Texit:=min{t0|n(t)𝒩abs}T_{\mathrm{exit}}:=\min\{t\geq 0|n(t)\in\mathcal{N}_{\mathrm{abs}}\}. For example, in the birth-death processes

Ak12A,Ak2,A\overset{k_{1}}{\longrightarrow}2A,\quad A\overset{k_{2}}{\longrightarrow}\emptyset, (18)

the extinction state nA=0n_{A}=0 is absorbing. The extinction events are particularly important in biological problems because the extinction of some spieces in a chemical, organismal, or human population could alter the dynamics of the system qualitatively. Thus, controlling species into either survival or extinction has many applications, as mentioned in the Introduction. Moreover, biological control problems may not always have a prescribed end time TT because the goal of control is usually to achieve something rather than to do something until TT. Thus, the first exist control is more essential than fixed-time control. Nevertheless, it has been less practiced because the first exist control is more involved technically.

We can consider the optimal control problem with such random terminal time TexitT_{\mathrm{exit}} and KL cost. The utility function is replaced with the sum of the terminal utility function Uexit(n)U_{\mathrm{exit}}(n) defined on the absorbing states 𝒩abs\mathcal{N}_{\mathrm{abs}}, and the instantaneous utility function U(n)U(n) defined on the non-absorbing states 𝒩non:=𝒩\𝒩abs\mathcal{N}_{\mathrm{non}}:=\mathcal{N}\backslash\mathcal{N}_{\mathrm{abs}}, i.e.,

𝒰(n())=Uexit(n(Texit))+0TexitU(n(τ))𝑑τ.\mathcal{U}(n(\cdot))=U_{\mathrm{exit}}(n(T_{\mathrm{exit}}))+\int_{0}^{T_{\mathrm{exit}}}U(n(\tau))d\tau. (19)

Then, the value function becomes time-independent V(n,β)=Vt(n,β)V(n,\beta)=V_{t}(n,\beta). If we assume the KL control cost, and the Cole–Hopf transformation Z(n,β):=exp(βV(n,β))Z(n,\beta):=\exp(\beta V(n,\beta)) yields

0=βU(n)Z(n,β)+k0Z(n,β),0=\beta U(n)Z(n,\beta)+\mathcal{L}_{k^{0}}Z(n,\beta), (20)

for n𝒩nonn\in\mathcal{N}_{\mathrm{non}}, and Z(n,β)=exp(βUexit(n,β))Z(n,\beta)=\exp\left(\beta U_{\mathrm{exit}}(n,\beta)\right) for n𝒩absn\in\mathcal{N}_{\mathrm{abs}}. Thus, the first exit optimal control problem is reduced to a linear algebraic problem. Similar results have been obtained for discrete-time Markov chains [26]. The probabilistic representations in Eqs. (16), (17) are also applicable.

If there are no absorbing states, the terminal time and the time accumulated objective function could diverge. For such cases, the time-averaged formulation is useful:

maximizek()limT1T𝔼k()[β𝒰(n())𝒞(n(),k())].\underset{k(\cdot)}{\text{maximize}}\quad\lim_{T\rightarrow\infty}\frac{1}{T}\mathbb{E}_{k(\cdot)}\left[\beta\mathcal{U}(n(\cdot))-\mathcal{C}(n(\cdot),k(\cdot))\right]. (21)

Via the Cole–Hopf transformation, the optimal solution could be cast into an eigenvalue problem (see Sec. C of the Supplementary Material [27]).

II.5 Controlling a random walker

Refer to caption
Figure 2: The results of the minimum exit time problem for interacting random walkers on one-dimensional space. (a) The value function V(x1,x2,β)V(x_{1},x_{2},\beta) plotted by color in the state space (x1,x2)(x_{1},x_{2}) with different β\beta. The dashed curves are its contours. The blue zigzag lines are the trajectories of 100100 independent simulations starting from (10,11)(10,11). (b) The value function V(x1,x2,β)V(x_{1},x_{2},\beta) for fixed x1=0x_{1}=0 (the upper panel) and x1=10x_{1}=10 (the lower panel) with different β\beta indicated by the dots. The dashed lines are calculated with Eq. (25). The color code represents the value of β\beta. (c) The gradient ¯2V(x1,x2,β)=V(x1,x2+1,β)V(x1,x2,β)\overline{\nabla}_{2}V(x_{1},x_{2},\beta)=V(x_{1},x_{2}+1,\beta)-V(x_{1},x_{2},\beta) of the value function in x2x_{2} direction for fixed x1=0x_{1}=0 and x1=10x_{1}=10. The dashed lines correspond to Eq. (25). The format of the panels is the same as in (b). The parameters are M=31M=31, N=2N=2, k0=0.1k^{0}=0.1.

We demonstrate the effectiveness of our method by applying it to several control problems. The first example is the jump processes on discrete states XX. We consider NN particles walking on a directed graph (X,R)(X,R), where the set of directed edges RR represents the allowed transitions between types XX. For any x,yXx,y\in X, (x,y)R(x,y)\in R implies that the particle can jump from xx to yy. The stoichiometric coefficients are given by s(x,y),z=δy,zδx,zs_{(x,y),z}=\delta_{y,z}-\delta_{x,z} and the kinetics is the mass action type h(x,y)(n)=nxh_{(x,y)}(n)=n_{x} for x,y,zXx,y,z\in X.

For X=X=\mathbb{Z} and N=1N=1, the process is reduced to a simple one-dimensional random walk by a single walker. Such a process has been used as a simple model of intracellular transport of macromolecules along intracellular filaments [37]. Molecular motors such as dynein and kinesin consume energy and move in one direction. Since the step length is fixed, the position of a molecular motor on a filament can be modeled as a 1-dimensional discrete grid. For this case, we can have analytical solutions for some optimal control problems owing to the sound properties of KL control.

Consider the situation where a particle is required to reach a goal xx^{*}\in\mathbb{Z} as soon as possible. The position of the particle at time tt is denoted as x^(n(t))X\hat{x}(n(t))\in X. We can formulate the situation as the following minimum exit time problem:

minimizek()β𝔼k()[Texit]+𝒟KL[k()k0],\underset{k(\cdot)}{\text{minimize}}\quad\beta\mathbb{E}_{k(\cdot)}\left[T_{\mathrm{exit}}\right]+\mathcal{D}_{KL}\left[\mathbb{P}_{k(\cdot)}\|\mathbb{P}_{k^{0}}\right], (22)

where the absorbing states are given by 𝒩abs={n𝒩|x^(n)=x}\mathcal{N}_{\mathrm{abs}}=\{n\in\mathcal{N}|\hat{x}(n)=x^{*}\}. Setting Uexit(n)=0U_{\mathrm{exit}}(n)=0 and U(n)=1U(n)=-1, we have an equivalent maximization problem

maximizek()β𝔼k()[Texit]𝒟KL[k()k0],\underset{k(\cdot)}{\text{maximize}}\quad\beta\mathbb{E}_{k(\cdot)}\left[-T_{\mathrm{exit}}\right]-\mathcal{D}_{KL}\left[\mathbb{P}_{k(\cdot)}\|\mathbb{P}_{k^{0}}\right], (23)

Equation (17) shows that the value function V(n,β)=V(x^(n),β)V(n,\beta)=V(\hat{x}(n),\beta) is equal to the cumulant-generating function of the exit time with parameter β-\beta and given by

V(x,β)=log𝔼k0[exp(βTexit)|x(0)=x].V(x,\beta)=\log\mathbb{E}_{k^{0}}[\exp(-\beta T_{\mathrm{exit}})|x(0)=x]. (24)

Assuming the symmetric uncontrolled transition rates kr0=κk^{0}_{r}=\kappa for all rRr\in R and using the analytical expression of the cumulant-generating function [38], we obtain the value function analytically:

V(x,β)=γ(β)|xx|,V(x,\beta)=-\gamma(\beta)|x^{*}-x|, (25)

where a scalar γ(β)0\gamma(\beta)\geq 0 is defined as

γ(β):=log(1+β2κ(1+β2κ)21).\gamma(\beta):=-\log\left(1+\frac{\beta}{2\kappa}-\sqrt{\left(1+\frac{\beta}{2\kappa}\right)^{2}-1}\right). (26)

The conjugate value function can be calculated for xxx^{*}\neq x and u<0u<0 as follows

(x,u)=2κ|u|Δ2+4κ2u2Δsinh1(Δ2κ|u|),\mathcal{E}(x,u)=2\kappa|u|-\sqrt{\Delta^{2}+4\kappa^{2}u^{2}}-\Delta\sinh^{-1}\left(\frac{\Delta}{2\kappa|u|}\right), (27)

where Δ:=xx\Delta:=x^{*}-x. Since the value function is linear in |xx||x^{*}-x|, we have the piecewise constant optimal transition rates:

k(x,x+1)(n,β)\displaystyle k^{\dagger}_{(x,x+1)}(n,\beta) ={κeγ(β)ifx<x,κeγ(β)ifx>x,\displaystyle=\begin{cases}\kappa e^{\gamma(\beta)}&\text{if}~{}x<x^{*},\\ \kappa e^{-\gamma(\beta)}&\text{if}~{}x>x^{*},\end{cases} (28)
k(x,x1)(n,β)\displaystyle k^{\dagger}_{(x,x-1)}(n,\beta) ={κeγ(β)ifx<x,κeγ(β)ifx>x.\displaystyle=\begin{cases}\kappa e^{-\gamma(\beta)}&\text{if}~{}x<x^{*},\\ \kappa e^{\gamma(\beta)}&\text{if}~{}x>x^{*}.\end{cases}

When x<xx<x^{*}, the transition rate k(x,x+1)k^{\dagger}_{(x,x+1)} in the positive direction is higher than the uncontrolled one κ\kappa, while the transition rate k(x,x1)k^{\dagger}_{(x,x-1)} in the negative direction is lower than κ\kappa. We can also obtain the analytic solution for the standard control problem to maximize the average speed of the walker (see Sec. D of the Supplementary Material [27]).

II.6 Controlling interacting random walkers

Next, we consider the case where N=2N=2 and X=X=\mathbb{Z}. Let us consider the minimum time problem with exclusion interaction, i.e., all the particles should reach the goal xXx^{*}\in X as soon as possible while they are not allowed to occupy the same site. This is a model of two molecular motors moving on the same filament on which they cannot overtake (Fig. 1 (a)). Non-colliding random walks and diffusion processes have been studied persistently [39, 40, 41].

Let us denote the position of the left particle as x^1(n)\hat{x}_{1}(n) and the right particle as x^2(n)\hat{x}_{2}(n), i.e., x^1(n)<x^2(n)\hat{x}_{1}(n)<\hat{x}_{2}(n). Due to the exclusive interaction, the first particle cannot overtake the second. Thus, the particles are identifiable for all t0t\geq 0. The absorbing states are given by

𝒩abs={n𝒩|x^1(n)=x^2(n)}.\mathcal{N}_{\mathrm{abs}}=\left\{n\in\mathcal{N}~{}|\hat{x}_{1}(n)=\hat{x}_{2}(n)\right\}. (29)

Among the abosrbing states above, we are interested in the single state n𝒩absn^{*}\in\mathcal{N}_{\mathrm{abs}} such that the particles reach the goal x^1(n)=x^2(n)=x\hat{x}_{1}(n^{*})=\hat{x}_{2}(n^{*})=x^{*}. We can formulate the problem as a first exit problem with the instantaneous utility U(n)=1U(n)=-1, and the terminal utility

Uexit(n)={0ifn=n,otherwise.U_{\mathrm{exit}}(n)=\begin{cases}0&\text{if}~{}n=n^{*},\\ -\infty&\text{otherwise}.\end{cases} (30)

Assuming that each particle stops when it reaches the goal, the time-integrated instantaneous utility is given by

Texit=max{Texit1,Texit2},T_{\mathrm{exit}}=\max\{T^{1}_{\mathrm{exit}},T^{2}_{\mathrm{exit}}\}, (31)

where Texiti:=inf{t0|x^i(n(t))=x}T^{i}_{\mathrm{exit}}:=\inf\{t\geq 0|\hat{x}_{i}(n(t))=x^{*}\} is the first exit time of the ii-th particle (i=1,2i=1,2). Then, the value function V(n,β)=:V(x^1(n),x^2(n),β)V(n,\beta)=:V(\hat{x}_{1}(n),\hat{x}_{2}(n),\beta) can be calculated as

V(x1,x2,β)\displaystyle V(x_{1},x_{2},\beta) =𝔼k0[exp(βTexit)|n(0)=n,n(Texit)=n]\displaystyle=\mathbb{E}_{k^{0}}\left[\exp(-\beta T_{\mathrm{exit}})|n(0)=n,n(T_{\mathrm{exit}})=n^{*}\right] (32)
=𝔼k0[𝔼k0[exp(βmax{Texit1,Texit2})|\displaystyle=\mathbb{E}_{k^{0}}\left[\mathbb{E}_{k^{0}}\left[\exp(-\beta\max\{T^{1}_{\mathrm{exit}},T^{2}_{\mathrm{exit}}\})\big{|}\right.\right.
x1(0)=x1,x1(Texit1)=x]\displaystyle\quad\quad\left.\left.x_{1}(0)=x_{1},x_{1}(T^{1}_{\mathrm{exit}})=x^{*}\right]\right.
x2(0)=x2,x2(Texit2)=x]\displaystyle\quad\quad\left.x_{2}(0)=x_{2},x_{2}(T^{2}_{\mathrm{exit}})=x^{*}\right]
min{V(x1,β),V(x2,β)}\displaystyle\leq\min\{V(x_{1},\beta),V(x_{2},\beta)\}
=γ(β)max{|xx1|,|xx2|},\displaystyle=-\gamma(\beta)\max\{|x^{*}-x_{1}|,|x^{*}-x_{2}|\},

where V(xi,β)V(x_{i},\beta) is the value function in Eq. (25) for a single random walker. When the first particle is close to the goal x1xx_{1}\approx x^{*} while the second particle is still away from it, |x2x|1|x_{2}-x^{*}|\gg 1, the upper bound become tight and we have

V(x1,x2,β)γ(β)|xx2|.V(x_{1},x_{2},\beta)\approx-\gamma(\beta)|x^{*}-x_{2}|. (33)

Using the upper bound, the conjugate value function (x1,x2,u)\mathcal{E}(x_{1},x_{2},u) satisfies

(x1,x2,u)max{(x1,u),(x2,u)},\mathcal{E}(x_{1},x_{2},u)\geq\max\{\mathcal{E}(x_{1},u),\mathcal{E}(x_{2},u)\}, (34)

where (xi,u)\mathcal{E}(x_{i},u) is the conjugate value function in Eq. (27).

Analytical estimates are compared with numerical solutions in Fig. 2. Numerical solutions are obtained by solving the linear equation in Eq. (20) where the infinite space \mathbb{Z} is truncated to the finite interval X={0,1,,M1}X=\{0,1,\ldots,M-1\}. The goal position is set to x=0x^{*}=0.

When the first particle has already reached the goal x1=0=xx_{1}=0=x^{*}, the problem is equivalent to the minimum time problem for the second particle only. Then, the value function satisfies V(x,x2,β)=γ(β)|xx2|V(x^{*},x_{2},\beta)=-\gamma(\beta)|x^{*}-x_{2}|, which is consistent with Eq. (33) (Fig. 2 (b), upper panel). Then, the gradient of V(x,x2,β)V(x^{*},x_{2},\beta) is constant (Fig. 2 (c), upper panel), which results in the constant optimal control.

When the first particle is at x1=10x_{1}=10, the value V(10,x2,β)V(10,x_{2},\beta) as a function of x2x_{2} has a gentle slope than γ(β)\gamma(\beta) to avoid collision with the first particle (Fig. 2 (b), lower panel). Especially when β\beta is small, collision avoidance is more important than early arrival. In this case, the value gradient can be positive, and the second particle moves away from the first particle, as seen in Fig. 2 (c) lower panel. On the other hand, when β\beta is large, the exit time becomes more important than the collision. Thus, the value gradient is always negative except at the collision point, i.e., at x2=10x_{2}=10.

II.7 Controlling survival in birth and death processes

Refer to caption
Figure 3: The results of the maximum exit time problem for Moran processes with N=100N=100. (a) Dependence of the value function V(n,β)V(n,\beta) on β\beta at nA=N/2n_{A}=N/2 with varying γ\gamma, the selective advantage of AA over BB. γ=1\gamma=1 is the neutral case. Both the linear algebraic solution (colored points) and the analytical solution (colored curves) show divergence at β=βc\beta=\beta_{c} (black dashed lines). (b) The relationship (n,u)\mathcal{E}(n,u) between the expected first exit time uu and the control cost starting from the state nA=N/2n_{A}=N/2. The format of the plot is the same as in (a). (c) The relationship (n,u)/u\mathcal{E}(n,u)/u between the expected first exit time and the control cost rate starting from the state nA=N/2n_{A}=N/2. The black dashed lines represent βc\beta_{c} for each γ\gamma. The format of the plot is the same as in (a). (d, e) Dependence of the value function V(n,β)V(n,\beta) (d) and its gradient ¯s1V(n,β)\overline{\nabla}_{s_{1}}V(n,\beta) (e) on the initial state nAn_{A} for different values of β\beta and γ\gamma. In each panel, the values of β\beta are sampled at equal intervals between 0 to βc\beta_{c} and color-coded. (f) The optimally controlled stochastic trajectories from time 0 to T=100T=100 for different β\beta and γ\gamma. In each panel, there are 100100 trajectories with initial conditions nA(0)=1n_{A}(0)=1 for different values of β\beta, which are color-coded. The parameters are k10=γ,k20=1.0k_{1}^{0}=\gamma,k_{2}^{0}=1.0.

In the control of birth and death processes of cells or organisms without immigration, the extinction state n=0n=0 is a natural absorbing state. In the context of biodiversity conservation, one has to avoid the extinction. Furthermore, a single species should not dominate the ecosystem. To address this population control problem, let us consider the following birth and death reactions involving two species AA and BB (Fig. 1 (b)):

A2A,A,\displaystyle A\rightarrow 2A,\quad A\rightarrow\emptyset, (35)
B2B,B.\displaystyle B\rightarrow 2B,\quad B\rightarrow\emptyset.

Assume that every dying individual is replaced by a duplicated individual of either AA or BB, and that the size NN of the population is constant. This is the continuous-time Moran process studied in population genetics [42, 43, 44]. Since birth and death reactions occur simultaneously, we can summarize the four reactions into two:

A+Bk12A,A+Bk22B.A+B\overset{k_{1}}{\rightarrow}2A,\quad A+B\overset{k_{2}}{\rightarrow}2B. (36)

where the propensity functions are given by h1(n)=h2(n)=nAnB/Nh_{1}(n)=h_{2}(n)=n_{A}n_{B}/N. We denote the ratio between the rate coefficient by γ:=k10/k20\gamma:=k_{1}^{0}/k_{2}^{0}, which means that type AA has a selective advantage over type BB when γ>1\gamma>1 and vice versa. The case with γ=1\gamma=1 is the neutral situation. In any case, the system eventually reaches one of two extinction boundaries (nA,nB)=(N,0)(n_{A},n_{B})=(N,0) or (0,N)(0,N) due to the random drift. We formulate the maintenance of the population coexistence as the maximization of the exit time problem from coexisting states:

maximizek()β𝔼k()[Texit]𝒟KL[k()k0].\underset{k(\cdot)}{\text{maximize}}\quad\beta\mathbb{E}_{k(\cdot)}\left[T_{\mathrm{exit}}\right]-\mathcal{D}_{KL}\left[\mathbb{P}_{k(\cdot)}\|\mathbb{P}_{k^{0}}\right]. (37)

The value function, i.e., the cumulant generating function of the extinction time, can be decomposed as

V(n,β)\displaystyle V(n,\beta) =log𝔼k0[exp(βTexit)|n(0)=n]\displaystyle=\log\mathbb{E}_{k^{0}}\left[\exp(\beta T_{\mathrm{exit}})|n(0)=n\right] (38)
=log(eK0(n,β)+eKN(n,β)),\displaystyle=\log\left(e^{K_{0}(n,\beta)}+e^{K_{N}(n,\beta)}\right),

where K0(n,β)K_{0}(n,\beta) and KN(n,β)K_{N}(n,\beta) are the cumulant-generating function of the first hitting time at 0 and NN, respectively. They have an explicit expression [45] using the eigenvalues of submatrices of the transition rate matrix M:=(ωn,n)nA,nA0,N(N1)×(N1)M:=(-\omega_{n,n^{\prime}})_{n_{A},n_{A}^{\prime}\neq 0,N}\in\mathbb{R}^{(N-1)\times(N-1)} as follows:

K0(n,β)\displaystyle K_{0}(n,\beta) :=log𝔼k0[exp(βTexit)δnA(Texit),0|n(0)=n]\displaystyle:=\log\mathbb{E}_{k^{0}}\left[\exp(\beta T_{\mathrm{exit}})\delta_{n_{A}(T_{\mathrm{exit}}),0}|n(0)=n\right] (39)
=a=1nAlogd0(a)+i=1NnA1log(xi(nA)β)\displaystyle=\sum_{a=1}^{n_{A}}\log d^{0}(a)+\sum_{i=1}^{N-n_{A}-1}\log(x_{i}(n_{A})-\beta)
k=1N1log(λkβ),\displaystyle\quad-\sum_{k=1}^{N-1}\log(\lambda_{k}-\beta),

and

KN(n,β)\displaystyle K_{N}(n,\beta) :=log𝔼k0[exp(βTexit,N)δnA(Texit),N|n(0)=n]\displaystyle:=\log\mathbb{E}_{k^{0}}\left[\exp(\beta T_{\mathrm{exit,N}})\delta_{n_{A}(T_{\mathrm{exit}}),N}|n(0)=n\right] (40)
=a=nAN1logb0(a)+j=1nA1log(yj(nA)β)\displaystyle=\sum_{a=n_{A}}^{N-1}\log b^{0}(a)+\sum_{j=1}^{n_{A}-1}\log(y_{j}(n_{A})-\beta)
k=1N1log(λkβ),\displaystyle\quad-\sum_{k=1}^{N-1}\log(\lambda_{k}-\beta),

where b0(a):=k10h1(a,Na)b^{0}(a):=k_{1}^{0}h_{1}(a,N-a) and d0(a):=k20h2(a,Na)d^{0}(a):=k_{2}^{0}h_{2}(a,N-a). In the equations, xi(nA)>0x_{i}(n_{A})>0, yj(nA)>0y_{j}(n_{A})>0, and λk>0\lambda_{k}>0 are the eigenvalues of bottom-right submatrix of size (NnA1)(N-n_{A}-1), top-left submatrix of size (nA1)(n_{A}-1), and the full matrix MM, respectively [45]. Due to the interlacing property [46], the smallest eigenvalue λ1\lambda_{1} of the full matrix is smaller than the smallest eigenvalues, x1(n)x_{1}(n) and y1(n)y_{1}(n), of the submatrices. Therefore, the value function diverges V(n,β)+V(n,\beta)\rightarrow+\infty as βλ1\beta\rightarrow\lambda_{1}. The expected extinction time 𝒰(n,β)\mathcal{U}^{\dagger}(n,\beta) under the optimal control also diverges as βλ1\beta\rightarrow\lambda_{1} because it is the derivative of V(n,β)V(n,\beta) by β\beta (Eq. (10)). This critical βc:=λ1\beta_{c}:=\lambda_{1} is the same for all initial conditions nn.

We numerically calculated the value function in two ways: by solving the linear algebraic Eq. (20) and by using the analytical formula in Eqs. (38)–(40). Two solutions exactly match, and the value function V(n,β)V(n,\beta) diverges as β\beta approaches the smallest eigenvalue ββc=λ1\beta\rightarrow\beta_{c}=\lambda_{1} (Fig. 3 (a)). The cost-exit time tradeoff curves (n,u)\mathcal{E}(n,u) in Fig. 3 (b) approach lines with slope βc\beta_{c} in the long exit time regime. Thus, the cost rate per unit time (n,u)/u\mathcal{E}(n,u)/u is upper bounded by βc\beta_{c} as shown in Fig. 3 (c). The result indicates that a finite amount of control cost per time is sufficient to prevent the ultimate extinction on average. As γ=k10/k20\gamma=k^{0}_{1}/k^{0}_{2} increases, the extinction tends to happen earlier, resulting in the increased control cost rate βc\beta_{c} for preventing extinction.

As a function of nAn_{A}, the value function V(n,β)V(n,\beta) for each β\beta and γ\gamma has a single peak, which we designate by nAn_{A}^{*} (Fig. 3 (d)). The optimal control steers the system to stay around nAn_{A}^{*}. As β\beta increases, the value gradient becomes steep around nAn_{A}^{*} (Fig. 3 (e)) so that the controlled trajectories do not leave there even after a long time (Fig. 3 (f)).

When β\beta is close to βc\beta_{c}, the width of the peak around nAn_{A}^{*} decreases as γ\gamma increases, and there emerges a region with a shallow gradient ¯s1V(n,β)0\overline{\nabla}_{s_{1}}V(n,\beta)\approx 0. In this region, the optimal control kr(n)=kr0exp(¯srV(n,β))k^{\dagger}_{r}(n)=k^{0}_{r}\exp(\overline{\nabla}_{s_{r}}V(n,\beta)) become close to the uncontrolled rate kr0k^{0}_{r}. The emergence of the zero-gradient region indicates that the optimal control strategy switches between ON and OFF modes depending on the current state.

The OFF mode region appears if γ\gamma is high and NN is moderately large (see the Fig. S1 in Sec. E of the Supplementary Material [27]), implying that the OFF mode is attributed to the difficulty of controlling an exponentially growing population. If both γ\gamma and NN are large, the uncontrolled system has a stronger and less stochastic driving force towards nA=Nn_{A}=N as nAn_{A} increases. Thus, the optimal control keeps the population away from nA=Nn_{A}=N by forcing it close to 0, which is the ON mode around the peak nAn_{A}^{*}. Once a large fluctuation drives the population to leave the peak, the additional cost to bring it back to the peak does not pay for its success rate, leading to the OFF mode, i.e., do nothing, in the intermediate region. Finally, the control turns ON again near nA=Nn_{A}=N to hang on there. Such a spontaneous emergence of hierarchical control may not be identified within the LQG approximation, highlighting the importance of taking into account the unique properties of RN.

II.8 Controlling epidemic outbreak

Refer to caption
Figure 4: The solution of the minimum total infection problem for stochastic SIR model for different values of R0{2,4,8}R_{0}\in\{2,4,8\}. (a) Heat maps of the value function V(n,1)V(n,1) plotted on the nSn_{S}nIn_{I} plane with β=1\beta=1 for varying R0R_{0}. Light color represents a large value. (b) The gradient ¯s1V(n,1)\overline{\nabla}_{s_{1}}V(n,1) of the value function in the infection direction s1s_{1}, which determines the optimally controled infection rate as k1(n)=k10exp(¯s1V(n,1))k^{\dagger}_{1}(n)=k^{0}_{1}\exp(\overline{\nabla}_{s_{1}}V(n,1)). (c) The value gradient ¯s2V(n,1)\overline{\nabla}_{s_{2}}V(n,1) in the recovery direction s2s_{2}, which determines the optimally controled recovery rate as k2(n)=k20exp(¯s2V(n,1))k^{\dagger}_{2}(n)=k^{0}_{2}\exp(\overline{\nabla}_{s_{2}}V(n,1)). (d) The value function V(n,1)V(n,1) on the line nS=20n_{S}=20. (e) The gradient of the value function ¯s2V(n,1)\overline{\nabla}_{s_{2}}V(n,1). Parameters are N=50N=50, β=1\beta=1, k20=0.01k^{0}_{2}=0.01, k10=R0k20k^{0}_{1}=R_{0}k^{0}_{2}.

Lastly, we apply our framework to epidemic problems. We use the stochastic SIR model (Fig. 1 (c)) in which the population is divided into three classes: susceptible (SS), infected (II), and recovered (RR), i.e., X={S,I,R}X=\{S,I,R\}. The uncontrolled process is described by the following infection and recovery reactions:

S+Ik12I,Ik2R,S+I\overset{k_{1}}{\longrightarrow}2I,\quad I\overset{k_{2}}{\longrightarrow}R, (41)

and we assume mass-action-type kinetics, i.e., h1(n)=N1nSnIh_{1}(n)=N^{-1}n_{S}n_{I} and h2(n)=nIh_{2}(n)=n_{I}. The total population size N:=nS(t)+nI(t)+nR(t)N:=n_{S}(t)+n_{I}(t)+n_{R}(t) is a conserved quantity of the system. This model has absorbing states 𝒩abs={n𝒩|nI=0}\mathcal{N}_{\mathrm{abs}}=\{n\in\mathcal{N}|n_{I}=0\} and any stochastic solution n(t)n(t) eventually reaches one of these states as tt\rightarrow\infty. When the ratio R0:=k10/k20R_{0}:=k_{1}^{0}/k_{2}^{0} between reaction rate coefficients is high, the infection spreads rapidly in the population, and the state n(Texit)n(T_{\mathrm{exit}}) at the terminal time tends to have a small number nS(Texit)n_{S}(T_{\mathrm{exit}}) of susceptible and a large number nR(Texit)n_{R}(T_{\mathrm{exit}}) of recovered people. The number of recovered people at the end is equal to the total number of infections during the epidemic, which is known as the size of the epidemic [47].

The goal of control is then to minimize the size of the epidemic or, equivalently, to maximize the number of susceptible people at the end of the epidemic. Let us formulate it as the first exit problem where Uexit(n)=nSU_{\mathrm{exit}}(n)=n_{S} and U(n)0U(n)\equiv 0, i.e.,

maximizek()β𝔼k()[nS(Texit)]𝒟KL[k()k0].\underset{k(\cdot)}{\text{maximize}}\quad\beta\mathbb{E}_{k(\cdot)}\left[n_{S}(T_{\mathrm{exit}})\right]-\mathcal{D}_{KL}\left[\mathbb{P}_{k(\cdot)}\|\mathbb{P}_{k^{0}}\right]. (42)

We numerically calculated the value function and the optimal reaction rate using Eqs. (20) and (13) for N=50N=50. The results are shown in Fig. 4. The value decreases as the number of susceptible people decreases and as the number of infected people increases (Fig. 4 (a)). The optimal control of infection and recovery rates plotted in Fig. 4 (b) and (c) indicates that strong control to reduce the infection rate and to boost the recovery rate is encouraged when the number of susceptibles is large and the number of infected is small. However, if many people are already infected, strong control is no longer encouraged. Instead, almost no control over the infection and recovery rates becomes optimal, as indicated by the white regions in Fig. 4 (b) and (c). This means that almost the entire population will eventually become infected in this situation, no matter how optimally the rates are controlled, and that further investment in the control cost is not worth the potential gains.

In particular, we can observe a sharp transition between strong control and no control when R0R_{0} is high. Figures 4 (d) and (e) show the value function V(n,1)V(n,1) and its gradient ¯s2V(n,1)\overline{\nabla}_{s_{2}}V(n,1) on the line nS=20n_{S}=20, which determines the optimal recovery rate k2(n)=k20exp(¯s2V(n,1))k^{\dagger}_{2}(n)=k^{0}_{2}\exp(\overline{\nabla}_{s_{2}}V(n,1)). The value functions are approximately piecewise-linear functions V(n,1)max{0,a0nS+a1nI+a2}V(n,1)\approx\max\{0,a_{0}n_{S}+a_{1}n_{I}+a_{2}\}, so the gradients and the optimal rates are approximately piecewise-constant. This transition leads to the mode switching of optimal control, which is similar to the case of the problem of maintaining diversity.

III Discussion

In this paper, we proposed a new framework to formulate optimal control problems of stochastic reaction networks. We showed that the Kullback–Leibler divergence has sound properties as the control cost of RN, taming the HJB equation via its linearization through Cole-Hopf transformation. It also yields a computationally efficient expression of the optimal solutions. We demonstrated the effectiveness of our framework for three classes of biological control problems with absorbing states.

There are several potential directions worth investigating. The first is risk-sensitive problems [48, 49], where not only the expectation of performance but also the variance and higher-order moments matter. For instance, as the concentration of intracellular molecules inevitably fluctuates, it is vital to suppress the fluctuation and variability when robust homeostasis is required [50]. As studied for diffusion processes [51], the current optimal control framework can be extended to incorporate risk sensitivity.

Second, more realistic reaction network models can have many types or large population sizes, such as complex ecological systems and stage-structured epidemic models [52, 53, 54, 55]. The optimal control problem for large models is accompanied by high-dimensional equations. Despite the linearity of the equation, solving it is numerically challenging. The development of fast and scalable numerical algorithms is essential for addressing large-scale problems. Sampling-based techniques [24, 56] would be efficient for these high-dimensional settings.

Finally, it would be desirable if we could modify the control cost function more flexibly depending on the setting of the actual control problem. For instance, if some components of the reaction rate coefficient are constrained to lie in a certain range, or if some reactions incur a much higher control cost than the others, the control cost function has to deviate from the Kullback–Leibler divergence, resulting in the loss of the efficient computation of the optimal solutions via Cole-Hopf transformation. This limitation is analogous to the inverse proportionality condition between the weight of control cost and the noise strength, being required to solve optimal control problems for diffusion processes efficiently [24, 26]. For stochastic reaction networks, the noise strength is related to the propensity function hr(n)h_{r}(n), which should be used as the weight of the control cost. An iterative method with local approximation as in [57, 58] might provide a way to overcome the limitation.

Acknowledgements.
We thank Simon Schnyder, Louis-Pierre Chaintron, and Kenji Kashima for their helpful discussions. This research is supported by JST CREST JPMJCR2011 and JPMJCR1927, and JSPS KAKENHI Grant Numbers 21J21415, 24KJ0090, 24H02148.

References