This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\affils

Department of Applied Mathematics, University of California, Merced, 5200 N. Lake Rd, Merced, CA 95343 USA
1E-mail: mreeves3@ucmerced.edu
2E-mail: hbhat@ucmerced.edu

Neural Continuous-Time Markov Models

Majerle Reeves∗1 and Harish S. Bhat2{}^{\ast 2}\dagger
Abstract

Continuous-time Markov chains are used to model stochastic systems where transitions can occur at irregular times, e.g., birth-death processes, chemical reaction networks, population dynamics, and gene regulatory networks. We develop a method to learn a continuous-time Markov chain’s transition rate functions from fully observed time series. In contrast with existing methods, our method allows for transition rates to depend nonlinearly on both state variables and external covariates. The Gillespie algorithm is used to generate trajectories of stochastic systems where propensity functions (reaction rates) are known. Our method can be viewed as the inverse: given trajectories of a stochastic reaction network, we generate estimates of the propensity functions. While previous methods used linear or log-linear methods to link transition rates to covariates, we use neural networks, increasing the capacity and potential accuracy of learned models. In the chemical context, this enables the method to learn propensity functions from non-mass-action kinetics. We test our method with synthetic data generated from a variety of systems with known transition rates. We show that our method learns these transition rates with considerably more accuracy than log-linear methods, in terms of mean absolute error between ground truth and predicted transition rates. We also demonstrate an application of our methods to open-loop control of a continuous-time Markov chain.

keywords:
System Identification, Continuous-Time Markov Chains, Neural Networks, Parameter Estimation

1 Introduction

Continuous-time Markov chains (CTMCs) are stochastic processes that model a variety of different systems, e.g., chemical reaction networks, population dynamics, gene regulatory networks, and infectious disease dynamics. CTMC sample paths are piecewise-constant functions that take values in the state space; for a given sample path, each jump discontinuity corresponds to a time at which the system transitions from one state to the next. On a discrete state space Ω\Omega, a CTMC is completely characterized by its transition rates, an array Λ={λi,j}i,jΩ\Lambda=\{\lambda_{i,j}\}_{i,j\in\Omega} that describes the rate at which the system transitions from one state to another. Given an initial condition, which can be expressed as a distribution over the possible states, one can use the transition rates to generate sample paths of the system. However, the inverse problem is not as straightforward. Learning the transition rates from sample path data is an active field of research.

Our focus in this work is on a particular class of CTMCs that model the stochastic dynamics of reaction networks [1]. For this class of CTMCs, there has been a wealth of prior work on estimation and identification. In the structure identification problem, the goal is to estimate the reaction network itself. Here we see approaches that focus on graphical modeling [2], sparse identification [3, 4, 5], and subnetwork reconstruction [6].

Our work has more in common with the parameter estimation problem, where all reactions and corresponding transition rates are known up to a finite set of parameters, which one seeks to estimate from data. Various prior approaches can then be distinguished based on methodology and assumptions regarding the data. In situations where one assumes access only to steady-state observations, we find both Bayesian [7] and moment-based methods [8]. In the case where we assume access to inexact measurements of all chemical species, we see moment-based methods [8], variational Bayes methods [9], Gaussian adaptation methods [10], and expectation maximization methods, either for general systems [11], or for the particular case of birth-death processes [12]. There also exist filtering and Bayesian methods to estimate states and parameters from exact measurements of a subset of all chemical species in a given system [13], and adjoint methods for reaction-diffusion systems [14].

In this paper, we model CTMC transition rates Λ\Lambda with functions that we parameterize and estimate using neural networks. We assume that all reactions are known, and that trajectories are fully observed. Unlike prior parameter estimation studies we have seen, we assume nothing regarding the functional form of the transition rates (or, equivalently, propensity functions). The neural CTMC (N-CTMC) method we describe allows for transition rates with arbitrary kinetics and dependence on external covariates, such as temperature.

Recent work has introduced covariate dependence via generalized linear models (GLM): for Markov chains in discrete/continuous-time, transition probabilities/rates are modeled as a softmax/exponential function of a linear function of the covariates [15, 16]. In this paper, on tests with synthetic data, we show that the N-CTMC method can model kinetics that lie outside mass-action or Michaelis-Menten models. This goes beyond the capabilities of GLM models.

The likelihood function is often used to learn the transition rate matrix of a CTMC. The likelihood measures how probable a sample path is to occur given a transition rate matrix. To find CTMC transition rates that best fit one or more given sample paths, we solve for the transition rates that maximize the likelihood function. For stochastic reaction networks with even a small number of reactions, the state space Ω\Omega can be large or infinite, rendering the likelihood intractable. Strategies to deal with the intractable likelihood include truncating the state space [17], or using likelihood-free approaches such as approximate Bayesian computation [18, 19]. Truncating the state space introduces error, and does not allow for the full model to be learned.

Instead of attempting to estimate a large or infinite-dimensional matrix Λ\Lambda, N-CTMC uses the current state ii as one of the inputs to the neural network. The neural network then outputs a vector of transition rates, one for each reaction. In a setting with ss species of interest, every unique integer ss-tuple (A1,,As)(A_{1},\ldots,A_{s}) denotes a state. If each Ai[0,N]A_{i}\in[0,N], the state space has maximal dimension NsN^{s}. If we were to estimate each entry of Λ\Lambda individually, we would have to estimate O(N2s)O(N^{2s}) constant parameters. In contrast, the number of parameters in the neural network model is independent of NN.

After deriving and explaining the method, we demonstrate its capabilities using tests with synthetic data. In each test, we start with sample paths generated from a known reaction network. We demonstrate that after training on these sample paths, N-CTMC succeeds in learning transition rates in close agreeement with the ground truth. We explain and show how N-CTMC significantly reduces the computational cost of learning transition rates. As we increase the sizes of our training sets, we show that the error of N-CTMC’s estimates decreases. Finally, we show that N-CTMC significantly outperforms a GLM of a similar construction.

2 Background

2.1 Stochastic Reaction Networks

We first review an established framework for stochastic reaction networks. In general, a reaction network with ss species, rr reactions, reaction rates kk, and stoichiometric coefficients ϕ\phi and ψ\psi can be written as:

ϕ11S1++ϕs1Ssk1ψ11S1++ψs1Ssϕ12S1++ϕs2Ssk2ψ12S1++ψs2Ssϕ1rS1++ϕsrSskrψ1rS1++ψsrSs\begin{split}\phi_{11}S_{1}+\ldots+\phi_{s1}S_{s}&\xrightarrow{k_{1}}\psi_{11}S_{1}+\ldots+\psi_{s1}S_{s}\\ \phi_{12}S_{1}+\ldots+\phi_{s2}S_{s}&\xrightarrow{k_{2}}\psi_{12}S_{1}+\ldots+\psi_{s2}S_{s}\\ &\!\quad\vdots\\ \phi_{1r}S_{1}+\ldots+\phi_{sr}S_{s}&\xrightarrow{k_{r}}\psi_{1r}S_{1}+\ldots+\psi_{sr}S_{s}\end{split} (RNRN)

The state of the system at time tt is described by the state vector 𝐒(t){\mathbf{S}}(t), where Si(t)S_{i}(t) is a non-negative integer describing the count of species ii and 𝚫𝐒ρ{\boldsymbol{\Delta}\mathbf{S}}_{\rho} denotes the change in the state after reaction ρ\rho:

𝐒(t)=[S1(t)S2(t)Ss(t)],𝚫𝐒ρ=[ψ1ρϕ1ρψ2ρϕ2ρψsρϕsρ]\mathbf{S}(t)=\begin{bmatrix}S_{1}(t)\\ S_{2}(t)\\ \vdots\\ S_{s}(t)\end{bmatrix},\quad{\boldsymbol{\Delta}\mathbf{S}}_{\rho}=\begin{bmatrix}\psi_{1\rho}-\phi_{1\rho}\\ \psi_{2\rho}-\phi_{2\rho}\\ \vdots\\ \psi_{s\rho}-\phi_{s\rho}\end{bmatrix}

Propensity functions describe how often reactions occur and are functions of the reaction rate and the current species count: αi=g(ki,𝐒(t))\alpha_{i}=g(k_{i},{\mathbf{S}}(t)). In mass action kinetics, αi\alpha_{i} is often a polynomial function of 𝐒(t){\mathbf{S}}(t). The time until the next reaction τ\tau is calculated using α=i=1rαi\alpha=\sum_{i=1}^{r}\alpha_{i}, τ=(1/α)log(1/v1)\tau=(1/\alpha)\log(1/v_{1}) where v1Uv_{1}\sim U, i.e., v1v_{1} is sampled uniformly on the interval [0,1][0,1]. The next reaction is decided by sampling again, v2Uv_{2}\sim U, and then determining for which ρ{1,2,,r}\rho\in\{1,2,\ldots,r\} it is true that

1αi=1ρ1αiv2<1αi=1ραi.\frac{1}{\alpha}\sum_{i=1}^{\rho-1}\alpha_{i}\leq v_{2}<\frac{1}{\alpha}\sum_{i=1}^{\rho}\alpha_{i}.

When ρ=1\rho=1, we define the summation on the left to be 0. Once we have determined both τ\tau and ρ\rho, we update the state vector via 𝐒(t+τ)=𝐒(t)+𝚫𝐒ρ{\mathbf{S}}(t+\tau)={\mathbf{S}}(t)+{\boldsymbol{\Delta}\mathbf{S}}_{\rho}.

2.2 Maximum Likelihood Estimation

Before proceeding, we review the maximum likelihood estimate (MLE) of CTMC transition rates. Let Ω\Omega denote the state space for the CTMC. Assume we have a sequence of state observations OiΩO_{i}\in\Omega at times tit_{i}:

(O0,t0),(O1,t1),(O2,t2),(ON,tN).(O_{0},t_{0}),(O_{1},t_{1}),(O_{2},t_{2}),...(O_{N},t_{N}).

For states l,mΩl,m\in\Omega such that lml\neq m, let λlm\lambda_{lm} denote the rate at which the CTMC transitions from state ll to state mm. The λlm\lambda_{lm} are the parameters to be estimated here. Let λll=mΩ,mlλlm\lambda_{ll}=\sum_{m\in\Omega,m\neq l}\lambda_{lm}.

Given transition rates λ\lambda, the likelihood of observing the short sequence (Oi,ti),(Oi+1,ti+1)(O_{i},t_{i}),(O_{i+1},t_{i+1}) can be decomposed into two pieces. The first piece stems from the time Ti=ti+1tiT_{i}=t_{i+1}-t_{i} spent in state OiO_{i} before transitioning; this sojourn time has an Exp(λOiOi)\mathrm{Exp}(\lambda_{O_{i}O_{i}}) distribution. The second piece is the probability of transitioning from state OiO_{i} to another state Oi+1O_{i+1}: this equals λOiOi+1/λOiOi\lambda_{O_{i}O_{i+1}}/\lambda_{O_{i}O_{i}}. Carrying this out across the entire observation sequence, we obtain the likelihood

L(λ)=i=0N1λOiOieλOiOiTiλOiOi+1λOiOi.L(\lambda)=\prod_{i=0}^{N-1}\lambda_{O_{i}O_{i}}e^{-\lambda_{O_{i}O_{i}}T_{i}}\frac{\lambda_{O_{i}O_{i+1}}}{\lambda_{O_{i}O_{i}}}. (1)

This likelihood can be rewritten in state space:

L(λ)=pΩeλppWpqΩ,qpλpqNpq\displaystyle{L}(\lambda)=\prod_{p\in\Omega}e^{-\lambda_{pp}W_{p}}\prod_{q\in\Omega,q\neq p}\lambda_{pq}^{N_{pq}}
Wp=i=0N1TiIOi=p,Npq=i=0N1IOi=pIOi+1=q.\displaystyle W_{p}=\sum_{i=0}^{N-1}T_{i}I_{O_{i}=p},\quad N_{pq}=\sum_{i=0}^{N-1}I_{O_{i}=p}I_{O_{i+1}=q}.

Here WpW_{p} is the total time spent in state pp, and NpqN_{pq} is the total number of transitions pqp\rightarrow q. Because log\log is monotonic, we can maximize L(λ)L(\lambda) by maximizing

logL(λ)=pΩWpqΩ,qpλpq+p,qΩ,qpNpqlogλpq.\log{L}(\lambda)=-\sum_{p\in\Omega}W_{p}\!\!\sum_{q\in\Omega,q\neq p}\!\!\lambda_{pq}\,+\!\!\!\!\!\!\sum_{p,q\in\Omega,q\neq p}\!\!\!\!\!\!N_{pq}\log\lambda_{pq}.

Differentiating with respect to λab\lambda_{ab} and setting the derivative equal to zero, we obtain the MLE

λ^ab=Nab/Wa,λ^aa=bΩ,baλ^ab=(Wa/bΩ,baNab)1.\begin{split}\widehat{\lambda}_{ab}&=N_{ab}/W_{a},\\ \widehat{\lambda}_{aa}&=\sum_{b\in\Omega,b\neq a}\widehat{\lambda}_{ab}=\left({W_{a}}/{\sum_{b\in\Omega,b\neq a}N_{ab}}\right)^{-1}.\end{split} (2)

We see that (λ^aa)1(\widehat{\lambda}_{aa})^{-1} is the mean time spent in state aa.

3 Methods and Metrics

We now detail the neural CTMC (N-CTMC) method. For a stochastic reaction network (RNRN) as described in Section 2.1, assume we have complete observations: 𝒪=(𝐒(t0),𝐂0,t0),(𝐒(tN),𝐂N,tN){\mathcal{O}}=({\mathbf{S}}(t_{0}),{\mathbf{C}}_{0},t_{0}),\ldots({\mathbf{S}}(t_{N}),{\mathbf{C}}_{N},t_{N}). Specifically, at time tit_{i}, we have 𝐒(ti)s{\mathbf{S}}(t_{i})\in{\mathcal{R}}^{s}, a vector of counts (of all species), and 𝐂ic{\mathbf{C}}_{i}\in{\mathcal{R}}^{c}, the associated covariates. Let 𝐱i=[𝐒(ti)T,𝐂iT]s+c{\mathbf{x}}_{i}=[{\mathbf{S}}(t_{i})^{T},{\mathbf{C}}_{i}^{T}]\in{\mathcal{R}}^{s+c}.

We look to model the propensity functions 𝜶(𝐱)r{\boldsymbol{\alpha}}({\mathbf{x}})\in{\mathcal{R}}^{r}, where rr is the number of reactions in the reaction network (RNRN). In this formulation, we assume no knowledge of the reaction rates kk.

We now formulate the likelihood function for the sequence of 3-tuples 𝒪{\mathcal{O}}. In this likelihood function, we replace the constant transition rate matrix Λ\Lambda with a vector of propensity function 𝜶(𝐱){\boldsymbol{\alpha}}({\mathbf{x}}). To model 𝜶(𝐱){\boldsymbol{\alpha}}({\mathbf{x}}) we use a neural network, a universal function approximator, with parameters θ\theta. The network takes as input 𝐱{\mathbf{x}} and outputs the propensity 𝜶(𝐱;θ){\boldsymbol{\alpha}}({\mathbf{x}};\theta). In contrast with Section 2.1, 𝜶(𝐱;θ){\boldsymbol{\alpha}}({\mathbf{x}};\theta) depends not only on the state 𝐒{\mathbf{S}} but also on the associated covariates 𝐂{\mathbf{C}}. Taking the negative log\log of (1) gives

logL(λ)=i=0N1λOiOiTilogλOiOi+1.-\log L(\lambda)=\sum_{i=0}^{N-1}\lambda_{O_{i}O_{i}}T_{i}-\log\lambda_{O_{i}O_{i+1}}. (3)

In the data 𝒪{\mathcal{O}}, the transition OiOi+1O_{i}\to O_{i+1} equates to the change in count vector 𝐒(ti)𝐒(ti+1){\mathbf{S}}(t_{i})\rightarrow{\mathbf{S}}(t_{i+1}); let ρiRN\rho_{i}\in RN denote the reaction to which this change corresponds. Then we can replace λOiOi+1\lambda_{O_{i}O_{i+1}} by the ρi\rho_{i}-th element of the neural network’s output, which we denote by α(𝐱i;θ)ρi\alpha({\mathbf{x}}_{i};\theta)_{\rho_{i}}. Similarly, we can replace λOiOi\lambda_{O_{i}O_{i}}, the rate of leaving state OiO_{i}, by j=1rα(𝐱i;θ)j\sum_{j=1}^{r}\alpha({\mathbf{x}}_{i};\theta)_{j}, the sum of the propensity functions for each reaction in (RNRN). Rewritten in terms of 𝐱{\mathbf{x}} and 𝜶(𝐱;θ){\boldsymbol{\alpha}}({\mathbf{x}};\theta), the negative log likelihood (3) becomes

logL(θ)=i=0N1(Tij=1rα(𝐱i;θ)j)logα(𝐱i;θ)ρi.-\log L(\theta)=\sum_{i=0}^{N-1}\Big{(}T_{i}\sum_{j=1}^{r}\alpha({\mathbf{x}}_{i};\theta)_{j}\Big{)}-\log\alpha({\mathbf{x}}_{i};\theta)_{\rho_{i}}.

Let LρL^{\rho} be the number of times reaction ρ\rho occurs. Suppose we collect all row vectors 𝐱i{\mathbf{x}}_{i} that correspond to states just before reaction ρ\rho occurs. We stack these vertically into an Lρ×(s+c)L^{\rho}\times(s+c) matrix XρX^{\rho}. Similarly, let 𝐓ρLρ{\mathbf{T}}^{\rho}\in{\mathcal{R}}^{\,L^{\rho}} be a vector of time spent in state before reaction ρ\rho occurs. With this, we rewrite the above negative log likelihood as

logL(θ)=ρ=1rl=1LρTlρ(j=1rα(Xlρ;θ)j)logα(Xlρ;θ)ρ.-\log L(\theta)=\sum_{\rho=1}^{r}\sum_{l=1}^{L^{\rho}}{T}^{\rho}_{l}\Big{(}\sum_{j=1}^{r}\alpha({X}^{\rho}_{l};\theta)_{j}\Big{)}\\ -\log\alpha({X}^{\rho}_{l};\theta)_{\rho}. (4)

To train the N-CTMC model, we apply gradient descent to minimize this negative log likelihood. Note that the internal sum over LρL^{\rho} does not require a for loop in our implementation. With neural network input vectorization, we can pass all of XρX^{\rho} into 𝜶{\boldsymbol{\alpha}} at once. The ll-th row of the output will be identical to the output we would have obtained had we passed in only XlρX^{\rho}_{l}, i.e., α(Xρ;θ)lj=α(Xlρ;θ)j\alpha({X}^{\rho};\theta)_{lj}=\alpha({X}^{\rho}_{l};\theta)_{j}. Using this, the internal sum can be accomplished entirely with matrix algebra, which is highly optimized in modern neural network frameworks. The only remaining loop corresponds to the sum over ρ\rho from 11 to rr; this should be contrasted with sums over O(|Ω|2)O(|\Omega|^{2}) elements that would have resulted from naïve fusion of neural networks with the methods from Section 2.2. Together with automatic differentiation, our formulation leads to more efficient computation of θlogL(θ)\nabla_{\theta}\log L(\theta) and resulting minimization of the loss (4).

Note that the above N-CTMC approach learns a model 𝜶(𝐱;θ){\boldsymbol{\alpha}}({\mathbf{x}};\theta) without direct observations of ground truth propensity functions. We observe the propensities only indirectly through their effect on the system, as evidenced in the data 𝒪{\mathcal{O}}. This varies from standard neural network training approaches where one starts with ground truth observations of a response variable and then minimizes the mean squared error beween predicted and true values of the response.

Here we briefly sketch how one might prove that neural networks approximations can converge to true propensity functions. For brevity, let 𝜶=𝜶(𝐱){\boldsymbol{\alpha}}={\boldsymbol{\alpha}}({\mathbf{x}}). Let 𝜶^θ\widehat{\boldsymbol{\alpha}}^{\theta} and 𝜶^N\widehat{\boldsymbol{\alpha}}^{N} denote propensity functions estimated by, respectively, a neural network and maximum likelihood estimation as in (2). More specifically, we begin by binning the covariate space c{\mathcal{R}}^{c}. For a given bin {\mathcal{B}}, using transitions 𝐒(ti)𝐒(ti+1){\mathbf{S}}(t_{i})\rightarrow{\mathbf{S}}(t_{i+1}) from 𝒪{\mathcal{O}} such that 𝐂i{\mathbf{C}}_{i}\in{\mathcal{B}}, we form MLE estimates as in (2). Now given a state 𝐒{\mathbf{S}} and covariates 𝐂{\mathbf{C}}, we find the bin corresponding to 𝐂{\mathbf{C}} and evaluate the MLE estimates (2) in that bin associated with all transitions of the form 𝐒𝐒{\mathbf{S}}\rightarrow{\mathbf{S}}^{\prime}. Organizing these estimates by reaction number, we form 𝜶^N\widehat{\boldsymbol{\alpha}}^{N}. Let us assume that MLE consistency results for covariate-free CTMCs [20] can be extended to include covariates, assuming conditions on the distribution of covariates 𝐂i{\mathbf{C}}_{i} in the data 𝒪{\mathcal{O}} and on the ergodicity of the CTMC, such that as NN\to\infty, we will have 𝜶^N𝜶0\|\widehat{\boldsymbol{\alpha}}^{N}-{\boldsymbol{\alpha}}\|_{\infty}\to 0 almost surely (a.s.). Next, treating 𝜶^N\widehat{\boldsymbol{\alpha}}^{N} as ground truth, we train a neural network with suitable activation functions, width WW and depth DD; as DD\to\infty, we have 𝜶^θ𝜶^N0\|\widehat{{\boldsymbol{\alpha}}}^{\theta}-\widehat{{\boldsymbol{\alpha}}}^{N}\|_{\infty}\to 0 [21]. Putting everything together, we obtain 𝜶θ𝜶0\|{\boldsymbol{\alpha}}^{\theta}-{\boldsymbol{\alpha}}\|_{\infty}\to 0 a.s. as both N,DN,D\to\infty.

To evaluate the N-CTMC method, we will use several metrics: the mean absolute error (MAE) and mean squared error (MSE) across each unique state in 𝒪{\mathcal{O}}, together with the weighted mean absolute error (W-MAE) and weighted mean squared error (W-MSE). The W-MAE and W-MSE examines the prevalence of each unique state in 𝒪{\mathcal{O}} and weights the contribution to the overall absolute error accordingly. This means that a state that is rarely visited in 𝒪{\mathcal{O}} will have less effect on the W-MAE and W-MSE than on the MAE and MSE.

4 Numerical Experiments

We examine three different applications of the N-CTMC: a birth-death process, Lotka-Volterra population dynamics, and a temperature-dependent chemical reaction network. Each application is chosen to test the N-CTMC model in a specific way.

In the birth-death process, we use a single covariate and let propensity functions depend solely on this covariate and not on the current population count. While this model lacks realism, it allows us to compare the N-CTMC to the counting-based MLE (2) for calculating rate parameters described in Section 2.2. To test the N-CTMC’s ability to capture arbitrary kinetics, we turn to a Lotka-Volterra population model. Here there are no covariates, but the propensity functions do not follow the law of mass action. The final reaction network we consider tests the N-CTMC’s ability to model arbitrary dependence on covariates. Here the system of reactions follows the law of mass action, but the propensities depend on temperature nonlinearly.

For each system, we generate trajectories using the Gillespie algorithm [22] and then use N-CTMC to learn propensity functions (equivalently, transition rates). We model all systems with an increasing amount of training data to show empirical model convergence.

Refer to caption
Figure 1: Sample trajectory from the birth-death process.

4.1 Birth-Death Process

We analyze a birth-death process where the birth and death rates do not depend on population size:

Reactions:

  • A\emptyset\rightarrow A

  • AA\rightarrow\emptyset

Propensity Functions:

  • α1(s)=2.1cos(2πs)\alpha_{1}(s)=2.1\textnormal{cos}(2\pi s)

  • α2(s)=2sin(2πs)\alpha_{2}(s)=2\textnormal{sin}(2\pi s)

Let tt denote time in days. Then s=(t/365.24) mod 1s=(t/365.24)\textnormal{ mod }1, rounded to the tenths decimal place to discretize the covariate space to s{0.0,0.1,0.2,,1.0}s\in\{0.0,0.1,0.2,\ldots,1.0\}. The birth and death rates α1\alpha_{1} and α2\alpha_{2} fluctuate annually, leading to sample trajectories such as that displayed in Figure 1. We generate training trajectories of increasing length, N=5×10N=5\times 10^{\ell} for =3,4,5\ell=3,4,5, all with an initial population of 5×1045\times 10^{4}, and train the N-CTMC model, yielding estimates 𝜶^θ(s)\widehat{{\boldsymbol{\alpha}}}^{\theta}(s).

While N-CTMC can handle continuous covariates, having a finite, discrete covariate space helps us compute transition rates using the counting-based MLE (2). Specifically, for each distinct value ss^{\prime} of the predictor, we find all transitions in the training set with predictor equal to ss^{\prime}. Using those transitions, we compute the counting-based MLEs (2), resulting in estimates 𝜶^jN(s)\widehat{{\boldsymbol{\alpha}}}^{N}_{j}(s^{\prime}).

In Table 1, we compute in turn the MAE between the true 𝜶{\boldsymbol{\alpha}} and predicted transition rates 𝜶^θ\widehat{{\boldsymbol{\alpha}}}^{\theta} and 𝜶^N\widehat{{\boldsymbol{\alpha}}}^{N} from the N-CTMC and counting approaches. This yields, respectively, the errors labeled as N-MAE and C-MAE. Carrying out the same comparison with MSE, we obtain the N-MSE and C-MSE errors.

Table 1 shows convergence of both the N-CTMC and counting-based estimated propensity functions to the true propensity functions, as the number of transitions increases. Table 1 also shows that the error rates of the N-CTMC and counting-based methods are similar. This demonstrates that N-CTMC converges to the ground truth at a comparable rate to the counting methods derived by analytically maximizing the likelihood. As the number of predictors increases, this calculation becomes more arduous. Finally, Table 1 gives evidence that, under appropriate hypotheses, a consistency result for covariate-dependent CTMCs should be possible.

Table 1: Model performance for the birth-death process as the amount of data increases. Here we compare the N-CTMC (N) error with the counting-based MLE (C) calculation detailed in the main text.
Transitions N-MAE C-MAE N-MSE C-MSE
5000 1.051011.05\cdot 10^{-1} 1.071011.07\cdot 10^{-1} 1.981021.98\cdot 10^{-2} 1.891021.89\cdot 10^{-2}
50000 4.021024.02\cdot 10^{-2} 3.971023.97\cdot 10^{-2} 2.941032.94\cdot 10^{-3} 2.951032.95\cdot 10^{-3}
500000 9.901039.90\cdot 10^{-3} 9.761039.76\cdot 10^{-3} 1.501041.50\cdot 10^{-4} 1.601041.60\cdot 10^{-4}
Refer to caption
Figure 2: We plot simulated population time series from the original birth-death process in black. We adjust the black curve, switching 1.5% of the births to deaths (uniformly at random), resulting in the blue curve. After training the N-CTMC on the blue curve, we generate a predicted time series, the red curve. The agreement between predicted and adjusted data is discussed in the main text.

We now illustrate a potential open-loop control application of the N-CTMC method. Assume we have a population of, for instance, a contagion/pest, that evolves stochastically according to the birth-death process described previously. Suppose we simulate the system forward in time from an initial population of 100100 and find that it leads to too large of a population, the black curve in Figure 2. What changes to the propensities are required to reduce population growth by a particular amount? To answer: we process the data, switching 1.5% of the births (uniformly at random) to deaths, resulting in the blue curve in Figure 2. We then train our N-CTMC model on the blue curve. With the estimated propensities, we then generate predictions starting from the same initial population of 100100—see the red curve in Figure 2. The close agreement indicates that N-CTMC succeeds in finding new propensity functions that, up to stochastic fluctuations, yield prescribed population dynamics.

4.2 Population Dynamics Modeling

Since neural networks are universal function approximators, N-CTMC should be able to model arbitrary transition rates. To test this, we consider a system that does not follow the law of mass action. We apply our method to a predator-prey system [23], which models dynamics with prey (A) and two predators (B, C). This system includes contingencies for A, B, and C entering and leaving the system, breeding between species A, species B hunting species A, and species C hunting species B.

Reactions:

  • BB\rightarrow\emptyset

  • B\emptyset\rightarrow B

  • AA\rightarrow\emptyset

  • A\emptyset\rightarrow A

  • CC\rightarrow\emptyset

  • C\emptyset\rightarrow C

  • A2AA\rightarrow 2A

  • A+B2BA+B\rightarrow 2B

  • B+C2CB+C\rightarrow 2C

Propensity Functions:

  • α1=Bk1\alpha_{1}=Bk_{1}

  • α2=k2\alpha_{2}=k_{2}

  • α3=Ak3/N\alpha_{3}=Ak_{3}/N

  • α4=k4\alpha_{4}=k_{4}

  • α5=Ck5\alpha_{5}=Ck_{5}

  • α6=k6\alpha_{6}=k_{6}

  • α7=Ak7/N\alpha_{7}=Ak_{7}/N

  • α8=ABk8\alpha_{8}=A\sqrt{B}k_{8}

  • α9=log(BC+1)k9\alpha_{9}=\textnormal{log}(BC+1)k_{9}

Note that reaction 4 and 7 have the same outcome on the state of the system, so the neural network cannot distinguish between the two reactions. Hence these reactions will be modeled jointly: α4,7=k4+Ak7/N\alpha_{4,7}=k_{4}+Ak_{7}/N. Additionally, reactions 8 and 9 model hunting and do not follow the law of mass action.

Refer to caption
Figure 3: One trajectory from the training data for the population dynamics system. We plot Species A on a separate axis from Species B and C due to the difference in scale.

We generate three training sets with, respectively, 100, 1000, and 10000 trajectories. Each trajectory has length N=150N=150, initial conditions (A(0),B(0),C(0))=(105,10,10)(A(0),B(0),C(0))=(10^{5},10,10) and parameters 𝐤=[0.5,1.7,3.9,4.6,2.7,1.9,6.1,2.4,1.5]{\mathbf{k}}=[0.5,1.7,3.9,4.6,2.7,1.9,6.1,2.4,1.5]. An example trajectory can be seen in Figure 3. Using this data, we train both the N-CTMC model and, for the largest training set, a CTMC plus GLM model in which transition rates depend linearly on inputs. The N-CTMC models propensities via the following neural network:

z1=σ1(xW1+b1),W13×128,b1128\displaystyle z_{1}=\sigma_{1}(xW_{1}+b_{1}),W_{1}\in{{\mathcal{R}}}^{3\times 128},b_{1}\in{{\mathcal{R}}}^{128}
zi+1=σ1(ziWi+1+bi+1),1i4\displaystyle z_{i+1}=\sigma_{1}(z_{i}W_{i+1}+b_{i+1}),\quad 1\leq i\leq 4
Wi+1128×128,bi+1128,1i4\displaystyle W_{i+1}\in{{\mathcal{R}}}^{128\times 128},b_{i+1}\in{{\mathcal{R}}}^{128},\quad 1\leq i\leq 4
𝜶^=σ2(z5W6+b6),W6128×9,b69.\displaystyle\widehat{{\boldsymbol{\alpha}}}=\sigma_{2}(z_{5}W_{6}+b_{6}),W_{6}\in{{\mathcal{R}}}^{128\times 9},b_{6}\in{{\mathcal{R}}}^{9}.

For σ1\sigma_{1}, we use the scaled exponential linear unit (selu) activation function [24]. We set σ2(x)=log(1+ex)\sigma_{2}(x)=\log(1+e^{x}), the softplus function. We chose softplus for σ2\sigma_{2} because the logα\log\alpha in the loss (4) forces α>0\alpha>0. During model formulation, we tested several activation functions that guarantee positive output, and softplus performed the best.

For each of the three training sets, we use the four error metrics described at the end of Section 3 to compare N-CTMC and GLM propensities against the true propensities. As we see in Table 2, for the N-CTMC model, as the training set size increases, all error metrics decrease. This implies empirical convergence to the true model. From the weighted MAE and MSE metrics, we see that N-CTMC learns the true propensity very well for states that the trajectories visit frequently. The unweighted MAE and MSE metrics show that it is more difficult to learn propensity functions for rarely visited states. This can be easily visualized in Figure 4 where predicted versus true propensity functions are plotted with intensity based on the prevalence of the state in the training data. When comparing the N-CTMC with the GLM, we see that the N-CTMC dramatically outperforms the GLM regardless of metric. Additionally, because the unweighted MAE and MSE are much larger for the GLM than for the N-CTMC, we conclude that the N-CTMC learns propensities for rarely visited states much better than the GLM.

Table 2: Model error for the population dynamics example. The first three rows show the error of the N-CTMC model as we increase the number of trajectories in the training set. For the largest training set, we also show GLM model errors. N-CTMC training is halted when Δloss<1010\Delta\text{loss}<10^{-10}. W denotes that the error metric is weighted. For the GLM model, training is halted when θ<0.01\|\nabla\theta\|<0.01.
Trajectories MAE W-MAE MSE W-MSE Epochs
100 0.304 2.451012.45\cdot 10^{-1} 0.244 1.531011.53\cdot 10^{-1} 12,900
1000 0.157 1.011011.01\cdot 10^{-1} 0.093 2.841022.84\cdot 10^{-2} 33,500
10000 0.099 3.821023.82\cdot 10^{-2} 0.051 5.131035.13\cdot 10^{-3} 89,800
10000 (GLM) 1.028 7.061017.06\cdot 10^{-1} 3.208 1.391001.39\cdot 10^{0} 2,300
Refer to caption
(a) Predicted vs known reaction rates for reactions 1,2,3,4-7 where opacity denotes prevalence in the training set.
Refer to caption
(b) Predicted vs known reaction rates for reactions 5,6,8,9 where opacity denotes prevalence in the training set.
Figure 4: Comparison of reaction rates for the population dynamics modeling example. When plotting by prevalence in the training set in (a) and (b), the predicted values that are more prevalent in the training set achieve near-perfect accuracy (black line).

For each of the three training sets, the number of unique states explored is 3767, 8380, and 13435, respectively. We see an increase in the state space size due to the stochastic nature of the trajectory generation: the more data generated, the more the state space is explored. Using counting-based likelihood-based approaches, we would need to estimate a transition rate for each distinct pair (i,j)(i,j) of states where both i,jΩi,j\in\Omega and |Ω|=13435|\Omega|=13435. Estimating each rate accurately would be highly inefficient and require a gargantuan amount of data. The N-CTMC model has 67,721 parameters, as compared with the 107,480 non-zero elements of the transition rate matrix for the system with 13,435 states. By sharing propensity function parameters across Ω\Omega, N-CTMC requires less data and less computational expense.

4.3 Chemical Reaction Network Modeling

In the third system, a modification of a system from [22], we model the effect of temperature on a chemical reaction network. Reaction rates increase as a function of temperature according to kj=𝒜jeEj/(RT)\displaystyle k_{j}={\mathcal{A}}_{j}e^{-E_{j}/(RT)}, where 𝒜j{\mathcal{A}}_{j} is the frequency factor, EjE_{j} is the activation energy, RR is the ideal gas constant, and TT is the temperature in Kelvin. This network has two species, A and B:

Reactions:

  • A+AA+A\rightarrow\emptyset

  • A+BA+B\rightarrow\emptyset

  • A\emptyset\rightarrow A

  • B\emptyset\rightarrow B

Propensity Functions:

  • α1=A(A1)k1\alpha_{1}=A(A-1)k_{1}

  • α2=ABk2\alpha_{2}=ABk_{2}

  • α3=k3\alpha_{3}=k_{3}

  • α4=k4\alpha_{4}=k_{4}

Note that 𝓐=[630000,770000,5380000,2240000]{\boldsymbol{\mathcal{A}}}=[630000,770000,5380000,2240000] and 𝐄=[39000,36000,40000,40000]{\mathbf{E}}=[39000,36000,40000,40000]. For each fixed choice of TT, we generate training sets with 2020, 6060, and 100100 trajectories, each with integer initial conditions A[0],B[0]A[0],B[0] drawn uniformly [0,4]\in[0,4]. This leads to [100,300,500][100,300,500] total trajectories, as trajectories are generated at 5 different temperatures: T=[271,272,273,274,275]T=[271,272,273,274,275]. A sample trajectory at 273 K can be seen in Figure 5. Using this synthetic data, we train both a N-CTMC model and (using the largest training set only) a CTMC plus GLM model in which propensities depend linearly on inputs.

Refer to caption
Figure 5: One trajectory (at the temperature 273 K) from the chemical reaction network training data.

In the N-CTMC, we use a convolutional neural network. We expand the inputs with a 96-unit dense layer, reshape into a 3×323\times 32 matrix, and perform 1D convolutions across the rows with 10 output channels and a 1×41\times 4 kernel according to G[m]=(fh)[m]=jh[j]f[mj]G[m]=(f*h)[m]=\sum_{j}h[j]f[m-j] where ff is the input and hh is the kernel. We then flatten the results and apply two linear layers with 32 nodes each, followed by a linear layer with 4 nodes. We use selu activation functions for all layers except the final layer where we use softplus to ensure positive output. Including a convolutional layer improved model performance over a dense neural network. This model has 3,854 parameters, as compared to 5,820 parameters required in a naïve counting-based approach with 291 unique states and 4 reactions at 5 different temperatures. In this model, we implemented batch training, updating model parameters on each trajectory in turn, instead of on the whole data set at once. This improved model performance, possibly by allowing less frequently visited state spaces to have more impact on the overall likelihood gradients.

In Table 3, we quantify N-CTMC and GLM errors using the metrics described at the end of Section 3. As we increase the number of trajectories, N-CTMC errors decrease, implying empirical convergence to ground truth propensity functions. In Figure 6, we plot N-CTMC predicted versus true transition rates, with varying intensities based on their prevalence in the training data. Based on the W-MAE and W-MSE errors in Table 3 and the results in Figure 6, our conclusions regarding the strong performance of N-CTMC on frequently visited states are the same as in the discussion of results for the population dynamics system above. Across all metrics, the N-CTMC outperforms the GLM, and especially in learning rare state transition rates.

Table 3: Model performance for the temperature-dependent chemical reaction network. The first three rows show the error of the N-CTMC model as we increase the number of trajectories in the training set. For the largest training set, we also show GLM model errors. Training is halted when Δloss\Delta\text{loss} reaches a minimum. W denotes that the error metric is weighted. For the GLM model, training is halted when θ<0.01\|\nabla\theta\|<0.01.
Trajectories MAE W-MAE MSE W-MSE Epochs
100 0.024 3.131033.13\cdot 10^{-3} 0.004 7.651057.65\cdot 10^{-5} 7,700
300 0.020 1.881031.88\cdot 10^{-3} 0.003 2.611052.61\cdot 10^{-5} 6,900
500 0.015 1.451031.45\cdot 10^{-3} 0.002 1.621051.62\cdot 10^{-5} 4,000
500 (GLM) 0.250 2.881022.88\cdot 10^{-2} 1.345 5.801035.80\cdot 10^{-3} 12,000
Refer to caption
Figure 6: Comparison of reaction rates for chemical reaction network modeling example. Opacity denotes prevalence of the corresponding states in the training set. For more prevalent states, N-CTMC predicted propensity functions achieve near-perfect accuracy (black line).

5 Discussion

We have assumed access to complete observations of trajectories together with knowledge of the number of reactions and associated stoichiometric coefficients. Subject to these assumptions, we have demonstrated that the N-CTMC method learns consistent transition rates for stochastic reaction networks with both mass action and non-mass action kinetics, including for one system with more than 13,000 states.

Unlike other methods where the likelihood is truncated, N-CTMC uses the entire state space in the likelihood calculation. The N-CTMC estimated propensity function can be evaluated on any state in Ω\Omega, including but not limited to states visited in the training set. We can only guarantee accurate estimates of propensity functions on states that have been visited sufficiently often in the training data. On rarely visited states, however, N-CTMC outperformed a GLM approach in which propensity functions depend linearly on inputs (count vectors and/or covariates); such GLM approaches have been explored in prior work [15, 16]. The N-CTMC also models transition rates which vary as arbitrary functions of covariates, allowing for study of systems where propensity functions depend on external stimuli.

In the current work, we focused on systems with exponentially distributed sojourn times, an unreasonable assumption for some systems [25]. Using the framework developed here, the negative log likelihood (NLL) can be easily modified to accommodate a semi-Markov process [26]. For the birth-death example, we showed that N-CTMC can also be used to control—in an open-loop fashion—covariate-dependent CTMC systems. We hope to explore both semi-Markov versions and control applications of N-CTMC in future work. This will expand the class of problems to which we can apply our methods.

Acknowledgments. M. Reeves and H. S. Bhat acknowledge support from, respectively, NSF DGE-1633722 and NSF DMS-1723272. This research also benefited from computational resources that include the Pinnacles cluster at UC Merced (supported by NSF ACI-2019144) and Nautilus, supported by the Pacific Research Platform (NSF ACI-1541349), CHASE-CI (NSF CNS-1730158), Towards a National Research Platform (NSF OAC-1826967), and UCOP.

References

  • [1] D. F. Anderson and T. G. Kurtz, “Continuous time Markov chain models for chemical reaction networks,” in Design and Analysis of Biomolecular Circuits: Engineering Approaches to Systems and Synthetic Biology (H. Koeppl, G. Setti, M. di Bernardo, and D. Densmore, eds.), pp. 3–42, New York, NY: Springer, 2011.
  • [2] Y. J. Cho, N. Ramakrishnan, and Y. Cao, “Reconstructing chemical reaction networks: Data mining meets system identification,” in Proceedings of the 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (Y. Li, B. Liu, and S. Sarawagi, eds.), pp. 142–150, 2008.
  • [3] K. K. K. Kim, H. Jang, R. B. Gopaluni, J. H. Lee, and R. D. Braatz, “Sparse identification in chemical master equations for monomolecular reaction networks,” in American Control Conference, ACC 2014, Portland, OR, USA, June 4-6, 2014, pp. 3698–3703, 2014.
  • [4] A. Klimovskaia, S. Ganscha, and M. Claassen, “Sparse regression based structure learning of stochastic reaction networks from single cell snapshot time series,” PLoS Comput. Biol., vol. 12, no. 12, p. e1005234, 2016.
  • [5] R. M. Jiang, P. Singh, F. Wrede, A. Hellander, and L. R. Petzold, “Identification of dynamic mass-action biochemical reaction networks using sparse Bayesian methods,” PLoS Comput. Biol., vol. 18, no. 1, 2022.
  • [6] E. Yeung, J. Kim, J. M. Gonçalves, and R. M. Murray, “Global network identification from reconstructed dynamical structure subnetworks: Applications to biochemical reaction networks,” in 54th IEEE Conference on Decision and Control, CDC 2015, pp. 881–888, 2015.
  • [7] A. Gupta, M. Khammash, and G. Sanguinetti, “Bayesian parameter estimation for stochastic reaction networks from steady-state observations,” in Int. Conf. on Computational Methods in Systems Biology, pp. 342–346, Springer, 2019.
  • [8] M. Backenköhler, L. Bortolussi, and V. Wolf, “Moment-based parameter estimation for stochastic reaction networks in equilibrium,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 15, no. 4, pp. 1180–1192, 2017.
  • [9] A. Ruttor, G. Sanguinetti, and M. Opper, “Approximate inference for stochastic reaction processes,” in Learning and Inference in Computational Systems Biology (N. D. Lawrence, M. A. Girolami, M. Rattray, and G. Sanguinetti, eds.), Computational Molecular Biology, pp. 277–296, MIT Press, 2010.
  • [10] C. L. Müller, R. Ramaswamy, and I. F. Sbalzarini, “Global parameter identification of stochastic reaction networks from single trajectories,” in Advances in Systems Biology (I. I. Goryanin and A. B. Goryachev, eds.), (New York, NY), pp. 477–498, Springer, 2012.
  • [11] A. Horváth and D. Manini, “Parameter estimation of kinetic rates in stochastic reaction networks by the EM method,” in 2008 International Conference on BioMedical Engineering and Informatics, vol. 1, pp. 713–717, 2008.
  • [12] F. W. Crawford, V. N. Minin, and M. A. Suchard, “Estimation for general birth-death processes,” J. Am. Stat. Assoc., vol. 109, no. 506, pp. 730–747, 2014.
  • [13] M. Rathinam and M. Yu, “State and parameter estimation from exact partial state observation in stochastic reaction networks,” J. Chem. Phys., vol. 154, no. 3, p. 034103, 2021.
  • [14] I. Yang and C. J. Tomlin, “Reaction-diffusion systems in protein networks: Global existence and identification,” Syst. Control. Lett., vol. 74, pp. 50–57, 2014.
  • [15] S. Barratt and S. Boyd, “Fitting feature-dependent Markov chains,” J. Glob. Optim., 2022.
  • [16] A. Awasthi, V. Minin, J. Huang, D. Chow, and J. Xu, “Fitting a stochastic model of intensive care occupancy to noisy hospitalization time series,” arXiv:2203.00229, 2023.
  • [17] A. Andreychenko, L. Mikeev, D. Spieler, and V. Wolf, “Approximate maximum likelihood estimation for stochastic chemical kinetics,” EURASIP J. Bioinform. Syst. Biol., vol. 2012, no. 1, pp. 1–14, 2012.
  • [18] J. Owen, D. J. Wilkinson, and C. S. Gillespie, “Scalable inference for Markov processes with intractable likelihoods,” Stat. Comput., vol. 25, no. 1, pp. 145–156, 2015.
  • [19] T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M. P. Stumpf, “Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems,” J. R. Soc. Interface, vol. 6, no. 31, pp. 187–202, 2009.
  • [20] B. Ranneby, “On necessary and sufficient conditions for consistency of MLE’s in Markov chain models,” Scand. J. Stat., pp. 99–105, 1978.
  • [21] G. Gripenberg, “Approximation by neural networks with a bounded number of nodes at each level,” J. Approx. Theory, vol. 122, no. 2, pp. 260–266, 2003.
  • [22] R. Erban, J. Chapman, and P. Maini, “A practical guide to stochastic simulations of reaction-diffusion processes,” arXiv:0704.1908, 2007.
  • [23] G. Enciso and J. Kim, “Accuracy of multiscale reduction for stochastic reaction systems,” Multiscale Model. Simul., vol. 19, no. 4, pp. 1633–1658, 2021.
  • [24] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017 (I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, eds.), pp. 971–980, 2017.
  • [25] D. Chiarugi, M. Falaschi, D. Hermith, C. Olarte, and L. Torella, “Modelling non-Markovian dynamics in biochemical reactions,” BMC Systems Biology, vol. 9, no. S8, pp. 1–13, 2015.
  • [26] A. Asanjarani, B. Liquet, and Y. Nazarathy, “Estimation of semi-Markov multi-state models: a comparison of the sojourn times and transition intensities approaches,” Int. J. Biostat., vol. 18, no. 1, pp. 243–262, 2022.