Dynamic inference in probabilistic graphical models

Weiming Feng , Kun He , Xiaoming Sun and Yitong Yin State Key Laboratory for Novel Software Technology, Nanjing University. E-mail: fengwm@smail.nju.edu.cn,yinyt@nju.edu.cn Shenzhen University; Shenzhen Institute of Computing Sciences. E-mail: hekun.threebody@foxmail.com CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences. E-mail: sunxiaoming@ict.ac.cn

Abstract.

Probabilistic graphical models, such as Markov random fields (MRFs), are useful for describing high-dimensional distributions in terms of local dependence structures. The probabilistic inference is a fundamental problem related to graphical models, and sampling is a main approach for the problem. In this paper, we study probabilistic inference problems when the graphical model itself is changing dynamically with time. Such dynamic inference problems arise naturally in today’s application, e.g. multivariate time-series data analysis and practical learning procedures.

We give a dynamic algorithm for sampling-based probabilistic inferences in MRFs, where each dynamic update can change the underlying graph and all parameters of the MRF simultaneously, as long as the total amount of changes is bounded. More precisely, suppose that the MRF has $n$ variables and polylogarithmic-bounded maximum degree, and $N(n)$ independent samples are sufficient for the inference for a polynomial function $N(\cdot)$ . Our algorithm dynamically maintains an answer to the inference problem using $\widetilde{O}(nN(n))$ space cost, and $\widetilde{O}(N(n)+n)$ incremental time cost upon each update to the MRF, as long as the Dobrushin-Shlosman condition is satisfied by the MRFs. This well-known condition has long been used for guaranteeing the efficiency of Markov chain Monte Carlo (MCMC) sampling in the traditional static setting. Compared to the static case, which requires $\Omega(nN(n))$ time cost for redrawing all $N(n)$ samples whenever the MRF changes, our dynamic algorithm gives a $\widetilde{\Omega}(\min\{n,N(n)\})$ -factor speedup. Our approach relies on a novel dynamic sampling technique, which transforms local Markov chains (a.k.a. single-site dynamics) to dynamic sampling algorithms, and an “algorithmic Lipschitz” condition that we establish for sampling from graphical models, namely, when the MRF changes by a small difference, samples can be modified to reflect the new distribution, with cost proportional to the difference on MRF.

Weiming Feng and Yitong Yin are supported by the National Key R&D Program of China 2018YFB1003202 and the National Natural Science Foundation of China under Grant Nos. 61722207 and 61672275. Kun He and Xiaoming Sun are supported by the National Natural Science Foundation of China Grants No. 61832003, 61433014 and K.C.Wong Education Foundation.

1. Introduction

The probabilistic graphical models provide a rich language for describing high-dimensional distributions in terms of the dependence structures between random variables. The Markov random filed (MRF) is a basic graphical model that encodes pairwise interactions of complex systems. Given a graph $G=(V,E)$ , each vertex $v\in V$ is associated with a function $\phi_{v}:Q\to\mathbb{R}$ , called the vertex potential, on a finite domain $Q=[q]$ of $q$ spin states, and each edge $e\in E$ is associated with a symmetric function $\phi_{e}:Q^{2}\to\mathbb{R}$ , called the edge potential, which describes a pairwise interaction. Together, these induce a probability distribution $\mu$ over all configurations $\sigma\in Q^{V}$ :

\displaystyle\mu(\sigma)\propto\exp(H(\sigma))=\exp\Big{(}\sum_{v\in V}\phi_{v}(\sigma_{v})+\sum_{e=\{u,v\}\in E}\phi_{e}(\sigma_{u},\sigma_{v})\Big{)}.

This distribution $\mu$ is known as the Gibbs distribution and $H(\sigma)$ is the Hamiltonian. It arises naturally from various physical models, statistics or learning problems, and combinatorial problems in computer science [30, 26].

The probabilistic inference is one of the most fundamental computational problems in graphical model. Some basic inference problems ask to calculate the marginal distribution, conditional distribution, or maximum-a-posteriori probabilities of one or several random variables [38]. Sampling is perhaps the most widely used approach for probabilistic inference. Given a graphical model, independent samples are drawn from the Gibbs distribution and certain statistics are computed using the samples to give estimates for the inferred quantity. For most typical inference problems, such statistics are easy to compute once the samples are given, for instance, for estimating the marginal distribution on a variable subset $S$ , the statistics is the frequency of each configuration in $Q^{S}$ among the samples, thus the cost for inference is dominated by the cost for generating random samples [25, 35].

The classic probabilistic inference assumes a static setting, where the input graphical model is fixed. In today’s application, dynamically changing graphical models naturally arise in many scenarios. In various practical algorithms for learning graphical models, e.g. the contrastive divergence algorithm for learning the restricted Boltzmann machine [21] and the iterative proportional fitting algorithm for maximum likelihood estimation of graphical models [38], the optimal model $\mathcal{I}^{*}$ is obtained by updating the parameters of the graphical model iteratively (usually by gradient descent), which generates a sequence of graphical models $\mathcal{I}_{1},\mathcal{I}_{2},\cdots,\mathcal{I}_{M}$ , with the goal that $\mathcal{I}_{M}$ is a good approximation of $\mathcal{I}^{*}$ . Also in the study of the multivariate time-series data, the dynamic Gaussian graphical models [6], multiregression dynamic model [33], dynamic graphical model [16], and dynamic chain graph models [2], are all dynamically changing graphical models and have been used in a variety of applications. Meanwhile, with the advent of Big Data, scalable machine learning systems need to deal with continuously evolving graphical models (see e.g. [34] and [36]).

The theoretical studies of probabilistic inference in dynamically changing graphical models are lacking. In the aforementioned scenarios in practice, it is common that a sequence of graphical models is presented with time, where any two consecutive graphical models can differ from each other in all potentials but by a small total amount. Recomputing the inference problem from scratch at every time when the graphical model is changed, can give the correct solution, but is very wasteful. A fundamental question is whether probabilistic inference can be solved dynamically and efficiently.

In this paper, we study the problem of probabilistic inference in an MRF when the MRF itself is changing dynamically with time. At each time, the whole graphical model, including all vertices and edges as well as their potentials, are subject to changes. Such non-local updates are very general and cover all applications mentioned above. The problem of dynamic inference then asks to maintain a correct answer to the inference in a dynamically changing MRF with low incremental cost proportional to the amount of changes made to the graphical model at each time.

1.1. Our results

We give a dynamic algorithm for sampling-based probabilistic inferences. Given an MRF instance with $n$ vertices, suppose that $N(n)$ independent samples are sufficient to give an approximate solution to the inference problem, where $N:\mathbb{N}^{+}\to\mathbb{N}^{+}$ is a polynomial function. We give dynamic algorithms for general inference problems on dynamically changing MRF.

Suppose that the current MRF has $n$ vertices and polylogarithmic-bounded maximum degree, and each update to the MRF may change the underlying graph and/or all vertex/edge potentials, as long as the total amount of changes is bounded. Our algorithm maintains an approximate solution to the inference with $\widetilde{O}(nN(n))$ space cost, and with $\widetilde{O}(N(n)+n)$ incremental time cost upon each update, assuming that the MRFs satisfy the Dobrushin-Shlosman condition [11, 12, 13]. The condition has been widely used to imply the efficiency of Markov chain Monte Carlo (MCMC) sampling (e.g. see [20, 10]). Compared to the static algorithm, which requires $\Omega(nN(n))$ time for redrawing all $N(n)$ samples each time, our dynamic algorithm significantly improves the time cost with an $\widetilde{\Omega}(\min\{n,N(n)\})$ -factor speedup.

On specific models, the Dobrushin-Shlosman condition has been established in the literature, which directly gives us following efficient dynamic inference algorithms, with $\widetilde{O}\left(nN(n)\right)$ space cost and $\widetilde{O}\left(N(n)+n\right)$ time cost per update, on graphs with $n$ vertices and maximum degree $\Delta=O(1)$ :

•

for Ising model with temperature $\beta$ satisfying $\mathrm{e}^{-2|\beta|}>1-\frac{2}{\Delta+1}$ , which is close to the uniqueness threshold $\mathrm{e}^{-2|\beta_{c}|}=1-\frac{2}{\Delta}$ , beyond which the static versions of sampling or marginal inference problem for anti-ferromagnetic Ising model is intractable [19, 18];
•

for hardcore model with fugacity $\lambda<\frac{2}{\Delta-2}$ , which matches the best bound known for sampling algorithm with near-linear running time on general graphs with bounded maximum degree [37, 29, 7];
•

for proper $q$ -coloring with $q>2\Delta$ , which matches the best bound known for sampling algorithm with near-linear running time on general graphs with bounded maximum degree [24].

Our dynamic inference algorithm is based on a dynamic sampling algorithm, which efficiently maintains $N(n)$ independent samples for the current MRF while the MRF is subject to changes. More specifically, we give a dynamic version of the Gibbs sampling algorithm, a local Markov chain for sampling from the Gibbs distribution that has been studied extensively. Our techniques are based on: (1) couplings for dynamic instances of graphical models; and (2) dynamic data structures for representing single-site Markov chains so that the couplings can be realized algorithmically in sub-linear time. Both these techniques are of independent interest, and can be naturally extended to more general settings with multi-body interactions.

Our results show that on dynamically changing graphical models, sampling-based probabilistic inferences can be solved significantly faster than rerunning the static algorithm at each time. This has practical significance in speeding up the iterative procedures for learning graphical models.

1.2. Related work

The problem of dynamic sampling from graphical models was introduced very recently in [16]. There, a dynamic sampling algorithm was given for graphical models with soft constraints, and can only deal with local updates that change a single vertex or edge at each time. The regimes for such dynamic sampling algorithm to be efficient are much more restrictive than the conditions for the rapid mixing of Markov chains. Our algorithm greatly improves the regimes for efficient dynamic sampling for the Ising and hardcore models in [16], and for the first time, can handle non-local updates that change all vertex/edge potentials simultaneously. Besides, the dynamic/online sampling from log-concave distributions was also studied in [31, 27].

Another related topic is the dynamic graph problems, which ask to maintain a solution (e.g. spanners [15, 32, 39] or shortest paths [3, 23, 22]) while the input graph is dynamically changing. More recently, important progress has been made on dynamically maintaining structures that are related to graph random walks, such as spectral sparsifier [9, 1] or effective resistances [8, 17]. Instead of one particular solution, dynamic inference problems ask to maintain an estimate of a statistics, such statistics comes from an exponential-sized probability space described by a dynamically changing graphical model.

1.3. Organization of the paper.

In Section 2, we formally introduce the dynamic inference problem. In Section 3, we formally state the main results. Preliminaries are given in Section 4. In Section 5, we outline our dynamic inference algorithm. In Section 6, we present the algorithms for dynamic Gibbs sampling. The analyses of these dynamic sampling algorithms are given in Section 7. The proof of the main theorem on dynamic inference is given in Section 8. The conclusion is given in Section 9.

2. Dynamic inference problem

2.1. Markov random fields.

An instance of Markov random field (MRF) is specified by a tuple $\mathcal{I}=(V,E,Q,\Phi)$ , where $G=(V,E)$ is an undirected simple graph; $Q$ is a domain of $q=|Q|$ spin states, for some finite $q>1$ ; and $\Phi=(\phi_{a})_{a\in V\cup E}$ associates each $v\in V$ a vertex potential $\phi_{v}:Q\to\mathbb{R}$ and each $e\in E$ an edge potential $\phi_{e}:Q^{2}\to\mathbb{R}$ , where $\phi_{e}$ is symmetric.

A configuration $\sigma\in Q^{V}$ maps each vertex $v\in V$ to a spin state in $Q$ , so that each vertex can be interpreted as a variable. And the Hamiltonian of a configuration $\sigma\in Q^{V}$ is defined as:

\displaystyle H(\sigma)\triangleq\sum_{v\in V}\phi_{v}(\sigma_{v})+\sum_{e=\{u,v\}\in E}\phi_{e}(\sigma_{u},\sigma_{v}).

This defines the Gibbs distribution $\mu_{\mathcal{I}}$ , which is a probability distribution over $Q^{V}$ such that

\displaystyle\forall\sigma\in Q^{V},\quad\mu_{\mathcal{I}}(\sigma)=\frac{1}{Z}\exp(H(\sigma)),

where the normalizing factor $Z\triangleq\sum_{\sigma\in Q^{V}}\exp(H(\sigma))$ is called the partition function.

The Gibbs measure $\mu(\sigma)$ can be $0$ as the functions $\phi_{v},\phi_{e}$ can take the value $-\infty$ . A configuration $\sigma$ is called feasible if $\mu(\sigma)>0$ . To trivialize the problem of constructing a feasible configuration, we further assume the following natural condition for the MRF instances considered in this paper:¹¹1This condition guarantees that the marginal probabilities are always well-defined, and the problem of constructing a feasible configuration $\sigma$ , where $\mu_{\mathcal{I}}(\sigma)>0$ , is trivial. The condition holds for all MRFs with soft constraints, or with hard constraints where there is a permissive spin, e.g. the hardcore model. For MRFs with truly repulsive hard constraints such as proper $q$ -coloring, the condition may translate to the condition $q\geq\Delta+1$ where $\Delta$ is the maximum degree of graph $G$ , which is necessary for the irreducibility of local Markov chains for $q$ -colorings.

(1)

\displaystyle\forall\,v\in V,\,\,\forall\sigma\in Q^{\Gamma_{G}(v)}:\quad\sum_{c\in Q}\exp\left(\phi_{v}(c)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},c)\right)>0.

where $\Gamma_{G}(v)\triangleq\{u\in V\mid\{u,v\}\in E\}$ denotes the neighborhood of $v$ in graph $G=(V,E)$ .

Some well studied typical MRFs include:

•

Ising model: The domain of each spin is $Q=\{-1,+1\}$ . Each edge $e\in E$ is associated with a temperature $\beta_{e}\in\mathbb{R}$ ; and each vertex $v\in V$ is associated with a local field $h_{v}\in\mathbb{R}$ . For each configuration $\sigma\in\{-1,+1\}^{V}$ , $\mu_{\mathcal{I}}(\sigma)\propto\exp\left(\sum_{\{u,v\}\in E}\beta_{e}\sigma_{u}\sigma_{v}+\sum_{v\in V}h_{v}\sigma_{v}\right)$ .
•

Hardcore model: The domain is $Q=\{0,1\}$ . Each configuration $\sigma\in Q^{V}$ indicates an independent set in $G=(V,E)$ , and $\mu_{\mathcal{I}}(\sigma)\propto\lambda^{\left\|\sigma\right\|}$ , where $\lambda>0$ is the fugacity parameter.
•

proper $q$ -coloring: uniform distribution over all proper $q$ -colorings of $G=(V,E)$ .

2.2. Probabilistic inference and sampling

In graphical models, the task of probabilistic inference is to derive the probabilities regarding one or more random variables of the model. Abstractly, this is described by a function $\bm{\theta}:\mathfrak{M}\rightarrow\mathbb{R}^{K}$ that maps each graphical model $\mathcal{I}\in\mathfrak{M}$ to a target $K$ -dimensional probability vector, where $\mathfrak{M}$ is the class of graphical models containing the random variables we are interested in and the $K$ -dimensional vector describes the probabilities we want to derive. Given $\bm{\theta}(\cdot)$ and an MRF instance $\mathcal{I}\in\mathfrak{M}$ , the inference problem asks to estimate the probability vector $\bm{\theta}(\mathcal{I})$ .

Here are some fundamental inference problems [38] for MRF instances. Let $\mathcal{I}=(V,E,Q,\Phi)$ be an MRF instance and $A,B\subseteq V$ two disjoint sets where $A\uplus B\subseteq V$ .

•

Marginal inference: estimate the marginal distribution $\mu_{A,\mathcal{I}}(\cdot)$ of the variables in $A$ , where

\displaystyle\forall\sigma_{A}\in Q^{A},\quad\mu_{A,\mathcal{I}}(\sigma_{A})\triangleq\sum_{\tau\in Q^{V\setminus A}}\mu_{\mathcal{I}}(\sigma_{A},\tau).

•

Posterior inference: given any $\tau_{B}\in Q^{B}$ , estimate the posterior distribution $\mu_{A,\mathcal{I}}(\cdot\mid\tau_{B})$ for the variables in $A$ , where

\displaystyle\forall\sigma_{A}\in Q^{A},\quad\mu_{A,\mathcal{I}}(\sigma_{A}\mid\tau_{B})\triangleq\frac{\mu_{A\cup B,\mathcal{I}}(\sigma_{A},\tau_{B})}{\mu_{B,\mathcal{I}}(\tau_{B})}.

•

Maximum-a-posteriori (MAP) inference: find the maximum-a-posteriori (MAP) probabilities $P_{A,\mathcal{I}}^{\ast}(\cdot)$ for the configurations over $A$ , where

\displaystyle\forall\sigma_{A}\in Q^{A},\quad P^{\ast}_{A,\mathcal{I}}(\sigma_{A})\triangleq\max_{\tau_{B}\in Q^{B}}\mu_{A\cup B,\mathcal{I}}(\sigma_{A},\tau_{B}).

All these fundamental inference problems can be described abstractly by a function $\bm{\theta}:\mathfrak{M}\rightarrow\mathbb{R}^{K}$ . For instances, for marginal inference, $\mathfrak{M}$ contains all MRF instances where $A$ is a subset of the vertices, $K=\left|Q\right|^{|A|}$ , and $\bm{\theta}(\mathcal{I})=(\mu_{A,\mathcal{I}}(\sigma_{A}))_{\sigma_{A}\in Q^{A}}$ ; and for posterior or MAP inference, $\mathfrak{M}$ contains all MRF instances where $A\uplus B$ is a subset of the vertices, $K=\left|Q\right|^{|A|}$ and $\bm{\theta}(\mathcal{I})=(\mu_{A,\mathcal{I}}(\sigma_{A}\mid\tau_{B}))_{\sigma_{A}\in Q^{A}}$ (for posterior inference) or $\bm{\theta}(\mathcal{I})=(P^{\ast}_{A,\mathcal{I}}(\sigma_{A}))_{\sigma_{A}\in Q^{A}}$ (for MAP inference).

One canonical approach for probabilistic inference is by sampling: sufficiently many independent samples are drawn (approximately) from the Gibbs distribution of the MRF instance and an estimate of the target probabilities is calculated from these samples. Given a probabilistic inference problem $\bm{\theta}(\cdot)$ , we use $\mathcal{E}_{\bm{\theta}}(\cdot)$ to denote an estimating function that approximates $\bm{\theta}(\mathcal{I})$ using independent samples drawn approximately from $\mu_{\mathcal{I}}$ . For the aforementioned problems of marginal, posterior and MAP inferences, such estimating function $\mathcal{E}_{\bm{\theta}}(\cdot)$ simply counts the frequency of the samples that satisfy certain properties.

The sampling cost of an estimator is captured in two aspects: the number of samples it uses and the accuracy of each individual sample it requires.

Definition 2.1 ( $(N,\epsilon)$ -estimator for $\bm{\theta}$ ).

Let $\bm{\theta}:\mathfrak{M}\to\mathbb{R}^{K}$ be a probabilistic inference problem and $\mathcal{E}_{\bm{\theta}}(\cdot)$ an estimating function for $\bm{\theta}(\cdot)$ that for each instance $\mathcal{I}=(V,E,Q,\Phi)\in\mathfrak{M}$ , maps samples in $Q^{V}$ to an estimate of $\bm{\theta}(\mathcal{I})$ . Let $N:\mathbb{N}^{+}\to\mathbb{N}^{+}$ and $\epsilon:\mathbb{N}^{+}\to(0,1)$ . For any instance $\mathcal{I}=(V,E,Q,\Phi)\in\mathfrak{M}$ where $n=|V|$ , the random variable $\mathcal{E}_{\bm{\theta}}(\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))})$ is said to be an $(N,\epsilon$ )-estimator for $\bm{\theta}(\mathcal{I})$ if $\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}\in Q^{V}$ are $N(n)$ independent samples drawn approximately from $\mu_{\mathcal{I}}$ such that $d_{\mathrm{TV}}\left({\bm{X}^{(j)}},{\mu_{\mathcal{I}}}\right)\leq\epsilon(n)$ for all $1\leq j\leq N(n)$ .

In Definition 2.1, an estimator is viewed as a black-box algorithm specified by two functions $N$ and $\epsilon$ . Usually, the estimator is more accurate if more independent samples are drawn and each sample provides a higher level of accuracy. Thus, one can choose some large $N$ and small $\epsilon$ to achieve a desired quality of estimate.

2.3. Dynamic inference problem

We consider the inference problem where the input graphical model is changed dynamically: at each step, the current MRF instance $\mathcal{I}=(V,E,Q,\Phi)$ is updated to a new instance $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})$ . We consider general update operations for MRFs that can change both the underlying graph and all edge/vertex potentials simultaneously, where the update request is made by a non-adaptive adversary independently of the randomness used by the inference algorithm. Such updates are general enough and cover many applications, e.g. analyses of time series network data [6, 33, 16, 2], and learning algorithms for graphical models [21, 38].

The difference between the original and the updated instances is measured as follows.

Definition 2.2 (difference between MRF instances).

The difference between two MRF instances $\mathcal{I}=(V,E,Q,\Phi)$ and $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})$ , where $\Phi=(\phi_{a})_{a\in V\cup E}$ and $\Phi^{\prime}=(\phi^{\prime}_{a})_{a\in V^{\prime}\cup E^{\prime}}$ , is defined as

(2)

d(\mathcal{I},\mathcal{I}^{\prime})\triangleq\sum_{v\in V\cap V^{{}^{\prime}}}\left\|\phi_{v}-\phi^{\prime}_{v}\right\|_{1}+\sum_{e\in E\cap E^{{}^{\prime}}}\left\|\phi_{e}-\phi^{\prime}_{e}\right\|_{1}+|V\oplus V^{\prime}|+|E\oplus E^{\prime}|,

where $A\oplus B=(A\setminus B)\cup(B\setminus A)$ stands for the symmetric difference between two sets $A$ and $B$ , $\left\|\phi_{v}-\phi^{\prime}_{v}\right\|_{1}\triangleq\sum_{c\in Q}\left|\phi_{v}(c)-\phi^{\prime}_{v}(c)\right|$ , and $\left\|\phi_{e}-\phi^{\prime}_{e}\right\|_{1}\triangleq\sum_{c,c^{\prime}\in Q}\left|\phi_{e}(c,c^{\prime})-\phi^{\prime}_{e}(c,c^{\prime})\right|$ .

Given a probability vector specified by the function $\bm{\theta}:\mathfrak{M}\to\mathbb{R}^{K}$ , the dynamic inference problem asks to maintain an estimator $\hat{\bm{\theta}}(\mathcal{I})$ of $\bm{\theta}(\mathcal{I})$ for the current MRF instance $\mathcal{I}=(V,E,Q,\Phi)\in\mathfrak{M}$ , with a data structure, such that when $\mathcal{I}$ is updated to $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})\in\mathfrak{M}$ , the algorithm updates $\hat{\bm{\theta}}(\mathcal{I})$ to an estimator $\hat{\bm{\theta}}(\mathcal{I}^{\prime})$ of the new vector $\bm{\theta}(\mathcal{I}^{\prime})$ , or equivalently, outputs the difference between the estimators $\hat{\bm{\theta}}(\mathcal{I})$ and $\hat{\bm{\theta}}(\mathcal{I}^{\prime})$ .

It is desirable to have the dynamic inference algorithm which maintains an $(N,\epsilon)$ -estimator for $\bm{\theta}(\mathcal{I})$ for the current instance $\mathcal{I}$ . However, the dynamic algorithm cannot be efficient if $N(n)$ and $\epsilon(n)$ change drastically with $n$ (so that significantly more samples or substantially more accurate samples may be needed when a new vertex is added), or if recalculating the estimating function $\mathcal{E}_{\bm{\theta}}(\cdot)$ itself is expensive. We introduce a notion of dynamical efficiency for the estimators that are suitable for dynamic inference.

Definition 2.3 (dynamical efficiency).

Let $N:\mathbb{N}^{+}\to\mathbb{N}^{+}$ and $\epsilon:\mathbb{N}^{+}\to(0,1)$ . Let $\mathcal{E}(\cdot)$ be an estimating function for some $K$ -dimensional probability vector of MRF instances. An tuple $(N,\epsilon,\mathcal{E})$ is said to be dynamically efficient if it satisfies:

•

(bounded difference) there exist constants $C_{1},C_{2}>0$ such that for any $n\in\mathbb{N}^{+}$ ,

\displaystyle\left|N(n+1)-N(n)\right|\leq\frac{C_{1}\cdot N(n)}{n}\quad\text{ and }\quad\left|\epsilon(n+1)-\epsilon(n)\right|\leq\frac{C_{2}\cdot\epsilon(n)}{n};

•

(small incremental cost) there is a deterministic algorithm that maintains $\mathcal{E}(\bm{X}^{(1)},\ldots,\bm{X}^{(m)})$ using $(mn+K)\cdot\mathrm{polylog}(mn)$ bits where $\bm{X}^{(1)},\ldots,\bm{X}^{(m)}\in Q^{V}$ and $n=|V|$ , such that when $\bm{X}^{(1)},\ldots,\bm{X}^{(m)}\in Q^{V}$ are updated to $\bm{Y}^{(1)},\ldots,\bm{Y}^{(m^{\prime})}\in Q^{V^{\prime}}$ , where $n^{\prime}=|V^{\prime}|$ , the algorithm updates $\mathcal{E}(\bm{X}^{(1)},\ldots,\bm{X}^{(m)})$ to $\mathcal{E}(\bm{Y}^{(1)},\ldots,\bm{Y}^{(m^{\prime})})$ within time cost $\mathcal{D}\cdot\mathrm{polylog}(mm^{\prime}nn^{\prime})+O(m+m^{\prime})$ , where $\mathcal{D}$ is the size of the difference between two sample sequences defined as:

(3)

\displaystyle\mathcal{D}\triangleq\sum_{i\leq\max\{m,m^{\prime}\}}\sum_{v\in V\cup V^{\prime}}\mathbf{1}\left[\bm{X}^{(i)}(v)\neq\bm{Y}^{(i)}(v)\right],

where an unassigned $\bm{X}^{(i)}(v)$ or $\bm{Y}^{(i)}(v)$ is not equal to any assigned spin.

The dynamic efficiency basically asks $N(\cdot),\epsilon(\cdot)$ , and $\mathcal{E}(\cdot)$ to have some sort of “Lipschitz” properties. To satisfy the bounded difference condition, $N(n)$ and $1/\epsilon(n)$ are necessarily polynomially bounded, and they can be any constant, polylogarithmic, or polynomial functions, or multiplications of such functions. The condition with small incremental cost also holds very commonly. In particular, it is satisfied by the estimating functions for all the aforementioned problems for the marginal, posterior and MAP inferences as long as the sets of variables have sizes $\left|A\right|,\left|B\right|=O(\log n)$ . We remark that the $O(\log n)$ upper bound is somehow necessary for the efficiency of inference, because otherwise the dimension of $\bm{\theta}(\mathcal{I})$ itself (which is at least $q^{|A|}$ ) becomes super-polynomial in $n$ .

3. Main results

Let $\mathcal{I}=(V,E,Q,\Phi)$ be an MRF instance, where $G=(V,E)$ . Let $\Gamma_{G}(v)$ denote the neighborhood of $v$ in $G$ . For any vertex $v\in V$ and any configuration $\sigma\in Q^{\Gamma_{G}(v)}$ , we use $\mu_{v,\mathcal{I}}^{\sigma}(\cdot)=\mu_{v,\mathcal{I}}(\cdot\mid\sigma)$ to denote the marginal distribution on $v$ conditional on $\sigma$ :

\displaystyle\forall c\in Q:\quad\mu_{v,\mathcal{I}}^{\sigma}(c)=\mu_{v,\mathcal{I}}(c\mid\sigma)\triangleq\frac{\exp\left(\phi_{v}(c)+\sum_{u\in\Gamma_{G}(v)}\phi_{uv}(\sigma_{u},c)\right)}{\sum_{a\in Q}\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{G}(v)}\phi_{uv}(\sigma_{u},a)\right)}.

Due to the assumption in (1), the marginal distribution is always well-defined. The following condition is the Dobrushin-Shlosman condition [11, 12, 13, 20, 10].

Condition 3.1 (Dobrushin-Shlosman condition).

Let $\mathcal{I}=(V,E,Q,\Phi)$ be an MRF instance with Gibbs distribution $\mu=\mu_{\mathcal{I}}$ . Let $A_{\mathcal{I}}\in\mathbb{R}_{\geq 0}^{V\times V}$ be the influence matrix which is defined as

\displaystyle A_{\mathcal{I}}(u,v)\triangleq\begin{cases}\max_{(\sigma,\tau)\in B_{u,v}}d_{\mathrm{TV}}\left({\mu^{\sigma}_{v}},{\mu_{v}^{\tau}}\right),&\{u,v\}\in E,\\ 0&\{u,v\}\not\in E,\end{cases}

where the maximum is taken over the set $B_{u,v}$ of all $(\sigma,\tau)\in Q^{\Gamma_{G}(v)}\times Q^{\Gamma_{G}(v)}$ that differ only at $u$ , and $d_{\mathrm{TV}}\left({\mu^{\sigma}_{v}},{\mu_{v}^{\tau}}\right)\triangleq\frac{1}{2}\sum_{c\in Q}\left|\mu^{\sigma}_{v}(c)-\mu^{\tau}_{v}(c)\right|$ is the total variation distance between $\mu^{\sigma}_{v}$ and $\mu^{\tau}_{v}$ . An MRF instance $\mathcal{I}$ is said to satisfy the Dobrushin-Shlosman condition if there is a constant $\delta>0$ such that

\displaystyle\max_{u\in V}\sum_{v\in V}A_{\mathcal{I}}(u,v)\leq 1-\delta.

Our main theorem assumes the following setup: Let $\bm{\theta}:\mathfrak{M}\to\mathbb{R}^{K}$ be a probabilistic inference problem that maps each MRF instance in $\mathfrak{M}$ to a $K$ -dimensional probability vector, and let $\mathcal{E}_{\bm{\theta}}$ be its estimating function. Let $N:\mathbb{N}^{+}\to\mathbb{N}^{+}$ and $\epsilon:\mathbb{N}^{+}\to(0,1)$ . We use $\mathcal{I}=(V,E,Q,\Phi)\in\mathfrak{M}$ , where $n=|V|$ , to denote the current instance and $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})\in\mathfrak{M}$ , where $n^{\prime}=|V^{\prime}|$ , to denote the updated instance.

Theorem 3.2 (dynamic inference algorithm).

Assume that $(N,\epsilon,\mathcal{E}_{\bm{\theta}})$ is dynamically efficient, both $\mathcal{I}$ and $\mathcal{I}^{\prime}$ satisfy the Dobrushin-Shlosman condition, and $d(\mathcal{I},\mathcal{I}^{\prime})\leq L=o(n)$ .

There is an algorithm that maintains an $(N,\epsilon)$ -estimator $\hat{\bm{\theta}}(\mathcal{I})$ of the probability vector $\bm{\theta}(\mathcal{I})$ for the current MRF instance $\mathcal{I}$ , using $\widetilde{O}\left(nN(n)+K\right)$ bits, such that when $\mathcal{I}$ is updated to $\mathcal{I}^{\prime}$ , the algorithm updates $\hat{\bm{\theta}}(\mathcal{I})$ to an $(N,\epsilon)$ -estimator $\hat{\bm{\theta}}(\mathcal{I}^{\prime})$ of $\bm{\theta}(\mathcal{I}^{\prime})$ for the new instance $\mathcal{I}^{\prime}$ , within expected time cost

\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right),

where $\widetilde{O}(\cdot)$ hides a $\mathrm{polylog}(n)$ factor, $\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}$ , where $\Delta_{G}$ and $\Delta_{G^{\prime}}$ denote the maximum degree of $G=(V,E)$ and $G^{\prime}=(V^{\prime},E^{\prime})$ respectively.

Note that the extra $O(\Delta n)$ cost is necessary for editing the current MRF instance $\mathcal{I}$ to $\mathcal{I}^{\prime}$ .

Typically, the difference between two MRF instances $\mathcal{I},\mathcal{I}^{\prime}$ is small²²2In multivariate time-series data analysis, the MRF instances of two sequential times are similar. In the iterative algorithms for learning graphical models, the difference between two sequential MRF instances generated by gradient descent are bounded to prevent oscillations. Specifically, the difference is very small when the iterative algorithm approaches to the convergence state [21, 38]., and the underlying graphs are sparse [14] , that is, $L,\Delta\leq\mathrm{polylog}(n)$ . In such cases, our algorithm updates the estimator within time cost $\widetilde{O}(N(n)+n)$ , which significantly outperforms static sampling-based inference algorithms that require time cost $\Omega(n^{\prime}N(n^{\prime}))=\Omega(nN(n))$ for redrawing all $N(n^{\prime})$ independent samples.

Dynamic sampling

The core of our dynamic inference algorithm is a dynamic sampling algorithm: Assuming the Dobrushin-Shlosman condition, the algorithm can maintain a sequence of $N(n)$ independent samples $\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}\in Q^{V}$ that are $\epsilon(n)$ -close to $\mu_{\mathcal{I}}$ in total variation distance, and when $\mathcal{I}$ is updated to $\mathcal{I}^{\prime}$ with difference $d(\mathcal{I},\mathcal{I}^{\prime})\leq L=o(n)$ , the algorithm can update the maintained samples to $N(n^{\prime})$ independent samples $\bm{Y}^{(1)},\ldots,\bm{Y}^{(N(n^{\prime}))}\in Q^{V^{\prime}}$ that are $\epsilon(n^{\prime})$ -close to $\mu_{\mathcal{I}^{\prime}}$ in total variation distance, using a time cost $\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right)$ in expectation. This shows an “algorithmic Lipschitz” condition holds for sampling from Gibbs distributions: when the MRF changes insignificantly, a population of samples can be modified to reflect the new distribution, with cost proportional to the difference on MRF. We show that such property is guaranteed by the Dobrushin-Shlosman condition. This dynamic sampling algorithm is formally described in Theorem 6.1 and is of independent interest [16].

Applications on specific models

On specific models, we have the following results, where $\delta>0$ is an arbitrary constant.

model	regime	space cost	time cost for each update
Ising	$\mathrm{e}^{-2\|\beta\|}\geq 1-\frac{2-\delta}{\Delta+1}$	$\widetilde{O}\left(nN(n)+K\right)$	$\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right)$
hardcore	$\lambda\leq\frac{2-\delta}{\Delta-2}$	$\widetilde{O}\left(nN(n)+K\right)$	$\widetilde{O}\left(\Delta^{3}LN(n)+\Delta n\right)$
$q$ -coloring	$q\geq(2+\delta)\Delta$	$\widetilde{O}\left(nN(n)+K\right)$	$\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right)$

Table 1. Dynamic inference for specific models.

The results for Ising model and $q$ -coloring are corollaries of Theorem 3.2. The regime for hardcore model is better than the Dobrushin-Shlosman condition (which is $\lambda\leq\frac{1-\delta}{\Delta-1}$ ), because we use the coupling introduced by Vigoda [37] to analyze the algorithm.

4. Preliminaries

Total variation distance and coupling

Let $\mu$ and $\nu$ be two distributions over $\Omega$ . The total variation distance between $\mu$ and $\nu$ is defined as

\displaystyle d_{\mathrm{TV}}\left({\mu},{\nu}\right)\triangleq\frac{1}{2}\sum_{x\in\Omega}\left|\mu(x)-\nu(x)\right|.

A coupling of $\mu$ and $\nu$ is a joint distribution $(X,Y)\in\Omega\times\Omega$ such that marginal distribution of $X$ is $\mu$ and the marginal distribution of $Y$ is $\nu$ . The following coupling lemma is well-known.

Proposition 4.1 (coupling lemma).

For any coupling $(X,Y)$ of $\mu$ and $\nu$ , it holds that

\displaystyle\Pr[X\neq Y]\geq d_{\mathrm{TV}}\left({\mu},{\nu}\right).

Furthermore, there is an optimal coupling that achieves equality.

Local neighborhood

Let $G=(V,E)$ be a graph. For any vertex $v\in V$ , let $\Gamma_{G}(v)\triangleq\{u\in V\mid\{u,v\}\in E\}$ denote the neighborhood of $v$ , and $\Gamma^{+}_{G}(v)\triangleq\Gamma_{G}(v)\cup\{v\}$ the inclusive neighborhood of $v$ . We simply write $\Gamma_{v}=\Gamma(v)=\Gamma_{G}(v)$ and $\Gamma_{v}^{+}=\Gamma^{+}(v)=\Gamma_{G}^{+}(v)$ for short when $G$ is clear in the context. We use $\Delta=\Delta_{G}\triangleq\max_{v\in V}|\Gamma_{v}|$ to denote the maximum degree of graph $G$ .

A notion of local neighborhood for MRF is frequently used. Let $\mathcal{I}=(V,E,Q,\Phi)$ be an MRF instance. For $v\in V$ , we denote by $\mathcal{I}_{v}\triangleq\mathcal{I}[\Gamma_{v}^{+}]$ the restriction of $\mathcal{I}$ on the inclusive neighborhood $\Gamma_{v}^{+}$ of $v$ , i.e. $\mathcal{I}_{v}=(\Gamma^{+}_{v},E_{v},Q,\Phi_{v})$ , where $E_{v}=\{\{u,v\}\in E\}$ and $\Phi_{v}=(\phi_{a})_{a\in\Gamma_{v}^{+}\cup E_{v}}$ .

Gibbs sampling

The Gibbs sampling (a.k.a. heat-bath, Glauber dynamics), is a classic Markov chain for sampling from Gibbs distributions. Let $\mathcal{I}=(V,E,Q,\Phi)$ be an MRF instance and $\mu=\mu_{\mathcal{I}}$ its Gibbs distribution. The chain of Gibbs sampling (Algorithm 1) is on the space $\Omega\triangleq Q^{V}$ , and has the stationary distribution $\mu_{\mathcal{I}}$ [28, Chapter 3].

Initialization : an initial state

\bm{X}_{0}\in\Omega

(not necessarily feasible);

1 for $t=1,2,\ldots,T$ do

2 pick

v_{t}\in V

uniformly at random;

3 draw a random value

c\in Q

from the marginal distribution

\mu_{v_{t}}(\cdot\mid X_{t-1}(\Gamma_{v_{t}}))

;

X_{t}(v_{t})\leftarrow c

and

X_{t}(u)\leftarrow X_{t-1}(u)

for all

u\in V\setminus\{v_{t}\}

;

Algorithm 1 Gibbs sampling

Marginal distributions

Here $\mu_{v}(\cdot\mid\sigma(\Gamma_{v}))=\mu_{v,\mathcal{I}}(\cdot\mid\sigma(\Gamma_{v}))$ denotes the marginal distribution at $v\in V$ conditioning on $\sigma(\Gamma_{v})\in Q^{\Gamma_{v}}$ , which is computed as:

(4)

\displaystyle\forall c\in Q:\quad\mu_{v}(c\mid\sigma(\Gamma_{v}))=\frac{\phi_{v}(c)\prod_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},c)}{\sum_{c^{\prime}\in Q}\phi_{v}(c^{\prime})\prod_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},c^{\prime})}.

Due to the assumption (1), this marginal distribution is always well defined, and its computation uses only the information of $\mathcal{I}_{v}$ .

Coupling for mixing time

Consider a chain $(\bm{X}_{t})_{t=0}^{\infty}$ on space $\Omega$ with stationary distribution $\mu_{\mathcal{I}}$ for MRF instance $\mathcal{I}$ . The mixing rate is defined as: for $\epsilon>0$ ,

\displaystyle\tau_{\mathsf{mix}}(\mathcal{I},\epsilon)\triangleq\max_{\bm{X}_{0}}\min\left\{t\mid d_{\mathrm{TV}}\left({\bm{X}_{t}},{\mu_{\mathcal{I}}}\right)\leq\epsilon\right\},

where $d_{\mathrm{TV}}\left({\bm{X}_{t}},{\mu_{\mathcal{I}}}\right)$ denotes the total variation distance between $\mu_{\mathcal{I}}$ and the distribution of $\bm{X}_{t}$ .

A coupling of a Markov chain is a joint process $(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0}$ such that $(\bm{X}_{t})_{t\geq 0}$ and $(\bm{Y}_{t})_{t\geq 0}$ marginally follow the same transition rule as the original chain. Consider the following type of couplings.

Definition 4.2 (one-step optimal coupling for Gibbs sampling).

A coupling $(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0}$ of Gibbs sampling on an MRF instance $\mathcal{I}=(V,E,Q,\Phi)$ is a one-step optimal coupling if it is constructed as follows: For $t=1,2,\ldots$ ,

(1)

pick the same random $v_{t}\in V$ , and let $(X_{t}(u),Y_{t}(u))\leftarrow(X_{t-1}(u),Y_{t-1}(u))$ for all $u\neq v_{t}$ ;
(2)

sample $(X_{t}(v_{t}),Y_{t}(v_{t}))$ from an optimal coupling $D_{\mathsf{opt},\mathcal{I}_{v_{t}}}^{\sigma,\tau}(\cdot,\cdot)$ of the marginal distributions $\mu_{v_{t}}(\cdot\mid\sigma)$ and $\mu_{v_{t}}(\cdot\mid\tau)$ where $\sigma=X_{t-1}(\Gamma_{v_{t}})$ and $\tau=Y_{t-1}(\Gamma_{v_{t}})$ .

The coupling $D_{\mathsf{opt},\mathcal{I}_{v_{t}}}^{\sigma,\tau}(\cdot,\cdot)$ is an optimal coupling of $\mu_{v_{t}}(\cdot\mid\sigma)$ and $\mu_{v_{t}}(\cdot\mid\tau)$ that attains the maximum $\Pr[\bm{x}=\bm{y}]$ for all couplings $(\bm{x},\bm{y})$ of $\bm{x}\sim\mu_{v_{t}}(\cdot\mid\sigma)$ and $\bm{y}\sim\mu_{v_{t}}(\cdot\mid\tau)$ . The coupling $D_{\mathsf{opt},\mathcal{I}_{v_{t}}}^{\sigma,\tau}(\cdot,\cdot)$ is determined by the local information $\mathcal{I}_{v}$ and $\sigma,\tau\in Q^{\mathrm{deg}(v)}$ .

With such a coupling, we can establish the following relation between the Dobrushin-Shlosman condition and the rapid mixing of the Gibbs sampling [11, 12, 13, 4, 20, 10].

Proposition 4.3 ([4, 20]).

Let $\mathcal{I}=(V,E,Q,\Phi)$ be an MRF instance with $n=|V|$ , and $\Omega=Q^{V}$ the state space. Let $H(\sigma,\tau)\triangleq\left|\{v\in V\mid\sigma_{v}\neq\tau_{v}\}\right|$ denote the Hamming distance between $\sigma\in\Omega$ and $\tau\in\Omega$ . If $\mathcal{I}$ satisfies the Dobrushin-Shlosman condition (3.1) with constant $\delta>0$ , then the one-step optimal coupling $(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0}$ for Gibbs sampling (Definition 4.2) satisfies

\displaystyle\forall\,\sigma,\tau\in\Omega:\quad\mathbb{E}\left[{\,H(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau\,}\right]\leq\left(1-\frac{\delta}{n}\right)\cdot H(\sigma,\tau),

and hence the mixing rate of Gibbs sampling on $\mathcal{I}$ is bounded as $\tau_{\mathsf{mix}}(\mathcal{I},\epsilon)\leq\left\lceil\frac{n}{\delta}\log\frac{n}{\epsilon}\right\rceil$ .

5. Outlines of algorithm

Let $\bm{\theta}:\mathfrak{M}\to\mathbb{R}^{K}$ be a probabilistic inference problem that maps each MRF instance in $\mathfrak{M}$ to a $K$ -dimensional probability vector, and let $\mathcal{E}_{\bm{\theta}}$ be its estimating function. Le $\mathcal{I}=(V,E,Q,\Phi)\in\mathfrak{M}$ be the current instance, where $n=|V|$ . Our dynamic inference algorithm maintains a sequence of $N(n)$ independent samples $\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}\in Q^{V}$ which are $\epsilon(n)$ -close to the Gibbs distribution $\mu_{\mathcal{I}}$ in total variation distance and an $(N,\epsilon)$ -estimator $\hat{\bm{\theta}}(\mathcal{I})$ of $\bm{\theta}(\mathcal{I})$ such that

\displaystyle\hat{\bm{\theta}}(\mathcal{I})=\mathcal{E}_{\bm{\theta}}(\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))}).

Upon an update request that modifies $\mathcal{I}$ to a new instance $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})\in\mathfrak{M}$ , where $n^{\prime}=|V^{\prime}|$ , our algorithm does the followings:

•

Update the sample sequence. Update $\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}$ to a new sequence of $N(n^{\prime})$ independent samples $\bm{Y}^{(1)},\ldots,\bm{Y}^{(N(n^{\prime}))}\in Q^{V^{\prime}}$ that are $\epsilon(n^{\prime})$ -close to $\mu_{\mathcal{I}^{\prime}}$ in total variation distance, and output the difference between two sample sequences.
•

Update the estimator. Given the difference between the two sample sequences, update $\hat{\bm{\theta}}(\mathcal{I})$ to $\hat{\bm{\theta}}(\mathcal{I}^{\prime})=\mathcal{E}_{\bm{\theta}}(\bm{Y}^{(1)},\ldots,\bm{Y}^{(N(n^{\prime}))})$ by accessing the oracle in Definition 2.3.

Obviously, the updated estimator $\hat{\bm{\theta}}(\mathcal{I}^{\prime})$ is an $(N,\epsilon)$ -estimator for $\bm{\theta}(\mathcal{I}^{\prime})$ .

Our main technical contribution is to give an algorithm that dynamically maintains a sequence of $N(n)$ independent samples for $\mu_{\mathcal{I}}$ , while $\mathcal{I}$ itself is dynamically changing. The dynamic sampling problem was recently introduced in [16]. The dynamical sampling algorithm given there only handles update of a single vertex or edge and works only for graphical models with soft constraints.

In contrast, our dynamic sampling algorithm maintains a sequence of $N(n)$ independent samples for $\mu_{\mathcal{I}}$ within total variation distance $\epsilon(n)$ , while the entire specification of the graphical model $\mathcal{I}$ is subject to dynamic update (to a new $\mathcal{I}^{\prime}$ with difference $d(\mathcal{I},\mathcal{I}^{\prime})\leq L=o(n)$ ). Specifically, the algorithm updates the sample sequence within expected time $O(\Delta^{2}N(n)L\log^{3}n+\Delta n)$ . Note that the extra $O(\Delta n)$ cost is necessary for just editing the current MRF instance $\mathcal{I}$ to $\mathcal{I}^{\prime}$ because a single update may change all the vertex and edge potentials simultaneously. This incremental time cost dominates the time cost of the dynamic inference algorithm, and is efficient for maintaining $N(n)$ independent samples, especially when $N(n)$ is sufficiently large, e.g. $N(n)=\Omega(n/L)$ , in which case the average incremental cost for updating each sample is $O(\Delta^{2}L\log^{3}n+{\Delta n}/{N(n)})=O(\Delta^{2}L\log^{3}n)$ .

We illustrate the main idea by explaining how to maintain one sample. The idea is to represent the trace of the Markov chain for generating the sample by a dynamic data structure, and when the MRF instance is changed, this trace is modified to that of the new chain for generating the sample for the updated instance. This is achieved by both a set of efficient dynamic data structures and the coupling between the two Markov chains.

Specifically, let $(\bm{X}_{t})_{t=0}^{T}$ be the Gibbs sampler chain for distribution $\mu_{\mathcal{I}}$ . When the chain is rapidly mixing, starting from an arbitrary initial configuration $\bm{X}_{0}\in Q^{V}$ , after suitably many steps $\bm{X}=\bm{X}_{T}$ is an accurate enough sample for $\mu_{\mathcal{I}}$ . At each step, $\bm{X}_{t-1}$ and $\bm{X}_{t}$ may differ only at a vertex $v_{t}$ which is picked from $V$ uniformly and independently at random. The evolution of the chain is fully captured by the initial state $\bm{X}_{0}$ and the sequence of pairs $\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle$ , from $t=1$ to $t=T$ , which is called the execution log of the chain in the paper.

Now suppose that the current instance $\mathcal{I}$ is updated to $\mathcal{I}^{\prime}$ . We construct such a coupling between the original chain $(\bm{X}_{t})_{t=0}^{T}$ and the new chain $(\bm{Y}_{t})_{t=0}^{T}$ , such that $(\bm{Y}_{t})_{t=0}^{T}$ is a faithful Gibbs sampling chain for the updated instance $\mathcal{I}^{\prime}$ given that $(\bm{X}_{t})_{t=0}^{T}$ is a faithful chain for $\mathcal{I}$ , and the difference between the two chains is small, in the sense that they have almost the same execution logs except for about $O(TL/n)$ steps, where $L$ is the difference between $\mathcal{I}$ and $\mathcal{I}^{\prime}$ .

To simplify the exposition of such coupling, for now we restrict ourselves to the cases where the update to the instance $\mathcal{I}$ does not change the set of variables. Without loss of generality, we only consider the following two basic update operations that modifies $\mathcal{I}$ to $\mathcal{I}^{\prime}$ .

•

Graph update. The update only adds or deletes some edges, while all vertex potentials and the potentials of unaffected edges are not changed.
•

Hamiltonian update. The update changes (possibly all) potentials of vertices and edges, while the underlying graph remains unchanged.

The general update of graphical model can be obtained by combining these two basic operations.

Then the new chain $(\bm{Y}_{t})_{t=0}^{T}$ can be coupled with $(\bm{X}_{t})_{t=0}^{T}$ by using the same initial configuration $\bm{Y}_{0}=\bm{X}_{0}$ and the same sequence $v_{1},v_{2},\ldots,v_{T}\in V$ of randomly picked vertices. And for $t=1,2,\ldots,T$ , the transition $\left\langle\,{v_{t},Y_{t}(v_{t})}\,\right\rangle$ of the new chain can be generated using the same vertex $v_{t}$ as in the original $(\bm{X}_{t})_{t=0}^{T}$ chain, and a random $Y_{t}(v_{t})$ generated according to a coupling of the marginal distributions of $X_{t}(v_{t})$ and $Y_{t}(v_{t})$ , conditioning respectively on the current states of the neighborhood of $v_{t}$ in $(\bm{X}_{t})_{t=0}^{T}$ and $(\bm{Y}_{t})_{t=0}^{T}$ . Note that these two marginal distributions must be identical unless (I) $\bm{X}_{t-1}$ and $\bm{Y}_{t-1}$ differ from each other over the neighborhood of $v_{t}$ or (II) the $v_{t}$ itself is incident to where the models $\mathcal{I}$ and $\mathcal{I}^{\prime}$ differ. The event (II) occurs rarely due to the following reasons.

•

For graph update, the event (II) occurs only if $v_{t}$ is incident to an updated edge. Since only $L$ edges are updated, the event occurs in at most $O(TL/n)$ steps in expectation.
•

For Hamiltonian update, all the potentials of vertices and edges can be changed, thus $\mathcal{I},\mathcal{I}^{\prime}$ may differ everywhere. The key observation is that, as the total difference between the current and updated potentials is bounded by $L$ , we can apply a filter to first select all candidate steps where the coupling may actually fail due to the difference between $\mathcal{I}$ and $\mathcal{I}^{\prime}$ , which can be as small as $O(TL/n)$ , and the actual coupling between $(\bm{X}_{t})_{t=0}^{\infty}$ and $(\bm{Y}_{t})_{t=0}^{\infty}$ is constructed with such prior.

Finally, when $\mathcal{I}$ and $\mathcal{I}^{\prime}$ both satisfy the Dobrushin-Shlosman condition, the percolation of disagreements between $(\bm{X}_{t})_{t=0}^{T}$ and $(\bm{Y}_{t})_{t=0}^{T}$ is bounded, and we show that the two chains are almost always identically coupled as $\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle=\left\langle\,{v_{t},Y_{t}(v_{t})}\,\right\rangle$ , with exceptions at only $O(TL/n)$ steps. The original chain $(\bm{X}_{t})_{t=0}^{T}$ can then be updated to the new chain $(\bm{Y}_{t})_{t=0}^{T}$ by only editing these $O(TL/n)$ local transitions $\left\langle\,{v_{t},Y_{t}(v_{t})}\,\right\rangle$ which are different from $\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle$ . This is aided by the dynamic data structure for the execution log of the chain, which is of independent interest.

6. Dynamic Gibbs sampling

In this section, we give the dynamic sampling algorithm that updates the sample sequences.

In the following theorem, we use $\mathcal{I}=(V,E,Q,\Phi)$ , where $n=|V|$ , to denote the current MRF instance and $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})$ , where $n^{\prime}=|V^{\prime}|$ , to denote the updated MRF instance. And define

	$\displaystyle d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime})$	$\displaystyle\triangleq\|V\oplus V^{\prime}\|+\|E\oplus E^{\prime}\|$
	$\displaystyle d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})$	$\displaystyle\triangleq\sum_{v\in V\cap V^{{}^{\prime}}}\left\\|\phi_{v}-\phi^{\prime}_{v}\right\\|_{1}+\sum_{e\in E\cap E^{{}^{\prime}}}\left\\|\phi_{e}-\phi^{\prime}_{e}\right\\|_{1}.$

Note that $d(\mathcal{I},\mathcal{I}^{\prime})=d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime})+d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})$ , where $d(\mathcal{I},\mathcal{I}^{\prime})$ is defined in (2).

Theorem 6.1 (dynamic sampling algorithm).

Let $N:\mathbb{N}^{+}\to\mathbb{N}^{+}$ and $\epsilon:\mathbb{N}^{+}\to(0,1)$ be two functions satisfying the bounded difference condition in Definition 2.3. Assume that $\mathcal{I}$ and $\mathcal{I}^{\prime}$ both satisfy Dobrushin-Shlosman condition, $d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{graph}}=o(n)$ and $d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}}$ .

There is an algorithm that maintains a sequence of $N(n)$ independent samples $\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}\in Q^{V}$ where $d_{\mathrm{TV}}\left({\mu_{\mathcal{I}}},{\bm{X}^{(i)}}\right)\leq\epsilon(n)$ for all $1\leq i\leq N(n)$ , using $O\left(nN(n)\log n\right)$ memory words, each of $O(\log n)$ bits, such that when $\mathcal{I}$ is updated to $\mathcal{I}^{\prime}$ , the algorithm updates the sequence to $N(n^{\prime})$ independent samples $\bm{Y}^{(1)},\ldots,\bm{Y}^{(N(n^{\prime}))}\in Q^{V^{\prime}}$ where $d_{\mathrm{TV}}\left({\mu_{\mathcal{I}^{\prime}}},{\bm{Y}^{(i)}}\right)\leq\epsilon(n^{\prime})$ for all $1\leq i\leq N(n^{\prime})$ , within expected time cost

(5)

\displaystyle O\left(\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})N(n)\log^{3}n+\Delta n\right),

where $\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}$ , and $\Delta_{G},\Delta_{G^{\prime}}$ denote the maximum degree of $G=(V,E)$ and $G^{\prime}=(V^{\prime},E^{\prime})$ .

Our algorithm is based on the Gibbs sampling algorithm. Let $N:\mathbb{N}^{+}\to\mathbb{N}^{+}$ and $\epsilon:\mathbb{N}^{+}\to(0,1)$ be two functions in Theorem 6.1. We first give the single-sample dynamic Gibbs sampling algorithm (Algorithm 2) that maintains a single sample $\bm{X}\in Q^{V}$ for the current MRF instance $\mathcal{I}=(V,E,Q,\Phi)$ where $n=|V|$ such that $d_{\mathrm{TV}}\left({\bm{X}},{\mu_{\mathcal{I}}}\right)\leq\epsilon(n)$ . We then use this algorithm to obtain the multi-sample dynamic Gibbs sampling algorithm that maintains $N(n)$ independent samples for the current instance.

Given the error function $\epsilon:\mathbb{N}^{+}\to(0,1)$ , suppose that $T(\mathcal{I})$ is an easy-to-compute integer-valued function that upper bounds the mixing time on instance $\mathcal{I}$ , such that

(6)

\displaystyle T(\mathcal{I})\geq\tau_{\textsf{mix}}(\mathcal{I},\epsilon(n)),

where $\tau_{\textsf{mix}}(\mathcal{I},\epsilon(n))$ denotes the mixing rate for the Gibbs sampling chain $(\bm{X}_{t})_{t\geq 0}$ on instance $\mathcal{I}$ . By Proposition 4.3, if the Dobrushin-Shlosman condition is satisfied, we can set

(7)

\displaystyle T(\mathcal{I})=\left\lceil\frac{n}{\delta}\log\frac{n}{\epsilon(n)}\right\rceil.

Our algorithm for single-sample dynamic Gibbs sampling maintains a random process $(\bm{X}_{t})_{t=0}^{T}$ , which is a Gibbs sampling chain on instance $\mathcal{I}$ of length $T=T(\mathcal{I})$ , where $T(\mathcal{I})$ satisfies (6). Clearly $\bm{X}_{T}$ is a sample for $\mu_{\mathcal{I}}$ with $d_{\mathrm{TV}}\left({\bm{X}_{T}},{\mu_{\mathcal{I}}}\right)\leq\epsilon(n)$ .

When the current instance $\mathcal{I}$ is updated to a new instance $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})$ where $n^{\prime}=|V^{\prime}|$ , the original process $(\bm{X}_{t})_{t=0}^{T}$ is transformed to a new process $(\bm{Y}_{t})_{t=0}^{T^{\prime}}$ such that the following holds as an invariant: $(\bm{Y}_{t})_{t=0}^{T^{\prime}}$ is a Gibbs sampling chain on $\mathcal{I}^{\prime}$ with $T^{\prime}=T(\mathcal{I}^{\prime})$ . Hence $\bm{Y}_{T}$ is a sample for the new instance $\mathcal{I}^{\prime}$ with $d_{\mathrm{TV}}\left({\bm{Y}_{T}},{\mu_{\mathcal{I}^{\prime}}}\right)\leq\epsilon(n^{\prime})$ . This is achieved through the following two steps:

(1)

We construct couplings between $(\bm{X}_{t})_{t=0}^{T}$ and $(\bm{Y}_{t})_{t=0}^{T^{\prime}}$ , so that the new process $(\bm{Y}_{t})_{t=0}^{T^{\prime}}$ for $\mathcal{I}^{\prime}$ can be obtained by making small changes to the original process $(\bm{X}_{t})_{t=0}^{T}$ for $\mathcal{I}$ .
(2)

We give a data structure which represents $(\bm{X}_{t})_{t=0}^{T}$ incrementally and supports various updates and queries to $(\bm{X}_{t})_{t=0}^{T}$ so that the above coupling can be generated efficiently.

6.1. Coupling for dynamic instances

The Gibbs sampling chain $(\bm{X}_{t})_{t=0}^{T}$ can be uniquely and fully recovered from: the initial state $\bm{X}_{0}\in Q^{V}$ , and the pairs $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ that record the transitions. We call $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ the execution-log for the chain $(\bm{X}_{t})_{t=0}^{T}$ , and denote it with

\mathsf{Exe\text{-}Log}(\mathcal{I},T)\triangleq\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}.

The following invariants are assumed for the random execution-log with an initial state.

Condition 6.2 (invariants for Exe-Log).

Fixed an initial state $\bm{X}_{0}\in Q^{V}$ , the followings hold for the random execution-log $\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ for the Gibbs sampling chain $(\bm{X}_{t})_{t=0}^{T}$ on instance $\mathcal{I}=(V,E,Q,\Phi)$ :

•

$T=T(\mathcal{I})$ where $T(\mathcal{I})$ satisfies (6);
•

the random process $(\bm{X}_{t})_{t=0}^{T}$ uniquely recovered from the transitions $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ and the initial state $\bm{X}_{0}$ , is identically distributed as the Gibbs sampling (Algorithm 1) on instance $\mathcal{I}$ starting from initial state $\bm{X}_{0}$ with $v_{t}$ as the vertex picked at the $t$ -th step.

Such invariants guarantee that $\bm{X}_{T}$ provides a sample for $\mu_{\mathcal{I}}$ with $d_{\mathrm{TV}}\left({\bm{X}_{T}},{\mu_{\mathcal{I}}}\right)\leq\epsilon(|V|)$ .

Suppose the current instance $\mathcal{I}$ is updated to a new instance $\mathcal{I}^{\prime}$ . We construct couplings between the execution-log $\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ with initial state $\bm{X}_{0}\in Q^{V}$ for $\mathcal{I}$ and the execution-log $\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T^{\prime}}}$ with initial state $\bm{Y}_{0}\in Q^{V^{\prime}}$ for $\mathcal{I}^{\prime}$ . Our goal is as follows: assuming Condition 6.2 for $\bm{X}_{0}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I},T)$ , the same condition should hold invariantly for $\bm{Y}_{0}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime})$ .

Unlike traditional coupling of Markov chains for the analysis of mixing time, where the two chains start from arbitrarily distinct initial states but proceed by the same transition rule, here the two chains $(\bm{X}_{t})_{t=0}^{T}$ and $(\bm{Y}_{t})_{t=0}^{T}$ start from similar states but have to obey different transition rules due to differences between instances $\mathcal{I}$ and $\mathcal{I}^{\prime}$ .

Due to the technical reason, we divide the update from $\mathcal{I}=(V,E,Q,\Phi)$ to $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})$ into two steps: we first update $\mathcal{I}=(V,E,Q,\Phi)$ to

(8)

\displaystyle\mathcal{I}_{\mathsf{mid}}=(V,E,Q,\Phi^{\mathsf{mid}}),

where the potentials $\Phi^{\mathsf{mid}}=(\phi^{\mathsf{mid}}_{a})_{a\in V\cup E}$ in the middle instance $\mathcal{I}_{\mathsf{mid}}$ are defined as

\displaystyle\forall a\in V\cup E,\quad\phi^{\mathsf{mid}}_{a}\triangleq\begin{cases}\phi^{\prime}_{a}&\text{if }a\in V^{\prime}\cup E^{\prime}\\ \phi_{a}&\text{if }a\not\in V^{\prime}\cup E^{\prime};\end{cases}

then we update $\mathcal{I}_{\mathsf{mid}}=(V,E,Q,\Phi^{\mathsf{mid}})$ to $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})$ . In other words, the update from $\mathcal{I}$ to $\mathcal{I}_{\mathsf{mid}}$ is only caused by updating the potentials of vertices and edges, while the underlying graph remains unchanged; and the update from $\mathcal{I}_{\mathsf{mid}}$ to $\mathcal{I}^{\prime}$ is only caused by updating the underlying graph, i.e. adding vertices, deleting vertices, adding edges and deleting edges.

The dynamic Gibbs sampling algorithm can be outlined as follows.

•

UpdateHamiltonian: update $\bm{X}_{0}$ and $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ to a new initial state $\bm{Z}_{0}$ and a new execution log $\mathsf{Exe\text{-}Log}(\mathcal{I}_{\mathsf{mid}},T)=\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}$ such that the random process $(\bm{Z}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}_{\mathsf{mid}}$ .
•

UpdateGraph: update $\bm{Z}_{0}$ and $\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}$ to a new initial state $\bm{Y}_{0}$ and a new execution log $\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}$ such that the random process $(\bm{Y}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}^{\prime}$ .
•

LengthFix: change the length of the execution log $\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}$ from $T$ to $T^{\prime}$ , where $T^{\prime}=T(\mathcal{I}^{\prime})$ and $T(\mathcal{I}^{\prime})$ satisfies (6).

The dynamic Gibbs sampling algorithm is given in Algorithm 2.

Data :

\bm{X}_{0}\in Q^{V}

and

\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}

for current

\mathcal{I}=(V,E,Q,\Phi)

Update : an update that modifies

\mathcal{I}

\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})

1 compute

T^{\prime}=T(\mathcal{I}^{\prime})

satisfying (6) and construct

\mathcal{I}_{\mathsf{mid}}=(V^{\prime},E^{\prime},Q,\Phi^{\mathsf{mid}})

as in (8);

\left(\bm{Z}_{0},\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}\right)\leftarrow\textsf{UpdateHamiltonian}\left(\mathcal{I},\mathcal{I}_{\mathsf{mid}},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right)

;

// update the potentials:

\mathcal{I}\to\mathcal{I}_{\mathsf{mid}}

\left(\bm{Y}_{0},\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}\right)\leftarrow\textsf{UpdateGraph}\left(\mathcal{I}_{\mathsf{mid}},\mathcal{I}^{\prime},\bm{Z}_{0},\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}\right)

;

// update the underlying graph:

\mathcal{I}_{\mathsf{mid}}\to\mathcal{I}^{\prime}

\left(\bm{Y}_{0},\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T^{\prime}}}\right)\leftarrow\textsf{LengthFix}\left(\mathcal{I}^{\prime},\bm{Y}_{0},\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T}},T^{\prime}\right)

, where

T^{\prime}=T(\mathcal{I}^{\prime})

;

// change the length of the execution log from

T

T^{\prime}=T(\mathcal{I}^{\prime})

5 update the data to

\bm{Y}_{0}

and

\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T^{\prime}}}

;

Algorithm 2 Dynamic Gibbs sampling

Data :

\bm{X}_{0}\in Q^{V}

and

\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}

for current

\mathcal{I}=(V,E,Q,\Phi)

Input : the new length

T^{\prime}>0

1 if $T^{\prime}<T$ then

2 truncate

\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}

\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T^{\prime}}}

;

3else

4 extend

\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}

\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T^{\prime}}}

by simulating the Gibbs sampling chain on

\mathcal{I}

for

T-T^{\prime}

more steps;

update the data to

\bm{X}_{0}

and

\mathsf{Exe\text{-}Log}(\mathcal{I},T^{\prime})=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T^{\prime}}}

Algorithm 3

\textsf{LengthFix}\left(\mathcal{I},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}},T^{\prime}\right)

The subroutine LengthFix is given in Algorithm 3. We then describe UpdateHamiltonian (Section 6.1.1) and UpdateGraph (Section 6.1.2).

6.1.1. Coupling for Hamiltonian update

We consider the update of changing potentials of vertices and edges. The update do not change the underlying graph. Let $\mathcal{I}=(V,E,Q,\Phi)$ be the current MRF instance. Let $\bm{X}_{0}$ and $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ be the current initial state and execution log such that the random process $(\bm{X}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}$ . Upon such an update, the new instance becomes $\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime})$ . The algorithm $\textsf{UpdateHamiltonian}(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}})$ updates the data to $\bm{Y}_{0}$ and $\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}$ such that the random process $(\bm{Y}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}^{\prime}$ .

We transform the pair of $\bm{X}_{0}\in Q^{V}$ and $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ to a new pair of $\bm{Y}_{0}\in Q^{V}$ and $\left\langle{v_{t}},Y_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ for $\mathcal{I}^{\prime}$ . This is achieved as follows: the vertex sequence $(v_{t})_{t=1}^{T}$ is identically coupled and the chain $(\bm{X}_{t})_{t=0}^{T}$ is transformed to $(\bm{Y}_{t})_{t=0}^{T}$ by the following one-step local coupling between $\bm{X}$ and $\bm{Y}$ .

Definition 6.3 (one-step local coupling for Hamiltonian update).

The two chains $(\bm{X}_{t})_{t=0}^{\infty}$ on instance $\mathcal{I}=(V,E,Q,\Phi)$ and $(\bm{Y}_{t})_{t=0}^{\infty}$ on instance $\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime})$ are coupled as:

•

Initially $\bm{X}_{0}=\bm{Y}_{0}\in Q^{V}$ ;
•
for $t=1,2,\ldots$ , the two chains $\bm{X}$ and $\bm{Y}$ jointly do:
1. (1)
  
  pick the same $v_{t}\in V$ , and let $(X_{t}(u),Y_{t}(u))\leftarrow(X_{t-1}(u),Y_{t-1}(u))$ for all $u\in V\setminus\{v_{t}\}$ ;
2. (2)
  
  sample $(X_{t}(v_{t}),Y_{t}(v_{t}))$ from a coupling $D_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}^{\sigma,\tau}(\cdot,\cdot)$ of the marginal distributions $\mu_{{v_{t}},{\mathcal{I}}}(\cdot\mid\sigma)$ and $\mu_{{v_{t}},{\mathcal{I}^{\prime}}}(\cdot\mid\tau)$ with $\sigma=X_{t-1}(\Gamma_{G}({v_{t}}))$ and $\tau=Y_{t-1}(\Gamma_{G}({v_{t}}))$ , where $G=(V,E)$ .

The local coupling $D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)$ for Hamiltonian update is specified as follows.

Definition 6.4 (local coupling $D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)$ for Hamiltonian update).

Let $v\in V$ be vertex and $\sigma,\tau\in Q^{\Gamma_{G}(v)}$ two configurations, where $G=(V,E)$ . We say a random pair $(c,c^{\prime})\in Q^{2}$ is drawn from the coupling $D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)$ if $(c,c^{\prime})$ is generated by the following two steps:

•

sampling step: sample $(c,c^{\prime})\in Q^{2}$ jointly from an optimal coupling $D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}}$ of the marginal distributions $\mu_{v,\mathcal{I}}(\cdot\mid\sigma)$ and $\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ , such that $c\sim\mu_{v,\mathcal{I}}(\cdot\mid\sigma)$ and $c^{\prime}\sim\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ ;

•

resampling step: flip a coin independently with the probability of HEADS being

(9)

\displaystyle p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(c^{\prime})\triangleq\begin{cases}0&\text{if }\mu_{v,\mathcal{I}}(c^{\prime}\mid\tau)\leq\mu_{v,\mathcal{I}^{\prime}}(c^{\prime}\mid\tau),\\ \frac{\mu_{v,\mathcal{I}}(c^{\prime}\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(c^{\prime}\mid\tau)}{\mu_{v,\mathcal{I}}(c^{\prime}\mid\tau)}&\text{otherwise };\end{cases}

if the outcome of coin flipping is HEADS, resample $c^{\prime}$ from the distribution $\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}$ independently, where the distribution $\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}$ is defined as

(10)

\displaystyle\forall b\in Q:\quad\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(b)\triangleq\frac{\max\left\{0,\mu_{v,\mathcal{I}^{\prime}}(b\mid\tau)-\mu_{v,\mathcal{I}}(b\mid\tau)\right\}}{\sum_{x\in Q}\max\left\{0,\mu_{v,\mathcal{I}}(x\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)\right\}}.

Lemma 6.5.

$D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)$ in Definition 6.4 is a valid coupling between $\mu_{v,\mathcal{I}}(\cdot\mid\sigma)$ and $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)$ .

By Lemma 6.5, the resulting $(\bm{Y}_{t})_{t=0}^{T}$ is a faithful copy of the Gibbs sampling on instance $\mathcal{I}^{\prime}$ , assuming that $(\bm{X}_{t})_{t=0}^{T}$ is such a chain on instance $\mathcal{I}$ .

Next we give an upper bound for the probability $p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(\cdot)$ defined in (9).

Lemma 6.6.

For any two instances $\mathcal{I}=(V,E,Q,\Phi)$ and $\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime})$ of MRF model, and any $v\in V,c\in Q$ and $\sigma\in Q^{\Gamma_{G}(v)}$ , it holds that

(11)

\displaystyle p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(c)\leq 2\left(\|\phi_{v}-\phi^{\prime}_{v}\|_{1}+\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right),

where $\|\phi_{v}-\phi^{\prime}_{v}\|_{1}=\sum_{c\in Q}|\phi_{v}(c)-\phi^{\prime}_{v}(c)|$ and $\|\phi_{e}-\phi^{\prime}_{e}\|_{1}=\sum_{c,c^{\prime}\in Q}|\phi_{e}(c,c^{\prime})-\phi^{\prime}_{e}(c,c^{\prime})|$ .

By Lemma 6.6, for each vertex $v\in V$ , we define an upper bound of the probability $p^{\cdot}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot)$ as

(12)

\displaystyle p^{\mathsf{up}}_{v}\triangleq\min\left\{2\left(\|\phi_{v}-\phi^{\prime}_{v}\|_{1}+\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right),1\right\}.

With $p^{\mathsf{up}}_{v}$ , we can implement the one-step local coupling in Definition 6.3 as follows. We first sample each $v_{i}\in V$ for $1\leq i\leq T$ uniformly and independently. For each vertex $v\in V$ , let $T_{v}\triangleq\{1\leq t\leq T\mid v_{t}=v\}$ be the set of all the steps that pick the vertex $v$ . We select each $t\in T_{v}$ independently with probability $p^{\mathsf{up}}_{v}$ to construct a random subset $\mathcal{P}_{v}\subseteq T_{v}$ , and let

(13)

\displaystyle\mathcal{P}\triangleq\bigcup_{v\in V}\mathcal{P}_{v}.

We then couple the two chains $(\bm{X}_{t})_{t=0}^{T}$ and $(\bm{Y}_{t})_{t=0}^{T}$ . First set $\bm{X}_{0}=\bm{Y}_{0}$ . For each $1\leq t\leq T$ , we set $(X_{t}(u),Y_{t}(u))\leftarrow(X_{t-1}(u),Y_{t-1}(u))$ for all $u\in V\setminus\{v_{t}\}$ ; then generate the random pair $(X_{t}(v_{t}),Y_{t}(v_{t}))$ by the following procedure.

•

sampling step: Let $\sigma=X_{t-1}(\Gamma_{G}(v_{t}))$ and $\tau=Y_{t-1}(\Gamma_{G}(v_{t}))$ . We draw a random pair $(c,c^{\prime})\in Q^{2}$ from the optimal coupling $D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}}$ of the marginal distributions $\mu_{v,\mathcal{I}}(\cdot\mid\sigma)$ and $\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ such that $c\sim\mu_{v,\mathcal{I}}(\cdot\mid\sigma)$ and $c^{\prime}\sim\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ ;

•

resampling step: If $t\notin\mathcal{P}$ , set $X_{t}(v_{t})=c$ and $Y_{t}(v_{t})=c^{\prime}$ . Otherwise, set $X_{t}(v_{t})=c$ and

(14)

\displaystyle Y_{t}(v_{t})=\begin{cases}b\sim\nu_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}^{\tau}&\text{with probability }p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})/p^{\mathsf{up}}_{v_{t}}\\ c^{\prime}&\text{with probability }1-p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})/p^{\mathsf{up}}_{v_{t}}.\end{cases}

Note that $p^{\mathsf{up}}_{v_{t}}>0$ if $t\in\mathcal{P}$ . By Lemma 6.6, it must hold that $p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})\leq p^{\mathsf{up}}_{v_{t}}$ . Hence, the probability $p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})/p^{\mathsf{up}}_{v_{t}}$ is valid. Note that the probability that $Y_{t}(v_{t})$ is set as $b$ is

\displaystyle\Pr[Y_{t}(v_{t})\text{ is set as }b]=\Pr\left[t\in\mathcal{P}\right]\cdot\frac{p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})}{p^{\mathsf{up}}_{v_{t}}}=p^{\mathsf{up}}_{v_{t}}\cdot\frac{p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})}{p^{\mathsf{up}}_{v_{t}}}=p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime}).

Hence, our implementation perfectly simulates the coupling in Definition 6.3.

Let $\mathcal{D}_{t}$ denote the set of disagreements between $\bm{X}_{t}$ and $\bm{Y}_{t}$ . Formally,

(15)

\displaystyle\mathcal{D}_{t}\triangleq\{v\in V\mid X_{t}(v)\neq Y_{t}(v)\}.

Note that if $v_{t}\notin\Gamma_{G}(\mathcal{D}_{t-1})$ , the random pair $(c,c^{\prime})$ drawn from the coupling $D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}}$ must satisfy $c=c^{\prime}$ . Thus it is easy to make the following observation for the $(\bm{X}_{t})_{t=0}^{T}$ and $(\bm{Y}_{t})_{t=0}^{T}$ coupled as above.

Observation 6.7.

For any integer $t\in[1,T]$ , if $v_{t}\notin\Gamma_{G}^{+}(\mathcal{D}_{t-1})$ and $t\notin\mathcal{P}$ , then $X_{t}(v_{t})=Y_{t}(v_{t})$ and $\mathcal{D}_{t}=\mathcal{D}_{t-1}$ .

With this observation, the new $\bm{Y}_{0}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v_{t}},Y_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ can be generated from $\bm{X}_{0}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ as Algorithm 4.

Data :

\bm{X}_{0}\in Q^{V}

and

\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}

for

\mathcal{I}=(V,E,Q,\Phi)

Update : an update that modifies

\mathcal{I}

\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime})

t_{0}\leftarrow 0

\mathcal{D}\leftarrow\varnothing

, and construct a

\bm{Y}_{0}\leftarrow\bm{X}_{0}

;

2 for each

v\in V

, construct a random subset

\mathcal{P}_{v}\subseteq T_{v}\triangleq\{1\leq t\leq T\mid v_{t}=v\}

such that each element in

T_{v}

is selected independently with probability

p^{\mathsf{up}}_{v}

defined in (12);

3 construct the set

\mathcal{P}\leftarrow\bigcup_{v\in V}\mathcal{P}_{v}

;

4 while $\exists\,t_{0}<t\leq T$ such that $v_{t}\in\Gamma_{G}^{+}(\mathcal{D})$ or $t\in\mathcal{P}$ do

5 find the smallest

t>t_{0}

such that

v_{t}\in\Gamma_{G}^{+}(\mathcal{D})

t\in\mathcal{P}

;

6 for all

t_{0}<i<t

, let

Y_{i}(v_{i})=X_{i}(v_{i})

;

7 sample

Y_{t}(v_{t})\in Q

conditioning on

X_{t}(v_{t})

according to the optimal coupling between

\mu_{v_{t},\mathcal{I}}(\cdot\mid X_{t-1}(\Gamma_{G}(v_{t})))

and

\mu_{v_{t},\mathcal{I}}(\cdot\mid Y_{t-1}(\Gamma_{G}(v_{t})))

;

8 if $t\in\mathcal{P}$ then

9 with probability $p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(Y_{t}(v_{t}))/p^{\mathsf{up}}_{v_{t}}$ where $\tau=Y_{t-1}(\Gamma_{G}(v_{t}))$ do

10 resample

Y_{t}(v_{t})\sim\nu_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}^{\tau}

, where

\nu_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}^{\tau}

is defined in (10) ;

13 if

X_{t}(v_{t})\neq Y_{t}(v_{t})

then

\mathcal{D}\leftarrow\mathcal{D}\cup\{v_{t}\}

else

\mathcal{D}\leftarrow\mathcal{D}\setminus\{v_{t}\}

;

t_{0}\leftarrow t

;

16for all remaining

t_{0}<i\leq T

: let

Y_{i}(v_{i})=X_{i}(v_{i})

;

17 update the data to

\bm{Y}_{0}

and

\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v_{t}},Y_{t}({v_{t}})\right\rangle_{t=1}^{{T}}

;

Algorithm 4

\textsf{UpdateHamiltonian}\left(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right)

Observation 6.7 says that the nontrivial coupling between $X_{t}(v_{t})$ and $Y_{t}(v_{t})$ is only needed when $v_{t}\in\Gamma_{G}^{+}(\mathcal{D}_{t-1})$ or $t\in\mathcal{P}$ , which occurs rarely as long as $\mathcal{D}_{t-1}$ and $\mathcal{P}$ are small. This is a key to ensure the small incremental time cost of Algorithm 4. For the $(\bm{X}_{t})_{t=0}^{T}$ and $(\bm{Y}_{t})_{t=0}^{T}$ coupled as above and any $1\leq t\leq T$ , let $\gamma_{t}$ indicate whether the event $t\in\mathcal{P}\lor v_{t}\in\Gamma_{G}^{+}(\mathcal{D}_{t-1})$ occurs:

(16)

\displaystyle\gamma_{t}\triangleq\mathbf{1}\left[t\in\mathcal{P}\lor v_{t}\in\Gamma_{G}^{+}(\mathcal{D}_{t-1})\right],

and $R_{\mathsf{Hamil}}$ denote the number of occurrences of such bad events:

(17)

\displaystyle R_{\mathsf{Hamil}}\triangleq\sum_{t=1}^{T}\gamma_{t}.

The following lemma bounds the expectation of $R_{\mathsf{Hamil}}$ .

Lemma 6.8 (cost of the coupling for UpdateHamiltonian).

Let $\mathcal{I}=(V,E,Q,\Phi)$ be the current MRF instance and $\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime})$ the updated instance. Assume that $\mathcal{I}$ satisfies Dobrushin-Shlosman condition (3.1) with constant $\delta>0$ , and $d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})=\sum_{v\in V}\left\|\phi_{v}-\phi^{\prime}_{v}\right\|_{1}+\sum_{e\in E}\left\|\phi_{e}-\phi^{\prime}_{e}\right\|_{1}\leq L$ . It holds that $\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]=O\left(\frac{\Delta TL}{n\delta}\right)$ , where $n=|V|$ , $\Delta$ is the maximum degree of graph $G=(V,E)$ .

6.1.2. Coupling for graph update

Let $\mathcal{I}=(V,E,Q,\Phi)$ be an MRF instance, where $\Phi=(\phi_{a})_{a\in V\cup E}$ . Let $\bm{X}_{0}$ and $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ be the current initial state and execution log such that the random process $(\bm{X}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}$ . Let $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})$ be the new instance obtained by updating the underlying graph, where $\Phi^{\prime}=(\phi_{a})_{a\in V^{\prime}\cup E^{\prime}}$ satisfies

\displaystyle\forall a\in(V\cap V^{\prime})\cap(E\cap E^{\prime}),\quad\phi_{a}=\phi^{\prime}_{a}.

Given the update from $\mathcal{I}$ to $\mathcal{I}^{\prime}$ , the subroutine $\textsf{UpdateGraph}\left(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right)$ updates the data to a new initial state $\bm{Y}_{0}$ and a new execution-log $\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T}}$ such that the random process $(\bm{Y}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}^{\prime}$ .

The subroutine UpdateGraph does as the following three steps.

•

AddVertex: add isolated vertices in $V^{\prime}\setminus V$ with potentials $(\phi_{v})_{v\in V^{\prime}\setminus V}$ , and update the instance $\mathcal{I}=(V,E,Q,\Phi)$ to a new instance

(18)

\displaystyle\mathcal{I}_{1}=\mathcal{I}_{1}(\mathcal{I},\mathcal{I}^{\prime})\triangleq\left(V\cup V^{\prime},E,Q,\Phi\cup(\phi_{v})_{v\in V^{\prime}\setminus V}\right);

then update $\bm{X}_{0}$ and $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ to $\bm{Z}_{0}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I}_{1},T)=\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}$ such that the random process $(\bm{Z}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}_{1}$ .

•

UpdateEdge: add new edges in $E^{\prime}\setminus E$ with potentials $(\phi_{e})_{e\in E^{\prime}\setminus E}$ , delete edges in $E\setminus E^{\prime}$ , and update the instance $\mathcal{I}_{1}$ to a new instance

	$\displaystyle\mathcal{I}_{2}=\mathcal{I}_{2}(\mathcal{I},\mathcal{I}^{\prime})$	$\displaystyle\triangleq\left(V\cup V^{\prime},E^{\prime},Q,\Phi\cup(\phi_{v})_{v\in V^{\prime}\setminus V}\cup(\phi_{e})_{e\in E^{\prime}\setminus E}\setminus(\phi_{e})_{e\in E\setminus E^{\prime}}\right)$
(19)			$\displaystyle=\left(V\cup V^{\prime},E^{\prime},Q,\Phi^{\prime}\cup(\phi_{v})_{v\in V\setminus V^{\prime}}\right);$

then update $\bm{Z}_{0}$ and $\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}$ to $\bm{Z}^{{}^{\prime}}_{0}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I}_{2},T)=\left\langle{w_{t}},Z^{{}^{\prime}}_{t}({w_{t}})\right\rangle_{t=1}^{{T}}$ such that the random process $(\bm{Z}^{{}^{\prime}}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}_{2}$ .

•

DeleteVertex: delete isolated vertices in $V\setminus V^{\prime}$ , and update the instance $\mathcal{I}_{2}$ to $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})$ ; then update $\bm{Z}^{{}^{\prime}}_{0}$ and $\left\langle{w_{t}},Z^{{}^{\prime}}_{t}({w_{t}})\right\rangle_{t=1}^{{T}}$ to $\bm{Y}_{0}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T}}$ such that the random process $(\bm{Y}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}^{\prime}$ .

The algorithm UpdateGraph is given in Algorithm 5.

Data :

\bm{X}_{0}\in Q^{V}

and

\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}

for current

\mathcal{I}=(V,E,Q,\Phi)

Update : an update of the underlying graph that modifies

\mathcal{I}

\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})

1 construct instances

\mathcal{I}_{1}

and

\mathcal{I}_{2}

as in (18) and (• ‣ 6.1.2);

\left(\bm{Z}_{0},\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}\right)\leftarrow\textsf{AddVertex}\left(\mathcal{I},\mathcal{I}_{1},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right)

;

// add isolated vertices to update

\mathcal{I}

\mathcal{I}_{1}

\left(\bm{Z}^{\prime}\bm{}0,\left\langle{w_{t}},Z^{\prime}_{t}({w_{t}})\right\rangle_{t=1}^{{T}}\right)\leftarrow\textsf{UpdateEdge}\left(\mathcal{I}_{1},\mathcal{I}_{2},\bm{Z}_{0},\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}\right)

;

// add and delete edges to update

\mathcal{I}_{1}

\mathcal{I}_{2}

\left(\bm{Y}_{0},\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}\right)\leftarrow\textsf{DeleteVertex}\left(\mathcal{I}_{2},\mathcal{I}^{\prime},\bm{Z}^{\prime}\bm{}0,\left\langle{w_{t}},Z^{\prime}_{t}({w_{t}})\right\rangle_{t=1}^{{T}}\right)

;

// delete isolated vertices to update

\mathcal{I}_{2}

\mathcal{I}^{\prime}

5 update the data to

\bm{Y}_{0}

and

\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime})=\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T}}

;

Algorithm 5

\textsf{UpdateGraph}\left(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right)

The subroutines AddVertex and DeleteVertex are simple, because they only deal with isolated variables. We first describe the main subroutine UpdateEdge, then describe AddVertex and DeleteVertex.

The coupling for UpdateEdge

We first consider the update of adding and deleting edges. The update does not change the set of variables. Let $\mathcal{I}=(V,E,Q,\Phi)$ be the current MRF instance. Let $\bm{X}_{0}$ and $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ be the current initial state and execution log such that the random process $(\bm{X}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}$ . Upon such an update, the new instance becomes $\mathcal{I}^{\prime}=(V,E^{\prime},Q,\Phi^{\prime})$ , where $\phi^{\prime}_{a}=\phi_{a}$ for all $a\in V\cup(E\cap E^{\prime})$ . The subroutine $\textsf{UpdateEdge}(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}})$ updates the data to $\bm{Y}_{0}$ and $\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}$ such that the random process $(\bm{Y}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}^{\prime}$ .

We use $\mathcal{S}\subseteq V$ to denote the set of vertices affected by the update from $\mathcal{I}$ to $\mathcal{I}^{\prime}$ :

(20)

\displaystyle\mathcal{S}\triangleq\bigcup_{(u,v)\in E\oplus E^{\prime}}\{u,v\},

where $E\oplus E^{\prime}$ is the symmetric difference between $E$ and $E^{\prime}$ .

We transform this pair of $\bm{X}_{0}\in Q^{V}$ and $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ to a new pair of $\bm{Y}_{0}\in Q^{V}$ and $\left\langle{v_{t}},Y_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ for $\mathcal{I}^{\prime}$ . This is achieved as follows: the vertex sequence $(v_{t})_{t=1}^{T}$ is identically coupled and the chain $(\bm{X}_{t})_{t=0}^{T}$ is transformed to $(\bm{Y}_{t})_{t=0}^{T}$ by the following one-step local coupling between $\bm{X}$ and $\bm{Y}$ .

Definition 6.9 (one-step local coupling for UpdateEdge).

The two chains $(\bm{X}_{t})_{t=0}^{\infty}$ on instance $\mathcal{I}=(V,E,Q,\Phi)$ and $(\bm{Y}_{t})_{t=0}^{\infty}$ on instance $\mathcal{I}^{\prime}=(V,E^{\prime},Q,\Phi^{\prime})$ are coupled as:

•

Initially $\bm{X}_{0}=\bm{Y}_{0}\in Q^{V}$ ;
•
for $t=1,2,\ldots$ , the two chains $\bm{X}$ and $\bm{Y}$ jointly do:
1. (1)
  
  pick the same $v_{t}\in V$ , and let $(X_{t}(u),Y_{t}(u))\leftarrow(X_{t-1}(u),Y_{t-1}(u))$ for all $u\in V\setminus\{v_{t}\}$ ;
2. (2)
  
  sample $(X_{t}(v_{t}),Y_{t}(v_{t}))$ from a coupling $D_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}^{\sigma,\tau}(\cdot,\cdot)$ of the marginal distributions $\mu_{{v_{t}},{\mathcal{I}}}(\cdot\mid\sigma)$ and $\mu_{{v_{t}},{\mathcal{I}^{\prime}}}(\cdot\mid\tau)$ with $\sigma=X_{t-1}(\Gamma_{G}({v_{t}}))$ and $\tau=Y_{t-1}(\Gamma_{G^{\prime}}({v_{t}}))$ , where $G=(V,E)$ and $G^{\prime}=(V,E^{\prime})$ .

The local coupling $D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)$ for UpdateEdge is specified as follows.

(21)

\displaystyle\forall\sigma\in Q^{\Gamma_{G}(v)},\tau\in Q^{\Gamma_{G^{\prime}}(v)}:\quad D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)=\begin{cases}D_{\mathsf{opt},\mathcal{I}_{v}}^{\sigma,\tau}(\cdot,\cdot)&\text{if }v\not\in\mathcal{S},\\ \mu_{v,\mathcal{I}}(\cdot\mid\sigma)\times\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)&\text{if }v\in\mathcal{S},\end{cases}

where $D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}}$ is an optimal coupling of marginal distributions $\mu_{v,\mathcal{I}}(\cdot\mid\sigma)$ and $\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ . Recall $\mathcal{I}_{v}=(\Gamma^{+}_{v},E_{v},Q,\Phi_{v})$ where $E_{v}=\{\{u,v\}\in E\}$ and $\Phi_{v}=(\phi_{a})_{a\in\Gamma_{v}^{+}\cup E_{v}}$ . Obviously, $D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}$ is a valid coupling of $\mu_{v,\mathcal{I}}(\cdot\mid\sigma)$ and $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)$ . Because for any $v\not\in\mathcal{S}$ , we have $\mathcal{I}_{v}=\mathcal{I}_{v^{\prime}}$ and hence $\mu_{v,\mathcal{I}}(\cdot\mid\sigma)$ and $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)$ are the same, both defined by (4) on $\mathcal{I}_{v}$ . Thus they can be coupled by $D_{\mathsf{opt},\mathcal{I}_{v}}^{\sigma,\tau}$ .

Obviously the resulting $(\bm{Y}_{t})_{t=0}^{T}$ is a faithful copy of the Gibbs sampling on instance $\mathcal{I}^{\prime}$ , assuming that $(\bm{X}_{t})_{t=0}^{T}$ is such a chain on instance $\mathcal{I}$ .

Recall $\mathcal{D}_{t}\triangleq\{v\in V\mid X_{t}(v)\neq Y_{t}(v)\}$ is set of disagreements between $\bm{X}_{t}$ and $\bm{Y}_{t}$ . The following observation is easy to make for the $(\bm{X}_{t})_{t=0}^{T}$ and $(\bm{Y}_{t})_{t=0}^{T}$ coupled as above.

Observation 6.10.

For any $t\in[1,T]$ , if $v_{t}\not\in\mathcal{S}\cup\Gamma_{G}^{+}(\mathcal{D}_{t-1})$ then $\bm{X}_{t}(v_{t})=\bm{Y}_{t}(v_{t})$ and $\mathcal{D}_{t}=\mathcal{D}_{t-1}$ .

Data :

\bm{X}_{0}\in Q^{V}

and

\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}

for current

\mathcal{I}=(V,E,Q,\Phi)

Update : an update of adding and deleting edges that modifies

\mathcal{I}

\mathcal{I}^{\prime}=(V,E^{\prime},Q,\Phi^{\prime})

t_{0}\leftarrow 0

\mathcal{D}\leftarrow\varnothing

\bm{Y}_{0}\leftarrow\bm{X}_{0}

and construct

\mathcal{S}\leftarrow\bigcup_{(u,v)\in E\oplus E^{\prime}}\{u,v\}

;

2 while $\exists\,t_{0}<t\leq T$ such that $v_{t}\in\mathcal{S}\cup\Gamma_{G}^{+}(\mathcal{D})$ do

3 find the smallest

t>t_{0}

such that

v_{t}\in\mathcal{S}\cup\Gamma_{G}^{+}(\mathcal{D})

;

4 for all

t_{0}<i<t

, let

Y_{i}(v_{i})=X_{i}(v_{i})

;

5 sample

Y_{t}(v_{t})

conditioning on

X_{t}(v_{t})

according to the coupling

D_{v_{t}}^{\sigma,\tau}(\cdot,\cdot)

(constructed in (21)), where

\sigma=X_{t-1}(\Gamma_{G}({v_{t}}))

and

\tau=Y_{t-1}(\Gamma_{G^{\prime}}({v_{t}}))

;

6 if

X_{t}(v_{t})\neq Y_{t}(v_{t})

then

\mathcal{D}\leftarrow\mathcal{D}\cup\{v_{t}\}

else

\mathcal{D}\leftarrow\mathcal{D}\setminus\{v_{t}\}

;

t_{0}\leftarrow t

;

9for all remaining

t_{0}<i\leq T

: let

\bm{Y}_{i}(v_{i})=X_{i}(v_{i})

;

10 update the data to

\bm{Y}_{0}

and

\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v_{t}},Y_{t}({v_{t}})\right\rangle_{t=1}^{{T}}

;

Algorithm 6

\textsf{UpdateEdge}(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}})

Observation 6.10 says that the nontrivial coupling between $X_{t}(v_{t})$ and $Y_{t}(v_{t})$ is only needed when $v_{t}\in\mathcal{S}\cup\Gamma_{G}^{+}(\mathcal{D}_{t-1})$ , which occurs rarely as long as $\mathcal{D}_{t-1}$ remains small. This is a key to ensure the small incremental time cost of Algorithm 6. Formally, for the $(\bm{X}_{t})_{t=0}^{T}$ and $(\bm{Y}_{t})_{t=0}^{T}$ coupled as above, for any $1\leq t\leq T$ , let $\gamma_{t}$ indicate whether this bad event occurs:

(22)

\displaystyle\gamma_{t}\triangleq\mathbf{1}\left[v_{t}\in\mathcal{S}\cup\Gamma_{G}^{+}(\mathcal{D}_{t-1})\right],

and let $R_{\mathsf{graph}}$ denote the number of occurrences of such bad events:

(23)

\displaystyle R_{\mathsf{graph}}\triangleq\sum_{t=1}^{T}\gamma_{t}.

We will see that $R_{\mathsf{graph}}$ dominates the cost of Algorithm 6, once a data structure is given to encode the execution-log and resolve the updates in Line 6 and various queries (in Lines 6, 6 and 6) to the data.

Lemma 6.11 (cost of the coupling for UpdateEdge).

Let $\mathcal{I}=(V,E,Q,\Phi)$ be the current MRF instance and $\mathcal{I}^{\prime}=(V,E^{\prime},Q,\Phi^{\prime})$ the updated instance. Assume that $\mathcal{I}^{\prime}$ satisfies Dobrushin-Shlosman condition (3.1) with constant $\delta>0$ , and $|E\oplus E^{\prime}|\leq L$ . It holds that $\mathbb{E}\left[{R_{\mathsf{graph}}}\right]=O\left(\frac{\Delta TL}{n\delta}\right)$ , where $n=|V|$ , $\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}$ , and $\Delta_{G},\Delta_{G^{\prime}}$ denote the maximum degree of $G=(V,E)$ and $G^{\prime}=(V,E^{\prime})$ .

Coupling for AddVertex

Let $\mathcal{I}=(V,E,Q,\Phi)$ be the current MRF instance. Let $\bm{X}_{0}$ and $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ be the current initial state and execution log such that the random process $(\bm{X}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}$ . The update adds a set of isolated vertices $S$ with potentials $(\phi_{a})_{a\in S}$ . Upon such an update, the new instance becomes

\displaystyle\mathcal{I}^{\prime}=(V^{\prime},E,Q,\Phi^{\prime})=(V\cup S,E,Q,\Phi\cup(\phi_{a})_{a\in S}).

The subroutine $\textsf{AddVertex}(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}})$ updates the data to $\bm{Y}_{0}$ and $\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}$ such that the random process $(\bm{Y}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}^{\prime}$ .

Since the new instance $\mathcal{I}^{\prime}$ is the same as $\mathcal{I}$ except the isolated vertices in $S$ , we can construct $\bm{Y}_{0}(V)=\bm{X}_{0}$ and $\bm{Y}_{0}(S)\in Q^{S}$ is arbitrary, and $\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}$ can be constructed by inserting random appearances of vertices in $S$ into $(v_{t})_{t=1}^{T}$ , while for any $v\in S$ , the $Y_{t}(v)$ at the inserted steps $t$ are sampled i.i.d. from the marginal distribution $\mu_{{v},{\mathcal{I}^{\prime}}}(\cdot)$ , which is just a distribution over $Q$ proportional to $\exp(\phi_{v}(\cdot))$ in the case of Gibbs sampling, since $v$ is an isolated vertex. Let $[T]\triangleq\{1,2,\ldots,T\}$ . Formally:

(1)

Let $P\subseteq[T]$ be a random subset such that each $t\in[T]$ is selected into $P$ independently with probability $\frac{|S|}{|S\cup V|}$ . Let $h=|P|$ and enumerate all elements in $P$ as $r_{1}<r_{2}<\ldots<r_{h}$ . Let $m=T-h$ and enumerate all elements in $[T]\setminus P$ as $\ell_{1}<\ell_{2}<\cdots<\ell_{m}$ .
(2)

For each $1\leq i\leq h$ , sample $u_{i}\in S$ uniformly and independently.
(3)

Let $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{m}}\leftarrow\textsf{LengthFix}\left(\mathcal{I},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}},m\right)$ .

(4)

Construct $\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T^{\prime}}}$ as follows:

	$\displaystyle\forall\,t=r_{k}\in P$	$\displaystyle:\quad v^{\prime}_{t}=u_{k}\quad\text{and }\quad Y_{t}(v_{t}^{\prime})\sim\mu_{{u_{k}},{\mathcal{I}^{\prime}}}(\cdot),\text{ where }\mu_{{u_{k}},{\mathcal{I}^{\prime}}}(c)\propto\exp(\phi_{u_{k}}(c));$
	$\displaystyle\forall\,t=\ell_{k}\in[T^{\prime}]\setminus P$	$\displaystyle:\quad v^{\prime}_{t}=v_{k}\quad\text{and }\quad Y_{t}(v_{t}^{\prime})=X_{k}(v_{t}^{\prime})=X_{k}(v_{k}).$

It is easy to see that $(\bm{Y}_{t})_{t=0}^{T^{\prime}}$ is a faithful copy of the Gibbs sampling on instance $\mathcal{I}^{\prime}$ .

Coupling for DeleteVertex

Let $\mathcal{I}=(V,E,Q,\Phi)$ be the current MRF instance. The update deletes a set of isolated variables $S\subseteq V$ . Let $\bm{X}_{0}$ and $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ be the current initial state and execution log such that the random process $(\bm{X}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}$ . Upon such update, the instance is updated to $\mathcal{I}^{\prime}=(V^{\prime},E,Q,\Phi^{\prime})$ , where $V^{\prime}=V\setminus S$ and $\Phi^{\prime}=\Phi\setminus(\phi_{v})_{v\in S}$ . The subroutine $\textsf{DeleteVertex}(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}})$ updates the data to $\bm{Y}_{0}$ and $\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}$ such that the random process $(\bm{Y}_{t})_{t=0}^{T}$ is the Gibbs sampling on instance $\mathcal{I}^{\prime}$ .

We can simply construct $\bm{Y}_{0}=X_{0}(V^{\prime})$ . The new execution-log $\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},\epsilon)=\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}$ can be constructed from the original $\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ by simply deleting all appearances of vertices $v\in S$ in $(v_{t})_{t=1}^{T}$ and the corresponding trivial transitions $X_{t}(v)$ , followed by calling LengthFix on instance $\mathcal{I}^{\prime}$ to properly append the chain to the length $T$ .

It is easy to see that $(\bm{Y}_{t})_{t=0}^{T}$ is a faithful copy of the Gibbs sampling on instance $\mathcal{I}^{\prime}$ .

6.2. Data structure for Gibbs sampling

We now describe an efficient data structure for Gibbs sampling $(\bm{X}_{t})_{t=0}^{T}$ . Let $\mathcal{I}=(V,E,Q,\Phi)$ be an MRF instance. The data structure should provide the following functionalities.

•

Data: an initial state $\bm{X}_{0}\in Q^{V}$ and an execution-log $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\in(V\times Q)^{T}$ that records the $T$ transitions of the Gibbs sampling $(\bm{X}_{t})_{t=0}^{T}$ ;
•
updates:
- –
  
  $\textsf{Insert}(t,v,c)$ , which inserts a transition $\left\langle\,{v,c}\,\right\rangle$ after the $(t-1)$ -th transition $\left\langle\,{v_{t-1},X_{t-1}(v_{t-1})}\,\right\rangle$ ;
- –
  
  $\textsf{Remove}(t)$ , which deletes the $t$ -th transition $\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle$ ;
- –
  
  $\textsf{Change}(t,c)$ , which changes the $t$ -th transition $\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle$ to $\left\langle\,{v_{t},c}\,\right\rangle$ ;
Note that the updates $\textsf{Insert}(t,v,c)$ and $\textsf{Remove}(t)$ change the length $T$ of the chain, as well as the order-numbers of all transitions after the inserted/deleted transition.
•
queries:
- –
  
  $\textsf{Eval}(t,v)$ , which returns the value of $X_{t}(v)$ for arbitrary $t$ and $v$ (not necessarily $=v_{t}$ );
- –
  
  $\textsf{Succ}(t,v)$ , which returns $i$ for the smallest $i>t$ such that $v_{i}=v$ if such $i$ exists, or returns $\perp$ if otherwise.

It is not difficult to realize that the query $\textsf{Eval}(t,v)$ can actually be solved by a predecessor search defined symmetrically to $\textsf{Succ}(t,v)$ . This data structure problem for Gibbs sampling is quite natural and is of independent interest.

Theorem 6.12 (data structure for Gibbs sampling).

There exists a deterministic dynamic data structure which stores an arbitrary initial state $\bm{X}_{0}\in Q^{V}$ and an execution-log $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\in(V\times Q)^{T}$ for Gibbs sampling using $O(T+|V|)$ memory words, each of $O(\log T+\log|V|+\log|Q|)$ bits, such that each operation among Insert, Remove, Change, Eval and Succ can be resolved in time $O(\log^{2}T+\log|V|)$ .

Proof.

The initial state and execution-log are stored by separate data structures.

The initial state $\bm{X}_{0}\in Q^{V}$ is maintained by a deterministic dynamic dictionary, with $(v,X_{0}(v))$ for vertices $v\in V$ as the key-value pairs. Such a deterministic data structure answers queries of $X_{0}(v)$ given any $v\in V$ while $V$ is dynamically changing.

The execution-log $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\in(V\times Q)^{T}$ is stored by $|V|$ balanced search trees $(\mathcal{T}_{v})_{v\in V}$ (e.g., red-black trees). In each tree $\mathcal{T}_{v}$ , each node in $\mathcal{T}_{v}$ stores a distinct transition $\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle$ with $v_{t}=v$ , such that the in-order tree walk of $\mathcal{T}_{v}$ prints all $\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle$ with $v_{t}=v$ in the order they appear in the execution-log $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ . Altogether these trees $(\mathcal{T}_{v})_{v\in V}$ have $T$ nodes in total. Besides, these trees $(\mathcal{T}_{v})_{v\in V}$ are indexed by another deterministic dynamic dictionary, with $(v,p_{v})$ for vertices $v\in V$ as key-value pairs, where each $p_{v}$ is the pointer to the root of tree $\mathcal{T}_{v}$ . This dictionary provides random accesses to the trees $\mathcal{T}_{v}$ for all $v\in V$ , while $V$ is dynamically changing.

Given any $t$ , we want to answer predecessor (or successor) search for the largest $i\leq t$ (or smallest $i>t$ ) such that $v_{i}=v$ . This is achieved with assistance from another data structure, an order-statistic tree (or OS-tree) $\widehat{\mathcal{T}}$ [5, Section 14]. In $\widehat{\mathcal{T}}$ , each node stores the “identity” of an individual transition $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ (which is actually a pointer to the node storing the transition $\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle$ in the tree $\mathcal{T}_{v}$ with $v_{t}=v$ ). In particular, the in-order tree walk of $\widehat{\mathcal{T}}$ prints all $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ in that order. Such a data structure supports two query functions: (1) Select: given any $t$ , returns the identity of the $t$ -th transition $\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle$ ; and (2) Rank: given the identity of any transition $\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle$ , returns its rank $t$ in the sequence $\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ . Besides, the OS-tree $\widehat{\mathcal{T}}$ also supports standard insertion (of a new transition $\left\langle\,{v,c}\,\right\rangle$ to a given rank $t$ ) and deletion (of the transition $\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle$ at a given rank $t$ ). As a balanced tree, all these queries and updates for the OS-tree $\widehat{\mathcal{T}}$ can be resolved in $O(\log T)$ time.

The successor and predecessor searches mentioned above for any $v\in T$ and $t$ , can then be resolved by binary searches in the balanced search tree $\mathcal{T}_{v}$ while querying the OS-tree $\widehat{\mathcal{T}}$ as an oracle for ordering, which takes time at most $O(\log^{2}T+\log|V|)$ in total, where the $\log|V|$ cost is used for accessing the root of $\mathcal{T}_{v}$ via the dynamic dictionary that indexes the trees $(T_{v})_{v\in V}$ .

This solves the successor query $\textsf{Succ}(t,v)$ as well as the evaluation query $\textsf{Eval}(t,v)$ for Gibbs sampling, both within time cost $O(\log^{2}T+\log|V|)$ , where the latter is actually solved by the predecessor search for the largest $i\leq t$ such that $v_{i}=v$ and returning the value of $X_{i}(v_{i})$ recorded in the $i$ -th transition $\left\langle\,{v_{i},X_{i}(v_{i})}\,\right\rangle$ or returning the value of $X_{0}(v)$ if no such $i$ exists.

It is also easy to verify that with the above dynamic data structures, all updates, including: $\textsf{Insert}(t,v,c)$ , $\textsf{Remove}(t)$ and $\textsf{Change}(t,c)$ , can be implemented with cost at most $O(\log^{2}T+\log|V|)$ , and the data structures together use $O(T+|V|)$ words in total, where each word consists of $O(\log T+\log|V|+\log|Q|)$ bits. ∎

6.3. Single-sample dynamic Gibbs sampling algorithm

With the data structure for Gibbs sampling stated in Theorem 6.12, the couplings constructed in Section 6.1 can be implemented as the algorithm for dynamic Gibbs sampling. Recall $d_{\textsf{graph}}(\cdot,\cdot)$ and $d_{\textsf{Hamil}}(\cdot,\cdot)$ are defined in (2).

Lemma 6.13 (single-sample dynamic Gibbs sampling algorithm).

Let $\epsilon:\mathbb{N}^{+}\to(0,1)$ be an error function. Let $\mathcal{I}=(V,E,Q,\Phi)$ be an MRF instance with $n=|V|$ and $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})$ the updated instance with $n^{\prime}=|V^{\prime}|$ . Denote $T=T(\mathcal{I})$ , $T^{\prime}=T(\mathcal{I}^{\prime})$ and $T_{\max}=\max\{T,T^{\prime}\}$ . Assume $d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{graph}}=o(n)$ , $d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}}$ , and $T,T^{\prime}\in\Omega(n\log n)$ . The single-sample dynamic Gibbs sampling algorithm (Algorithm 2) does the followings:

•

(space cost) The algorithm maintains an explicit copy of a sample $\bm{X}\in Q^{V}$ for the current instance $\mathcal{I}$ , and also a data structure using $O(T)$ memory words, each of $O(\log T)$ bits, for representing an initial state $\bm{X}_{0}\in Q^{V}$ and an execution-log $\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ for the Gibbs sampling $(\bm{X}_{t})_{t=0}^{T}$ on $\mathcal{I}$ generating sample $\bm{X}=\bm{X}_{T}$ .
•

(correctness) Assuming that Condition 6.2 holds for $\bm{X}_{0}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I},T)$ for the Gibbs sampling on $\mathcal{I}$ , upon each update that modifies $\mathcal{I}$ to $\mathcal{I}^{\prime}$ , the algorithm updates $\bm{X}$ to an explicit copy of a sample $\bm{Y}\in Q^{V^{\prime}}$ for the new instance $\mathcal{I}^{\prime}$ , and correspondingly updates the $\bm{X}_{0}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I},T)$ represented by the data structure to a $\bm{Y}_{0}\in Q^{V^{\prime}}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T^{\prime}}}$ for the Gibbs sampling $(\bm{Y}_{t})_{t=0}^{T^{\prime}}$ on $\mathcal{I}^{\prime}$ generating the new sample $\bm{Y}=\bm{Y}_{T^{\prime}}$ , where $\bm{Y}_{0}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime})$ satisfy Condition 6.2 for the Gibbs sampling on $\mathcal{I}^{\prime}$ , therefore,

$d_{\mathrm{TV}}\left({\bm{Y}},{\mu_{\mathcal{I}^{\prime}}}\right)\leq\epsilon(n^{\prime}).$

•

(time cost) Assuming Condition 6.2 for $\bm{X}_{0}$ and $\mathsf{Exe\text{-}Log}(\mathcal{I},T)$ for the Gibbs sampling on $\mathcal{I}$ , the expected time complexity for resolving an update is

\displaystyle O\left(\Delta n+\Delta\left(|T-T^{\prime}|+\frac{T_{\max}(L_{\mathsf{Hamil}}+L_{\mathsf{graph}})}{n}+\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]+\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\right)\log^{2}T_{\max}\right),

where $\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}$ , $\Delta_{G}$ , $\Delta_{G^{\prime}}$ denote the maximum degrees of $G=(V,E)$ and $G^{\prime}=(V^{\prime},E^{\prime})$ , $R_{\mathsf{Hamil}}$ is defined in (17) for the subroutine UpdateHamiltonian in Algorithm 2, and $R_{\mathsf{graph}}$ is defined in (23) for the subroutine UpdateEdge in Algorithm 2.

We remark that the $O(\Delta n)$ in time cost is necessary because the update from $\mathcal{I}$ to $\mathcal{I}^{\prime}$ may change all the potentials of vertices and edges. One can reduce the $O(\Delta n)$ from the time cost if we further restrict that one update can only change constant number of vertices, edges, and potentials.

The following result is a corollary from Lemma 6.13.

Corollary 6.14.

Assume $\epsilon:\mathbb{N}^{+}\to(0,1)$ in Lemma 6.13 satisfies the bounded difference condition in Definition 2.3. Assume $\mathcal{I}$ and $\mathcal{I}^{\prime}$ in Lemma 6.13 both satisfy Dobrushin-Shlosman condition (3.1) with constant $\delta>0$ . The single-sample dynamic Gibbs sampling algorithm (Algorithm 2) uses $O(n\log n)$ memory words, each of $O(\log n)$ bits to maintain the sample for current instance $\mathcal{I}$ , and resolves the update from $\mathcal{I}$ to $\mathcal{I}^{\prime}$ with expected time cost $O\left(\Delta n+\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})\log^{3}n\right)$ .

Proof of Lemma 6.13.

The dynamic Gibbs sampling algorithm is implemented as follows. The algorithm uses the dynamic data structure in Theorem 6.12 to maintain the initial state $\bm{X}_{0}$ and execution-log $\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ . Besides, the algorithm maintains the explicit copy of the sample $\bm{X}\in Q^{V}$ by a deterministic dynamic dictionary, with $(v,X(v))$ for vertices $v\in V$ as the key-value pairs. The lemma is proved as follows.

Space cost: Note that $T=\Omega(n\log n),|V|=n$ and $|Q|=O(1)$ . We have $O(n)=O(T)$ and $O(\log T+\log|V|+\log|Q|)=O(\log T)$ . The dynamic dictionary for sample $\bm{X}$ uses $O(n)$ memory words, each of $O(\log|V|+\log|Q|)$ bits. Combining with Theorem 6.12, we have the algorithm uses $O(T)$ memory words to maintain the initial state, execution-log and the random sample, each word is of $O(\log T+\log|V|+\log|Q|)=O(\log T)$ bits.

Correctness: The invariants for execution-log (Condition 6.2) are preserved by the coupling simulated by the algorithm. The correctness holds as a consequence.

Time cost: Consider the update that modifies $\mathcal{I}$ to $\mathcal{I}^{\prime}$ . We divide the algorithm into two stages.

•

Preparation stage: construct the updated instances $\mathcal{I}^{\prime}$ and other middle instances $\mathcal{I}_{\mathsf{mid}},\mathcal{I}_{1},\mathcal{I}_{2}$ in (8), (18), (• ‣ 6.1.2); compute $p^{\mathsf{up}}_{v}$ in (12) for all $v\in V$ and construct the random set $\mathcal{P}\subseteq[T]=\{1,2,\ldots,T\}$ in (13).
•

Update stage: given $\mathcal{P}$ and $p^{\mathsf{up}}_{v}$ for all $v\in V$ , update the initial state $\bm{X}_{0}$ to $\bm{Y}_{0}$ , the execution-log $\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}$ to $\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T^{\prime}}}$ , and the sample $\bm{X}$ to $\bm{Y}$ .

We make the following two claims.

Claim 6.15.

The expected running time of the preparation stage is

\displaystyle\mathbb{E}\left[{T_{\mathsf{preparation}}^{\mathsf{single}}}\right]=O\left(\Delta n+\mathbb{E}\left[{\left|\mathcal{P}\right|}\right]\log^{2}T_{\max}\right),

and the expected size of $\mathcal{P}$ is at most $\frac{4T_{\max}L_{\mathsf{Hamil}}}{n}$ .

Claim 6.16.

The expected running time of the update stage is

(24)

\displaystyle\mathbb{E}\left[{T_{\mathsf{update}}^{\mathsf{single}}}\right]=O\left(\Delta\left(|T-T^{\prime}|+\frac{T_{\max}L_{\mathsf{graph}}}{n}+\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]+\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\right)\log^{2}T_{\max}\right),

$R_{\mathsf{Hamil}}$ is defined in (17) for the subroutine UpdateHamiltonian in Algorithm 2, and $R_{\mathsf{graph}}$ is defined in (23) for the subroutine UpdateEdge in Algorithm 2.

By the linearity of expectation, the expected time cost of the algorithm is $\mathbb{E}\left[{T_{\mathsf{preparation}}^{\mathsf{single}}}\right]+\mathbb{E}\left[{T_{\mathsf{update}}^{\mathsf{single}}}\right]$ . This proves the time cost.

∎

We introduce the following technique lemma to prove Corollary 6.14.

Lemma 6.17.

Let $\epsilon:\mathbb{N}^{+}\to(0,1)$ be a function such that there exists a constant $C>0$ such that

\displaystyle\forall n\in\mathbb{N}^{+}:\quad\left|\epsilon(n+1)-\epsilon(n)\right|\leq\frac{C}{n}\epsilon(n).

Then the function $N$ has the following properties

•

for any $n\in\mathbb{N}^{+}$ , it holds that $\epsilon(n)\geq\frac{1}{\mathrm{poly}(n)}$ ;

•

let $\alpha\geq 1$ be a constant, given any $n,n^{\prime}\in\mathbb{N}^{+}$ such that $\frac{1}{\alpha}\leq\frac{n^{\prime}}{n}\leq\alpha$ ,

\displaystyle\left|n\log\frac{n}{\epsilon(n)}-n^{\prime}\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\right|=C^{\prime}\left|n^{\prime}-n\right|\log n.

where $C^{\prime}$ is a constant that depends only on $\alpha,C$ and $\epsilon(3\lceil C\rceil)$ .

Proof.

By the condition, we have $\epsilon(t)\leq\left(1+\frac{C}{t-C}\right)\epsilon(t+1)$ for all $t>\lceil C+1\rceil$ . Thus for all $n>l=3\lceil C\rceil$ ,

(25)

\displaystyle\epsilon(l)\leq\prod_{i=l}^{n-1}\left(1+\frac{C}{i-C}\right)\epsilon(n)\leq\epsilon(n)\exp\left(C\sum_{i=2}^{n-1}\frac{1}{i}\right)\leq\epsilon(n)\exp(C\ln n)=\epsilon(n)n^{C}.

Thus, we have $\epsilon(n)\geq\frac{1}{\mathrm{poly}(n)}$ .

We then prove the second property. It is lossless to assume that $\min\{n,n^{\prime}\}\geq l$ , since otherwise we can choose $C^{\prime}$ sufficiently large so that the second property holds. Firstly, we prove for the case $n>n^{\prime}$ . We have $\left|\log\frac{n}{n^{\prime}}\right|\leq\frac{n-n^{\prime}}{n^{\prime}}$ . By $\epsilon(t)\leq\left(1+\frac{C}{t-C}\right)\epsilon(t+1)$ for all $t>\lceil C+1\rceil$ , we also have

\displaystyle\epsilon(n^{\prime})\leq\prod_{i=n^{\prime}}^{n-1}\left(1+\frac{C}{i-C}\right)\epsilon(n)\leq\epsilon(n)\exp\left(\frac{C(n-n^{\prime})}{n^{\prime}-C}\right).

Thus,

(26)

\displaystyle\left|\log\frac{n}{\epsilon(n)}-\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\right|\leq\left|\log\frac{n}{n^{\prime}}\right|+\left|\log\frac{\epsilon(n)}{\epsilon(n^{\prime})}\right|\leq\frac{n-n^{\prime}}{n^{\prime}}+\frac{C(n-n^{\prime})}{n^{\prime}-C}\leq\frac{(2C+1)(n-n^{\prime})}{n^{\prime}}.

The last equality is due to $2(n^{\prime}-C)\geq n^{\prime}+l-2C\geq n^{\prime}$ . Let $C^{\prime}=2+\left|\log\epsilon(l)\right|+3C$ . We have

\displaystyle\left|n\log\frac{n}{\epsilon(n)}-n^{\prime}\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\right|\leq\left|(n^{\prime}-n)\log\frac{n}{\epsilon(n)}\right|+\left|n^{\prime}\left(\log\frac{n}{\epsilon(n)}-\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\right)\right|\leq C^{\prime}\left|n^{\prime}-n\right|\log n.

The last inequality is due to (25) and (26). Similarly, we can also prove the lemma if $n<n^{\prime}$ . ∎

Proof of Corollary 6.14.

By $L_{\mathsf{graph}}=o(n)$ , we have $n^{\prime}=\Theta(n)$ . Since $\mathcal{I}$ and $\mathcal{I}^{\prime}$ both satisfy Dobrushin-Shlosman condition (3.1) with constant $\delta>0$ , we can set $T,T^{\prime}$ as in (7) such that

	$\displaystyle T$	$\displaystyle=\left\lceil\frac{n}{\delta}\log\frac{n}{\varepsilon(n)}\right\rceil=\Theta(n\log n)$
	$\displaystyle T^{\prime}$	$\displaystyle=\left\lceil\frac{n^{\prime}}{\delta}\log\frac{n^{\prime}}{\varepsilon(n^{\prime})}\right\rceil=\Theta(n\log n).$

The equations hold because $n^{\prime}=\Theta(n)$ and the error function $\epsilon$ satisfies $\epsilon(\ell)\geq\frac{1}{\mathrm{poly}(\ell)}$ by Lemma 6.17. Thus, we have

(27)

\displaystyle T_{\max}=\max\{T,T^{\prime}\}=O(n\log n).

By Lemma 6.17 and $|n^{\prime}-n|\leq L_{\mathsf{graph}}=o(n)$ , we have

(28)

\displaystyle\left|T-T^{\prime}\right|=O(L_{\mathsf{graph}}\log n).

Let $\mathcal{I}_{\mathsf{mid}}=(V,E,Q,\Phi^{\mathsf{mid}})$ be the middle instance constructed as in (8). In Algorithm 2, we call the subroutine UpdateHamiltonian for instances $\mathcal{I}$ and $\mathcal{I}_{\mathsf{mid}}$ . Since $\mathcal{I}$ satisfies the Dobrushin-Shlosman condition, by Lemma 6.8 and $\,\mathrm{d}(\mathcal{I},\mathcal{I}_{\mathsf{mid}})\leq d(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}}$ , we have

(29)

\displaystyle\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]=O\left(\frac{\Delta TL_{\mathsf{Hamil}}}{\delta n}\right)=O(\Delta L_{\mathsf{Hamil}}\log n),

where $R_{\mathsf{Hamil}}$ is defined in (17) for the subroutine UpdateHamiltonian.

We also call the subroutine UpdateGraph for instances $\mathcal{I}_{\mathsf{mid}}$ and $\mathcal{I}^{\prime}$ in Algorithm 2. The subroutine is shown in Algorithm 5. We first add isolated vertices to update $\mathcal{I}_{\mathsf{mid}}$ to $\mathcal{I}_{1}$ , then update edges to update $\mathcal{I}_{1}$ to $\mathcal{I}_{2}$ , finally delete isolated vertices to update $\mathcal{I}_{2}$ to $\mathcal{I}^{\prime}$ . Since $\mathcal{I}^{\prime}$ satisfies Dobrushin-Shlosman condition and the only difference between $\mathcal{I}_{2}$ and $\mathcal{I}^{\prime}$ is that $\mathcal{I}_{2}$ contains extra isolated vertices, it is easy to verify that $\mathcal{I}_{2}$ also satisfies Dobrushin-Shlosman condition. In Algorithm 5, the subroutine UpdateEdge is called for $\mathcal{I}_{1}$ and $\mathcal{I}_{2}$ . By Lemma 6.11, we have

(30)

\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]=O\left(\frac{\Delta TL_{\mathsf{graph}}}{\Delta n}\right)=O(\Delta L_{\mathsf{graph}}\log n).

where $R_{\mathsf{graph}}$ is defined in (23) for the subroutine UpdateEdge.

Combining (27), (28), (29), (30) with Lemma 6.13, we have the expected time cost is

	$\displaystyle\mathbb{E}\left[{T_{\mathsf{cost}}}\right]$	$\displaystyle=O\left(\Delta n+\Delta\left(\|T-T^{\prime}\|+\frac{T_{\max}(L_{\mathsf{Hamil}}+L_{\mathsf{graph}})}{n}+\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]+\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\right)\log^{2}T_{\max}\right)$
		$\displaystyle=O\left(\Delta n+\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})\log^{3}n\right).\qed$

6.4. Multi-sample dynamic Gibbs sampling algorithm

In this section, we give an Multi-sample dynamic Gibbs sampling algorithm that maintains multiple independent random samples for the current MRF instance. Theorem 6.1 follows immediately from the following lemma.

Lemma 6.18 (multi-sample dynamic Gibbs sampling algorithm).

Let $N:\mathbb{N}^{+}\to\mathbb{N}^{+}$ and $\epsilon:\mathbb{N}^{+}\to(0,1)$ be two functions satisfying the bounded difference condition in Definition 2.3. Let $\mathcal{I}=(V,E,Q,\Phi)$ be an MRF instance with $n=|V|$ and $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})$ the updated instance with $n^{\prime}=|V^{\prime}|$ . Assume that $\mathcal{I}$ and $\mathcal{I}^{\prime}$ both satisfy Dobrushin-Shlosman condition with constant $\delta>0$ , $d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{graph}}=o(n)$ and $d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}}$ . Denote $T=\lceil\frac{n}{\delta}\log\frac{n}{\epsilon(n)}\rceil$ , $T^{\prime}=\lceil\frac{n^{\prime}}{\delta}\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\rceil$ .

There is an algorithm which does the followings:

•

(space cost) The algorithm maintains $N(n)$ explicit copies of independent samples $\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}$ , where $\bm{X}^{(i)}\in Q^{V}$ for all $1\leq i\leq N(n)$ , for the current instance $\mathcal{I}$ , and also a data structure using $O(nN(n)\log n)$ memory words, each of $O(\log n)$ bits, for representing the initial state $\bm{X}^{(i)}_{0}\in Q^{V}$ and the execution-log $\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I},T)=\left\langle{v^{(i)}_{t}},X^{(i)}_{t}({v^{(i)}_{t}})\right\rangle_{t=1}^{{T}}$ for $1\leq i\leq N(n)$ such that each Gibbs sampling $(\bm{X}^{(i)}_{t})_{t=0}^{T}$ on $\mathcal{I}$ generating an independent sample $\bm{X}^{(i)}=\bm{X}^{(i)}_{T}$ .
•

(correctness) Assuming that Condition 6.2 holds for each $\bm{X}^{(i)}_{0}$ and $\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I},T)$ for the Gibbs sampling on $\mathcal{I}$ , upon each update that modifies $\mathcal{I}$ to $\mathcal{I}^{\prime}$ , the algorithm updates $\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))}$ to $N(n^{\prime})$ explicit copies of independent samples $\bm{Y}^{(1)},\bm{Y}^{(2)},\ldots,\bm{Y}^{(N(n^{\prime}))}\in Q^{V^{\prime}}$ for the new instance $\mathcal{I}^{\prime}$ , and correspondingly updates the data represented by the data structure to $\bm{Y}^{(i)}_{0}\in Q^{V^{\prime}}$ and $\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{u^{(i)}_{t}},Y^{(i)}_{t}({u^{(i)}_{t}})\right\rangle_{t=1}^{{T^{\prime}}}$ for $1\leq i\leq N(n^{\prime})$ such that each Gibbs sampling chain $(\bm{Y}^{(i)}_{t})_{t=0}^{T^{\prime}}$ on $\mathcal{I}^{\prime}$ generating a new sample $\bm{Y}^{(i)}=\bm{Y}^{(i)}_{T^{\prime}}$ , where each $\bm{Y}^{(i)}_{0}$ and $\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I}^{\prime},T^{\prime})$ satisfy Condition 6.2 for the Gibbs sampling on $\mathcal{I}^{\prime}$ , therefore,

$d_{\mathrm{TV}}\left({\bm{Y}^{(i)}},{\mu_{\mathcal{I}^{\prime}}}\right)\leq\epsilon(n^{\prime}).$
•

(time cost) Assuming Condition 6.2 for each $\bm{X}^{(i)}_{0}$ and $\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I},T)$ for the Gibbs sampling on $\mathcal{I}$ , the time complexity for resolving an update is

$O\left(\Delta^{2}(L_{\mathsf{Hamil}}+L_{\mathsf{graph}})N(n)\cdot\log^{3}n+\Delta n\right),$

where $\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}$ , and $\Delta_{G},\Delta_{G^{\prime}}$ denote the maximum degree of $G=(V,E)$ and $G^{\prime}=(V^{\prime},E^{\prime})$ .

The following technique lemma will be used to prove Lemma 6.18.

Lemma 6.19.

Let $N:\mathbb{N}^{+}\to\mathbb{N}^{+}$ be a function such that there exists a constant $C>0$ such that

\displaystyle\forall n\in\mathbb{N}^{+}:\quad\left|N(n+1)-N(n)\right|\leq\frac{C}{n}N(n).

Then the function $N$ has the following properties

•

for any $n\in\mathbb{N}^{+}$ , it holds that $N(n)\leq\mathrm{poly}(n)$ ;
•

let $\alpha\geq 1$ be a constant, given any $n,n^{\prime}\in\mathbb{N}^{+}$ such that $\frac{1}{\alpha}\leq\frac{n^{\prime}}{n}\leq\alpha$ ,

$\displaystyle\left|N(n)-N(n^{\prime})\right|=C^{\prime}(\alpha,C)\cdot\frac{\left|n-n^{\prime}\right|}{n}N(n),$

where $C^{\prime}(\alpha,C)$ is a constant that depends only on $\alpha$ and $C$ .

Proof.

By the condition, we have $N(n+1)\leq\left(1+\frac{C}{n}\right)N(n)$ . Thus for all $n\in\mathbb{N}^{+}$ ,

\displaystyle N(n)\leq N(1)\prod_{i=1}^{n-1}\left(1+\frac{C}{i}\right)\leq N(1)\exp\left(C\sum_{i=1}^{n-1}\frac{1}{i}\right)=N(1)\exp(\Theta(\ln n))=\mathrm{poly}(n).

We then prove the second property. Note that $\frac{\left|n-n^{\prime}\right|}{n}\leq\alpha$ , it suffices to prove

(31)

\displaystyle\left|\frac{N(n^{\prime})}{N(n)}-1\right|\leq C^{\prime}(\alpha,C)\cdot\frac{\left|n-n^{\prime}\right|}{n}.

Assume that $\min\{n,n^{\prime}\}\leq 2C\alpha$ . Then, we have $\max\{n,n^{\prime}\}\leq 2C\alpha^{2}$ . We can choose $C^{\prime}(\alpha,C)$ sufficiently large so that (31) holds. Assume $n^{\prime}>n>2\alpha C$ . Note that $\frac{\left|n-n^{\prime}\right|}{n}\leq\alpha$ . We have

\displaystyle 1-\frac{C\left|n-n^{\prime}\right|}{n}\leq\left(1-\frac{C}{n}\right)^{\left|n-n^{\prime}\right|}\leq\frac{N(n^{\prime})}{N(n)}\leq\left(1+\frac{C}{n}\right)^{\left|n-n^{\prime}\right|}\leq 1+\frac{C\exp(\alpha C)\left|n-n^{\prime}\right|}{n},

which implies (31) holds if $C^{\prime}(\alpha,C)\geq C\exp(\alpha C)$ . Assume $n>n^{\prime}>2\alpha C$ . Note that $\frac{\left|n-n^{\prime}\right|}{n}\leq\alpha$ and $n^{\prime}\geq\frac{n}{\alpha}$ . We have

\displaystyle 1-\frac{\alpha C\left|n-n^{\prime}\right|}{n}\leq\left(1-\frac{\alpha C}{n}\right)^{\left|n-n^{\prime}\right|}\leq\frac{N(n^{\prime})}{N(n)}\leq\left(1+\frac{\alpha C}{n}\right)^{\left|n-n^{\prime}\right|}\leq 1+\frac{C\alpha\exp(\alpha^{2}C)\left|n-n^{\prime}\right|}{n}.

which implies (31) holds if $C^{\prime}(\alpha,C)\geq C\alpha\exp(\alpha^{2}C)$ . ∎

Proof.

The main idea of the multi-sample dynamic Gibbs sampling algorithm is to use single-sample dynamic Gibbs sampling algorithm (Algorithm 2) to maintain each sample $\bm{X}^{(i)}\in Q^{V}$ for $1\leq i\leq N(n)$ . We need a careful implementation of the algorithm to guarantee the time cost in Lemma 6.18.

Space cost: Note that $T=\left\lceil\frac{n}{\delta}\log\frac{n}{\varepsilon(n)}\right\rceil=\Theta(n\log n)$ due to Lemma 6.17 and $N(n)\leq\mathrm{poly}(n)$ due to Lemma 6.19. The dynamic dictionary for each sample $\bm{X}^{(i)}$ uses $O(n)$ memory words, each of $O(\log n)$ bits. Hence, the algorithm uses $O(T\cdot N(n))=O(nN(n)\log n)$ memory words to maintain all the initial states, execution-logs and the random samples due to Theorem 6.12.

Correctness: The invariants for execution-log (Condition 6.2) are preserved by the coupling simulated by the algorithm. The correctness holds as a consequence.

Time cost: Define $N_{\min}\triangleq\min\{N(n),N(n^{\prime})\}$ . Fix $1\leq k\leq N_{\min}$ . We use the Algorithm 2 to update the sample $\bm{X}^{(k)}$ to $\bm{Y}^{(k)}$ . Let $\mathcal{P}_{k}\subseteq[T]$ denote the set defined in (13) for the subroutine UpdateHamiltonian in Algorithm 2. The multi-sample dynamic Gibbs sampling has the following three stages.

•

Preparation stage: construct the updated instances $\mathcal{I}^{\prime}$ and other middle instances $\mathcal{I}_{\mathsf{mid}},\mathcal{I}_{1},\mathcal{I}_{2}$ in (8), (18), (• ‣ 6.1.2); compute $p^{\mathsf{up}}_{v}$ in (12) for all $v\in V$ ; and construct the random sets $\mathcal{P}_{1},\mathcal{P}_{2},\ldots,\mathcal{P}_{N_{\min}}$ .
•

Update stage: given the $(p^{\mathsf{up}}_{v})_{v\in V}$ and $(\mathcal{P}_{i})_{1\leq i\leq N_{\min}}$ , for each $1\leq i\leq N_{\min}$ , use Algorithm 2 to update the initial state $\bm{X}_{0}^{(i)}$ to $\bm{Y}_{0}^{(i)}$ , the execution-log $\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I},T)=\left\langle{v_{t}^{(i)}},X^{(i)}_{t}({v_{t}^{(i)}})\right\rangle_{t=1}^{{T}}$ to $\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{u^{(i)}},Y^{(i)}_{t}({u^{(i)}})\right\rangle_{t=1}^{{T^{\prime}}}$ , and the sample $\bm{X}^{(i)}$ to $\bm{Y}^{(i)}$ .
•

Completion stage: If $N(n^{\prime})<N(n)$ , for each $N(n^{\prime})<i\leq N(n)$ , remove the sample $\bm{X}^{(i)}$ , the initial state $\bm{X}^{(i)}_{0}$ and the execution-log $\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I},T)=\left\langle{v_{t}^{(i)}},X^{(i)}_{t}({v_{t}^{(i)}})\right\rangle_{t=1}^{{T}}$ from the data; if $N(n^{\prime})>N(n)$ , for each $N(n)<i\leq N(n^{\prime})$ , construct an independent Gibbs sampling chain $(\bm{Y}^{(i)}_{t})_{t=0}^{T^{\prime}}$ on instance $\mathcal{I}^{\prime}$ , write the sample $\bm{Y}^{(i)}=\bm{Y}^{(i)}_{T^{\prime}}$ , the initial state $\bm{Y}^{(i)}_{0}$ and the execution-log $\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{u_{t}^{(i)}},Y^{(i)}_{t}({u_{t}^{(i)}})\right\rangle_{t=1}^{{T^{\prime}}}$ into the data.

Let $T_{\mathsf{preparation}}^{\mathsf{multi}},T_{\mathsf{update}}^{\mathsf{multi}}$ and $T_{\mathsf{completion}}^{\mathsf{multi}}$ denote the running time of the corresponding stages. Note that the update stage of the multi-sample dynamic sampling algorithm repeats the update stage of the single-sample algorithm for $N_{\min}$ times. Also note that both $\mathcal{I}$ and $\mathcal{I}^{\prime}$ satisfies Dobrushin-Shlosman condition. Combining (24), (27), (28), (29), and (30), we have

	$\displaystyle\mathbb{E}\left[{T_{\mathsf{update}}^{\mathsf{multi}}}\right]=\sum_{i=1}^{N_{\min}}\mathbb{E}\left[{T_{\mathsf{update}}^{\mathsf{single},(i)}}\right]$	$\displaystyle=O(N_{\min}\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})\log^{3}n)$
(32)		$\displaystyle(\text{by }N_{\min}\leq N(n))\quad$	$\displaystyle=O(N(n)\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})\log^{3}n)$

where $T_{\mathsf{update}}^{\mathsf{single},(i)}$ is the running time of the update stage of the Algorithm 2 that updates the $i$ -th sample.

In completion stage, we either remove the chains from the data structure, or generate the new chains and write them into data structure. It is easy to see the running time of the completion stage satisfies

	$\displaystyle\mathbb{E}\left[{T_{\mathsf{completion}}^{\mathsf{multi}}}\right]$	$\displaystyle=O(\left\|N(n)-N(n^{\prime})\right\|T_{\max}\log T_{\max})=O(n\left\|N(n)-N(n^{\prime})\right\|\log^{2}n)$
	$\displaystyle(\text{by \lx@cref{creftypecap~refnum}{lemma-smooth-function}})\quad$	$\displaystyle=O(\left\|n-n^{\prime}\right\|N(n)\log^{2}n)=O(L_{\mathsf{graph}}N(n)\log^{2}n),$

where $T_{\max}=\max\{T,T^{\prime}\}=O(n\log n)$ since $n^{\prime}=\Theta(n)$ and $\epsilon(n^{\prime})\geq\frac{1}{\mathrm{poly}(n^{\prime})}$ (by $L_{\mathsf{graph}}=o(n)$ and Lemma 6.17).

We make the following claim about the preparation stage.

Claim 6.20.

The expected running time of the preparation stage is

\displaystyle\mathbb{E}\left[{T_{\mathsf{preparation}}^{\mathsf{multi}}}\right]=O\left(\Delta n+\log^{2}n\sum_{i=1}^{N_{\min}}\mathbb{E}\left[{|\mathcal{P}_{i}|}\right]\right),

and the expected size of $\mathcal{P}_{i}$ is at most $\frac{4T_{\max}L_{\mathsf{Hamil}}}{n}$ for each $1\leq i\leq N_{\min}$ .

By 6.20, we have

\displaystyle\mathbb{E}\left[{T_{\mathsf{preparation}}^{\mathsf{multi}}}\right]=O\left(\Delta n+N(n)L_{\mathsf{Hamil}}\log^{3}n\right).

By the linearity of expectation, the expected time cost of the algorithm is $\mathbb{E}\left[{T_{\mathsf{preparation}}^{\mathsf{multi}}}\right]+\mathbb{E}\left[{T_{\mathsf{update}}^{\mathsf{multi}}}\right]+\mathbb{E}\left[{T_{\mathsf{completion}}^{\mathsf{multi}}}\right]$ . This proves the time cost.

∎

7. Proofs for dynamic Gibbs sampling

7.1. Analysis of the couplings

We analysis the couplings in dynamic Gibbs sampling algorithm. In Section 7.1.1, we analysis the coupling for Hamiltonian update. In Section 7.1.2, we analysis the coupling for graph update.

7.1.1. Proofs for the coupling for Hamiltonian update

In this section, we prove Lemma 6.5, Lemma 6.6, and Lemma 6.8.

The validity of the coupling (proof of Lemma 6.5)

We first prove that the distribution $\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot)$ in (10) is valid. We draw samples from $\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot)$ only if the result of coin flipping is HEADS, which implies $\mu_{v,\mathcal{I}}(x\mid\tau)>\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)$ for some $x\in Q$ . Thus, the two distributions $\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ and $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)$ are not identical, and

\displaystyle\sum_{x\in Q}\max\left\{0,\mu_{v,\mathcal{I}}(x\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)\right\}>0.

Hence, the denominator of $\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot)$ is positive. Besides, since both $\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ and $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)$ are distributions over $Q$ , we have

\displaystyle\sum_{x\in Q}\max\left\{0,\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)-\mu_{v,\mathcal{I}}(x\mid\tau)\right\}=\sum_{x\in Q}\max\left\{0,\mu_{v,\mathcal{I}}(x\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)\right\}.

Thus we have $\sum_{x\in Q}\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(x)=1.$ Hence, $\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot)$ a valid distribution.

We next prove the coupling $D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)$ in Definition 6.4 is a valid coupling between $\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ and $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)$ . If $\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ and $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)$ are identical, the result holds trivially. We may assume $\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ and $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)$ are not identical, thus the distribution $\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot)$ is well-defined.

The coupling $D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)$ in Definition 6.4 returns a pair $(c,c^{\prime})\in Q^{2}$ . It is easy to see $c$ follows the law $\mu_{v,\mathcal{I}}(\cdot\mid\sigma)$ . We prove that $c^{\prime}$ follows the law $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\sigma)$ . By the definition of $D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)$ , $c^{\prime}\in Q$ is generated by the following procedure:

•

sample $a\in Q$ from the distribution $\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ ;

•

sample $b\in Q$ from the distribution $\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}_{v}^{\prime}}$ defined in (10), set

\displaystyle c^{\prime}=\begin{cases}b&\text{with probability }p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(a)\\ a&\text{with probability }1-p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(a).\end{cases}

Note that $a$ follows the law $\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ . We have for each $x\in Q$ ,

	$\displaystyle\Pr[c^{\prime}=x]$	$\displaystyle=\Pr[a=x]\cdot(1-p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(x))+\sum_{y\in Q}\Pr[a=y]\cdot p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y)\cdot\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(x)$
		$\displaystyle=\mu_{v,\mathcal{I}}(x\mid\tau)\cdot(1-p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(x))+\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(x)\sum_{y\in Q}\mu_{v,\mathcal{I}}(y\mid\tau)\cdot p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y).$

By the definition of $p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y)$ in (9), we have

\displaystyle\forall y\in Q,\quad\mu_{v,\mathcal{I}}(y\mid\tau)\cdot p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y)=\begin{cases}0&\text{if }\mu_{v,\mathcal{I}}(y\mid\tau)\leq\mu_{v,\mathcal{I}^{\prime}}(y\mid\tau)\\ \mu_{v,\mathcal{I}}(y\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(y\mid\tau)&\text{otherwise}.\end{cases}

This implies $\mu_{v,\mathcal{I}}(y\mid\tau)\cdot p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y)=\max\left\{0,\mu_{v,\mathcal{I}}(y\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(y\mid\tau)\right\}$ . We have

		$\displaystyle\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(x)\sum_{y\in Q}\mu_{v,\mathcal{I}}(y\mid\tau)\cdot p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y)$
	$\displaystyle=$	$\displaystyle\,\frac{\max\left\{0,\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)-\mu_{v,\mathcal{I}}(x\mid\tau)\right\}}{\sum_{y\in Q}\max\left\{0,\mu_{v,\mathcal{I}}(y\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(y\mid\tau)\right\}}\sum_{y\in Q}\max\left\{0,\mu_{v,\mathcal{I}}(y\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(y\mid\tau)\right\}$
	$\displaystyle=$	$\displaystyle\,\max\left\{0,\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)-\mu_{v,\mathcal{I}}(x\mid\tau)\right\}.$

Hence, we have

\displaystyle\Pr[c^{\prime}=x]=\mu_{v,\mathcal{I}}(x\mid\tau)\cdot(1-p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(x))+\max\left\{0,\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)-\mu_{v,\mathcal{I}}(x\mid\tau)\right\}.

Suppose $\mu_{v,\mathcal{I}}(x\mid\tau)\leq\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)$ , then we have $p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(x)=0$ . In this case, we have

\displaystyle\Pr[c^{\prime}=x]=\mu_{v,\mathcal{I}}(x\mid\tau)+\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)-\mu_{v,\mathcal{I}}(x\mid\tau)=\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau).

Suppose $\mu_{v,\mathcal{I}}(x\mid\tau)>\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)$ , then we have

\displaystyle\Pr[c^{\prime}=x]

\displaystyle=\mu_{v,\mathcal{I}}(x\mid\tau)\cdot(1-p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(x))=\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau).

Combining these two cases proves that $c^{\prime}$ follows the law $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)$ . ∎

The upper bound of the probability $p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\cdot}(\cdot)$ (proof of Lemma 6.6)

It suffices to prove that for any two instances $\mathcal{I}=(V,E,Q,\Phi)$ and $\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime})$ of MRF model, and any $v\in V,c\in Q$ and $\sigma\in Q^{\Gamma_{G}(v)}$ ,

(33)

\displaystyle\mu_{v,\mathcal{I}}(c\mid\sigma)-\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)\leq 2\mu_{v,\mathcal{I}}(c\mid\sigma)\left(\|\phi_{v}-\phi^{\prime}_{v}\|_{1}+\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right).

Note that if $\mu_{v,\mathcal{I}}(c\mid\sigma)=0$ , then $p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(c)=0$ ; otherwise $p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(c)=\max\left\{0,\frac{\mu_{v,\mathcal{I}}(c\mid\sigma)-\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)}\right\}$ . Hence, inequality (33) proves the lemma.

We now prove (33). Suppose $\mu_{v,\mathcal{I}}(c\mid\sigma)=0$ . Then the LHS of (33) $\leq 0$ . Since the RHS $\geq 0$ , the inequality holds.

We next assume $\mu_{v,\mathcal{I}}(c\mid\sigma)>0$ . Then it suffices to prove

\displaystyle\frac{\mu_{v,\mathcal{I}}(c\mid\sigma)-\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)}=1-\frac{\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)}\leq 2\left(\|\phi_{v}-\phi^{\prime}_{v}\|_{1}+\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right).

By the definitions of $\phi_{v},\phi^{\prime}_{v},\phi_{e},\phi^{\prime}_{e}$ , we can write the ratio as

\displaystyle\frac{\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)}=\frac{\exp\left(\phi^{\prime}_{v}(c)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},c)\right)}{\exp\left(\phi_{v}(c)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},c)\right)}\frac{\sum_{a\in Q}\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)}{\sum_{a\in Q}\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)},

where $\Gamma_{v}$ denotes the neighborhood of $v$ in $G$ . Next, we assume that

(34)

\begin{split}\forall c\in Q:\quad&\phi_{v}(c)=-\infty\quad\Longleftrightarrow\quad\phi^{\prime}_{v}(c)=-\infty\\ \forall u\in\Gamma_{v},c,c^{\prime}\in Q:\quad&\phi_{uv}(c,c^{\prime})=-\infty\quad\Longleftrightarrow\quad\phi^{\prime}_{uv}(c,c^{\prime})=-\infty.\end{split}

Otherwise, it must hold that the RHS of (33) is $\infty$ , then (33) holds trivially. Thus we can define the set

\displaystyle Q^{\prime}\triangleq\left\{a\in Q\mid\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\neq-\infty\right\}=\left\{a\in Q\mid\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\neq-\infty\right\}.

Since $\exp(-\infty)=0$ , we have

\displaystyle\frac{\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)}=\frac{\exp\left(\phi^{\prime}_{v}(c)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},c)\right)}{\exp\left(\phi_{v}(c)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},c)\right)}\frac{\sum_{a\in Q^{\prime}}\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)}{\sum_{a\in Q^{\prime}}\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)}.

We then show that

(35)

\begin{split}\forall a\in Q^{\prime}:\quad&\frac{\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)}{\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)}\geq\exp\left(-\|\phi_{v}-\phi^{\prime}_{v}\|_{1}-\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right)\\ \forall a\in Q^{\prime}:\quad&\frac{\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)}{\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)}\geq\exp\left(-\|\phi_{v}-\phi^{\prime}_{v}\|_{1}-\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right)\end{split}

We first use (35) to prove the (33). Since $\mu_{v,\mathcal{I}}(c\mid\sigma)>0$ , then we have $c\in Q^{\prime}$ . By (35), we have

	$\displaystyle 1-\frac{\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)}$	$\displaystyle\leq 1-\exp\left(-2\\|\phi_{v}-\phi^{\prime}_{v}\\|_{1}-2\sum_{e=\{u,v\}\in E}\\|\phi_{e}-\phi^{\prime}_{e}\\|_{1}\right)$
		$\displaystyle\leq 2\left(\\|\phi_{v}-\phi^{\prime}_{v}\\|_{1}+\sum_{e=\{u,v\}\in E}\\|\phi_{e}-\phi^{\prime}_{e}\\|_{1}\right).$

This proves the lemma.

We now prove (35). For any $a\in Q^{\prime}$ , it holds that

\displaystyle\frac{\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)}{\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)}=\exp\left(\phi_{v}(a)-\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)-\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right).

Then (35) holds because

	$\displaystyle\phi_{v}(a)-\phi^{\prime}_{v}(a)$	$\displaystyle\geq-\sum_{c\in Q}\|\phi_{v}(c)-\phi^{\prime}_{v}(c)\|=-\\|\phi_{v}-\phi^{\prime}_{v}\\|_{1};$
	$\displaystyle\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)-\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)$	$\displaystyle\geq-\sum_{e=\{u,v\}\in E}\sum_{c,c^{\prime}\in Q}\|\phi_{e}(c,c^{\prime})-\phi^{\prime}_{e}(c,c^{\prime})\|=-\sum_{e=\{u,v\}\in E}\\|\phi_{e}-\phi^{\prime}_{e}\\|_{1}.$

The lower bound of $\frac{\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)}{\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)}$ can be proved in a similar way. ∎

The cost of the coupling for UpdateHamiltonian (proof of Lemma 6.8)

By the definition of the indicator random variable $\gamma_{t}$ in (17), we have

	$\displaystyle\Pr[\gamma_{t}=1\mid\mathcal{D}_{t-1}]$	$\displaystyle\leq\Pr\left[t\in\mathcal{P}\mid\mathcal{D}_{t-1}\right]+\Pr\left[v_{t}\in\Gamma_{G}^{+}(\mathcal{D}_{t-1})\mid\mathcal{D}_{t-1}\right]$
		$\displaystyle\leq\frac{(\Delta+1)\|\mathcal{D}_{t-1}\|}{n}+\sum_{v\in V}\frac{p^{\mathsf{up}}_{v}}{n}.$

By the definition of $p^{\mathsf{up}}_{v}$ in (12) and $d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})=\sum_{v\in V}\left\|\phi_{v}-\phi^{\prime}_{v}\right\|_{1}+\sum_{e\in E}\left\|\phi_{e}-\phi^{\prime}_{e}\right\|_{1}\leq L$ , we have

\displaystyle\Pr[\gamma_{t}=1\mid\mathcal{D}_{t-1}]\leq\frac{(\Delta+1)|\mathcal{D}_{t-1}|}{n}+\frac{4L}{n}.

By the definition of $R_{\mathsf{Hamil}}\triangleq\sum_{t=1}^{T}\gamma_{t}$ , we have

(36)

\displaystyle\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]=\sum_{t=1}^{T}\mathbb{E}\left[{\gamma_{t}}\right]=\sum_{t=1}^{T}{\mathbb{E}\left[{\mathbb{E}\left[{\gamma_{t}\mid\mathcal{D}_{t-1}}\right]}\right]}\leq\sum_{t=1}^{T}\left(\frac{(\Delta+1)\mathbb{E}\left[{|\mathcal{D}_{t-1}|}\right]}{n}+\frac{4L}{n}\right).

Next, we bound the expectation $\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]$ . Recall that the one-step local coupling for Hamiltonian update (Definition 6.3) is implemented as follows. We first construct the random set $\mathcal{P}\subseteq V$ in (13). In the $t$ -th step, where $1\leq t\leq T$ , given any $\bm{X}_{t-1}$ and $\bm{Y}_{t-1}$ , the $\bm{X}_{t}$ and $\bm{Y}_{t}$ is generated as follows.

•

Let $X^{\prime}(u)=X_{t-1}(u)$ and $Y^{\prime}(u)=Y_{t-1}(u)$ for all $u\in V\setminus\{v_{t}\}$ , sample $(X^{\prime}(v_{t}),Y^{\prime}(v_{t}))\in Q^{2}$ jointly from the optimal coupling $D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v_{t}}}$ of the marginal distributions $\mu_{v_{t},\mathcal{I}}(\cdot\mid\sigma)$ and $\mu_{v_{t},\mathcal{I}}(\cdot\mid\tau)$ , where $\sigma=X_{t-1}(\Gamma_{G}(v_{t}))$ and $\tau=Y_{t-1}(\Gamma_{G}(v_{t}))$ .
•

Let $\bm{X}_{t}=\bm{X}^{\prime}$ and $\bm{Y}_{t}=\bm{Y}^{\prime}$ . If $t\in\mathcal{P}$ , update the value of $Y_{t}(v_{t})$ using (14).

Hence, for any vertex $v\in V$ , $X_{t}(v)\neq Y_{t}(v)$ only if one of the following two events occurs (1) $X^{\prime}(v)\neq Y^{\prime}(v)$ ; (2) $v_{t}=v$ and $t\in\mathcal{P}$ . Then for any $v\in V$ , we have

	$\displaystyle\Pr[X_{t}(v)\neq Y_{t}(v)\mid\bm{X}_{t-1},\bm{Y}_{t-1}]$	$\displaystyle\leq\Pr[X^{\prime}(v)\neq Y^{\prime}(v)\mid\bm{X}_{t-1},\bm{Y}_{t-1}]+\Pr[v=v_{t}\land t\in\mathcal{P}\mid\bm{X}_{t-1},\bm{Y}_{t-1}]$
(37)			$\displaystyle=\Pr[X^{\prime}(v)\neq Y^{\prime}(v)\mid\bm{X}_{t-1},\bm{Y}_{t-1}]+\Pr[v=v_{t}\land t\in\mathcal{P}],$

where the equation holds because $v=v_{t}\land t\in\mathcal{P}$ is independent of $\bm{X}_{t-1},\bm{Y}_{t-1}$ . Given $\bm{X}_{t-1},\bm{Y}_{t-1}$ , the random pair $\bm{X}^{\prime},\bm{Y}^{\prime}$ are obtained by the one-step optimal coupling for Gibbs sampling on instance $\mathcal{I}$ (Definition 4.2). Since $\mathcal{I}$ satisfies the Dobrushin-Shlosman condition with constant $0<\delta<1$ , by Proposition 4.3, we have

(38)

\displaystyle\mathbb{E}\left[{H(\bm{X}^{\prime},\bm{Y}^{\prime})\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right]\leq\left(1-\frac{\delta}{n}\right)H(\bm{X}_{t-1},\bm{Y}_{t-1})=\left(1-\frac{\delta}{n}\right)|\mathcal{D}_{t-1}|.

where $H(\bm{X},\bm{Y})=|\{v\in V\mid X(v)\neq Y(v)\}|$ denote the Hamming distance. Combining (7.1.1) and (38),

	$\displaystyle\mathbb{E}\left[{\|\mathcal{D}_{t}\|\mid\mathcal{D}_{t-1}}\right]$	$\displaystyle\leq\sum_{v\in V}\Pr[X^{\prime}(v)\neq Y^{\prime}(v)\mid\mathcal{D}_{t-1}]+\sum_{v\in V}\Pr[t\in\mathcal{P}\land v=v_{t}\mid\mathcal{D}_{t-1}]$
		$\displaystyle\leq\left(1-\frac{\delta}{n}\right)\|\mathcal{D}_{t-1}\|+\sum_{v\in V}\frac{p^{\mathsf{up}}_{v}}{n}$
	$\displaystyle(\text{by}~\eqref{eq-def-Ising-up})\quad$	$\displaystyle\leq\left(1-\frac{\delta}{n}\right)\|\mathcal{D}_{t-1}\|+\frac{2}{n}\sum_{v\in V}\left(\\|\phi_{v}-\phi^{\prime}_{v}\\|_{1}+\sum_{e=\{u,v\}\in E}\\|\phi_{e}-\phi^{\prime}_{e}\\|_{1}\right)$
	$\displaystyle(\text{by}~d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L)\quad$	$\displaystyle\leq\left(1-\frac{\delta}{n}\right)\|\mathcal{D}_{t-1}\|+\frac{4L}{n}.$

Thus, we have

\displaystyle\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]\leq\left(1-\frac{\delta}{n}\right)\mathbb{E}\left[{|\mathcal{D}_{t-1}|}\right]+\frac{4L}{n}.

Note that $|\mathcal{D}_{0}|=0$ . This implies

(39)

\displaystyle\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]\leq\frac{8L}{\delta}.

Thus, by (36), we have

\displaystyle\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]\leq\frac{20\Delta TL}{\delta n}=O\left(\frac{\Delta TL}{\delta n}\right).

∎

7.1.2. Proofs for the coupling for graph update

In this section, we prove Lemma 6.11.

Cost of the coupling for UpdateEdge (Proof of Lemma 6.11)

By the definition of $R_{\mathsf{graph}}$ in (23) and the linearity of the expectation, we have

\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]

\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[{\gamma_{t}}\right]=\sum_{t=1}^{T}\mathbb{E}\left[{\mathbb{E}\left[{\gamma_{t}\mid\mathcal{D}_{t-1}}\right]}\right].

Recall $\gamma_{t}=\mathbf{1}\left[v_{t}\in\mathcal{S}\cup\Gamma^{+}_{G}(\mathcal{D}_{t-1})\right]$ and $v_{t}\in V$ is uniformly at random given $\mathcal{D}_{t-1}$ . Note that $|\Gamma^{+}_{G}(\mathcal{D}_{t-1})|\leq(\Delta+1)|\mathcal{D}_{t-1}|$ and $|\mathcal{S}|\leq 2|E\oplus E^{\prime}|\leq 2L$ . We have

(40)

\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\leq\sum_{t=1}^{T}\mathbb{E}\left[{\frac{(\Delta+1)|\mathcal{D}_{t-1}|+2L}{n}}\right]=\frac{(\Delta+1)}{n}\sum_{t=1}^{T}\mathbb{E}\left[{|\mathcal{D}_{t-1}|}\right]+\frac{2LT}{n}.

Suppose $\mathcal{I}^{\prime}$ satisfies Dobrushin-Shlosman condition (3.1) with the constant $\delta>0$ , we claim

(41)

\displaystyle\forall\,0\leq t\leq T:\quad\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]\leq\frac{8L}{\delta}.

Combining (40) and (41), we have

\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\leq\frac{18\Delta LT}{\delta n}=O\left(\frac{\Delta LT}{n}\right).

This proves the lemma.

We now prove (41). Let $(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0}$ be the one-step local coupling for updating edges (Definition 6.9). We claim the following result

(42)

\displaystyle\forall\,\sigma,\tau\in\Omega:\qquad\mathbb{E}\left[{\,H(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau\,}\right]\leq\left(1-\frac{\delta}{n}\right)\cdot H(\sigma,\tau)+\frac{4L}{n},

where $H(\sigma,\tau)=|\{v\in V\mid\sigma(v)\neq\tau(v)\}|$ denotes the Hamming distance. Assume (42) holds. Taking expectation over $\bm{X}_{t-1}$ and $\bm{Y}_{t-1}$ , we have

(43)

\displaystyle\mathbb{E}\left[{H(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\left(1-\frac{\delta}{n}\right)\mathbb{E}\left[{H(\bm{X}_{t-1},\bm{Y}_{t-1})}\right]+\frac{4L}{n}.

Note that $\bm{X}_{0}=\bm{Y}_{0}$ , we have

(44)

\displaystyle H(\bm{X}_{0},\bm{Y}_{0})=0.

Combining (43) with (44) implies

(45)

\displaystyle\forall\,0\leq t\leq T:\quad\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]=\mathbb{E}\left[{H(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\frac{8L}{\delta}.

This proves the claim in (41).

We finish the proof by proving the claim in (42). The main idea is to compare the one-step local coupling for updating edges (Definition 6.9) with the one-step optimal coupling for Gibbs sampling on instance $\mathcal{I}^{\prime}$ (Definition 4.2). Let $(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})_{t\geq 0}$ be the coupling for Gibbs sampling on $\mathcal{I}^{\prime}$ . Since $\mathcal{I}^{\prime}$ satisfies Dobrushin-Shlosman condition, by Proposition 4.3, we have

(46)

\displaystyle\forall\,\sigma,\tau\in\Omega=Q^{V}:\quad\mathbb{E}\left[{\,H(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau\,}\right]\leq\left(1-\frac{\delta}{n}\right)\cdot H(\sigma,\tau).

According to the coupling, we can rewrite the expectation in (46) as follows:

(47)

\displaystyle\mathbb{E}\left[{H(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau}\right]=\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right],

where $(C^{X^{\prime}}_{v},C^{Y^{\prime}}_{v})\sim D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}^{\prime}_{v}}$ , $D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}^{\prime}_{v}}$ is the optimal coupling between $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\sigma)$ and $\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)$ , and the configuration $\sigma^{v\leftarrow C_{v}^{X^{\prime}}}\in Q^{V}$ is defined as

\displaystyle\sigma^{v\leftarrow C_{v}^{X^{\prime}}}(u)\triangleq\begin{cases}C^{X^{\prime}}_{v}&\text{if }u=v\\ \sigma(u)&\text{if }u\neq v\end{cases}

and the configuration $\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\in Q^{V}$ is defined in a similar way.

Similarly, we can rewrite the expectation in (42) as follows:

(48)

\displaystyle\mathbb{E}\left[{H(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]=\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right],

where $(C^{X}_{v},C^{Y}_{v})\sim D_{\mathcal{I}_{v},\mathcal{I}_{v}^{\prime}}^{\sigma,\tau}$ , where $D_{\mathcal{I}_{v},\mathcal{I}_{v}^{\prime}}^{\sigma,\tau}$ is the local coupling defined in (21).

The following two properties hold for (47) and (48).

•

If $v\not\in\mathcal{S}$ , by the definition of $D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)$ in (21), it holds that $D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}=D_{\mathsf{opt},\mathcal{I}_{v}}^{\sigma,\tau}$ . Hence

\displaystyle\forall v\not\in\mathcal{S}:\quad\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]=\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right].

•

	$\displaystyle H\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)$	$\displaystyle\leq H\left(\sigma^{v\leftarrow C_{v}^{X}},\sigma^{v\leftarrow C_{v}^{X^{\prime}}}\right)+H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)+H\left(\tau^{v\leftarrow C_{v}^{Y^{\prime}}},\tau^{v\leftarrow C_{v}^{Y}}\right)$
		$\displaystyle\leq H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)+2.$

This implies

\displaystyle\forall v\in\mathcal{S}:\quad\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right]\leq\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+2.

Combining above two properties with (47) and (48), we have for any $\sigma\in,\tau\in\Omega$ ,

		$\displaystyle\mathbb{E}\left[{H(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]$
	$\displaystyle=$	$\displaystyle\,\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right]$
	$\displaystyle\leq$	$\displaystyle\,\frac{1}{n}\sum_{v\not\in\mathcal{S}}\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+\frac{1}{n}\sum_{v\in\mathcal{S}}\left(\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+2\right)$
	$\displaystyle(\ast)\quad\leq$	$\displaystyle\,\mathbb{E}\left[{H(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau}\right]+\frac{4L}{n}$
	$\displaystyle\leq$	$\displaystyle\,\left(1-\frac{\delta}{n}\right)\cdot H(\sigma,\tau)+\frac{4L}{n},$

where $(\ast)$ holds due to $|\mathcal{S}|\leq 2L$ . This proves the claim in (42). ∎

7.2. Implementation of the algorithms

In this section, we prove the 6.15, 6.16 and 6.20 by giving the implementation of the algorithms.

7.2.1. Proofs of 6.15 and 6.20

We prove 6.20, then 6.15 can be proved in a similar way.

It is easy to verify the updated sample $\mathcal{I}^{\prime}$ , all the probabilities $(p^{\mathsf{up}}_{v})_{v\in V}$ in (12), all middle instances $\mathcal{I}_{\mathsf{mid}},\mathcal{I}_{1},\mathcal{I}_{2}$ in (8), (18), (• ‣ 6.1.2) can be computed with time cost $O(\Delta n)$ . We focus on constructing $\mathcal{P}_{i}$ for $1\leq i\leq N_{\min}$ .

The multi-sample dynamic Gibbs sampling algorithm use the data structure in Theorem 6.12 to maintain $N(n)$ independent Gibbs sampling chain on instance $\mathcal{I}$ represented by $\bm{X}^{(i)}_{0}$ and $\mathsf{Exe\text{-}Log}\left(\mathcal{I},T\right)=\left\langle{v^{(i)}_{t}},X^{(i)}_{t}({v^{(i)}_{t}})\right\rangle_{t=1}^{{T}}$ . To construct the random sets $\mathcal{P}_{i}$ for $1\leq i\leq N_{\min}$ , we need an additional data structure to maintain the following data. Define the set $H_{v}$ as

\displaystyle H_{v}\triangleq\{(i,t)\in[N(n)]\times[T]\mid v^{(i)}_{t}=v\}.

$H_{v}$ contains all the transition steps in $N(n)$ independent chains that picks the vertex $v$ . The algorithm uses an extra data structure $\mathcal{H}$ to maintain all $(H_{v})_{v\in V}$ . The data structure $\mathcal{H}$ contains $n$ balanced binary search trees $(\mathcal{H}_{v})_{v\in V}$ , where each $\mathcal{H}_{v}$ maintains the set $H_{v}$ in a similar way as in the main data structure in Theorem 6.12. Since $T=O(n\log n),N(n)\leq\mathrm{poly(n)}$ , the space cost of $\mathcal{H}$ is $O(nN(n)\log n)$ memory words, each of $O(\log n)$ bits, which is dominated by the space cost in Lemma 6.18. And the time cost of adding, deleting, and searching a transition step in $\mathcal{H}$ is $O(\log^{2}n)$ . We need to update $\mathcal{H}$ when $\mathcal{I}$ is updated to $\mathcal{I}^{\prime}$ . One can verify that such time cost is dominated by the time cost in Lemma 6.18.

Then for each $v\in V$ , we pick each element in $H_{v}$ with probability $p^{\mathsf{up}}_{v}$ to construct the set

\displaystyle\mathcal{B}_{v}\subseteq H_{v}.

This is the standard Bernoulli process. With the data structure $\mathcal{H}_{v}$ , the time complexity of constructing the set $\mathcal{B}_{v}$ is $O(\left|\mathcal{B}_{v}\right|\log^{2}n)$ . Given all the sets $\mathcal{B}_{v}$ , it is easy to construct all the sets $\mathcal{P}_{i}$ . Hence,

\displaystyle T_{\mathsf{preparation}}^{\mathsf{multi}}=O\left(\Delta n+\sum_{v\in V}\left|\mathcal{B}_{v}\right|\log^{2}n\right)=O\left(\Delta n+\sum_{i=1}^{N_{\min}}\left|\mathcal{P}_{i}\right|\log^{2}n\right).

In the preparation stage of multi-sample dynamic Gibbs sampling algorithm, we first construct the $\mathcal{I}_{\mathsf{mid}}=(V,E,Q,\Phi^{\mathsf{mid}})$ as in (8), and each $\mathcal{P}_{i}$ ( $1\leq i\leq N_{\min}$ ) is constructed with respect to $\mathcal{I}$ and $\mathcal{I}_{\mathsf{mid}}$ . Note that $d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}_{\mathsf{mid}})\leq d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}}$ . By (12), we have for each $1\leq i\leq N_{\min}$ ,

\displaystyle\mathbb{E}\left[{\left|\mathcal{P}_{i}\right|}\right]\leq\sum_{t=1}^{T}\sum_{v\in V}\frac{p^{\mathsf{up}}_{v}}{n}\leq\frac{4TL_{\mathsf{Hamil}}}{n}.

This proves the claim. ∎

7.2.2. Proof of 6.16

We give the implementation of the update stage of the single-sample dynamic Gibbs sampling algorithm (Algorithm 2). The algorithm updates the MRF instance from $\mathcal{I}$ to $\mathcal{I}^{\prime}$ as follows,

\displaystyle\mathcal{I}\quad\to\quad\mathcal{I}_{\mathsf{mid}}\quad\to\quad\mathcal{I}_{1}\quad\to\quad\mathcal{I}_{2}\quad\to\quad\mathcal{I}^{\prime},

where $\mathcal{I}_{\mathsf{mid}}$ is defined in (8), $\mathcal{I}_{1}=\mathcal{I}_{1}(\mathcal{I}_{\mathsf{mid}},\mathcal{I}^{\prime})$ is defined in (18), and $\mathcal{I}_{2}=\mathcal{I}_{2}(\mathcal{I}_{\mathsf{mid}},\mathcal{I}^{\prime})$ is defined in (• ‣ 6.1.2). Then the algorithm calls LengthFix to modifies the length of the execution log from $T$ to $T^{\prime}$ .

The preparation stage computes all probabilities $(p^{\mathsf{up}}_{v})_{v\in V}$ in (12), the set $\mathcal{P}$ in (13), and all instances $\mathcal{I}_{\mathsf{mid}},\mathcal{I}_{1},\mathcal{I}_{2}$ . Consider the time cost of the update stage. In the update from $\mathcal{I}_{\mathsf{mid}}$ to $\mathcal{I}_{1}$ , we only add isolated vertices in $V^{\prime}\setminus V$ , using the data structure in Theorem 6.12, the expected time cost is

\displaystyle\mathbb{E}\left[{T_{\mathcal{I}_{\mathsf{mid}}\to\mathcal{I}_{1}}}\right]=O\left(\frac{\left|V^{\prime}\setminus V\right|}{\left|V\right|}T_{\max}\log^{2}{T_{\max}}\right)=O\left(\frac{L_{\mathsf{graph}}}{n}T_{\max}\log^{2}{T_{\max}}\right).

In the update from $\mathcal{I}_{2}$ to $\mathcal{I}^{\prime}$ , we only delete isolated vertices in $V\setminus V^{\prime}$ , thus

\displaystyle\mathbb{E}\left[{T_{\mathcal{I}_{\mathsf{mid}}\to\mathcal{I}_{1}}}\right]=O\left(\frac{\left|V\setminus V^{\prime}\right|}{\left|V\cup V^{\prime}\right|}T_{\max}\log^{2}{T_{\max}}\right)=O\left(\frac{L_{\mathsf{graph}}}{n}T_{\max}\log^{2}{T_{\max}}\right).

It is also easy to observe that the expected time cost of LengthFix is

\displaystyle\mathbb{E}\left[{T_{\textsf{LengthFix}}}\right]=O\left(\Delta\left|T-T^{\prime}\right|\log^{2}T_{\max}\right).

We then prove that

(49)		$\displaystyle\mathbb{E}\left[{T_{\mathcal{I}\to\mathcal{I}_{\mathsf{mid}}}}\right]$	$\displaystyle=O\left(\Delta\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]\log^{2}T_{\max}\right)$
(50)		$\displaystyle\mathbb{E}\left[{T_{\mathcal{I}_{1}\to\mathcal{I}_{2}}}\right]$	$\displaystyle=O\left(\Delta\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\log^{2}T_{\max}\right).$

Combining all the running time together proves 6.16.

We give the implementation of Algorithm 4 to prove (49). The Algorithm 6 can be implemented in a similar way to prove (50). Since $(p^{\mathsf{up}}_{v})_{v\in V}$ and $\mathcal{P}$ are given, the running time of Algorithm 4 is dominated by the while-loop. We implement Algorithm 4 such that after each execution of the while-loop, the first $t_{0}$ transition steps of the Gibbs sampling on instance $\mathcal{I}$ is updated to the first $t_{0}$ transition steps of the Gibbs sampling on instance $\mathcal{I}^{\prime}$ , namely, $(\bm{X}_{t})_{t=0}^{t_{0}}$ is updated to $(\bm{Y}_{t})_{t=0}^{t_{0}}$ , where $t_{0}$ is the variable in Algorithm 4. Recall the sets $\mathcal{D}$ and $\mathcal{P}$ in Algorithm 4. We need some temporary data structures:

•

a balanced binary search tree $\mathcal{T}$ to maintain the set $\mathcal{D}$ and the configuration $X_{t_{0}-1}(\mathcal{D})$ ;
•

a heap $\mathcal{H}_{1}$ to maintain the set $\mathcal{P}$ ;
•

a heap $\mathcal{H}_{2}$ such that once a vertex $v$ is added into $\mathcal{D}$ , the update times $\textsf{Succ}(t_{0},u)$ for all $u\in\Gamma_{G}(v)\cup\{v\}$ are added into $\mathcal{H}_{2}$ , where Succ is the operation of the data structure in Theorem 6.12.

Algorithm 4 can be implemented using $\mathcal{H}_{1},\mathcal{H}_{2},\mathcal{T}$ . And Algorithm 4 and Algorithm 4 can be implemented using $\mathcal{T}$ and the main data structure in Theorem 6.12. Note that the time cost of each operation of $\mathcal{T}$ is $O(\log n)=O(\log T_{\max})$ . Also note that at most $\Delta R_{\mathsf{Hamil}}$ elements can be added into $\mathcal{H}_{2}$ . Hence, all the time cost contributed by $\mathcal{H}_{2}$ is $O(\Delta R_{\mathsf{Hamil}}\log(\Delta R_{\mathsf{Hamil}}))=O(\Delta R_{\mathsf{Hamil}}\log T_{\max})$ . One can verify that the total running time is

\displaystyle T_{\mathcal{I}\to\mathcal{I}_{\mathsf{mid}}}=O\left(\Delta R_{\mathsf{Hamil}}\log^{2}T_{\max}\right).

This proves (49). ∎

7.3. Dynamic Gibbs sampling for specific models

In this section, we apply our algorithm on Ising model, graph $q$ -coloring, and hardcore model. We prove the following theorem.

Theorem 7.1.

There exist dynamic sampling algorithms as stated in Theorem 6.1 with the same space cost $O\left(nN(n)\log n\right)$ , and expected time cost $O\left(\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})N(n)\log^{3}n+\Delta n\right)$ for each update, if the input instance $\mathcal{I}$ with $n$ vertices and the updated instance $\mathcal{I}^{\prime}$ satisfying $d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{graph}}=o(n),d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}}$ both are:

•

Ising models with temperature $\beta$ and arbitrary local fields where $\exp(-2|\beta|)\geq 1-\frac{2-\delta}{\Delta+1}$ ;
•

proper $q$ -colorings with $q\geq(2+\delta)\Delta$ ;
•

hardcore models with fugacity $\lambda\leq\frac{2-\delta}{\Delta-2}$ , but with an alternative time cost for each update

(51) $\displaystyle O\left(\Delta^{3}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})N(n)\log^{3}n+\Delta n\right),$

where $\delta>0$ is a constant, $\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}$ , $\Delta_{G}$ denotes the maximum degree of the input graph, and $\Delta_{G^{\prime}}$ denotes the maximum degree of the updated graph.

In Theorem 7.1, the regime for Ising model and $q$ -coloring match the Dobrushin-Shlosman condition, thus the results are corollaries of Theorem 6.1. The regime for hardcore model is better than the Dobrushin-Shlosman condition. We give the proof for hardcore model.

We use $\mathcal{I}=(V,E,\lambda)$ to specify the hardcore model on graph $G=(V,E)$ with fugacity $\lambda$ . A configuration of hardcore model is $\sigma\in\{0,1\}^{V}$ , where $\sigma_{v}=1$ indicates $v$ is occupied, $\sigma_{v}=0$ indicates $v$ is unoccupied. If $\sigma$ forms an independent set, then $\mu_{\mathcal{I}}(\sigma)\propto\lambda^{\|\sigma\|}$ ; otherwise, $\mu_{\mathcal{I}}(\sigma)=0$ . We need the following lemma proved by Vigoda’s coupling technique [37].

Lemma 7.2.

Let $\delta>0$ be a constant. Let $\mathcal{I}=(V,E,\lambda)$ be a hardcore instance, where $n=|V|$ , and $\Omega_{\mathcal{I}}\triangleq\{\sigma\in\{0,1\}^{V}\mid\mu_{\mathcal{I}}(\sigma)>0\}$ . Assume $\lambda\leq\frac{2-\delta}{\Delta-2}$ , where $\Delta$ is the maximum degree of $G=(V,E)$ . There exist a potential function $\rho_{\mathcal{I}}:\Omega_{\mathcal{I}}\times\Omega_{\mathcal{I}}\to\mathbb{R}_{\geq 0}$ , where $\forall\sigma,\tau\in\Omega_{\mathcal{I}}$ , $\rho_{\mathcal{I}}(\sigma,\tau)=0$ if $\sigma=\tau$ and $\rho_{\mathcal{I}}(\sigma,\tau)\geq 1$ if $\sigma\neq\tau$ , and $\mathrm{Diam}_{\mathcal{I}}\triangleq\max_{\sigma,\tau\in\Omega_{\mathcal{I}}}\rho_{\mathcal{I}}(\sigma,\tau)\leq\Delta n$ , such that the one-step optimal coupling (Definition 4.2) $(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0}$ of Gibbs sampling on $\mathcal{I}$ satisfies

(1)

(step-wise decay) for the coupling $(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0}$ of Gibbs sampling, it holds that

(52)

\displaystyle\forall\,\sigma,\tau\in\Omega_{\mathcal{I}}:\quad\mathbb{E}\left[{\,\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau\,}\right]\leq\mbox{$\left(1-\frac{\beta}{n}\right)$}\cdot\rho_{\mathcal{I}}(\sigma,\tau),

where $\beta=\frac{1}{96\delta}$ , which implies $\tau_{\mathsf{mix}}(\mathcal{I},\epsilon)\leq\lceil\frac{n}{\beta}\log\frac{\mathrm{Diam}_{\mathcal{I}}}{\epsilon}\rceil=O(n\log\frac{n}{\epsilon})$ .

(2)

(up-bound to Hamming) for all $\sigma,\tau\in\Omega_{\mathcal{I}}$ , $H(\sigma,\tau)\leq\rho_{\mathcal{I}}(\sigma,\tau)$ , where $H(\sigma,\tau)$ denotes the Hamming distance between $\sigma$ and $\tau$ .

(3)

(Lipschitz) function $\rho_{\mathcal{I}}(\cdot,\cdot)$ , seen as a function of $2n$ variables, is $K$ -Lipschitz, that is,

\max_{\sigma,\sigma^{\prime},\tau,\tau^{\prime}\in\Omega_{\mathcal{I}}}\left|\rho_{\mathcal{I}}(\sigma,\tau)-\rho_{\mathcal{I}}(\sigma^{\prime},\tau^{\prime})\right|\leq K\cdot H(\sigma\tau,\sigma^{\prime}\tau^{\prime}),

where $K=12\Delta$ .

Compared with Proposition 4.3, the step-wise decay property in (52) holds only for feasible configurations $\sigma$ and $\tau$ , and the decay property is established on the potential function $\rho_{\mathcal{I}}$ rather than the Hamming distance $H$ . We first use Lemma 7.2 to prove Theorem 7.1, then we prove Lemma 7.2 in the end of this section.

Recall that the error function $\epsilon$ satisfies $\epsilon(\ell)\geq\frac{1}{\mathrm{poly}(\ell)}$ by Lemma 6.17. Recall $\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}$ . By Lemma 7.2 and $n^{\prime}=\Theta(n)$ (since $L_{\mathsf{graph}}=o(n)$ ), we can set

	$\displaystyle T$	$\displaystyle=T(\mathcal{I})=\left\lceil\frac{96n}{\delta}\log\frac{n\Delta}{\epsilon(n)}\right\rceil=O\left(n\log n\right)$
	$\displaystyle T^{\prime}$	$\displaystyle=T(\mathcal{I}^{\prime})=\left\lceil\frac{96n^{\prime}}{\delta}\log\frac{n^{\prime}\Delta}{\epsilon(n^{\prime})}\right\rceil=O\left(n\log n\right).$

We modify Algorithm 2 for the hardcore model as follows. Suppose the current instance is $\mathcal{I}=(V,E,\lambda)$ , we set the initial configuration $\bm{X}_{0}$ as

\displaystyle\forall v\in V,\quad X_{0}(v)=0.

Thus $\bm{X}_{0}$ is feasible. Suppose the instance $\mathcal{I}=(V,E,\lambda)$ is updated to $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},\lambda^{\prime})$ . We divide the update into the following steps

\displaystyle\mathcal{I}\quad\to\quad\mathcal{I}_{\mathsf{mid}}\quad\to\quad\mathcal{I}_{1}\quad\to\quad\mathcal{I}_{2}\quad\to\quad\mathcal{I}_{3}\quad\to\quad\mathcal{I}^{\prime},

•

change fugacity to update $\mathcal{I}=(V,E,\lambda)$ to $\mathcal{I}_{\mathsf{mid}}=(V,E,\lambda^{\prime})$ using UpdateHamiltonian;
•

add isolated vertices in $V^{\prime}\setminus V$ to update $\mathcal{I}_{\mathsf{mid}}=(V,E,\lambda^{\prime})$ to $\mathcal{I}_{1}=(V\cup V^{\prime},E,\lambda^{\prime})$ using AddVertex;
•

delete edges in $E\setminus E^{\prime}$ to update $\mathcal{I}_{1}=(V\cup V^{\prime},E,\lambda^{\prime})$ to $\mathcal{I}_{2}=(V\cup V^{\prime},E\cap E^{\prime},\lambda^{\prime})$ using UpdateEdge;
•

add edges in $E^{\prime}\setminus E$ to update $\mathcal{I}_{2}=(V\cup V^{\prime},E\cap E^{\prime},\lambda^{\prime})$ to $\mathcal{I}_{3}=(V\cup V^{\prime},E^{\prime},\lambda^{\prime})$ using UpdateEdge;
•

delete isolated vertices in $V^{\prime}\setminus V$ to update $\mathcal{I}_{3}=(V\cup V^{\prime},E^{\prime},\lambda^{\prime})$ to $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},\lambda^{\prime})$ ;
•

fix the length of the execution log from $T$ to $T^{\prime}$ .

Compared to Algorithm 2, we further divide the update of edges into two steps: at first delete edges, then add edges. Thus, we have the following observation.

Observation 7.3.

The following results holds:

•

$\Omega_{\mathcal{I}}=\Omega_{\mathcal{I}_{\mathsf{mid}}}$ , $\Omega_{\mathcal{I}_{1}}\subseteq\Omega_{\mathcal{I}_{2}}$ and $\Omega_{\mathcal{I}_{3}}\subseteq\Omega_{\mathcal{I}_{2}}$ , where $\Omega_{\mathcal{J}}$ is the set of feasible configurations for any instance $\mathcal{J}$ .
•

the instances $\mathcal{I},\mathcal{I}_{2},\mathcal{I}_{3},\mathcal{I}^{\prime}$ all satisfy $\lambda\leq\frac{2-\delta}{\Delta-2}$ , where $\lambda$ and $\Delta$ are the fugacity and maximum degree of the corresponding instance.

By the observation, we know that $\Omega_{\mathcal{I}}=\Omega_{\mathcal{I}_{\mathsf{mid}}}$ , $\Omega_{\mathcal{I}_{1}}\subseteq\Omega_{\mathcal{I}_{2}}$ and $\Omega_{\mathcal{I}_{3}}\subseteq\Omega_{\mathcal{I}_{2}}$ , thus we can use Lemma 7.2, because the step-wise decay property (52) is established only on feasible configurations.

We need to analyze $R_{\mathsf{Hamil}}$ and $R_{\mathsf{graph}}$ defined in (17) and (23) for the hardcore model. We prove the following two lemmas for hardcore model.

Lemma 7.4.

Consider $\textsf{UpdateHamiltonian}\left(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right)$ . Let $\mathcal{I}=(V,E,\lambda)$ be the current instance and $\mathcal{I}^{\prime}=(V,E,\lambda^{\prime})$ the updated instance. Assume $\lambda\leq\frac{2-\delta}{\Delta-2}$ , where $\delta>0$ is a constant and $\Delta$ is the maximum degree of $G=(V,E)$ . Also assume $d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})=n\left|\ln\lambda-\ln\lambda^{\prime}\right|\leq L$ . Then $\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]=O\left(\frac{\Delta^{2}TL}{n\delta}\right)$ , where $n=V$ , $\Delta$ is the maximum degree of graph $G=(V,E)$ .

Lemma 7.5.

Consider $\textsf{UpdateEdge}\left(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right)$ . Let $\mathcal{I}=(V,E,\lambda)$ be the current instance and $\mathcal{I}^{\prime}=(V,E^{\prime},\lambda)$ the updated instance. Assume $|E\oplus E^{\prime}|\leq L$ . Also assume one of the following two conditions holds for some constant $\delta>0$ :

•

$\lambda\leq\frac{2-\delta}{\Delta_{G}-2}$ and $\Omega_{\mathcal{I}^{\prime}}\subseteq\Omega_{\mathcal{I}}$ , where $\Delta_{G}$ is the maximum degree of $G=(V,E)$ ;
•

$\lambda\leq\frac{2-\delta}{\Delta_{G^{\prime}}-2}$ and $\Omega_{\mathcal{I}}\subseteq\Omega_{\mathcal{I}^{\prime}}$ , where $\Delta_{G^{\prime}}$ is the maximum degree of $G^{\prime}=(V,E^{\prime})$ .

Then $\mathbb{E}\left[{R_{\mathsf{graph}}}\right]=O\left(\frac{\Delta^{2}TL}{n\delta}\right)$ , where $n=V$ , $\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}$ .

Note that we call the subroutine UpdateHamiltonian for the update modifying $\mathcal{I}$ to $\mathcal{I}_{\mathsf{mid}}$ . By 7.3, the condition in Lemma 7.4 holds. We call the subroutine UpdateEdge for the update modifying $\mathcal{I}_{1}$ to $\mathcal{I}_{2}$ and the update modifying $\mathcal{I}_{2}$ to $\mathcal{I}_{3}$ . By 7.3, in both two calls of UpdateEdge, the condition in Lemma 7.5 holds. Then Theorem 7.1 for hardcore can by proved by going through the proof in Section 6. Compared to Lemma 6.8 and Lemma 6.11, $\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right],\mathbb{E}\left[{R_{\mathsf{graph}}}\right]$ in Lemma 7.4 and Lemma 7.5 are bounded by $O\left(\frac{\Delta^{2}TL}{n\delta}\right)$ rather than $O\left(\frac{\Delta TL}{n\delta}\right)$ . This is why the hardcore model has an alternative running time in (51).

The proofs of Lemma 7.4 and Lemma 7.5 are similar to the proofs of Lemma 6.8 and Lemma 6.11. We give the proofs here for the completeness.

Proof of Lemma 7.4.

By the definition of the indicator $\gamma_{t}$ in (17), we have

\displaystyle\Pr[\gamma_{t}=1\mid\mathcal{D}_{t-1}]

\displaystyle\leq\Pr\left[t\in\mathcal{P}\right]+\Pr\left[v_{t}\in\Gamma_{G}^{+}(\mathcal{D}_{t-1})\right]=\frac{(\Delta+1)|\mathcal{D}_{t-1}|}{n}+\sum_{v\in V}\frac{p^{\mathsf{up}}_{v}}{n}.

By the definition of $p^{\mathsf{up}}_{v}$ in (12) and $d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})=n\left|\ln\lambda-\ln\lambda^{\prime}\right|\leq L$ , we have

\displaystyle\Pr[\gamma_{t}=1\mid\mathcal{D}_{t-1}]\leq\frac{(\Delta+1)|\mathcal{D}_{t-1}|}{n}+\frac{2L}{n}.

By the definition of $R_{\mathsf{Hamil}}\triangleq\sum_{t=1}^{T}\gamma_{t}$ , we have

(53)

\displaystyle\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]=\sum_{t=1}^{T}\mathbb{E}\left[{\gamma_{t}}\right]=\sum_{t=1}^{T}{\mathbb{E}\left[{\mathbb{E}\left[{\gamma_{t}\mid\mathcal{D}_{t-1}}\right]}\right]}\leq\sum_{t=1}^{T}\left(\frac{(\Delta+1)\mathbb{E}\left[{|\mathcal{D}_{t-1}|}\right]}{n}+\frac{2L}{n}\right).

Next, we bound the expectation $\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]$ . In our implementation of the one-step local coupling for Hamiltonian update (Definition 6.3), we first construct the random set $\mathcal{P}\subseteq V$ in (13). In the $t$ -th step, where $1\leq t\leq T$ , given any $\bm{X}_{t-1}$ and $\bm{Y}_{t-1}$ , the $\bm{X}_{t}$ and $\bm{Y}_{t}$ is generated as follows.

•

Let $X^{\prime}(u)=X_{t-1}(u)$ and $Y^{\prime}(u)=Y_{t-1}(u)$ for all $u\in V\setminus\{v_{t}\}$ , sample $(X^{\prime}(v_{t}),Y^{\prime}(v_{t}))\in\{0,1\}^{2}$ jointly from the optimal coupling $D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v_{t}}}$ of the marginal distributions $\mu_{v_{t},\mathcal{I}}(\cdot\mid\sigma)$ and $\mu_{v_{t},\mathcal{I}}(\cdot\mid\tau)$ , where $\sigma=X_{t-1}(\Gamma_{G}(v_{t}))$ and $\tau=Y_{t-1}(\Gamma_{G}(v_{t}))$ .
•

Let $\bm{X}_{t}=\bm{X}^{\prime}$ and $\bm{Y}_{t}=\bm{Y}^{\prime}$ . If $t\in\mathcal{P}$ , update the value of $Y_{t}(v_{t})$ using (14).

Note that $\Omega_{\mathcal{I}}=\Omega_{\mathcal{I}^{\prime}}$ . Since $\mathcal{I}$ satisfies $\lambda\leq\frac{2-\delta}{\Delta-2}$ with constant $\delta>0$ , by Lemma 7.2, for any feasible $\bm{X}_{t-1},\bm{Y}_{t-1}\in\Omega_{\mathcal{I}}=\Omega_{\mathcal{I}^{\prime}}$ , we have

(54)

\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}^{\prime},\bm{Y}^{\prime})\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right]\leq\left(1-\frac{\delta}{96n}\right)\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1}).

By Lemma 7.2, function $\rho_{\mathcal{I}}(\cdot,\cdot)$ , seen as a function of $2n$ variables, is $12\Delta$ -Lipschitz. Let $\mathcal{F}$ indicates whether $t\in\mathcal{P}$ . We flip the value of $Y_{t}(v_{t})$ only if $\mathcal{F}$ occurs. By (54), we have

	$\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right]$	$\displaystyle\leq\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}^{\prime},\bm{Y}^{\prime})+12\Delta\mathcal{F}\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right]$
		$\displaystyle=\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}^{\prime},\bm{Y}^{\prime})\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right]+\mathbb{E}\left[{12\Delta\mathcal{F}\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right]$
	$\displaystyle(\text{$\mathcal{F}$ is independent with $\bm{X}_{t-1},\bm{Y}_{t-1}$})\quad$	$\displaystyle\leq\left(1-\frac{\delta}{96n}\right)\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})+12\Delta\mathbb{E}\left[{\mathcal{F}}\right]$
		$\displaystyle\leq\left(1-\frac{\delta}{96n}\right)\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})+12\Delta\sum_{v\in V}\frac{p^{\mathsf{up}}_{v}}{n}$
	$\displaystyle(\text{by}~\eqref{eq-def-Ising-up})\quad$	$\displaystyle\leq\left(1-\frac{\delta}{96n}\right)\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})+\frac{24\Delta}{n}\sum_{v\in V}\left\|\ln\lambda-\ln\lambda^{\prime}\right\|$
	$\displaystyle(\text{by}~d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L)\quad$	$\displaystyle\leq\left(1-\frac{\delta}{96n}\right)\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})+\frac{24L\Delta}{n}.$

Note that $\rho_{\mathcal{I}}(\bm{X}_{0},\bm{Y}_{0})=0$ and $\bm{X}_{0}(v)=\bm{Y}_{0}(v)=0$ for all $v\in V$ , the configurations $\bm{X}_{t},\bm{Y}_{t}$ are feasible for all $t\geq 0$ . Thus, we have

\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\left(1-\frac{\delta}{96n}\right)\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})}\right]+\frac{24L\Delta}{n}.

Thus $\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\frac{5000L\Delta}{\delta}.$ By the up-bound to Hamming in Lemma 7.2, we have

\displaystyle\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]\leq\frac{5000L\Delta}{\delta}.

Thus, by (53), we have

\displaystyle\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]\leq\frac{50000\Delta^{2}TL}{\delta n}=O\left(\frac{\Delta^{2}TL}{\delta n}\right).

∎

Proof of Lemma 7.5.

By the definition of $R_{\mathsf{graph}}$ in (23) and the linearity of the expectation, we have

\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]

\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[{\gamma_{t}}\right]=\sum_{t=1}^{T}\mathbb{E}\left[{\mathbb{E}\left[{\gamma_{t}\mid\mathcal{D}_{t-1}}\right]}\right].

(55)

\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\leq\sum_{t=1}^{T}\mathbb{E}\left[{\frac{(\Delta+1)|\mathcal{D}_{t-1}|+2L}{n}}\right]=\frac{(\Delta+1)}{n}\sum_{t=1}^{T}\mathbb{E}\left[{|\mathcal{D}_{t-1}|}\right]+\frac{2LT}{n}.

Suppose $\lambda\leq\frac{2-\delta}{\Delta_{G}-2}$ and $\Omega_{\mathcal{I}^{\prime}}\subseteq\Omega_{\mathcal{I}}$ . The other condition follows from symmetry. We claim that

(56)

\displaystyle\forall\,0\leq t\leq T:\quad\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]\leq\frac{10000\Delta L}{\delta}.

Combining (55) and (56), we have

\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\leq\frac{100000\Delta LT}{n\delta}=O\left(\frac{\Delta^{2}LT}{n\delta}\right).

This proves the lemma.

We now prove (56). Let $(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0}$ be the one-step local coupling for updating edges (Definition 6.9). We claim the following result

(57)

\displaystyle\forall\,\sigma\in\Omega_{\mathcal{I}},\tau\in\Omega_{\mathcal{I}^{\prime}}\subseteq\Omega_{\mathcal{I}},\mathbb{E}\left[{\,\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau\,}\right]\leq\left(1-\frac{\delta}{96n}\right)\cdot\rho_{\mathcal{I}}(\sigma,\tau)+\frac{48\Delta L}{n},

where $\rho_{\mathcal{I}}$ is the potential function in Lemma 7.2. Assume (57) holds. Since $\bm{X}_{0}=\bm{Y}_{0}=\{0\}^{V}$ and $\Omega_{\mathcal{I}^{\prime}}\subseteq\Omega_{\mathcal{I}}$ , we must have $\bm{X}_{t-1},\bm{Y}_{t-1}\in\Omega_{\mathcal{I}}$ . Taking expectation over $\bm{X}_{t-1}$ and $\bm{Y}_{t-1}$ , we have

(58)

\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\left(1-\frac{\delta}{96n}\right)\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})}\right]+\frac{48\Delta L}{n}.

Note that $\bm{X}_{0}=\bm{Y}_{0}$ , we have

(59)

\displaystyle\rho_{\mathcal{I}}(\bm{X}_{0},\bm{Y}_{0})=0.

Combining (58), (59) and upper-bound Hamming property in Lemma 7.2 implies

\displaystyle\forall\,0\leq t\leq T:\quad\mathbb{E}\left[{\left|\mathcal{D}_{t}\right|}\right]\leq\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\frac{10000\Delta L}{\delta}.

This proves the claim in (56).

We finish the proof by proving the claim in (57). Let $(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})_{t\geq 0}$ be the one-step optimal coupling for Gibbs sampling on instance $\mathcal{I}$ (Definition 4.2). Since $\mathcal{I}$ satisfies $\lambda\leq\frac{2-\delta}{\Delta_{G}-2}$ , by Lemma 7.2, we have

(60)

\displaystyle\forall\,\sigma,\tau\in\Omega_{\mathcal{I}}:\quad\mathbb{E}\left[{\,\rho_{\mathcal{I}}(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau\,}\right]\leq\left(1-\frac{\delta}{96n}\right)\cdot\rho_{\mathcal{I}}(\sigma,\tau).

According to the coupling, we can rewrite the expectation in (60) as follows:

(61)

\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau}\right]=\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right],

where $(C^{X^{\prime}}_{v},C^{Y^{\prime}}_{v})\sim D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}}$ , $D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}}$ is the optimal coupling between $\mu_{v,\mathcal{I}}(\cdot\mid\sigma)$ and $\mu_{v,\mathcal{I}}(\cdot\mid\tau)$ , and the configuration $\sigma^{v\leftarrow C_{v}^{X^{\prime}}}\in Q^{V}$ is defined as

\displaystyle\sigma^{v\leftarrow C_{v}^{X^{\prime}}}(u)\triangleq\begin{cases}C^{X^{\prime}}_{v}&\text{if }u=v\\ \sigma(u)&\text{if }u\neq v\end{cases}

and the configuration $\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\in Q^{V}$ is defined in a similar way.

Similarly, we can rewrite the expectation in (57) as follows:

(62)

\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]=\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right],

The following two properties hold for (61) and (62).

•

\displaystyle\forall v\not\in\mathcal{S}:\quad\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]=\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right].

•

If $v\in\mathcal{S}$ , then it holds that $H(\sigma^{v\leftarrow C_{v}^{X}},\sigma^{v\leftarrow C_{v}^{X^{\prime}}})\leq 1$ and $H(\tau^{v\leftarrow C_{v}^{Y^{\prime}}},\tau^{v\leftarrow C_{v}^{Y}})\leq 1$ , where $H$ is the Hamming distance. Since $\Omega_{\mathcal{I}^{\prime}}\subseteq\Omega_{\mathcal{I}}$ , it holds that $\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\in\Omega_{\mathcal{I}}$ . Since the function $\rho_{\mathcal{I}}$ is $12\Delta$ -Lipschitz, we have

\displaystyle\forall v\in\mathcal{S}:\quad\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right]\leq\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+24\Delta.

Combining above two properties with (60), (61) and (62), we have for any $\sigma\in,\tau\in\Omega$ ,

		$\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]$
	$\displaystyle=$	$\displaystyle\,\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right]$
	$\displaystyle\leq$	$\displaystyle\,\frac{1}{n}\sum_{v\not\in\mathcal{S}}\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+\frac{1}{n}\sum_{v\in\mathcal{S}}\left(\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+24\Delta\right)$
	$\displaystyle(\ast)\quad\leq$	$\displaystyle\,\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau}\right]+\frac{48L\Delta}{n}$
	$\displaystyle\leq$	$\displaystyle\,\left(1-\frac{\delta}{96n}\right)\cdot\rho_{\mathcal{I}}(\sigma,\tau)+\frac{48L\Delta}{n},$

where $(\ast)$ holds due to $|\mathcal{S}|\leq 2L$ . This proves the claim in (57). ∎

Finally, we prove Lemma 7.2. This proof is based on the coupling technique in [37].

Proof of Lemma 7.2.

We give a potential function $\rho_{\mathcal{I}}$ for the hard core instance $\mathcal{I}$ . We mainly use Vigoda’s potential function in [37]. However, we need to slightly modify Vigoda’s potential function to handle the isolated vertices.

Recall that for hard core model, $Q=\{0,1\}$ . For any $\sigma\in Q^{V}$ , $\sigma(v)=1$ represents $v$ is occupied and $\sigma(v)=0$ represents $v$ is unoccupied. For each vertex $v\in V$ , we use $\mathrm{deg}(v)$ to denote the degree of $v$ in graph $G=(V,E)$ . We divide the graph $G=(V,E)$ into two graphs $G_{1}=(V_{1},E_{1})$ and $G_{2}=(V_{2},E_{2})$ such that

	$\displaystyle V_{1}=\{v\in V\mid\deg(v)=0\},\quad E_{1}=\varnothing,$
	$\displaystyle V_{2}=V\setminus V_{1},\quad E_{2}=E.$

Thus $G_{1}$ is an empty graph and $G_{2}$ contains no isolated vertex. The potential function $\rho_{\mathcal{I}}$ is defined as

\displaystyle\forall\sigma,\tau\in\Omega_{\mathcal{I}}:\quad\rho_{\mathcal{I}}(\sigma,\tau)\triangleq 4\rho_{1}(\sigma(V_{1}),\tau(V_{1}))+4\rho_{2}(\sigma(V_{2}),\tau(V_{2})).

Here, $\rho_{1}$ is the potential function on $G_{1}$ , which is the Hamming distance:

\displaystyle\rho_{1}(\sigma(V_{1}),\tau(V_{1}))=\sum_{v\in V_{1}}\mathbf{1}\left[\sigma(v)\neq\tau(v)\right].

And $\rho_{2}(\sigma(V_{2}),\tau(V_{2}))$ is the Vigoda’s potential function [37] on the graph $G_{2}$ . Formally, let $D=\{v\in V_{2}\mid\sigma(v)\neq\tau(v)\}$ . For each $v\in V_{2}$ , let $d_{v}=|D\cap\Gamma_{G_{2}}(v)|$ . Let $c=\frac{\Delta\lambda}{\Delta\lambda+2}$ , where $\Delta$ is the maximum degree of graph $G$ . Note that the maximum degree of graph $G_{2}$ is also $\Delta$ . The potential function $\rho_{2}(\sigma(V_{2}),\tau(V_{2}))$ is defined as

\displaystyle\alpha_{v}=\begin{cases}\deg(v)&\text{if }v\in D\\ 0&\text{otherwise};\end{cases}\quad\beta_{v}=\begin{cases}-cd_{v}&\text{if $\exists\,w\in\Gamma_{G_{2}}(v)$ such that $\sigma(w)=\tau(w)=1$}\\ -c(d_{v}-1)&\text{if there is no such $w$ and $d_{v}>1$ }\\ 0&\text{otherwise};\end{cases}

\displaystyle\rho_{2}(\sigma(V_{2}),\tau(V_{2}))=\sum_{v\in V_{2}}(\alpha_{v}+\beta_{v}).

It is easy to see $\rho_{\mathcal{I}}(\sigma,\sigma)=0$ and $\max_{\sigma,\tau\in\Omega_{\mathcal{I}}}\rho_{\mathcal{I}}(\sigma,\tau)=\Delta n$ . We then verify other properties for $\rho_{\mathcal{I}}$ .

At first, we prove the upper-bound to Hamming property. For function $\rho_{1}$ , it holds that

\displaystyle\rho_{1}(\sigma(V_{1}),\tau(V_{1}))=H(\sigma(V_{1}),\tau(V_{1})).

For function $\rho_{2}$ , it holds that

\displaystyle\rho_{2}(\sigma(V_{2}),\tau(V_{2}))=\sum_{v\in V_{2}}(\alpha_{v}+\beta_{v})=\sum_{v\in D}\alpha_{v}+\sum_{v\in V_{2}}\beta_{v}\geq\sum_{v\in D}\sum_{w\in\Gamma_{G_{2}}(v)}(1-c),

where the last inequality holds due to $\sum_{v\in V_{2}}\beta_{v}\geq-\sum_{v\in V_{2}}cd_{v}=-c\sum_{v\in D}\deg(v)$ . Since graph $G_{2}$ contains no isolated vertex, then $|\Gamma_{G_{2}}(v)|=\deg(v)\geq 1$ for all $v\in D$ . Note $c<1$ . Thus

\displaystyle\rho_{2}(\sigma(V_{2}),\tau(V_{2}))\geq|D|(1-c)=|D|\frac{2}{\Delta\lambda+2}\geq\frac{|D|}{4}=\frac{1}{4}H(\sigma(V_{2}),\tau(V_{2})),

where $\frac{2}{\lambda\Delta+2}\geq\frac{1}{4}$ is because $\lambda<\frac{2}{\Delta-2}$ and $\Delta\geq 3$ . Combining together we have

\displaystyle\rho_{\mathcal{I}}(\sigma,\tau)=4\rho_{1}(\sigma(V_{1}),\tau(V_{1}))+4\rho_{2}(\sigma(V_{2}),\tau(V_{2}))\geq H(\sigma,\tau).

This also implies $\rho_{\mathcal{I}}(\sigma,\tau)\geq\mathbf{1}\left[\sigma\neq\tau\right]$ .

Next, we show the function $\rho_{\mathcal{I}}$ is $12\Delta$ -Lipschitz. Recall $V_{1}\cap V_{2}=\varnothing$ , $V_{1}\cup V_{2}=V$ and

\displaystyle\rho_{\mathcal{I}}(\sigma,\tau)=4\rho_{1}(\sigma(V_{1}),\tau(V_{1}))+4\rho_{2}(\sigma(V_{2}),\tau(V_{2})).

Since $\rho_{1}$ is the Hamming distance, it is easy to see $\rho_{1}$ is $1$ -Lipschitz. To give the Lipschitz constant for $\rho_{2}$ . We extend the function $\rho_{2}$ as follows. Suppose the function $\rho_{2}$ is defined over $Q^{V_{2}}\times Q^{V_{2}}$ , where $Q=\{0,1\}$ . For any $x,y,x^{\prime},y^{\prime}\in Q^{V_{2}}$ such that $H(xy,x^{\prime}y^{\prime})=1$ , it is easy to verify the extended function $\rho_{2}$ satisfies

\displaystyle|\rho_{2}(x,y)-\rho_{2}(x^{\prime},y^{\prime})|\leq 3\Delta.

This implies the original function $\rho_{2}$ is $3\Delta$ -Lipschitz. Hence, the function $\rho_{\mathcal{I}}$ is $12\Delta$ -Lipschitz.

Finally, we prove the step-wise decay property. Let $(\bm{X}_{t}^{(1)})_{t\geq 0},(\bm{Y}_{t}^{(1)})_{t\geq 0}$ be the Gibbs sampling chains for hard core model on graph $G_{1}$ . Since $G_{1}$ is a graph consisting of isolated vertices, then the one step optimal coupling $(\bm{X}_{t}^{(1)},\bm{Y}_{t}^{(1)})_{t\geq 0}$ satisfies

\displaystyle\forall\sigma,\tau\in\Omega_{\mathcal{I}}:\,\mathbb{E}\left[{\rho_{1}\left(\bm{X}^{(1)}_{t},\bm{Y}^{(1)}_{t}\right)\mid\bm{X}^{(1)}_{t-1}=\sigma(V_{1})\land\bm{Y}^{(1)}_{t-1}=\tau(V_{1})}\right]\leq\left(1-\frac{1}{|V_{1}|}\right)\rho_{1}(\sigma(V_{1}),\tau(V_{1})).

Let $(\bm{X}_{t}^{(2)})_{t\geq 0},(\bm{Y}_{t}^{(2)})_{t\geq 0}$ be the Gibbs sampling chains for hard core model on graph $G_{2}$ . If $\lambda\leq\frac{2-\delta}{\Delta-2}=\frac{2(1-\delta/2)}{\Delta-2}$ , then due to Vigoda’s proof ³³3It can be verified that in Vigoda’s proof [37], the Markov chain for sampling hard core is indeed the Gibbs sampling and the coupling for analysis is indeed the one step-optimal coupling for Gibbs sampling., the one step optimal coupling $(\bm{X}_{t}^{(2)},\bm{Y}_{t}^{(2)})_{t\geq 0}$ satisfies:

\displaystyle\forall\sigma,\tau\in\Omega_{\mathcal{I}}:\,\mathbb{E}\left[{\rho_{2}\left(\bm{X}^{(2)}_{t},\bm{Y}^{(2)}_{t}\right)\mid\bm{X}^{(2)}_{t-1}=\sigma(V_{2})\land\bm{Y}^{(2)}_{t-1}=\tau(V_{2})}\right]\leq\left(1-\frac{\delta}{96|V_{2}|}\right)\rho_{2}(\sigma(V_{2}),\tau(V_{2})).

Let $(\bm{X}_{t})_{t\geq 0},(\bm{Y}_{t})_{t\geq 0}$ be the Gibbs sampling chains for hard core model on graph $G$ . If $\lambda\leq\frac{2-\delta}{\Delta-2}$ , then the one step optimal coupling $(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0}$ satisfies:

	$\displaystyle\forall\sigma,\tau\in\Omega_{\mathcal{I}}:\quad$	$\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\bm{X}_{t},\bm{Y}_{t}\right)\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]$
	$\displaystyle=$	$\displaystyle\,\frac{\|V_{1}\|}{n}\left(\left(1-\frac{1}{\|V_{1}\|}\right)4\rho_{1}(\sigma(V_{1}),\tau(V_{1}))+4\rho_{2}(\sigma(V_{2}),\tau(V_{2}))\right)$
		$\displaystyle+\frac{\|V_{2}\|}{n}\left(4\rho_{1}(\sigma(V_{1}),\tau(V_{1}))+\left(1-\frac{\delta}{96\|V_{2}\|}\right)4\rho_{2}(\sigma(V_{2}),\tau(V_{2}))\right)$
	$\displaystyle\leq$	$\displaystyle\,\left(1-\frac{\min\{\delta/96,1\}}{n}\right)\rho_{\mathcal{I}}(\sigma,\tau).$

Thus, the potential function $\rho_{\mathcal{I}}$ satisfies the step-wise decay property.

\displaystyle\forall\sigma,\tau\in\Omega_{\mathcal{I}}:\quad

\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\bm{X}_{t},\bm{Y}_{t}\right)\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]\leq\left(1-\frac{\delta/96}{n}\right)\rho_{\mathcal{I}}(\sigma,\tau).

This proves the lemma. ∎

8. Proofs for dynamic inference

8.1. Proof of the main theorem

Our dynamic inference algorithm is given as follows. For each MRF instance $\mathcal{I}=(V,E,Q,\Phi)$ , where $n=|V|$ , our dynamic inference algorithm maintains $N(n)$ independent samples $\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))}\in Q^{V}$ satisfying each $d_{\mathrm{TV}}\left({\mu_{\mathcal{I}}},{\bm{X}^{(i)}}\right)\leq\epsilon(n)$ and the estimator $\hat{\bm{\theta}}(\mathcal{I})=\mathcal{E}(\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))})$ for $\bm{\theta}(\mathcal{I})$ . Given an update that modifies $\mathcal{I}$ to $\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})$ where $n^{\prime}=|V^{\prime}|$ , our algorithm does as follows.

•

Update the sample sequence. Update $\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))}$ to $N(n^{\prime})$ independent random samples $\bm{Y}^{(1)},\bm{Y}^{(2)},\ldots,\bm{Y}^{(N(n^{\prime}))}\in Q^{V^{\prime}}$ such that each $d_{\mathrm{TV}}\left({\mu_{\mathcal{I}^{\prime}}},{\bm{Y}^{(i)}}\right)\leq\epsilon(n^{\prime})$ and output the difference between two sample sequences.
•

Update the estimator. Given the difference between two sample sequences $\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))}$ and $\bm{Y}^{(1)},\bm{Y}^{(2)},\ldots,\bm{Y}^{(N(n^{\prime}))}$ , update $\hat{\bm{\theta}}(\mathcal{I})$ to $\hat{\bm{\theta}}(\mathcal{I}^{\prime})=\mathcal{E}_{\bm{\theta}}(\bm{Y}^{(1)},\bm{Y}^{(2)},\ldots,\bm{Y}^{(N(n^{\prime}))})$ using the black-box algorithm in Definition 2.3.

Obviously, $\hat{\bm{\theta}}(\mathcal{I}^{\prime})$ is an $(N,\epsilon)$ -estimator for $\bm{\theta}(\mathcal{I}^{\prime})$ .

The sample sequence is maintained and updated by the dynamic sampling algorithm in Theorem 6.1. By Theorem 6.1, we have the space cost for maintaining the sample sequence is $O\left(nN(n)\log n\right)$ memory words, each of $O(\log n)$ bits. By following the proof of Theorem 6.1, it is easy to verify that the expected time cost for each update is $O\left(\Delta^{2}LN(n)\log^{3}n+\Delta n\right)$ .

The estimator is maintained and updated by the black-box algorithm in Definition 2.3. By Lemma 6.19, we have $N(n)\leq\mathrm{poly}(n)$ . Combining with Definition 2.3, we have the space cost for maintaining the estimator is $(n\cdot N(n)+K)\mathrm{polylog}(n)$ bits. Let $\mathcal{D}$ be the size of the difference between two sample sequences as defined in (3). We can follow the proof of Theorem 6.1 to bound the expectation of $\mathcal{D}$ . Let $T=\left\lceil\frac{n}{\delta}\log\frac{n}{\epsilon(n)}\right\rceil$ and $T^{\prime}=\left\lceil\frac{n^{\prime}}{\delta}\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\right\rceil$ . Since $\left|n-n^{\prime}\right|\leq L=o(n)$ , we have $\left|T-T^{\prime}\right|=O(L\log n)$ (due to Lemma 6.17). Combining (39), (45) and (7) yields

\displaystyle\mathbb{E}\left[{\mathcal{D}}\right]

\displaystyle=\left|N(n)-N(n^{\prime})\right|\cdot\max\{n,n^{\prime}\}+O\left(L+\left|T-T^{\prime}\right|\right)\cdot N(n)=O(LN(n)\log n),

where the last equation holds because $N(n)-N(n^{\prime})=O(\frac{N(n)}{n})$ (due to Lemma 6.19). Combining with Definition 2.3, we have the expected time cost for updating the estimator is $LN(n)\mathrm{polylog}(n)$ .

In summary, our dynamic inference algorithm maintains an estimator for the current MRF instance $\mathcal{I}$ , using extra $\widetilde{O}\left(nN(n)+K\right)$ memory words, each of $O(\log n)$ bits, such that when $\mathcal{I}$ is updated to $\mathcal{I}^{\prime}$ , the algorithm updates the estimator within expected time cost

	$\displaystyle\mathbb{E}\left[{T_{\mathsf{cost}}}\right]$	$\displaystyle=\mathbb{E}\left[{T_{\mathsf{sample}}}\right]+\mathbb{E}\left[{T_{\mathsf{estimator}}}\right]$
		$\displaystyle=O\left(\Delta^{2}LN(n)\log^{3}n+\Delta n\right)+LN(n)\mathrm{polylog}(n)$
		$\displaystyle=\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right).$

8.2. Dynamic inference on specific models

Applying our dynamic inference algorithm on Ising model, $q$ -coloring and hardcore model yields the following result.

Theorem 8.1.

There exist dynamic inference algorithms as stated in Theorem 3.2 with the same space cost $\widetilde{O}\left(nN(n)+K\right)$ , and expected time cost $\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right)$ for each update, if the input instance $\mathcal{I}$ with $n$ vertices and the updated instance $\mathcal{I}^{\prime}$ with $d(\mathcal{I},\mathcal{I}^{\prime})\leq L=o(n)$ both are:

•

Ising models with temperature $\beta$ and arbitrary local fields where $\exp(-2|\beta|)\geq 1-\frac{2-\delta}{\Delta+1}$ ;
•

proper $q$ -colorings with $q\geq(2+\delta)\Delta$ ;
•

hardcore models with fugacity $\lambda\leq\frac{2-\delta}{\Delta-2}$ , but with an alternative time cost for each update

$\displaystyle\widetilde{O}\left(\Delta^{3}LN(n)+\Delta n\right),$

where $\delta>0$ is a constant, $\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}$ , $\Delta_{G}$ and $\Delta_{G^{\prime}}$ denote the maximum degree of the input graph and updated graph respectively.

With the dynamic sampling algorithm in Theorem 7.1, Theorem 8.1 can be proved by going through the same proof in Section 8.1.

9. Conclusion

In this paper we study probabilistic inference problem in a graphical model when the model itself is changing dynamically with time. We study the non-local updates so that two consecutive graphical models may differ everywhere as long as the total amount of their difference is bounded. This general setting covers many typical applications. We give a sampling-based dynamic inference algorithm that maintains an inference solution efficiently against the dynamic inputs. The algorithm significantly improves the time cost compared to the static sampling-based inference algorithm.

Our algorithm generically reduces the dynamic inference to dynamic sampling problem. Our main technical contribution is a dynamic Gibbs sampling algorithm that maintains random samples for graphical models dynamically changed by non-local updates. Such technique is extendable to all single-site dynamics. This gives us a systematic approach for transforming classic MCMC samplers on static inputs to the sampling and inference algorithms in a dynamic setting. Our dynamic algorithms are efficient as long as the one-step optimal coupling exhibits a step-wise decay, a key property that has been widely used in supporting efficient MCMC sampling in the classic static setting and captured by the Dobrushin-Shlosman condition.

Our result is the first one that shows the possibility of efficient probabilistic inference in dynamically changing graphical models (especially when the graphical models are changed by non-local updates). Our dynamic inference algorithm has potentials in speeding up the iterative algorithms for learning graphical models, which deserves more theoretical and experimental research. In this paper, we focus on discrete graphical models and sampling-based inference algorithms. Important future directions include considering more general distributions and the dynamic algorithms based on other inference techniques.

References

ADK⁺ [16] Ittai Abraham, David Durfee, Ioannis Koutis, Sebastian Krinninger, and Richard Peng. On fully dynamic graph sparsifiers. In FOCS, 2016.
AQ⁺ [17] Osvaldo Anacleto, Catriona Queen, et al. Dynamic chain graph models for time series network data. Bayesian Anal., 12(2):491–509, 2017.
BC [16] Aaron Bernstein and Shiri Chechik. Deterministic decremental single source shortest paths: beyond the $o(mn)$ bound. In STOC, 2016.
BD [97] Russ Bubley and Martin Dyer. Path coupling: A technique for proving rapid mixing in Markov chains. In FOCS, 1997.
CLRS [09] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009.
CW [07] Carlos M. Carvalho and Mike West. Dynamic matrix-variate graphical models. Bayesian Anal., 2(1):69–97, 2007.
DG [00] Martin Dyer and Catherine Greenhill. On Markov chains for independent sets. J. Algorithms, 35(1):17–49, 2000.
DGGP [18] David Durfee, Yu Gao, Gramoz Goranci, and Richard Peng. Fully dynamic effective resistances. arXiv preprint arXiv:1804.04038, 2018.
DGGP [19] David Durfee, Yu Gao, Gramoz Goranci, and Richard Peng. Fully dynamic spectral vertex sparsifiers and applications. In STOC, 2019.
DGJ [08] Martin Dyer, Leslie Ann Goldberg, and Mark Jerrum. Dobrushin conditions and systematic scan. Combin. Probab. Comput., 17(6):761–779, 2008.
[11] Roland L Dobrushin and Senya B Shlosman. Completely analytical Gibbs fields. In Statistical Physics and Dynamical Systems, pages 371–403. Springer, 1985.
[12] Roland Lvovich Dobrushin and Senya B Shlosman. Constructive criterion for the uniqueness of Gibbs field. In Statistical Physics and Dynamical Systems, pages 347–370. Springer, 1985.
DS [87] RL Dobrushin and SB Shlosman. Completely analytical interactions: constructive description. J. Statist. Phys., 46(5-6):983–1014, 1987.
DSOR [16] Christopher De Sa, Kunle Olukotun, and Christopher Ré. Ensuring rapid mixing and low bias for asynchronous Gibbs sampling. In ICML, 2016.
FG [19] Sebastian Forster and Gramoz Goranci. Dynamic low-stretch trees via dynamic low-diameter decompositions. In STOC, pages 377–388, 2019.
FVY [19] Weiming Feng, Nisheeth K Vishnoi, and Yitong Yin. Dynamic sampling from graphical models. In STOC, 2019.
GHP [18] Gramoz Goranci, Monika Henzinger, and Pan Peng. Dynamic Effective Resistances and Approximate Schur Complement on Separable Graphs. In ESA, volume 112, 2018.
GŠV [15] Andreas Galanis, Daniel Štefankovič, and Eric Vigoda. Inapproximability for antiferromagnetic spin systems in the tree nonuniqueness region. J. ACM, 62(6):50, 2015.
GŠV [16] Andreas Galanis, Daniel Štefankovič, and Eric Vigoda. Inapproximability of the partition function for the antiferromagnetic Ising and hard-core models. Combin. Probab. Comput., 25(04):500–559, 2016.
Hay [06] Thomas P Hayes. A simple condition implying rapid mixing of single-site dynamics on spin systems. In FOCS, 2006.
Hin [12] Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. In Neural Networks: Tricks of the Trade, pages 599–619. Springer, 2012.
HKN [14] Monika Henzinger, Sebastian Krinninger, and Danupon Nanongkai. Decremental single-source shortest paths on undirected graphs in near-linear total update time. In FOCS, 2014.
HKN [16] Monika Henzinger, Sebastian Krinninger, and Danupon Nanongkai. Dynamic approximate all-pairs shortest paths: Breaking the $O(mn)$ barrier and derandomization. SIAM J. Comput., 45(3):947–1006, 2016.
Jer [95] Mark Jerrum. A very simple algorithm for estimating the number of $k$ -colorings of a low-degree graph. Random Structures & Algorithms, 7(2):157–165, 1995.
JVV [86] Mark Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoret. Comput. Sci., 43:169–188, 1986.
KFB [09] Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and techniques. MIT press, 2009.
LMV [19] Holden Lee, Oren Mangoubi, and Nisheeth Vishnoi. Online sampling from log-concave distributions. In NIPS, 2019.
LP [17] David A Levin and Yuval Peres. Markov chains and mixing times. American Mathematical Soc., 2017.
LV [99] Michael Luby and Eric Vigoda. Fast convergence of the Glauber dynamics for sampling independent sets. Random Structures & Algorithms, 15(3-4):229–241, 1999.
MM [09] Marc Mezard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009.
NR [17] Hariharan Narayanan and Alexander Rakhlin. Efficient sampling from time-varying log-concave distributions. J. Mach. Learn. Res., 18(1):4017–4045, 2017.
NSWN [17] Danupon Nanongkai, Thatchaphol Saranurak, and Christian Wulff-Nilsen. Dynamic minimum spanning forest with subpolynomial worst-case update time. In FOCS, 2017.
QS [93] Catriona M. Queen and Jim Q. Smith. Multiregression dynamic models. J. Roy. Statist. Soc. Ser. B, 55(4):849–870, 1993.
RKD⁺ [19] Cedric Renggli, Bojan Karlaš, Bolin Ding, Feng Liu, Kevin Schawinski, Wentao Wu, and Ce Zhang. Continuous integration of machine learning models: A rigorous yet practical treatment. In SysML, 2019.
ŠVV [09] Daniel Štefankovič, Santosh Vempala, and Eric Vigoda. Adaptive simulated annealing: A near-optimal connection between sampling and counting. J. ACM, 56(3):18, 2009.
SWA [09] Padhraic Smyth, Max Welling, and Arthur U Asuncion. Asynchronous distributed learning of topic models. In NIPS, 2009.
Vig [99] Eric Vigoda. Fast convergence of the Glauber dynamics for sampling independent sets: Part II. Technical Report TR-99-003, International Computer Science Institute, 1999.
WJ [08] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Now Publishers Inc, 2008.
WN [17] Christian Wulff-Nilsen. Fully-dynamic minimum spanning forest with improved worst-case update time. In STOC, 2017.

	$\displaystyle\mathbb{E}\left[{\|\mathcal{D}_{t}\|\mid\mathcal{D}_{t-1}}\right]$	$\displaystyle\leq\sum_{v\in V}\Pr[X^{\prime}(v)\neq Y^{\prime}(v)\mid\mathcal{D}_{t-1}]+\sum_{v\in V}\Pr[t\in\mathcal{P}\land v=v_{t}\mid\mathcal{D}_{t-1}]$
		$\displaystyle\leq\left(1-\frac{\delta}{n}\right)\|\mathcal{D}_{t-1}\|+\sum_{v\in V}\frac{p^{\mathsf{up}}_{v}}{n}$
	$\displaystyle(\text{by}~\eqref{eq-def-Ising-up})\quad$	$\displaystyle\leq\left(1-\frac{\delta}{n}\right)\|\mathcal{D}_{t-1}\|+\frac{2}{n}\sum_{v\in V}\left(\\|\phi_{v}-\phi^{\prime}_{v}\\|_{1}+\sum_{e=\{u,v\}\in E}\\|\phi_{e}-\phi^{\prime}_{e}\\|_{1}\right)$
	$\displaystyle(\text{by}~d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L)\quad$	$\displaystyle\leq\left(1-\frac{\delta}{n}\right)\|\mathcal{D}_{t-1}\|+\frac{4L}{n}.$

Dynamic inference in probabilistic graphical models

Abstract.

1. Introduction

1.1. Our results

1.2. Related work

1.3. Organization of the paper.

2. Dynamic inference problem

2.1. Markov random fields.

2.2. Probabilistic inference and sampling

Definition 2.1 ((N,ϵ)(N,\epsilon)-estimator for θ\bm{\theta}).

2.3. Dynamic inference problem

Definition 2.2 (difference between MRF instances).

Definition 2.3 (dynamical efficiency).

3. Main results

Condition 3.1 (Dobrushin-Shlosman condition).

Theorem 3.2 (dynamic inference algorithm).

Dynamic sampling

Applications on specific models

4. Preliminaries

Total variation distance and coupling

Proposition 4.1 (coupling lemma).

Local neighborhood

Gibbs sampling

Marginal distributions

Coupling for mixing time

Definition 4.2 (one-step optimal coupling for Gibbs sampling).

Proposition 4.3 ([4, 20]).

5. Outlines of algorithm

6. Dynamic Gibbs sampling

Theorem 6.1 (dynamic sampling algorithm).

6.1. Coupling for dynamic instances

Condition 6.2 (invariants for Exe-Log).

6.1.1. Coupling for Hamiltonian update

Definition 6.3 (one-step local coupling for Hamiltonian update).

Definition 6.4 (local coupling Dℐv,ℐv′σ,τ​(⋅,⋅)D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot) for Hamiltonian update).

Lemma 6.5.

Lemma 6.6.

Observation 6.7.

Lemma 6.8 (cost of the coupling for UpdateHamiltonian).

6.1.2. Coupling for graph update

The coupling for UpdateEdge

Definition 6.9 (one-step local coupling for UpdateEdge).

Observation 6.10.

Lemma 6.11 (cost of the coupling for UpdateEdge).

Coupling for AddVertex

Coupling for DeleteVertex

6.2. Data structure for Gibbs sampling

Theorem 6.12 (data structure for Gibbs sampling).

Proof.

6.3. Single-sample dynamic Gibbs sampling algorithm

Lemma 6.13 (single-sample dynamic Gibbs sampling algorithm).

Corollary 6.14.

Proof of Lemma 6.13.

Claim 6.15.

Claim 6.16.

Lemma 6.17.

Proof.

Proof of Corollary 6.14.

6.4. Multi-sample dynamic Gibbs sampling algorithm

Lemma 6.18 (multi-sample dynamic Gibbs sampling algorithm).

Lemma 6.19.

Proof.

Proof.

Claim 6.20.

7. Proofs for dynamic Gibbs sampling

7.1. Analysis of the couplings

7.1.1. Proofs for the coupling for Hamiltonian update

The validity of the coupling (proof of Lemma 6.5)

The upper bound of the probability pℐv,ℐv′⋅​(⋅)p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\cdot}(\cdot) (proof of Lemma 6.6)

The cost of the coupling for UpdateHamiltonian (proof of Lemma 6.8)

7.1.2. Proofs for the coupling for graph update

Cost of the coupling for UpdateEdge (Proof of Lemma 6.11)

7.2. Implementation of the algorithms

7.2.1. Proofs of 6.15 and 6.20

7.2.2. Proof of 6.16

7.3. Dynamic Gibbs sampling for specific models

Theorem 7.1.

Lemma 7.2.

Observation 7.3.

Lemma 7.4.

Definition 2.1 ( $(N,\epsilon)$ -estimator for $\bm{\theta}$ ).

Definition 6.4 (local coupling $D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)$ for Hamiltonian update).

The upper bound of the probability $p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\cdot}(\cdot)$ (proof of Lemma 6.6)