This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dynamic inference in probabilistic graphical models

Weiming Feng Kun He Xiaoming Sun  and  Yitong Yin State Key Laboratory for Novel Software Technology, Nanjing University. E-mail: fengwm@smail.nju.edu.cn,yinyt@nju.edu.cn Shenzhen University; Shenzhen Institute of Computing Sciences. E-mail: hekun.threebody@foxmail.com CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences. E-mail: sunxiaoming@ict.ac.cn
Abstract.

Probabilistic graphical models, such as Markov random fields (MRFs), are useful for describing high-dimensional distributions in terms of local dependence structures. The probabilistic inference is a fundamental problem related to graphical models, and sampling is a main approach for the problem. In this paper, we study probabilistic inference problems when the graphical model itself is changing dynamically with time. Such dynamic inference problems arise naturally in today’s application, e.g. multivariate time-series data analysis and practical learning procedures.

We give a dynamic algorithm for sampling-based probabilistic inferences in MRFs, where each dynamic update can change the underlying graph and all parameters of the MRF simultaneously, as long as the total amount of changes is bounded. More precisely, suppose that the MRF has nn variables and polylogarithmic-bounded maximum degree, and N(n)N(n) independent samples are sufficient for the inference for a polynomial function N()N(\cdot). Our algorithm dynamically maintains an answer to the inference problem using O~(nN(n))\widetilde{O}(nN(n)) space cost, and O~(N(n)+n)\widetilde{O}(N(n)+n) incremental time cost upon each update to the MRF, as long as the Dobrushin-Shlosman condition is satisfied by the MRFs. This well-known condition has long been used for guaranteeing the efficiency of Markov chain Monte Carlo (MCMC) sampling in the traditional static setting. Compared to the static case, which requires Ω(nN(n))\Omega(nN(n)) time cost for redrawing all N(n)N(n) samples whenever the MRF changes, our dynamic algorithm gives a Ω~(min{n,N(n)})\widetilde{\Omega}(\min\{n,N(n)\})-factor speedup. Our approach relies on a novel dynamic sampling technique, which transforms local Markov chains (a.k.a. single-site dynamics) to dynamic sampling algorithms, and an “algorithmic Lipschitz” condition that we establish for sampling from graphical models, namely, when the MRF changes by a small difference, samples can be modified to reflect the new distribution, with cost proportional to the difference on MRF.

Weiming Feng and Yitong Yin are supported by the National Key R&D Program of China 2018YFB1003202 and the National Natural Science Foundation of China under Grant Nos. 61722207 and 61672275. Kun He and Xiaoming Sun are supported by the National Natural Science Foundation of China Grants No. 61832003, 61433014 and K.C.Wong Education Foundation.

1. Introduction

The probabilistic graphical models provide a rich language for describing high-dimensional distributions in terms of the dependence structures between random variables. The Markov random filed (MRF) is a basic graphical model that encodes pairwise interactions of complex systems. Given a graph G=(V,E)G=(V,E), each vertex vVv\in V is associated with a function ϕv:Q\phi_{v}:Q\to\mathbb{R}, called the vertex potential, on a finite domain Q=[q]Q=[q] of qq spin states, and each edge eEe\in E is associated with a symmetric function ϕe:Q2\phi_{e}:Q^{2}\to\mathbb{R}, called the edge potential, which describes a pairwise interaction. Together, these induce a probability distribution μ\mu over all configurations σQV\sigma\in Q^{V}:

μ(σ)exp(H(σ))=exp(vVϕv(σv)+e={u,v}Eϕe(σu,σv)).\displaystyle\mu(\sigma)\propto\exp(H(\sigma))=\exp\Big{(}\sum_{v\in V}\phi_{v}(\sigma_{v})+\sum_{e=\{u,v\}\in E}\phi_{e}(\sigma_{u},\sigma_{v})\Big{)}.

This distribution μ\mu is known as the Gibbs distribution and H(σ)H(\sigma) is the Hamiltonian. It arises naturally from various physical models, statistics or learning problems, and combinatorial problems in computer science [30, 26].

The probabilistic inference is one of the most fundamental computational problems in graphical model. Some basic inference problems ask to calculate the marginal distribution, conditional distribution, or maximum-a-posteriori probabilities of one or several random variables [38]. Sampling is perhaps the most widely used approach for probabilistic inference. Given a graphical model, independent samples are drawn from the Gibbs distribution and certain statistics are computed using the samples to give estimates for the inferred quantity. For most typical inference problems, such statistics are easy to compute once the samples are given, for instance, for estimating the marginal distribution on a variable subset SS, the statistics is the frequency of each configuration in QSQ^{S} among the samples, thus the cost for inference is dominated by the cost for generating random samples [25, 35].

The classic probabilistic inference assumes a static setting, where the input graphical model is fixed. In today’s application, dynamically changing graphical models naturally arise in many scenarios. In various practical algorithms for learning graphical models, e.g. the contrastive divergence algorithm for learning the restricted Boltzmann machine [21] and the iterative proportional fitting algorithm for maximum likelihood estimation of graphical models [38], the optimal model \mathcal{I}^{*} is obtained by updating the parameters of the graphical model iteratively (usually by gradient descent), which generates a sequence of graphical models 1,2,,M\mathcal{I}_{1},\mathcal{I}_{2},\cdots,\mathcal{I}_{M}, with the goal that M\mathcal{I}_{M} is a good approximation of \mathcal{I}^{*}. Also in the study of the multivariate time-series data, the dynamic Gaussian graphical models [6], multiregression dynamic model [33], dynamic graphical model [16], and dynamic chain graph models [2], are all dynamically changing graphical models and have been used in a variety of applications. Meanwhile, with the advent of Big Data, scalable machine learning systems need to deal with continuously evolving graphical models (see e.g. [34] and [36]).

The theoretical studies of probabilistic inference in dynamically changing graphical models are lacking. In the aforementioned scenarios in practice, it is common that a sequence of graphical models is presented with time, where any two consecutive graphical models can differ from each other in all potentials but by a small total amount. Recomputing the inference problem from scratch at every time when the graphical model is changed, can give the correct solution, but is very wasteful. A fundamental question is whether probabilistic inference can be solved dynamically and efficiently.

In this paper, we study the problem of probabilistic inference in an MRF when the MRF itself is changing dynamically with time. At each time, the whole graphical model, including all vertices and edges as well as their potentials, are subject to changes. Such non-local updates are very general and cover all applications mentioned above. The problem of dynamic inference then asks to maintain a correct answer to the inference in a dynamically changing MRF with low incremental cost proportional to the amount of changes made to the graphical model at each time.

1.1. Our results

We give a dynamic algorithm for sampling-based probabilistic inferences. Given an MRF instance with nn vertices, suppose that N(n)N(n) independent samples are sufficient to give an approximate solution to the inference problem, where N:++N:\mathbb{N}^{+}\to\mathbb{N}^{+} is a polynomial function. We give dynamic algorithms for general inference problems on dynamically changing MRF.

Suppose that the current MRF has nn vertices and polylogarithmic-bounded maximum degree, and each update to the MRF may change the underlying graph and/or all vertex/edge potentials, as long as the total amount of changes is bounded. Our algorithm maintains an approximate solution to the inference with O~(nN(n))\widetilde{O}(nN(n)) space cost, and with O~(N(n)+n)\widetilde{O}(N(n)+n) incremental time cost upon each update, assuming that the MRFs satisfy the Dobrushin-Shlosman condition [11, 12, 13]. The condition has been widely used to imply the efficiency of Markov chain Monte Carlo (MCMC) sampling (e.g. see [20, 10]). Compared to the static algorithm, which requires Ω(nN(n))\Omega(nN(n)) time for redrawing all N(n)N(n) samples each time, our dynamic algorithm significantly improves the time cost with an Ω~(min{n,N(n)})\widetilde{\Omega}(\min\{n,N(n)\})-factor speedup.

On specific models, the Dobrushin-Shlosman condition has been established in the literature, which directly gives us following efficient dynamic inference algorithms, with O~(nN(n))\widetilde{O}\left(nN(n)\right) space cost and O~(N(n)+n)\widetilde{O}\left(N(n)+n\right) time cost per update, on graphs with nn vertices and maximum degree Δ=O(1)\Delta=O(1):

  • for Ising model with temperature β\beta satisfying e2|β|>12Δ+1\mathrm{e}^{-2|\beta|}>1-\frac{2}{\Delta+1}, which is close to the uniqueness threshold e2|βc|=12Δ\mathrm{e}^{-2|\beta_{c}|}=1-\frac{2}{\Delta}, beyond which the static versions of sampling or marginal inference problem for anti-ferromagnetic Ising model is intractable [19, 18];

  • for hardcore model with fugacity λ<2Δ2\lambda<\frac{2}{\Delta-2}, which matches the best bound known for sampling algorithm with near-linear running time on general graphs with bounded maximum degree [37, 29, 7];

  • for proper qq-coloring with q>2Δq>2\Delta, which matches the best bound known for sampling algorithm with near-linear running time on general graphs with bounded maximum degree [24].

Our dynamic inference algorithm is based on a dynamic sampling algorithm, which efficiently maintains N(n)N(n) independent samples for the current MRF while the MRF is subject to changes. More specifically, we give a dynamic version of the Gibbs sampling algorithm, a local Markov chain for sampling from the Gibbs distribution that has been studied extensively. Our techniques are based on: (1) couplings for dynamic instances of graphical models; and (2) dynamic data structures for representing single-site Markov chains so that the couplings can be realized algorithmically in sub-linear time. Both these techniques are of independent interest, and can be naturally extended to more general settings with multi-body interactions.

Our results show that on dynamically changing graphical models, sampling-based probabilistic inferences can be solved significantly faster than rerunning the static algorithm at each time. This has practical significance in speeding up the iterative procedures for learning graphical models.

1.2. Related work

The problem of dynamic sampling from graphical models was introduced very recently in [16]. There, a dynamic sampling algorithm was given for graphical models with soft constraints, and can only deal with local updates that change a single vertex or edge at each time. The regimes for such dynamic sampling algorithm to be efficient are much more restrictive than the conditions for the rapid mixing of Markov chains. Our algorithm greatly improves the regimes for efficient dynamic sampling for the Ising and hardcore models in [16], and for the first time, can handle non-local updates that change all vertex/edge potentials simultaneously. Besides, the dynamic/online sampling from log-concave distributions was also studied in [31, 27].

Another related topic is the dynamic graph problems, which ask to maintain a solution (e.g. spanners [15, 32, 39] or shortest paths [3, 23, 22]) while the input graph is dynamically changing. More recently, important progress has been made on dynamically maintaining structures that are related to graph random walks, such as spectral sparsifier [9, 1] or effective resistances [8, 17]. Instead of one particular solution, dynamic inference problems ask to maintain an estimate of a statistics, such statistics comes from an exponential-sized probability space described by a dynamically changing graphical model.

1.3. Organization of the paper.

In Section 2, we formally introduce the dynamic inference problem. In Section 3, we formally state the main results. Preliminaries are given in Section 4. In Section 5, we outline our dynamic inference algorithm. In Section 6, we present the algorithms for dynamic Gibbs sampling. The analyses of these dynamic sampling algorithms are given in Section 7. The proof of the main theorem on dynamic inference is given in Section 8. The conclusion is given in Section 9.

2. Dynamic inference problem

2.1. Markov random fields.

An instance of Markov random field (MRF) is specified by a tuple =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi), where G=(V,E)G=(V,E) is an undirected simple graph; QQ is a domain of q=|Q|q=|Q| spin states, for some finite q>1q>1; and Φ=(ϕa)aVE\Phi=(\phi_{a})_{a\in V\cup E} associates each vVv\in V a vertex potential ϕv:Q\phi_{v}:Q\to\mathbb{R} and each eEe\in E an edge potential ϕe:Q2\phi_{e}:Q^{2}\to\mathbb{R}, where ϕe\phi_{e} is symmetric.

A configuration σQV\sigma\in Q^{V} maps each vertex vVv\in V to a spin state in QQ, so that each vertex can be interpreted as a variable. And the Hamiltonian of a configuration σQV\sigma\in Q^{V} is defined as:

H(σ)vVϕv(σv)+e={u,v}Eϕe(σu,σv).\displaystyle H(\sigma)\triangleq\sum_{v\in V}\phi_{v}(\sigma_{v})+\sum_{e=\{u,v\}\in E}\phi_{e}(\sigma_{u},\sigma_{v}).

This defines the Gibbs distribution μ\mu_{\mathcal{I}}, which is a probability distribution over QVQ^{V} such that

σQV,μ(σ)=1Zexp(H(σ)),\displaystyle\forall\sigma\in Q^{V},\quad\mu_{\mathcal{I}}(\sigma)=\frac{1}{Z}\exp(H(\sigma)),

where the normalizing factor ZσQVexp(H(σ))Z\triangleq\sum_{\sigma\in Q^{V}}\exp(H(\sigma)) is called the partition function.

The Gibbs measure μ(σ)\mu(\sigma) can be 0 as the functions ϕv,ϕe\phi_{v},\phi_{e} can take the value -\infty. A configuration σ\sigma is called feasible if μ(σ)>0\mu(\sigma)>0. To trivialize the problem of constructing a feasible configuration, we further assume the following natural condition for the MRF instances considered in this paper:111This condition guarantees that the marginal probabilities are always well-defined, and the problem of constructing a feasible configuration σ\sigma, where μ(σ)>0\mu_{\mathcal{I}}(\sigma)>0, is trivial. The condition holds for all MRFs with soft constraints, or with hard constraints where there is a permissive spin, e.g. the hardcore model. For MRFs with truly repulsive hard constraints such as proper qq-coloring, the condition may translate to the condition qΔ+1q\geq\Delta+1 where Δ\Delta is the maximum degree of graph GG, which is necessary for the irreducibility of local Markov chains for qq-colorings.

(1) vV,σQΓG(v):cQexp(ϕv(c)+uΓvϕuv(σu,c))>0.\displaystyle\forall\,v\in V,\,\,\forall\sigma\in Q^{\Gamma_{G}(v)}:\quad\sum_{c\in Q}\exp\left(\phi_{v}(c)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},c)\right)>0.

where ΓG(v){uV{u,v}E}\Gamma_{G}(v)\triangleq\{u\in V\mid\{u,v\}\in E\} denotes the neighborhood of vv in graph G=(V,E)G=(V,E).

Some well studied typical MRFs include:

  • Ising model: The domain of each spin is Q={1,+1}Q=\{-1,+1\}. Each edge eEe\in E is associated with a temperature βe\beta_{e}\in\mathbb{R}; and each vertex vVv\in V is associated with a local field hvh_{v}\in\mathbb{R}. For each configuration σ{1,+1}V\sigma\in\{-1,+1\}^{V}, μ(σ)exp({u,v}Eβeσuσv+vVhvσv)\mu_{\mathcal{I}}(\sigma)\propto\exp\left(\sum_{\{u,v\}\in E}\beta_{e}\sigma_{u}\sigma_{v}+\sum_{v\in V}h_{v}\sigma_{v}\right).

  • Hardcore model: The domain is Q={0,1}Q=\{0,1\}. Each configuration σQV\sigma\in Q^{V} indicates an independent set in G=(V,E)G=(V,E), and μ(σ)λσ\mu_{\mathcal{I}}(\sigma)\propto\lambda^{\left\|\sigma\right\|}, where λ>0\lambda>0 is the fugacity parameter.

  • proper qq-coloring: uniform distribution over all proper qq-colorings of G=(V,E)G=(V,E).

2.2. Probabilistic inference and sampling

In graphical models, the task of probabilistic inference is to derive the probabilities regarding one or more random variables of the model. Abstractly, this is described by a function 𝜽:𝔐K\bm{\theta}:\mathfrak{M}\rightarrow\mathbb{R}^{K} that maps each graphical model 𝔐\mathcal{I}\in\mathfrak{M} to a target KK-dimensional probability vector, where 𝔐\mathfrak{M} is the class of graphical models containing the random variables we are interested in and the KK-dimensional vector describes the probabilities we want to derive. Given 𝜽()\bm{\theta}(\cdot) and an MRF instance 𝔐\mathcal{I}\in\mathfrak{M}, the inference problem asks to estimate the probability vector 𝜽()\bm{\theta}(\mathcal{I}).

Here are some fundamental inference problems [38] for MRF instances. Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be an MRF instance and A,BVA,B\subseteq V two disjoint sets where ABVA\uplus B\subseteq V.

  • Marginal inference: estimate the marginal distribution μA,()\mu_{A,\mathcal{I}}(\cdot) of the variables in AA, where

    σAQA,μA,(σA)τQVAμ(σA,τ).\displaystyle\forall\sigma_{A}\in Q^{A},\quad\mu_{A,\mathcal{I}}(\sigma_{A})\triangleq\sum_{\tau\in Q^{V\setminus A}}\mu_{\mathcal{I}}(\sigma_{A},\tau).
  • Posterior inference: given any τBQB\tau_{B}\in Q^{B}, estimate the posterior distribution μA,(τB)\mu_{A,\mathcal{I}}(\cdot\mid\tau_{B}) for the variables in AA, where

    σAQA,μA,(σAτB)μAB,(σA,τB)μB,(τB).\displaystyle\forall\sigma_{A}\in Q^{A},\quad\mu_{A,\mathcal{I}}(\sigma_{A}\mid\tau_{B})\triangleq\frac{\mu_{A\cup B,\mathcal{I}}(\sigma_{A},\tau_{B})}{\mu_{B,\mathcal{I}}(\tau_{B})}.
  • Maximum-a-posteriori (MAP) inference: find the maximum-a-posteriori (MAP) probabilities PA,()P_{A,\mathcal{I}}^{\ast}(\cdot) for the configurations over AA, where

    σAQA,PA,(σA)maxτBQBμAB,(σA,τB).\displaystyle\forall\sigma_{A}\in Q^{A},\quad P^{\ast}_{A,\mathcal{I}}(\sigma_{A})\triangleq\max_{\tau_{B}\in Q^{B}}\mu_{A\cup B,\mathcal{I}}(\sigma_{A},\tau_{B}).

All these fundamental inference problems can be described abstractly by a function 𝜽:𝔐K\bm{\theta}:\mathfrak{M}\rightarrow\mathbb{R}^{K}. For instances, for marginal inference, 𝔐\mathfrak{M} contains all MRF instances where AA is a subset of the vertices, K=|Q||A|K=\left|Q\right|^{|A|}, and 𝜽()=(μA,(σA))σAQA\bm{\theta}(\mathcal{I})=(\mu_{A,\mathcal{I}}(\sigma_{A}))_{\sigma_{A}\in Q^{A}}; and for posterior or MAP inference, 𝔐\mathfrak{M} contains all MRF instances where ABA\uplus B is a subset of the vertices, K=|Q||A|K=\left|Q\right|^{|A|} and 𝜽()=(μA,(σAτB))σAQA\bm{\theta}(\mathcal{I})=(\mu_{A,\mathcal{I}}(\sigma_{A}\mid\tau_{B}))_{\sigma_{A}\in Q^{A}} (for posterior inference) or 𝜽()=(PA,(σA))σAQA\bm{\theta}(\mathcal{I})=(P^{\ast}_{A,\mathcal{I}}(\sigma_{A}))_{\sigma_{A}\in Q^{A}} (for MAP inference).

One canonical approach for probabilistic inference is by sampling: sufficiently many independent samples are drawn (approximately) from the Gibbs distribution of the MRF instance and an estimate of the target probabilities is calculated from these samples. Given a probabilistic inference problem 𝜽()\bm{\theta}(\cdot), we use 𝜽()\mathcal{E}_{\bm{\theta}}(\cdot) to denote an estimating function that approximates 𝜽()\bm{\theta}(\mathcal{I}) using independent samples drawn approximately from μ\mu_{\mathcal{I}}. For the aforementioned problems of marginal, posterior and MAP inferences, such estimating function 𝜽()\mathcal{E}_{\bm{\theta}}(\cdot) simply counts the frequency of the samples that satisfy certain properties.

The sampling cost of an estimator is captured in two aspects: the number of samples it uses and the accuracy of each individual sample it requires.

Definition 2.1 ((N,ϵ)(N,\epsilon)-estimator for θ\bm{\theta}).

Let 𝜽:𝔐K\bm{\theta}:\mathfrak{M}\to\mathbb{R}^{K} be a probabilistic inference problem and 𝜽()\mathcal{E}_{\bm{\theta}}(\cdot) an estimating function for 𝜽()\bm{\theta}(\cdot) that for each instance =(V,E,Q,Φ)𝔐\mathcal{I}=(V,E,Q,\Phi)\in\mathfrak{M}, maps samples in QVQ^{V} to an estimate of 𝜽()\bm{\theta}(\mathcal{I}). Let N:++N:\mathbb{N}^{+}\to\mathbb{N}^{+} and ϵ:+(0,1)\epsilon:\mathbb{N}^{+}\to(0,1). For any instance =(V,E,Q,Φ)𝔐\mathcal{I}=(V,E,Q,\Phi)\in\mathfrak{M} where n=|V|n=|V|, the random variable 𝜽(𝑿(1),,𝑿(N(n)))\mathcal{E}_{\bm{\theta}}(\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}) is said to be an (N,ϵ(N,\epsilon)-estimator for 𝜽()\bm{\theta}(\mathcal{I}) if 𝑿(1),,𝑿(N(n))QV\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}\in Q^{V} are N(n)N(n) independent samples drawn approximately from μ\mu_{\mathcal{I}} such that dTV(𝑿(j),μ)ϵ(n)d_{\mathrm{TV}}\left({\bm{X}^{(j)}},{\mu_{\mathcal{I}}}\right)\leq\epsilon(n) for all 1jN(n)1\leq j\leq N(n).

In Definition 2.1, an estimator is viewed as a black-box algorithm specified by two functions NN and ϵ\epsilon. Usually, the estimator is more accurate if more independent samples are drawn and each sample provides a higher level of accuracy. Thus, one can choose some large NN and small ϵ\epsilon to achieve a desired quality of estimate.

2.3. Dynamic inference problem

We consider the inference problem where the input graphical model is changed dynamically: at each step, the current MRF instance =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) is updated to a new instance =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}). We consider general update operations for MRFs that can change both the underlying graph and all edge/vertex potentials simultaneously, where the update request is made by a non-adaptive adversary independently of the randomness used by the inference algorithm. Such updates are general enough and cover many applications, e.g. analyses of time series network data [6, 33, 16, 2], and learning algorithms for graphical models [21, 38].

The difference between the original and the updated instances is measured as follows.

Definition 2.2 (difference between MRF instances).

The difference between two MRF instances =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) and =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}), where Φ=(ϕa)aVE\Phi=(\phi_{a})_{a\in V\cup E} and Φ=(ϕa)aVE\Phi^{\prime}=(\phi^{\prime}_{a})_{a\in V^{\prime}\cup E^{\prime}}, is defined as

(2) d(,)vVVϕvϕv1+eEEϕeϕe1+|VV|+|EE|,d(\mathcal{I},\mathcal{I}^{\prime})\triangleq\sum_{v\in V\cap V^{{}^{\prime}}}\left\|\phi_{v}-\phi^{\prime}_{v}\right\|_{1}+\sum_{e\in E\cap E^{{}^{\prime}}}\left\|\phi_{e}-\phi^{\prime}_{e}\right\|_{1}+|V\oplus V^{\prime}|+|E\oplus E^{\prime}|,

where AB=(AB)(BA)A\oplus B=(A\setminus B)\cup(B\setminus A) stands for the symmetric difference between two sets AA and BB, ϕvϕv1cQ|ϕv(c)ϕv(c)|\left\|\phi_{v}-\phi^{\prime}_{v}\right\|_{1}\triangleq\sum_{c\in Q}\left|\phi_{v}(c)-\phi^{\prime}_{v}(c)\right|, and ϕeϕe1c,cQ|ϕe(c,c)ϕe(c,c)|\left\|\phi_{e}-\phi^{\prime}_{e}\right\|_{1}\triangleq\sum_{c,c^{\prime}\in Q}\left|\phi_{e}(c,c^{\prime})-\phi^{\prime}_{e}(c,c^{\prime})\right|.

Given a probability vector specified by the function 𝜽:𝔐K\bm{\theta}:\mathfrak{M}\to\mathbb{R}^{K}, the dynamic inference problem asks to maintain an estimator 𝜽^()\hat{\bm{\theta}}(\mathcal{I}) of 𝜽()\bm{\theta}(\mathcal{I}) for the current MRF instance =(V,E,Q,Φ)𝔐\mathcal{I}=(V,E,Q,\Phi)\in\mathfrak{M}, with a data structure, such that when \mathcal{I} is updated to =(V,E,Q,Φ)𝔐\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})\in\mathfrak{M}, the algorithm updates 𝜽^()\hat{\bm{\theta}}(\mathcal{I}) to an estimator 𝜽^()\hat{\bm{\theta}}(\mathcal{I}^{\prime}) of the new vector 𝜽()\bm{\theta}(\mathcal{I}^{\prime}), or equivalently, outputs the difference between the estimators 𝜽^()\hat{\bm{\theta}}(\mathcal{I}) and 𝜽^()\hat{\bm{\theta}}(\mathcal{I}^{\prime}).

It is desirable to have the dynamic inference algorithm which maintains an (N,ϵ)(N,\epsilon)-estimator for 𝜽()\bm{\theta}(\mathcal{I}) for the current instance \mathcal{I}. However, the dynamic algorithm cannot be efficient if N(n)N(n) and ϵ(n)\epsilon(n) change drastically with nn (so that significantly more samples or substantially more accurate samples may be needed when a new vertex is added), or if recalculating the estimating function 𝜽()\mathcal{E}_{\bm{\theta}}(\cdot) itself is expensive. We introduce a notion of dynamical efficiency for the estimators that are suitable for dynamic inference.

Definition 2.3 (dynamical efficiency).

Let N:++N:\mathbb{N}^{+}\to\mathbb{N}^{+} and ϵ:+(0,1)\epsilon:\mathbb{N}^{+}\to(0,1). Let ()\mathcal{E}(\cdot) be an estimating function for some KK-dimensional probability vector of MRF instances. An tuple (N,ϵ,)(N,\epsilon,\mathcal{E}) is said to be dynamically efficient if it satisfies:

  • (bounded difference) there exist constants C1,C2>0C_{1},C_{2}>0 such that for any n+n\in\mathbb{N}^{+},

    |N(n+1)N(n)|C1N(n)n and |ϵ(n+1)ϵ(n)|C2ϵ(n)n;\displaystyle\left|N(n+1)-N(n)\right|\leq\frac{C_{1}\cdot N(n)}{n}\quad\text{ and }\quad\left|\epsilon(n+1)-\epsilon(n)\right|\leq\frac{C_{2}\cdot\epsilon(n)}{n};
  • (small incremental cost) there is a deterministic algorithm that maintains (𝑿(1),,𝑿(m))\mathcal{E}(\bm{X}^{(1)},\ldots,\bm{X}^{(m)}) using (mn+K)polylog(mn)(mn+K)\cdot\mathrm{polylog}(mn) bits where 𝑿(1),,𝑿(m)QV\bm{X}^{(1)},\ldots,\bm{X}^{(m)}\in Q^{V} and n=|V|n=|V|, such that when 𝑿(1),,𝑿(m)QV\bm{X}^{(1)},\ldots,\bm{X}^{(m)}\in Q^{V} are updated to 𝒀(1),,𝒀(m)QV\bm{Y}^{(1)},\ldots,\bm{Y}^{(m^{\prime})}\in Q^{V^{\prime}}, where n=|V|n^{\prime}=|V^{\prime}|, the algorithm updates (𝑿(1),,𝑿(m))\mathcal{E}(\bm{X}^{(1)},\ldots,\bm{X}^{(m)}) to (𝒀(1),,𝒀(m))\mathcal{E}(\bm{Y}^{(1)},\ldots,\bm{Y}^{(m^{\prime})}) within time cost 𝒟polylog(mmnn)+O(m+m)\mathcal{D}\cdot\mathrm{polylog}(mm^{\prime}nn^{\prime})+O(m+m^{\prime}), where 𝒟\mathcal{D} is the size of the difference between two sample sequences defined as:

    (3) 𝒟imax{m,m}vVV𝟏[𝑿(i)(v)𝒀(i)(v)],\displaystyle\mathcal{D}\triangleq\sum_{i\leq\max\{m,m^{\prime}\}}\sum_{v\in V\cup V^{\prime}}\mathbf{1}\left[\bm{X}^{(i)}(v)\neq\bm{Y}^{(i)}(v)\right],

    where an unassigned 𝑿(i)(v)\bm{X}^{(i)}(v) or 𝒀(i)(v)\bm{Y}^{(i)}(v) is not equal to any assigned spin.

The dynamic efficiency basically asks N(),ϵ()N(\cdot),\epsilon(\cdot), and ()\mathcal{E}(\cdot) to have some sort of “Lipschitz” properties. To satisfy the bounded difference condition, N(n)N(n) and 1/ϵ(n)1/\epsilon(n) are necessarily polynomially bounded, and they can be any constant, polylogarithmic, or polynomial functions, or multiplications of such functions. The condition with small incremental cost also holds very commonly. In particular, it is satisfied by the estimating functions for all the aforementioned problems for the marginal, posterior and MAP inferences as long as the sets of variables have sizes |A|,|B|=O(logn)\left|A\right|,\left|B\right|=O(\log n). We remark that the O(logn)O(\log n) upper bound is somehow necessary for the efficiency of inference, because otherwise the dimension of 𝜽()\bm{\theta}(\mathcal{I}) itself (which is at least q|A|q^{|A|}) becomes super-polynomial in nn.

3. Main results

Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be an MRF instance, where G=(V,E)G=(V,E). Let ΓG(v)\Gamma_{G}(v) denote the neighborhood of vv in GG. For any vertex vVv\in V and any configuration σQΓG(v)\sigma\in Q^{\Gamma_{G}(v)}, we use μv,σ()=μv,(σ)\mu_{v,\mathcal{I}}^{\sigma}(\cdot)=\mu_{v,\mathcal{I}}(\cdot\mid\sigma) to denote the marginal distribution on vv conditional on σ\sigma:

cQ:μv,σ(c)=μv,(cσ)exp(ϕv(c)+uΓG(v)ϕuv(σu,c))aQexp(ϕv(a)+uΓG(v)ϕuv(σu,a)).\displaystyle\forall c\in Q:\quad\mu_{v,\mathcal{I}}^{\sigma}(c)=\mu_{v,\mathcal{I}}(c\mid\sigma)\triangleq\frac{\exp\left(\phi_{v}(c)+\sum_{u\in\Gamma_{G}(v)}\phi_{uv}(\sigma_{u},c)\right)}{\sum_{a\in Q}\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{G}(v)}\phi_{uv}(\sigma_{u},a)\right)}.

Due to the assumption in (1), the marginal distribution is always well-defined. The following condition is the Dobrushin-Shlosman condition [11, 12, 13, 20, 10].

Condition 3.1 (Dobrushin-Shlosman condition).

Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be an MRF instance with Gibbs distribution μ=μ\mu=\mu_{\mathcal{I}}. Let A0V×VA_{\mathcal{I}}\in\mathbb{R}_{\geq 0}^{V\times V} be the influence matrix which is defined as

A(u,v){max(σ,τ)Bu,vdTV(μvσ,μvτ),{u,v}E,0{u,v}E,\displaystyle A_{\mathcal{I}}(u,v)\triangleq\begin{cases}\max_{(\sigma,\tau)\in B_{u,v}}d_{\mathrm{TV}}\left({\mu^{\sigma}_{v}},{\mu_{v}^{\tau}}\right),&\{u,v\}\in E,\\ 0&\{u,v\}\not\in E,\end{cases}

where the maximum is taken over the set Bu,vB_{u,v} of all (σ,τ)QΓG(v)×QΓG(v)(\sigma,\tau)\in Q^{\Gamma_{G}(v)}\times Q^{\Gamma_{G}(v)} that differ only at uu, and dTV(μvσ,μvτ)12cQ|μvσ(c)μvτ(c)|d_{\mathrm{TV}}\left({\mu^{\sigma}_{v}},{\mu_{v}^{\tau}}\right)\triangleq\frac{1}{2}\sum_{c\in Q}\left|\mu^{\sigma}_{v}(c)-\mu^{\tau}_{v}(c)\right| is the total variation distance between μvσ\mu^{\sigma}_{v} and μvτ\mu^{\tau}_{v}. An MRF instance \mathcal{I} is said to satisfy the Dobrushin-Shlosman condition if there is a constant δ>0\delta>0 such that

maxuVvVA(u,v)1δ.\displaystyle\max_{u\in V}\sum_{v\in V}A_{\mathcal{I}}(u,v)\leq 1-\delta.

Our main theorem assumes the following setup: Let 𝜽:𝔐K\bm{\theta}:\mathfrak{M}\to\mathbb{R}^{K} be a probabilistic inference problem that maps each MRF instance in 𝔐\mathfrak{M} to a KK-dimensional probability vector, and let 𝜽\mathcal{E}_{\bm{\theta}} be its estimating function. Let N:++N:\mathbb{N}^{+}\to\mathbb{N}^{+} and ϵ:+(0,1)\epsilon:\mathbb{N}^{+}\to(0,1). We use =(V,E,Q,Φ)𝔐\mathcal{I}=(V,E,Q,\Phi)\in\mathfrak{M}, where n=|V|n=|V|, to denote the current instance and =(V,E,Q,Φ)𝔐\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})\in\mathfrak{M}, where n=|V|n^{\prime}=|V^{\prime}|, to denote the updated instance.

Theorem 3.2 (dynamic inference algorithm).

Assume that (N,ϵ,𝛉)(N,\epsilon,\mathcal{E}_{\bm{\theta}}) is dynamically efficient, both \mathcal{I} and \mathcal{I}^{\prime} satisfy the Dobrushin-Shlosman condition, and d(,)L=o(n)d(\mathcal{I},\mathcal{I}^{\prime})\leq L=o(n).

There is an algorithm that maintains an (N,ϵ)(N,\epsilon)-estimator 𝛉^()\hat{\bm{\theta}}(\mathcal{I}) of the probability vector 𝛉()\bm{\theta}(\mathcal{I}) for the current MRF instance \mathcal{I}, using O~(nN(n)+K)\widetilde{O}\left(nN(n)+K\right) bits, such that when \mathcal{I} is updated to \mathcal{I}^{\prime}, the algorithm updates 𝛉^()\hat{\bm{\theta}}(\mathcal{I}) to an (N,ϵ)(N,\epsilon)-estimator 𝛉^()\hat{\bm{\theta}}(\mathcal{I}^{\prime}) of 𝛉()\bm{\theta}(\mathcal{I}^{\prime}) for the new instance \mathcal{I}^{\prime}, within expected time cost

O~(Δ2LN(n)+Δn),\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right),

where O~()\widetilde{O}(\cdot) hides a polylog(n)\mathrm{polylog}(n) factor, Δ=max{ΔG,ΔG}\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}, where ΔG\Delta_{G} and ΔG\Delta_{G^{\prime}} denote the maximum degree of G=(V,E)G=(V,E) and G=(V,E)G^{\prime}=(V^{\prime},E^{\prime}) respectively.

Note that the extra O(Δn)O(\Delta n) cost is necessary for editing the current MRF instance \mathcal{I} to \mathcal{I}^{\prime}.

Typically, the difference between two MRF instances ,\mathcal{I},\mathcal{I}^{\prime} is small222In multivariate time-series data analysis, the MRF instances of two sequential times are similar. In the iterative algorithms for learning graphical models, the difference between two sequential MRF instances generated by gradient descent are bounded to prevent oscillations. Specifically, the difference is very small when the iterative algorithm approaches to the convergence state [21, 38]., and the underlying graphs are sparse [14] , that is, L,Δpolylog(n)L,\Delta\leq\mathrm{polylog}(n). In such cases, our algorithm updates the estimator within time cost O~(N(n)+n)\widetilde{O}(N(n)+n), which significantly outperforms static sampling-based inference algorithms that require time cost Ω(nN(n))=Ω(nN(n))\Omega(n^{\prime}N(n^{\prime}))=\Omega(nN(n)) for redrawing all N(n)N(n^{\prime}) independent samples.


Dynamic sampling

The core of our dynamic inference algorithm is a dynamic sampling algorithm: Assuming the Dobrushin-Shlosman condition, the algorithm can maintain a sequence of N(n)N(n) independent samples 𝑿(1),,𝑿(N(n))QV\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}\in Q^{V} that are ϵ(n)\epsilon(n)-close to μ\mu_{\mathcal{I}} in total variation distance, and when \mathcal{I} is updated to \mathcal{I}^{\prime} with difference d(,)L=o(n)d(\mathcal{I},\mathcal{I}^{\prime})\leq L=o(n), the algorithm can update the maintained samples to N(n)N(n^{\prime}) independent samples 𝒀(1),,𝒀(N(n))QV\bm{Y}^{(1)},\ldots,\bm{Y}^{(N(n^{\prime}))}\in Q^{V^{\prime}} that are ϵ(n)\epsilon(n^{\prime})-close to μ\mu_{\mathcal{I}^{\prime}} in total variation distance, using a time cost O~(Δ2LN(n)+Δn)\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right) in expectation. This shows an “algorithmic Lipschitz” condition holds for sampling from Gibbs distributions: when the MRF changes insignificantly, a population of samples can be modified to reflect the new distribution, with cost proportional to the difference on MRF. We show that such property is guaranteed by the Dobrushin-Shlosman condition. This dynamic sampling algorithm is formally described in Theorem 6.1 and is of independent interest [16].


Applications on specific models

On specific models, we have the following results, where δ>0\delta>0 is an arbitrary constant.

model regime space cost time cost for each update
Ising e2|β|12δΔ+1\mathrm{e}^{-2|\beta|}\geq 1-\frac{2-\delta}{\Delta+1} O~(nN(n)+K)\widetilde{O}\left(nN(n)+K\right) O~(Δ2LN(n)+Δn)\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right)
hardcore λ2δΔ2\lambda\leq\frac{2-\delta}{\Delta-2} O~(nN(n)+K)\widetilde{O}\left(nN(n)+K\right) O~(Δ3LN(n)+Δn)\widetilde{O}\left(\Delta^{3}LN(n)+\Delta n\right)
qq-coloring q(2+δ)Δq\geq(2+\delta)\Delta O~(nN(n)+K)\widetilde{O}\left(nN(n)+K\right) O~(Δ2LN(n)+Δn)\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right)
Table 1. Dynamic inference for specific models.

The results for Ising model and qq-coloring are corollaries of Theorem 3.2. The regime for hardcore model is better than the Dobrushin-Shlosman condition (which is λ1δΔ1\lambda\leq\frac{1-\delta}{\Delta-1}), because we use the coupling introduced by Vigoda [37] to analyze the algorithm.

4. Preliminaries

Total variation distance and coupling

Let μ\mu and ν\nu be two distributions over Ω\Omega. The total variation distance between μ\mu and ν\nu is defined as

dTV(μ,ν)12xΩ|μ(x)ν(x)|.\displaystyle d_{\mathrm{TV}}\left({\mu},{\nu}\right)\triangleq\frac{1}{2}\sum_{x\in\Omega}\left|\mu(x)-\nu(x)\right|.

A coupling of μ\mu and ν\nu is a joint distribution (X,Y)Ω×Ω(X,Y)\in\Omega\times\Omega such that marginal distribution of XX is μ\mu and the marginal distribution of YY is ν\nu. The following coupling lemma is well-known.

Proposition 4.1 (coupling lemma).

For any coupling (X,Y)(X,Y) of μ\mu and ν\nu, it holds that

Pr[XY]dTV(μ,ν).\displaystyle\Pr[X\neq Y]\geq d_{\mathrm{TV}}\left({\mu},{\nu}\right).

Furthermore, there is an optimal coupling that achieves equality.

Local neighborhood

Let G=(V,E)G=(V,E) be a graph. For any vertex vVv\in V, let ΓG(v){uV{u,v}E}\Gamma_{G}(v)\triangleq\{u\in V\mid\{u,v\}\in E\} denote the neighborhood of vv, and ΓG+(v)ΓG(v){v}\Gamma^{+}_{G}(v)\triangleq\Gamma_{G}(v)\cup\{v\} the inclusive neighborhood of vv. We simply write Γv=Γ(v)=ΓG(v)\Gamma_{v}=\Gamma(v)=\Gamma_{G}(v) and Γv+=Γ+(v)=ΓG+(v)\Gamma_{v}^{+}=\Gamma^{+}(v)=\Gamma_{G}^{+}(v) for short when GG is clear in the context. We use Δ=ΔGmaxvV|Γv|\Delta=\Delta_{G}\triangleq\max_{v\in V}|\Gamma_{v}| to denote the maximum degree of graph GG.

A notion of local neighborhood for MRF is frequently used. Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be an MRF instance. For vVv\in V, we denote by v[Γv+]\mathcal{I}_{v}\triangleq\mathcal{I}[\Gamma_{v}^{+}] the restriction of \mathcal{I} on the inclusive neighborhood Γv+\Gamma_{v}^{+} of vv, i.e. v=(Γv+,Ev,Q,Φv)\mathcal{I}_{v}=(\Gamma^{+}_{v},E_{v},Q,\Phi_{v}), where Ev={{u,v}E}E_{v}=\{\{u,v\}\in E\} and Φv=(ϕa)aΓv+Ev\Phi_{v}=(\phi_{a})_{a\in\Gamma_{v}^{+}\cup E_{v}}.

Gibbs sampling

The Gibbs sampling (a.k.a. heat-bath, Glauber dynamics), is a classic Markov chain for sampling from Gibbs distributions. Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be an MRF instance and μ=μ\mu=\mu_{\mathcal{I}} its Gibbs distribution. The chain of Gibbs sampling (Algorithm 1) is on the space ΩQV\Omega\triangleq Q^{V}, and has the stationary distribution μ\mu_{\mathcal{I}} [28, Chapter 3].

Initialization : an initial state 𝑿0Ω\bm{X}_{0}\in\Omega (not necessarily feasible);
1 for t=1,2,,Tt=1,2,\ldots,T do
2   pick vtVv_{t}\in V uniformly at random;
3   draw a random value cQc\in Q from the marginal distribution μvt(Xt1(Γvt))\mu_{v_{t}}(\cdot\mid X_{t-1}(\Gamma_{v_{t}}));
4 Xt(vt)cX_{t}(v_{t})\leftarrow c and Xt(u)Xt1(u)X_{t}(u)\leftarrow X_{t-1}(u) for all uV{vt}u\in V\setminus\{v_{t}\};
5 
Algorithm 1 Gibbs sampling
Marginal distributions

Here μv(σ(Γv))=μv,(σ(Γv))\mu_{v}(\cdot\mid\sigma(\Gamma_{v}))=\mu_{v,\mathcal{I}}(\cdot\mid\sigma(\Gamma_{v})) denotes the marginal distribution at vVv\in V conditioning on σ(Γv)QΓv\sigma(\Gamma_{v})\in Q^{\Gamma_{v}}, which is computed as:

(4) cQ:μv(cσ(Γv))=ϕv(c)uΓvϕuv(σu,c)cQϕv(c)uΓvϕuv(σu,c).\displaystyle\forall c\in Q:\quad\mu_{v}(c\mid\sigma(\Gamma_{v}))=\frac{\phi_{v}(c)\prod_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},c)}{\sum_{c^{\prime}\in Q}\phi_{v}(c^{\prime})\prod_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},c^{\prime})}.

Due to the assumption (1), this marginal distribution is always well defined, and its computation uses only the information of v\mathcal{I}_{v}.

Coupling for mixing time

Consider a chain (𝑿t)t=0(\bm{X}_{t})_{t=0}^{\infty} on space Ω\Omega with stationary distribution μ\mu_{\mathcal{I}} for MRF instance \mathcal{I}. The mixing rate is defined as: for ϵ>0\epsilon>0,

τ𝗆𝗂𝗑(,ϵ)max𝑿0min{tdTV(𝑿t,μ)ϵ},\displaystyle\tau_{\mathsf{mix}}(\mathcal{I},\epsilon)\triangleq\max_{\bm{X}_{0}}\min\left\{t\mid d_{\mathrm{TV}}\left({\bm{X}_{t}},{\mu_{\mathcal{I}}}\right)\leq\epsilon\right\},

where dTV(𝑿t,μ)d_{\mathrm{TV}}\left({\bm{X}_{t}},{\mu_{\mathcal{I}}}\right) denotes the total variation distance between μ\mu_{\mathcal{I}} and the distribution of 𝑿t\bm{X}_{t}.

A coupling of a Markov chain is a joint process (𝑿t,𝒀t)t0(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0} such that (𝑿t)t0(\bm{X}_{t})_{t\geq 0} and (𝒀t)t0(\bm{Y}_{t})_{t\geq 0} marginally follow the same transition rule as the original chain. Consider the following type of couplings.

Definition 4.2 (one-step optimal coupling for Gibbs sampling).

A coupling (𝑿t,𝒀t)t0(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0} of Gibbs sampling on an MRF instance =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) is a one-step optimal coupling if it is constructed as follows: For t=1,2,t=1,2,\ldots,

  1. (1)

    pick the same random vtVv_{t}\in V, and let (Xt(u),Yt(u))(Xt1(u),Yt1(u))(X_{t}(u),Y_{t}(u))\leftarrow(X_{t-1}(u),Y_{t-1}(u)) for all uvtu\neq v_{t};

  2. (2)

    sample (Xt(vt),Yt(vt))(X_{t}(v_{t}),Y_{t}(v_{t})) from an optimal coupling D𝗈𝗉𝗍,vtσ,τ(,)D_{\mathsf{opt},\mathcal{I}_{v_{t}}}^{\sigma,\tau}(\cdot,\cdot) of the marginal distributions μvt(σ)\mu_{v_{t}}(\cdot\mid\sigma) and μvt(τ)\mu_{v_{t}}(\cdot\mid\tau) where σ=Xt1(Γvt)\sigma=X_{t-1}(\Gamma_{v_{t}}) and τ=Yt1(Γvt)\tau=Y_{t-1}(\Gamma_{v_{t}}).

The coupling D𝗈𝗉𝗍,vtσ,τ(,)D_{\mathsf{opt},\mathcal{I}_{v_{t}}}^{\sigma,\tau}(\cdot,\cdot) is an optimal coupling of μvt(σ)\mu_{v_{t}}(\cdot\mid\sigma) and μvt(τ)\mu_{v_{t}}(\cdot\mid\tau) that attains the maximum Pr[𝒙=𝒚]\Pr[\bm{x}=\bm{y}] for all couplings (𝒙,𝒚)(\bm{x},\bm{y}) of 𝒙μvt(σ)\bm{x}\sim\mu_{v_{t}}(\cdot\mid\sigma) and 𝒚μvt(τ)\bm{y}\sim\mu_{v_{t}}(\cdot\mid\tau). The coupling D𝗈𝗉𝗍,vtσ,τ(,)D_{\mathsf{opt},\mathcal{I}_{v_{t}}}^{\sigma,\tau}(\cdot,\cdot) is determined by the local information v\mathcal{I}_{v} and σ,τQdeg(v)\sigma,\tau\in Q^{\mathrm{deg}(v)}.

With such a coupling, we can establish the following relation between the Dobrushin-Shlosman condition and the rapid mixing of the Gibbs sampling [11, 12, 13, 4, 20, 10].

Proposition 4.3 ([4, 20]).

Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be an MRF instance with n=|V|n=|V|, and Ω=QV\Omega=Q^{V} the state space. Let H(σ,τ)|{vVσvτv}|H(\sigma,\tau)\triangleq\left|\{v\in V\mid\sigma_{v}\neq\tau_{v}\}\right| denote the Hamming distance between σΩ\sigma\in\Omega and τΩ\tau\in\Omega. If \mathcal{I} satisfies the Dobrushin-Shlosman condition (3.1) with constant δ>0\delta>0, then the one-step optimal coupling (𝐗t,𝐘t)t0(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0} for Gibbs sampling (Definition 4.2) satisfies

σ,τΩ:𝔼[H(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ](1δn)H(σ,τ),\displaystyle\forall\,\sigma,\tau\in\Omega:\quad\mathbb{E}\left[{\,H(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau\,}\right]\leq\left(1-\frac{\delta}{n}\right)\cdot H(\sigma,\tau),

and hence the mixing rate of Gibbs sampling on \mathcal{I} is bounded as τ𝗆𝗂𝗑(,ϵ)nδlognϵ\tau_{\mathsf{mix}}(\mathcal{I},\epsilon)\leq\left\lceil\frac{n}{\delta}\log\frac{n}{\epsilon}\right\rceil.

5. Outlines of algorithm

Let 𝜽:𝔐K\bm{\theta}:\mathfrak{M}\to\mathbb{R}^{K} be a probabilistic inference problem that maps each MRF instance in 𝔐\mathfrak{M} to a KK-dimensional probability vector, and let 𝜽\mathcal{E}_{\bm{\theta}} be its estimating function. Le =(V,E,Q,Φ)𝔐\mathcal{I}=(V,E,Q,\Phi)\in\mathfrak{M} be the current instance, where n=|V|n=|V|. Our dynamic inference algorithm maintains a sequence of N(n)N(n) independent samples 𝑿(1),,𝑿(N(n))QV\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}\in Q^{V} which are ϵ(n)\epsilon(n)-close to the Gibbs distribution μ\mu_{\mathcal{I}} in total variation distance and an (N,ϵ)(N,\epsilon)-estimator 𝜽^()\hat{\bm{\theta}}(\mathcal{I}) of 𝜽()\bm{\theta}(\mathcal{I}) such that

𝜽^()=𝜽(𝑿(1),𝑿(2),,𝑿(N(n))).\displaystyle\hat{\bm{\theta}}(\mathcal{I})=\mathcal{E}_{\bm{\theta}}(\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))}).

Upon an update request that modifies \mathcal{I} to a new instance =(V,E,Q,Φ)𝔐\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime})\in\mathfrak{M}, where n=|V|n^{\prime}=|V^{\prime}|, our algorithm does the followings:

  • Update the sample sequence. Update 𝑿(1),,𝑿(N(n))\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))} to a new sequence of N(n)N(n^{\prime}) independent samples 𝒀(1),,𝒀(N(n))QV\bm{Y}^{(1)},\ldots,\bm{Y}^{(N(n^{\prime}))}\in Q^{V^{\prime}} that are ϵ(n)\epsilon(n^{\prime})-close to μ\mu_{\mathcal{I}^{\prime}} in total variation distance, and output the difference between two sample sequences.

  • Update the estimator. Given the difference between the two sample sequences, update 𝜽^()\hat{\bm{\theta}}(\mathcal{I}) to 𝜽^()=𝜽(𝒀(1),,𝒀(N(n)))\hat{\bm{\theta}}(\mathcal{I}^{\prime})=\mathcal{E}_{\bm{\theta}}(\bm{Y}^{(1)},\ldots,\bm{Y}^{(N(n^{\prime}))}) by accessing the oracle in Definition 2.3.

Obviously, the updated estimator 𝜽^()\hat{\bm{\theta}}(\mathcal{I}^{\prime}) is an (N,ϵ)(N,\epsilon)-estimator for 𝜽()\bm{\theta}(\mathcal{I}^{\prime}).

Our main technical contribution is to give an algorithm that dynamically maintains a sequence of N(n)N(n) independent samples for μ\mu_{\mathcal{I}}, while \mathcal{I} itself is dynamically changing. The dynamic sampling problem was recently introduced in [16]. The dynamical sampling algorithm given there only handles update of a single vertex or edge and works only for graphical models with soft constraints.

In contrast, our dynamic sampling algorithm maintains a sequence of N(n)N(n) independent samples for μ\mu_{\mathcal{I}} within total variation distance ϵ(n)\epsilon(n), while the entire specification of the graphical model \mathcal{I} is subject to dynamic update (to a new \mathcal{I}^{\prime} with difference d(,)L=o(n)d(\mathcal{I},\mathcal{I}^{\prime})\leq L=o(n)). Specifically, the algorithm updates the sample sequence within expected time O(Δ2N(n)Llog3n+Δn)O(\Delta^{2}N(n)L\log^{3}n+\Delta n). Note that the extra O(Δn)O(\Delta n) cost is necessary for just editing the current MRF instance \mathcal{I} to \mathcal{I}^{\prime} because a single update may change all the vertex and edge potentials simultaneously. This incremental time cost dominates the time cost of the dynamic inference algorithm, and is efficient for maintaining N(n)N(n) independent samples, especially when N(n)N(n) is sufficiently large, e.g. N(n)=Ω(n/L)N(n)=\Omega(n/L), in which case the average incremental cost for updating each sample is O(Δ2Llog3n+Δn/N(n))=O(Δ2Llog3n)O(\Delta^{2}L\log^{3}n+{\Delta n}/{N(n)})=O(\Delta^{2}L\log^{3}n).

We illustrate the main idea by explaining how to maintain one sample. The idea is to represent the trace of the Markov chain for generating the sample by a dynamic data structure, and when the MRF instance is changed, this trace is modified to that of the new chain for generating the sample for the updated instance. This is achieved by both a set of efficient dynamic data structures and the coupling between the two Markov chains.

Specifically, let (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} be the Gibbs sampler chain for distribution μ\mu_{\mathcal{I}}. When the chain is rapidly mixing, starting from an arbitrary initial configuration 𝑿0QV\bm{X}_{0}\in Q^{V}, after suitably many steps 𝑿=𝑿T\bm{X}=\bm{X}_{T} is an accurate enough sample for μ\mu_{\mathcal{I}}. At each step, 𝑿t1\bm{X}_{t-1} and 𝑿t\bm{X}_{t} may differ only at a vertex vtv_{t} which is picked from VV uniformly and independently at random. The evolution of the chain is fully captured by the initial state 𝑿0\bm{X}_{0} and the sequence of pairs vt,Xt(vt)\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle, from t=1t=1 to t=Tt=T, which is called the execution log of the chain in the paper.

Now suppose that the current instance \mathcal{I} is updated to \mathcal{I}^{\prime}. We construct such a coupling between the original chain (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} and the new chain (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T}, such that (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is a faithful Gibbs sampling chain for the updated instance \mathcal{I}^{\prime} given that (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} is a faithful chain for \mathcal{I}, and the difference between the two chains is small, in the sense that they have almost the same execution logs except for about O(TL/n)O(TL/n) steps, where LL is the difference between \mathcal{I} and \mathcal{I}^{\prime}.

To simplify the exposition of such coupling, for now we restrict ourselves to the cases where the update to the instance \mathcal{I} does not change the set of variables. Without loss of generality, we only consider the following two basic update operations that modifies \mathcal{I} to \mathcal{I}^{\prime}.

  • Graph update. The update only adds or deletes some edges, while all vertex potentials and the potentials of unaffected edges are not changed.

  • Hamiltonian update. The update changes (possibly all) potentials of vertices and edges, while the underlying graph remains unchanged.

The general update of graphical model can be obtained by combining these two basic operations.

Then the new chain (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} can be coupled with (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} by using the same initial configuration 𝒀0=𝑿0\bm{Y}_{0}=\bm{X}_{0} and the same sequence v1,v2,,vTVv_{1},v_{2},\ldots,v_{T}\in V of randomly picked vertices. And for t=1,2,,Tt=1,2,\ldots,T, the transition vt,Yt(vt)\left\langle\,{v_{t},Y_{t}(v_{t})}\,\right\rangle of the new chain can be generated using the same vertex vtv_{t} as in the original (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} chain, and a random Yt(vt)Y_{t}(v_{t}) generated according to a coupling of the marginal distributions of Xt(vt)X_{t}(v_{t}) and Yt(vt)Y_{t}(v_{t}), conditioning respectively on the current states of the neighborhood of vtv_{t} in (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} and (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T}. Note that these two marginal distributions must be identical unless (I) 𝑿t1\bm{X}_{t-1} and 𝒀t1\bm{Y}_{t-1} differ from each other over the neighborhood of vtv_{t} or (II) the vtv_{t} itself is incident to where the models \mathcal{I} and \mathcal{I}^{\prime} differ. The event (II) occurs rarely due to the following reasons.

  • For graph update, the event (II) occurs only if vtv_{t} is incident to an updated edge. Since only LL edges are updated, the event occurs in at most O(TL/n)O(TL/n) steps in expectation.

  • For Hamiltonian update, all the potentials of vertices and edges can be changed, thus ,\mathcal{I},\mathcal{I}^{\prime} may differ everywhere. The key observation is that, as the total difference between the current and updated potentials is bounded by LL, we can apply a filter to first select all candidate steps where the coupling may actually fail due to the difference between \mathcal{I} and \mathcal{I}^{\prime}, which can be as small as O(TL/n)O(TL/n), and the actual coupling between (𝑿t)t=0(\bm{X}_{t})_{t=0}^{\infty} and (𝒀t)t=0(\bm{Y}_{t})_{t=0}^{\infty} is constructed with such prior.

Finally, when \mathcal{I} and \mathcal{I}^{\prime} both satisfy the Dobrushin-Shlosman condition, the percolation of disagreements between (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} and (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is bounded, and we show that the two chains are almost always identically coupled as vt,Xt(vt)=vt,Yt(vt)\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle=\left\langle\,{v_{t},Y_{t}(v_{t})}\,\right\rangle, with exceptions at only O(TL/n)O(TL/n) steps. The original chain (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} can then be updated to the new chain (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} by only editing these O(TL/n)O(TL/n) local transitions vt,Yt(vt)\left\langle\,{v_{t},Y_{t}(v_{t})}\,\right\rangle which are different from vt,Xt(vt)\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle. This is aided by the dynamic data structure for the execution log of the chain, which is of independent interest.

6. Dynamic Gibbs sampling

In this section, we give the dynamic sampling algorithm that updates the sample sequences.

In the following theorem, we use =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi), where n=|V|n=|V|, to denote the current MRF instance and =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}), where n=|V|n^{\prime}=|V^{\prime}|, to denote the updated MRF instance. And define

dgraph(,)\displaystyle d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime}) |VV|+|EE|\displaystyle\triangleq|V\oplus V^{\prime}|+|E\oplus E^{\prime}|
dHamil(,)\displaystyle d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime}) vVVϕvϕv1+eEEϕeϕe1.\displaystyle\triangleq\sum_{v\in V\cap V^{{}^{\prime}}}\left\|\phi_{v}-\phi^{\prime}_{v}\right\|_{1}+\sum_{e\in E\cap E^{{}^{\prime}}}\left\|\phi_{e}-\phi^{\prime}_{e}\right\|_{1}.

Note that d(,)=dgraph(,)+dHamil(,)d(\mathcal{I},\mathcal{I}^{\prime})=d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime})+d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime}), where d(,)d(\mathcal{I},\mathcal{I}^{\prime}) is defined in (2).

Theorem 6.1 (dynamic sampling algorithm).

Let N:++N:\mathbb{N}^{+}\to\mathbb{N}^{+} and ϵ:+(0,1)\epsilon:\mathbb{N}^{+}\to(0,1) be two functions satisfying the bounded difference condition in Definition 2.3. Assume that \mathcal{I} and \mathcal{I}^{\prime} both satisfy Dobrushin-Shlosman condition, dgraph(,)L𝗀𝗋𝖺𝗉𝗁=o(n)d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{graph}}=o(n) and dHamil(,)L𝖧𝖺𝗆𝗂𝗅d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}}.

There is an algorithm that maintains a sequence of N(n)N(n) independent samples 𝐗(1),,𝐗(N(n))QV\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}\in Q^{V} where dTV(μ,𝐗(i))ϵ(n)d_{\mathrm{TV}}\left({\mu_{\mathcal{I}}},{\bm{X}^{(i)}}\right)\leq\epsilon(n) for all 1iN(n)1\leq i\leq N(n), using O(nN(n)logn)O\left(nN(n)\log n\right) memory words, each of O(logn)O(\log n) bits, such that when \mathcal{I} is updated to \mathcal{I}^{\prime}, the algorithm updates the sequence to N(n)N(n^{\prime}) independent samples 𝐘(1),,𝐘(N(n))QV\bm{Y}^{(1)},\ldots,\bm{Y}^{(N(n^{\prime}))}\in Q^{V^{\prime}} where dTV(μ,𝐘(i))ϵ(n)d_{\mathrm{TV}}\left({\mu_{\mathcal{I}^{\prime}}},{\bm{Y}^{(i)}}\right)\leq\epsilon(n^{\prime}) for all 1iN(n)1\leq i\leq N(n^{\prime}), within expected time cost

(5) O(Δ2(L𝗀𝗋𝖺𝗉𝗁+L𝖧𝖺𝗆𝗂𝗅)N(n)log3n+Δn),\displaystyle O\left(\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})N(n)\log^{3}n+\Delta n\right),

where Δ=max{ΔG,ΔG}\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}, and ΔG,ΔG\Delta_{G},\Delta_{G^{\prime}} denote the maximum degree of G=(V,E)G=(V,E) and G=(V,E)G^{\prime}=(V^{\prime},E^{\prime}).

Our algorithm is based on the Gibbs sampling algorithm. Let N:++N:\mathbb{N}^{+}\to\mathbb{N}^{+} and ϵ:+(0,1)\epsilon:\mathbb{N}^{+}\to(0,1) be two functions in Theorem 6.1. We first give the single-sample dynamic Gibbs sampling algorithm (Algorithm 2) that maintains a single sample 𝑿QV\bm{X}\in Q^{V} for the current MRF instance =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) where n=|V|n=|V| such that dTV(𝑿,μ)ϵ(n)d_{\mathrm{TV}}\left({\bm{X}},{\mu_{\mathcal{I}}}\right)\leq\epsilon(n). We then use this algorithm to obtain the multi-sample dynamic Gibbs sampling algorithm that maintains N(n)N(n) independent samples for the current instance.

Given the error function ϵ:+(0,1)\epsilon:\mathbb{N}^{+}\to(0,1), suppose that T()T(\mathcal{I}) is an easy-to-compute integer-valued function that upper bounds the mixing time on instance \mathcal{I}, such that

(6) T()τmix(,ϵ(n)),\displaystyle T(\mathcal{I})\geq\tau_{\textsf{mix}}(\mathcal{I},\epsilon(n)),

where τmix(,ϵ(n))\tau_{\textsf{mix}}(\mathcal{I},\epsilon(n)) denotes the mixing rate for the Gibbs sampling chain (𝑿t)t0(\bm{X}_{t})_{t\geq 0} on instance \mathcal{I}. By Proposition 4.3, if the Dobrushin-Shlosman condition is satisfied, we can set

(7) T()=nδlognϵ(n).\displaystyle T(\mathcal{I})=\left\lceil\frac{n}{\delta}\log\frac{n}{\epsilon(n)}\right\rceil.

Our algorithm for single-sample dynamic Gibbs sampling maintains a random process (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T}, which is a Gibbs sampling chain on instance \mathcal{I} of length T=T()T=T(\mathcal{I}), where T()T(\mathcal{I}) satisfies (6). Clearly 𝑿T\bm{X}_{T} is a sample for μ\mu_{\mathcal{I}} with dTV(𝑿T,μ)ϵ(n)d_{\mathrm{TV}}\left({\bm{X}_{T}},{\mu_{\mathcal{I}}}\right)\leq\epsilon(n).

When the current instance \mathcal{I} is updated to a new instance =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}) where n=|V|n^{\prime}=|V^{\prime}|, the original process (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} is transformed to a new process (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T^{\prime}} such that the following holds as an invariant: (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T^{\prime}} is a Gibbs sampling chain on \mathcal{I}^{\prime} with T=T()T^{\prime}=T(\mathcal{I}^{\prime}). Hence 𝒀T\bm{Y}_{T} is a sample for the new instance \mathcal{I}^{\prime} with dTV(𝒀T,μ)ϵ(n)d_{\mathrm{TV}}\left({\bm{Y}_{T}},{\mu_{\mathcal{I}^{\prime}}}\right)\leq\epsilon(n^{\prime}). This is achieved through the following two steps:

  1. (1)

    We construct couplings between (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} and (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T^{\prime}}, so that the new process (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T^{\prime}} for \mathcal{I}^{\prime} can be obtained by making small changes to the original process (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} for \mathcal{I}.

  2. (2)

    We give a data structure which represents (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} incrementally and supports various updates and queries to (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} so that the above coupling can be generated efficiently.

6.1. Coupling for dynamic instances

The Gibbs sampling chain (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} can be uniquely and fully recovered from: the initial state 𝑿0QV\bm{X}_{0}\in Q^{V}, and the pairs vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} that record the transitions. We call vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} the execution-log for the chain (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T}, and denote it with

𝖤𝗑𝖾-𝖫𝗈𝗀(,T)vt,Xt(vt)t=1T.\mathsf{Exe\text{-}Log}(\mathcal{I},T)\triangleq\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}.

The following invariants are assumed for the random execution-log with an initial state.

Condition 6.2 (invariants for Exe-Log).

Fixed an initial state 𝑿0QV\bm{X}_{0}\in Q^{V}, the followings hold for the random execution-log 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} for the Gibbs sampling chain (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} on instance =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi):

  • T=T()T=T(\mathcal{I}) where T()T(\mathcal{I}) satisfies (6);

  • the random process (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} uniquely recovered from the transitions vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} and the initial state 𝑿0\bm{X}_{0}, is identically distributed as the Gibbs sampling (Algorithm 1) on instance \mathcal{I} starting from initial state 𝑿0\bm{X}_{0} with vtv_{t} as the vertex picked at the tt-th step.

Such invariants guarantee that 𝑿T\bm{X}_{T} provides a sample for μ\mu_{\mathcal{I}} with dTV(𝑿T,μ)ϵ(|V|)d_{\mathrm{TV}}\left({\bm{X}_{T}},{\mu_{\mathcal{I}}}\right)\leq\epsilon(|V|).

Suppose the current instance \mathcal{I} is updated to a new instance \mathcal{I}^{\prime}. We construct couplings between the execution-log 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} with initial state 𝑿0QV\bm{X}_{0}\in Q^{V} for \mathcal{I} and the execution-log 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T^{\prime}}} with initial state 𝒀0QV\bm{Y}_{0}\in Q^{V^{\prime}} for \mathcal{I}^{\prime}. Our goal is as follows: assuming Condition 6.2 for 𝑿0\bm{X}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)\mathsf{Exe\text{-}Log}(\mathcal{I},T), the same condition should hold invariantly for 𝒀0\bm{Y}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime}).

Unlike traditional coupling of Markov chains for the analysis of mixing time, where the two chains start from arbitrarily distinct initial states but proceed by the same transition rule, here the two chains (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} and (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} start from similar states but have to obey different transition rules due to differences between instances \mathcal{I} and \mathcal{I}^{\prime}.

Due to the technical reason, we divide the update from =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) to =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}) into two steps: we first update =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) to

(8) 𝗆𝗂𝖽=(V,E,Q,Φ𝗆𝗂𝖽),\displaystyle\mathcal{I}_{\mathsf{mid}}=(V,E,Q,\Phi^{\mathsf{mid}}),

where the potentials Φ𝗆𝗂𝖽=(ϕa𝗆𝗂𝖽)aVE\Phi^{\mathsf{mid}}=(\phi^{\mathsf{mid}}_{a})_{a\in V\cup E} in the middle instance 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}} are defined as

aVE,ϕa𝗆𝗂𝖽{ϕaif aVEϕaif aVE;\displaystyle\forall a\in V\cup E,\quad\phi^{\mathsf{mid}}_{a}\triangleq\begin{cases}\phi^{\prime}_{a}&\text{if }a\in V^{\prime}\cup E^{\prime}\\ \phi_{a}&\text{if }a\not\in V^{\prime}\cup E^{\prime};\end{cases}

then we update 𝗆𝗂𝖽=(V,E,Q,Φ𝗆𝗂𝖽)\mathcal{I}_{\mathsf{mid}}=(V,E,Q,\Phi^{\mathsf{mid}}) to =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}). In other words, the update from \mathcal{I} to 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}} is only caused by updating the potentials of vertices and edges, while the underlying graph remains unchanged; and the update from 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}} to \mathcal{I}^{\prime} is only caused by updating the underlying graph, i.e. adding vertices, deleting vertices, adding edges and deleting edges.

The dynamic Gibbs sampling algorithm can be outlined as follows.

  • UpdateHamiltonian: update 𝑿0\bm{X}_{0} and vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} to a new initial state 𝒁0\bm{Z}_{0} and a new execution log 𝖤𝗑𝖾-𝖫𝗈𝗀(𝗆𝗂𝖽,T)=ut,Zt(ut)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}_{\mathsf{mid}},T)=\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}} such that the random process (𝒁t)t=0T(\bm{Z}_{t})_{t=0}^{T} is the Gibbs sampling on instance 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}}.

  • UpdateGraph: update 𝒁0\bm{Z}_{0} and ut,Zt(ut)t=1T\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}} to a new initial state 𝒀0\bm{Y}_{0} and a new execution log 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}} such that the random process (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}^{\prime}.

  • LengthFix: change the length of the execution log vt,Yt(vt)t=1T\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}} from TT to TT^{\prime}, where T=T()T^{\prime}=T(\mathcal{I}^{\prime}) and T()T(\mathcal{I}^{\prime}) satisfies (6).

The dynamic Gibbs sampling algorithm is given in Algorithm 2.

Data : 𝑿0QV\bm{X}_{0}\in Q^{V} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} for current =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi).
Update : an update that modifies \mathcal{I} to =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}).
1 compute T=T()T^{\prime}=T(\mathcal{I}^{\prime}) satisfying (6) and construct 𝗆𝗂𝖽=(V,E,Q,Φ𝗆𝗂𝖽)\mathcal{I}_{\mathsf{mid}}=(V^{\prime},E^{\prime},Q,\Phi^{\mathsf{mid}}) as in (8);
2 (𝒁0,ut,Zt(ut)t=1T)UpdateHamiltonian(,𝗆𝗂𝖽,𝑿0,vt,Xt(vt)t=1T)\left(\bm{Z}_{0},\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}\right)\leftarrow\textsf{UpdateHamiltonian}\left(\mathcal{I},\mathcal{I}_{\mathsf{mid}},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right);
// update the potentials: 𝗆𝗂𝖽\mathcal{I}\to\mathcal{I}_{\mathsf{mid}}
3 (𝒀0,vt,Yt(vt)t=1T)UpdateGraph(𝗆𝗂𝖽,,𝒁0,ut,Zt(ut)t=1T)\left(\bm{Y}_{0},\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}\right)\leftarrow\textsf{UpdateGraph}\left(\mathcal{I}_{\mathsf{mid}},\mathcal{I}^{\prime},\bm{Z}_{0},\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}\right);
// update the underlying graph: 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}}\to\mathcal{I}^{\prime}
4 (𝒀0,vt,Yt(vt)t=1T)LengthFix(,𝒀0,vt,Yt(vt)t=1T,T)\left(\bm{Y}_{0},\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T^{\prime}}}\right)\leftarrow\textsf{LengthFix}\left(\mathcal{I}^{\prime},\bm{Y}_{0},\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T}},T^{\prime}\right), where T=T()T^{\prime}=T(\mathcal{I}^{\prime}) ;
// change the length of the execution log from TT to T=T()T^{\prime}=T(\mathcal{I}^{\prime})
5 update the data to 𝒀0\bm{Y}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T^{\prime}}};
Algorithm 2 Dynamic Gibbs sampling
Data : 𝑿0QV\bm{X}_{0}\in Q^{V} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} for current =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi).
Input : the new length T>0T^{\prime}>0.
1 if T<TT^{\prime}<T then
2  truncate vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} to vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T^{\prime}}};
3else
4  extend vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} to vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T^{\prime}}} by simulating the Gibbs sampling chain on \mathcal{I} for TTT-T^{\prime} more steps;
update the data to 𝑿0\bm{X}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T^{\prime})=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T^{\prime}}}
Algorithm 3 LengthFix(,𝑿0,vt,Xt(vt)t=1T,T)\textsf{LengthFix}\left(\mathcal{I},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}},T^{\prime}\right)

The subroutine LengthFix is given in Algorithm 3. We then describe UpdateHamiltonian (Section 6.1.1) and UpdateGraph (Section 6.1.2).

6.1.1. Coupling for Hamiltonian update

We consider the update of changing potentials of vertices and edges. The update do not change the underlying graph. Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be the current MRF instance. Let 𝑿0\bm{X}_{0} and vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} be the current initial state and execution log such that the random process (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}. Upon such an update, the new instance becomes =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime}). The algorithm UpdateHamiltonian(,,𝑿0,vt,Xt(vt)t=1T)\textsf{UpdateHamiltonian}(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}) updates the data to 𝒀0\bm{Y}_{0} and vt,Yt(vt)t=1T\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}} such that the random process (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}^{\prime}.

We transform the pair of 𝑿0QV\bm{X}_{0}\in Q^{V} and vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} to a new pair of 𝒀0QV\bm{Y}_{0}\in Q^{V} and vt,Yt(vt)t=1T\left\langle{v_{t}},Y_{t}({v_{t}})\right\rangle_{t=1}^{{T}} for \mathcal{I}^{\prime}. This is achieved as follows: the vertex sequence (vt)t=1T(v_{t})_{t=1}^{T} is identically coupled and the chain (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} is transformed to (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} by the following one-step local coupling between 𝑿\bm{X} and 𝒀\bm{Y}.

Definition 6.3 (one-step local coupling for Hamiltonian update).

The two chains (𝑿t)t=0(\bm{X}_{t})_{t=0}^{\infty} on instance =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) and (𝒀t)t=0(\bm{Y}_{t})_{t=0}^{\infty} on instance =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime}) are coupled as:

  • Initially 𝑿0=𝒀0QV\bm{X}_{0}=\bm{Y}_{0}\in Q^{V};

  • for t=1,2,t=1,2,\ldots, the two chains 𝑿\bm{X} and 𝒀\bm{Y} jointly do:

    1. (1)

      pick the same vtVv_{t}\in V, and let (Xt(u),Yt(u))(Xt1(u),Yt1(u))(X_{t}(u),Y_{t}(u))\leftarrow(X_{t-1}(u),Y_{t-1}(u)) for all uV{vt}u\in V\setminus\{v_{t}\};

    2. (2)

      sample (Xt(vt),Yt(vt))(X_{t}(v_{t}),Y_{t}(v_{t})) from a coupling Dvt,vtσ,τ(,)D_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}^{\sigma,\tau}(\cdot,\cdot) of the marginal distributions μvt,(σ)\mu_{{v_{t}},{\mathcal{I}}}(\cdot\mid\sigma) and μvt,(τ)\mu_{{v_{t}},{\mathcal{I}^{\prime}}}(\cdot\mid\tau) with σ=Xt1(ΓG(vt))\sigma=X_{t-1}(\Gamma_{G}({v_{t}})) and τ=Yt1(ΓG(vt))\tau=Y_{t-1}(\Gamma_{G}({v_{t}})), where G=(V,E)G=(V,E).

The local coupling Dv,vσ,τ(,)D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot) for Hamiltonian update is specified as follows.

Definition 6.4 (local coupling Dv,vσ,τ(,)D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot) for Hamiltonian update).

Let vVv\in V be vertex and σ,τQΓG(v)\sigma,\tau\in Q^{\Gamma_{G}(v)} two configurations, where G=(V,E)G=(V,E). We say a random pair (c,c)Q2(c,c^{\prime})\in Q^{2} is drawn from the coupling Dv,vσ,τ(,)D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot) if (c,c)(c,c^{\prime}) is generated by the following two steps:

  • sampling step: sample (c,c)Q2(c,c^{\prime})\in Q^{2} jointly from an optimal coupling D𝗈𝗉𝗍,vσ,τD^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}} of the marginal distributions μv,(σ)\mu_{v,\mathcal{I}}(\cdot\mid\sigma) and μv,(τ)\mu_{v,\mathcal{I}}(\cdot\mid\tau), such that cμv,(σ)c\sim\mu_{v,\mathcal{I}}(\cdot\mid\sigma) and cμv,(τ)c^{\prime}\sim\mu_{v,\mathcal{I}}(\cdot\mid\tau);

  • resampling step: flip a coin independently with the probability of HEADS being

    (9) pv,vτ(c){0if μv,(cτ)μv,(cτ),μv,(cτ)μv,(cτ)μv,(cτ)otherwise ;\displaystyle p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(c^{\prime})\triangleq\begin{cases}0&\text{if }\mu_{v,\mathcal{I}}(c^{\prime}\mid\tau)\leq\mu_{v,\mathcal{I}^{\prime}}(c^{\prime}\mid\tau),\\ \frac{\mu_{v,\mathcal{I}}(c^{\prime}\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(c^{\prime}\mid\tau)}{\mu_{v,\mathcal{I}}(c^{\prime}\mid\tau)}&\text{otherwise };\end{cases}

    if the outcome of coin flipping is HEADS, resample cc^{\prime} from the distribution νv,vτ\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau} independently, where the distribution νv,vτ\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau} is defined as

    (10) bQ:νv,vτ(b)max{0,μv,(bτ)μv,(bτ)}xQmax{0,μv,(xτ)μv,(xτ)}.\displaystyle\forall b\in Q:\quad\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(b)\triangleq\frac{\max\left\{0,\mu_{v,\mathcal{I}^{\prime}}(b\mid\tau)-\mu_{v,\mathcal{I}}(b\mid\tau)\right\}}{\sum_{x\in Q}\max\left\{0,\mu_{v,\mathcal{I}}(x\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)\right\}}.
Lemma 6.5.

Dv,vσ,τ(,)D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot) in Definition 6.4 is a valid coupling between μv,(σ)\mu_{v,\mathcal{I}}(\cdot\mid\sigma) and μv,(τ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau).

By Lemma 6.5, the resulting (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is a faithful copy of the Gibbs sampling on instance \mathcal{I}^{\prime}, assuming that (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} is such a chain on instance \mathcal{I}.

Next we give an upper bound for the probability pv,vτ()p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(\cdot) defined in (9).

Lemma 6.6.

For any two instances =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) and =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime}) of MRF model, and any vV,cQv\in V,c\in Q and σQΓG(v)\sigma\in Q^{\Gamma_{G}(v)}, it holds that

(11) pv,vτ(c)2(ϕvϕv1+e={u,v}Eϕeϕe1),\displaystyle p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(c)\leq 2\left(\|\phi_{v}-\phi^{\prime}_{v}\|_{1}+\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right),

where ϕvϕv1=cQ|ϕv(c)ϕv(c)|\|\phi_{v}-\phi^{\prime}_{v}\|_{1}=\sum_{c\in Q}|\phi_{v}(c)-\phi^{\prime}_{v}(c)| and ϕeϕe1=c,cQ|ϕe(c,c)ϕe(c,c)|\|\phi_{e}-\phi^{\prime}_{e}\|_{1}=\sum_{c,c^{\prime}\in Q}|\phi_{e}(c,c^{\prime})-\phi^{\prime}_{e}(c,c^{\prime})|.

By Lemma 6.6, for each vertex vVv\in V, we define an upper bound of the probability pv,v()p^{\cdot}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot) as

(12) pv𝗎𝗉min{2(ϕvϕv1+e={u,v}Eϕeϕe1),1}.\displaystyle p^{\mathsf{up}}_{v}\triangleq\min\left\{2\left(\|\phi_{v}-\phi^{\prime}_{v}\|_{1}+\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right),1\right\}.

With pv𝗎𝗉p^{\mathsf{up}}_{v}, we can implement the one-step local coupling in Definition 6.3 as follows. We first sample each viVv_{i}\in V for 1iT1\leq i\leq T uniformly and independently. For each vertex vVv\in V, let Tv{1tTvt=v}T_{v}\triangleq\{1\leq t\leq T\mid v_{t}=v\} be the set of all the steps that pick the vertex vv. We select each tTvt\in T_{v} independently with probability pv𝗎𝗉p^{\mathsf{up}}_{v} to construct a random subset 𝒫vTv\mathcal{P}_{v}\subseteq T_{v}, and let

(13) 𝒫vV𝒫v.\displaystyle\mathcal{P}\triangleq\bigcup_{v\in V}\mathcal{P}_{v}.

We then couple the two chains (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} and (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T}. First set 𝑿0=𝒀0\bm{X}_{0}=\bm{Y}_{0}. For each 1tT1\leq t\leq T, we set (Xt(u),Yt(u))(Xt1(u),Yt1(u))(X_{t}(u),Y_{t}(u))\leftarrow(X_{t-1}(u),Y_{t-1}(u)) for all uV{vt}u\in V\setminus\{v_{t}\}; then generate the random pair (Xt(vt),Yt(vt))(X_{t}(v_{t}),Y_{t}(v_{t})) by the following procedure.

  • sampling step: Let σ=Xt1(ΓG(vt))\sigma=X_{t-1}(\Gamma_{G}(v_{t})) and τ=Yt1(ΓG(vt))\tau=Y_{t-1}(\Gamma_{G}(v_{t})). We draw a random pair (c,c)Q2(c,c^{\prime})\in Q^{2} from the optimal coupling D𝗈𝗉𝗍,vσ,τD^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}} of the marginal distributions μv,(σ)\mu_{v,\mathcal{I}}(\cdot\mid\sigma) and μv,(τ)\mu_{v,\mathcal{I}}(\cdot\mid\tau) such that cμv,(σ)c\sim\mu_{v,\mathcal{I}}(\cdot\mid\sigma) and cμv,(τ)c^{\prime}\sim\mu_{v,\mathcal{I}}(\cdot\mid\tau);

  • resampling step: If t𝒫t\notin\mathcal{P}, set Xt(vt)=cX_{t}(v_{t})=c and Yt(vt)=cY_{t}(v_{t})=c^{\prime}. Otherwise, set Xt(vt)=cX_{t}(v_{t})=c and

    (14) Yt(vt)={bνvt,vtτwith probability pvt,vtτ(c)/pvt𝗎𝗉cwith probability 1pvt,vtτ(c)/pvt𝗎𝗉.\displaystyle Y_{t}(v_{t})=\begin{cases}b\sim\nu_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}^{\tau}&\text{with probability }p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})/p^{\mathsf{up}}_{v_{t}}\\ c^{\prime}&\text{with probability }1-p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})/p^{\mathsf{up}}_{v_{t}}.\end{cases}

Note that pvt𝗎𝗉>0p^{\mathsf{up}}_{v_{t}}>0 if t𝒫t\in\mathcal{P}. By Lemma 6.6, it must hold that pvt,vtτ(c)pvt𝗎𝗉p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})\leq p^{\mathsf{up}}_{v_{t}}. Hence, the probability pvt,vtτ(c)/pvt𝗎𝗉p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})/p^{\mathsf{up}}_{v_{t}} is valid. Note that the probability that Yt(vt)Y_{t}(v_{t}) is set as bb is

Pr[Yt(vt) is set as b]=Pr[t𝒫]pvt,vtτ(c)pvt𝗎𝗉=pvt𝗎𝗉pvt,vtτ(c)pvt𝗎𝗉=pvt,vtτ(c).\displaystyle\Pr[Y_{t}(v_{t})\text{ is set as }b]=\Pr\left[t\in\mathcal{P}\right]\cdot\frac{p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})}{p^{\mathsf{up}}_{v_{t}}}=p^{\mathsf{up}}_{v_{t}}\cdot\frac{p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime})}{p^{\mathsf{up}}_{v_{t}}}=p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(c^{\prime}).

Hence, our implementation perfectly simulates the coupling in Definition 6.3.

Let 𝒟t\mathcal{D}_{t} denote the set of disagreements between 𝑿t\bm{X}_{t} and 𝒀t\bm{Y}_{t}. Formally,

(15) 𝒟t{vVXt(v)Yt(v)}.\displaystyle\mathcal{D}_{t}\triangleq\{v\in V\mid X_{t}(v)\neq Y_{t}(v)\}.

Note that if vtΓG(𝒟t1)v_{t}\notin\Gamma_{G}(\mathcal{D}_{t-1}), the random pair (c,c)(c,c^{\prime}) drawn from the coupling D𝗈𝗉𝗍,vσ,τD^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}} must satisfy c=cc=c^{\prime}. Thus it is easy to make the following observation for the (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} and (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} coupled as above.

Observation 6.7.

For any integer t[1,T]t\in[1,T], if vtΓG+(𝒟t1)v_{t}\notin\Gamma_{G}^{+}(\mathcal{D}_{t-1}) and t𝒫t\notin\mathcal{P}, then Xt(vt)=Yt(vt)X_{t}(v_{t})=Y_{t}(v_{t}) and 𝒟t=𝒟t1\mathcal{D}_{t}=\mathcal{D}_{t-1}.

With this observation, the new 𝒀0\bm{Y}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v_{t}},Y_{t}({v_{t}})\right\rangle_{t=1}^{{T}} can be generated from 𝑿0\bm{X}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} as Algorithm 4.

Data : 𝑿0QV\bm{X}_{0}\in Q^{V} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} for =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi).
Update : an update that modifies \mathcal{I} to =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime}).
1 t00t_{0}\leftarrow 0, 𝒟\mathcal{D}\leftarrow\varnothing, and construct a 𝒀0𝑿0\bm{Y}_{0}\leftarrow\bm{X}_{0};
2 for each vVv\in V, construct a random subset 𝒫vTv{1tTvt=v}\mathcal{P}_{v}\subseteq T_{v}\triangleq\{1\leq t\leq T\mid v_{t}=v\} such that each element in TvT_{v} is selected independently with probability pv𝗎𝗉p^{\mathsf{up}}_{v} defined in (12);
3 construct the set 𝒫vV𝒫v\mathcal{P}\leftarrow\bigcup_{v\in V}\mathcal{P}_{v};
4 while t0<tT\exists\,t_{0}<t\leq T such that vtΓG+(𝒟)v_{t}\in\Gamma_{G}^{+}(\mathcal{D}) or t𝒫t\in\mathcal{P} do
5   find the smallest t>t0t>t_{0} such that vtΓG+(𝒟)v_{t}\in\Gamma_{G}^{+}(\mathcal{D}) or t𝒫t\in\mathcal{P};
6   for all t0<i<tt_{0}<i<t, let Yi(vi)=Xi(vi)Y_{i}(v_{i})=X_{i}(v_{i});
7   sample Yt(vt)QY_{t}(v_{t})\in Q conditioning on Xt(vt)X_{t}(v_{t}) according to the optimal coupling between μvt,(Xt1(ΓG(vt)))\mu_{v_{t},\mathcal{I}}(\cdot\mid X_{t-1}(\Gamma_{G}(v_{t}))) and μvt,(Yt1(ΓG(vt)))\mu_{v_{t},\mathcal{I}}(\cdot\mid Y_{t-1}(\Gamma_{G}(v_{t})));
8 if t𝒫t\in\mathcal{P} then
9    with probability pvt,vtτ(Yt(vt))/pvt𝗎𝗉p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(Y_{t}(v_{t}))/p^{\mathsf{up}}_{v_{t}} where τ=Yt1(ΓG(vt))\tau=Y_{t-1}(\Gamma_{G}(v_{t}))  do
10         resample Yt(vt)νvt,vtτY_{t}(v_{t})\sim\nu_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}^{\tau}, where νvt,vtτ\nu_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}^{\tau} is defined in (10) ;
11       
12    
13 if Xt(vt)Yt(vt)X_{t}(v_{t})\neq Y_{t}(v_{t}) then 𝒟𝒟{vt}\mathcal{D}\leftarrow\mathcal{D}\cup\{v_{t}\} else 𝒟𝒟{vt}\mathcal{D}\leftarrow\mathcal{D}\setminus\{v_{t}\};
14 t0tt_{0}\leftarrow t;
15 
16for all remaining t0<iTt_{0}<i\leq T: let Yi(vi)=Xi(vi)Y_{i}(v_{i})=X_{i}(v_{i});
17 update the data to 𝒀0\bm{Y}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v_{t}},Y_{t}({v_{t}})\right\rangle_{t=1}^{{T}};
Algorithm 4 UpdateHamiltonian(,,𝑿0,vt,Xt(vt)t=1T)\textsf{UpdateHamiltonian}\left(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right)

Observation 6.7 says that the nontrivial coupling between Xt(vt)X_{t}(v_{t}) and Yt(vt)Y_{t}(v_{t}) is only needed when vtΓG+(𝒟t1)v_{t}\in\Gamma_{G}^{+}(\mathcal{D}_{t-1}) or t𝒫t\in\mathcal{P}, which occurs rarely as long as 𝒟t1\mathcal{D}_{t-1} and 𝒫\mathcal{P} are small. This is a key to ensure the small incremental time cost of Algorithm 4. For the (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} and (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} coupled as above and any 1tT1\leq t\leq T, let γt\gamma_{t} indicate whether the event t𝒫vtΓG+(𝒟t1)t\in\mathcal{P}\lor v_{t}\in\Gamma_{G}^{+}(\mathcal{D}_{t-1}) occurs:

(16) γt𝟏[t𝒫vtΓG+(𝒟t1)],\displaystyle\gamma_{t}\triangleq\mathbf{1}\left[t\in\mathcal{P}\lor v_{t}\in\Gamma_{G}^{+}(\mathcal{D}_{t-1})\right],

and R𝖧𝖺𝗆𝗂𝗅R_{\mathsf{Hamil}} denote the number of occurrences of such bad events:

(17) R𝖧𝖺𝗆𝗂𝗅t=1Tγt.\displaystyle R_{\mathsf{Hamil}}\triangleq\sum_{t=1}^{T}\gamma_{t}.

The following lemma bounds the expectation of R𝖧𝖺𝗆𝗂𝗅R_{\mathsf{Hamil}}.

Lemma 6.8 (cost of the coupling for UpdateHamiltonian).

Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be the current MRF instance and =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime}) the updated instance. Assume that \mathcal{I} satisfies Dobrushin-Shlosman condition (3.1) with constant δ>0\delta>0, and dHamil(,)=vVϕvϕv1+eEϕeϕe1Ld_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})=\sum_{v\in V}\left\|\phi_{v}-\phi^{\prime}_{v}\right\|_{1}+\sum_{e\in E}\left\|\phi_{e}-\phi^{\prime}_{e}\right\|_{1}\leq L. It holds that 𝔼[R𝖧𝖺𝗆𝗂𝗅]=O(ΔTLnδ)\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]=O\left(\frac{\Delta TL}{n\delta}\right), where n=|V|n=|V|, Δ\Delta is the maximum degree of graph G=(V,E)G=(V,E).

6.1.2. Coupling for graph update

Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be an MRF instance, where Φ=(ϕa)aVE\Phi=(\phi_{a})_{a\in V\cup E}. Let 𝑿0\bm{X}_{0} and vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} be the current initial state and execution log such that the random process (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}. Let =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}) be the new instance obtained by updating the underlying graph, where Φ=(ϕa)aVE\Phi^{\prime}=(\phi_{a})_{a\in V^{\prime}\cup E^{\prime}} satisfies

a(VV)(EE),ϕa=ϕa.\displaystyle\forall a\in(V\cap V^{\prime})\cap(E\cap E^{\prime}),\quad\phi_{a}=\phi^{\prime}_{a}.

Given the update from \mathcal{I} to \mathcal{I}^{\prime}, the subroutine UpdateGraph(,,𝑿0,vt,Xt(vt)t=1T)\textsf{UpdateGraph}\left(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right) updates the data to a new initial state 𝒀0\bm{Y}_{0} and a new execution-log vt,Yt(vt)t=1T\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T}} such that the random process (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}^{\prime}.

The subroutine UpdateGraph does as the following three steps.

  • AddVertex: add isolated vertices in VVV^{\prime}\setminus V with potentials (ϕv)vVV(\phi_{v})_{v\in V^{\prime}\setminus V}, and update the instance =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) to a new instance

    (18) 1=1(,)(VV,E,Q,Φ(ϕv)vVV);\displaystyle\mathcal{I}_{1}=\mathcal{I}_{1}(\mathcal{I},\mathcal{I}^{\prime})\triangleq\left(V\cup V^{\prime},E,Q,\Phi\cup(\phi_{v})_{v\in V^{\prime}\setminus V}\right);

    then update 𝑿0\bm{X}_{0} and vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} to 𝒁0\bm{Z}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(1,T)=ut,Zt(ut)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}_{1},T)=\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}} such that the random process (𝒁t)t=0T(\bm{Z}_{t})_{t=0}^{T} is the Gibbs sampling on instance 1\mathcal{I}_{1}.

  • UpdateEdge: add new edges in EEE^{\prime}\setminus E with potentials (ϕe)eEE(\phi_{e})_{e\in E^{\prime}\setminus E}, delete edges in EEE\setminus E^{\prime} , and update the instance 1\mathcal{I}_{1} to a new instance

    2=2(,)\displaystyle\mathcal{I}_{2}=\mathcal{I}_{2}(\mathcal{I},\mathcal{I}^{\prime}) (VV,E,Q,Φ(ϕv)vVV(ϕe)eEE(ϕe)eEE)\displaystyle\triangleq\left(V\cup V^{\prime},E^{\prime},Q,\Phi\cup(\phi_{v})_{v\in V^{\prime}\setminus V}\cup(\phi_{e})_{e\in E^{\prime}\setminus E}\setminus(\phi_{e})_{e\in E\setminus E^{\prime}}\right)
    (19) =(VV,E,Q,Φ(ϕv)vVV);\displaystyle=\left(V\cup V^{\prime},E^{\prime},Q,\Phi^{\prime}\cup(\phi_{v})_{v\in V\setminus V^{\prime}}\right);

    then update 𝒁0\bm{Z}_{0} and ut,Zt(ut)t=1T\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}} to 𝒁0\bm{Z}^{{}^{\prime}}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(2,T)=wt,Zt(wt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}_{2},T)=\left\langle{w_{t}},Z^{{}^{\prime}}_{t}({w_{t}})\right\rangle_{t=1}^{{T}} such that the random process (𝒁t)t=0T(\bm{Z}^{{}^{\prime}}_{t})_{t=0}^{T} is the Gibbs sampling on instance 2\mathcal{I}_{2}.

  • DeleteVertex: delete isolated vertices in VVV\setminus V^{\prime}, and update the instance 2\mathcal{I}_{2} to =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}); then update 𝒁0\bm{Z}^{{}^{\prime}}_{0} and wt,Zt(wt)t=1T\left\langle{w_{t}},Z^{{}^{\prime}}_{t}({w_{t}})\right\rangle_{t=1}^{{T}} to 𝒀0\bm{Y}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T}} such that the random process (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}^{\prime}.

The algorithm UpdateGraph is given in Algorithm 5.

Data : 𝑿0QV\bm{X}_{0}\in Q^{V} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} for current =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi).
Update : an update of the underlying graph that modifies \mathcal{I} to =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}).
1 construct instances 1\mathcal{I}_{1} and 2\mathcal{I}_{2} as in (18) and (6.1.2);
2 (𝒁0,ut,Zt(ut)t=1T)AddVertex(,1,𝑿0,vt,Xt(vt)t=1T)\left(\bm{Z}_{0},\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}\right)\leftarrow\textsf{AddVertex}\left(\mathcal{I},\mathcal{I}_{1},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right);
// add isolated vertices to update \mathcal{I} to 1\mathcal{I}_{1}
3 (𝒁0,wt,Zt(wt)t=1T)UpdateEdge(1,2,𝒁0,ut,Zt(ut)t=1T)\left(\bm{Z}^{\prime}\bm{}0,\left\langle{w_{t}},Z^{\prime}_{t}({w_{t}})\right\rangle_{t=1}^{{T}}\right)\leftarrow\textsf{UpdateEdge}\left(\mathcal{I}_{1},\mathcal{I}_{2},\bm{Z}_{0},\left\langle{u_{t}},Z_{t}({u_{t}})\right\rangle_{t=1}^{{T}}\right);
// add and delete edges to update 1\mathcal{I}_{1} to 2\mathcal{I}_{2}
4 (𝒀0,vt,Yt(vt)t=1T)DeleteVertex(2,,𝒁0,wt,Zt(wt)t=1T)\left(\bm{Y}_{0},\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}}\right)\leftarrow\textsf{DeleteVertex}\left(\mathcal{I}_{2},\mathcal{I}^{\prime},\bm{Z}^{\prime}\bm{}0,\left\langle{w_{t}},Z^{\prime}_{t}({w_{t}})\right\rangle_{t=1}^{{T}}\right);
// delete isolated vertices to update 2\mathcal{I}_{2} to \mathcal{I}^{\prime}
5 update the data to 𝒀0\bm{Y}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀()=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime})=\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T}};
Algorithm 5 UpdateGraph(,,𝑿0,vt,Xt(vt)t=1T)\textsf{UpdateGraph}\left(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right)

The subroutines AddVertex and DeleteVertex are simple, because they only deal with isolated variables. We first describe the main subroutine UpdateEdge, then describe AddVertex and DeleteVertex.

The coupling for UpdateEdge

We first consider the update of adding and deleting edges. The update does not change the set of variables. Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be the current MRF instance. Let 𝑿0\bm{X}_{0} and vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} be the current initial state and execution log such that the random process (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}. Upon such an update, the new instance becomes =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V,E^{\prime},Q,\Phi^{\prime}), where ϕa=ϕa\phi^{\prime}_{a}=\phi_{a} for all aV(EE)a\in V\cup(E\cap E^{\prime}). The subroutine UpdateEdge(,,𝑿0,vt,Xt(vt)t=1T)\textsf{UpdateEdge}(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}) updates the data to 𝒀0\bm{Y}_{0} and vt,Yt(vt)t=1T\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}} such that the random process (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}^{\prime}.

We use 𝒮V\mathcal{S}\subseteq V to denote the set of vertices affected by the update from \mathcal{I} to \mathcal{I}^{\prime}:

(20) 𝒮(u,v)EE{u,v},\displaystyle\mathcal{S}\triangleq\bigcup_{(u,v)\in E\oplus E^{\prime}}\{u,v\},

where EEE\oplus E^{\prime} is the symmetric difference between EE and EE^{\prime}.

We transform this pair of 𝑿0QV\bm{X}_{0}\in Q^{V} and vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} to a new pair of 𝒀0QV\bm{Y}_{0}\in Q^{V} and vt,Yt(vt)t=1T\left\langle{v_{t}},Y_{t}({v_{t}})\right\rangle_{t=1}^{{T}} for \mathcal{I}^{\prime}. This is achieved as follows: the vertex sequence (vt)t=1T(v_{t})_{t=1}^{T} is identically coupled and the chain (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} is transformed to (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} by the following one-step local coupling between 𝑿\bm{X} and 𝒀\bm{Y}.

Definition 6.9 (one-step local coupling for UpdateEdge).

The two chains (𝑿t)t=0(\bm{X}_{t})_{t=0}^{\infty} on instance =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) and (𝒀t)t=0(\bm{Y}_{t})_{t=0}^{\infty} on instance =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V,E^{\prime},Q,\Phi^{\prime}) are coupled as:

  • Initially 𝑿0=𝒀0QV\bm{X}_{0}=\bm{Y}_{0}\in Q^{V};

  • for t=1,2,t=1,2,\ldots, the two chains 𝑿\bm{X} and 𝒀\bm{Y} jointly do:

    1. (1)

      pick the same vtVv_{t}\in V, and let (Xt(u),Yt(u))(Xt1(u),Yt1(u))(X_{t}(u),Y_{t}(u))\leftarrow(X_{t-1}(u),Y_{t-1}(u)) for all uV{vt}u\in V\setminus\{v_{t}\};

    2. (2)

      sample (Xt(vt),Yt(vt))(X_{t}(v_{t}),Y_{t}(v_{t})) from a coupling Dvt,vtσ,τ(,)D_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}^{\sigma,\tau}(\cdot,\cdot) of the marginal distributions μvt,(σ)\mu_{{v_{t}},{\mathcal{I}}}(\cdot\mid\sigma) and μvt,(τ)\mu_{{v_{t}},{\mathcal{I}^{\prime}}}(\cdot\mid\tau) with σ=Xt1(ΓG(vt))\sigma=X_{t-1}(\Gamma_{G}({v_{t}})) and τ=Yt1(ΓG(vt))\tau=Y_{t-1}(\Gamma_{G^{\prime}}({v_{t}})), where G=(V,E)G=(V,E) and G=(V,E)G^{\prime}=(V,E^{\prime}).

The local coupling Dv,vσ,τ(,)D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot) for UpdateEdge is specified as follows.

(21) σQΓG(v),τQΓG(v):Dv,vσ,τ(,)={D𝗈𝗉𝗍,vσ,τ(,)if v𝒮,μv,(σ)×μv,(τ)if v𝒮,\displaystyle\forall\sigma\in Q^{\Gamma_{G}(v)},\tau\in Q^{\Gamma_{G^{\prime}}(v)}:\quad D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot)=\begin{cases}D_{\mathsf{opt},\mathcal{I}_{v}}^{\sigma,\tau}(\cdot,\cdot)&\text{if }v\not\in\mathcal{S},\\ \mu_{v,\mathcal{I}}(\cdot\mid\sigma)\times\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau)&\text{if }v\in\mathcal{S},\end{cases}

where D𝗈𝗉𝗍,vσ,τD^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}} is an optimal coupling of marginal distributions μv,(σ)\mu_{v,\mathcal{I}}(\cdot\mid\sigma) and μv,(τ)\mu_{v,\mathcal{I}}(\cdot\mid\tau). Recall v=(Γv+,Ev,Q,Φv)\mathcal{I}_{v}=(\Gamma^{+}_{v},E_{v},Q,\Phi_{v}) where Ev={{u,v}E}E_{v}=\{\{u,v\}\in E\} and Φv=(ϕa)aΓv+Ev\Phi_{v}=(\phi_{a})_{a\in\Gamma_{v}^{+}\cup E_{v}}. Obviously, Dv,vσ,τD_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau} is a valid coupling of μv,(σ)\mu_{v,\mathcal{I}}(\cdot\mid\sigma) and μv,(τ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau). Because for any v𝒮v\not\in\mathcal{S}, we have v=v\mathcal{I}_{v}=\mathcal{I}_{v^{\prime}} and hence μv,(σ)\mu_{v,\mathcal{I}}(\cdot\mid\sigma) and μv,(τ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau) are the same, both defined by  (4) on v\mathcal{I}_{v}. Thus they can be coupled by D𝗈𝗉𝗍,vσ,τD_{\mathsf{opt},\mathcal{I}_{v}}^{\sigma,\tau}.

Obviously the resulting (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is a faithful copy of the Gibbs sampling on instance \mathcal{I}^{\prime}, assuming that (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} is such a chain on instance \mathcal{I}.

Recall 𝒟t{vVXt(v)Yt(v)}\mathcal{D}_{t}\triangleq\{v\in V\mid X_{t}(v)\neq Y_{t}(v)\} is set of disagreements between 𝑿t\bm{X}_{t} and 𝒀t\bm{Y}_{t}. The following observation is easy to make for the (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} and (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} coupled as above.

Observation 6.10.

For any t[1,T]t\in[1,T], if vt𝒮ΓG+(𝒟t1)v_{t}\not\in\mathcal{S}\cup\Gamma_{G}^{+}(\mathcal{D}_{t-1}) then 𝐗t(vt)=𝐘t(vt)\bm{X}_{t}(v_{t})=\bm{Y}_{t}(v_{t}) and 𝒟t=𝒟t1\mathcal{D}_{t}=\mathcal{D}_{t-1}.

With this observation, the new 𝒀0\bm{Y}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v_{t}},Y_{t}({v_{t}})\right\rangle_{t=1}^{{T}} can be generated from 𝑿0\bm{X}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} as in Algorithm 6.

Data : 𝑿0QV\bm{X}_{0}\in Q^{V} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} for current =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi).
Update : an update of adding and deleting edges that modifies \mathcal{I} to =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V,E^{\prime},Q,\Phi^{\prime}).
1 t00t_{0}\leftarrow 0, 𝒟\mathcal{D}\leftarrow\varnothing, 𝒀0𝑿0\bm{Y}_{0}\leftarrow\bm{X}_{0} and construct 𝒮(u,v)EE{u,v}\mathcal{S}\leftarrow\bigcup_{(u,v)\in E\oplus E^{\prime}}\{u,v\} ;
2 while t0<tT\exists\,t_{0}<t\leq T such that vt𝒮ΓG+(𝒟)v_{t}\in\mathcal{S}\cup\Gamma_{G}^{+}(\mathcal{D}) do
3   find the smallest t>t0t>t_{0} such that vt𝒮ΓG+(𝒟)v_{t}\in\mathcal{S}\cup\Gamma_{G}^{+}(\mathcal{D});
4   for all t0<i<tt_{0}<i<t, let Yi(vi)=Xi(vi)Y_{i}(v_{i})=X_{i}(v_{i});
5   sample Yt(vt)Y_{t}(v_{t}) conditioning on Xt(vt)X_{t}(v_{t}) according to the coupling Dvtσ,τ(,)D_{v_{t}}^{\sigma,\tau}(\cdot,\cdot) (constructed in (21)), where σ=Xt1(ΓG(vt))\sigma=X_{t-1}(\Gamma_{G}({v_{t}})) and τ=Yt1(ΓG(vt))\tau=Y_{t-1}(\Gamma_{G^{\prime}}({v_{t}}));
6 if Xt(vt)Yt(vt)X_{t}(v_{t})\neq Y_{t}(v_{t}) then 𝒟𝒟{vt}\mathcal{D}\leftarrow\mathcal{D}\cup\{v_{t}\} else 𝒟𝒟{vt}\mathcal{D}\leftarrow\mathcal{D}\setminus\{v_{t}\};
7 t0tt_{0}\leftarrow t;
8 
9for all remaining t0<iTt_{0}<i\leq T: let 𝒀i(vi)=Xi(vi)\bm{Y}_{i}(v_{i})=X_{i}(v_{i});
10 update the data to 𝒀0\bm{Y}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v_{t}},Y_{t}({v_{t}})\right\rangle_{t=1}^{{T}};
Algorithm 6 UpdateEdge(,,𝑿0,vt,Xt(vt)t=1T)\textsf{UpdateEdge}(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}})

Observation 6.10 says that the nontrivial coupling between Xt(vt)X_{t}(v_{t}) and Yt(vt)Y_{t}(v_{t}) is only needed when vt𝒮ΓG+(𝒟t1)v_{t}\in\mathcal{S}\cup\Gamma_{G}^{+}(\mathcal{D}_{t-1}), which occurs rarely as long as 𝒟t1\mathcal{D}_{t-1} remains small. This is a key to ensure the small incremental time cost of Algorithm 6. Formally, for the (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} and (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} coupled as above, for any 1tT1\leq t\leq T, let γt\gamma_{t} indicate whether this bad event occurs:

(22) γt𝟏[vt𝒮ΓG+(𝒟t1)],\displaystyle\gamma_{t}\triangleq\mathbf{1}\left[v_{t}\in\mathcal{S}\cup\Gamma_{G}^{+}(\mathcal{D}_{t-1})\right],

and let R𝗀𝗋𝖺𝗉𝗁R_{\mathsf{graph}} denote the number of occurrences of such bad events:

(23) R𝗀𝗋𝖺𝗉𝗁t=1Tγt.\displaystyle R_{\mathsf{graph}}\triangleq\sum_{t=1}^{T}\gamma_{t}.

We will see that R𝗀𝗋𝖺𝗉𝗁R_{\mathsf{graph}} dominates the cost of Algorithm 6, once a data structure is given to encode the execution-log and resolve the updates in Line 6 and various queries (in Lines 6, 6 and 6) to the data.

Lemma 6.11 (cost of the coupling for UpdateEdge).

Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be the current MRF instance and =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V,E^{\prime},Q,\Phi^{\prime}) the updated instance. Assume that \mathcal{I}^{\prime} satisfies Dobrushin-Shlosman condition (3.1) with constant δ>0\delta>0, and |EE|L|E\oplus E^{\prime}|\leq L. It holds that 𝔼[R𝗀𝗋𝖺𝗉𝗁]=O(ΔTLnδ)\mathbb{E}\left[{R_{\mathsf{graph}}}\right]=O\left(\frac{\Delta TL}{n\delta}\right), where n=|V|n=|V|, Δ=max{ΔG,ΔG}\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}, and ΔG,ΔG\Delta_{G},\Delta_{G^{\prime}} denote the maximum degree of G=(V,E)G=(V,E) and G=(V,E)G^{\prime}=(V,E^{\prime}).

Coupling for AddVertex

Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be the current MRF instance. Let 𝑿0\bm{X}_{0} and vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} be the current initial state and execution log such that the random process (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}. The update adds a set of isolated vertices SS with potentials (ϕa)aS(\phi_{a})_{a\in S}. Upon such an update, the new instance becomes

=(V,E,Q,Φ)=(VS,E,Q,Φ(ϕa)aS).\displaystyle\mathcal{I}^{\prime}=(V^{\prime},E,Q,\Phi^{\prime})=(V\cup S,E,Q,\Phi\cup(\phi_{a})_{a\in S}).

The subroutine AddVertex(,,𝑿0,vt,Xt(vt)t=1T)\textsf{AddVertex}(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}) updates the data to 𝒀0\bm{Y}_{0} and vt,Yt(vt)t=1T\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}} such that the random process (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}^{\prime}.

Since the new instance \mathcal{I}^{\prime} is the same as \mathcal{I} except the isolated vertices in SS, we can construct 𝒀0(V)=𝑿0\bm{Y}_{0}(V)=\bm{X}_{0} and 𝒀0(S)QS\bm{Y}_{0}(S)\in Q^{S} is arbitrary, and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T)=\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}} can be constructed by inserting random appearances of vertices in SS into (vt)t=1T(v_{t})_{t=1}^{T}, while for any vSv\in S, the Yt(v)Y_{t}(v) at the inserted steps tt are sampled i.i.d. from the marginal distribution μv,()\mu_{{v},{\mathcal{I}^{\prime}}}(\cdot), which is just a distribution over QQ proportional to exp(ϕv())\exp(\phi_{v}(\cdot)) in the case of Gibbs sampling, since vv is an isolated vertex. Let [T]{1,2,,T}[T]\triangleq\{1,2,\ldots,T\}. Formally:

  1. (1)

    Let P[T]P\subseteq[T] be a random subset such that each t[T]t\in[T] is selected into PP independently with probability |S||SV|\frac{|S|}{|S\cup V|}. Let h=|P|h=|P| and enumerate all elements in PP as r1<r2<<rhr_{1}<r_{2}<\ldots<r_{h}. Let m=Thm=T-h and enumerate all elements in [T]P[T]\setminus P as 1<2<<m\ell_{1}<\ell_{2}<\cdots<\ell_{m}.

  2. (2)

    For each 1ih1\leq i\leq h, sample uiSu_{i}\in S uniformly and independently.

  3. (3)

    Let vt,Xt(vt)t=1mLengthFix(,𝑿0,vt,Xt(vt)t=1T,m)\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{m}}\leftarrow\textsf{LengthFix}\left(\mathcal{I},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}},m\right).

  4. (4)

    Construct vt,Yt(vt)t=1T\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T^{\prime}}} as follows:

    t=rkP\displaystyle\forall\,t=r_{k}\in P :vt=ukand Yt(vt)μuk,(), where μuk,(c)exp(ϕuk(c));\displaystyle:\quad v^{\prime}_{t}=u_{k}\quad\text{and }\quad Y_{t}(v_{t}^{\prime})\sim\mu_{{u_{k}},{\mathcal{I}^{\prime}}}(\cdot),\text{ where }\mu_{{u_{k}},{\mathcal{I}^{\prime}}}(c)\propto\exp(\phi_{u_{k}}(c));
    t=k[T]P\displaystyle\forall\,t=\ell_{k}\in[T^{\prime}]\setminus P :vt=vkand Yt(vt)=Xk(vt)=Xk(vk).\displaystyle:\quad v^{\prime}_{t}=v_{k}\quad\text{and }\quad Y_{t}(v_{t}^{\prime})=X_{k}(v_{t}^{\prime})=X_{k}(v_{k}).

It is easy to see that (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T^{\prime}} is a faithful copy of the Gibbs sampling on instance \mathcal{I}^{\prime}.

Coupling for DeleteVertex

Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be the current MRF instance. The update deletes a set of isolated variables SVS\subseteq V. Let 𝑿0\bm{X}_{0} and vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} be the current initial state and execution log such that the random process (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}. Upon such update, the instance is updated to =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E,Q,\Phi^{\prime}), where V=VSV^{\prime}=V\setminus S and Φ=Φ(ϕv)vS\Phi^{\prime}=\Phi\setminus(\phi_{v})_{v\in S}. The subroutine DeleteVertex(,,𝑿0,vt,Xt(vt)t=1T)\textsf{DeleteVertex}(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}) updates the data to 𝒀0\bm{Y}_{0} and vt,Yt(vt)t=1T\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}} such that the random process (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is the Gibbs sampling on instance \mathcal{I}^{\prime}.

We can simply construct 𝒀0=X0(V)\bm{Y}_{0}=X_{0}(V^{\prime}). The new execution-log 𝖤𝗑𝖾-𝖫𝗈𝗀(,ϵ)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},\epsilon)=\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T}} can be constructed from the original 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} by simply deleting all appearances of vertices vSv\in S in (vt)t=1T(v_{t})_{t=1}^{T} and the corresponding trivial transitions Xt(v)X_{t}(v), followed by calling LengthFix on instance \mathcal{I}^{\prime} to properly append the chain to the length TT.

It is easy to see that (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T} is a faithful copy of the Gibbs sampling on instance \mathcal{I}^{\prime}.

6.2. Data structure for Gibbs sampling

We now describe an efficient data structure for Gibbs sampling (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T}. Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be an MRF instance. The data structure should provide the following functionalities.

  • Data: an initial state 𝑿0QV\bm{X}_{0}\in Q^{V} and an execution-log vt,Xt(vt)t=1T(V×Q)T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\in(V\times Q)^{T} that records the TT transitions of the Gibbs sampling (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T};

  • updates:

    • Insert(t,v,c)\textsf{Insert}(t,v,c), which inserts a transition v,c\left\langle\,{v,c}\,\right\rangle after the (t1)(t-1)-th transition vt1,Xt1(vt1)\left\langle\,{v_{t-1},X_{t-1}(v_{t-1})}\,\right\rangle;

    • Remove(t)\textsf{Remove}(t), which deletes the tt-th transition vt,Xt(vt)\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle;

    • Change(t,c)\textsf{Change}(t,c), which changes the tt-th transition vt,Xt(vt)\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle to vt,c\left\langle\,{v_{t},c}\,\right\rangle;

    Note that the updates Insert(t,v,c)\textsf{Insert}(t,v,c) and Remove(t)\textsf{Remove}(t) change the length TT of the chain, as well as the order-numbers of all transitions after the inserted/deleted transition.

  • queries:

    • Eval(t,v)\textsf{Eval}(t,v), which returns the value of Xt(v)X_{t}(v) for arbitrary tt and vv (not necessarily =vt=v_{t});

    • Succ(t,v)\textsf{Succ}(t,v), which returns ii for the smallest i>ti>t such that vi=vv_{i}=v if such ii exists, or returns \perp if otherwise.

It is not difficult to realize that the query Eval(t,v)\textsf{Eval}(t,v) can actually be solved by a predecessor search defined symmetrically to Succ(t,v)\textsf{Succ}(t,v). This data structure problem for Gibbs sampling is quite natural and is of independent interest.

Theorem 6.12 (data structure for Gibbs sampling).

There exists a deterministic dynamic data structure which stores an arbitrary initial state 𝐗0QV\bm{X}_{0}\in Q^{V} and an execution-log vt,Xt(vt)t=1T(V×Q)T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\in(V\times Q)^{T} for Gibbs sampling using O(T+|V|)O(T+|V|) memory words, each of O(logT+log|V|+log|Q|)O(\log T+\log|V|+\log|Q|) bits, such that each operation among Insert, Remove, Change, Eval and Succ can be resolved in time O(log2T+log|V|)O(\log^{2}T+\log|V|).

Proof.

The initial state and execution-log are stored by separate data structures.

The initial state 𝑿0QV\bm{X}_{0}\in Q^{V} is maintained by a deterministic dynamic dictionary, with (v,X0(v))(v,X_{0}(v)) for vertices vVv\in V as the key-value pairs. Such a deterministic data structure answers queries of X0(v)X_{0}(v) given any vVv\in V while VV is dynamically changing.

The execution-log vt,Xt(vt)t=1T(V×Q)T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\in(V\times Q)^{T} is stored by |V||V| balanced search trees (𝒯v)vV(\mathcal{T}_{v})_{v\in V} (e.g., red-black trees). In each tree 𝒯v\mathcal{T}_{v}, each node in 𝒯v\mathcal{T}_{v} stores a distinct transition vt,Xt(vt)\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle with vt=vv_{t}=v, such that the in-order tree walk of 𝒯v\mathcal{T}_{v} prints all vt,Xt(vt)\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle with vt=vv_{t}=v in the order they appear in the execution-log vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}. Altogether these trees (𝒯v)vV(\mathcal{T}_{v})_{v\in V} have TT nodes in total. Besides, these trees (𝒯v)vV(\mathcal{T}_{v})_{v\in V} are indexed by another deterministic dynamic dictionary, with (v,pv)(v,p_{v}) for vertices vVv\in V as key-value pairs, where each pvp_{v} is the pointer to the root of tree 𝒯v\mathcal{T}_{v}. This dictionary provides random accesses to the trees 𝒯v\mathcal{T}_{v} for all vVv\in V, while VV is dynamically changing.

Given any tt, we want to answer predecessor (or successor) search for the largest iti\leq t (or smallest i>ti>t) such that vi=vv_{i}=v. This is achieved with assistance from another data structure, an order-statistic tree (or OS-tree) 𝒯^\widehat{\mathcal{T}} [5, Section 14]. In 𝒯^\widehat{\mathcal{T}}, each node stores the “identity” of an individual transition vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} (which is actually a pointer to the node storing the transition vt,Xt(vt)\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle in the tree 𝒯v\mathcal{T}_{v} with vt=vv_{t}=v). In particular, the in-order tree walk of 𝒯^\widehat{\mathcal{T}} prints all vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} in that order. Such a data structure supports two query functions: (1) Select: given any tt, returns the identity of the tt-th transition vt,Xt(vt)\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle; and (2) Rank: given the identity of any transition vt,Xt(vt)\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle, returns its rank tt in the sequence vt,Xt(vt)t=1T\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}. Besides, the OS-tree 𝒯^\widehat{\mathcal{T}} also supports standard insertion (of a new transition v,c\left\langle\,{v,c}\,\right\rangle to a given rank tt) and deletion (of the transition vt,Xt(vt)\left\langle\,{v_{t},X_{t}(v_{t})}\,\right\rangle at a given rank tt). As a balanced tree, all these queries and updates for the OS-tree 𝒯^\widehat{\mathcal{T}} can be resolved in O(logT)O(\log T) time.

The successor and predecessor searches mentioned above for any vTv\in T and tt, can then be resolved by binary searches in the balanced search tree 𝒯v\mathcal{T}_{v} while querying the OS-tree 𝒯^\widehat{\mathcal{T}} as an oracle for ordering, which takes time at most O(log2T+log|V|)O(\log^{2}T+\log|V|) in total, where the log|V|\log|V| cost is used for accessing the root of 𝒯v\mathcal{T}_{v} via the dynamic dictionary that indexes the trees (Tv)vV(T_{v})_{v\in V}.

This solves the successor query Succ(t,v)\textsf{Succ}(t,v) as well as the evaluation query Eval(t,v)\textsf{Eval}(t,v) for Gibbs sampling, both within time cost O(log2T+log|V|)O(\log^{2}T+\log|V|), where the latter is actually solved by the predecessor search for the largest iti\leq t such that vi=vv_{i}=v and returning the value of Xi(vi)X_{i}(v_{i}) recorded in the ii-th transition vi,Xi(vi)\left\langle\,{v_{i},X_{i}(v_{i})}\,\right\rangle or returning the value of X0(v)X_{0}(v) if no such ii exists.

It is also easy to verify that with the above dynamic data structures, all updates, including: Insert(t,v,c)\textsf{Insert}(t,v,c), Remove(t)\textsf{Remove}(t) and Change(t,c)\textsf{Change}(t,c), can be implemented with cost at most O(log2T+log|V|)O(\log^{2}T+\log|V|), and the data structures together use O(T+|V|)O(T+|V|) words in total, where each word consists of O(logT+log|V|+log|Q|)O(\log T+\log|V|+\log|Q|) bits. ∎

6.3. Single-sample dynamic Gibbs sampling algorithm

With the data structure for Gibbs sampling stated in Theorem 6.12, the couplings constructed in Section 6.1 can be implemented as the algorithm for dynamic Gibbs sampling. Recall dgraph(,)d_{\textsf{graph}}(\cdot,\cdot) and dHamil(,)d_{\textsf{Hamil}}(\cdot,\cdot) are defined in (2).

Lemma 6.13 (single-sample dynamic Gibbs sampling algorithm).

Let ϵ:+(0,1)\epsilon:\mathbb{N}^{+}\to(0,1) be an error function. Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be an MRF instance with n=|V|n=|V| and =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}) the updated instance with n=|V|n^{\prime}=|V^{\prime}|. Denote T=T()T=T(\mathcal{I}), T=T()T^{\prime}=T(\mathcal{I}^{\prime}) and Tmax=max{T,T}T_{\max}=\max\{T,T^{\prime}\}. Assume dgraph(,)L𝗀𝗋𝖺𝗉𝗁=o(n)d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{graph}}=o(n), dHamil(,)L𝖧𝖺𝗆𝗂𝗅d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}}, and T,TΩ(nlogn)T,T^{\prime}\in\Omega(n\log n). The single-sample dynamic Gibbs sampling algorithm (Algorithm 2) does the followings:

  • (space cost) The algorithm maintains an explicit copy of a sample 𝑿QV\bm{X}\in Q^{V} for the current instance \mathcal{I}, and also a data structure using O(T)O(T) memory words, each of O(logT)O(\log T) bits, for representing an initial state 𝑿0QV\bm{X}_{0}\in Q^{V} and an execution-log 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} for the Gibbs sampling (𝑿t)t=0T(\bm{X}_{t})_{t=0}^{T} on \mathcal{I} generating sample 𝑿=𝑿T\bm{X}=\bm{X}_{T}.

  • (correctness) Assuming that Condition 6.2 holds for 𝑿0\bm{X}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)\mathsf{Exe\text{-}Log}(\mathcal{I},T) for the Gibbs sampling on \mathcal{I}, upon each update that modifies \mathcal{I} to \mathcal{I}^{\prime}, the algorithm updates 𝑿\bm{X} to an explicit copy of a sample 𝒀QV\bm{Y}\in Q^{V^{\prime}} for the new instance \mathcal{I}^{\prime}, and correspondingly updates the 𝑿0\bm{X}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)\mathsf{Exe\text{-}Log}(\mathcal{I},T) represented by the data structure to a 𝒀0QV\bm{Y}_{0}\in Q^{V^{\prime}} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{v_{t}^{\prime}},Y_{t}({v_{t}^{\prime}})\right\rangle_{t=1}^{{T^{\prime}}} for the Gibbs sampling (𝒀t)t=0T(\bm{Y}_{t})_{t=0}^{T^{\prime}} on \mathcal{I}^{\prime} generating the new sample 𝒀=𝒀T\bm{Y}=\bm{Y}_{T^{\prime}}, where 𝒀0\bm{Y}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime}) satisfy Condition 6.2 for the Gibbs sampling on \mathcal{I}^{\prime}, therefore,

    dTV(𝒀,μ)ϵ(n).d_{\mathrm{TV}}\left({\bm{Y}},{\mu_{\mathcal{I}^{\prime}}}\right)\leq\epsilon(n^{\prime}).
  • (time cost) Assuming Condition 6.2 for 𝑿0\bm{X}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)\mathsf{Exe\text{-}Log}(\mathcal{I},T) for the Gibbs sampling on \mathcal{I}, the expected time complexity for resolving an update is

    O(Δn+Δ(|TT|+Tmax(L𝖧𝖺𝗆𝗂𝗅+L𝗀𝗋𝖺𝗉𝗁)n+𝔼[R𝖧𝖺𝗆𝗂𝗅]+𝔼[R𝗀𝗋𝖺𝗉𝗁])log2Tmax),\displaystyle O\left(\Delta n+\Delta\left(|T-T^{\prime}|+\frac{T_{\max}(L_{\mathsf{Hamil}}+L_{\mathsf{graph}})}{n}+\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]+\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\right)\log^{2}T_{\max}\right),

    where Δ=max{ΔG,ΔG}\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}, ΔG\Delta_{G},ΔG\Delta_{G^{\prime}} denote the maximum degrees of G=(V,E)G=(V,E) and G=(V,E)G^{\prime}=(V^{\prime},E^{\prime}), R𝖧𝖺𝗆𝗂𝗅R_{\mathsf{Hamil}} is defined in (17) for the subroutine UpdateHamiltonian in Algorithm 2, and R𝗀𝗋𝖺𝗉𝗁R_{\mathsf{graph}} is defined in (23) for the subroutine UpdateEdge in Algorithm 2.

We remark that the O(Δn)O(\Delta n) in time cost is necessary because the update from \mathcal{I} to \mathcal{I}^{\prime} may change all the potentials of vertices and edges. One can reduce the O(Δn)O(\Delta n) from the time cost if we further restrict that one update can only change constant number of vertices, edges, and potentials.

The following result is a corollary from Lemma 6.13.

Corollary 6.14.

Assume ϵ:+(0,1)\epsilon:\mathbb{N}^{+}\to(0,1) in Lemma 6.13 satisfies the bounded difference condition in Definition 2.3. Assume \mathcal{I} and \mathcal{I}^{\prime} in Lemma 6.13 both satisfy Dobrushin-Shlosman condition (3.1) with constant δ>0\delta>0. The single-sample dynamic Gibbs sampling algorithm (Algorithm 2) uses O(nlogn)O(n\log n) memory words, each of O(logn)O(\log n) bits to maintain the sample for current instance \mathcal{I}, and resolves the update from \mathcal{I} to \mathcal{I}^{\prime} with expected time cost O(Δn+Δ2(L𝗀𝗋𝖺𝗉𝗁+L𝖧𝖺𝗆𝗂𝗅)log3n)O\left(\Delta n+\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})\log^{3}n\right).

Proof of Lemma 6.13.

The dynamic Gibbs sampling algorithm is implemented as follows. The algorithm uses the dynamic data structure in Theorem 6.12 to maintain the initial state 𝑿0\bm{X}_{0} and execution-log 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}. Besides, the algorithm maintains the explicit copy of the sample 𝑿QV\bm{X}\in Q^{V} by a deterministic dynamic dictionary, with (v,X(v))(v,X(v)) for vertices vVv\in V as the key-value pairs. The lemma is proved as follows.

Space cost: Note that T=Ω(nlogn),|V|=nT=\Omega(n\log n),|V|=n and |Q|=O(1)|Q|=O(1). We have O(n)=O(T)O(n)=O(T) and O(logT+log|V|+log|Q|)=O(logT)O(\log T+\log|V|+\log|Q|)=O(\log T). The dynamic dictionary for sample 𝑿\bm{X} uses O(n)O(n) memory words, each of O(log|V|+log|Q|)O(\log|V|+\log|Q|) bits. Combining with Theorem 6.12, we have the algorithm uses O(T)O(T) memory words to maintain the initial state, execution-log and the random sample, each word is of O(logT+log|V|+log|Q|)=O(logT)O(\log T+\log|V|+\log|Q|)=O(\log T) bits.

Correctness: The invariants for execution-log (Condition 6.2) are preserved by the coupling simulated by the algorithm. The correctness holds as a consequence.

Time cost: Consider the update that modifies \mathcal{I} to \mathcal{I}^{\prime}. We divide the algorithm into two stages.

  • Preparation stage: construct the updated instances \mathcal{I}^{\prime} and other middle instances 𝗆𝗂𝖽,1,2\mathcal{I}_{\mathsf{mid}},\mathcal{I}_{1},\mathcal{I}_{2} in (8), (18), (6.1.2); compute pv𝗎𝗉p^{\mathsf{up}}_{v} in (12) for all vVv\in V and construct the random set 𝒫[T]={1,2,,T}\mathcal{P}\subseteq[T]=\{1,2,\ldots,T\} in (13).

  • Update stage: given 𝒫\mathcal{P} and pv𝗎𝗉p^{\mathsf{up}}_{v} for all vVv\in V, update the initial state 𝑿0\bm{X}_{0} to 𝒀0\bm{Y}_{0}, the execution-log 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Xt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I},T)=\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}} to 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt,Yt(vt)t=1T\mathsf{Exe\text{-}Log}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{v^{\prime}_{t}},Y_{t}({v^{\prime}_{t}})\right\rangle_{t=1}^{{T^{\prime}}}, and the sample 𝑿\bm{X} to 𝒀\bm{Y}.

We make the following two claims.

Claim 6.15.

The expected running time of the preparation stage is

𝔼[T𝗉𝗋𝖾𝗉𝖺𝗋𝖺𝗍𝗂𝗈𝗇𝗌𝗂𝗇𝗀𝗅𝖾]=O(Δn+𝔼[|𝒫|]log2Tmax),\displaystyle\mathbb{E}\left[{T_{\mathsf{preparation}}^{\mathsf{single}}}\right]=O\left(\Delta n+\mathbb{E}\left[{\left|\mathcal{P}\right|}\right]\log^{2}T_{\max}\right),

and the expected size of 𝒫\mathcal{P} is at most 4TmaxL𝖧𝖺𝗆𝗂𝗅n\frac{4T_{\max}L_{\mathsf{Hamil}}}{n}.

Claim 6.16.

The expected running time of the update stage is

(24) 𝔼[T𝗎𝗉𝖽𝖺𝗍𝖾𝗌𝗂𝗇𝗀𝗅𝖾]=O(Δ(|TT|+TmaxL𝗀𝗋𝖺𝗉𝗁n+𝔼[R𝖧𝖺𝗆𝗂𝗅]+𝔼[R𝗀𝗋𝖺𝗉𝗁])log2Tmax),\displaystyle\mathbb{E}\left[{T_{\mathsf{update}}^{\mathsf{single}}}\right]=O\left(\Delta\left(|T-T^{\prime}|+\frac{T_{\max}L_{\mathsf{graph}}}{n}+\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]+\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\right)\log^{2}T_{\max}\right),

R𝖧𝖺𝗆𝗂𝗅R_{\mathsf{Hamil}} is defined in (17) for the subroutine UpdateHamiltonian in Algorithm 2, and R𝗀𝗋𝖺𝗉𝗁R_{\mathsf{graph}} is defined in (23) for the subroutine UpdateEdge in Algorithm 2.

By the linearity of expectation, the expected time cost of the algorithm is 𝔼[T𝗉𝗋𝖾𝗉𝖺𝗋𝖺𝗍𝗂𝗈𝗇𝗌𝗂𝗇𝗀𝗅𝖾]+𝔼[T𝗎𝗉𝖽𝖺𝗍𝖾𝗌𝗂𝗇𝗀𝗅𝖾]\mathbb{E}\left[{T_{\mathsf{preparation}}^{\mathsf{single}}}\right]+\mathbb{E}\left[{T_{\mathsf{update}}^{\mathsf{single}}}\right]. This proves the time cost.

We introduce the following technique lemma to prove Corollary 6.14.

Lemma 6.17.

Let ϵ:+(0,1)\epsilon:\mathbb{N}^{+}\to(0,1) be a function such that there exists a constant C>0C>0 such that

n+:|ϵ(n+1)ϵ(n)|Cnϵ(n).\displaystyle\forall n\in\mathbb{N}^{+}:\quad\left|\epsilon(n+1)-\epsilon(n)\right|\leq\frac{C}{n}\epsilon(n).

Then the function NN has the following properties

  • for any n+n\in\mathbb{N}^{+}, it holds that ϵ(n)1poly(n)\epsilon(n)\geq\frac{1}{\mathrm{poly}(n)};

  • let α1\alpha\geq 1 be a constant, given any n,n+n,n^{\prime}\in\mathbb{N}^{+} such that 1αnnα\frac{1}{\alpha}\leq\frac{n^{\prime}}{n}\leq\alpha,

    |nlognϵ(n)nlognϵ(n)|=C|nn|logn.\displaystyle\left|n\log\frac{n}{\epsilon(n)}-n^{\prime}\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\right|=C^{\prime}\left|n^{\prime}-n\right|\log n.

    where CC^{\prime} is a constant that depends only on α,C\alpha,C and ϵ(3C)\epsilon(3\lceil C\rceil).

Proof.

By the condition, we have ϵ(t)(1+CtC)ϵ(t+1)\epsilon(t)\leq\left(1+\frac{C}{t-C}\right)\epsilon(t+1) for all t>C+1t>\lceil C+1\rceil. Thus for all n>l=3Cn>l=3\lceil C\rceil,

(25) ϵ(l)i=ln1(1+CiC)ϵ(n)ϵ(n)exp(Ci=2n11i)ϵ(n)exp(Clnn)=ϵ(n)nC.\displaystyle\epsilon(l)\leq\prod_{i=l}^{n-1}\left(1+\frac{C}{i-C}\right)\epsilon(n)\leq\epsilon(n)\exp\left(C\sum_{i=2}^{n-1}\frac{1}{i}\right)\leq\epsilon(n)\exp(C\ln n)=\epsilon(n)n^{C}.

Thus, we have ϵ(n)1poly(n)\epsilon(n)\geq\frac{1}{\mathrm{poly}(n)}.

We then prove the second property. It is lossless to assume that min{n,n}l\min\{n,n^{\prime}\}\geq l, since otherwise we can choose CC^{\prime} sufficiently large so that the second property holds. Firstly, we prove for the case n>nn>n^{\prime}. We have |lognn|nnn\left|\log\frac{n}{n^{\prime}}\right|\leq\frac{n-n^{\prime}}{n^{\prime}}. By ϵ(t)(1+CtC)ϵ(t+1)\epsilon(t)\leq\left(1+\frac{C}{t-C}\right)\epsilon(t+1) for all t>C+1t>\lceil C+1\rceil, we also have

ϵ(n)i=nn1(1+CiC)ϵ(n)ϵ(n)exp(C(nn)nC).\displaystyle\epsilon(n^{\prime})\leq\prod_{i=n^{\prime}}^{n-1}\left(1+\frac{C}{i-C}\right)\epsilon(n)\leq\epsilon(n)\exp\left(\frac{C(n-n^{\prime})}{n^{\prime}-C}\right).

Thus,

(26) |lognϵ(n)lognϵ(n)||lognn|+|logϵ(n)ϵ(n)|nnn+C(nn)nC(2C+1)(nn)n.\displaystyle\left|\log\frac{n}{\epsilon(n)}-\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\right|\leq\left|\log\frac{n}{n^{\prime}}\right|+\left|\log\frac{\epsilon(n)}{\epsilon(n^{\prime})}\right|\leq\frac{n-n^{\prime}}{n^{\prime}}+\frac{C(n-n^{\prime})}{n^{\prime}-C}\leq\frac{(2C+1)(n-n^{\prime})}{n^{\prime}}.

The last equality is due to 2(nC)n+l2Cn2(n^{\prime}-C)\geq n^{\prime}+l-2C\geq n^{\prime}. Let C=2+|logϵ(l)|+3CC^{\prime}=2+\left|\log\epsilon(l)\right|+3C. We have

|nlognϵ(n)nlognϵ(n)||(nn)lognϵ(n)|+|n(lognϵ(n)lognϵ(n))|C|nn|logn.\displaystyle\left|n\log\frac{n}{\epsilon(n)}-n^{\prime}\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\right|\leq\left|(n^{\prime}-n)\log\frac{n}{\epsilon(n)}\right|+\left|n^{\prime}\left(\log\frac{n}{\epsilon(n)}-\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\right)\right|\leq C^{\prime}\left|n^{\prime}-n\right|\log n.

The last inequality is due to (25) and (26). Similarly, we can also prove the lemma if n<nn<n^{\prime}. ∎

Proof of Corollary 6.14.

By L𝗀𝗋𝖺𝗉𝗁=o(n)L_{\mathsf{graph}}=o(n), we have n=Θ(n)n^{\prime}=\Theta(n). Since \mathcal{I} and \mathcal{I}^{\prime} both satisfy Dobrushin-Shlosman condition (3.1) with constant δ>0\delta>0, we can set T,TT,T^{\prime} as in (7) such that

T\displaystyle T =nδlognε(n)=Θ(nlogn)\displaystyle=\left\lceil\frac{n}{\delta}\log\frac{n}{\varepsilon(n)}\right\rceil=\Theta(n\log n)
T\displaystyle T^{\prime} =nδlognε(n)=Θ(nlogn).\displaystyle=\left\lceil\frac{n^{\prime}}{\delta}\log\frac{n^{\prime}}{\varepsilon(n^{\prime})}\right\rceil=\Theta(n\log n).

The equations hold because n=Θ(n)n^{\prime}=\Theta(n) and the error function ϵ\epsilon satisfies ϵ()1poly()\epsilon(\ell)\geq\frac{1}{\mathrm{poly}(\ell)} by Lemma 6.17. Thus, we have

(27) Tmax=max{T,T}=O(nlogn).\displaystyle T_{\max}=\max\{T,T^{\prime}\}=O(n\log n).

By Lemma 6.17 and |nn|L𝗀𝗋𝖺𝗉𝗁=o(n)|n^{\prime}-n|\leq L_{\mathsf{graph}}=o(n), we have

(28) |TT|=O(L𝗀𝗋𝖺𝗉𝗁logn).\displaystyle\left|T-T^{\prime}\right|=O(L_{\mathsf{graph}}\log n).

Let 𝗆𝗂𝖽=(V,E,Q,Φ𝗆𝗂𝖽)\mathcal{I}_{\mathsf{mid}}=(V,E,Q,\Phi^{\mathsf{mid}}) be the middle instance constructed as in (8). In Algorithm 2, we call the subroutine UpdateHamiltonian for instances \mathcal{I} and 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}}. Since \mathcal{I} satisfies the Dobrushin-Shlosman condition, by Lemma 6.8 and d(,𝗆𝗂𝖽)d(,)L𝖧𝖺𝗆𝗂𝗅\,\mathrm{d}(\mathcal{I},\mathcal{I}_{\mathsf{mid}})\leq d(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}}, we have

(29) 𝔼[R𝖧𝖺𝗆𝗂𝗅]=O(ΔTL𝖧𝖺𝗆𝗂𝗅δn)=O(ΔL𝖧𝖺𝗆𝗂𝗅logn),\displaystyle\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]=O\left(\frac{\Delta TL_{\mathsf{Hamil}}}{\delta n}\right)=O(\Delta L_{\mathsf{Hamil}}\log n),

where R𝖧𝖺𝗆𝗂𝗅R_{\mathsf{Hamil}} is defined in (17) for the subroutine UpdateHamiltonian.

We also call the subroutine UpdateGraph for instances 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}} and \mathcal{I}^{\prime} in Algorithm 2. The subroutine is shown in Algorithm 5. We first add isolated vertices to update 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}} to 1\mathcal{I}_{1}, then update edges to update 1\mathcal{I}_{1} to 2\mathcal{I}_{2}, finally delete isolated vertices to update 2\mathcal{I}_{2} to \mathcal{I}^{\prime}. Since \mathcal{I}^{\prime} satisfies Dobrushin-Shlosman condition and the only difference between 2\mathcal{I}_{2} and \mathcal{I}^{\prime} is that 2\mathcal{I}_{2} contains extra isolated vertices, it is easy to verify that 2\mathcal{I}_{2} also satisfies Dobrushin-Shlosman condition. In Algorithm 5, the subroutine UpdateEdge is called for 1\mathcal{I}_{1} and 2\mathcal{I}_{2}. By Lemma 6.11, we have

(30) 𝔼[R𝗀𝗋𝖺𝗉𝗁]=O(ΔTL𝗀𝗋𝖺𝗉𝗁Δn)=O(ΔL𝗀𝗋𝖺𝗉𝗁logn).\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]=O\left(\frac{\Delta TL_{\mathsf{graph}}}{\Delta n}\right)=O(\Delta L_{\mathsf{graph}}\log n).

where R𝗀𝗋𝖺𝗉𝗁R_{\mathsf{graph}} is defined in (23) for the subroutine UpdateEdge.

Combining (27), (28), (29), (30) with Lemma 6.13, we have the expected time cost is

𝔼[T𝖼𝗈𝗌𝗍]\displaystyle\mathbb{E}\left[{T_{\mathsf{cost}}}\right] =O(Δn+Δ(|TT|+Tmax(L𝖧𝖺𝗆𝗂𝗅+L𝗀𝗋𝖺𝗉𝗁)n+𝔼[R𝖧𝖺𝗆𝗂𝗅]+𝔼[R𝗀𝗋𝖺𝗉𝗁])log2Tmax)\displaystyle=O\left(\Delta n+\Delta\left(|T-T^{\prime}|+\frac{T_{\max}(L_{\mathsf{Hamil}}+L_{\mathsf{graph}})}{n}+\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]+\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\right)\log^{2}T_{\max}\right)
=O(Δn+Δ2(L𝗀𝗋𝖺𝗉𝗁+L𝖧𝖺𝗆𝗂𝗅)log3n).\displaystyle=O\left(\Delta n+\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})\log^{3}n\right).\qed

6.4. Multi-sample dynamic Gibbs sampling algorithm

In this section, we give an Multi-sample dynamic Gibbs sampling algorithm that maintains multiple independent random samples for the current MRF instance. Theorem 6.1 follows immediately from the following lemma.

Lemma 6.18 (multi-sample dynamic Gibbs sampling algorithm).

Let N:++N:\mathbb{N}^{+}\to\mathbb{N}^{+} and ϵ:+(0,1)\epsilon:\mathbb{N}^{+}\to(0,1) be two functions satisfying the bounded difference condition in Definition 2.3. Let =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) be an MRF instance with n=|V|n=|V| and =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}) the updated instance with n=|V|n^{\prime}=|V^{\prime}|. Assume that \mathcal{I} and \mathcal{I}^{\prime} both satisfy Dobrushin-Shlosman condition with constant δ>0\delta>0, dgraph(,)L𝗀𝗋𝖺𝗉𝗁=o(n)d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{graph}}=o(n) and dHamil(,)L𝖧𝖺𝗆𝗂𝗅d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}}. Denote T=nδlognϵ(n)T=\lceil\frac{n}{\delta}\log\frac{n}{\epsilon(n)}\rceil, T=nδlognϵ(n)T^{\prime}=\lceil\frac{n^{\prime}}{\delta}\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\rceil.

There is an algorithm which does the followings:

  • (space cost) The algorithm maintains N(n)N(n) explicit copies of independent samples 𝑿(1),,𝑿(N(n))\bm{X}^{(1)},\ldots,\bm{X}^{(N(n))}, where 𝑿(i)QV\bm{X}^{(i)}\in Q^{V} for all 1iN(n)1\leq i\leq N(n), for the current instance \mathcal{I}, and also a data structure using O(nN(n)logn)O(nN(n)\log n) memory words, each of O(logn)O(\log n) bits, for representing the initial state 𝑿0(i)QV\bm{X}^{(i)}_{0}\in Q^{V} and the execution-log 𝖤𝗑𝖾-𝖫𝗈𝗀(i)(,T)=vt(i),Xt(i)(vt(i))t=1T\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I},T)=\left\langle{v^{(i)}_{t}},X^{(i)}_{t}({v^{(i)}_{t}})\right\rangle_{t=1}^{{T}} for 1iN(n)1\leq i\leq N(n) such that each Gibbs sampling (𝑿t(i))t=0T(\bm{X}^{(i)}_{t})_{t=0}^{T} on \mathcal{I} generating an independent sample 𝑿(i)=𝑿T(i)\bm{X}^{(i)}=\bm{X}^{(i)}_{T}.

  • (correctness) Assuming that Condition 6.2 holds for each 𝑿0(i)\bm{X}^{(i)}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(i)(,T)\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I},T) for the Gibbs sampling on \mathcal{I}, upon each update that modifies \mathcal{I} to \mathcal{I}^{\prime}, the algorithm updates 𝑿(1),𝑿(2),,𝑿(N(n))\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))} to N(n)N(n^{\prime}) explicit copies of independent samples 𝒀(1),𝒀(2),,𝒀(N(n))QV\bm{Y}^{(1)},\bm{Y}^{(2)},\ldots,\bm{Y}^{(N(n^{\prime}))}\in Q^{V^{\prime}} for the new instance \mathcal{I}^{\prime}, and correspondingly updates the data represented by the data structure to 𝒀0(i)QV\bm{Y}^{(i)}_{0}\in Q^{V^{\prime}} and 𝖤𝗑𝖾-𝖫𝗈𝗀(i)(,T)=ut(i),Yt(i)(ut(i))t=1T\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{u^{(i)}_{t}},Y^{(i)}_{t}({u^{(i)}_{t}})\right\rangle_{t=1}^{{T^{\prime}}} for 1iN(n)1\leq i\leq N(n^{\prime}) such that each Gibbs sampling chain (𝒀t(i))t=0T(\bm{Y}^{(i)}_{t})_{t=0}^{T^{\prime}} on \mathcal{I}^{\prime} generating a new sample 𝒀(i)=𝒀T(i)\bm{Y}^{(i)}=\bm{Y}^{(i)}_{T^{\prime}}, where each 𝒀0(i)\bm{Y}^{(i)}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(i)(,T)\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I}^{\prime},T^{\prime}) satisfy Condition 6.2 for the Gibbs sampling on \mathcal{I}^{\prime}, therefore,

    dTV(𝒀(i),μ)ϵ(n).d_{\mathrm{TV}}\left({\bm{Y}^{(i)}},{\mu_{\mathcal{I}^{\prime}}}\right)\leq\epsilon(n^{\prime}).
  • (time cost) Assuming Condition 6.2 for each 𝑿0(i)\bm{X}^{(i)}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(i)(,T)\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I},T) for the Gibbs sampling on \mathcal{I}, the time complexity for resolving an update is

    O(Δ2(L𝖧𝖺𝗆𝗂𝗅+L𝗀𝗋𝖺𝗉𝗁)N(n)log3n+Δn),O\left(\Delta^{2}(L_{\mathsf{Hamil}}+L_{\mathsf{graph}})N(n)\cdot\log^{3}n+\Delta n\right),

    where Δ=max{ΔG,ΔG}\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}, and ΔG,ΔG\Delta_{G},\Delta_{G^{\prime}} denote the maximum degree of G=(V,E)G=(V,E) and G=(V,E)G^{\prime}=(V^{\prime},E^{\prime}).

The following technique lemma will be used to prove Lemma 6.18.

Lemma 6.19.

Let N:++N:\mathbb{N}^{+}\to\mathbb{N}^{+} be a function such that there exists a constant C>0C>0 such that

n+:|N(n+1)N(n)|CnN(n).\displaystyle\forall n\in\mathbb{N}^{+}:\quad\left|N(n+1)-N(n)\right|\leq\frac{C}{n}N(n).

Then the function NN has the following properties

  • for any n+n\in\mathbb{N}^{+}, it holds that N(n)poly(n)N(n)\leq\mathrm{poly}(n);

  • let α1\alpha\geq 1 be a constant, given any n,n+n,n^{\prime}\in\mathbb{N}^{+} such that 1αnnα\frac{1}{\alpha}\leq\frac{n^{\prime}}{n}\leq\alpha,

    |N(n)N(n)|=C(α,C)|nn|nN(n),\displaystyle\left|N(n)-N(n^{\prime})\right|=C^{\prime}(\alpha,C)\cdot\frac{\left|n-n^{\prime}\right|}{n}N(n),

    where C(α,C)C^{\prime}(\alpha,C) is a constant that depends only on α\alpha and CC.

Proof.

By the condition, we have N(n+1)(1+Cn)N(n)N(n+1)\leq\left(1+\frac{C}{n}\right)N(n). Thus for all n+n\in\mathbb{N}^{+},

N(n)N(1)i=1n1(1+Ci)N(1)exp(Ci=1n11i)=N(1)exp(Θ(lnn))=poly(n).\displaystyle N(n)\leq N(1)\prod_{i=1}^{n-1}\left(1+\frac{C}{i}\right)\leq N(1)\exp\left(C\sum_{i=1}^{n-1}\frac{1}{i}\right)=N(1)\exp(\Theta(\ln n))=\mathrm{poly}(n).

We then prove the second property. Note that |nn|nα\frac{\left|n-n^{\prime}\right|}{n}\leq\alpha, it suffices to prove

(31) |N(n)N(n)1|C(α,C)|nn|n.\displaystyle\left|\frac{N(n^{\prime})}{N(n)}-1\right|\leq C^{\prime}(\alpha,C)\cdot\frac{\left|n-n^{\prime}\right|}{n}.

Assume that min{n,n}2Cα\min\{n,n^{\prime}\}\leq 2C\alpha. Then, we have max{n,n}2Cα2\max\{n,n^{\prime}\}\leq 2C\alpha^{2}. We can choose C(α,C)C^{\prime}(\alpha,C) sufficiently large so that (31) holds. Assume n>n>2αCn^{\prime}>n>2\alpha C. Note that |nn|nα\frac{\left|n-n^{\prime}\right|}{n}\leq\alpha. We have

1C|nn|n(1Cn)|nn|N(n)N(n)(1+Cn)|nn|1+Cexp(αC)|nn|n,\displaystyle 1-\frac{C\left|n-n^{\prime}\right|}{n}\leq\left(1-\frac{C}{n}\right)^{\left|n-n^{\prime}\right|}\leq\frac{N(n^{\prime})}{N(n)}\leq\left(1+\frac{C}{n}\right)^{\left|n-n^{\prime}\right|}\leq 1+\frac{C\exp(\alpha C)\left|n-n^{\prime}\right|}{n},

which implies (31) holds if C(α,C)Cexp(αC)C^{\prime}(\alpha,C)\geq C\exp(\alpha C). Assume n>n>2αCn>n^{\prime}>2\alpha C. Note that |nn|nα\frac{\left|n-n^{\prime}\right|}{n}\leq\alpha and nnαn^{\prime}\geq\frac{n}{\alpha}. We have

1αC|nn|n(1αCn)|nn|N(n)N(n)(1+αCn)|nn|1+Cαexp(α2C)|nn|n.\displaystyle 1-\frac{\alpha C\left|n-n^{\prime}\right|}{n}\leq\left(1-\frac{\alpha C}{n}\right)^{\left|n-n^{\prime}\right|}\leq\frac{N(n^{\prime})}{N(n)}\leq\left(1+\frac{\alpha C}{n}\right)^{\left|n-n^{\prime}\right|}\leq 1+\frac{C\alpha\exp(\alpha^{2}C)\left|n-n^{\prime}\right|}{n}.

which implies (31) holds if C(α,C)Cαexp(α2C)C^{\prime}(\alpha,C)\geq C\alpha\exp(\alpha^{2}C). ∎

Proof.

The main idea of the multi-sample dynamic Gibbs sampling algorithm is to use single-sample dynamic Gibbs sampling algorithm (Algorithm 2) to maintain each sample 𝑿(i)QV\bm{X}^{(i)}\in Q^{V} for 1iN(n)1\leq i\leq N(n). We need a careful implementation of the algorithm to guarantee the time cost in Lemma 6.18.

Space cost: Note that T=nδlognε(n)=Θ(nlogn)T=\left\lceil\frac{n}{\delta}\log\frac{n}{\varepsilon(n)}\right\rceil=\Theta(n\log n) due to Lemma 6.17 and N(n)poly(n)N(n)\leq\mathrm{poly}(n) due to Lemma 6.19. The dynamic dictionary for each sample 𝑿(i)\bm{X}^{(i)} uses O(n)O(n) memory words, each of O(logn)O(\log n) bits. Hence, the algorithm uses O(TN(n))=O(nN(n)logn)O(T\cdot N(n))=O(nN(n)\log n) memory words to maintain all the initial states, execution-logs and the random samples due to Theorem 6.12.

Correctness: The invariants for execution-log (Condition 6.2) are preserved by the coupling simulated by the algorithm. The correctness holds as a consequence.

Time cost: Define Nminmin{N(n),N(n)}N_{\min}\triangleq\min\{N(n),N(n^{\prime})\}. Fix 1kNmin1\leq k\leq N_{\min}. We use the Algorithm 2 to update the sample 𝑿(k)\bm{X}^{(k)} to 𝒀(k)\bm{Y}^{(k)}. Let 𝒫k[T]\mathcal{P}_{k}\subseteq[T] denote the set defined in (13) for the subroutine UpdateHamiltonian in Algorithm 2. The multi-sample dynamic Gibbs sampling has the following three stages.

  • Preparation stage: construct the updated instances \mathcal{I}^{\prime} and other middle instances 𝗆𝗂𝖽,1,2\mathcal{I}_{\mathsf{mid}},\mathcal{I}_{1},\mathcal{I}_{2} in (8), (18), (6.1.2); compute pv𝗎𝗉p^{\mathsf{up}}_{v} in (12) for all vVv\in V; and construct the random sets 𝒫1,𝒫2,,𝒫Nmin\mathcal{P}_{1},\mathcal{P}_{2},\ldots,\mathcal{P}_{N_{\min}}.

  • Update stage: given the (pv𝗎𝗉)vV(p^{\mathsf{up}}_{v})_{v\in V} and (𝒫i)1iNmin(\mathcal{P}_{i})_{1\leq i\leq N_{\min}}, for each 1iNmin1\leq i\leq N_{\min}, use Algorithm 2 to update the initial state 𝑿0(i)\bm{X}_{0}^{(i)} to 𝒀0(i)\bm{Y}_{0}^{(i)}, the execution-log 𝖤𝗑𝖾-𝖫𝗈𝗀(i)(,T)=vt(i),Xt(i)(vt(i))t=1T\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I},T)=\left\langle{v_{t}^{(i)}},X^{(i)}_{t}({v_{t}^{(i)}})\right\rangle_{t=1}^{{T}} to 𝖤𝗑𝖾-𝖫𝗈𝗀(i)(,T)=u(i),Yt(i)(u(i))t=1T\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{u^{(i)}},Y^{(i)}_{t}({u^{(i)}})\right\rangle_{t=1}^{{T^{\prime}}}, and the sample 𝑿(i)\bm{X}^{(i)} to 𝒀(i)\bm{Y}^{(i)}.

  • Completion stage: If N(n)<N(n)N(n^{\prime})<N(n), for each N(n)<iN(n)N(n^{\prime})<i\leq N(n), remove the sample 𝑿(i)\bm{X}^{(i)}, the initial state 𝑿0(i)\bm{X}^{(i)}_{0} and the execution-log 𝖤𝗑𝖾-𝖫𝗈𝗀(i)(,T)=vt(i),Xt(i)(vt(i))t=1T\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I},T)=\left\langle{v_{t}^{(i)}},X^{(i)}_{t}({v_{t}^{(i)}})\right\rangle_{t=1}^{{T}} from the data; if N(n)>N(n)N(n^{\prime})>N(n), for each N(n)<iN(n)N(n)<i\leq N(n^{\prime}), construct an independent Gibbs sampling chain (𝒀t(i))t=0T(\bm{Y}^{(i)}_{t})_{t=0}^{T^{\prime}} on instance \mathcal{I}^{\prime}, write the sample 𝒀(i)=𝒀T(i)\bm{Y}^{(i)}=\bm{Y}^{(i)}_{T^{\prime}}, the initial state 𝒀0(i)\bm{Y}^{(i)}_{0} and the execution-log 𝖤𝗑𝖾-𝖫𝗈𝗀(i)(,T)=ut(i),Yt(i)(ut(i))t=1T\mathsf{Exe\text{-}Log}^{(i)}(\mathcal{I}^{\prime},T^{\prime})=\left\langle{u_{t}^{(i)}},Y^{(i)}_{t}({u_{t}^{(i)}})\right\rangle_{t=1}^{{T^{\prime}}} into the data.

Let T𝗉𝗋𝖾𝗉𝖺𝗋𝖺𝗍𝗂𝗈𝗇𝗆𝗎𝗅𝗍𝗂,T𝗎𝗉𝖽𝖺𝗍𝖾𝗆𝗎𝗅𝗍𝗂T_{\mathsf{preparation}}^{\mathsf{multi}},T_{\mathsf{update}}^{\mathsf{multi}} and T𝖼𝗈𝗆𝗉𝗅𝖾𝗍𝗂𝗈𝗇𝗆𝗎𝗅𝗍𝗂T_{\mathsf{completion}}^{\mathsf{multi}} denote the running time of the corresponding stages. Note that the update stage of the multi-sample dynamic sampling algorithm repeats the update stage of the single-sample algorithm for NminN_{\min} times. Also note that both \mathcal{I} and \mathcal{I}^{\prime} satisfies Dobrushin-Shlosman condition. Combining (24), (27), (28), (29), and (30), we have

𝔼[T𝗎𝗉𝖽𝖺𝗍𝖾𝗆𝗎𝗅𝗍𝗂]=i=1Nmin𝔼[T𝗎𝗉𝖽𝖺𝗍𝖾𝗌𝗂𝗇𝗀𝗅𝖾,(i)]\displaystyle\mathbb{E}\left[{T_{\mathsf{update}}^{\mathsf{multi}}}\right]=\sum_{i=1}^{N_{\min}}\mathbb{E}\left[{T_{\mathsf{update}}^{\mathsf{single},(i)}}\right] =O(NminΔ2(L𝗀𝗋𝖺𝗉𝗁+L𝖧𝖺𝗆𝗂𝗅)log3n)\displaystyle=O(N_{\min}\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})\log^{3}n)
(32) (by NminN(n))\displaystyle(\text{by }N_{\min}\leq N(n))\quad =O(N(n)Δ2(L𝗀𝗋𝖺𝗉𝗁+L𝖧𝖺𝗆𝗂𝗅)log3n)\displaystyle=O(N(n)\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})\log^{3}n)

where T𝗎𝗉𝖽𝖺𝗍𝖾𝗌𝗂𝗇𝗀𝗅𝖾,(i)T_{\mathsf{update}}^{\mathsf{single},(i)} is the running time of the update stage of the Algorithm 2 that updates the ii-th sample.

In completion stage, we either remove the chains from the data structure, or generate the new chains and write them into data structure. It is easy to see the running time of the completion stage satisfies

𝔼[T𝖼𝗈𝗆𝗉𝗅𝖾𝗍𝗂𝗈𝗇𝗆𝗎𝗅𝗍𝗂]\displaystyle\mathbb{E}\left[{T_{\mathsf{completion}}^{\mathsf{multi}}}\right] =O(|N(n)N(n)|TmaxlogTmax)=O(n|N(n)N(n)|log2n)\displaystyle=O(\left|N(n)-N(n^{\prime})\right|T_{\max}\log T_{\max})=O(n\left|N(n)-N(n^{\prime})\right|\log^{2}n)
(by Lemma 6.19)\displaystyle(\text{by \lx@cref{creftypecap~refnum}{lemma-smooth-function}})\quad =O(|nn|N(n)log2n)=O(L𝗀𝗋𝖺𝗉𝗁N(n)log2n),\displaystyle=O(\left|n-n^{\prime}\right|N(n)\log^{2}n)=O(L_{\mathsf{graph}}N(n)\log^{2}n),

where Tmax=max{T,T}=O(nlogn)T_{\max}=\max\{T,T^{\prime}\}=O(n\log n) since n=Θ(n)n^{\prime}=\Theta(n) and ϵ(n)1poly(n)\epsilon(n^{\prime})\geq\frac{1}{\mathrm{poly}(n^{\prime})}(by L𝗀𝗋𝖺𝗉𝗁=o(n)L_{\mathsf{graph}}=o(n) and Lemma 6.17).

We make the following claim about the preparation stage.

Claim 6.20.

The expected running time of the preparation stage is

𝔼[T𝗉𝗋𝖾𝗉𝖺𝗋𝖺𝗍𝗂𝗈𝗇𝗆𝗎𝗅𝗍𝗂]=O(Δn+log2ni=1Nmin𝔼[|𝒫i|]),\displaystyle\mathbb{E}\left[{T_{\mathsf{preparation}}^{\mathsf{multi}}}\right]=O\left(\Delta n+\log^{2}n\sum_{i=1}^{N_{\min}}\mathbb{E}\left[{|\mathcal{P}_{i}|}\right]\right),

and the expected size of 𝒫i\mathcal{P}_{i} is at most 4TmaxL𝖧𝖺𝗆𝗂𝗅n\frac{4T_{\max}L_{\mathsf{Hamil}}}{n} for each 1iNmin1\leq i\leq N_{\min}.

By 6.20, we have

𝔼[T𝗉𝗋𝖾𝗉𝖺𝗋𝖺𝗍𝗂𝗈𝗇𝗆𝗎𝗅𝗍𝗂]=O(Δn+N(n)L𝖧𝖺𝗆𝗂𝗅log3n).\displaystyle\mathbb{E}\left[{T_{\mathsf{preparation}}^{\mathsf{multi}}}\right]=O\left(\Delta n+N(n)L_{\mathsf{Hamil}}\log^{3}n\right).

By the linearity of expectation, the expected time cost of the algorithm is 𝔼[T𝗉𝗋𝖾𝗉𝖺𝗋𝖺𝗍𝗂𝗈𝗇𝗆𝗎𝗅𝗍𝗂]+𝔼[T𝗎𝗉𝖽𝖺𝗍𝖾𝗆𝗎𝗅𝗍𝗂]+𝔼[T𝖼𝗈𝗆𝗉𝗅𝖾𝗍𝗂𝗈𝗇𝗆𝗎𝗅𝗍𝗂]\mathbb{E}\left[{T_{\mathsf{preparation}}^{\mathsf{multi}}}\right]+\mathbb{E}\left[{T_{\mathsf{update}}^{\mathsf{multi}}}\right]+\mathbb{E}\left[{T_{\mathsf{completion}}^{\mathsf{multi}}}\right]. This proves the time cost.

7. Proofs for dynamic Gibbs sampling

7.1. Analysis of the couplings

We analysis the couplings in dynamic Gibbs sampling algorithm. In Section 7.1.1, we analysis the coupling for Hamiltonian update. In Section 7.1.2, we analysis the coupling for graph update.

7.1.1. Proofs for the coupling for Hamiltonian update

In this section, we prove Lemma 6.5, Lemma 6.6, and Lemma 6.8.

The validity of the coupling (proof of Lemma 6.5)

We first prove that the distribution νv,vτ()\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot) in (10) is valid. We draw samples from νv,vτ()\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot) only if the result of coin flipping is HEADS, which implies μv,(xτ)>μv,(xτ)\mu_{v,\mathcal{I}}(x\mid\tau)>\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau) for some xQx\in Q. Thus, the two distributions μv,(τ)\mu_{v,\mathcal{I}}(\cdot\mid\tau) and μv,(τ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau) are not identical, and

xQmax{0,μv,(xτ)μv,(xτ)}>0.\displaystyle\sum_{x\in Q}\max\left\{0,\mu_{v,\mathcal{I}}(x\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)\right\}>0.

Hence, the denominator of νv,vτ()\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot) is positive. Besides, since both μv,(τ)\mu_{v,\mathcal{I}}(\cdot\mid\tau) and μv,(τ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau) are distributions over QQ, we have

xQmax{0,μv,(xτ)μv,(xτ)}=xQmax{0,μv,(xτ)μv,(xτ)}.\displaystyle\sum_{x\in Q}\max\left\{0,\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)-\mu_{v,\mathcal{I}}(x\mid\tau)\right\}=\sum_{x\in Q}\max\left\{0,\mu_{v,\mathcal{I}}(x\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)\right\}.

Thus we have xQνv,vτ(x)=1.\sum_{x\in Q}\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(x)=1. Hence, νv,vτ()\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot) a valid distribution.

We next prove the coupling Dv,vσ,τ(,)D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot) in Definition 6.4 is a valid coupling between μv,(τ)\mu_{v,\mathcal{I}}(\cdot\mid\tau) and μv,(τ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau). If μv,(τ)\mu_{v,\mathcal{I}}(\cdot\mid\tau) and μv,(τ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau) are identical, the result holds trivially. We may assume μv,(τ)\mu_{v,\mathcal{I}}(\cdot\mid\tau) and μv,(τ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau) are not identical, thus the distribution νv,vτ()\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(\cdot) is well-defined.

The coupling Dv,vσ,τ(,)D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot) in Definition 6.4 returns a pair (c,c)Q2(c,c^{\prime})\in Q^{2}. It is easy to see cc follows the law μv,(σ)\mu_{v,\mathcal{I}}(\cdot\mid\sigma). We prove that cc^{\prime} follows the law μv,(σ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\sigma). By the definition of Dv,vσ,τ(,)D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot), cQc^{\prime}\in Q is generated by the following procedure:

  • sample aQa\in Q from the distribution μv,(τ)\mu_{v,\mathcal{I}}(\cdot\mid\tau);

  • sample bQb\in Q from the distribution νv,vτ\nu^{\tau}_{\mathcal{I}_{v},\mathcal{I}_{v}^{\prime}} defined in (10), set

    c={bwith probability pvt,vtτ(a)awith probability 1pvt,vtτ(a).\displaystyle c^{\prime}=\begin{cases}b&\text{with probability }p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(a)\\ a&\text{with probability }1-p^{\tau}_{\mathcal{I}_{v_{t}},\mathcal{I}^{\prime}_{v_{t}}}(a).\end{cases}

Note that aa follows the law μv,(τ)\mu_{v,\mathcal{I}}(\cdot\mid\tau). We have for each xQx\in Q,

Pr[c=x]\displaystyle\Pr[c^{\prime}=x] =Pr[a=x](1pv,vτ(x))+yQPr[a=y]pv,vτ(y)νv,vτ(x)\displaystyle=\Pr[a=x]\cdot(1-p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(x))+\sum_{y\in Q}\Pr[a=y]\cdot p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y)\cdot\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(x)
=μv,(xτ)(1pv,vτ(x))+νv,vτ(x)yQμv,(yτ)pv,vτ(y).\displaystyle=\mu_{v,\mathcal{I}}(x\mid\tau)\cdot(1-p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(x))+\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(x)\sum_{y\in Q}\mu_{v,\mathcal{I}}(y\mid\tau)\cdot p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y).

By the definition of pv,vτ(y)p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y) in (9), we have

yQ,μv,(yτ)pv,vτ(y)={0if μv,(yτ)μv,(yτ)μv,(yτ)μv,(yτ)otherwise.\displaystyle\forall y\in Q,\quad\mu_{v,\mathcal{I}}(y\mid\tau)\cdot p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y)=\begin{cases}0&\text{if }\mu_{v,\mathcal{I}}(y\mid\tau)\leq\mu_{v,\mathcal{I}^{\prime}}(y\mid\tau)\\ \mu_{v,\mathcal{I}}(y\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(y\mid\tau)&\text{otherwise}.\end{cases}

This implies μv,(yτ)pv,vτ(y)=max{0,μv,(yτ)μv,(yτ)}\mu_{v,\mathcal{I}}(y\mid\tau)\cdot p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y)=\max\left\{0,\mu_{v,\mathcal{I}}(y\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(y\mid\tau)\right\}. We have

νv,vτ(x)yQμv,(yτ)pv,vτ(y)\displaystyle\nu_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(x)\sum_{y\in Q}\mu_{v,\mathcal{I}}(y\mid\tau)\cdot p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(y)
=\displaystyle= max{0,μv,(xτ)μv,(xτ)}yQmax{0,μv,(yτ)μv,(yτ)}yQmax{0,μv,(yτ)μv,(yτ)}\displaystyle\,\frac{\max\left\{0,\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)-\mu_{v,\mathcal{I}}(x\mid\tau)\right\}}{\sum_{y\in Q}\max\left\{0,\mu_{v,\mathcal{I}}(y\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(y\mid\tau)\right\}}\sum_{y\in Q}\max\left\{0,\mu_{v,\mathcal{I}}(y\mid\tau)-\mu_{v,\mathcal{I}^{\prime}}(y\mid\tau)\right\}
=\displaystyle= max{0,μv,(xτ)μv,(xτ)}.\displaystyle\,\max\left\{0,\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)-\mu_{v,\mathcal{I}}(x\mid\tau)\right\}.

Hence, we have

Pr[c=x]=μv,(xτ)(1pv,vτ(x))+max{0,μv,(xτ)μv,(xτ)}.\displaystyle\Pr[c^{\prime}=x]=\mu_{v,\mathcal{I}}(x\mid\tau)\cdot(1-p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(x))+\max\left\{0,\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)-\mu_{v,\mathcal{I}}(x\mid\tau)\right\}.

Suppose μv,(xτ)μv,(xτ)\mu_{v,\mathcal{I}}(x\mid\tau)\leq\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau), then we have pv,vτ(x)=0p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(x)=0. In this case, we have

Pr[c=x]=μv,(xτ)+μv,(xτ)μv,(xτ)=μv,(xτ).\displaystyle\Pr[c^{\prime}=x]=\mu_{v,\mathcal{I}}(x\mid\tau)+\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau)-\mu_{v,\mathcal{I}}(x\mid\tau)=\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau).

Suppose μv,(xτ)>μv,(xτ)\mu_{v,\mathcal{I}}(x\mid\tau)>\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau), then we have

Pr[c=x]\displaystyle\Pr[c^{\prime}=x] =μv,(xτ)(1pv,vτ(x))=μv,(xτ).\displaystyle=\mu_{v,\mathcal{I}}(x\mid\tau)\cdot(1-p^{\tau}_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}(x))=\mu_{v,\mathcal{I}^{\prime}}(x\mid\tau).

Combining these two cases proves that cc^{\prime} follows the law μv,(τ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau). ∎

The upper bound of the probability pv,v()p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\cdot}(\cdot) (proof of Lemma 6.6)

It suffices to prove that for any two instances =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi) and =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V,E,Q,\Phi^{\prime}) of MRF model, and any vV,cQv\in V,c\in Q and σQΓG(v)\sigma\in Q^{\Gamma_{G}(v)},

(33) μv,(cσ)μv,(cσ)2μv,(cσ)(ϕvϕv1+e={u,v}Eϕeϕe1).\displaystyle\mu_{v,\mathcal{I}}(c\mid\sigma)-\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)\leq 2\mu_{v,\mathcal{I}}(c\mid\sigma)\left(\|\phi_{v}-\phi^{\prime}_{v}\|_{1}+\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right).

Note that if μv,(cσ)=0\mu_{v,\mathcal{I}}(c\mid\sigma)=0, then pv,vτ(c)=0p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(c)=0; otherwise pv,vτ(c)=max{0,μv,(cσ)μv,(cσ)μv,(cσ)}p_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\tau}(c)=\max\left\{0,\frac{\mu_{v,\mathcal{I}}(c\mid\sigma)-\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)}\right\}. Hence, inequality (33) proves the lemma.

We now prove (33). Suppose μv,(cσ)=0\mu_{v,\mathcal{I}}(c\mid\sigma)=0. Then the LHS of (33) 0\leq 0. Since the RHS 0\geq 0, the inequality holds.

We next assume μv,(cσ)>0\mu_{v,\mathcal{I}}(c\mid\sigma)>0. Then it suffices to prove

μv,(cσ)μv,(cσ)μv,(cσ)=1μv,(cσ)μv,(cσ)2(ϕvϕv1+e={u,v}Eϕeϕe1).\displaystyle\frac{\mu_{v,\mathcal{I}}(c\mid\sigma)-\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)}=1-\frac{\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)}\leq 2\left(\|\phi_{v}-\phi^{\prime}_{v}\|_{1}+\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right).

By the definitions of ϕv,ϕv,ϕe,ϕe\phi_{v},\phi^{\prime}_{v},\phi_{e},\phi^{\prime}_{e}, we can write the ratio as

μv,(cσ)μv,(cσ)=exp(ϕv(c)+uΓvϕuv(σu,c))exp(ϕv(c)+uΓvϕuv(σu,c))aQexp(ϕv(a)+uΓvϕuv(σu,a))aQexp(ϕv(a)+uΓvϕuv(σu,a)),\displaystyle\frac{\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)}=\frac{\exp\left(\phi^{\prime}_{v}(c)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},c)\right)}{\exp\left(\phi_{v}(c)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},c)\right)}\frac{\sum_{a\in Q}\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)}{\sum_{a\in Q}\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)},

where Γv\Gamma_{v} denotes the neighborhood of vv in GG. Next, we assume that

(34) cQ:ϕv(c)=ϕv(c)=uΓv,c,cQ:ϕuv(c,c)=ϕuv(c,c)=.\begin{split}\forall c\in Q:\quad&\phi_{v}(c)=-\infty\quad\Longleftrightarrow\quad\phi^{\prime}_{v}(c)=-\infty\\ \forall u\in\Gamma_{v},c,c^{\prime}\in Q:\quad&\phi_{uv}(c,c^{\prime})=-\infty\quad\Longleftrightarrow\quad\phi^{\prime}_{uv}(c,c^{\prime})=-\infty.\end{split}

Otherwise, it must hold that the RHS of (33) is \infty, then  (33) holds trivially. Thus we can define the set

Q{aQϕv(a)+uΓvϕuv(σu,a)}={aQϕv(a)+uΓvϕuv(σu,a)}.\displaystyle Q^{\prime}\triangleq\left\{a\in Q\mid\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\neq-\infty\right\}=\left\{a\in Q\mid\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\neq-\infty\right\}.

Since exp()=0\exp(-\infty)=0, we have

μv,(cσ)μv,(cσ)=exp(ϕv(c)+uΓvϕuv(σu,c))exp(ϕv(c)+uΓvϕuv(σu,c))aQexp(ϕv(a)+uΓvϕuv(σu,a))aQexp(ϕv(a)+uΓvϕuv(σu,a)).\displaystyle\frac{\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)}=\frac{\exp\left(\phi^{\prime}_{v}(c)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},c)\right)}{\exp\left(\phi_{v}(c)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},c)\right)}\frac{\sum_{a\in Q^{\prime}}\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)}{\sum_{a\in Q^{\prime}}\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)}.

We then show that

(35) aQ:exp(ϕv(a)+uΓvϕuv(σu,a))exp(ϕv(a)+uΓvϕuv(σu,a))exp(ϕvϕv1e={u,v}Eϕeϕe1)aQ:exp(ϕv(a)+uΓvϕuv(σu,a))exp(ϕv(a)+uΓvϕuv(σu,a))exp(ϕvϕv1e={u,v}Eϕeϕe1)\begin{split}\forall a\in Q^{\prime}:\quad&\frac{\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)}{\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)}\geq\exp\left(-\|\phi_{v}-\phi^{\prime}_{v}\|_{1}-\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right)\\ \forall a\in Q^{\prime}:\quad&\frac{\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)}{\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)}\geq\exp\left(-\|\phi_{v}-\phi^{\prime}_{v}\|_{1}-\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right)\end{split}

We first use (35) to prove the (33). Since μv,(cσ)>0\mu_{v,\mathcal{I}}(c\mid\sigma)>0, then we have cQc\in Q^{\prime}. By (35), we have

1μv,(cσ)μv,(cσ)\displaystyle 1-\frac{\mu_{v,\mathcal{I}^{\prime}}(c\mid\sigma)}{\mu_{v,\mathcal{I}}(c\mid\sigma)} 1exp(2ϕvϕv12e={u,v}Eϕeϕe1)\displaystyle\leq 1-\exp\left(-2\|\phi_{v}-\phi^{\prime}_{v}\|_{1}-2\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right)
2(ϕvϕv1+e={u,v}Eϕeϕe1).\displaystyle\leq 2\left(\|\phi_{v}-\phi^{\prime}_{v}\|_{1}+\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right).

This proves the lemma.

We now prove (35). For any aQa\in Q^{\prime}, it holds that

exp(ϕv(a)+uΓvϕuv(σu,a))exp(ϕv(a)+uΓvϕuv(σu,a))=exp(ϕv(a)ϕv(a)+uΓvϕuv(σu,a)uΓvϕuv(σu,a)).\displaystyle\frac{\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)}{\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)}=\exp\left(\phi_{v}(a)-\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)-\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right).

Then (35) holds because

ϕv(a)ϕv(a)\displaystyle\phi_{v}(a)-\phi^{\prime}_{v}(a) cQ|ϕv(c)ϕv(c)|=ϕvϕv1;\displaystyle\geq-\sum_{c\in Q}|\phi_{v}(c)-\phi^{\prime}_{v}(c)|=-\|\phi_{v}-\phi^{\prime}_{v}\|_{1};
uΓvϕuv(σu,a)uΓvϕuv(σu,a)\displaystyle\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)-\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a) e={u,v}Ec,cQ|ϕe(c,c)ϕe(c,c)|=e={u,v}Eϕeϕe1.\displaystyle\geq-\sum_{e=\{u,v\}\in E}\sum_{c,c^{\prime}\in Q}|\phi_{e}(c,c^{\prime})-\phi^{\prime}_{e}(c,c^{\prime})|=-\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}.

The lower bound of exp(ϕv(a)+uΓvϕuv(σu,a))exp(ϕv(a)+uΓvϕuv(σu,a))\frac{\exp\left(\phi^{\prime}_{v}(a)+\sum_{u\in\Gamma_{v}}\phi^{\prime}_{uv}(\sigma_{u},a)\right)}{\exp\left(\phi_{v}(a)+\sum_{u\in\Gamma_{v}}\phi_{uv}(\sigma_{u},a)\right)} can be proved in a similar way. ∎

The cost of the coupling for UpdateHamiltonian (proof of Lemma 6.8)

By the definition of the indicator random variable γt\gamma_{t} in (17), we have

Pr[γt=1𝒟t1]\displaystyle\Pr[\gamma_{t}=1\mid\mathcal{D}_{t-1}] Pr[t𝒫𝒟t1]+Pr[vtΓG+(𝒟t1)𝒟t1]\displaystyle\leq\Pr\left[t\in\mathcal{P}\mid\mathcal{D}_{t-1}\right]+\Pr\left[v_{t}\in\Gamma_{G}^{+}(\mathcal{D}_{t-1})\mid\mathcal{D}_{t-1}\right]
(Δ+1)|𝒟t1|n+vVpv𝗎𝗉n.\displaystyle\leq\frac{(\Delta+1)|\mathcal{D}_{t-1}|}{n}+\sum_{v\in V}\frac{p^{\mathsf{up}}_{v}}{n}.

By the definition of pv𝗎𝗉p^{\mathsf{up}}_{v} in  (12) and dHamil(,)=vVϕvϕv1+eEϕeϕe1Ld_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})=\sum_{v\in V}\left\|\phi_{v}-\phi^{\prime}_{v}\right\|_{1}+\sum_{e\in E}\left\|\phi_{e}-\phi^{\prime}_{e}\right\|_{1}\leq L, we have

Pr[γt=1𝒟t1](Δ+1)|𝒟t1|n+4Ln.\displaystyle\Pr[\gamma_{t}=1\mid\mathcal{D}_{t-1}]\leq\frac{(\Delta+1)|\mathcal{D}_{t-1}|}{n}+\frac{4L}{n}.

By the definition of R𝖧𝖺𝗆𝗂𝗅t=1TγtR_{\mathsf{Hamil}}\triangleq\sum_{t=1}^{T}\gamma_{t}, we have

(36) 𝔼[R𝖧𝖺𝗆𝗂𝗅]=t=1T𝔼[γt]=t=1T𝔼[𝔼[γt𝒟t1]]t=1T((Δ+1)𝔼[|𝒟t1|]n+4Ln).\displaystyle\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]=\sum_{t=1}^{T}\mathbb{E}\left[{\gamma_{t}}\right]=\sum_{t=1}^{T}{\mathbb{E}\left[{\mathbb{E}\left[{\gamma_{t}\mid\mathcal{D}_{t-1}}\right]}\right]}\leq\sum_{t=1}^{T}\left(\frac{(\Delta+1)\mathbb{E}\left[{|\mathcal{D}_{t-1}|}\right]}{n}+\frac{4L}{n}\right).

Next, we bound the expectation 𝔼[|𝒟t|]\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]. Recall that the one-step local coupling for Hamiltonian update (Definition 6.3) is implemented as follows. We first construct the random set 𝒫V\mathcal{P}\subseteq V in (13). In the tt-th step, where 1tT1\leq t\leq T, given any 𝑿t1\bm{X}_{t-1} and 𝒀t1\bm{Y}_{t-1}, the 𝑿t\bm{X}_{t} and 𝒀t\bm{Y}_{t} is generated as follows.

  • Let X(u)=Xt1(u)X^{\prime}(u)=X_{t-1}(u) and Y(u)=Yt1(u)Y^{\prime}(u)=Y_{t-1}(u) for all uV{vt}u\in V\setminus\{v_{t}\}, sample (X(vt),Y(vt))Q2(X^{\prime}(v_{t}),Y^{\prime}(v_{t}))\in Q^{2} jointly from the optimal coupling D𝗈𝗉𝗍,vtσ,τD^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v_{t}}} of the marginal distributions μvt,(σ)\mu_{v_{t},\mathcal{I}}(\cdot\mid\sigma) and μvt,(τ)\mu_{v_{t},\mathcal{I}}(\cdot\mid\tau), where σ=Xt1(ΓG(vt))\sigma=X_{t-1}(\Gamma_{G}(v_{t})) and τ=Yt1(ΓG(vt))\tau=Y_{t-1}(\Gamma_{G}(v_{t})).

  • Let 𝑿t=𝑿\bm{X}_{t}=\bm{X}^{\prime} and 𝒀t=𝒀\bm{Y}_{t}=\bm{Y}^{\prime}. If t𝒫t\in\mathcal{P}, update the value of Yt(vt)Y_{t}(v_{t}) using (14).

Hence, for any vertex vVv\in V, Xt(v)Yt(v)X_{t}(v)\neq Y_{t}(v) only if one of the following two events occurs (1) X(v)Y(v)X^{\prime}(v)\neq Y^{\prime}(v); (2) vt=vv_{t}=v and t𝒫t\in\mathcal{P}. Then for any vVv\in V, we have

Pr[Xt(v)Yt(v)𝑿t1,𝒀t1]\displaystyle\Pr[X_{t}(v)\neq Y_{t}(v)\mid\bm{X}_{t-1},\bm{Y}_{t-1}] Pr[X(v)Y(v)𝑿t1,𝒀t1]+Pr[v=vtt𝒫𝑿t1,𝒀t1]\displaystyle\leq\Pr[X^{\prime}(v)\neq Y^{\prime}(v)\mid\bm{X}_{t-1},\bm{Y}_{t-1}]+\Pr[v=v_{t}\land t\in\mathcal{P}\mid\bm{X}_{t-1},\bm{Y}_{t-1}]
(37) =Pr[X(v)Y(v)𝑿t1,𝒀t1]+Pr[v=vtt𝒫],\displaystyle=\Pr[X^{\prime}(v)\neq Y^{\prime}(v)\mid\bm{X}_{t-1},\bm{Y}_{t-1}]+\Pr[v=v_{t}\land t\in\mathcal{P}],

where the equation holds because v=vtt𝒫v=v_{t}\land t\in\mathcal{P} is independent of 𝑿t1,𝒀t1\bm{X}_{t-1},\bm{Y}_{t-1}. Given 𝑿t1,𝒀t1\bm{X}_{t-1},\bm{Y}_{t-1}, the random pair 𝑿,𝒀\bm{X}^{\prime},\bm{Y}^{\prime} are obtained by the one-step optimal coupling for Gibbs sampling on instance \mathcal{I} (Definition 4.2). Since \mathcal{I} satisfies the Dobrushin-Shlosman condition with constant 0<δ<10<\delta<1, by Proposition 4.3, we have

(38) 𝔼[H(𝑿,𝒀)𝑿t1,𝒀t1](1δn)H(𝑿t1,𝒀t1)=(1δn)|𝒟t1|.\displaystyle\mathbb{E}\left[{H(\bm{X}^{\prime},\bm{Y}^{\prime})\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right]\leq\left(1-\frac{\delta}{n}\right)H(\bm{X}_{t-1},\bm{Y}_{t-1})=\left(1-\frac{\delta}{n}\right)|\mathcal{D}_{t-1}|.

where H(𝑿,𝒀)=|{vVX(v)Y(v)}|H(\bm{X},\bm{Y})=|\{v\in V\mid X(v)\neq Y(v)\}| denote the Hamming distance. Combining (7.1.1) and (38),

𝔼[|𝒟t|𝒟t1]\displaystyle\mathbb{E}\left[{|\mathcal{D}_{t}|\mid\mathcal{D}_{t-1}}\right] vVPr[X(v)Y(v)𝒟t1]+vVPr[t𝒫v=vt𝒟t1]\displaystyle\leq\sum_{v\in V}\Pr[X^{\prime}(v)\neq Y^{\prime}(v)\mid\mathcal{D}_{t-1}]+\sum_{v\in V}\Pr[t\in\mathcal{P}\land v=v_{t}\mid\mathcal{D}_{t-1}]
(1δn)|𝒟t1|+vVpv𝗎𝗉n\displaystyle\leq\left(1-\frac{\delta}{n}\right)|\mathcal{D}_{t-1}|+\sum_{v\in V}\frac{p^{\mathsf{up}}_{v}}{n}
(by(12))\displaystyle(\text{by}~\eqref{eq-def-Ising-up})\quad (1δn)|𝒟t1|+2nvV(ϕvϕv1+e={u,v}Eϕeϕe1)\displaystyle\leq\left(1-\frac{\delta}{n}\right)|\mathcal{D}_{t-1}|+\frac{2}{n}\sum_{v\in V}\left(\|\phi_{v}-\phi^{\prime}_{v}\|_{1}+\sum_{e=\{u,v\}\in E}\|\phi_{e}-\phi^{\prime}_{e}\|_{1}\right)
(bydHamil(,)L)\displaystyle(\text{by}~d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L)\quad (1δn)|𝒟t1|+4Ln.\displaystyle\leq\left(1-\frac{\delta}{n}\right)|\mathcal{D}_{t-1}|+\frac{4L}{n}.

Thus, we have

𝔼[|𝒟t|](1δn)𝔼[|𝒟t1|]+4Ln.\displaystyle\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]\leq\left(1-\frac{\delta}{n}\right)\mathbb{E}\left[{|\mathcal{D}_{t-1}|}\right]+\frac{4L}{n}.

Note that |𝒟0|=0|\mathcal{D}_{0}|=0. This implies

(39) 𝔼[|𝒟t|]8Lδ.\displaystyle\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]\leq\frac{8L}{\delta}.

Thus, by (36), we have

𝔼[R𝖧𝖺𝗆𝗂𝗅]20ΔTLδn=O(ΔTLδn).\displaystyle\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]\leq\frac{20\Delta TL}{\delta n}=O\left(\frac{\Delta TL}{\delta n}\right).

7.1.2. Proofs for the coupling for graph update

In this section, we prove Lemma 6.11.

Cost of the coupling for UpdateEdge (Proof of Lemma 6.11)

By the definition of R𝗀𝗋𝖺𝗉𝗁R_{\mathsf{graph}} in (23) and the linearity of the expectation, we have

𝔼[R𝗀𝗋𝖺𝗉𝗁]\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right] =t=1T𝔼[γt]=t=1T𝔼[𝔼[γt𝒟t1]].\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[{\gamma_{t}}\right]=\sum_{t=1}^{T}\mathbb{E}\left[{\mathbb{E}\left[{\gamma_{t}\mid\mathcal{D}_{t-1}}\right]}\right].

Recall γt=𝟏[vt𝒮ΓG+(𝒟t1)]\gamma_{t}=\mathbf{1}\left[v_{t}\in\mathcal{S}\cup\Gamma^{+}_{G}(\mathcal{D}_{t-1})\right] and vtVv_{t}\in V is uniformly at random given 𝒟t1\mathcal{D}_{t-1}. Note that |ΓG+(𝒟t1)|(Δ+1)|𝒟t1||\Gamma^{+}_{G}(\mathcal{D}_{t-1})|\leq(\Delta+1)|\mathcal{D}_{t-1}| and |𝒮|2|EE|2L|\mathcal{S}|\leq 2|E\oplus E^{\prime}|\leq 2L. We have

(40) 𝔼[R𝗀𝗋𝖺𝗉𝗁]t=1T𝔼[(Δ+1)|𝒟t1|+2Ln]=(Δ+1)nt=1T𝔼[|𝒟t1|]+2LTn.\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\leq\sum_{t=1}^{T}\mathbb{E}\left[{\frac{(\Delta+1)|\mathcal{D}_{t-1}|+2L}{n}}\right]=\frac{(\Delta+1)}{n}\sum_{t=1}^{T}\mathbb{E}\left[{|\mathcal{D}_{t-1}|}\right]+\frac{2LT}{n}.

Suppose \mathcal{I}^{\prime} satisfies Dobrushin-Shlosman condition (3.1) with the constant δ>0\delta>0, we claim

(41)  0tT:𝔼[|𝒟t|]8Lδ.\displaystyle\forall\,0\leq t\leq T:\quad\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]\leq\frac{8L}{\delta}.

Combining (40) and (41), we have

𝔼[R𝗀𝗋𝖺𝗉𝗁]18ΔLTδn=O(ΔLTn).\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\leq\frac{18\Delta LT}{\delta n}=O\left(\frac{\Delta LT}{n}\right).

This proves the lemma.

We now prove (41). Let (𝑿t,𝒀t)t0(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0} be the one-step local coupling for updating edges (Definition 6.9). We claim the following result

(42) σ,τΩ:𝔼[H(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ](1δn)H(σ,τ)+4Ln,\displaystyle\forall\,\sigma,\tau\in\Omega:\qquad\mathbb{E}\left[{\,H(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau\,}\right]\leq\left(1-\frac{\delta}{n}\right)\cdot H(\sigma,\tau)+\frac{4L}{n},

where H(σ,τ)=|{vVσ(v)τ(v)}|H(\sigma,\tau)=|\{v\in V\mid\sigma(v)\neq\tau(v)\}| denotes the Hamming distance. Assume (42) holds. Taking expectation over 𝑿t1\bm{X}_{t-1} and 𝒀t1\bm{Y}_{t-1}, we have

(43) 𝔼[H(𝑿t,𝒀t)](1δn)𝔼[H(𝑿t1,𝒀t1)]+4Ln.\displaystyle\mathbb{E}\left[{H(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\left(1-\frac{\delta}{n}\right)\mathbb{E}\left[{H(\bm{X}_{t-1},\bm{Y}_{t-1})}\right]+\frac{4L}{n}.

Note that 𝑿0=𝒀0\bm{X}_{0}=\bm{Y}_{0}, we have

(44) H(𝑿0,𝒀0)=0.\displaystyle H(\bm{X}_{0},\bm{Y}_{0})=0.

Combining (43) with (44) implies

(45)  0tT:𝔼[|𝒟t|]=𝔼[H(𝑿t,𝒀t)]8Lδ.\displaystyle\forall\,0\leq t\leq T:\quad\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]=\mathbb{E}\left[{H(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\frac{8L}{\delta}.

This proves the claim in (41).

We finish the proof by proving the claim in (42). The main idea is to compare the one-step local coupling for updating edges (Definition 6.9) with the one-step optimal coupling for Gibbs sampling on instance \mathcal{I}^{\prime} (Definition 4.2). Let (𝑿t,𝒀t)t0(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})_{t\geq 0} be the coupling for Gibbs sampling on \mathcal{I}^{\prime}. Since \mathcal{I}^{\prime} satisfies Dobrushin-Shlosman condition, by Proposition 4.3, we have

(46) σ,τΩ=QV:𝔼[H(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ](1δn)H(σ,τ).\displaystyle\forall\,\sigma,\tau\in\Omega=Q^{V}:\quad\mathbb{E}\left[{\,H(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau\,}\right]\leq\left(1-\frac{\delta}{n}\right)\cdot H(\sigma,\tau).

According to the coupling, we can rewrite the expectation in (46) as follows:

(47) 𝔼[H(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ]=1nvV𝔼[H(σvCvX,τvCvY)],\displaystyle\mathbb{E}\left[{H(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau}\right]=\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right],

where (CvX,CvY)D𝗈𝗉𝗍,vσ,τ(C^{X^{\prime}}_{v},C^{Y^{\prime}}_{v})\sim D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}^{\prime}_{v}}, D𝗈𝗉𝗍,vσ,τD^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}^{\prime}_{v}} is the optimal coupling between μv,(σ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\sigma) and μv,(τ)\mu_{v,\mathcal{I}^{\prime}}(\cdot\mid\tau), and the configuration σvCvXQV\sigma^{v\leftarrow C_{v}^{X^{\prime}}}\in Q^{V} is defined as

σvCvX(u){CvXif u=vσ(u)if uv\displaystyle\sigma^{v\leftarrow C_{v}^{X^{\prime}}}(u)\triangleq\begin{cases}C^{X^{\prime}}_{v}&\text{if }u=v\\ \sigma(u)&\text{if }u\neq v\end{cases}

and the configuration τvCvYQV\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\in Q^{V} is defined in a similar way.

Similarly, we can rewrite the expectation in (42) as follows:

(48) 𝔼[H(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ]=1nvV𝔼[H(σvCvX,τvCvY)],\displaystyle\mathbb{E}\left[{H(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]=\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right],

where (CvX,CvY)Dv,vσ,τ(C^{X}_{v},C^{Y}_{v})\sim D_{\mathcal{I}_{v},\mathcal{I}_{v}^{\prime}}^{\sigma,\tau}, where Dv,vσ,τD_{\mathcal{I}_{v},\mathcal{I}_{v}^{\prime}}^{\sigma,\tau} is the local coupling defined in (21).

The following two properties hold for (47) and (48).

  • If v𝒮v\not\in\mathcal{S}, by the definition of Dv,vσ,τ(,)D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot) in (21), it holds that Dv,vσ,τ=D𝗈𝗉𝗍,vσ,τD_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}=D_{\mathsf{opt},\mathcal{I}_{v}}^{\sigma,\tau}. Hence

    v𝒮:𝔼[H(σvCvX,τvCvY)]=𝔼[H(σvCvX,τvCvY)].\displaystyle\forall v\not\in\mathcal{S}:\quad\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]=\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right].
  • If v𝒮v\in\mathcal{S}, then it holds that H(σvCvX,σvCvX)1H(\sigma^{v\leftarrow C_{v}^{X}},\sigma^{v\leftarrow C_{v}^{X^{\prime}}})\leq 1 and H(τvCvY,τvCvY)1H(\tau^{v\leftarrow C_{v}^{Y^{\prime}}},\tau^{v\leftarrow C_{v}^{Y}})\leq 1. By the triangle inequality of the Hamming distance, we have

    H(σvCvX,τvCvY)\displaystyle H\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right) H(σvCvX,σvCvX)+H(σvCvX,τvCvY)+H(τvCvY,τvCvY)\displaystyle\leq H\left(\sigma^{v\leftarrow C_{v}^{X}},\sigma^{v\leftarrow C_{v}^{X^{\prime}}}\right)+H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)+H\left(\tau^{v\leftarrow C_{v}^{Y^{\prime}}},\tau^{v\leftarrow C_{v}^{Y}}\right)
    H(σvCvX,τvCvY)+2.\displaystyle\leq H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)+2.

    This implies

    v𝒮:𝔼[H(σvCvX,τvCvY)]𝔼[H(σvCvX,τvCvY)]+2.\displaystyle\forall v\in\mathcal{S}:\quad\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right]\leq\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+2.

Combining above two properties with (47) and (48), we have for any σ,τΩ\sigma\in,\tau\in\Omega,

𝔼[H(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ]\displaystyle\mathbb{E}\left[{H(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]
=\displaystyle= 1nvV𝔼[H(σvCvX,τvCvY)]\displaystyle\,\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right]
\displaystyle\leq 1nv𝒮𝔼[H(σvCvX,τvCvY)]+1nv𝒮(𝔼[H(σvCvX,τvCvY)]+2)\displaystyle\,\frac{1}{n}\sum_{v\not\in\mathcal{S}}\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+\frac{1}{n}\sum_{v\in\mathcal{S}}\left(\mathbb{E}\left[{H\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+2\right)
()\displaystyle(\ast)\quad\leq 𝔼[H(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ]+4Ln\displaystyle\,\mathbb{E}\left[{H(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau}\right]+\frac{4L}{n}
\displaystyle\leq (1δn)H(σ,τ)+4Ln,\displaystyle\,\left(1-\frac{\delta}{n}\right)\cdot H(\sigma,\tau)+\frac{4L}{n},

where ()(\ast) holds due to |𝒮|2L|\mathcal{S}|\leq 2L. This proves the claim in (42). ∎

7.2. Implementation of the algorithms

In this section, we prove the 6.15, 6.16 and 6.20 by giving the implementation of the algorithms.

7.2.1. Proofs of 6.15 and 6.20

We prove 6.20, then 6.15 can be proved in a similar way.

It is easy to verify the updated sample \mathcal{I}^{\prime}, all the probabilities (pv𝗎𝗉)vV(p^{\mathsf{up}}_{v})_{v\in V} in (12), all middle instances 𝗆𝗂𝖽,1,2\mathcal{I}_{\mathsf{mid}},\mathcal{I}_{1},\mathcal{I}_{2} in (8), (18), (6.1.2) can be computed with time cost O(Δn)O(\Delta n). We focus on constructing 𝒫i\mathcal{P}_{i} for 1iNmin1\leq i\leq N_{\min}.

The multi-sample dynamic Gibbs sampling algorithm use the data structure in Theorem 6.12 to maintain N(n)N(n) independent Gibbs sampling chain on instance \mathcal{I} represented by 𝑿0(i)\bm{X}^{(i)}_{0} and 𝖤𝗑𝖾-𝖫𝗈𝗀(,T)=vt(i),Xt(i)(vt(i))t=1T\mathsf{Exe\text{-}Log}\left(\mathcal{I},T\right)=\left\langle{v^{(i)}_{t}},X^{(i)}_{t}({v^{(i)}_{t}})\right\rangle_{t=1}^{{T}} . To construct the random sets 𝒫i\mathcal{P}_{i} for 1iNmin1\leq i\leq N_{\min}, we need an additional data structure to maintain the following data. Define the set HvH_{v} as

Hv{(i,t)[N(n)]×[T]vt(i)=v}.\displaystyle H_{v}\triangleq\{(i,t)\in[N(n)]\times[T]\mid v^{(i)}_{t}=v\}.

HvH_{v} contains all the transition steps in N(n)N(n) independent chains that picks the vertex vv. The algorithm uses an extra data structure \mathcal{H} to maintain all (Hv)vV(H_{v})_{v\in V}. The data structure \mathcal{H} contains nn balanced binary search trees (v)vV(\mathcal{H}_{v})_{v\in V}, where each v\mathcal{H}_{v} maintains the set HvH_{v} in a similar way as in the main data structure in Theorem 6.12. Since T=O(nlogn),N(n)poly(n)T=O(n\log n),N(n)\leq\mathrm{poly(n)}, the space cost of \mathcal{H} is O(nN(n)logn)O(nN(n)\log n) memory words, each of O(logn)O(\log n) bits, which is dominated by the space cost in Lemma 6.18. And the time cost of adding, deleting, and searching a transition step in \mathcal{H} is O(log2n)O(\log^{2}n). We need to update \mathcal{H} when \mathcal{I} is updated to \mathcal{I}^{\prime}. One can verify that such time cost is dominated by the time cost in Lemma 6.18.

Then for each vVv\in V, we pick each element in HvH_{v} with probability pv𝗎𝗉p^{\mathsf{up}}_{v} to construct the set

vHv.\displaystyle\mathcal{B}_{v}\subseteq H_{v}.

This is the standard Bernoulli process. With the data structure v\mathcal{H}_{v}, the time complexity of constructing the set v\mathcal{B}_{v} is O(|v|log2n)O(\left|\mathcal{B}_{v}\right|\log^{2}n). Given all the sets v\mathcal{B}_{v}, it is easy to construct all the sets 𝒫i\mathcal{P}_{i}. Hence,

T𝗉𝗋𝖾𝗉𝖺𝗋𝖺𝗍𝗂𝗈𝗇𝗆𝗎𝗅𝗍𝗂=O(Δn+vV|v|log2n)=O(Δn+i=1Nmin|𝒫i|log2n).\displaystyle T_{\mathsf{preparation}}^{\mathsf{multi}}=O\left(\Delta n+\sum_{v\in V}\left|\mathcal{B}_{v}\right|\log^{2}n\right)=O\left(\Delta n+\sum_{i=1}^{N_{\min}}\left|\mathcal{P}_{i}\right|\log^{2}n\right).

In the preparation stage of multi-sample dynamic Gibbs sampling algorithm, we first construct the 𝗆𝗂𝖽=(V,E,Q,Φ𝗆𝗂𝖽)\mathcal{I}_{\mathsf{mid}}=(V,E,Q,\Phi^{\mathsf{mid}}) as in (8), and each 𝒫i\mathcal{P}_{i} (1iNmin1\leq i\leq N_{\min}) is constructed with respect to \mathcal{I} and 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}}. Note that dHamil(,𝗆𝗂𝖽)dHamil(,)L𝖧𝖺𝗆𝗂𝗅d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}_{\mathsf{mid}})\leq d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}}. By (12), we have for each 1iNmin1\leq i\leq N_{\min},

𝔼[|𝒫i|]t=1TvVpv𝗎𝗉n4TL𝖧𝖺𝗆𝗂𝗅n.\displaystyle\mathbb{E}\left[{\left|\mathcal{P}_{i}\right|}\right]\leq\sum_{t=1}^{T}\sum_{v\in V}\frac{p^{\mathsf{up}}_{v}}{n}\leq\frac{4TL_{\mathsf{Hamil}}}{n}.

This proves the claim. ∎

7.2.2. Proof of 6.16

We give the implementation of the update stage of the single-sample dynamic Gibbs sampling algorithm (Algorithm 2). The algorithm updates the MRF instance from \mathcal{I} to \mathcal{I}^{\prime} as follows,

𝗆𝗂𝖽12,\displaystyle\mathcal{I}\quad\to\quad\mathcal{I}_{\mathsf{mid}}\quad\to\quad\mathcal{I}_{1}\quad\to\quad\mathcal{I}_{2}\quad\to\quad\mathcal{I}^{\prime},

where 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}} is defined in (8), 1=1(𝗆𝗂𝖽,)\mathcal{I}_{1}=\mathcal{I}_{1}(\mathcal{I}_{\mathsf{mid}},\mathcal{I}^{\prime}) is defined in (18), and 2=2(𝗆𝗂𝖽,)\mathcal{I}_{2}=\mathcal{I}_{2}(\mathcal{I}_{\mathsf{mid}},\mathcal{I}^{\prime}) is defined in (6.1.2). Then the algorithm calls LengthFix to modifies the length of the execution log from TT to TT^{\prime}.

The preparation stage computes all probabilities (pv𝗎𝗉)vV(p^{\mathsf{up}}_{v})_{v\in V} in (12), the set 𝒫\mathcal{P} in (13), and all instances 𝗆𝗂𝖽,1,2\mathcal{I}_{\mathsf{mid}},\mathcal{I}_{1},\mathcal{I}_{2}. Consider the time cost of the update stage. In the update from 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}} to 1\mathcal{I}_{1}, we only add isolated vertices in VVV^{\prime}\setminus V, using the data structure in Theorem 6.12, the expected time cost is

𝔼[T𝗆𝗂𝖽1]=O(|VV||V|Tmaxlog2Tmax)=O(L𝗀𝗋𝖺𝗉𝗁nTmaxlog2Tmax).\displaystyle\mathbb{E}\left[{T_{\mathcal{I}_{\mathsf{mid}}\to\mathcal{I}_{1}}}\right]=O\left(\frac{\left|V^{\prime}\setminus V\right|}{\left|V\right|}T_{\max}\log^{2}{T_{\max}}\right)=O\left(\frac{L_{\mathsf{graph}}}{n}T_{\max}\log^{2}{T_{\max}}\right).

In the update from 2\mathcal{I}_{2} to \mathcal{I}^{\prime}, we only delete isolated vertices in VVV\setminus V^{\prime}, thus

𝔼[T𝗆𝗂𝖽1]=O(|VV||VV|Tmaxlog2Tmax)=O(L𝗀𝗋𝖺𝗉𝗁nTmaxlog2Tmax).\displaystyle\mathbb{E}\left[{T_{\mathcal{I}_{\mathsf{mid}}\to\mathcal{I}_{1}}}\right]=O\left(\frac{\left|V\setminus V^{\prime}\right|}{\left|V\cup V^{\prime}\right|}T_{\max}\log^{2}{T_{\max}}\right)=O\left(\frac{L_{\mathsf{graph}}}{n}T_{\max}\log^{2}{T_{\max}}\right).

It is also easy to observe that the expected time cost of LengthFix is

𝔼[TLengthFix]=O(Δ|TT|log2Tmax).\displaystyle\mathbb{E}\left[{T_{\textsf{LengthFix}}}\right]=O\left(\Delta\left|T-T^{\prime}\right|\log^{2}T_{\max}\right).

We then prove that

(49) 𝔼[T𝗆𝗂𝖽]\displaystyle\mathbb{E}\left[{T_{\mathcal{I}\to\mathcal{I}_{\mathsf{mid}}}}\right] =O(Δ𝔼[R𝖧𝖺𝗆𝗂𝗅]log2Tmax)\displaystyle=O\left(\Delta\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]\log^{2}T_{\max}\right)
(50) 𝔼[T12]\displaystyle\mathbb{E}\left[{T_{\mathcal{I}_{1}\to\mathcal{I}_{2}}}\right] =O(Δ𝔼[R𝗀𝗋𝖺𝗉𝗁]log2Tmax).\displaystyle=O\left(\Delta\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\log^{2}T_{\max}\right).

Combining all the running time together proves 6.16.

We give the implementation of Algorithm 4 to prove (49). The Algorithm 6 can be implemented in a similar way to prove (50). Since (pv𝗎𝗉)vV(p^{\mathsf{up}}_{v})_{v\in V} and 𝒫\mathcal{P} are given, the running time of Algorithm 4 is dominated by the while-loop. We implement Algorithm 4 such that after each execution of the while-loop, the first t0t_{0} transition steps of the Gibbs sampling on instance \mathcal{I} is updated to the first t0t_{0} transition steps of the Gibbs sampling on instance \mathcal{I}^{\prime}, namely, (𝑿t)t=0t0(\bm{X}_{t})_{t=0}^{t_{0}} is updated to (𝒀t)t=0t0(\bm{Y}_{t})_{t=0}^{t_{0}}, where t0t_{0} is the variable in Algorithm 4. Recall the sets 𝒟\mathcal{D} and 𝒫\mathcal{P} in Algorithm 4. We need some temporary data structures:

  • a balanced binary search tree 𝒯\mathcal{T} to maintain the set 𝒟\mathcal{D} and the configuration Xt01(𝒟)X_{t_{0}-1}(\mathcal{D});

  • a heap 1\mathcal{H}_{1} to maintain the set 𝒫\mathcal{P};

  • a heap 2\mathcal{H}_{2} such that once a vertex vv is added into 𝒟\mathcal{D}, the update times Succ(t0,u)\textsf{Succ}(t_{0},u) for all uΓG(v){v}u\in\Gamma_{G}(v)\cup\{v\} are added into 2\mathcal{H}_{2}, where Succ is the operation of the data structure in Theorem 6.12.

Algorithm 4 can be implemented using 1,2,𝒯\mathcal{H}_{1},\mathcal{H}_{2},\mathcal{T}. And Algorithm 4 and Algorithm 4 can be implemented using 𝒯\mathcal{T} and the main data structure in Theorem 6.12. Note that the time cost of each operation of 𝒯\mathcal{T} is O(logn)=O(logTmax)O(\log n)=O(\log T_{\max}). Also note that at most ΔR𝖧𝖺𝗆𝗂𝗅\Delta R_{\mathsf{Hamil}} elements can be added into 2\mathcal{H}_{2}. Hence, all the time cost contributed by 2\mathcal{H}_{2} is O(ΔR𝖧𝖺𝗆𝗂𝗅log(ΔR𝖧𝖺𝗆𝗂𝗅))=O(ΔR𝖧𝖺𝗆𝗂𝗅logTmax)O(\Delta R_{\mathsf{Hamil}}\log(\Delta R_{\mathsf{Hamil}}))=O(\Delta R_{\mathsf{Hamil}}\log T_{\max}). One can verify that the total running time is

T𝗆𝗂𝖽=O(ΔR𝖧𝖺𝗆𝗂𝗅log2Tmax).\displaystyle T_{\mathcal{I}\to\mathcal{I}_{\mathsf{mid}}}=O\left(\Delta R_{\mathsf{Hamil}}\log^{2}T_{\max}\right).

This proves (49). ∎

7.3. Dynamic Gibbs sampling for specific models

In this section, we apply our algorithm on Ising model, graph qq-coloring, and hardcore model. We prove the following theorem.

Theorem 7.1.

There exist dynamic sampling algorithms as stated in Theorem 6.1 with the same space cost O(nN(n)logn)O\left(nN(n)\log n\right), and expected time cost O(Δ2(L𝗀𝗋𝖺𝗉𝗁+L𝖧𝖺𝗆𝗂𝗅)N(n)log3n+Δn)O\left(\Delta^{2}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})N(n)\log^{3}n+\Delta n\right) for each update, if the input instance \mathcal{I} with nn vertices and the updated instance \mathcal{I}^{\prime} satisfying dgraph(,)L𝗀𝗋𝖺𝗉𝗁=o(n),dHamil(,)L𝖧𝖺𝗆𝗂𝗅d_{\textsf{graph}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{graph}}=o(n),d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L_{\mathsf{Hamil}} both are:

  • Ising models with temperature β\beta and arbitrary local fields where exp(2|β|)12δΔ+1\exp(-2|\beta|)\geq 1-\frac{2-\delta}{\Delta+1};

  • proper qq-colorings with q(2+δ)Δq\geq(2+\delta)\Delta;

  • hardcore models with fugacity λ2δΔ2\lambda\leq\frac{2-\delta}{\Delta-2}, but with an alternative time cost for each update

    (51) O(Δ3(L𝗀𝗋𝖺𝗉𝗁+L𝖧𝖺𝗆𝗂𝗅)N(n)log3n+Δn),\displaystyle O\left(\Delta^{3}(L_{\mathsf{graph}}+L_{\mathsf{Hamil}})N(n)\log^{3}n+\Delta n\right),

where δ>0\delta>0 is a constant, Δ=max{ΔG,ΔG}\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}, ΔG\Delta_{G} denotes the maximum degree of the input graph, and ΔG\Delta_{G^{\prime}} denotes the maximum degree of the updated graph.

In Theorem 7.1, the regime for Ising model and qq-coloring match the Dobrushin-Shlosman condition, thus the results are corollaries of Theorem 6.1. The regime for hardcore model is better than the Dobrushin-Shlosman condition. We give the proof for hardcore model.

We use =(V,E,λ)\mathcal{I}=(V,E,\lambda) to specify the hardcore model on graph G=(V,E)G=(V,E) with fugacity λ\lambda. A configuration of hardcore model is σ{0,1}V\sigma\in\{0,1\}^{V}, where σv=1\sigma_{v}=1 indicates vv is occupied, σv=0\sigma_{v}=0 indicates vv is unoccupied. If σ\sigma forms an independent set, then μ(σ)λσ\mu_{\mathcal{I}}(\sigma)\propto\lambda^{\|\sigma\|}; otherwise, μ(σ)=0\mu_{\mathcal{I}}(\sigma)=0. We need the following lemma proved by Vigoda’s coupling technique [37].

Lemma 7.2.

Let δ>0\delta>0 be a constant. Let =(V,E,λ)\mathcal{I}=(V,E,\lambda) be a hardcore instance, where n=|V|n=|V|, and Ω{σ{0,1}Vμ(σ)>0}\Omega_{\mathcal{I}}\triangleq\{\sigma\in\{0,1\}^{V}\mid\mu_{\mathcal{I}}(\sigma)>0\}. Assume λ2δΔ2\lambda\leq\frac{2-\delta}{\Delta-2}, where Δ\Delta is the maximum degree of G=(V,E)G=(V,E). There exist a potential function ρ:Ω×Ω0\rho_{\mathcal{I}}:\Omega_{\mathcal{I}}\times\Omega_{\mathcal{I}}\to\mathbb{R}_{\geq 0}, where σ,τΩ\forall\sigma,\tau\in\Omega_{\mathcal{I}}, ρ(σ,τ)=0\rho_{\mathcal{I}}(\sigma,\tau)=0 if σ=τ\sigma=\tau and ρ(σ,τ)1\rho_{\mathcal{I}}(\sigma,\tau)\geq 1 if στ\sigma\neq\tau, and Diammaxσ,τΩρ(σ,τ)Δn\mathrm{Diam}_{\mathcal{I}}\triangleq\max_{\sigma,\tau\in\Omega_{\mathcal{I}}}\rho_{\mathcal{I}}(\sigma,\tau)\leq\Delta n, such that the one-step optimal coupling (Definition 4.2) (𝐗t,𝐘t)t0(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0} of Gibbs sampling on \mathcal{I} satisfies

  1. (1)

    (step-wise decay) for the coupling (𝑿t,𝒀t)t0(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0} of Gibbs sampling, it holds that

    (52) σ,τΩ:𝔼[ρ(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ](1βn)ρ(σ,τ),\displaystyle\forall\,\sigma,\tau\in\Omega_{\mathcal{I}}:\quad\mathbb{E}\left[{\,\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau\,}\right]\leq\mbox{$\left(1-\frac{\beta}{n}\right)$}\cdot\rho_{\mathcal{I}}(\sigma,\tau),

    where β=196δ\beta=\frac{1}{96\delta}, which implies τ𝗆𝗂𝗑(,ϵ)nβlogDiamϵ=O(nlognϵ)\tau_{\mathsf{mix}}(\mathcal{I},\epsilon)\leq\lceil\frac{n}{\beta}\log\frac{\mathrm{Diam}_{\mathcal{I}}}{\epsilon}\rceil=O(n\log\frac{n}{\epsilon}).

  2. (2)

    (up-bound to Hamming) for all σ,τΩ\sigma,\tau\in\Omega_{\mathcal{I}}, H(σ,τ)ρ(σ,τ)H(\sigma,\tau)\leq\rho_{\mathcal{I}}(\sigma,\tau), where H(σ,τ)H(\sigma,\tau) denotes the Hamming distance between σ\sigma and τ\tau.

  3. (3)

    (Lipschitz) function ρ(,)\rho_{\mathcal{I}}(\cdot,\cdot), seen as a function of 2n2n variables, is KK-Lipschitz, that is,

    maxσ,σ,τ,τΩ|ρ(σ,τ)ρ(σ,τ)|KH(στ,στ),\max_{\sigma,\sigma^{\prime},\tau,\tau^{\prime}\in\Omega_{\mathcal{I}}}\left|\rho_{\mathcal{I}}(\sigma,\tau)-\rho_{\mathcal{I}}(\sigma^{\prime},\tau^{\prime})\right|\leq K\cdot H(\sigma\tau,\sigma^{\prime}\tau^{\prime}),

    where K=12ΔK=12\Delta.

Compared with Proposition 4.3, the step-wise decay property in (52) holds only for feasible configurations σ\sigma and τ\tau, and the decay property is established on the potential function ρ\rho_{\mathcal{I}} rather than the Hamming distance HH. We first use Lemma 7.2 to prove Theorem 7.1, then we prove Lemma 7.2 in the end of this section.

Recall that the error function ϵ\epsilon satisfies ϵ()1poly()\epsilon(\ell)\geq\frac{1}{\mathrm{poly}(\ell)} by Lemma 6.17. Recall Δ=max{ΔG,ΔG}\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}. By Lemma 7.2 and n=Θ(n)n^{\prime}=\Theta(n) (since L𝗀𝗋𝖺𝗉𝗁=o(n)L_{\mathsf{graph}}=o(n)), we can set

T\displaystyle T =T()=96nδlognΔϵ(n)=O(nlogn)\displaystyle=T(\mathcal{I})=\left\lceil\frac{96n}{\delta}\log\frac{n\Delta}{\epsilon(n)}\right\rceil=O\left(n\log n\right)
T\displaystyle T^{\prime} =T()=96nδlognΔϵ(n)=O(nlogn).\displaystyle=T(\mathcal{I}^{\prime})=\left\lceil\frac{96n^{\prime}}{\delta}\log\frac{n^{\prime}\Delta}{\epsilon(n^{\prime})}\right\rceil=O\left(n\log n\right).

We modify Algorithm 2 for the hardcore model as follows. Suppose the current instance is =(V,E,λ)\mathcal{I}=(V,E,\lambda), we set the initial configuration 𝑿0\bm{X}_{0} as

vV,X0(v)=0.\displaystyle\forall v\in V,\quad X_{0}(v)=0.

Thus 𝑿0\bm{X}_{0} is feasible. Suppose the instance =(V,E,λ)\mathcal{I}=(V,E,\lambda) is updated to =(V,E,λ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},\lambda^{\prime}). We divide the update into the following steps

𝗆𝗂𝖽123,\displaystyle\mathcal{I}\quad\to\quad\mathcal{I}_{\mathsf{mid}}\quad\to\quad\mathcal{I}_{1}\quad\to\quad\mathcal{I}_{2}\quad\to\quad\mathcal{I}_{3}\quad\to\quad\mathcal{I}^{\prime},
  • change fugacity to update =(V,E,λ)\mathcal{I}=(V,E,\lambda) to 𝗆𝗂𝖽=(V,E,λ)\mathcal{I}_{\mathsf{mid}}=(V,E,\lambda^{\prime}) using UpdateHamiltonian;

  • add isolated vertices in VVV^{\prime}\setminus V to update 𝗆𝗂𝖽=(V,E,λ)\mathcal{I}_{\mathsf{mid}}=(V,E,\lambda^{\prime}) to 1=(VV,E,λ)\mathcal{I}_{1}=(V\cup V^{\prime},E,\lambda^{\prime}) using AddVertex;

  • delete edges in EEE\setminus E^{\prime} to update 1=(VV,E,λ)\mathcal{I}_{1}=(V\cup V^{\prime},E,\lambda^{\prime}) to 2=(VV,EE,λ)\mathcal{I}_{2}=(V\cup V^{\prime},E\cap E^{\prime},\lambda^{\prime}) using UpdateEdge;

  • add edges in EEE^{\prime}\setminus E to update 2=(VV,EE,λ)\mathcal{I}_{2}=(V\cup V^{\prime},E\cap E^{\prime},\lambda^{\prime}) to 3=(VV,E,λ)\mathcal{I}_{3}=(V\cup V^{\prime},E^{\prime},\lambda^{\prime}) using UpdateEdge;

  • delete isolated vertices in VVV^{\prime}\setminus V to update 3=(VV,E,λ)\mathcal{I}_{3}=(V\cup V^{\prime},E^{\prime},\lambda^{\prime}) to =(V,E,λ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},\lambda^{\prime});

  • fix the length of the execution log from TT to TT^{\prime}.

Compared to Algorithm 2, we further divide the update of edges into two steps: at first delete edges, then add edges. Thus, we have the following observation.

Observation 7.3.

The following results holds:

  • Ω=Ω𝗆𝗂𝖽\Omega_{\mathcal{I}}=\Omega_{\mathcal{I}_{\mathsf{mid}}}, Ω1Ω2\Omega_{\mathcal{I}_{1}}\subseteq\Omega_{\mathcal{I}_{2}} and Ω3Ω2\Omega_{\mathcal{I}_{3}}\subseteq\Omega_{\mathcal{I}_{2}}, where Ω𝒥\Omega_{\mathcal{J}} is the set of feasible configurations for any instance 𝒥\mathcal{J}.

  • the instances ,2,3,\mathcal{I},\mathcal{I}_{2},\mathcal{I}_{3},\mathcal{I}^{\prime} all satisfy λ2δΔ2\lambda\leq\frac{2-\delta}{\Delta-2}, where λ\lambda and Δ\Delta are the fugacity and maximum degree of the corresponding instance.

By the observation, we know that Ω=Ω𝗆𝗂𝖽\Omega_{\mathcal{I}}=\Omega_{\mathcal{I}_{\mathsf{mid}}}, Ω1Ω2\Omega_{\mathcal{I}_{1}}\subseteq\Omega_{\mathcal{I}_{2}} and Ω3Ω2\Omega_{\mathcal{I}_{3}}\subseteq\Omega_{\mathcal{I}_{2}}, thus we can use Lemma 7.2, because the step-wise decay property (52) is established only on feasible configurations.

We need to analyze R𝖧𝖺𝗆𝗂𝗅R_{\mathsf{Hamil}} and R𝗀𝗋𝖺𝗉𝗁R_{\mathsf{graph}} defined in (17) and (23) for the hardcore model. We prove the following two lemmas for hardcore model.

Lemma 7.4.

Consider UpdateHamiltonian(,,𝐗0,vt,Xt(vt)t=1T)\textsf{UpdateHamiltonian}\left(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right). Let =(V,E,λ)\mathcal{I}=(V,E,\lambda) be the current instance and =(V,E,λ)\mathcal{I}^{\prime}=(V,E,\lambda^{\prime}) the updated instance. Assume λ2δΔ2\lambda\leq\frac{2-\delta}{\Delta-2}, where δ>0\delta>0 is a constant and Δ\Delta is the maximum degree of G=(V,E)G=(V,E). Also assume dHamil(,)=n|lnλlnλ|Ld_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})=n\left|\ln\lambda-\ln\lambda^{\prime}\right|\leq L. Then 𝔼[R𝖧𝖺𝗆𝗂𝗅]=O(Δ2TLnδ)\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]=O\left(\frac{\Delta^{2}TL}{n\delta}\right), where n=Vn=V, Δ\Delta is the maximum degree of graph G=(V,E)G=(V,E).

Lemma 7.5.

Consider UpdateEdge(,,𝐗0,vt,Xt(vt)t=1T)\textsf{UpdateEdge}\left(\mathcal{I},\mathcal{I}^{\prime},\bm{X}_{0},\left\langle{v_{t}},X_{t}({v_{t}})\right\rangle_{t=1}^{{T}}\right). Let =(V,E,λ)\mathcal{I}=(V,E,\lambda) be the current instance and =(V,E,λ)\mathcal{I}^{\prime}=(V,E^{\prime},\lambda) the updated instance. Assume |EE|L|E\oplus E^{\prime}|\leq L. Also assume one of the following two conditions holds for some constant δ>0\delta>0:

  • λ2δΔG2\lambda\leq\frac{2-\delta}{\Delta_{G}-2} and ΩΩ\Omega_{\mathcal{I}^{\prime}}\subseteq\Omega_{\mathcal{I}}, where ΔG\Delta_{G} is the maximum degree of G=(V,E)G=(V,E);

  • λ2δΔG2\lambda\leq\frac{2-\delta}{\Delta_{G^{\prime}}-2} and ΩΩ\Omega_{\mathcal{I}}\subseteq\Omega_{\mathcal{I}^{\prime}}, where ΔG\Delta_{G^{\prime}} is the maximum degree of G=(V,E)G^{\prime}=(V,E^{\prime}).

Then 𝔼[R𝗀𝗋𝖺𝗉𝗁]=O(Δ2TLnδ)\mathbb{E}\left[{R_{\mathsf{graph}}}\right]=O\left(\frac{\Delta^{2}TL}{n\delta}\right), where n=Vn=V, Δ=max{ΔG,ΔG}\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}.

Note that we call the subroutine UpdateHamiltonian for the update modifying \mathcal{I} to 𝗆𝗂𝖽\mathcal{I}_{\mathsf{mid}}. By 7.3, the condition in Lemma 7.4 holds. We call the subroutine UpdateEdge for the update modifying 1\mathcal{I}_{1} to 2\mathcal{I}_{2} and the update modifying 2\mathcal{I}_{2} to 3\mathcal{I}_{3}. By 7.3, in both two calls of UpdateEdge, the condition in Lemma 7.5 holds. Then Theorem 7.1 for hardcore can by proved by going through the proof in Section 6. Compared to  Lemma 6.8 and  Lemma 6.11, 𝔼[R𝖧𝖺𝗆𝗂𝗅],𝔼[R𝗀𝗋𝖺𝗉𝗁]\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right],\mathbb{E}\left[{R_{\mathsf{graph}}}\right] in Lemma 7.4 and Lemma 7.5 are bounded by O(Δ2TLnδ)O\left(\frac{\Delta^{2}TL}{n\delta}\right) rather than O(ΔTLnδ)O\left(\frac{\Delta TL}{n\delta}\right). This is why the hardcore model has an alternative running time in (51).

The proofs of Lemma 7.4 and Lemma 7.5 are similar to the proofs of Lemma 6.8 and Lemma 6.11. We give the proofs here for the completeness.

Proof of Lemma 7.4.

By the definition of the indicator γt\gamma_{t} in (17), we have

Pr[γt=1𝒟t1]\displaystyle\Pr[\gamma_{t}=1\mid\mathcal{D}_{t-1}] Pr[t𝒫]+Pr[vtΓG+(𝒟t1)]=(Δ+1)|𝒟t1|n+vVpv𝗎𝗉n.\displaystyle\leq\Pr\left[t\in\mathcal{P}\right]+\Pr\left[v_{t}\in\Gamma_{G}^{+}(\mathcal{D}_{t-1})\right]=\frac{(\Delta+1)|\mathcal{D}_{t-1}|}{n}+\sum_{v\in V}\frac{p^{\mathsf{up}}_{v}}{n}.

By the definition of pv𝗎𝗉p^{\mathsf{up}}_{v} in  (12) and dHamil(,)=n|lnλlnλ|Ld_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})=n\left|\ln\lambda-\ln\lambda^{\prime}\right|\leq L, we have

Pr[γt=1𝒟t1](Δ+1)|𝒟t1|n+2Ln.\displaystyle\Pr[\gamma_{t}=1\mid\mathcal{D}_{t-1}]\leq\frac{(\Delta+1)|\mathcal{D}_{t-1}|}{n}+\frac{2L}{n}.

By the definition of R𝖧𝖺𝗆𝗂𝗅t=1TγtR_{\mathsf{Hamil}}\triangleq\sum_{t=1}^{T}\gamma_{t}, we have

(53) 𝔼[R𝖧𝖺𝗆𝗂𝗅]=t=1T𝔼[γt]=t=1T𝔼[𝔼[γt𝒟t1]]t=1T((Δ+1)𝔼[|𝒟t1|]n+2Ln).\displaystyle\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]=\sum_{t=1}^{T}\mathbb{E}\left[{\gamma_{t}}\right]=\sum_{t=1}^{T}{\mathbb{E}\left[{\mathbb{E}\left[{\gamma_{t}\mid\mathcal{D}_{t-1}}\right]}\right]}\leq\sum_{t=1}^{T}\left(\frac{(\Delta+1)\mathbb{E}\left[{|\mathcal{D}_{t-1}|}\right]}{n}+\frac{2L}{n}\right).

Next, we bound the expectation 𝔼[|𝒟t|]\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]. In our implementation of the one-step local coupling for Hamiltonian update (Definition 6.3), we first construct the random set 𝒫V\mathcal{P}\subseteq V in (13). In the tt-th step, where 1tT1\leq t\leq T, given any 𝑿t1\bm{X}_{t-1} and 𝒀t1\bm{Y}_{t-1}, the 𝑿t\bm{X}_{t} and 𝒀t\bm{Y}_{t} is generated as follows.

  • Let X(u)=Xt1(u)X^{\prime}(u)=X_{t-1}(u) and Y(u)=Yt1(u)Y^{\prime}(u)=Y_{t-1}(u) for all uV{vt}u\in V\setminus\{v_{t}\}, sample (X(vt),Y(vt)){0,1}2(X^{\prime}(v_{t}),Y^{\prime}(v_{t}))\in\{0,1\}^{2} jointly from the optimal coupling D𝗈𝗉𝗍,vtσ,τD^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v_{t}}} of the marginal distributions μvt,(σ)\mu_{v_{t},\mathcal{I}}(\cdot\mid\sigma) and μvt,(τ)\mu_{v_{t},\mathcal{I}}(\cdot\mid\tau), where σ=Xt1(ΓG(vt))\sigma=X_{t-1}(\Gamma_{G}(v_{t})) and τ=Yt1(ΓG(vt))\tau=Y_{t-1}(\Gamma_{G}(v_{t})).

  • Let 𝑿t=𝑿\bm{X}_{t}=\bm{X}^{\prime} and 𝒀t=𝒀\bm{Y}_{t}=\bm{Y}^{\prime}. If t𝒫t\in\mathcal{P}, update the value of Yt(vt)Y_{t}(v_{t}) using (14).

Note that Ω=Ω\Omega_{\mathcal{I}}=\Omega_{\mathcal{I}^{\prime}}. Since \mathcal{I} satisfies λ2δΔ2\lambda\leq\frac{2-\delta}{\Delta-2} with constant δ>0\delta>0, by Lemma 7.2, for any feasible 𝑿t1,𝒀t1Ω=Ω\bm{X}_{t-1},\bm{Y}_{t-1}\in\Omega_{\mathcal{I}}=\Omega_{\mathcal{I}^{\prime}}, we have

(54) 𝔼[ρ(𝑿,𝒀)𝑿t1,𝒀t1](1δ96n)ρ(𝑿t1,𝒀t1).\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}^{\prime},\bm{Y}^{\prime})\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right]\leq\left(1-\frac{\delta}{96n}\right)\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1}).

By Lemma 7.2, function ρ(,)\rho_{\mathcal{I}}(\cdot,\cdot), seen as a function of 2n2n variables, is 12Δ12\Delta-Lipschitz. Let \mathcal{F} indicates whether t𝒫t\in\mathcal{P}. We flip the value of Yt(vt)Y_{t}(v_{t}) only if \mathcal{F} occurs. By (54), we have

𝔼[ρ(𝑿t,𝒀t)𝑿t1,𝒀t1]\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right] 𝔼[ρ(𝑿,𝒀)+12Δ𝑿t1,𝒀t1]\displaystyle\leq\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}^{\prime},\bm{Y}^{\prime})+12\Delta\mathcal{F}\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right]
=𝔼[ρ(𝑿,𝒀)𝑿t1,𝒀t1]+𝔼[12Δ𝑿t1,𝒀t1]\displaystyle=\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}^{\prime},\bm{Y}^{\prime})\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right]+\mathbb{E}\left[{12\Delta\mathcal{F}\mid\bm{X}_{t-1},\bm{Y}_{t-1}}\right]
( is independent with 𝑿t1,𝒀t1)\displaystyle(\text{$\mathcal{F}$ is independent with $\bm{X}_{t-1},\bm{Y}_{t-1}$})\quad (1δ96n)ρ(𝑿t1,𝒀t1)+12Δ𝔼[]\displaystyle\leq\left(1-\frac{\delta}{96n}\right)\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})+12\Delta\mathbb{E}\left[{\mathcal{F}}\right]
(1δ96n)ρ(𝑿t1,𝒀t1)+12ΔvVpv𝗎𝗉n\displaystyle\leq\left(1-\frac{\delta}{96n}\right)\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})+12\Delta\sum_{v\in V}\frac{p^{\mathsf{up}}_{v}}{n}
(by(12))\displaystyle(\text{by}~\eqref{eq-def-Ising-up})\quad (1δ96n)ρ(𝑿t1,𝒀t1)+24ΔnvV|lnλlnλ|\displaystyle\leq\left(1-\frac{\delta}{96n}\right)\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})+\frac{24\Delta}{n}\sum_{v\in V}\left|\ln\lambda-\ln\lambda^{\prime}\right|
(bydHamil(,)L)\displaystyle(\text{by}~d_{\textsf{Hamil}}(\mathcal{I},\mathcal{I}^{\prime})\leq L)\quad (1δ96n)ρ(𝑿t1,𝒀t1)+24LΔn.\displaystyle\leq\left(1-\frac{\delta}{96n}\right)\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})+\frac{24L\Delta}{n}.

Note that ρ(𝑿0,𝒀0)=0\rho_{\mathcal{I}}(\bm{X}_{0},\bm{Y}_{0})=0 and 𝑿0(v)=𝒀0(v)=0\bm{X}_{0}(v)=\bm{Y}_{0}(v)=0 for all vVv\in V, the configurations 𝑿t,𝒀t\bm{X}_{t},\bm{Y}_{t} are feasible for all t0t\geq 0. Thus, we have

𝔼[ρ(𝑿t,𝒀t)](1δ96n)𝔼[ρ(𝑿t1,𝒀t1)]+24LΔn.\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\left(1-\frac{\delta}{96n}\right)\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})}\right]+\frac{24L\Delta}{n}.

Thus 𝔼[ρ(𝑿t,𝒀t)]5000LΔδ.\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\frac{5000L\Delta}{\delta}. By the up-bound to Hamming in Lemma 7.2, we have

𝔼[|𝒟t|]5000LΔδ.\displaystyle\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]\leq\frac{5000L\Delta}{\delta}.

Thus, by (53), we have

𝔼[R𝖧𝖺𝗆𝗂𝗅]50000Δ2TLδn=O(Δ2TLδn).\displaystyle\mathbb{E}\left[{R_{\mathsf{Hamil}}}\right]\leq\frac{50000\Delta^{2}TL}{\delta n}=O\left(\frac{\Delta^{2}TL}{\delta n}\right).

Proof of Lemma 7.5.

By the definition of R𝗀𝗋𝖺𝗉𝗁R_{\mathsf{graph}} in (23) and the linearity of the expectation, we have

𝔼[R𝗀𝗋𝖺𝗉𝗁]\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right] =t=1T𝔼[γt]=t=1T𝔼[𝔼[γt𝒟t1]].\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[{\gamma_{t}}\right]=\sum_{t=1}^{T}\mathbb{E}\left[{\mathbb{E}\left[{\gamma_{t}\mid\mathcal{D}_{t-1}}\right]}\right].

Recall γt=𝟏[vt𝒮ΓG+(𝒟t1)]\gamma_{t}=\mathbf{1}\left[v_{t}\in\mathcal{S}\cup\Gamma^{+}_{G}(\mathcal{D}_{t-1})\right] and vtVv_{t}\in V is uniformly at random given 𝒟t1\mathcal{D}_{t-1}. Note that |ΓG+(𝒟t1)|(Δ+1)|𝒟t1||\Gamma^{+}_{G}(\mathcal{D}_{t-1})|\leq(\Delta+1)|\mathcal{D}_{t-1}| and |𝒮|2|EE|2L|\mathcal{S}|\leq 2|E\oplus E^{\prime}|\leq 2L. We have

(55) 𝔼[R𝗀𝗋𝖺𝗉𝗁]t=1T𝔼[(Δ+1)|𝒟t1|+2Ln]=(Δ+1)nt=1T𝔼[|𝒟t1|]+2LTn.\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\leq\sum_{t=1}^{T}\mathbb{E}\left[{\frac{(\Delta+1)|\mathcal{D}_{t-1}|+2L}{n}}\right]=\frac{(\Delta+1)}{n}\sum_{t=1}^{T}\mathbb{E}\left[{|\mathcal{D}_{t-1}|}\right]+\frac{2LT}{n}.

Suppose λ2δΔG2\lambda\leq\frac{2-\delta}{\Delta_{G}-2} and ΩΩ\Omega_{\mathcal{I}^{\prime}}\subseteq\Omega_{\mathcal{I}}. The other condition follows from symmetry. We claim that

(56)  0tT:𝔼[|𝒟t|]10000ΔLδ.\displaystyle\forall\,0\leq t\leq T:\quad\mathbb{E}\left[{|\mathcal{D}_{t}|}\right]\leq\frac{10000\Delta L}{\delta}.

Combining (55) and (56), we have

𝔼[R𝗀𝗋𝖺𝗉𝗁]100000ΔLTnδ=O(Δ2LTnδ).\displaystyle\mathbb{E}\left[{R_{\mathsf{graph}}}\right]\leq\frac{100000\Delta LT}{n\delta}=O\left(\frac{\Delta^{2}LT}{n\delta}\right).

This proves the lemma.

We now prove (56). Let (𝑿t,𝒀t)t0(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0} be the one-step local coupling for updating edges (Definition 6.9). We claim the following result

(57) σΩ,τΩΩ,𝔼[ρ(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ](1δ96n)ρ(σ,τ)+48ΔLn,\displaystyle\forall\,\sigma\in\Omega_{\mathcal{I}},\tau\in\Omega_{\mathcal{I}^{\prime}}\subseteq\Omega_{\mathcal{I}},\mathbb{E}\left[{\,\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau\,}\right]\leq\left(1-\frac{\delta}{96n}\right)\cdot\rho_{\mathcal{I}}(\sigma,\tau)+\frac{48\Delta L}{n},

where ρ\rho_{\mathcal{I}} is the potential function in Lemma 7.2. Assume (57) holds. Since 𝑿0=𝒀0={0}V\bm{X}_{0}=\bm{Y}_{0}=\{0\}^{V} and ΩΩ\Omega_{\mathcal{I}^{\prime}}\subseteq\Omega_{\mathcal{I}}, we must have 𝑿t1,𝒀t1Ω\bm{X}_{t-1},\bm{Y}_{t-1}\in\Omega_{\mathcal{I}}. Taking expectation over 𝑿t1\bm{X}_{t-1} and 𝒀t1\bm{Y}_{t-1}, we have

(58) 𝔼[ρ(𝑿t,𝒀t)](1δ96n)𝔼[ρ(𝑿t1,𝒀t1)]+48ΔLn.\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\left(1-\frac{\delta}{96n}\right)\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t-1},\bm{Y}_{t-1})}\right]+\frac{48\Delta L}{n}.

Note that 𝑿0=𝒀0\bm{X}_{0}=\bm{Y}_{0}, we have

(59) ρ(𝑿0,𝒀0)=0.\displaystyle\rho_{\mathcal{I}}(\bm{X}_{0},\bm{Y}_{0})=0.

Combining (58),  (59) and upper-bound Hamming property in Lemma 7.2 implies

 0tT:𝔼[|𝒟t|]𝔼[ρ(𝑿t,𝒀t)]10000ΔLδ.\displaystyle\forall\,0\leq t\leq T:\quad\mathbb{E}\left[{\left|\mathcal{D}_{t}\right|}\right]\leq\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})}\right]\leq\frac{10000\Delta L}{\delta}.

This proves the claim in (56).

We finish the proof by proving the claim in (57). Let (𝑿t,𝒀t)t0(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})_{t\geq 0} be the one-step optimal coupling for Gibbs sampling on instance \mathcal{I} (Definition 4.2). Since \mathcal{I} satisfies λ2δΔG2\lambda\leq\frac{2-\delta}{\Delta_{G}-2}, by Lemma 7.2, we have

(60) σ,τΩ:𝔼[ρ(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ](1δ96n)ρ(σ,τ).\displaystyle\forall\,\sigma,\tau\in\Omega_{\mathcal{I}}:\quad\mathbb{E}\left[{\,\rho_{\mathcal{I}}(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau\,}\right]\leq\left(1-\frac{\delta}{96n}\right)\cdot\rho_{\mathcal{I}}(\sigma,\tau).

According to the coupling, we can rewrite the expectation in (60) as follows:

(61) 𝔼[ρ(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ]=1nvV𝔼[ρ(σvCvX,τvCvY)],\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau}\right]=\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right],

where (CvX,CvY)D𝗈𝗉𝗍,vσ,τ(C^{X^{\prime}}_{v},C^{Y^{\prime}}_{v})\sim D^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}}, D𝗈𝗉𝗍,vσ,τD^{\sigma,\tau}_{\mathsf{opt},\mathcal{I}_{v}} is the optimal coupling between μv,(σ)\mu_{v,\mathcal{I}}(\cdot\mid\sigma) and μv,(τ)\mu_{v,\mathcal{I}}(\cdot\mid\tau), and the configuration σvCvXQV\sigma^{v\leftarrow C_{v}^{X^{\prime}}}\in Q^{V} is defined as

σvCvX(u){CvXif u=vσ(u)if uv\displaystyle\sigma^{v\leftarrow C_{v}^{X^{\prime}}}(u)\triangleq\begin{cases}C^{X^{\prime}}_{v}&\text{if }u=v\\ \sigma(u)&\text{if }u\neq v\end{cases}

and the configuration τvCvYQV\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\in Q^{V} is defined in a similar way.

Similarly, we can rewrite the expectation in (57) as follows:

(62) 𝔼[ρ(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ]=1nvV𝔼[ρ(σvCvX,τvCvY)],\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]=\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right],

where (CvX,CvY)Dv,vσ,τ(C^{X}_{v},C^{Y}_{v})\sim D_{\mathcal{I}_{v},\mathcal{I}_{v}^{\prime}}^{\sigma,\tau}, where Dv,vσ,τD_{\mathcal{I}_{v},\mathcal{I}_{v}^{\prime}}^{\sigma,\tau} is the local coupling defined in (21).

The following two properties hold for (61) and (62).

  • If v𝒮v\not\in\mathcal{S}, by the definition of Dv,vσ,τ(,)D_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}(\cdot,\cdot) in (21), it holds that Dv,vσ,τ=D𝗈𝗉𝗍,vσ,τD_{\mathcal{I}_{v},\mathcal{I}^{\prime}_{v}}^{\sigma,\tau}=D_{\mathsf{opt},\mathcal{I}_{v}}^{\sigma,\tau}. Hence

    v𝒮:𝔼[ρ(σvCvX,τvCvY)]=𝔼[ρ(σvCvX,τvCvY)].\displaystyle\forall v\not\in\mathcal{S}:\quad\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]=\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right].
  • If v𝒮v\in\mathcal{S}, then it holds that H(σvCvX,σvCvX)1H(\sigma^{v\leftarrow C_{v}^{X}},\sigma^{v\leftarrow C_{v}^{X^{\prime}}})\leq 1 and H(τvCvY,τvCvY)1H(\tau^{v\leftarrow C_{v}^{Y^{\prime}}},\tau^{v\leftarrow C_{v}^{Y}})\leq 1, where HH is the Hamming distance. Since ΩΩ\Omega_{\mathcal{I}^{\prime}}\subseteq\Omega_{\mathcal{I}}, it holds that σvCvX,σvCvX,τvCvY,τvCvYΩ\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\in\Omega_{\mathcal{I}}. Since the function ρ\rho_{\mathcal{I}} is 12Δ12\Delta-Lipschitz, we have

    v𝒮:𝔼[ρ(σvCvX,τvCvY)]𝔼[ρ(σvCvX,τvCvY)]+24Δ.\displaystyle\forall v\in\mathcal{S}:\quad\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right]\leq\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+24\Delta.

Combining above two properties with  (60), (61) and (62), we have for any σ,τΩ\sigma\in,\tau\in\Omega,

𝔼[ρ(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ]\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}_{t},\bm{Y}_{t})\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]
=\displaystyle= 1nvV𝔼[ρ(σvCvX,τvCvY)]\displaystyle\,\frac{1}{n}\sum_{v\in V}\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X}},\tau^{v\leftarrow C_{v}^{Y}}\right)}\right]
\displaystyle\leq 1nv𝒮𝔼[ρ(σvCvX,τvCvY)]+1nv𝒮(𝔼[ρ(σvCvX,τvCvY)]+24Δ)\displaystyle\,\frac{1}{n}\sum_{v\not\in\mathcal{S}}\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+\frac{1}{n}\sum_{v\in\mathcal{S}}\left(\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\sigma^{v\leftarrow C_{v}^{X^{\prime}}},\tau^{v\leftarrow C_{v}^{Y^{\prime}}}\right)}\right]+24\Delta\right)
()\displaystyle(\ast)\quad\leq 𝔼[ρ(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ]+48LΔn\displaystyle\,\mathbb{E}\left[{\rho_{\mathcal{I}}(\bm{X}^{\prime}_{t},\bm{Y}^{\prime}_{t})\mid\bm{X}^{\prime}_{t-1}=\sigma\land\bm{Y}^{\prime}_{t-1}=\tau}\right]+\frac{48L\Delta}{n}
\displaystyle\leq (1δ96n)ρ(σ,τ)+48LΔn,\displaystyle\,\left(1-\frac{\delta}{96n}\right)\cdot\rho_{\mathcal{I}}(\sigma,\tau)+\frac{48L\Delta}{n},

where ()(\ast) holds due to |𝒮|2L|\mathcal{S}|\leq 2L. This proves the claim in (57). ∎

Finally, we prove Lemma 7.2. This proof is based on the coupling technique in [37].

Proof of Lemma 7.2.

We give a potential function ρ\rho_{\mathcal{I}} for the hard core instance \mathcal{I}. We mainly use Vigoda’s potential function in [37]. However, we need to slightly modify Vigoda’s potential function to handle the isolated vertices.

Recall that for hard core model, Q={0,1}Q=\{0,1\}. For any σQV\sigma\in Q^{V}, σ(v)=1\sigma(v)=1 represents vv is occupied and σ(v)=0\sigma(v)=0 represents vv is unoccupied. For each vertex vVv\in V, we use deg(v)\mathrm{deg}(v) to denote the degree of vv in graph G=(V,E)G=(V,E). We divide the graph G=(V,E)G=(V,E) into two graphs G1=(V1,E1)G_{1}=(V_{1},E_{1}) and G2=(V2,E2)G_{2}=(V_{2},E_{2}) such that

V1={vVdeg(v)=0},E1=,\displaystyle V_{1}=\{v\in V\mid\deg(v)=0\},\quad E_{1}=\varnothing,
V2=VV1,E2=E.\displaystyle V_{2}=V\setminus V_{1},\quad E_{2}=E.

Thus G1G_{1} is an empty graph and G2G_{2} contains no isolated vertex. The potential function ρ\rho_{\mathcal{I}} is defined as

σ,τΩ:ρ(σ,τ)4ρ1(σ(V1),τ(V1))+4ρ2(σ(V2),τ(V2)).\displaystyle\forall\sigma,\tau\in\Omega_{\mathcal{I}}:\quad\rho_{\mathcal{I}}(\sigma,\tau)\triangleq 4\rho_{1}(\sigma(V_{1}),\tau(V_{1}))+4\rho_{2}(\sigma(V_{2}),\tau(V_{2})).

Here, ρ1\rho_{1} is the potential function on G1G_{1}, which is the Hamming distance:

ρ1(σ(V1),τ(V1))=vV1𝟏[σ(v)τ(v)].\displaystyle\rho_{1}(\sigma(V_{1}),\tau(V_{1}))=\sum_{v\in V_{1}}\mathbf{1}\left[\sigma(v)\neq\tau(v)\right].

And ρ2(σ(V2),τ(V2))\rho_{2}(\sigma(V_{2}),\tau(V_{2})) is the Vigoda’s potential function [37] on the graph G2G_{2}. Formally, let D={vV2σ(v)τ(v)}D=\{v\in V_{2}\mid\sigma(v)\neq\tau(v)\}. For each vV2v\in V_{2}, let dv=|DΓG2(v)|d_{v}=|D\cap\Gamma_{G_{2}}(v)|. Let c=ΔλΔλ+2c=\frac{\Delta\lambda}{\Delta\lambda+2}, where Δ\Delta is the maximum degree of graph GG. Note that the maximum degree of graph G2G_{2} is also Δ\Delta. The potential function ρ2(σ(V2),τ(V2))\rho_{2}(\sigma(V_{2}),\tau(V_{2})) is defined as

αv={deg(v)if vD0otherwise;βv={cdvif wΓG2(v) such that σ(w)=τ(w)=1c(dv1)if there is no such w and dv>1 0otherwise;\displaystyle\alpha_{v}=\begin{cases}\deg(v)&\text{if }v\in D\\ 0&\text{otherwise};\end{cases}\quad\beta_{v}=\begin{cases}-cd_{v}&\text{if $\exists\,w\in\Gamma_{G_{2}}(v)$ such that $\sigma(w)=\tau(w)=1$}\\ -c(d_{v}-1)&\text{if there is no such $w$ and $d_{v}>1$ }\\ 0&\text{otherwise};\end{cases}
ρ2(σ(V2),τ(V2))=vV2(αv+βv).\displaystyle\rho_{2}(\sigma(V_{2}),\tau(V_{2}))=\sum_{v\in V_{2}}(\alpha_{v}+\beta_{v}).

It is easy to see ρ(σ,σ)=0\rho_{\mathcal{I}}(\sigma,\sigma)=0 and maxσ,τΩρ(σ,τ)=Δn\max_{\sigma,\tau\in\Omega_{\mathcal{I}}}\rho_{\mathcal{I}}(\sigma,\tau)=\Delta n. We then verify other properties for ρ\rho_{\mathcal{I}}.

At first, we prove the upper-bound to Hamming property. For function ρ1\rho_{1}, it holds that

ρ1(σ(V1),τ(V1))=H(σ(V1),τ(V1)).\displaystyle\rho_{1}(\sigma(V_{1}),\tau(V_{1}))=H(\sigma(V_{1}),\tau(V_{1})).

For function ρ2\rho_{2}, it holds that

ρ2(σ(V2),τ(V2))=vV2(αv+βv)=vDαv+vV2βvvDwΓG2(v)(1c),\displaystyle\rho_{2}(\sigma(V_{2}),\tau(V_{2}))=\sum_{v\in V_{2}}(\alpha_{v}+\beta_{v})=\sum_{v\in D}\alpha_{v}+\sum_{v\in V_{2}}\beta_{v}\geq\sum_{v\in D}\sum_{w\in\Gamma_{G_{2}}(v)}(1-c),

where the last inequality holds due to vV2βvvV2cdv=cvDdeg(v)\sum_{v\in V_{2}}\beta_{v}\geq-\sum_{v\in V_{2}}cd_{v}=-c\sum_{v\in D}\deg(v). Since graph G2G_{2} contains no isolated vertex, then |ΓG2(v)|=deg(v)1|\Gamma_{G_{2}}(v)|=\deg(v)\geq 1 for all vDv\in D. Note c<1c<1. Thus

ρ2(σ(V2),τ(V2))|D|(1c)=|D|2Δλ+2|D|4=14H(σ(V2),τ(V2)),\displaystyle\rho_{2}(\sigma(V_{2}),\tau(V_{2}))\geq|D|(1-c)=|D|\frac{2}{\Delta\lambda+2}\geq\frac{|D|}{4}=\frac{1}{4}H(\sigma(V_{2}),\tau(V_{2})),

where 2λΔ+214\frac{2}{\lambda\Delta+2}\geq\frac{1}{4} is because λ<2Δ2\lambda<\frac{2}{\Delta-2} and Δ3\Delta\geq 3. Combining together we have

ρ(σ,τ)=4ρ1(σ(V1),τ(V1))+4ρ2(σ(V2),τ(V2))H(σ,τ).\displaystyle\rho_{\mathcal{I}}(\sigma,\tau)=4\rho_{1}(\sigma(V_{1}),\tau(V_{1}))+4\rho_{2}(\sigma(V_{2}),\tau(V_{2}))\geq H(\sigma,\tau).

This also implies ρ(σ,τ)𝟏[στ]\rho_{\mathcal{I}}(\sigma,\tau)\geq\mathbf{1}\left[\sigma\neq\tau\right].

Next, we show the function ρ\rho_{\mathcal{I}} is 12Δ12\Delta-Lipschitz. Recall V1V2=V_{1}\cap V_{2}=\varnothing, V1V2=VV_{1}\cup V_{2}=V and

ρ(σ,τ)=4ρ1(σ(V1),τ(V1))+4ρ2(σ(V2),τ(V2)).\displaystyle\rho_{\mathcal{I}}(\sigma,\tau)=4\rho_{1}(\sigma(V_{1}),\tau(V_{1}))+4\rho_{2}(\sigma(V_{2}),\tau(V_{2})).

Since ρ1\rho_{1} is the Hamming distance, it is easy to see ρ1\rho_{1} is 11-Lipschitz. To give the Lipschitz constant for ρ2\rho_{2}. We extend the function ρ2\rho_{2} as follows. Suppose the function ρ2\rho_{2} is defined over QV2×QV2Q^{V_{2}}\times Q^{V_{2}}, where Q={0,1}Q=\{0,1\}. For any x,y,x,yQV2x,y,x^{\prime},y^{\prime}\in Q^{V_{2}} such that H(xy,xy)=1H(xy,x^{\prime}y^{\prime})=1, it is easy to verify the extended function ρ2\rho_{2} satisfies

|ρ2(x,y)ρ2(x,y)|3Δ.\displaystyle|\rho_{2}(x,y)-\rho_{2}(x^{\prime},y^{\prime})|\leq 3\Delta.

This implies the original function ρ2\rho_{2} is 3Δ3\Delta-Lipschitz. Hence, the function ρ\rho_{\mathcal{I}} is 12Δ12\Delta-Lipschitz.

Finally, we prove the step-wise decay property. Let (𝑿t(1))t0,(𝒀t(1))t0(\bm{X}_{t}^{(1)})_{t\geq 0},(\bm{Y}_{t}^{(1)})_{t\geq 0} be the Gibbs sampling chains for hard core model on graph G1G_{1}. Since G1G_{1} is a graph consisting of isolated vertices, then the one step optimal coupling (𝑿t(1),𝒀t(1))t0(\bm{X}_{t}^{(1)},\bm{Y}_{t}^{(1)})_{t\geq 0} satisfies

σ,τΩ:𝔼[ρ1(𝑿t(1),𝒀t(1))𝑿t1(1)=σ(V1)𝒀t1(1)=τ(V1)](11|V1|)ρ1(σ(V1),τ(V1)).\displaystyle\forall\sigma,\tau\in\Omega_{\mathcal{I}}:\,\mathbb{E}\left[{\rho_{1}\left(\bm{X}^{(1)}_{t},\bm{Y}^{(1)}_{t}\right)\mid\bm{X}^{(1)}_{t-1}=\sigma(V_{1})\land\bm{Y}^{(1)}_{t-1}=\tau(V_{1})}\right]\leq\left(1-\frac{1}{|V_{1}|}\right)\rho_{1}(\sigma(V_{1}),\tau(V_{1})).

Let (𝑿t(2))t0,(𝒀t(2))t0(\bm{X}_{t}^{(2)})_{t\geq 0},(\bm{Y}_{t}^{(2)})_{t\geq 0} be the Gibbs sampling chains for hard core model on graph G2G_{2}. If λ2δΔ2=2(1δ/2)Δ2\lambda\leq\frac{2-\delta}{\Delta-2}=\frac{2(1-\delta/2)}{\Delta-2}, then due to Vigoda’s proof 333It can be verified that in Vigoda’s proof [37], the Markov chain for sampling hard core is indeed the Gibbs sampling and the coupling for analysis is indeed the one step-optimal coupling for Gibbs sampling., the one step optimal coupling (𝑿t(2),𝒀t(2))t0(\bm{X}_{t}^{(2)},\bm{Y}_{t}^{(2)})_{t\geq 0} satisfies:

σ,τΩ:𝔼[ρ2(𝑿t(2),𝒀t(2))𝑿t1(2)=σ(V2)𝒀t1(2)=τ(V2)](1δ96|V2|)ρ2(σ(V2),τ(V2)).\displaystyle\forall\sigma,\tau\in\Omega_{\mathcal{I}}:\,\mathbb{E}\left[{\rho_{2}\left(\bm{X}^{(2)}_{t},\bm{Y}^{(2)}_{t}\right)\mid\bm{X}^{(2)}_{t-1}=\sigma(V_{2})\land\bm{Y}^{(2)}_{t-1}=\tau(V_{2})}\right]\leq\left(1-\frac{\delta}{96|V_{2}|}\right)\rho_{2}(\sigma(V_{2}),\tau(V_{2})).

Let (𝑿t)t0,(𝒀t)t0(\bm{X}_{t})_{t\geq 0},(\bm{Y}_{t})_{t\geq 0} be the Gibbs sampling chains for hard core model on graph GG. If λ2δΔ2\lambda\leq\frac{2-\delta}{\Delta-2}, then the one step optimal coupling (𝑿t,𝒀t)t0(\bm{X}_{t},\bm{Y}_{t})_{t\geq 0} satisfies:

σ,τΩ:\displaystyle\forall\sigma,\tau\in\Omega_{\mathcal{I}}:\quad 𝔼[ρ(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ]\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\bm{X}_{t},\bm{Y}_{t}\right)\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]
=\displaystyle= |V1|n((11|V1|)4ρ1(σ(V1),τ(V1))+4ρ2(σ(V2),τ(V2)))\displaystyle\,\frac{|V_{1}|}{n}\left(\left(1-\frac{1}{|V_{1}|}\right)4\rho_{1}(\sigma(V_{1}),\tau(V_{1}))+4\rho_{2}(\sigma(V_{2}),\tau(V_{2}))\right)
+|V2|n(4ρ1(σ(V1),τ(V1))+(1δ96|V2|)4ρ2(σ(V2),τ(V2)))\displaystyle+\frac{|V_{2}|}{n}\left(4\rho_{1}(\sigma(V_{1}),\tau(V_{1}))+\left(1-\frac{\delta}{96|V_{2}|}\right)4\rho_{2}(\sigma(V_{2}),\tau(V_{2}))\right)
\displaystyle\leq (1min{δ/96,1}n)ρ(σ,τ).\displaystyle\,\left(1-\frac{\min\{\delta/96,1\}}{n}\right)\rho_{\mathcal{I}}(\sigma,\tau).

Thus, the potential function ρ\rho_{\mathcal{I}} satisfies the step-wise decay property.

σ,τΩ:\displaystyle\forall\sigma,\tau\in\Omega_{\mathcal{I}}:\quad 𝔼[ρ(𝑿t,𝒀t)𝑿t1=σ𝒀t1=τ](1δ/96n)ρ(σ,τ).\displaystyle\mathbb{E}\left[{\rho_{\mathcal{I}}\left(\bm{X}_{t},\bm{Y}_{t}\right)\mid\bm{X}_{t-1}=\sigma\land\bm{Y}_{t-1}=\tau}\right]\leq\left(1-\frac{\delta/96}{n}\right)\rho_{\mathcal{I}}(\sigma,\tau).

This proves the lemma. ∎

8. Proofs for dynamic inference

8.1. Proof of the main theorem

Our dynamic inference algorithm is given as follows. For each MRF instance =(V,E,Q,Φ)\mathcal{I}=(V,E,Q,\Phi), where n=|V|n=|V|, our dynamic inference algorithm maintains N(n)N(n) independent samples 𝑿(1),𝑿(2),,𝑿(N(n))QV\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))}\in Q^{V} satisfying each dTV(μ,𝑿(i))ϵ(n)d_{\mathrm{TV}}\left({\mu_{\mathcal{I}}},{\bm{X}^{(i)}}\right)\leq\epsilon(n) and the estimator 𝜽^()=(𝑿(1),𝑿(2),,𝑿(N(n)))\hat{\bm{\theta}}(\mathcal{I})=\mathcal{E}(\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))}) for 𝜽()\bm{\theta}(\mathcal{I}). Given an update that modifies \mathcal{I} to =(V,E,Q,Φ)\mathcal{I}^{\prime}=(V^{\prime},E^{\prime},Q,\Phi^{\prime}) where n=|V|n^{\prime}=|V^{\prime}|, our algorithm does as follows.

  • Update the sample sequence. Update 𝑿(1),𝑿(2),,𝑿(N(n))\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))} to N(n)N(n^{\prime}) independent random samples 𝒀(1),𝒀(2),,𝒀(N(n))QV\bm{Y}^{(1)},\bm{Y}^{(2)},\ldots,\bm{Y}^{(N(n^{\prime}))}\in Q^{V^{\prime}} such that each dTV(μ,𝒀(i))ϵ(n)d_{\mathrm{TV}}\left({\mu_{\mathcal{I}^{\prime}}},{\bm{Y}^{(i)}}\right)\leq\epsilon(n^{\prime}) and output the difference between two sample sequences.

  • Update the estimator. Given the difference between two sample sequences 𝑿(1),𝑿(2),,𝑿(N(n))\bm{X}^{(1)},\bm{X}^{(2)},\ldots,\bm{X}^{(N(n))} and 𝒀(1),𝒀(2),,𝒀(N(n))\bm{Y}^{(1)},\bm{Y}^{(2)},\ldots,\bm{Y}^{(N(n^{\prime}))}, update 𝜽^()\hat{\bm{\theta}}(\mathcal{I}) to 𝜽^()=𝜽(𝒀(1),𝒀(2),,𝒀(N(n)))\hat{\bm{\theta}}(\mathcal{I}^{\prime})=\mathcal{E}_{\bm{\theta}}(\bm{Y}^{(1)},\bm{Y}^{(2)},\ldots,\bm{Y}^{(N(n^{\prime}))}) using the black-box algorithm in Definition 2.3.

Obviously, 𝜽^()\hat{\bm{\theta}}(\mathcal{I}^{\prime}) is an (N,ϵ)(N,\epsilon)-estimator for 𝜽()\bm{\theta}(\mathcal{I}^{\prime}).

The sample sequence is maintained and updated by the dynamic sampling algorithm in Theorem 6.1. By  Theorem 6.1, we have the space cost for maintaining the sample sequence is O(nN(n)logn)O\left(nN(n)\log n\right) memory words, each of O(logn)O(\log n) bits. By following the proof of Theorem 6.1, it is easy to verify that the expected time cost for each update is O(Δ2LN(n)log3n+Δn)O\left(\Delta^{2}LN(n)\log^{3}n+\Delta n\right).

The estimator is maintained and updated by the black-box algorithm in Definition 2.3. By Lemma 6.19, we have N(n)poly(n)N(n)\leq\mathrm{poly}(n). Combining with  Definition 2.3, we have the space cost for maintaining the estimator is (nN(n)+K)polylog(n)(n\cdot N(n)+K)\mathrm{polylog}(n) bits. Let 𝒟\mathcal{D} be the size of the difference between two sample sequences as defined in (3). We can follow the proof of Theorem 6.1 to bound the expectation of 𝒟\mathcal{D}. Let T=nδlognϵ(n)T=\left\lceil\frac{n}{\delta}\log\frac{n}{\epsilon(n)}\right\rceil and T=nδlognϵ(n)T^{\prime}=\left\lceil\frac{n^{\prime}}{\delta}\log\frac{n^{\prime}}{\epsilon(n^{\prime})}\right\rceil. Since |nn|L=o(n)\left|n-n^{\prime}\right|\leq L=o(n), we have |TT|=O(Llogn)\left|T-T^{\prime}\right|=O(L\log n) (due to Lemma 6.17). Combining (39), (45) and (7) yields

𝔼[𝒟]\displaystyle\mathbb{E}\left[{\mathcal{D}}\right] =|N(n)N(n)|max{n,n}+O(L+|TT|)N(n)=O(LN(n)logn),\displaystyle=\left|N(n)-N(n^{\prime})\right|\cdot\max\{n,n^{\prime}\}+O\left(L+\left|T-T^{\prime}\right|\right)\cdot N(n)=O(LN(n)\log n),

where the last equation holds because N(n)N(n)=O(N(n)n)N(n)-N(n^{\prime})=O(\frac{N(n)}{n}) (due to Lemma 6.19). Combining with  Definition 2.3, we have the expected time cost for updating the estimator is LN(n)polylog(n)LN(n)\mathrm{polylog}(n).

In summary, our dynamic inference algorithm maintains an estimator for the current MRF instance \mathcal{I}, using extra O~(nN(n)+K)\widetilde{O}\left(nN(n)+K\right) memory words, each of O(logn)O(\log n) bits, such that when \mathcal{I} is updated to \mathcal{I}^{\prime}, the algorithm updates the estimator within expected time cost

𝔼[T𝖼𝗈𝗌𝗍]\displaystyle\mathbb{E}\left[{T_{\mathsf{cost}}}\right] =𝔼[T𝗌𝖺𝗆𝗉𝗅𝖾]+𝔼[T𝖾𝗌𝗍𝗂𝗆𝖺𝗍𝗈𝗋]\displaystyle=\mathbb{E}\left[{T_{\mathsf{sample}}}\right]+\mathbb{E}\left[{T_{\mathsf{estimator}}}\right]
=O(Δ2LN(n)log3n+Δn)+LN(n)polylog(n)\displaystyle=O\left(\Delta^{2}LN(n)\log^{3}n+\Delta n\right)+LN(n)\mathrm{polylog}(n)
=O~(Δ2LN(n)+Δn).\displaystyle=\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right).

8.2. Dynamic inference on specific models

Applying our dynamic inference algorithm on Ising model, qq-coloring and hardcore model yields the following result.

Theorem 8.1.

There exist dynamic inference algorithms as stated in Theorem 3.2 with the same space cost O~(nN(n)+K)\widetilde{O}\left(nN(n)+K\right), and expected time cost O~(Δ2LN(n)+Δn)\widetilde{O}\left(\Delta^{2}LN(n)+\Delta n\right) for each update, if the input instance \mathcal{I} with nn vertices and the updated instance \mathcal{I}^{\prime} with d(,)L=o(n)d(\mathcal{I},\mathcal{I}^{\prime})\leq L=o(n) both are:

  • Ising models with temperature β\beta and arbitrary local fields where exp(2|β|)12δΔ+1\exp(-2|\beta|)\geq 1-\frac{2-\delta}{\Delta+1};

  • proper qq-colorings with q(2+δ)Δq\geq(2+\delta)\Delta;

  • hardcore models with fugacity λ2δΔ2\lambda\leq\frac{2-\delta}{\Delta-2}, but with an alternative time cost for each update

    O~(Δ3LN(n)+Δn),\displaystyle\widetilde{O}\left(\Delta^{3}LN(n)+\Delta n\right),

where δ>0\delta>0 is a constant, Δ=max{ΔG,ΔG}\Delta=\max\{\Delta_{G},\Delta_{G^{\prime}}\}, ΔG\Delta_{G} and ΔG\Delta_{G^{\prime}} denote the maximum degree of the input graph and updated graph respectively.

With the dynamic sampling algorithm in Theorem 7.1, Theorem 8.1 can be proved by going through the same proof in Section 8.1.

9. Conclusion

In this paper we study probabilistic inference problem in a graphical model when the model itself is changing dynamically with time. We study the non-local updates so that two consecutive graphical models may differ everywhere as long as the total amount of their difference is bounded. This general setting covers many typical applications. We give a sampling-based dynamic inference algorithm that maintains an inference solution efficiently against the dynamic inputs. The algorithm significantly improves the time cost compared to the static sampling-based inference algorithm.

Our algorithm generically reduces the dynamic inference to dynamic sampling problem. Our main technical contribution is a dynamic Gibbs sampling algorithm that maintains random samples for graphical models dynamically changed by non-local updates. Such technique is extendable to all single-site dynamics. This gives us a systematic approach for transforming classic MCMC samplers on static inputs to the sampling and inference algorithms in a dynamic setting. Our dynamic algorithms are efficient as long as the one-step optimal coupling exhibits a step-wise decay, a key property that has been widely used in supporting efficient MCMC sampling in the classic static setting and captured by the Dobrushin-Shlosman condition.

Our result is the first one that shows the possibility of efficient probabilistic inference in dynamically changing graphical models (especially when the graphical models are changed by non-local updates). Our dynamic inference algorithm has potentials in speeding up the iterative algorithms for learning graphical models, which deserves more theoretical and experimental research. In this paper, we focus on discrete graphical models and sampling-based inference algorithms. Important future directions include considering more general distributions and the dynamic algorithms based on other inference techniques.

References

  • ADK+ [16] Ittai Abraham, David Durfee, Ioannis Koutis, Sebastian Krinninger, and Richard Peng. On fully dynamic graph sparsifiers. In FOCS, 2016.
  • AQ+ [17] Osvaldo Anacleto, Catriona Queen, et al. Dynamic chain graph models for time series network data. Bayesian Anal., 12(2):491–509, 2017.
  • BC [16] Aaron Bernstein and Shiri Chechik. Deterministic decremental single source shortest paths: beyond the o(mn)o(mn) bound. In STOC, 2016.
  • BD [97] Russ Bubley and Martin Dyer. Path coupling: A technique for proving rapid mixing in Markov chains. In FOCS, 1997.
  • CLRS [09] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009.
  • CW [07] Carlos M. Carvalho and Mike West. Dynamic matrix-variate graphical models. Bayesian Anal., 2(1):69–97, 2007.
  • DG [00] Martin Dyer and Catherine Greenhill. On Markov chains for independent sets. J. Algorithms, 35(1):17–49, 2000.
  • DGGP [18] David Durfee, Yu Gao, Gramoz Goranci, and Richard Peng. Fully dynamic effective resistances. arXiv preprint arXiv:1804.04038, 2018.
  • DGGP [19] David Durfee, Yu Gao, Gramoz Goranci, and Richard Peng. Fully dynamic spectral vertex sparsifiers and applications. In STOC, 2019.
  • DGJ [08] Martin Dyer, Leslie Ann Goldberg, and Mark Jerrum. Dobrushin conditions and systematic scan. Combin. Probab. Comput., 17(6):761–779, 2008.
  • [11] Roland L Dobrushin and Senya B Shlosman. Completely analytical Gibbs fields. In Statistical Physics and Dynamical Systems, pages 371–403. Springer, 1985.
  • [12] Roland Lvovich Dobrushin and Senya B Shlosman. Constructive criterion for the uniqueness of Gibbs field. In Statistical Physics and Dynamical Systems, pages 347–370. Springer, 1985.
  • DS [87] RL Dobrushin and SB Shlosman. Completely analytical interactions: constructive description. J. Statist. Phys., 46(5-6):983–1014, 1987.
  • DSOR [16] Christopher De Sa, Kunle Olukotun, and Christopher Ré. Ensuring rapid mixing and low bias for asynchronous Gibbs sampling. In ICML, 2016.
  • FG [19] Sebastian Forster and Gramoz Goranci. Dynamic low-stretch trees via dynamic low-diameter decompositions. In STOC, pages 377–388, 2019.
  • FVY [19] Weiming Feng, Nisheeth K Vishnoi, and Yitong Yin. Dynamic sampling from graphical models. In STOC, 2019.
  • GHP [18] Gramoz Goranci, Monika Henzinger, and Pan Peng. Dynamic Effective Resistances and Approximate Schur Complement on Separable Graphs. In ESA, volume 112, 2018.
  • GŠV [15] Andreas Galanis, Daniel Štefankovič, and Eric Vigoda. Inapproximability for antiferromagnetic spin systems in the tree nonuniqueness region. J. ACM, 62(6):50, 2015.
  • GŠV [16] Andreas Galanis, Daniel Štefankovič, and Eric Vigoda. Inapproximability of the partition function for the antiferromagnetic Ising and hard-core models. Combin. Probab. Comput., 25(04):500–559, 2016.
  • Hay [06] Thomas P Hayes. A simple condition implying rapid mixing of single-site dynamics on spin systems. In FOCS, 2006.
  • Hin [12] Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. In Neural Networks: Tricks of the Trade, pages 599–619. Springer, 2012.
  • HKN [14] Monika Henzinger, Sebastian Krinninger, and Danupon Nanongkai. Decremental single-source shortest paths on undirected graphs in near-linear total update time. In FOCS, 2014.
  • HKN [16] Monika Henzinger, Sebastian Krinninger, and Danupon Nanongkai. Dynamic approximate all-pairs shortest paths: Breaking the O(mn)O(mn) barrier and derandomization. SIAM J. Comput., 45(3):947–1006, 2016.
  • Jer [95] Mark Jerrum. A very simple algorithm for estimating the number of kk-colorings of a low-degree graph. Random Structures & Algorithms, 7(2):157–165, 1995.
  • JVV [86] Mark Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoret. Comput. Sci., 43:169–188, 1986.
  • KFB [09] Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • LMV [19] Holden Lee, Oren Mangoubi, and Nisheeth Vishnoi. Online sampling from log-concave distributions. In NIPS, 2019.
  • LP [17] David A Levin and Yuval Peres. Markov chains and mixing times. American Mathematical Soc., 2017.
  • LV [99] Michael Luby and Eric Vigoda. Fast convergence of the Glauber dynamics for sampling independent sets. Random Structures & Algorithms, 15(3-4):229–241, 1999.
  • MM [09] Marc Mezard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009.
  • NR [17] Hariharan Narayanan and Alexander Rakhlin. Efficient sampling from time-varying log-concave distributions. J. Mach. Learn. Res., 18(1):4017–4045, 2017.
  • NSWN [17] Danupon Nanongkai, Thatchaphol Saranurak, and Christian Wulff-Nilsen. Dynamic minimum spanning forest with subpolynomial worst-case update time. In FOCS, 2017.
  • QS [93] Catriona M. Queen and Jim Q. Smith. Multiregression dynamic models. J. Roy. Statist. Soc. Ser. B, 55(4):849–870, 1993.
  • RKD+ [19] Cedric Renggli, Bojan Karlaš, Bolin Ding, Feng Liu, Kevin Schawinski, Wentao Wu, and Ce Zhang. Continuous integration of machine learning models: A rigorous yet practical treatment. In SysML, 2019.
  • ŠVV [09] Daniel Štefankovič, Santosh Vempala, and Eric Vigoda. Adaptive simulated annealing: A near-optimal connection between sampling and counting. J. ACM, 56(3):18, 2009.
  • SWA [09] Padhraic Smyth, Max Welling, and Arthur U Asuncion. Asynchronous distributed learning of topic models. In NIPS, 2009.
  • Vig [99] Eric Vigoda. Fast convergence of the Glauber dynamics for sampling independent sets: Part II. Technical Report TR-99-003, International Computer Science Institute, 1999.
  • WJ [08] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Now Publishers Inc, 2008.
  • WN [17] Christian Wulff-Nilsen. Fully-dynamic minimum spanning forest with improved worst-case update time. In STOC, 2017.