A Geometric Reduction Approach for
Identity Testing of Reversible Markov Chains

Geoffrey Wolfer email: geoffrey.wolfer@riken.jp.
The author is supported by the Special Postdoctoral Researcher Program (SPDR) of RIKEN. RIKEN Center for AI Project Shun Watanabe email: shunwata@cc.tuat.ac.jp.
The author is supported in part by Japan Society for the Promotion of Science KAKENHI under Grant 20H02144. Department of Computer and Information Sciences
Tokyo University of Agriculture and Technology

Abstract

We consider the problem of testing the identity of a reversible Markov chain against a reference from a single trajectory of observations. Employing the recently introduced notion of a lumping-congruent Markov embedding, we show that, at least in a mildly restricted setting, testing identity to a reversible chain reduces to testing to a symmetric chain over a larger state space and recover state-of-the-art sample complexity for the problem.

Keywords— Information geometry; Irreducible Markov chain; Identity testing; Congruent embedding; Markov morphism; Lumpability.

1 Introduction

Uniformity testing is the flagship problem of distribution property testing. From $n$ independent observations sampled from an unknown distribution $\mu$ on a finite space $\mathcal{X}$ , the goal is to distinguish between the two cases where $\mu$ is uniform and $\mu$ is $\varepsilon$ -far from being uniform with respect to some notion of distance. The complexity of this problem is known to be of the order of $\tilde{\Theta}(\sqrt{\left|\mathcal{X}\right|}/\varepsilon^{2})$ Paninski (2008), which compares favorably with the linear dependency in $\left|\mathcal{X}\right|$ required for estimating the distribution to precision $\varepsilon$ Waggoner (2015). Interestingly, the uniform distribution can be replaced with some arbitrary reference at same statistical cost. In fact, Goldreich Goldreich (2016) proved that the latter problem formally reduces to the former. Inspired by his approach, we seek and obtain a reduction result in the much less understood and more challenging Markovian setting.

Informal Markovian problem statement —

The scientist is given the full description of a reference transition matrix $\overline{P}$ and a single Markov chain (MC) $X_{1}^{n}$ sampled with respect to some unknown transition operator $P$ and arbitrary initial distribution. For fixed proximity parameter $\varepsilon>0$ , the goal is to design an algorithm that distinguishes between the two cases $P=\overline{P}$ and $K(P,\overline{P})>\varepsilon$ , with high probability, where $K$ is a contrast function between stochastic matrices.

Related work —

Under the contrast function (1) described in Section 2, and the hypothesis that $P$ and $\overline{P}$ are both irreducible and symmetric over a finite space $\mathcal{X}$ , Daskalakis et al. (2018) constructed a tester with sample complexity $\tilde{\mathcal{O}}(\left|\mathcal{X}\right|/\varepsilon+h)$ , where $h$ (Daskalakis et al., 2018, Definition 3) corresponds to some hitting property of the chain, and a lower bound in $\Omega(\left|\mathcal{X}\right|/\varepsilon)$ . In Cherapanamjeri and Bartlett (2019), a graph partitioning algorithm delivers, under the same symmetry assumption, a testing procedure with sample complexity $\mathcal{O}(\left|\mathcal{X}\right|/\varepsilon^{4})$ , i.e. independent on hitting properties. More recently, Fried and Wolfer (2022) relaxed the symmetry requirement, replacing it with a more natural reversibility assumption. Their algorithm has a sample complexity of $\mathcal{O}(1/(\overline{\pi}_{\star}\varepsilon^{4}))$ , where $\overline{\pi}_{\star}$ is the minimum stationary probability of $\overline{P}$ , gracefully recovering Cherapanamjeri and Bartlett (2019) under symmetry. In parallel, Wolfer and Kontorovich (2020) started the research program of inspecting the problem under the infinity norm for matrices, and derived nearly minimax-optimal bounds.

Contribution —

We show how to mostly recover Fried and Wolfer (2022) under additional assumptions (see Section 3), with a technique based on a geometry preserving embedding. We obtain a more economical proof than Fried and Wolfer (2022) who went through the process of re-deriving a graph partitioning algorithm for the reversible case. Furthermore, our approach, by its generality, is also applicable to related inference problems.

2 Preliminaries

We let $\mathcal{X},\mathcal{Y}$ be finite sets, and denote $\mathcal{P}(\mathcal{X})$ the set of all probability distributions over $\mathcal{X}$ . All vectors are written as row vectors. For matrices $A,B$ , $A\circ B$ is their Hadamard product and $\rho(A)$ is the spectral radius of $A$ . For $n\in\mathbb{N}$ , we use the compact notation $x_{1}^{n}=(x_{1},\dots,x_{n})$ . $\mathcal{W}(\mathcal{X})$ is the set of all row-stochastic matrices over the state space $\mathcal{X}$ , and $\pi$ is called a stationary distribution for $P\in\mathcal{W}(\mathcal{X})$ when $\pi P=\pi$ .

Irreducibility and reversibility —

We denote $\mathcal{W}(\mathcal{X},\mathcal{D})$ the set of irreducible stochastic matrices over a strongly connected digraph $(\mathcal{X},\mathcal{D})$ . When $P\in\mathcal{W}(\mathcal{X},\mathcal{D})$ , $\pi$ is unique and we denote $\pi_{\star}=\min_{x\in\mathcal{X}}\pi(x)>0$ . When $P$ verifies the detailed-balance equation $\pi(x)P(x,x^{\prime})=\pi(x^{\prime})P(x^{\prime},x)$ for any $(x,x^{\prime})\in\mathcal{D}$ , we say that $P$ is reversible.

Lumpability —

In contradistinction with the distribution setting, merging symbols in a Markov chain may break the Markov property. For $P\in\mathcal{W}(\mathcal{Y},\mathcal{E})$ and a surjective map $\kappa\colon\mathcal{Y}\to\mathcal{X}$ , merging elements of $\mathcal{Y}$ together, we say that $P$ is $\kappa$ -lumpable Kemeny and Snell (1983) when the resulting process still defines a MC. If so, the resulting transition matrix can be found in (Kemeny and Snell, 1983, Theorem 6.3.2), which we denote as $\kappa_{\star}P\in\mathcal{W}(\mathcal{X},\kappa_{2}(\mathcal{D}))$ , with

\kappa_{2}(\mathcal{D})\doteq\left\{(x,x^{\prime})\in\mathcal{X}^{2}\colon\exists(y,y^{\prime})\in\mathcal{E},(\kappa(y),\kappa(y^{\prime}))=(x,x^{\prime})\right\}.

Contrast function —

We consider the following notion of discrepancy between two stochastic matrices $P,P^{\prime}\in\mathcal{W}(\mathcal{X})$ ,

K(P,P^{\prime})\doteq 1-\rho\left(P^{\circ 1/2}\circ P^{\prime\circ 1/2}\right).

(1)

Although $K$ first appeared in Daskalakis et al. (2018) in the context of MC testing, its inception can be traced back to Kazakos Kazakos (1978). It is instructive to observe that $K$ vanishes on chains that share an identical component and does not satisfy the triangle inequality for reducible matrices, hence is not a proper metric on $\mathcal{W}(\mathcal{X})$ (Daskalakis et al., 2018, p.10, footnote 13). Some additional properties of $K$ of possible interest are listed in (Fried and Wolfer, 2022, Section 7).

Reduction approach for identity testing of distributions —

Problem reduction is ubiquitous in the property testing literature. Our work takes inspiration from Goldreich (2016), who introduced two so-called “stochastic filters” in order to show how in the distribution setting, identity testing was reducible to uniformity testing, thereby recovering the known complexity of $\mathcal{O}(\sqrt{\left|\mathcal{X}\right|}/\varepsilon^{2})$ obtained more directly by Valiant and Valiant (2017). Other notable works include Diakonikolas and Kane (2016), who reduced a collection of distribution testing problems to $\ell_{2}$ -identity testing.

3 The restricted identity testing problem

We cast our problem in the minimax framework by defining the risk $\mathcal{R}_{n}(\varepsilon)$ ,

\mathcal{R}_{n}(\varepsilon)\doteq\min_{\phi\colon\mathcal{X}^{n}\to\left\{0,1\right\}}\left\{\mathbb{P}_{X_{1}^{n}\sim\overline{\pi},\overline{P}}\left(\phi(X_{1}^{n})=1\right)+\max_{P\in\mathcal{H}_{1}(\varepsilon)}\mathbb{P}_{X_{1}^{n}\sim\pi,P}\left(\phi(X_{1}^{n})=0\right)\right\},

sample complexity $n_{\star}(\varepsilon,\delta)\doteq\min\left\{n\in\mathbb{N}\colon\mathcal{R}_{n}(\varepsilon)<\delta\right\}$ , and where

\mathcal{H}_{0}=\left\{\overline{P}\right\},\qquad\mathcal{H}_{1}(\varepsilon)=\left\{P\in\mathcal{V}_{\mathsf{test}}\colon K(P,\overline{P})>\varepsilon\right\},

with $\mathcal{H}_{0},\mathcal{H}_{1}(\varepsilon)\subset\mathcal{V}_{\mathsf{test}}$ , the subset of stochastic matrices under consideration. We note the presence of an exclusion region, and that the problem can be regarded as a Bayesian testing problem with a prior which is uniform over $\mathcal{H}_{0}$ and $\mathcal{H}_{1}(\varepsilon)$ and vanishes on the exclusion region. We briefly recall the assumptions made in Fried and Wolfer (2022). For $(P,\overline{P})\in(\mathcal{H}_{1}(\varepsilon),\mathcal{H}_{0})$ ,

$({A}.1)$

$P$ and $\overline{P}$ are irreducible and reversible.
$({A}.2)$

$P$ and $\overline{P}$ share the same stationary distribution $\overline{\pi}=\pi$ . ¹¹1We note that Fried and Wolfer (2022) also slightly loosen the requirement of having a matching stationary distributions to being close in the sense where $\left\|\pi/\overline{\pi}-1\right\|_{\infty}<\varepsilon$ .

The following additional assumptions will make our approach readily applicable.

$({B}.1)$

$P,\overline{P}$ and share the same connection graph, $P,\overline{P}\in\mathcal{W}(\mathcal{X},\mathcal{D})$ .
$({B}.2)$

The common stationary probability is rational, $\overline{\pi}\in\mathbb{Q}^{\mathcal{X}}$ .

Remark 3.1.

A sufficient condition for $\overline{\pi}\in\mathbb{Q}^{\mathcal{X}}$ is $\overline{P}(x,x^{\prime})\in\mathbb{Q}$ for any $x,x^{\prime}\in\mathcal{X}$ .

Without loss of generality, we express $\overline{\pi}=\left(p_{1},p_{2},\dots,p_{\left|\mathcal{X}\right|}\right)/\Delta$ , for some $\Delta\in\mathbb{N}$ , and $p\in\mathbb{N}^{\left|\mathcal{X}\right|}$ where $0<p_{1}\leq p_{2}\leq\dots\leq p_{\left|\mathcal{X}\right|}<\Delta$ . We denote by $\mathcal{V}_{\mathsf{test}}$ the set of stochastic matrices that verify assumptions $(A.1),(A.2),(B.1)$ and $(B.2)$ for some fixed and positive $\overline{\pi}\in\mathcal{P}(\mathcal{X})$ . Our below-stated theorem provides an upper bound on the sample complexity $n_{\star}(\varepsilon,\delta)$ in $\widetilde{\mathcal{O}}(1/(\overline{\pi}_{\star}\varepsilon))$ .

Theorem 3.1.

Let $\varepsilon,\delta\in(0,1)$ and let $\overline{P}\in\mathcal{V}_{\mathsf{test}}\subset\mathcal{W}(\mathcal{X},\mathcal{D})$ . There exists a testing procedure $\phi\colon\mathcal{X}^{n}\to\left\{0,1\right\}$ , with $n=\tilde{\mathcal{O}}(1/(\overline{\pi}_{\star}\varepsilon^{4}))$ , such that the following holds. For any $P\in\mathcal{V}_{\mathsf{test}}$ and $X_{1}^{n}$ sampled according to $P$ , $\phi$ distinguishes between the cases $P=\overline{P}$ and $K(P,\overline{P})>\varepsilon$ with probability at least $1-\delta$ .

sketch.

Our strategy consists in two steps. First, we employ a transformation on Markov chains termed Markov embedding Wolfer and Watanabe (2022) in order to symmetrize both the reference chain (algebraically, by computing the new transition matrix) and the unknown chain (operationally, by simulating an embedded trajectory). Crucially, our transformation preserves the contrast between two chains and their embedded version (Lemma 5.2). Second, we invoke the known tester Cherapanamjeri and Bartlett (2019) for symmetric chains as a black box and report its output. The proof is deferred to Section 6. ∎

Remark 3.2.

Our reduction could also be applied in the robust testing setting ²²2Note that even in the symmetric setting, the robust problem remains open., where the two competing hypotheses are $K(P,\overline{P})<\varepsilon/2$ and $K(P,\overline{P})>\varepsilon$ .

4 Symmetrization of reversible Markov chains

Information geometry —

Our construction and notation follow Nagaoka (2005), who established the dually-flat structure $(\mathcal{W}(\mathcal{X},\mathcal{D}),\mathfrak{g},\nabla^{(e)},\nabla^{(m)})$ on the space of irreducible stochastic matrices. Writing $P_{\theta}\in\mathcal{W}(\mathcal{X},\mathcal{D})$ for the transition matrix at coordinates $\theta\in\Theta\subset\mathbb{R}^{d}$ , with $d$ the dimension of the manifold, and with the shorthand $\partial_{i}\doteq\partial/\partial\theta^{i}$ , recall that the Fisher metric is expressed in the chart induced basis $(\partial_{i})_{i\in[d]}$ as

\mathfrak{g}_{ij}(\theta)=\sum_{(x,x^{\prime})\in\mathcal{D}}\pi_{\theta}(x)P_{\theta}(x,x^{\prime})\partial_{i}\log P_{\theta}(x,x^{\prime})\partial_{j}\log P_{\theta}(x,x^{\prime}),\text{ for }i,j\in[d].

Embeddings —

In Wolfer and Watanabe (2022), the following notion of an embedding for stochastic matrices is proposed.

Definition 4.1 (Markov embedding for Markov chains Wolfer and Watanabe (2022)).

We call Markov embedding, a map $\Lambda_{\star}\colon\mathcal{W}(\mathcal{X},\mathcal{D})\to\mathcal{W}(\mathcal{Y},\mathcal{E}),P\mapsto\Lambda_{\star}P$ , such that for any $(y,y^{\prime})\in\mathcal{E}$ ,

\Lambda_{\star}P(y,y^{\prime})=P(\kappa(y),\kappa(y^{\prime}))\Lambda(y,y^{\prime}),

and where $\kappa$ and $\Lambda$ satisfy the following requirements

$(i)$

$\kappa\colon\mathcal{Y}\to\mathcal{X}$ is a lumping function for which $\kappa_{2}(\mathcal{E})=\mathcal{D}$ .
$(ii)$

$\Lambda$ is a positive function over the edge set, $\Lambda\colon\mathcal{E}\to\mathbb{R}_{+}$ .
$(iii)$

Writing $\bigcup_{x\in\mathcal{X}}\mathcal{S}_{x}=\mathcal{Y}$ for the partition defined by $\kappa$ , $\Lambda$ is such that for any $y\in\mathcal{Y}$ and $x^{\prime}\in\mathcal{X}$ , $(\kappa(y),x^{\prime})\in\mathcal{D}\implies(\Lambda(y,y^{\prime}))_{y^{\prime}\in\mathcal{S}_{x^{\prime}}}\in\mathcal{P}(\mathcal{S}_{x^{\prime}})$ .

The above embeddings are characterized as the linear maps over lumpable matrices that satisfy some monotonicity requirements and are congruent with respect to the lumping operation (Wolfer and Watanabe, 2022, Theorem 3.1). When for any $y,y^{\prime}\in\mathcal{Y}$ , it additionally holds that $\Lambda(y,y^{\prime})=\Lambda(y^{\prime})\delta\left[(\kappa(y),\kappa(y^{\prime}))\in\mathcal{D}\right]$ , the embedding $\Lambda_{\star}$ is called memoryless (Wolfer and Watanabe, 2022, Section 3.4.2) and is e/m-geodesic affine (Wolfer and Watanabe, 2022, Th. 3.2, Lemma 3.6), preserving both exponential and mixture families of MC.

Given $\overline{\pi}$ and $\Delta$ as defined in Section 3, from (Wolfer and Watanabe, 2022, Corollary 3.3), there exists a lumping function $\kappa\colon[\Delta]\to\mathcal{X}$ , and a memoryless embedding $\sigma^{\overline{\pi}}_{\star}\colon\mathcal{W}(\mathcal{X},\mathcal{D})\to\mathcal{W}([\Delta],\mathcal{E})$ with $\mathcal{E}=\left\{(y,y^{\prime})\in[\Delta]^{2}\colon(\kappa(y),\kappa(y^{\prime}))\in\mathcal{D}\right\}$ , such that $\sigma^{\overline{\pi}}_{\star}\overline{P}$ is symmetric. Furthermore, identifying $\mathcal{X}=\left\{1,2,\dots,\left|\mathcal{X}\right|\right\}$ , its existence is given constructively by

\kappa(j)=\operatorname*{arg\,min}_{1\leq i\leq\left|\mathcal{X}\right|}\left\{\sum_{k=1}^{i}p_{k}\geq j\right\},\text{ with }\sigma^{\overline{\pi}}(j)=p^{-1}_{\kappa(j)},\text{ for any }1\leq j\leq\Delta.

As a consequence, we have both,

1.

The expression of $\sigma^{\overline{\pi}}_{\star}\overline{P}$ following the algebraic manipulations in Definition 4.1.
2.

A random algorithm Wolfer and Watanabe (2022) to simulate trajectories from $\sigma^{\overline{\pi}}_{\star}P$ out of trajectories from $P$ (see (Wolfer and Watanabe, 2022, Section 3.1)).

5 Contrast preservation

It was established in (Wolfer and Watanabe, 2022, Lemma 3.1) that Markov embeddings preserve the Fisher information metric, the affine connections and the KL divergence between points. In this section, we show that memoryless embeddings, such as the symmetrizer $\sigma^{\overline{\pi}}_{\star}$ introduced in Section 4, also preserve the contrast function $K$ . Our proof will rely on first showing that memoryless embeddings induce natural Markov morphisms Čencov (1978) from distributions over $\mathcal{X}^{n}$ to $\mathcal{Y}^{n}$ .

Lemma 5.1.

Let a lumping function $\kappa\colon\mathcal{Y}\to\mathcal{X}$ , and

L_{\star}\colon\mathcal{W}(\mathcal{X},\mathcal{D})\rightarrow\mathcal{W}(\mathcal{Y},\mathcal{E})

be a $\kappa$ congruent memoryless Markov embedding. For $P\in\mathcal{W}(\mathcal{X},\mathcal{D})$ , let $Q^{n}\in\mathcal{P}(\mathcal{X}^{n})$ (resp. $\widetilde{Q}^{n}\in\mathcal{P}(\mathcal{Y}^{n})$ ) be the unique distribution over stationary paths of length $n$ induced from $P$ (resp. $L_{\star}P$ ). Then there exists a Markov morphism $M_{\star}\colon\mathcal{P}(\mathcal{X}^{n})\to\mathcal{P}(\mathcal{Y}^{n})$ such that $M_{\star}Q^{n}=\widetilde{Q}^{n}$ .

Proof.

Let $\kappa_{n}\colon\mathcal{Y}^{n}\to\mathcal{X}^{n}$ be the lumping function on blocks induced from $\kappa$ ,

\forall y_{1}^{n}\in\mathcal{Y}^{n},\kappa_{n}(y_{1}^{n})=(\kappa(y_{t}))_{1\leq t\leq n}\in\mathcal{X}^{n},

and introduce

\mathcal{Y}^{n}=\bigcup_{x_{1}^{n}\in\mathcal{X}^{n}}\mathcal{S}_{x_{1}^{n}},\text{ with }\mathcal{S}_{x_{1}^{n}}=\left\{y_{1}^{n}\in\mathcal{Y}^{n}\colon\kappa_{n}(y_{1}^{n})=x_{1}^{n}\right\},

the partition associated to $\kappa_{n}$ . For any realizable path $x_{1}^{n},Q^{n}(x_{1}^{n})>0$ , we define a distribution $M^{x_{1}^{n}}\in\mathcal{P}(\mathcal{Y}^{n})$ concentrated on $\mathcal{S}_{x_{1}^{n}}$ , and such that for any $y_{1}^{n}\in\mathcal{S}_{x_{1}^{n}}$ , $M^{x_{1}^{n}}(y_{1}^{n})=\prod_{t=1}^{n}L(y_{t}).$ Non-negativity of $M^{x_{1}^{n}}$ is immediate, and

\begin{split}\sum_{y_{1}^{n}\in\mathcal{Y}^{n}}M^{x_{1}^{n}}(y_{1}^{n})&=\sum_{y_{1}^{n}\in\mathcal{Y}^{n}\colon\kappa_{n}(y_{1}^{n})=x_{1}^{n}}M^{x_{1}^{n}}(y_{1}^{n})=\prod_{t=1}^{n}\left(\sum_{y_{t}\in\mathcal{S}_{x_{t}}}L(y_{t})\right)=1,\end{split}

thus $M^{x_{1}^{n}}$ is well-defined. Furthermore, for $y_{1}^{n}\in\mathcal{Y}^{n}$ , it holds that

\begin{split}\widetilde{Q}^{n}(y_{1}^{n})&=L_{\star}\pi(y_{1})\prod_{t=1}^{n-1}L_{\star}P(y_{t},y_{t+1})\stackrel{{\scriptstyle(\spadesuit)}}{{=}}\pi(\kappa(y_{1}))L(y_{1})\prod_{t=1}^{n-1}P(\kappa(y_{t}),\kappa(y_{t+1}))L(y_{t})\\ &=Q^{n}(\kappa(y_{1}),\dots,\kappa(y_{n}))\prod_{t=1}^{n}L(y_{t})=Q^{n}(\kappa_{n}(y_{1}^{n}))\prod_{t=1}^{n}L(y_{t})\\ &=\sum_{x_{1}^{n}\in\mathcal{X}^{n}}Q^{n}(\kappa_{n}(y_{1}^{n}))M^{x_{1}^{n}}(y_{1}^{n})=M_{\star}Q^{n}(y_{1}^{n}),\end{split}

where $(\spadesuit)$ stems from (Wolfer and Watanabe, 2022, Lemma 3.5), whence our claim holds. ∎

Lemma 5.1 essentially states that the following diagram commutes

for some Markov morphism $M_{\star}$ , and where we denoted $\mathcal{Q}^{n}_{\mathcal{W}(\mathcal{X},\mathcal{D})}\subset\mathcal{P}(\mathcal{X}^{n})$ for the set of all distributions over paths of length $n$ induced from the family $\mathcal{W}(\mathcal{X},\mathcal{D})$ . As a consequence, we can unambiguously write $L_{\star}Q^{n}\in\mathcal{Q}^{n}_{L_{\star}\mathcal{W}(\mathcal{X},\mathcal{D})}$ for the distribution over stationary paths of length $n$ that pertains to $L_{\star}P$ .

Lemma 5.2.

Let $L_{\star}\colon\mathcal{W}(\mathcal{X},\mathcal{D})\to\mathcal{W}(\mathcal{Y},\mathcal{E})$ be a memoryless embedding,

K(L_{\star}P,L_{\star}\overline{P})=K(P,\overline{P}).

Proof.

We recall for two distributions $\mu,\nu\in\mathcal{P}(\mathcal{X})$ the definition of $R_{1/2}$ the Rényi entropy of order $1/2$ ,

\begin{split}R_{1/2}(\mu\|\nu)\doteq-2\log\left(\sum_{x\in\mathcal{X}}\sqrt{\mu(x)\nu(x)}\right),\end{split}

and note that $R_{1/2}$ is closely related to the Hellinger distance between $\mu$ and $\nu$ . This definition extends to a divergence rate between stochastic processes $(X_{t})_{t\in\mathbb{N}},(X^{\prime}_{t})_{t\in\mathbb{N}}$ on $\mathcal{X}$ as follows

R_{1/2}\left((X_{t})_{t\in\mathbb{N}}\|(X^{\prime}_{t})_{t\in\mathbb{N}}\right)=\lim_{n\to\infty}\frac{1}{n}R_{1/2}\left(X_{1}^{n}\|X_{1}^{\prime n}\right),

and in the irreducible time-homogeneous Markovian setting where $(X_{t})_{t\in\mathbb{N}},(X^{\prime}_{t})_{t\in\mathbb{N}}$ evolve according to transition matrices $P$ and $P^{\prime}$ , the above reduces Rached et al. (2001) to

R_{1/2}\left((X_{t})_{t\in\mathbb{N}}\|(X^{\prime}_{t})_{t\in\mathbb{N}}\right)=-2\log\rho(P^{\circ 1/2}\circ P^{\prime\circ 1/2})=-2\log(1-K(P,P^{\prime})).

Reorganizing terms and plugging-for the embedded stochastic matrices,

\begin{split}K(L_{\star}P,L_{\star}\overline{P})&=1-\exp\left(-\frac{1}{2}\lim_{n\to\infty}\frac{1}{n}R_{1/2}\left(L_{\star}Q^{n}\|L_{\star}\overline{Q}^{n}\right)\right),\\ \end{split}

where $L_{\star}\overline{Q}^{n}$ is the distribution over stationary paths of length $n$ induced by the embedded $L_{\star}\overline{P}$ . For any $n\in\mathbb{N}$ , from Lemma 5.1 and information monotonicity of the Rényi divergence, $R_{1/2}\left(L_{\star}Q^{n}\|L_{\star}\overline{Q}^{n}\right)=R_{1/2}\left(Q^{n}\|\overline{Q}^{n}\right),$ hence our claim. ∎

6 Proof of Theorem 3.1

We assume that $P$ and $\overline{P}$ are in $\mathcal{V}_{\mathsf{test}}$ . We reduce the problem as follows. We construct $\sigma^{\overline{\pi}}_{\star}$ , the symmetrizer ³³3If we wish to test for the identity of multiple chains against a same reference, we only need to perform this step once. defined in Section 4. We proceed to embed both the reference chain (using Definition 4.1) and and the unknown trajectory (using the operational definition in (Wolfer and Watanabe, 2022, Section 3.1)). We invoke the tester of Cherapanamjeri and Bartlett (2019) as a black box, and report its answer.

Figure 1: Reduction of the testing problem by isometric embedding.

Completeness case.

It is immediate that $P=\overline{P}\implies L_{\star}P=L_{\star}\overline{P}$ .

Soundness case.

From Lemma 5.2, $K(P,\overline{P})>\varepsilon\implies K(\sigma^{\overline{\pi}}P,\sigma^{\overline{\pi}}\overline{P})>\varepsilon$ .

As a consequence of (Cherapanamjeri and Bartlett, 2019, Theorem 10), the sample complexity of testing is upper bounded by $\mathcal{O}(\Delta/\varepsilon^{4})$ . With $\overline{\pi}_{\star}=p_{1}/\Delta$ and treating $p_{1}$ as a small constant, we recover the known sample complexity.

References

Cherapanamjeri and Bartlett (2019) Y. Cherapanamjeri and P. L. Bartlett. Testing symmetric Markov chains without hitting. In Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 758–785. PMLR, 2019.
Daskalakis et al. (2018) C. Daskalakis, N. Dikkala, and N. Gravin. Testing symmetric Markov chains from a single trajectory. In Conference On Learning Theory, pages 385–409. PMLR, 2018.
Diakonikolas and Kane (2016) I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete distributions. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 685–694. IEEE, 2016.
Fried and Wolfer (2022) S. Fried and G. Wolfer. Identity testing of reversible Markov chains. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 798–817. PMLR, 2022.
Goldreich (2016) O. Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. In Electron. Colloquium Comput. Complex., volume 23, page 15, 2016.
Kazakos (1978) D. Kazakos. The Bhattacharyya distance and detection between Markov chains. IEEE Transactions on Information Theory, 24(6):747–754, 1978.
Kemeny and Snell (1983) J. G. Kemeny and J. L. Snell. Finite Markov chains: with a new appendix ”Generalization of a fundamental matrix”. Springer, 1983.
Nagaoka (2005) H. Nagaoka. The exponential family of Markov chains and its information geometry. In The proceedings of the Symposium on Information Theory and Its Applications, volume 28(2), pages 601–604, 2005.
Paninski (2008) L. Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008.
Rached et al. (2001) Z. Rached, F. Alajaji, and L. L. Campbell. Rényi’s divergence and entropy rates for finite alphabet Markov sources. IEEE Transactions on Information theory, 47(4):1553–1561, 2001.
Valiant and Valiant (2017) G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing, 46(1):429–455, 2017.
Čencov (1978) N. N. Čencov. Algebraic foundation of mathematical statistics. Series Statistics, 9(2):267–276, 1978. doi: 10.1080/02331887808801428.
Waggoner (2015) B. Waggoner. $l_{p}$ testing and learning of discrete distributions. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 347–356, 2015.
Wolfer and Kontorovich (2020) G. Wolfer and A. Kontorovich. Minimax testing of identity to a reference ergodic Markov chain. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 191–201. PMLR, 26–28 Aug 2020.
Wolfer and Watanabe (2022) G. Wolfer and S. Watanabe. Geometric aspects of data-processing of Markov chains, 2022. arXiv:2203.04575.

A Geometric Reduction Approach for Identity Testing of Reversible Markov Chains

Abstract

1 Introduction

Informal Markovian problem statement —

Related work —

Contribution —

2 Preliminaries

Irreducibility and reversibility —

Lumpability —

Contrast function —

Reduction approach for identity testing of distributions —

3 The restricted identity testing problem

Remark 3.1.

Theorem 3.1.

sketch.

Remark 3.2.

4 Symmetrization of reversible Markov chains

Information geometry —

Embeddings —

Definition 4.1 (Markov embedding for Markov chains Wolfer and Watanabe (2022)).

5 Contrast preservation

Lemma 5.1.

Proof.

Lemma 5.2.

Proof.

6 Proof of Theorem 3.1

Completeness case.

Soundness case.

References

A Geometric Reduction Approach for
Identity Testing of Reversible Markov Chains