This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Geometric Reduction Approach for
Identity Testing of Reversible Markov Chains

Geoffrey Wolfer email: geoffrey.wolfer@riken.jp.
The author is supported by the Special Postdoctoral Researcher Program (SPDR) of RIKEN. RIKEN Center for AI Project
Shun Watanabe email: shunwata@cc.tuat.ac.jp.
The author is supported in part by Japan Society for the Promotion of Science KAKENHI under Grant 20H02144. Department of Computer and Information Sciences
Tokyo University of Agriculture and Technology
Abstract

We consider the problem of testing the identity of a reversible Markov chain against a reference from a single trajectory of observations. Employing the recently introduced notion of a lumping-congruent Markov embedding, we show that, at least in a mildly restricted setting, testing identity to a reversible chain reduces to testing to a symmetric chain over a larger state space and recover state-of-the-art sample complexity for the problem.

Keywords— Information geometry; Irreducible Markov chain; Identity testing; Congruent embedding; Markov morphism; Lumpability.

1   Introduction

Uniformity testing is the flagship problem of distribution property testing. From nn independent observations sampled from an unknown distribution μ\mu on a finite space 𝒳\mathcal{X}, the goal is to distinguish between the two cases where μ\mu is uniform and μ\mu is ε\varepsilon-far from being uniform with respect to some notion of distance. The complexity of this problem is known to be of the order of Θ~(|𝒳|/ε2)\tilde{\Theta}(\sqrt{\left|\mathcal{X}\right|}/\varepsilon^{2}) Paninski (2008), which compares favorably with the linear dependency in |𝒳|\left|\mathcal{X}\right| required for estimating the distribution to precision ε\varepsilon Waggoner (2015). Interestingly, the uniform distribution can be replaced with some arbitrary reference at same statistical cost. In fact, Goldreich Goldreich (2016) proved that the latter problem formally reduces to the former. Inspired by his approach, we seek and obtain a reduction result in the much less understood and more challenging Markovian setting.

Informal Markovian problem statement —

The scientist is given the full description of a reference transition matrix P¯\overline{P} and a single Markov chain (MC) X1nX_{1}^{n} sampled with respect to some unknown transition operator PP and arbitrary initial distribution. For fixed proximity parameter ε>0\varepsilon>0, the goal is to design an algorithm that distinguishes between the two cases P=P¯P=\overline{P} and K(P,P¯)>εK(P,\overline{P})>\varepsilon, with high probability, where KK is a contrast function between stochastic matrices.

Related work —

Under the contrast function (1) described in Section 2, and the hypothesis that PP and P¯\overline{P} are both irreducible and symmetric over a finite space 𝒳\mathcal{X}, Daskalakis et al. (2018) constructed a tester with sample complexity 𝒪~(|𝒳|/ε+h)\tilde{\mathcal{O}}(\left|\mathcal{X}\right|/\varepsilon+h), where hh (Daskalakis et al., 2018, Definition 3) corresponds to some hitting property of the chain, and a lower bound in Ω(|𝒳|/ε)\Omega(\left|\mathcal{X}\right|/\varepsilon). In Cherapanamjeri and Bartlett (2019), a graph partitioning algorithm delivers, under the same symmetry assumption, a testing procedure with sample complexity 𝒪(|𝒳|/ε4)\mathcal{O}(\left|\mathcal{X}\right|/\varepsilon^{4}), i.e. independent on hitting properties. More recently, Fried and Wolfer (2022) relaxed the symmetry requirement, replacing it with a more natural reversibility assumption. Their algorithm has a sample complexity of 𝒪(1/(π¯ε4))\mathcal{O}(1/(\overline{\pi}_{\star}\varepsilon^{4})), where π¯\overline{\pi}_{\star} is the minimum stationary probability of P¯\overline{P}, gracefully recovering Cherapanamjeri and Bartlett (2019) under symmetry. In parallel, Wolfer and Kontorovich (2020) started the research program of inspecting the problem under the infinity norm for matrices, and derived nearly minimax-optimal bounds.

Contribution —

We show how to mostly recover Fried and Wolfer (2022) under additional assumptions (see Section 3), with a technique based on a geometry preserving embedding. We obtain a more economical proof than Fried and Wolfer (2022) who went through the process of re-deriving a graph partitioning algorithm for the reversible case. Furthermore, our approach, by its generality, is also applicable to related inference problems.

2   Preliminaries

We let 𝒳,𝒴\mathcal{X},\mathcal{Y} be finite sets, and denote 𝒫(𝒳)\mathcal{P}(\mathcal{X}) the set of all probability distributions over 𝒳\mathcal{X}. All vectors are written as row vectors. For matrices A,BA,B, ABA\circ B is their Hadamard product and ρ(A)\rho(A) is the spectral radius of AA. For nn\in\mathbb{N}, we use the compact notation x1n=(x1,,xn)x_{1}^{n}=(x_{1},\dots,x_{n}). 𝒲(𝒳)\mathcal{W}(\mathcal{X}) is the set of all row-stochastic matrices over the state space 𝒳\mathcal{X}, and π\pi is called a stationary distribution for P𝒲(𝒳)P\in\mathcal{W}(\mathcal{X}) when πP=π\pi P=\pi.

Irreducibility and reversibility —

We denote 𝒲(𝒳,𝒟)\mathcal{W}(\mathcal{X},\mathcal{D}) the set of irreducible stochastic matrices over a strongly connected digraph (𝒳,𝒟)(\mathcal{X},\mathcal{D}). When P𝒲(𝒳,𝒟)P\in\mathcal{W}(\mathcal{X},\mathcal{D}), π\pi is unique and we denote π=minx𝒳π(x)>0\pi_{\star}=\min_{x\in\mathcal{X}}\pi(x)>0. When PP verifies the detailed-balance equation π(x)P(x,x)=π(x)P(x,x)\pi(x)P(x,x^{\prime})=\pi(x^{\prime})P(x^{\prime},x) for any (x,x)𝒟(x,x^{\prime})\in\mathcal{D}, we say that PP is reversible.

Lumpability —

In contradistinction with the distribution setting, merging symbols in a Markov chain may break the Markov property. For P𝒲(𝒴,)P\in\mathcal{W}(\mathcal{Y},\mathcal{E}) and a surjective map κ:𝒴𝒳\kappa\colon\mathcal{Y}\to\mathcal{X}, merging elements of 𝒴\mathcal{Y} together, we say that PP is κ\kappa-lumpable Kemeny and Snell (1983) when the resulting process still defines a MC. If so, the resulting transition matrix can be found in (Kemeny and Snell, 1983, Theorem 6.3.2), which we denote as κP𝒲(𝒳,κ2(𝒟))\kappa_{\star}P\in\mathcal{W}(\mathcal{X},\kappa_{2}(\mathcal{D})), with

κ2(𝒟){(x,x)𝒳2:(y,y),(κ(y),κ(y))=(x,x)}.\kappa_{2}(\mathcal{D})\doteq\left\{(x,x^{\prime})\in\mathcal{X}^{2}\colon\exists(y,y^{\prime})\in\mathcal{E},(\kappa(y),\kappa(y^{\prime}))=(x,x^{\prime})\right\}.

Contrast function —

We consider the following notion of discrepancy between two stochastic matrices P,P𝒲(𝒳)P,P^{\prime}\in\mathcal{W}(\mathcal{X}),

K(P,P)1ρ(P1/2P1/2).K(P,P^{\prime})\doteq 1-\rho\left(P^{\circ 1/2}\circ P^{\prime\circ 1/2}\right). (1)

Although KK first appeared in Daskalakis et al. (2018) in the context of MC testing, its inception can be traced back to Kazakos Kazakos (1978). It is instructive to observe that KK vanishes on chains that share an identical component and does not satisfy the triangle inequality for reducible matrices, hence is not a proper metric on 𝒲(𝒳)\mathcal{W}(\mathcal{X}) (Daskalakis et al., 2018, p.10, footnote 13). Some additional properties of KK of possible interest are listed in (Fried and Wolfer, 2022, Section 7).

Reduction approach for identity testing of distributions —

Problem reduction is ubiquitous in the property testing literature. Our work takes inspiration from Goldreich (2016), who introduced two so-called “stochastic filters” in order to show how in the distribution setting, identity testing was reducible to uniformity testing, thereby recovering the known complexity of 𝒪(|𝒳|/ε2)\mathcal{O}(\sqrt{\left|\mathcal{X}\right|}/\varepsilon^{2}) obtained more directly by Valiant and Valiant (2017). Other notable works include Diakonikolas and Kane (2016), who reduced a collection of distribution testing problems to 2\ell_{2}-identity testing.

3   The restricted identity testing problem

We cast our problem in the minimax framework by defining the risk n(ε)\mathcal{R}_{n}(\varepsilon),

n(ε)minϕ:𝒳n{0,1}{X1nπ¯,P¯(ϕ(X1n)=1)+maxP1(ε)X1nπ,P(ϕ(X1n)=0)},\mathcal{R}_{n}(\varepsilon)\doteq\min_{\phi\colon\mathcal{X}^{n}\to\left\{0,1\right\}}\left\{\mathbb{P}_{X_{1}^{n}\sim\overline{\pi},\overline{P}}\left(\phi(X_{1}^{n})=1\right)+\max_{P\in\mathcal{H}_{1}(\varepsilon)}\mathbb{P}_{X_{1}^{n}\sim\pi,P}\left(\phi(X_{1}^{n})=0\right)\right\},

sample complexity n(ε,δ)min{n:n(ε)<δ}n_{\star}(\varepsilon,\delta)\doteq\min\left\{n\in\mathbb{N}\colon\mathcal{R}_{n}(\varepsilon)<\delta\right\}, and where

0={P¯},1(ε)={P𝒱𝗍𝖾𝗌𝗍:K(P,P¯)>ε},\mathcal{H}_{0}=\left\{\overline{P}\right\},\qquad\mathcal{H}_{1}(\varepsilon)=\left\{P\in\mathcal{V}_{\mathsf{test}}\colon K(P,\overline{P})>\varepsilon\right\},

with 0,1(ε)𝒱𝗍𝖾𝗌𝗍\mathcal{H}_{0},\mathcal{H}_{1}(\varepsilon)\subset\mathcal{V}_{\mathsf{test}}, the subset of stochastic matrices under consideration. We note the presence of an exclusion region, and that the problem can be regarded as a Bayesian testing problem with a prior which is uniform over 0\mathcal{H}_{0} and 1(ε)\mathcal{H}_{1}(\varepsilon) and vanishes on the exclusion region. We briefly recall the assumptions made in Fried and Wolfer (2022). For (P,P¯)(1(ε),0)(P,\overline{P})\in(\mathcal{H}_{1}(\varepsilon),\mathcal{H}_{0}),

  1. (A.1)({A}.1)

    PP and P¯\overline{P} are irreducible and reversible.

  2. (A.2)({A}.2)

    PP and P¯\overline{P} share the same stationary distribution π¯=π\overline{\pi}=\pi. 111We note that Fried and Wolfer (2022) also slightly loosen the requirement of having a matching stationary distributions to being close in the sense where π/π¯1<ε\left\|\pi/\overline{\pi}-1\right\|_{\infty}<\varepsilon.

The following additional assumptions will make our approach readily applicable.

  1. (B.1)({B}.1)

    P,P¯P,\overline{P} and share the same connection graph, P,P¯𝒲(𝒳,𝒟)P,\overline{P}\in\mathcal{W}(\mathcal{X},\mathcal{D}).

  2. (B.2)({B}.2)

    The common stationary probability is rational, π¯𝒳\overline{\pi}\in\mathbb{Q}^{\mathcal{X}}.

Remark 3.1.

A sufficient condition for π¯𝒳\overline{\pi}\in\mathbb{Q}^{\mathcal{X}} is P¯(x,x)\overline{P}(x,x^{\prime})\in\mathbb{Q} for any x,x𝒳x,x^{\prime}\in\mathcal{X}.

Without loss of generality, we express π¯=(p1,p2,,p|𝒳|)/Δ\overline{\pi}=\left(p_{1},p_{2},\dots,p_{\left|\mathcal{X}\right|}\right)/\Delta, for some Δ\Delta\in\mathbb{N}, and p|𝒳|p\in\mathbb{N}^{\left|\mathcal{X}\right|} where 0<p1p2p|𝒳|<Δ0<p_{1}\leq p_{2}\leq\dots\leq p_{\left|\mathcal{X}\right|}<\Delta. We denote by 𝒱𝗍𝖾𝗌𝗍\mathcal{V}_{\mathsf{test}} the set of stochastic matrices that verify assumptions (A.1),(A.2),(B.1)(A.1),(A.2),(B.1) and (B.2)(B.2) for some fixed and positive π¯𝒫(𝒳)\overline{\pi}\in\mathcal{P}(\mathcal{X}). Our below-stated theorem provides an upper bound on the sample complexity n(ε,δ)n_{\star}(\varepsilon,\delta) in 𝒪~(1/(π¯ε))\widetilde{\mathcal{O}}(1/(\overline{\pi}_{\star}\varepsilon)).

Theorem 3.1.

Let ε,δ(0,1)\varepsilon,\delta\in(0,1) and let P¯𝒱𝗍𝖾𝗌𝗍𝒲(𝒳,𝒟)\overline{P}\in\mathcal{V}_{\mathsf{test}}\subset\mathcal{W}(\mathcal{X},\mathcal{D}). There exists a testing procedure ϕ:𝒳n{0,1}\phi\colon\mathcal{X}^{n}\to\left\{0,1\right\}, with n=𝒪~(1/(π¯ε4))n=\tilde{\mathcal{O}}(1/(\overline{\pi}_{\star}\varepsilon^{4})), such that the following holds. For any P𝒱𝗍𝖾𝗌𝗍P\in\mathcal{V}_{\mathsf{test}} and X1nX_{1}^{n} sampled according to PP, ϕ\phi distinguishes between the cases P=P¯P=\overline{P} and K(P,P¯)>εK(P,\overline{P})>\varepsilon with probability at least 1δ1-\delta.

sketch.

Our strategy consists in two steps. First, we employ a transformation on Markov chains termed Markov embedding Wolfer and Watanabe (2022) in order to symmetrize both the reference chain (algebraically, by computing the new transition matrix) and the unknown chain (operationally, by simulating an embedded trajectory). Crucially, our transformation preserves the contrast between two chains and their embedded version (Lemma 5.2). Second, we invoke the known tester Cherapanamjeri and Bartlett (2019) for symmetric chains as a black box and report its output. The proof is deferred to Section 6. ∎

Remark 3.2.

Our reduction could also be applied in the robust testing setting 222Note that even in the symmetric setting, the robust problem remains open., where the two competing hypotheses are K(P,P¯)<ε/2K(P,\overline{P})<\varepsilon/2 and K(P,P¯)>εK(P,\overline{P})>\varepsilon.

4   Symmetrization of reversible Markov chains

Information geometry —

Our construction and notation follow Nagaoka (2005), who established the dually-flat structure (𝒲(𝒳,𝒟),𝔤,(e),(m))(\mathcal{W}(\mathcal{X},\mathcal{D}),\mathfrak{g},\nabla^{(e)},\nabla^{(m)}) on the space of irreducible stochastic matrices. Writing Pθ𝒲(𝒳,𝒟)P_{\theta}\in\mathcal{W}(\mathcal{X},\mathcal{D}) for the transition matrix at coordinates θΘd\theta\in\Theta\subset\mathbb{R}^{d}, with dd the dimension of the manifold, and with the shorthand i/θi\partial_{i}\doteq\partial/\partial\theta^{i}, recall that the Fisher metric is expressed in the chart induced basis (i)i[d](\partial_{i})_{i\in[d]} as

𝔤ij(θ)=(x,x)𝒟πθ(x)Pθ(x,x)ilogPθ(x,x)jlogPθ(x,x), for i,j[d].\mathfrak{g}_{ij}(\theta)=\sum_{(x,x^{\prime})\in\mathcal{D}}\pi_{\theta}(x)P_{\theta}(x,x^{\prime})\partial_{i}\log P_{\theta}(x,x^{\prime})\partial_{j}\log P_{\theta}(x,x^{\prime}),\text{ for }i,j\in[d].

Embeddings —

In Wolfer and Watanabe (2022), the following notion of an embedding for stochastic matrices is proposed.

Definition 4.1 (Markov embedding for Markov chains Wolfer and Watanabe (2022)).

We call Markov embedding, a map Λ:𝒲(𝒳,𝒟)𝒲(𝒴,),PΛP\Lambda_{\star}\colon\mathcal{W}(\mathcal{X},\mathcal{D})\to\mathcal{W}(\mathcal{Y},\mathcal{E}),P\mapsto\Lambda_{\star}P, such that for any (y,y)(y,y^{\prime})\in\mathcal{E},

ΛP(y,y)=P(κ(y),κ(y))Λ(y,y),\Lambda_{\star}P(y,y^{\prime})=P(\kappa(y),\kappa(y^{\prime}))\Lambda(y,y^{\prime}),

and where κ\kappa and Λ\Lambda satisfy the following requirements

  1. (i)(i)

    κ:𝒴𝒳\kappa\colon\mathcal{Y}\to\mathcal{X} is a lumping function for which κ2()=𝒟\kappa_{2}(\mathcal{E})=\mathcal{D}.

  2. (ii)(ii)

    Λ\Lambda is a positive function over the edge set, Λ:+\Lambda\colon\mathcal{E}\to\mathbb{R}_{+}.

  3. (iii)(iii)

    Writing x𝒳𝒮x=𝒴\bigcup_{x\in\mathcal{X}}\mathcal{S}_{x}=\mathcal{Y} for the partition defined by κ\kappa, Λ\Lambda is such that for any y𝒴y\in\mathcal{Y} and x𝒳x^{\prime}\in\mathcal{X}, (κ(y),x)𝒟(Λ(y,y))y𝒮x𝒫(𝒮x)(\kappa(y),x^{\prime})\in\mathcal{D}\implies(\Lambda(y,y^{\prime}))_{y^{\prime}\in\mathcal{S}_{x^{\prime}}}\in\mathcal{P}(\mathcal{S}_{x^{\prime}}).

The above embeddings are characterized as the linear maps over lumpable matrices that satisfy some monotonicity requirements and are congruent with respect to the lumping operation (Wolfer and Watanabe, 2022, Theorem 3.1). When for any y,y𝒴y,y^{\prime}\in\mathcal{Y}, it additionally holds that Λ(y,y)=Λ(y)δ[(κ(y),κ(y))𝒟]\Lambda(y,y^{\prime})=\Lambda(y^{\prime})\delta\left[(\kappa(y),\kappa(y^{\prime}))\in\mathcal{D}\right], the embedding Λ\Lambda_{\star} is called memoryless (Wolfer and Watanabe, 2022, Section 3.4.2) and is e/m-geodesic affine (Wolfer and Watanabe, 2022, Th. 3.2, Lemma 3.6), preserving both exponential and mixture families of MC.

Given π¯\overline{\pi} and Δ\Delta as defined in Section 3, from (Wolfer and Watanabe, 2022, Corollary 3.3), there exists a lumping function κ:[Δ]𝒳\kappa\colon[\Delta]\to\mathcal{X}, and a memoryless embedding σπ¯:𝒲(𝒳,𝒟)𝒲([Δ],)\sigma^{\overline{\pi}}_{\star}\colon\mathcal{W}(\mathcal{X},\mathcal{D})\to\mathcal{W}([\Delta],\mathcal{E}) with ={(y,y)[Δ]2:(κ(y),κ(y))𝒟}\mathcal{E}=\left\{(y,y^{\prime})\in[\Delta]^{2}\colon(\kappa(y),\kappa(y^{\prime}))\in\mathcal{D}\right\}, such that σπ¯P¯\sigma^{\overline{\pi}}_{\star}\overline{P} is symmetric. Furthermore, identifying 𝒳={1,2,,|𝒳|}\mathcal{X}=\left\{1,2,\dots,\left|\mathcal{X}\right|\right\}, its existence is given constructively by

κ(j)=argmin1i|𝒳|{k=1ipkj}, with σπ¯(j)=pκ(j)1, for any 1jΔ.\kappa(j)=\operatorname*{arg\,min}_{1\leq i\leq\left|\mathcal{X}\right|}\left\{\sum_{k=1}^{i}p_{k}\geq j\right\},\text{ with }\sigma^{\overline{\pi}}(j)=p^{-1}_{\kappa(j)},\text{ for any }1\leq j\leq\Delta.

As a consequence, we have both,

  1. 1.

    The expression of σπ¯P¯\sigma^{\overline{\pi}}_{\star}\overline{P} following the algebraic manipulations in Definition 4.1.

  2. 2.

    A random algorithm Wolfer and Watanabe (2022) to simulate trajectories from σπ¯P\sigma^{\overline{\pi}}_{\star}P out of trajectories from PP (see (Wolfer and Watanabe, 2022, Section 3.1)).

5   Contrast preservation

It was established in (Wolfer and Watanabe, 2022, Lemma 3.1) that Markov embeddings preserve the Fisher information metric, the affine connections and the KL divergence between points. In this section, we show that memoryless embeddings, such as the symmetrizer σπ¯\sigma^{\overline{\pi}}_{\star} introduced in Section 4, also preserve the contrast function KK. Our proof will rely on first showing that memoryless embeddings induce natural Markov morphisms Čencov (1978) from distributions over 𝒳n\mathcal{X}^{n} to 𝒴n\mathcal{Y}^{n}.

Lemma 5.1.

Let a lumping function κ:𝒴𝒳\kappa\colon\mathcal{Y}\to\mathcal{X}, and

L:𝒲(𝒳,𝒟)𝒲(𝒴,)L_{\star}\colon\mathcal{W}(\mathcal{X},\mathcal{D})\rightarrow\mathcal{W}(\mathcal{Y},\mathcal{E})

be a κ\kappa congruent memoryless Markov embedding. For P𝒲(𝒳,𝒟)P\in\mathcal{W}(\mathcal{X},\mathcal{D}), let Qn𝒫(𝒳n)Q^{n}\in\mathcal{P}(\mathcal{X}^{n}) (resp. Q~n𝒫(𝒴n)\widetilde{Q}^{n}\in\mathcal{P}(\mathcal{Y}^{n})) be the unique distribution over stationary paths of length nn induced from PP (resp. LPL_{\star}P). Then there exists a Markov morphism M:𝒫(𝒳n)𝒫(𝒴n)M_{\star}\colon\mathcal{P}(\mathcal{X}^{n})\to\mathcal{P}(\mathcal{Y}^{n}) such that MQn=Q~nM_{\star}Q^{n}=\widetilde{Q}^{n}.

Proof.

Let κn:𝒴n𝒳n\kappa_{n}\colon\mathcal{Y}^{n}\to\mathcal{X}^{n} be the lumping function on blocks induced from κ\kappa,

y1n𝒴n,κn(y1n)=(κ(yt))1tn𝒳n,\forall y_{1}^{n}\in\mathcal{Y}^{n},\kappa_{n}(y_{1}^{n})=(\kappa(y_{t}))_{1\leq t\leq n}\in\mathcal{X}^{n},

and introduce

𝒴n=x1n𝒳n𝒮x1n, with 𝒮x1n={y1n𝒴n:κn(y1n)=x1n},\mathcal{Y}^{n}=\bigcup_{x_{1}^{n}\in\mathcal{X}^{n}}\mathcal{S}_{x_{1}^{n}},\text{ with }\mathcal{S}_{x_{1}^{n}}=\left\{y_{1}^{n}\in\mathcal{Y}^{n}\colon\kappa_{n}(y_{1}^{n})=x_{1}^{n}\right\},

the partition associated to κn\kappa_{n}. For any realizable path x1n,Qn(x1n)>0x_{1}^{n},Q^{n}(x_{1}^{n})>0, we define a distribution Mx1n𝒫(𝒴n)M^{x_{1}^{n}}\in\mathcal{P}(\mathcal{Y}^{n}) concentrated on 𝒮x1n\mathcal{S}_{x_{1}^{n}}, and such that for any y1n𝒮x1ny_{1}^{n}\in\mathcal{S}_{x_{1}^{n}}, Mx1n(y1n)=t=1nL(yt).M^{x_{1}^{n}}(y_{1}^{n})=\prod_{t=1}^{n}L(y_{t}). Non-negativity of Mx1nM^{x_{1}^{n}} is immediate, and

y1n𝒴nMx1n(y1n)=y1n𝒴n:κn(y1n)=x1nMx1n(y1n)=t=1n(yt𝒮xtL(yt))=1,\begin{split}\sum_{y_{1}^{n}\in\mathcal{Y}^{n}}M^{x_{1}^{n}}(y_{1}^{n})&=\sum_{y_{1}^{n}\in\mathcal{Y}^{n}\colon\kappa_{n}(y_{1}^{n})=x_{1}^{n}}M^{x_{1}^{n}}(y_{1}^{n})=\prod_{t=1}^{n}\left(\sum_{y_{t}\in\mathcal{S}_{x_{t}}}L(y_{t})\right)=1,\end{split}

thus Mx1nM^{x_{1}^{n}} is well-defined. Furthermore, for y1n𝒴ny_{1}^{n}\in\mathcal{Y}^{n}, it holds that

Q~n(y1n)=Lπ(y1)t=1n1LP(yt,yt+1)=()π(κ(y1))L(y1)t=1n1P(κ(yt),κ(yt+1))L(yt)=Qn(κ(y1),,κ(yn))t=1nL(yt)=Qn(κn(y1n))t=1nL(yt)=x1n𝒳nQn(κn(y1n))Mx1n(y1n)=MQn(y1n),\begin{split}\widetilde{Q}^{n}(y_{1}^{n})&=L_{\star}\pi(y_{1})\prod_{t=1}^{n-1}L_{\star}P(y_{t},y_{t+1})\stackrel{{\scriptstyle(\spadesuit)}}{{=}}\pi(\kappa(y_{1}))L(y_{1})\prod_{t=1}^{n-1}P(\kappa(y_{t}),\kappa(y_{t+1}))L(y_{t})\\ &=Q^{n}(\kappa(y_{1}),\dots,\kappa(y_{n}))\prod_{t=1}^{n}L(y_{t})=Q^{n}(\kappa_{n}(y_{1}^{n}))\prod_{t=1}^{n}L(y_{t})\\ &=\sum_{x_{1}^{n}\in\mathcal{X}^{n}}Q^{n}(\kappa_{n}(y_{1}^{n}))M^{x_{1}^{n}}(y_{1}^{n})=M_{\star}Q^{n}(y_{1}^{n}),\end{split}

where ()(\spadesuit) stems from (Wolfer and Watanabe, 2022, Lemma 3.5), whence our claim holds. ∎

Lemma 5.1 essentially states that the following diagram commutes

𝒲(𝒳,𝒟){\mathcal{W}(\mathcal{X},\mathcal{D})}L𝒲(𝒳,𝒟){L_{\star}\mathcal{W}(\mathcal{X},\mathcal{D})}𝒬𝒲(𝒳,𝒟)n{\mathcal{Q}^{n}_{\mathcal{W}(\mathcal{X},\mathcal{D})}}𝒬L𝒲(𝒳,𝒟)n,{\mathcal{Q}^{n}_{L_{\star}\mathcal{W}(\mathcal{X},\mathcal{D})},}L\scriptstyle{L_{\star}}M\scriptstyle{M_{\star}}

for some Markov morphism MM_{\star}, and where we denoted 𝒬𝒲(𝒳,𝒟)n𝒫(𝒳n)\mathcal{Q}^{n}_{\mathcal{W}(\mathcal{X},\mathcal{D})}\subset\mathcal{P}(\mathcal{X}^{n}) for the set of all distributions over paths of length nn induced from the family 𝒲(𝒳,𝒟)\mathcal{W}(\mathcal{X},\mathcal{D}). As a consequence, we can unambiguously write LQn𝒬L𝒲(𝒳,𝒟)nL_{\star}Q^{n}\in\mathcal{Q}^{n}_{L_{\star}\mathcal{W}(\mathcal{X},\mathcal{D})} for the distribution over stationary paths of length nn that pertains to LPL_{\star}P.

Lemma 5.2.

Let L:𝒲(𝒳,𝒟)𝒲(𝒴,)L_{\star}\colon\mathcal{W}(\mathcal{X},\mathcal{D})\to\mathcal{W}(\mathcal{Y},\mathcal{E}) be a memoryless embedding,

K(LP,LP¯)=K(P,P¯).K(L_{\star}P,L_{\star}\overline{P})=K(P,\overline{P}).
Proof.

We recall for two distributions μ,ν𝒫(𝒳)\mu,\nu\in\mathcal{P}(\mathcal{X}) the definition of R1/2R_{1/2} the Rényi entropy of order 1/21/2,

R1/2(μν)2log(x𝒳μ(x)ν(x)),\begin{split}R_{1/2}(\mu\|\nu)\doteq-2\log\left(\sum_{x\in\mathcal{X}}\sqrt{\mu(x)\nu(x)}\right),\end{split}

and note that R1/2R_{1/2} is closely related to the Hellinger distance between μ\mu and ν\nu. This definition extends to a divergence rate between stochastic processes (Xt)t,(Xt)t(X_{t})_{t\in\mathbb{N}},(X^{\prime}_{t})_{t\in\mathbb{N}} on 𝒳\mathcal{X} as follows

R1/2((Xt)t(Xt)t)=limn1nR1/2(X1nX1n),R_{1/2}\left((X_{t})_{t\in\mathbb{N}}\|(X^{\prime}_{t})_{t\in\mathbb{N}}\right)=\lim_{n\to\infty}\frac{1}{n}R_{1/2}\left(X_{1}^{n}\|X_{1}^{\prime n}\right),

and in the irreducible time-homogeneous Markovian setting where (Xt)t,(Xt)t(X_{t})_{t\in\mathbb{N}},(X^{\prime}_{t})_{t\in\mathbb{N}} evolve according to transition matrices PP and PP^{\prime}, the above reduces Rached et al. (2001) to

R1/2((Xt)t(Xt)t)=2logρ(P1/2P1/2)=2log(1K(P,P)).R_{1/2}\left((X_{t})_{t\in\mathbb{N}}\|(X^{\prime}_{t})_{t\in\mathbb{N}}\right)=-2\log\rho(P^{\circ 1/2}\circ P^{\prime\circ 1/2})=-2\log(1-K(P,P^{\prime})).

Reorganizing terms and plugging-for the embedded stochastic matrices,

K(LP,LP¯)=1exp(12limn1nR1/2(LQnLQ¯n)),\begin{split}K(L_{\star}P,L_{\star}\overline{P})&=1-\exp\left(-\frac{1}{2}\lim_{n\to\infty}\frac{1}{n}R_{1/2}\left(L_{\star}Q^{n}\|L_{\star}\overline{Q}^{n}\right)\right),\\ \end{split}

where LQ¯nL_{\star}\overline{Q}^{n} is the distribution over stationary paths of length nn induced by the embedded LP¯L_{\star}\overline{P}. For any nn\in\mathbb{N}, from Lemma 5.1 and information monotonicity of the Rényi divergence, R1/2(LQnLQ¯n)=R1/2(QnQ¯n),R_{1/2}\left(L_{\star}Q^{n}\|L_{\star}\overline{Q}^{n}\right)=R_{1/2}\left(Q^{n}\|\overline{Q}^{n}\right), hence our claim. ∎

6   Proof of Theorem 3.1

We assume that PP and P¯\overline{P} are in 𝒱𝗍𝖾𝗌𝗍\mathcal{V}_{\mathsf{test}}. We reduce the problem as follows. We construct σπ¯\sigma^{\overline{\pi}}_{\star}, the symmetrizer 333If we wish to test for the identity of multiple chains against a same reference, we only need to perform this step once. defined in Section 4. We proceed to embed both the reference chain (using Definition 4.1) and and the unknown trajectory (using the operational definition in (Wolfer and Watanabe, 2022, Section 3.1)). We invoke the tester of Cherapanamjeri and Bartlett (2019) as a black box, and report its answer.

Figure 1: Reduction of the testing problem by isometric embedding.
σπ¯\sigma_{\star}^{\overline{\pi}}𝒲𝗋𝖾𝗏(𝒳,𝒟)\mathcal{W}_{\mathsf{rev}}(\mathcal{X},\mathcal{D})𝒲𝗌𝗒𝗆([Δ],)\mathcal{W}_{\mathsf{sym}}([\Delta],\mathcal{E})ε\varepsilonε\varepsilonP¯\overline{P}σπ¯P¯\sigma^{\overline{\pi}}_{\star}\overline{P}1(ε)\mathcal{H}_{1}(\varepsilon)σπ¯1(ε)\sigma_{\star}^{\overline{\pi}}\mathcal{H}_{1}(\varepsilon)𝒱𝗍𝖾𝗌𝗍\mathcal{V}_{\mathsf{test}}σπ¯𝒱𝗍𝖾𝗌𝗍\sigma_{\star}^{\overline{\pi}}\mathcal{V}_{\mathsf{test}}

Completeness case.

It is immediate that P=P¯LP=LP¯P=\overline{P}\implies L_{\star}P=L_{\star}\overline{P}.

Soundness case.

From Lemma 5.2, K(P,P¯)>εK(σπ¯P,σπ¯P¯)>εK(P,\overline{P})>\varepsilon\implies K(\sigma^{\overline{\pi}}P,\sigma^{\overline{\pi}}\overline{P})>\varepsilon.

As a consequence of (Cherapanamjeri and Bartlett, 2019, Theorem 10), the sample complexity of testing is upper bounded by 𝒪(Δ/ε4)\mathcal{O}(\Delta/\varepsilon^{4}). With π¯=p1/Δ\overline{\pi}_{\star}=p_{1}/\Delta and treating p1p_{1} as a small constant, we recover the known sample complexity.

References

  • Cherapanamjeri and Bartlett (2019) Y. Cherapanamjeri and P. L. Bartlett. Testing symmetric Markov chains without hitting. In Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 758–785. PMLR, 2019.
  • Daskalakis et al. (2018) C. Daskalakis, N. Dikkala, and N. Gravin. Testing symmetric Markov chains from a single trajectory. In Conference On Learning Theory, pages 385–409. PMLR, 2018.
  • Diakonikolas and Kane (2016) I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete distributions. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 685–694. IEEE, 2016.
  • Fried and Wolfer (2022) S. Fried and G. Wolfer. Identity testing of reversible Markov chains. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 798–817. PMLR, 2022.
  • Goldreich (2016) O. Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. In Electron. Colloquium Comput. Complex., volume 23, page 15, 2016.
  • Kazakos (1978) D. Kazakos. The Bhattacharyya distance and detection between Markov chains. IEEE Transactions on Information Theory, 24(6):747–754, 1978.
  • Kemeny and Snell (1983) J. G. Kemeny and J. L. Snell. Finite Markov chains: with a new appendix ”Generalization of a fundamental matrix”. Springer, 1983.
  • Nagaoka (2005) H. Nagaoka. The exponential family of Markov chains and its information geometry. In The proceedings of the Symposium on Information Theory and Its Applications, volume 28(2), pages 601–604, 2005.
  • Paninski (2008) L. Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008.
  • Rached et al. (2001) Z. Rached, F. Alajaji, and L. L. Campbell. Rényi’s divergence and entropy rates for finite alphabet Markov sources. IEEE Transactions on Information theory, 47(4):1553–1561, 2001.
  • Valiant and Valiant (2017) G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing, 46(1):429–455, 2017.
  • Čencov (1978) N. N. Čencov. Algebraic foundation of mathematical statistics. Series Statistics, 9(2):267–276, 1978. doi: 10.1080/02331887808801428.
  • Waggoner (2015) B. Waggoner. lpl_{p} testing and learning of discrete distributions. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 347–356, 2015.
  • Wolfer and Kontorovich (2020) G. Wolfer and A. Kontorovich. Minimax testing of identity to a reference ergodic Markov chain. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 191–201. PMLR, 26–28 Aug 2020.
  • Wolfer and Watanabe (2022) G. Wolfer and S. Watanabe. Geometric aspects of data-processing of Markov chains, 2022. arXiv:2203.04575.