A Geometric Reduction Approach for
Identity Testing of Reversible Markov Chains
Abstract
We consider the problem of testing the identity of a reversible Markov chain against a reference from a single trajectory of observations. Employing the recently introduced notion of a lumping-congruent Markov embedding, we show that, at least in a mildly restricted setting, testing identity to a reversible chain reduces to testing to a symmetric chain over a larger state space and recover state-of-the-art sample complexity for the problem.
Keywords— Information geometry; Irreducible Markov chain; Identity testing; Congruent embedding; Markov morphism; Lumpability.
1 Introduction
Uniformity testing is the flagship problem of distribution property testing. From independent observations sampled from an unknown distribution on a finite space , the goal is to distinguish between the two cases where is uniform and is -far from being uniform with respect to some notion of distance. The complexity of this problem is known to be of the order of Paninski (2008), which compares favorably with the linear dependency in required for estimating the distribution to precision Waggoner (2015). Interestingly, the uniform distribution can be replaced with some arbitrary reference at same statistical cost. In fact, Goldreich Goldreich (2016) proved that the latter problem formally reduces to the former. Inspired by his approach, we seek and obtain a reduction result in the much less understood and more challenging Markovian setting.
Informal Markovian problem statement —
The scientist is given the full description of a reference transition matrix and a single Markov chain (MC) sampled with respect to some unknown transition operator and arbitrary initial distribution. For fixed proximity parameter , the goal is to design an algorithm that distinguishes between the two cases and , with high probability, where is a contrast function between stochastic matrices.
Related work —
Under the contrast function (1) described in Section 2, and the hypothesis that and are both irreducible and symmetric over a finite space , Daskalakis et al. (2018) constructed a tester with sample complexity , where (Daskalakis et al., 2018, Definition 3) corresponds to some hitting property of the chain, and a lower bound in . In Cherapanamjeri and Bartlett (2019), a graph partitioning algorithm delivers, under the same symmetry assumption, a testing procedure with sample complexity , i.e. independent on hitting properties. More recently, Fried and Wolfer (2022) relaxed the symmetry requirement, replacing it with a more natural reversibility assumption. Their algorithm has a sample complexity of , where is the minimum stationary probability of , gracefully recovering Cherapanamjeri and Bartlett (2019) under symmetry. In parallel, Wolfer and Kontorovich (2020) started the research program of inspecting the problem under the infinity norm for matrices, and derived nearly minimax-optimal bounds.
Contribution —
We show how to mostly recover Fried and Wolfer (2022) under additional assumptions (see Section 3), with a technique based on a geometry preserving embedding. We obtain a more economical proof than Fried and Wolfer (2022) who went through the process of re-deriving a graph partitioning algorithm for the reversible case. Furthermore, our approach, by its generality, is also applicable to related inference problems.
2 Preliminaries
We let be finite sets, and denote the set of all probability distributions over . All vectors are written as row vectors. For matrices , is their Hadamard product and is the spectral radius of . For , we use the compact notation . is the set of all row-stochastic matrices over the state space , and is called a stationary distribution for when .
Irreducibility and reversibility —
We denote the set of irreducible stochastic matrices over a strongly connected digraph . When , is unique and we denote . When verifies the detailed-balance equation for any , we say that is reversible.
Lumpability —
In contradistinction with the distribution setting, merging symbols in a Markov chain may break the Markov property. For and a surjective map , merging elements of together, we say that is -lumpable Kemeny and Snell (1983) when the resulting process still defines a MC. If so, the resulting transition matrix can be found in (Kemeny and Snell, 1983, Theorem 6.3.2), which we denote as , with
Contrast function —
We consider the following notion of discrepancy between two stochastic matrices ,
(1) |
Although first appeared in Daskalakis et al. (2018) in the context of MC testing, its inception can be traced back to Kazakos Kazakos (1978). It is instructive to observe that vanishes on chains that share an identical component and does not satisfy the triangle inequality for reducible matrices, hence is not a proper metric on (Daskalakis et al., 2018, p.10, footnote 13). Some additional properties of of possible interest are listed in (Fried and Wolfer, 2022, Section 7).
Reduction approach for identity testing of distributions —
Problem reduction is ubiquitous in the property testing literature. Our work takes inspiration from Goldreich (2016), who introduced two so-called “stochastic filters” in order to show how in the distribution setting, identity testing was reducible to uniformity testing, thereby recovering the known complexity of obtained more directly by Valiant and Valiant (2017). Other notable works include Diakonikolas and Kane (2016), who reduced a collection of distribution testing problems to -identity testing.
3 The restricted identity testing problem
We cast our problem in the minimax framework by defining the risk ,
sample complexity , and where
with , the subset of stochastic matrices under consideration. We note the presence of an exclusion region, and that the problem can be regarded as a Bayesian testing problem with a prior which is uniform over and and vanishes on the exclusion region. We briefly recall the assumptions made in Fried and Wolfer (2022). For ,
-
and are irreducible and reversible.
-
and share the same stationary distribution . 111We note that Fried and Wolfer (2022) also slightly loosen the requirement of having a matching stationary distributions to being close in the sense where .
The following additional assumptions will make our approach readily applicable.
-
and share the same connection graph, .
-
The common stationary probability is rational, .
Remark 3.1.
A sufficient condition for is for any .
Without loss of generality, we express , for some , and where . We denote by the set of stochastic matrices that verify assumptions and for some fixed and positive . Our below-stated theorem provides an upper bound on the sample complexity in .
Theorem 3.1.
Let and let . There exists a testing procedure , with , such that the following holds. For any and sampled according to , distinguishes between the cases and with probability at least .
sketch.
Our strategy consists in two steps. First, we employ a transformation on Markov chains termed Markov embedding Wolfer and Watanabe (2022) in order to symmetrize both the reference chain (algebraically, by computing the new transition matrix) and the unknown chain (operationally, by simulating an embedded trajectory). Crucially, our transformation preserves the contrast between two chains and their embedded version (Lemma 5.2). Second, we invoke the known tester Cherapanamjeri and Bartlett (2019) for symmetric chains as a black box and report its output. The proof is deferred to Section 6. ∎
Remark 3.2.
Our reduction could also be applied in the robust testing setting 222Note that even in the symmetric setting, the robust problem remains open., where the two competing hypotheses are and .
4 Symmetrization of reversible Markov chains
Information geometry —
Our construction and notation follow Nagaoka (2005), who established the dually-flat structure on the space of irreducible stochastic matrices. Writing for the transition matrix at coordinates , with the dimension of the manifold, and with the shorthand , recall that the Fisher metric is expressed in the chart induced basis as
Embeddings —
In Wolfer and Watanabe (2022), the following notion of an embedding for stochastic matrices is proposed.
Definition 4.1 (Markov embedding for Markov chains Wolfer and Watanabe (2022)).
We call Markov embedding, a map , such that for any ,
and where and satisfy the following requirements
-
is a lumping function for which .
-
is a positive function over the edge set, .
-
Writing for the partition defined by , is such that for any and , .
The above embeddings are characterized as the linear maps over lumpable matrices that satisfy some monotonicity requirements and are congruent with respect to the lumping operation (Wolfer and Watanabe, 2022, Theorem 3.1). When for any , it additionally holds that , the embedding is called memoryless (Wolfer and Watanabe, 2022, Section 3.4.2) and is e/m-geodesic affine (Wolfer and Watanabe, 2022, Th. 3.2, Lemma 3.6), preserving both exponential and mixture families of MC.
Given and as defined in Section 3, from (Wolfer and Watanabe, 2022, Corollary 3.3), there exists a lumping function , and a memoryless embedding with , such that is symmetric. Furthermore, identifying , its existence is given constructively by
As a consequence, we have both,
-
1.
The expression of following the algebraic manipulations in Definition 4.1.
- 2.
5 Contrast preservation
It was established in (Wolfer and Watanabe, 2022, Lemma 3.1) that Markov embeddings preserve the Fisher information metric, the affine connections and the KL divergence between points. In this section, we show that memoryless embeddings, such as the symmetrizer introduced in Section 4, also preserve the contrast function . Our proof will rely on first showing that memoryless embeddings induce natural Markov morphisms Čencov (1978) from distributions over to .
Lemma 5.1.
Let a lumping function , and
be a congruent memoryless Markov embedding. For , let (resp. ) be the unique distribution over stationary paths of length induced from (resp. ). Then there exists a Markov morphism such that .
Proof.
Let be the lumping function on blocks induced from ,
and introduce
the partition associated to . For any realizable path , we define a distribution concentrated on , and such that for any , Non-negativity of is immediate, and
thus is well-defined. Furthermore, for , it holds that
where stems from (Wolfer and Watanabe, 2022, Lemma 3.5), whence our claim holds. ∎
Lemma 5.1 essentially states that the following diagram commutes
for some Markov morphism , and where we denoted for the set of all distributions over paths of length induced from the family . As a consequence, we can unambiguously write for the distribution over stationary paths of length that pertains to .
Lemma 5.2.
Let be a memoryless embedding,
Proof.
We recall for two distributions the definition of the Rényi entropy of order ,
and note that is closely related to the Hellinger distance between and . This definition extends to a divergence rate between stochastic processes on as follows
and in the irreducible time-homogeneous Markovian setting where evolve according to transition matrices and , the above reduces Rached et al. (2001) to
Reorganizing terms and plugging-for the embedded stochastic matrices,
where is the distribution over stationary paths of length induced by the embedded . For any , from Lemma 5.1 and information monotonicity of the Rényi divergence, hence our claim. ∎
6 Proof of Theorem 3.1
We assume that and are in . We reduce the problem as follows. We construct , the symmetrizer 333If we wish to test for the identity of multiple chains against a same reference, we only need to perform this step once. defined in Section 4. We proceed to embed both the reference chain (using Definition 4.1) and and the unknown trajectory (using the operational definition in (Wolfer and Watanabe, 2022, Section 3.1)). We invoke the tester of Cherapanamjeri and Bartlett (2019) as a black box, and report its answer.
Completeness case.
It is immediate that .
Soundness case.
From Lemma 5.2, .
As a consequence of (Cherapanamjeri and Bartlett, 2019, Theorem 10), the sample complexity of testing is upper bounded by . With and treating as a small constant, we recover the known sample complexity.
References
- Cherapanamjeri and Bartlett (2019) Y. Cherapanamjeri and P. L. Bartlett. Testing symmetric Markov chains without hitting. In Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 758–785. PMLR, 2019.
- Daskalakis et al. (2018) C. Daskalakis, N. Dikkala, and N. Gravin. Testing symmetric Markov chains from a single trajectory. In Conference On Learning Theory, pages 385–409. PMLR, 2018.
- Diakonikolas and Kane (2016) I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete distributions. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 685–694. IEEE, 2016.
- Fried and Wolfer (2022) S. Fried and G. Wolfer. Identity testing of reversible Markov chains. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 798–817. PMLR, 2022.
- Goldreich (2016) O. Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. In Electron. Colloquium Comput. Complex., volume 23, page 15, 2016.
- Kazakos (1978) D. Kazakos. The Bhattacharyya distance and detection between Markov chains. IEEE Transactions on Information Theory, 24(6):747–754, 1978.
- Kemeny and Snell (1983) J. G. Kemeny and J. L. Snell. Finite Markov chains: with a new appendix ”Generalization of a fundamental matrix”. Springer, 1983.
- Nagaoka (2005) H. Nagaoka. The exponential family of Markov chains and its information geometry. In The proceedings of the Symposium on Information Theory and Its Applications, volume 28(2), pages 601–604, 2005.
- Paninski (2008) L. Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008.
- Rached et al. (2001) Z. Rached, F. Alajaji, and L. L. Campbell. Rényi’s divergence and entropy rates for finite alphabet Markov sources. IEEE Transactions on Information theory, 47(4):1553–1561, 2001.
- Valiant and Valiant (2017) G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing, 46(1):429–455, 2017.
- Čencov (1978) N. N. Čencov. Algebraic foundation of mathematical statistics. Series Statistics, 9(2):267–276, 1978. doi: 10.1080/02331887808801428.
- Waggoner (2015) B. Waggoner. testing and learning of discrete distributions. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 347–356, 2015.
- Wolfer and Kontorovich (2020) G. Wolfer and A. Kontorovich. Minimax testing of identity to a reference ergodic Markov chain. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 191–201. PMLR, 26–28 Aug 2020.
- Wolfer and Watanabe (2022) G. Wolfer and S. Watanabe. Geometric aspects of data-processing of Markov chains, 2022. arXiv:2203.04575.