A Simple Non-Stationary Mean Ergodic Theorem, with Bonus Weak Law of Large Numbers

Cosma Rohilla Shalizi Departments of Statistics and of Machine Learning, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, and the Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501.

(1 November 2021, small revision 19 March 2022)

Abstract

This brief pedagogical note re-proves a simple theorem on the convergence, in $L_{2}$ and in probability, of time averages of non-stationary time series to the mean of expectation values. The basic condition is that the sum of covariances grows sub-quadratically with the length of the time series. I make no claim to originality; the result is widely, but unevenly, spread bit of folklore among users of applied probability. The goal of this note is merely to even out that distribution.

I assume some familiarity with basic probability and stochastic processes, along the lines of Grimmett and Stirzaker (1992), but re-prove a number of basic results, to show that everything really is elementary.

Let $X_{1},X_{2},\ldots X_{t},\ldots$ be a sequence of real-valued random variables. Assume that $\mu_{t}\equiv\mathbb{E}\left[X_{t}\right]$ and $\mathrm{Cov}\left[X_{t},X_{s}\right]$ exist and are finite for all $t,s$ . The time average of the $X_{t}$ s,

A_{n}=\frac{1}{n}\sum_{t=1}^{n}{X_{t}}~{},

(1)

converges on the average of the expectation values,

m_{n}=\frac{1}{n}\sum_{t=1}^{n}{\mu_{t}}~{},

(2)

under a condition on the sum of the covariances,

V_{n}=\sum_{t=1}^{n}{\sum_{s=1}^{n}{\mathrm{Cov}\left[X_{t},X_{s}\right]}}~{}.

(3)

The condition will be obvious after the following lemma, which will also be useful for extensions.

Lemma 1

\mathbb{E}\left[(A_{n}-m_{n})^{2}\right]=\mathrm{Var}\left[A_{n}\right]=\frac{1}{n^{2}}V_{n}

(4)

Proof: Since, for any $Z$ , $\mathbb{E}\left[Z^{2}\right]=(\mathbb{E}\left[Z\right])^{2}+\mathrm{Var}\left[Z\right]$ , we have

\mathbb{E}\left[(A_{n}-m_{n})^{2}\right]=(\mathbb{E}\left[A_{n}-m_{n}\right])^{2}+\mathrm{Var}\left[A_{n}\right]

(5)

By linearity of expectation,

\mathbb{E}\left[A_{n}\right]=m_{n}

(6)

On the other hand,

$\displaystyle\mathrm{Var}\left[A_{n}\right]$	$\displaystyle=$	$\displaystyle\frac{1}{n^{2}}{\mathrm{Var}\left[\sum_{t=1}^{n}{X_{t}}\right]}$	(7)
	$\displaystyle=$	$\displaystyle\frac{1}{n^{2}}{\sum_{t=1}^{n}{\sum_{s=1}^{n}{\mathrm{Cov}\left[X_{t},X_{s}\right]}}}$	(8)
	$\displaystyle=$	$\displaystyle\frac{1}{n^{2}}{V_{n}}$	(9)

through repeated application of the identity $\mathrm{Var}\left[B+C\right]=\mathrm{Var}\left[B\right]+\mathrm{Var}\left[C\right]+2\mathrm{Cov}\left[B,C\right]$ , and the definition of $V_{n}$ . Substituting into Eq. 5,

\mathbb{E}\left[(A_{n}-m_{n})^{2}\right]=0^{2}+\mathrm{Var}\left[A_{n}\right]=\frac{1}{n^{2}}V_{n}

(10)

as was to be shown. $\Box$

Theorem 1 (Mean ergodic theorem)

If $V_{n}=o(n^{2})$ , then $A_{n}-m_{n}\rightarrow 0$ in $L_{2}$ as $n\rightarrow\infty$ .

Proof: Convergence in $L_{2}$ means that $\mathbb{E}\left[(A_{n}-m_{n})^{2}\right]\rightarrow 0$ , and, by the lemma, $\mathbb{E}\left[(A_{n}-m_{n})^{2}\right]=V_{n}/n^{2}$ . By the assumption of the theorem, $V_{n}=o(n^{2})$ , hence $V_{n}/n^{2}=o(1)$ , so we have

\mathbb{E}\left[(A_{n}-m_{n})^{2}\right]=o(1)\rightarrow 0

(11)

Thus $A_{n}-m_{n}\rightarrow 0$ in $L_{2}$ ; this is a mean (or mean-square) ergodic theorem. $\Box$

Since $V_{n}$ is a sum of $n^{2}$ terms, if the sum is to be $o(n^{2})$ , most of those terms must be shrinking rapidly to zero as $n$ grows. That is, $\mathrm{Cov}\left[X_{t},X_{t+h}\right]$ must go to zero as $h\rightarrow\infty$ . Stationarity (see immediately below) is not required.

Definition 1

The sequence of $X$ s is weakly or second-order stationary when $\mathrm{Cov}\left[X_{t},X_{t+h}\right]=\gamma(h)$ for some $\gamma$ and all $t$ .

If one does assume weak stationarity, a sufficient condition for $V_{n}=o(n^{2})$ is that $\sum_{h=-\infty}^{\infty}{\gamma(h)}=\tau\gamma(0)<\infty$ , since in that case $V_{n}=O(n)$ . $\tau$ has names like “correlation time”, “autocorrelation time”, “integrated autocovariance time”, etc.¹¹1Some authors use these names for $\sum_{h=0}^{\infty}{\gamma(h)}$ , or even for $\sum_{h=1}^{\infty}{\gamma(h)}$ . Any one of these sums is finite if and only if the others are.. From Lemma 4, for uncorrelated variables, $\mathrm{Var}\left[A_{n}\right]=\gamma(0)/n$ , but when $0<\tau<\infty$ , $n\mathrm{Var}\left[A_{n}\right]\rightarrow\tau\gamma(0)$ , so the “effective sample size” is reduced from $n$ to $n/\tau$ .

It is unnecessary for the theorem to assume that $m_{n}$ converges, i.e., that the instantaneous expectations $\mu_{n}$ are Cèsaro-convergent. Still less is it necessary to assume that the $\mu_{n}$ have a limit. $\mathrm{Var}\left[X_{t}\right]$ need not be constant or tending to a limit either, though it cannot grow too fast.

$L_{2}$ convergence implies convergence in probability, or what is usually known as the weak law of large numbers.

Corollary 1 (Weak law of large numbers)

Under the conditions of the theorem, $A_{n}-m_{n}\rightarrow 0$ in probability.

Proof: Use Chebyshev’s inequality (re-proved below as Proposition 1): for any random variable $Z$ ,

\mathbb{P}\left(|Z-\mathbb{E}\left[Z\right]|\geq\epsilon\right)\leq\frac{\mathrm{Var}\left[Z\right]}{\epsilon^{2}}

(12)

Applied to $A_{n}$ , this gives, for each fixed $\epsilon$ ,

\mathbb{P}\left(|A_{n}-m_{n}|\geq\epsilon\right)\leq\frac{V_{n}n^{-2}}{\epsilon^{2}}\rightarrow 0

(13)

which is the definition of convergence to 0 in probability. $\Box$

The convergence results carry over easily to $d$ -dimensional vectors, at least if $d$ is fixed as $n$ grows.

Corollary 2

Let $\vec{Y}_{t}=(Y_{t1},Y_{t2},\ldots Y_{td})$ be a sequence of $d$ -dimensional vectors, with expected values $\vec{\mu}_{t}$ , and define $V_{nj}=\sum_{t=1}^{n}{\sum_{s=1}^{n}{\mathrm{Cov}\left[Y_{tj},Y_{sj}\right]}}$ . If $V_{nj}=o(n^{2})$ for all $j$ , then

\left\|\frac{1}{n}\sum_{t=1}^{n}{\vec{Y}_{t}}-\frac{1}{n}\sum_{t=1}^{n}{\vec{\mu}_{t}}\right\|\rightarrow 0

(14)

in $L_{2}$ and in probability.

Proof: Apply the theorem and the previous corollary to each coordinate of $Y$ separately to get convergence along each coordinate, and hence convergence of the Euclidean distance to zero. $\Box$

If $V_{n}$ grows too fast, then the convergence to a deterministic limit fails.

Corollary 3

If $V_{n}=\Omega(n^{2})$ , then $A_{n}-m_{n}\not\rightarrow 0$ in $L_{2}$ .

Proof: $V_{n}=\Omega(n^{2})$ means that $\liminf{V_{n}/n^{2}}=v>0$ . From the lemma, we know that $\mathrm{Var}\left[A_{n}\right]=V_{n}/n^{2}$ , which does not go to zero, so convergence in $L_{2}$ must fail. $\Box$

Convergence in probability is slightly more delicate.

Corollary 4

If $V_{n}=\Omega(n^{2})$ and $\mathrm{Var}\left[A_{n}^{2}\right]=O(V_{n}^{2}/n^{4})$ , then $A_{n}-m_{n}\not\rightarrow 0$ in probability.

Proof: Begin with the previous corollary, and apply the Paley-Zygmund inequality (Proposition 3) to the non-negative random variable $(A_{n}-\mu_{n})^{2}$ , whose expected value is (again, from Lemma 4) $V_{n}/n^{2}$ . By the inequality, for any $\epsilon\leq V_{n}/n^{2}$ ,

	$\displaystyle\mathbb{P}\left((A_{n}-\mu_{n})^{2}\geq\epsilon\right)$	$\displaystyle\geq$	$\displaystyle\frac{(V_{n}/n^{2}-\epsilon)^{2}}{\mathrm{Var}\left[(A_{n}-\mu_{n})^{2}\right]+(V_{n}/n^{2})^{2}}$		(15)
	$\displaystyle\mathbb{P}\left(\left\|A_{n}-\mu_{n}\right\|\geq\sqrt{\epsilon}\right)$	$\displaystyle\geq$	$\displaystyle\frac{(V_{n}/n^{2}-\epsilon)^{2}}{\mathrm{Var}\left[A_{n}^{2}\right]+(V_{n}/n^{2})^{2}}$		(16)

Restrict ourselves to $\epsilon<v/2$ . Then, for all sufficiently large $n$ ,

$\displaystyle\mathbb{P}\left(\left\|A_{n}-\mu_{n}\right\|\geq\sqrt{\epsilon}\right)$	$\displaystyle\geq$	$\displaystyle\frac{(v-\epsilon)^{2}}{v^{2}+\mathrm{Var}\left[A_{n}^{2}\right]}$	(17)
	$\displaystyle\geq$	$\displaystyle\frac{(v/2)^{2}}{v^{2}+\mathrm{Var}\left[A_{n}^{2}\right]}$	(18)
	$\displaystyle=$	$\displaystyle\frac{1}{4}\frac{1}{1+\mathrm{Var}\left[A_{n}^{2}\right]/v^{2}}>0$	(19)

so $A_{n}-\mu_{n}\not\rightarrow 0$ in probability. $\Box$

Remark 3: I suspect the extra condition needed to force non-convergence in probability can be weakened, because the underlying Paley-Zygmund inequality used in the proof isn’t necessarily sharp. But an example helps show that some condition is necessary. Suppose that for each $t$ , $X_{t}=\pm t^{3/2}$ with probability $\frac{1}{2}t^{-2}$ , otherwise $X_{t}=0$ , and that the $X_{t}$ are all mutually independent. Then $m_{n}=0$ for all $n$ . Moreover, $\mathbb{P}\left(X_{t}\neq 0\right)=1/t^{2}$ . Since those probabilities are summable, by the Borel-Cantelli lemma, $\mathbb{P}\left(X_{t}\neq 0~{}\text{infinitely often}\right)=0$ . But then $X_{t}=0$ for all but finitely many $t$ almost surely, hence $A_{n}\rightarrow 0$ in probability. On the other hand, $\mathrm{Var}\left[X_{t}\right]=t$ , so $V_{n}=n(n+1)/2$ , $\lim{V_{n}/n^{2}}=1/2$ , and $A_{n}\not\rightarrow 0$ in $L_{2}$ . Verifying that the second condition of the corollary does not hold involves some straightforward but detailed algebra, given in Appendix C. While this example is deliberately stylized, it does get at what’s needed to have convergence in probability without convergence in $L_{2}$ : the probability that $|A_{n}-m_{n}|\geq\epsilon$ has to be going to zero, no matter how small we set $\epsilon$ , but when there are fluctuations in $A_{n}$ away from $m_{n}$ , they need to be getting larger and larger. Whether this is a realistic concern or a paranoid fear will depend on the application.

Remark 4: Convergence of the $A_{n}$ not to the deterministic $m_{n}$ but to a random limit, as in the full mean-square ergodic theorem for weakly stationary sequences (as given by, e.g., Grimmett and Stirzaker (1992, §9.5, theorem 3)) would seem to require more advanced tools.

Credit

I do not know the history of Theorem 1, but I want to emphasize again that it is not original to me. I learned it, without attribution to any particular source or even a name, when studying statistical mechanics in the physics department of the University of Wisconsin-Madison in the mid-1990s. A version of the argument which assumes weak (second-order) stationarity of the $X_{t}$ but allows for continuous time appears in Frisch (1995, pp. 50–51). The oldest version of the result I have been able to locate is Taylor (1922). (While the paper was not published until 1922, it was read before the London Mathematical Society in 1920.) This again develops the result assuming weak stationarity, but in both discrete and continuous time. Taylor presents this as a new result, but someone else might be able to claim historical priority.

Acknowledgments

I am grateful to David Darmon and Paul J. Wolfson for correspondence which led me to write this; to support for a sabbatical year in 2017–2018 from Carnegie Mellon University; and to my students in 36-462, “Data over Space and Time”, in 2018 and 2020, for letting me test versions of this material on them.

References

Frisch (1995) Frisch, Uriel (1995). Turbulence: The Legacy of A. N. Kolmogorov. Cambridge, England: Cambridge University Press.
Grimmett and Stirzaker (1992) Grimmett, G. R. and D. R. Stirzaker (1992). Probability and Random Processes. Oxford: Oxford University Press, 2nd edn.
Taylor (1922) Taylor, G. I. (1922). “Diffusion by Continuous Movements.” Proceedings of the London Mathematical Society, 20: 196–212. doi:10.1112/plms/s2-20.1.196.

Appendix A Upper Bounds: Markov and Chebyshev

Going from $L_{2}$ convergence to convergence in probability uses an inequality which has come to be associated with the name of Chebyshev:

Proposition 1 (Chebyshev inequality)

For any real-valued random variable $Z$ ,

\mathbb{P}\left(|Z-\mathbb{E}\left[Z\right]|\geq\epsilon\right)\leq\frac{\mathrm{Var}\left[Z\right]}{\epsilon^{2}}

This is itself an easy consequence of another inequality:

Proposition 2 (Markov inequality)

For any non-negative, real-value random variable $Z$ ,

\mathbb{P}\left(Z\geq\epsilon\right)\leq\frac{\mathbb{E}\left[Z\right]}{\epsilon}

The intuition behind Markov’s inequality is simple: the probability of $Z$ being large can’t be too big, without also driving up the expected value of $Z$ .

Proof (of Proposition 2): For any event $B$ , $\mathbb{I}\left(B\right)+\mathbb{I}\left(B^{c}\right)=1$ . So, clearly,

Z=Z\mathbb{I}\left(Z\geq\epsilon\right)+Z\mathbb{I}\left(Z<\epsilon\right)

and thus

$\displaystyle\mathbb{E}\left[Z\right]$	$\displaystyle=$	$\displaystyle\mathbb{E}\left[Z\mathbb{I}\left(Z\geq\epsilon\right)\right]+\mathbb{E}\left[Z\mathbb{I}\left(Z<\epsilon\right)\right]$	(20)
	$\displaystyle\geq$	$\displaystyle\mathbb{E}\left[Z\mathbb{I}\left(Z\geq\epsilon\right)\right]$	(21)
	$\displaystyle\geq$	$\displaystyle\mathbb{E}\left[\epsilon\mathbb{I}\left(Z\geq\epsilon\right)\right]$	(22)
	$\displaystyle=$	$\displaystyle\epsilon\mathbb{E}\left[\mathbb{I}\left(Z\geq\epsilon\right)\right]=\epsilon\mathbb{P}\left(Z\geq\epsilon\right)$	(23)

as was to be shown. $\Box$

Proof (of Proposition 1): $|Z-\mathbb{E}\left[Z\right]|\geq\epsilon$ if and only if $(Z-\mathbb{E}\left[Z\right])^{2}\geq\epsilon^{2}$ . But $(Z-\mathbb{E}\left[Z\right])^{2}$ is a non-negative, real-valued random variable, with expected value $\mathrm{Var}\left[Z\right]$ , so the proposition follows by applying Proposition 2. $\Box$

Appendix B Lower Bounds: Paley-Zygmund

Markov’s inequality says that the probability of large values can’t be too high, without increasing the expected value. A counter-part inequality essentially says that the probability of large values can’t be too small, either, without decreasing the expected value. Start with Equation 20, again assuming $Z\geq 0$ :

$\displaystyle\mathbb{E}\left[Z\right]$	$\displaystyle=$	$\displaystyle\mathbb{E}\left[Z\mathbb{I}\left(Z\geq\epsilon\right)\right]+\mathbb{E}\left[Z\mathbb{I}\left(Z<\epsilon\right)\right]$	(24)
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[Z\mathbb{I}\left(Z\geq\epsilon\right)\right]+\epsilon\mathbb{E}\left[\mathbb{I}\left(Z<\epsilon\right)\right]$	(25)
	$\displaystyle\leq$	$\displaystyle\sqrt{\mathbb{E}\left[Z^{2}\right]\mathbb{E}\left[\mathbb{I}\left(Z\geq\epsilon\right)\right]}+\epsilon\mathbb{E}\left[\mathbb{I}\left(Z<\epsilon\right)\right]$	(26)
	$\displaystyle\leq$	$\displaystyle\sqrt{\mathbb{E}\left[Z^{2}\right]\mathbb{P}\left(Z\geq\epsilon\right)}+\epsilon$	(27)
$\displaystyle\frac{(\mathbb{E}\left[Z\right]-\epsilon)^{2}}{\mathbb{E}\left[Z^{2}\right]}$	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(Z\geq\epsilon\right)$	(28)

where Eq. 26 uses the Cauchy-Schwarz inequality. We have thus proved

Proposition 3 (Paley-Zygmund Inequality)

For a random variable $Z\geq 0$ , and $\epsilon\leq\mathbb{E}\left[Z\right]$ ,

\mathbb{P}\left(Z\geq\epsilon\right)\geq\frac{(\mathbb{E}\left[Z\right]-\epsilon)^{2}}{\mathrm{Var}\left[Z\right]+\mathbb{E}\left[Z\right]^{2}}

(29)

(The proposition is usually stated in the form

\mathbb{P}\left(Z\geq\theta\mathbb{E}\left[Z\right]\right)\geq\frac{(1-\theta)^{2}\mathbb{E}\left[Z\right]^{2}}{\mathrm{Var}\left[Z\right]+\mathbb{E}\left[Z\right]^{2}}

(30)

for $\theta\in(0,1)$ , but this is clearly equivalent.)

Appendix C Algebraic Details for Remark 3

The example in Remark 3 converges in probability but not in $L_{2}$ , so we need to verify that one or the other conditions of Corollary 4 fails. In fact, the failing condition is the one about fourth moments, that $\mathrm{Var}\left[A_{n}^{2}\right]=O(V_{n}^{2}/n^{4})$ .

$\displaystyle A_{n}^{2}$	$\displaystyle=$	$\displaystyle n^{-2}\left[\sum_{t=1}^{n}{X_{t}^{2}}+2\sum_{t=1}^{n-1}\sum_{t+1}^{n}{X_{t}X_{s}}\right]$	(31)
$\displaystyle\mathrm{Var}\left[A_{n}^{2}\right]$	$\displaystyle=$	$\displaystyle n^{-4}\left[\sum(\mathrm{Var}\left[X_{t}^{2}\right])\right.$
		$\displaystyle+4\sum_{t=1}^{n-1}{\sum_{s=t+1}^{n}{\mathrm{Var}\left[X_{t}X_{s}\right]}}$
		$\displaystyle+2\sum_{t=1}^{n-1}{\sum_{s=t+1}^{n}{\mathrm{Cov}\left[X_{t}^{2},X_{s}^{2}\right]}}$
		$\displaystyle+4\sum_{t=1}^{n}\sum_{s=1}^{n-1}{\sum_{r=s+1}^{n}{\mathrm{Cov}\left[X_{t}^{2},X_{s}X_{r}\right]}}$
		$\displaystyle\left.+8\sum_{t=1}^{n-1}{\sum_{s=t+1}^{n}{\sum_{(r,q)\neq(t,s)}{\mathrm{Cov}\left[X_{t}X_{s},X_{r}X_{q}\right]}}}\right]$

Take the terms appearing in the expression for $\mathrm{Var}\left[A_{n}^{2}\right]$ one at a time:

$\displaystyle\mathrm{Var}\left[X_{t}^{2}\right]$	$\displaystyle=$	$\displaystyle\mathbb{E}\left[X_{t}^{4}\right]-(\mathbb{E}\left[X_{t}^{2}\right])^{2}$	(33)
	$\displaystyle=$	$\displaystyle t^{6}/t^{2}-(\mathrm{Var}\left[X_{t}\right]+\mathbb{E}\left[X_{t}\right])^{2}$	(34)
	$\displaystyle=$	$\displaystyle t^{4}-(t+0)^{2}=t^{4}-t^{2}=t^{2}(t^{2}-1)$	(35)
$\displaystyle\mathrm{Var}\left[X_{s}X_{t}\right]$	$\displaystyle=$	$\displaystyle\mathbb{E}\left[X_{t}^{2}X_{s}^{2}\right]-\left(\mathbb{E}\left[X_{t}X_{s}\right]\right)^{2}$	(36)
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[X_{t}^{2}\right]\mathbb{E}\left[X_{s}^{2}\right]-(\mathbb{E}\left[X_{t}\right]\mathbb{E}\left[X_{s}\right])^{2}$	(37)
	$\displaystyle=$	$\displaystyle ts-0=ts$	(38)

using the fact that the expectation of the product of two independent variables is the product of their expectations. On the other hand, for $s\neq t$ ,

\displaystyle\mathrm{Cov}\left[X_{t}^{2},X_{s}^{2}\right]

\displaystyle=

\displaystyle 0

(39)

by independence of the $X_{t}$ s. Similarly, unless $s=t$ or $r=t$ , $\mathrm{Cov}\left[X_{t}^{2},X_{s}X_{r}\right]=0$ . Since $s\neq r$ , at most one of $s$ and $r$ could equal $t$ , so, without loss of generality, say it’s $s=t$ . Then we have

$\displaystyle\mathrm{Cov}\left[X_{t}^{2},X_{t}X_{r}\right]$	$\displaystyle=$	$\displaystyle\mathbb{E}\left[X_{t}^{3}X_{r}\right]-\mathbb{E}\left[X_{t}^{2}\right]\mathbb{E}\left[X_{t}X_{r}\right]$	(40)
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[X_{t}^{3}\right]\mathbb{E}\left[X_{r}\right]-\mathbb{E}\left[X_{t}^{2}\right]\mathbb{E}\left[X_{t}\right]\mathbb{E}\left[X_{r}\right]$	(41)
	$\displaystyle=$	$\displaystyle 0$	(42)

So $\mathrm{Cov}\left[X_{t}^{2},X_{t}X_{r}\right]=0$ as well. Likewise $\mathrm{Cov}\left[X_{t}X_{s},X_{r}X_{q}\right]=0$ if all the indices are distinct, but even if indices overlap, everything boils out to zero.

Going back to the variance of $A_{n}^{2}$ , then,

\displaystyle\mathrm{Var}\left[A_{n}^{2}\right]

\displaystyle=

\displaystyle\frac{1}{n^{4}}\left[\sum_{t=1}^{n}{t^{4}-t^{2}}+\sum_{t=1}^{n-1}{\sum_{s=t+1}^{n}{ts}}\right]

(43)

Now $\sum{t^{4}}\sim n^{5}$ , so $\mathrm{Var}\left[A_{n}^{2}\right]\sim n$ , but $V_{n}^{2}/n^{4}=(n(n+1)/2)^{2}/n^{4}\rightarrow 1/4$ , so the second condition of the corollary is indeed violated.