A Simple Non-Stationary Mean Ergodic Theorem, with Bonus Weak Law of
Large Numbers
Cosma Rohilla Shalizi
Departments of Statistics and of Machine
Learning, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, and the Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501.
(1 November 2021, small revision 19 March 2022)
Abstract
This brief pedagogical note re-proves a simple theorem on the convergence, in
and in probability, of time averages of non-stationary time series to
the mean of expectation values. The basic condition is that the sum of
covariances grows sub-quadratically with the length of the time series. I
make no claim to originality; the result is widely, but unevenly, spread bit
of folklore among users of applied probability. The goal of this note
is merely to even out that distribution.
I assume some familiarity with basic probability and stochastic processes,
along the lines of Grimmett and
Stirzaker (1992), but re-prove a number of basic
results, to show that everything really is elementary.
Let be a sequence of real-valued random
variables. Assume that and exist and are finite for all . The time
average of the s,
|
|
|
(1) |
converges on the average of the expectation values,
|
|
|
(2) |
under a condition on the sum of the covariances,
|
|
|
(3) |
The condition will be obvious after the following lemma, which will also be
useful for extensions.
Lemma 1
|
|
|
(4) |
Proof: Since, for any ,
, we have
|
|
|
(5) |
By linearity of expectation,
|
|
|
(6) |
On the other hand,
|
|
|
|
|
(7) |
|
|
|
|
|
(8) |
|
|
|
|
|
(9) |
through repeated application of the identity , and the definition of . Substituting into Eq. 5,
|
|
|
(10) |
as was to be shown.
Theorem 1 (Mean ergodic theorem)
If , then in as .
Proof: Convergence in means that , and, by the lemma, . By
the assumption of the theorem, , hence , so we
have
|
|
|
(11) |
Thus in ; this is a mean (or mean-square) ergodic
theorem.
Since is a sum of terms, if the sum is to be ,
most of those terms must be shrinking rapidly to zero as grows. That is,
must go to zero as . Stationarity
(see immediately below) is not required.
Definition 1
The sequence of s is weakly or second-order stationary when
for some
and all .
If one does assume weak stationarity, a sufficient condition for is that , since in that case . has names like “correlation
time”, “autocorrelation time”, “integrated autocovariance time”,
etc.. From Lemma 4, for
uncorrelated variables, , but when , , so the “effective sample
size” is reduced from to .
It is unnecessary for the theorem to assume that converges, i.e., that
the instantaneous expectations are Cèsaro-convergent. Still less
is it necessary to assume that the have a limit. need not
be constant or tending to a limit either, though it cannot grow too fast.
convergence implies convergence in probability, or what is usually
known as the weak law of large numbers.
Corollary 1 (Weak law of large numbers)
Under the conditions of the theorem, in probability.
Proof: Use Chebyshev’s inequality (re-proved below as Proposition
1): for any random variable ,
|
|
|
(12) |
Applied to , this gives, for each fixed ,
|
|
|
(13) |
which is the definition of convergence to 0 in probability.
The convergence results carry over easily to -dimensional vectors, at least
if is fixed as grows.
Corollary 2
Let be a sequence of
-dimensional vectors, with expected values , and define
. If for all , then
|
|
|
(14) |
in and in probability.
Proof: Apply the theorem and the previous corollary to each coordinate
of separately to get convergence along each coordinate, and hence
convergence of the Euclidean distance to zero.
If grows too fast, then the convergence to a deterministic limit
fails.
Corollary 3
If , then in .
Proof: means that .
From the lemma, we know that , which does not go to zero,
so convergence in must fail.
Convergence in probability is slightly more delicate.
Corollary 4
If and , then in probability.
Proof: Begin with the previous corollary, and apply the
Paley-Zygmund inequality (Proposition 3) to the
non-negative random variable , whose expected
value is (again, from Lemma 4) . By the
inequality, for any ,
|
|
|
|
|
(15) |
|
|
|
|
|
(16) |
Restrict ourselves to . Then, for all sufficiently large ,
|
|
|
|
|
(17) |
|
|
|
|
|
(18) |
|
|
|
|
|
(19) |
so in probability.
Remark 3: I suspect the extra condition needed to force non-convergence
in probability can be weakened, because the underlying Paley-Zygmund inequality
used in the proof isn’t necessarily sharp. But an example helps show that some condition
is necessary. Suppose that for each , with probability
, otherwise , and that the are all mutually
independent. Then for all . Moreover, .
Since those probabilities are summable, by the Borel-Cantelli lemma, . But then for all but
finitely many almost surely, hence in probability. On
the other hand, , so , ,
and in . Verifying that the second condition of
the corollary does not hold involves some straightforward but detailed
algebra, given in Appendix C. While this
example is deliberately stylized, it does get at what’s needed to have
convergence in probability without convergence in : the probability that
has to be going to zero, no matter how small we set
, but when there are fluctuations in away from , they need to
be getting larger and larger. Whether this is a realistic concern or a
paranoid fear will depend on the application.
Remark 4: Convergence of the not to the deterministic but to
a random limit, as in the full mean-square ergodic theorem for weakly
stationary sequences (as given by, e.g., Grimmett and
Stirzaker (1992, §9.5, theorem
3)) would seem to require more advanced tools.
Credit
I do not know the history of Theorem 1, but I want to
emphasize again that it is not original to me. I learned it, without
attribution to any particular source or even a name, when studying statistical
mechanics in the physics department of the University of Wisconsin-Madison in
the mid-1990s. A version of the argument which assumes weak (second-order)
stationarity of the but allows for continuous time appears in Frisch (1995, pp. 50–51). The oldest version of the result I have been able
to locate is Taylor (1922). (While the
paper was not published until 1922, it was read before the London Mathematical
Society in 1920.) This again develops the result assuming weak stationarity,
but in both discrete and continuous time. Taylor presents this as a new
result, but someone else might be able to claim historical priority.
Acknowledgments
I am grateful to David Darmon and Paul J. Wolfson for correspondence which led
me to write this; to support for a sabbatical year in 2017–2018 from Carnegie
Mellon University; and to my students in 36-462, “Data over Space and Time”,
in 2018 and 2020, for letting me test versions of this material on them.
References
-
Frisch (1995)
Frisch, Uriel (1995).
Turbulence: The Legacy of A. N. Kolmogorov.
Cambridge, England: Cambridge University Press.
-
Grimmett and
Stirzaker (1992)
Grimmett, G. R. and D. R. Stirzaker (1992).
Probability and Random Processes.
Oxford: Oxford University Press, 2nd edn.
-
Taylor (1922)
Taylor, G. I. (1922).
“Diffusion by Continuous Movements.”
Proceedings of the London Mathematical Society, 20:
196–212.
doi:10.1112/plms/s2-20.1.196.
Appendix A Upper Bounds: Markov and Chebyshev
Going from convergence to convergence in probability uses an inequality
which has come to be associated with the name of Chebyshev:
Proposition 1 (Chebyshev inequality)
For any real-valued random variable ,
|
|
|
This is itself an easy consequence of another inequality:
Proposition 2 (Markov inequality)
For any non-negative, real-value random variable ,
|
|
|
The intuition behind Markov’s inequality is simple: the probability of
being large can’t be too big, without also driving up the expected value of
.
Proof (of Proposition 2): For any event
, . So, clearly,
|
|
|
and thus
|
|
|
|
|
(20) |
|
|
|
|
|
(21) |
|
|
|
|
|
(22) |
|
|
|
|
|
(23) |
as was to be shown.
Proof (of Proposition 1):
if and only if
. But is a
non-negative, real-valued random variable, with expected value , so
the proposition follows by applying Proposition 2.
Appendix B Lower Bounds: Paley-Zygmund
Markov’s inequality says that the probability of large values can’t be too
high, without increasing the expected value. A counter-part inequality
essentially says that the probability of large values can’t be too small,
either, without decreasing the expected value. Start with
Equation 20, again assuming :
|
|
|
|
|
(24) |
|
|
|
|
|
(25) |
|
|
|
|
|
(26) |
|
|
|
|
|
(27) |
|
|
|
|
|
(28) |
where Eq. 26 uses the Cauchy-Schwarz inequality.
We have thus proved
Proposition 3 (Paley-Zygmund Inequality)
For a random variable , and ,
|
|
|
(29) |
(The proposition is usually stated in the form
|
|
|
(30) |
for , but this is clearly equivalent.)
Appendix C Algebraic Details for Remark 3
The example in Remark 3 converges in probability but not in , so we need
to verify that one or the other conditions of Corollary
4 fails. In fact, the failing condition is the one
about fourth moments, that .
|
|
|
|
|
(31) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Take the terms appearing in the expression for one at a time:
|
|
|
|
|
(33) |
|
|
|
|
|
(34) |
|
|
|
|
|
(35) |
|
|
|
|
|
(36) |
|
|
|
|
|
(37) |
|
|
|
|
|
(38) |
using the fact that the expectation of the product of two independent
variables is the product of their expectations.
On the other hand, for ,
|
|
|
|
|
(39) |
by independence of the s. Similarly, unless or , . Since , at most one of and could equal , so,
without loss of generality, say it’s . Then we have
|
|
|
|
|
(40) |
|
|
|
|
|
(41) |
|
|
|
|
|
(42) |
So
as well. Likewise if all the indices are
distinct, but even if indices overlap, everything boils out to zero.
Going back to the variance of , then,
|
|
|
|
|
(43) |
Now , so , but , so the second condition of the corollary is
indeed violated.