Analyzing Generalized Pólya Urn Models using Martingales, with an Application to Viral Evolution
Abstract
The randomized play-the-winner (RPW) model is a generalized Pólya Urn process with broad applications ranging from clinical trials to molecular evolution. We derive an exact expression for the variance of the RPW model by transforming the Pólya Urn process into a martingale, correcting an earlier result of Matthews and Rosenberger (1997). We then use this result to approximate the full probability mass function of the RPW model for certain parameter values relevant to genetic applications. Finally, we fit our model to genomic sequencing data of SARS-CoV-2, demonstrating a novel method of estimating the viral mutation rate that delivers comparable results to existing scientific literature.
Keywords: Pólya Urn models, branching processes, martingales, applied probability, computational genetics.
I Introduction
Consider the following generalized Pólya Urn model: An urn starts out with white balls and black balls, with . At each step , a ball in the urn is chosen uniformly at random. If the chosen ball is white, a black ball is added to the urn with probability and a white ball is added with probability , where . If the chosen ball is black, a white ball is added to the urn with probability and a white ball is added with probability , where . The originally chosen ball is then returned to the urn, so that at step , the total number of balls equals . This model, known as the randomized play-the-winner (RPW) model, has been widely studied in theoretical and applied contexts, with applications ranging from clinical trials to genetic mutations. Several papers, including Wei and Durham (1978), Smythe and Rosenberger (1995), Smythe (1996), and Rosenberger and Sriram (1997) have considered the RPW model’s asymptotic properties, deriving limit theorems related to the asymptotic fraction of white (or black) balls in the urn.
The distribution of this fraction after finitely many steps, however, is less well understood. Rosenberger and Sriram (1997) prove an expression for its expectation, which we re-derive using an alternative approach. Matthews and Rosenberger (1997) propose an expression for its variance; however, we claim that this expression is erroneous, as it returns the incorrect variance for the special case , which reduces the RPW model to the Binomial model (see Appendix). In this paper, we introduce a new approach to compute the variance of the number of white balls in the RPW model after finitely many steps. Our method involves rewriting the RPW model as a martingale process—a transformation that may be applied not only to the RPW model, but to any process represented by a sequence of random variables with finite first and second moments such that is a non-constant linear function of and is a quadratic function of , for all . We then use our variance formula to approximate the full probability mass function (PMF) of when and are small. For such , , the RPW model aptly characterizes branching processes that arise in viral genetics, with the balls representing viral particles and colors representing variants. We conclude with a novel method of estimating mutation rates of pathogens by fitting our approximate PMF to genome sequencing data of viruses.
II Constructing the Martingale
Our central idea behind computing the variance of any in the aforementioned process is to construct a martingale such that each is a linear function of (and is the canonical -algebra, ). For a martingale, computations of and are more straightforward, and from them, we may obtain and . Our first step in developing this method is an elementary lemma that allows us to compute for any , when is a non-constant linear function of :
Lemma 1.
Let be a sequence of random variables with finite first moment, and let almost surely (a.s.). Let be the natural filtration. Suppose for all , we have , where each and is fixed and known and each . Define
and let
for . Set a.s.. Then is a martingale.
Proof.
Clearly is -adapted because is a deterministic function of and each is -measurable. is finite for all because is finite for all , and is a linear function of . Finally, we compute that
as desired. ∎