¹¹affiliationtext: Harvard College, Faculty of Arts and Sciences, Harvard University, Cambridge, MA 02138. ORCiD: 0000-0003-2834-8191.²²affiliationtext: Department of Computer Science, School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138. ORCiD: 0000-0001-5430-5457.^*^*affiliationtext: For correspondence: ispecht@college.harvard.edu.

Analyzing Generalized Pólya Urn Models using Martingales, with an Application to Viral Evolution

Ivan Specht Michael Mitzenmacher

Abstract

The randomized play-the-winner (RPW) model is a generalized Pólya Urn process with broad applications ranging from clinical trials to molecular evolution. We derive an exact expression for the variance of the RPW model by transforming the Pólya Urn process into a martingale, correcting an earlier result of Matthews and Rosenberger (1997). We then use this result to approximate the full probability mass function of the RPW model for certain parameter values relevant to genetic applications. Finally, we fit our model to genomic sequencing data of SARS-CoV-2, demonstrating a novel method of estimating the viral mutation rate that delivers comparable results to existing scientific literature.

Keywords: Pólya Urn models, branching processes, martingales, applied probability, computational genetics.

I Introduction

Consider the following generalized Pólya Urn model: An urn starts out with $u$ white balls and $v$ black balls, with $u+v>0$ . At each step $i=1,2,3,\dots$ , a ball in the urn is chosen uniformly at random. If the chosen ball is white, a black ball is added to the urn with probability $p_{W}$ and a white ball is added with probability $1-p_{W}$ , where $0<p_{W}<1$ . If the chosen ball is black, a white ball is added to the urn with probability $p_{B}$ and a white ball is added with probability $1-p_{B}$ , where $0<p_{B}<1$ . The originally chosen ball is then returned to the urn, so that at step $i$ , the total number of balls equals $u+v+i$ . This model, known as the randomized play-the-winner (RPW) model, has been widely studied in theoretical and applied contexts, with applications ranging from clinical trials to genetic mutations. Several papers, including Wei and Durham (1978), Smythe and Rosenberger (1995), Smythe (1996), and Rosenberger and Sriram (1997) have considered the RPW model’s asymptotic properties, deriving limit theorems related to the asymptotic fraction of white (or black) balls in the urn.

The distribution of this fraction after finitely many steps, however, is less well understood. Rosenberger and Sriram (1997) prove an expression for its expectation, which we re-derive using an alternative approach. Matthews and Rosenberger (1997) propose an expression for its variance; however, we claim that this expression is erroneous, as it returns the incorrect variance for the special case $p_{W}=p_{B}=\frac{1}{2}$ , which reduces the RPW model to the Binomial $(\frac{1}{2})$ model (see Appendix). In this paper, we introduce a new approach to compute the variance of the number of white balls in the RPW model after finitely many steps. Our method involves rewriting the RPW model as a martingale process—a transformation that may be applied not only to the RPW model, but to any process represented by a sequence of random variables $M_{0},M_{1},M_{2},\dots$ with finite first and second moments such that ${\mathbb{E}}[M_{i+1}|M_{i}]$ is a non-constant linear function of $M_{i}$ and $\text{Var}[M_{i+1}|M_{i}]$ is a quadratic function of $M_{i}$ , for all $i$ . We then use our variance formula to approximate the full probability mass function (PMF) of $M_{i}$ when $p_{W}$ and $p_{B}$ are small. For such $p_{W}$ , $p_{B}$ , the RPW model aptly characterizes branching processes that arise in viral genetics, with the balls representing viral particles and colors representing variants. We conclude with a novel method of estimating mutation rates of pathogens by fitting our approximate PMF to genome sequencing data of viruses.

II Constructing the Martingale

Our central idea behind computing the variance of any $M_{n}$ in the aforementioned process $\{M_{i}\}_{i\geq 0}$ is to construct a martingale $(X_{i},{\mathcal{F}}_{i})$ such that each $X_{i}$ is a linear function of $M_{i}$ (and ${\mathcal{F}}_{i}$ is the canonical $\sigma$ -algebra, $\sigma(X_{0},\dots,X_{i})$ ). For a martingale, computations of ${\mathbb{E}}[X_{n}]$ and $\text{Var}[X_{n}]$ are more straightforward, and from them, we may obtain ${\mathbb{E}}[M_{n}]$ and $\text{Var}[M_{n}]$ . Our first step in developing this method is an elementary lemma that allows us to compute ${\mathbb{E}}[M_{n}]$ for any $n$ , when ${\mathbb{E}}[M_{i+1}|M_{i}]$ is a non-constant linear function of $M_{i}$ :

Lemma 1.

Let $\{M_{i}\}_{i\geq 0}$ be a sequence of random variables with finite first moment, and let $M_{0}=0$ almost surely (a.s.). Let ${\mathcal{F}}_{i}=\sigma(M_{0},\dots,M_{i})$ be the natural filtration. Suppose for all $i\geq 0$ , we have ${\mathbb{E}}[M_{i+1}|{\mathcal{F}}_{i}]=a_{i}M_{i}+b_{i}$ , where each $a_{i}$ and $b_{i}$ is fixed and known and each $a_{i}\neq 0$ . Define

q_{i}=\prod_{j=0}^{i}a_{j},

and let

X_{i}=\frac{M_{i}}{q_{i-1}}-\sum_{j=0}^{i-1}\frac{b_{j}}{q_{j}}

for $i\geq 1$ . Set $X_{0}=0$ a.s.. Then $(X_{i},{\mathcal{F}}_{i})$ is a martingale.

Proof.

Clearly $X_{i}$ is ${\mathcal{F}}_{i}$ -adapted because $X_{i}$ is a deterministic function of $M_{i}$ and each $M_{i}$ is ${\mathcal{F}}_{i}$ -measurable. ${\mathbb{E}}|X_{i}|$ is finite for all $i$ because ${\mathbb{E}}|M_{i}|$ is finite for all $i$ , and $X_{i}$ is a linear function of $M_{i}$ . Finally, we compute that

	$\displaystyle{\mathbb{E}}[X_{i+1}\|{\mathcal{F}}_{i}]$	$\displaystyle=\frac{{\mathbb{E}}[M_{i+1}\|{\mathcal{F}}_{i}]}{q_{i}}-\sum_{j=0}^{i}\frac{b_{j}}{q_{j}}$
		$\displaystyle=\frac{a_{i}M_{i}+b_{i}}{q_{i}}-\sum_{j=0}^{i}\frac{b_{j}}{q_{j}}$
		$\displaystyle=\frac{a_{i}M_{i}}{\prod_{j=0}^{i}a_{j}}+\frac{b_{i}}{q_{i}}-\sum_{j=0}^{i}\frac{b_{j}}{q_{j}}$
		$\displaystyle=\frac{M_{i}}{\prod_{j=0}^{i-1}a_{j}}-\sum_{j=0}^{i-1}\frac{b_{j}}{q_{j}}$
		$\displaystyle=X_{i}$

as desired. ∎