This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11affiliationtext: Harvard College, Faculty of Arts and Sciences, Harvard University, Cambridge, MA 02138. ORCiD: 0000-0003-2834-8191.22affiliationtext: Department of Computer Science, School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138. ORCiD: 0000-0001-5430-5457.**affiliationtext: For correspondence: ispecht@college.harvard.edu.

Analyzing Generalized Pólya Urn Models using Martingales, with an Application to Viral Evolution

Ivan Specht Michael Mitzenmacher
Abstract

The randomized play-the-winner (RPW) model is a generalized Pólya Urn process with broad applications ranging from clinical trials to molecular evolution. We derive an exact expression for the variance of the RPW model by transforming the Pólya Urn process into a martingale, correcting an earlier result of Matthews and Rosenberger (1997). We then use this result to approximate the full probability mass function of the RPW model for certain parameter values relevant to genetic applications. Finally, we fit our model to genomic sequencing data of SARS-CoV-2, demonstrating a novel method of estimating the viral mutation rate that delivers comparable results to existing scientific literature.

Keywords: Pólya Urn models, branching processes, martingales, applied probability, computational genetics.

I Introduction

Consider the following generalized Pólya Urn model: An urn starts out with uu white balls and vv black balls, with u+v>0u+v>0. At each step i=1,2,3,i=1,2,3,\dots, a ball in the urn is chosen uniformly at random. If the chosen ball is white, a black ball is added to the urn with probability pWp_{W} and a white ball is added with probability 1pW1-p_{W}, where 0<pW<10<p_{W}<1. If the chosen ball is black, a white ball is added to the urn with probability pBp_{B} and a white ball is added with probability 1pB1-p_{B}, where 0<pB<10<p_{B}<1. The originally chosen ball is then returned to the urn, so that at step ii, the total number of balls equals u+v+iu+v+i. This model, known as the randomized play-the-winner (RPW) model, has been widely studied in theoretical and applied contexts, with applications ranging from clinical trials to genetic mutations. Several papers, including Wei and Durham (1978), Smythe and Rosenberger (1995), Smythe (1996), and Rosenberger and Sriram (1997) have considered the RPW model’s asymptotic properties, deriving limit theorems related to the asymptotic fraction of white (or black) balls in the urn.

The distribution of this fraction after finitely many steps, however, is less well understood. Rosenberger and Sriram (1997) prove an expression for its expectation, which we re-derive using an alternative approach. Matthews and Rosenberger (1997) propose an expression for its variance; however, we claim that this expression is erroneous, as it returns the incorrect variance for the special case pW=pB=12p_{W}=p_{B}=\frac{1}{2}, which reduces the RPW model to the Binomial(12)(\frac{1}{2}) model (see Appendix). In this paper, we introduce a new approach to compute the variance of the number of white balls in the RPW model after finitely many steps. Our method involves rewriting the RPW model as a martingale process—a transformation that may be applied not only to the RPW model, but to any process represented by a sequence of random variables M0,M1,M2,M_{0},M_{1},M_{2},\dots with finite first and second moments such that 𝔼[Mi+1|Mi]{\mathbb{E}}[M_{i+1}|M_{i}] is a non-constant linear function of MiM_{i} and Var[Mi+1|Mi]\text{Var}[M_{i+1}|M_{i}] is a quadratic function of MiM_{i}, for all ii. We then use our variance formula to approximate the full probability mass function (PMF) of MiM_{i} when pWp_{W} and pBp_{B} are small. For such pWp_{W}, pBp_{B}, the RPW model aptly characterizes branching processes that arise in viral genetics, with the balls representing viral particles and colors representing variants. We conclude with a novel method of estimating mutation rates of pathogens by fitting our approximate PMF to genome sequencing data of viruses.

II Constructing the Martingale

Our central idea behind computing the variance of any MnM_{n} in the aforementioned process {Mi}i0\{M_{i}\}_{i\geq 0} is to construct a martingale (Xi,i)(X_{i},{\mathcal{F}}_{i}) such that each XiX_{i} is a linear function of MiM_{i} (and i{\mathcal{F}}_{i} is the canonical σ\sigma-algebra, σ(X0,,Xi)\sigma(X_{0},\dots,X_{i})). For a martingale, computations of 𝔼[Xn]{\mathbb{E}}[X_{n}] and Var[Xn]\text{Var}[X_{n}] are more straightforward, and from them, we may obtain 𝔼[Mn]{\mathbb{E}}[M_{n}] and Var[Mn]\text{Var}[M_{n}]. Our first step in developing this method is an elementary lemma that allows us to compute 𝔼[Mn]{\mathbb{E}}[M_{n}] for any nn, when 𝔼[Mi+1|Mi]{\mathbb{E}}[M_{i+1}|M_{i}] is a non-constant linear function of MiM_{i}:

Lemma 1.

Let {Mi}i0\{M_{i}\}_{i\geq 0} be a sequence of random variables with finite first moment, and let M0=0M_{0}=0 almost surely (a.s.). Let i=σ(M0,,Mi){\mathcal{F}}_{i}=\sigma(M_{0},\dots,M_{i}) be the natural filtration. Suppose for all i0i\geq 0, we have 𝔼[Mi+1|i]=aiMi+bi{\mathbb{E}}[M_{i+1}|{\mathcal{F}}_{i}]=a_{i}M_{i}+b_{i}, where each aia_{i} and bib_{i} is fixed and known and each ai0a_{i}\neq 0. Define

qi=j=0iaj,q_{i}=\prod_{j=0}^{i}a_{j},

and let

Xi=Miqi1j=0i1bjqjX_{i}=\frac{M_{i}}{q_{i-1}}-\sum_{j=0}^{i-1}\frac{b_{j}}{q_{j}}

for i1i\geq 1. Set X0=0X_{0}=0 a.s.. Then (Xi,i)(X_{i},{\mathcal{F}}_{i}) is a martingale.

Proof.

Clearly XiX_{i} is i{\mathcal{F}}_{i}-adapted because XiX_{i} is a deterministic function of MiM_{i} and each MiM_{i} is i{\mathcal{F}}_{i}-measurable. 𝔼|Xi|{\mathbb{E}}|X_{i}| is finite for all ii because 𝔼|Mi|{\mathbb{E}}|M_{i}| is finite for all ii, and XiX_{i} is a linear function of MiM_{i}. Finally, we compute that

𝔼[Xi+1|i]\displaystyle{\mathbb{E}}[X_{i+1}|{\mathcal{F}}_{i}] =𝔼[Mi+1|i]qij=0ibjqj\displaystyle=\frac{{\mathbb{E}}[M_{i+1}|{\mathcal{F}}_{i}]}{q_{i}}-\sum_{j=0}^{i}\frac{b_{j}}{q_{j}}
=aiMi+biqij=0ibjqj\displaystyle=\frac{a_{i}M_{i}+b_{i}}{q_{i}}-\sum_{j=0}^{i}\frac{b_{j}}{q_{j}}
=aiMij=0iaj+biqij=0ibjqj\displaystyle=\frac{a_{i}M_{i}}{\prod_{j=0}^{i}a_{j}}+\frac{b_{i}}{q_{i}}-\sum_{j=0}^{i}\frac{b_{j}}{q_{j}}
=Mij=0i1ajj=0i1bjqj\displaystyle=\frac{M_{i}}{\prod_{j=0}^{i-1}a_{j}}-\sum_{j=0}^{i-1}\frac{b_{j}}{q_{j}}
=Xi\displaystyle=X_{i}

as desired. ∎