This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient recycled algorithms for quantitative trait models on phylogenies

Gordon Hiscott1    Colin Fox2    Matthew Parry1    David Bryant1
(1Department of Mathematics and Statistics, 2Department of Physics,
University of Otago, Dunedin, New Zealand
)
Abstract

We present an efficient and flexible method for computing likelihoods of phenotypic traits on a phylogeny. The method does not resort to Monte-Carlo computation but instead blends Felsenstein’s discrete character pruning algorithm with methods for numerical quadrature. It is not limited to Gaussian models and adapts readily to model uncertainty in the observed trait values. We demonstrate the framework by developing efficient algorithms for likelihood calculation and ancestral state reconstruction under Wright’s threshold model, applying our methods to a dataset of trait data for extrafloral nectaries (EFNs) across a phylogeny of 839 Labales species.

Keywords: Likelihood algorithm, quantitative traits, continuous traits, comparative method, numerical quadrature, numerical integration.

Introduction

Models for the evolution of continuous traits on a phylogeny are an essential tool of evolutionary genomics. These models are used to describe a huge range of biological phenomena, including morphological characteristics (Felsenstein, 2002, Harmon et al., 2010, Ronquist, 2004, Stevens, 1991), expression levels (Khaitovich et al., 2005, 2006), geography (Lemey et al., 2010), and gene frequency dynamics (Cavalli-Sforza and Edwards, 1967, Felsenstein, 1981a, Sirén et al., 2011). Model-based statistical techniques can detect or test for association between different traits, reconstruct ancestral states, and infer subtle features of the evolutionary process that generated them.

The usefulness of continuous trait models is contingent on our ability to compute with them. Early work of Felsenstein (Felsenstein, 1968, 1973) demonstrated that if traits are evolving according to Brownian motion then we can compute likelihoods quickly and (up to numerical precision) exactly. Felsenstein’s method extends directly to other Gaussian processes, notably the Ornstein-Uhlenbeck (OU) process (Felsenstein, 1988, Hansen, 1997, Lande, 1976). These methods break down for more complex models, in which case researchers have typically resorted to Monte Carlo strategies (e.g. (Landis et al., 2013, Ronquist, 2004)).

Computing the probability of a qualitative character is essentially a numerical integration (quadrature) problem. For most models, if we know the value of the trait at each ancestral node in the phylogeny we can quickly compute the various transition probabilities. Since we do not usually know these ancestral trait values we integrate them out. This is a multi-dimensional integration problem with one dimension for each ancestral node (or two dimensions for each node if we are modelling covarying traits).

Methods for estimating or approximating integrals are usually judged by their rate of convergence: how quickly the error of approximation decreases as the amount of work (function evaluations) increases. Consider the problem of computing a one-dimensional integral

01f(x)dx\int_{0}^{1}\!f(x)\,\mathrm{d}x (1)

where ff is a ‘nice’ function with continuous and bounded derivatives. Simpson’s rule, a simple textbook method reviewed below, can be shown to have an O(N4)O(N^{-4}) rate of convergence, meaning that, asymptotically in NN, evaluating 10 times more points reduces the error by a factor of 10410^{4}. In contrast, a standard Monte Carlo method has a rate of convergence of O(N12)O(N^{-\frac{1}{2}}), meaning that evaluating 10 times more points will only reduced the error by a factor of around 3 For this reason, numerical analysis texts often refer to Monte Carlo approaches as ‘methods of last resort.’

Despite this apparently lacklustre performance guarantee, Monte-Carlo methods have revolutionised phylogenetics in general and the analysis of quantitative characters in particular. The reason is their partial immunity to the curse of dimensionality. Methods like Simpson’s rule are not practical for a high number of dimensions as the asymptotic convergence rate, quoted above, is only achieved for an infeasibly large number of function evaluations NN. The effective convergence rate for small NN can be very poor, and typically worse than Monte-Carlo. In contrast, there are Monte Carlo approaches which achieve close to O(N12)O(N^{-\frac{1}{2}}) convergence irrespective of dimension. This has been critical when computing with complex evolutionary models with as many dimensions as there are nodes in the phylogeny.

The main contribution of our paper is to demonstrate how to efficiently and accurately compute likelihoods on a phylogeny using a sequence of one-dimensional integrations. We obtain a fast algorithm with convergence guarantees that far exceed what can be obtained by Monte Carlo integration. Our approach combines two standard tools: classical numerical integrators and Felsenstein’s pruning algorithm for discrete characters (Felsenstein, 1981b). Indeed, the only real difference between our approach and Felsenstein’s discrete character algorithm is that we use numerical integration techniques to integrate states at ancestral nodes, instead of just carrying out a summation.

The running time of the algorithm is O(N2n)O(N^{2}n), where NN is the number of points used in the numerical integration at each node and nn is the number of taxa (leaves) in the tree. Using Simpson’s method, we obtain a convergence rate of O(nN4)O(nN^{-4}), meaning that if we increase NN by a factor of 1010 we will obtain an estimate which is accurate to four more decimal places.

To illustrate the application of our general framework, we develop an efficient algorithm for computing the likelihood of a tree under the threshold model of Sewell Wright and Felsenstein (Felsenstein, 2005, 2012, Wright, 1934). We also show how to infer marginal trait densities at ancestral nodes. We have implemented these algorithms and used them to study evolution of extrafloral nectaries on an 839-taxon phylogeny Marazzi et al. (2012).

The combination of numerical integrators and the pruning algorithm opens up a large range of potential models and approaches which we have only just begun to explore. In the discussion, we briefly review developments in numerical integration techniques that could well be brought to bear on these problems, and a few suggestions of directions and problems which can now be addressed.

Material and Methods

Models for continuous trait evolution

Phylogenetic models for continuous trait evolution, like those for discrete traits, are specified by the density of trait values at the root and the transition densities along the branches. We use f(xr|θr)f(x_{r}|\theta_{r}) to denote the density for the trait value at the root, where θr\theta_{r} is a set of relevant model parameters. We use f(xi|xj,θi)f(x_{i}|x_{j},\theta_{i}) to denote the transitional density for the value at node ii, conditional on the trait value at its parent node jj. Here, θi\theta_{i} represents a bundle of parameters related to node ii such as branch length, population size, and mutation rate. All of these parameters could vary throughout the tree.

To see how the model works, consider how continuous traits might be simulated. A state XrX_{r} is sampled from the root density f(Xr|θr)f(X_{r}|\theta_{r}). We now proceed through the phylogeny from the root to the tips, each time visiting a node only after its parent has already been visited. For each node ii, we generate the value at that node from the density f(Xi|xj,θv)f(X_{i}|x_{j},\theta_{v}), where xjx_{j} is the simulated trait value at node jj, the parent of node ii. In this way, we will eventually generate trait values for the tips.

We use X1,,XnX_{1},\ldots,X_{n} to denote the random trait values at the tips and Xn+1,,X2n1X_{n+1},\ldots,X_{2n-1} to denote the random trait values at the internal nodes, ordered so that children come before parents. Hence X2n1X_{2n-1} is the state assigned to the root. Let

(T)={(i,j):node i is a child node j}\mathcal{E}(T)=\{(i,j):\mbox{node $i$ is a child node $j$}\} (2)

denote the set of branches in the tree. The joint density for all trait values, observed and ancestral, is given by multiplying the root density with all of the transition densities

f(x1,,xn,xn+1,,x2n1|θ)=f(x2n1|θ)(i,j)(T)f(xi|xj,θi).f(x_{1},\ldots,x_{n},x_{n+1},\ldots,x_{2n-1}|\theta)=f(x_{2n-1}|\theta)\!\!\!\prod_{(i,j)\in\mathcal{E}(T)}\!\!\!f(x_{i}|x_{j},\theta_{i}). (3)

The probability of the observed trait values x1,,xnx_{1},\ldots,x_{n} is now determined by integrating out all of the ancestral trait values:

(T)=f(x1,,xn|θ)=f(x2n1|θr)(i,j)(T)f(xi|xj,θi)dxn+1,,dx2n1.\mathcal{L}(T)=f(x_{1},\ldots,x_{n}|\theta)=\int\int\cdots\int f(x_{2n-1}|\theta_{r})\!\!\!\prod_{(i,j)\in\mathcal{E}(T)}\!\!\!f(x_{i}|x_{j},\theta_{i})\,\,\mathrm{d}x_{n+1},\ldots,\,\mathrm{d}x_{2n-1}. (4)

In these integrals, the bounds of integration will vary according to the model.

The oldest, and most widely used, continuous trait models assume that traits (or transformed gene frequencies) evolve like Brownian motion (Cavalli-Sforza and Edwards, 1967, Felsenstein, 1973). For these models, the root density f(xr|θ)f(x_{r}|\theta) is Gaussian (normal) with mean 0 and unknown variance σr2\sigma_{r}^{2}. The transition densities f(xi|xj,θv)f(x_{i}|x_{j},\theta_{v}) are also Gaussian, with mean xjx_{j} (the trait value of the parent) and standard deviation proportional to branch length. Note that there are identifiability issues which arise with the inference of the root position under this model, necessitating a few tweaks in practice.

It can be shown that when the root density and transitional densities are all Gaussian, the joint density (4) is multivariate Gaussian. Furthermore, the covariance matrix for this density has a special structure which methods such as the pruning technique of Felsenstein (1973) exploit. This technique continues to work when Brownian motion is replaced by an OU process (Felsenstein, 1988, Hansen, 1997, Lande, 1976).

Landis et al. (2013) discuss a class of continuous trait models which are based on Lévy processes and include jumps. At particular times, as governed by a Poisson process, the trait value jumps to a value drawn from a given density. Examples include a compound Poisson process with Gaussian jumps and a variance Gamma model with Gamma distributed jumps. Both of these processes have analytical transition probabilities in some special cases.

Lepage et al. (2006) use the Cox-Ingersoll-Ross (CIR) process to model rate variation across a phylogeny. Like the OU process (but unlike Brownian motion), the CIR process is ergodic. It has a stationary Gamma density which can be used for the root density. The transition density is a particular non-central Chi-squared density and the process only assumes positive values.

Kutsukake and Innan (2013) examine a family of compound Poisson models, focusing particularly on a model where the trait values make exponentially distributed jumps upwards or downwards. In the case that the rates of upward and downward jumps are the same, the model reduces to the variance Gamma model of Landis et al. (2013) and has an analytical probability density function.

Sirén et al. (2011) propose a simple and elegant model for gene frequencies whereby the root value is drawn from a Beta distribution and each transitional density is Beta with appropriately chosen parameters.

Trait values at the tips are not always observed directly. A simple, but important, example of this is the threshold model of Wright (1934), explored by Felsenstein (2005). Under this model, the trait value itself is censored and we only observe whether or not the value is positive or negative. A similar complication arises when dealing with gene frequency data as we typically do not observe the actual gene frequency but instead a binomially distributed sample based on that frequency (Sirén et al., 2011).

If the trait values at the tip are not directly observed we integrate over these values as well. Let π(zi|xi,θi)\pi(z_{i}|x_{i},\theta_{i}) denote the probability of observing ziz_{i} given the trait value xix_{i}. The marginalised likelihood is then

(T|z1,,zn)=f(xr|θ)(i,j)(T)f(xi|xj,θv)i=1nπ(zi|xi,θi)dx1,,dx2n1.\mathcal{L}(T|z_{1},\ldots,z_{n})=\int\int\cdots\int f(x_{r}|\theta)\!\!\!\prod_{(i,j)\in\mathcal{E}(T)}\!\!\!f(x_{i}|x_{j},\theta_{v})\!\prod_{i=1}^{n}\pi(z_{i}|x_{i},\theta_{i})\,\,\mathrm{d}x_{1},\ldots,\,\mathrm{d}x_{2n-1}. (5)

Numerical integration

Analytical integration can be difficult or impossible. For the most part, it is unusual for an integral to have an analytical solution, and there is no general method for finding it when it does exist. In contrast, numerical integration techniques (also known as numerical quadrature) are remarkably effective and are often easy to implement. A numerical integration method computes an approximation of the integral from function values at a finite number of points. Hence we can obtain approximate integrals of functions even when we don’t have an equation for the function itself. See Cheney and Kincaid (2012) for an introduction to numerical integration, and Dahlquist and Björck (2008) and Davis and Rabinowitz (2007) for more comprehensive technical surveys.

The idea behind most numerical integration techniques is to approximate the target function using a function which is easy to integrate. In this paper we will restrict our attention to Simpson’s method which approximates the original function using piecewise quadratic functions. To approximate an integral abf(x)dx\int_{a}^{b}f(x)\,\mathrm{d}x we first determine N+1N+1 equally spaced points

x0=a,x1=a+baN,x2=a+2baN,,xk=a+kbaN,,xN=b.x_{0}=a,\,\,\,\,x_{1}=a+\frac{b-a}{N},\,\,\,\,x_{2}=a+2\frac{b-a}{N},\ldots,x_{k}=a+k\frac{b-a}{N},\ldots,\,\,x_{N}=b. (6)

We now divide the integration into N/2N/2 intervals

abf(x)dx==1N/2x22x2f(x)dx.\int_{a}^{b}f(x)\,\,\mathrm{d}x=\sum_{\ell=1}^{N/2}\,\,\,\,\int\displaylimits_{x_{2\ell-2}}^{x_{2\ell}}f(x)\,\,\mathrm{d}x. (7)

Within each interval [x22,x2][x_{2\ell-2},x_{2\ell}], there is a unique quadratic function which equals f(x)f(x) at each the three points x=x22x=x_{2\ell-2}, x=x21x=x_{2\ell-1} and x=x2x=x_{2\ell}. The integral of this quadratic on the interval [x22,x2][x_{2\ell-2},x_{2\ell}] is (ba)3N(f(x22)+4f(x21)+f(x2))\frac{(b-a)}{3N}\left(f(x_{2\ell-2})+4f(x_{2\ell-1})+f(x_{2\ell})\right). Summing over \ell, we obtain the approximation

abf(x)dx=1N/2(ba)3N(f(x22)+4f(x21)+f(x2)).\int_{a}^{b}f(x)\,\,\mathrm{d}x\approx\sum_{\ell=1}^{N/2}\frac{(b-a)}{3N}\left(f(x_{2\ell-2})+4f(x_{2\ell-1})+f(x_{2\ell})\right). (8)

With a little rearrangement, the approximation can be written in the form

abf(x)dx(ba)Nk=0Nwkf(xk)\int_{a}^{b}f(x)\,\,\mathrm{d}x\approx\frac{(b-a)}{N}\sum_{k=0}^{N}w_{k}f(x_{k}) (9)

where wk=4/3w_{k}=4/3 when kk is odd and wk=2/3w_{k}=2/3 when kk is even, with the exception of w0w_{0} and wNw_{N} which both equal 1/31/3. Simpson’s method is embarrassingly easy to implement and has a convergence rate of O(N4)O(N^{-4}). Increasing the number of intervals by a factor of 1010 decreases the error by a factor of 10410^{-4}. See Dahlquist and Björck (2008) and Davis and Rabinowitz (2007) for further details.

It should be remembered, however, that the convergence rate is still only an asymptotic bound, and gives no guarantees on how well the method performs for a specific function and choice of NN. Simpson’s method, for example, can perform quite poorly when the function being integrated has rapid changes or soft peaks. We observed this behaviour when implementing threshold models, as described below. Our response was to better tailor the integration method for the functions appearing. We noted that the numerical integrations we carried out all had the form

abe(xμ)22σ2f(x)dx\int_{a}^{b}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}f(x)\,\,\mathrm{d}x (10)

where μ\mu and σ\sigma varied. Using the same general approach as Simpson’s rule, we approximated f(x)f(x), rather than the whole function e(xμ)22σ2f(x)e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}f(x), by a piecewise quadratic function p(x)p(x). We could then use standard techniques and tools to evaluate abe(xμ)22σ2p(x)dx\int_{a}^{b}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}p(x)\,\,\mathrm{d}x numerically. The resulting integration formula, which we call the Gaussian kernel method, gives a significant improvement in numerical accuracy.

A further complication is that, in models of continuous traits, the trait value often ranges over the whole real line, or at least over the set of positive reals. Hence, we need to approximate integrals of the form

f(x)dx or 0f(x)dx\int_{-\infty}^{\infty}f(x)\,\,\mathrm{d}x\mbox{ or }\int_{0}^{\infty}f(x)\,\,\mathrm{d}x (11)

though the methods discussed above only apply to integrals on finite intervals. We truncate these integrals, determining values UU and LL such that the difference

f(x)dxLUf(x)dx\int_{-\infty}^{\infty}f(x)\,\,\mathrm{d}x-\int_{L}^{U}f(x)\,\,\mathrm{d}x (12)

between the full integral f(x)dx\int_{-\infty}^{\infty}f(x)\,\,\mathrm{d}x and the truncated integral LUf(x)dx\int_{L}^{U}f(x)\,\,\mathrm{d}x can be bounded analytically. Other strategies are possible; see Dahlquist and Björck (2008) for a comprehensive review.

A pruning algorithm for integrating continuous traits

Felsenstein has developed pruning algorithms for both continuous and discrete characters (Felsenstein, 1981a, b). His algorithm for continuous characters works only for Gaussian processes. Our approach is to take his algorithm for discrete characters and adapt it to continuous characters.

The (discrete character) pruning algorithm is an application of dynamic programming. For each node ii, and each state xx, we compute the probability of observing the states for all tips which are descendants of node ii, conditional on node ii having ancestral state xx. This probability is called the partial likelihood at node ii given state xx. Our algorithm follows the same scheme, with one major difference. Since traits are continuous, we cannot store all possible partial likelihoods. Instead, we store likelihoods for a finite set of values and plug these values into a numerical integration routine.

Let ii be the index of a node in the tree not equal to the root, let node jj be its parent node. We define the partial likelihood, i(x)\mathcal{F}_{i}(x) to be the likelihood for the observed trait values at the tips which are descendants of node ii, conditional on the parent node jj having trait value xx. If node ii is a tip with observed trait value xix_{i} we have

i(x)=f(xi|x,θi)\mathcal{F}_{i}(x)=f(x_{i}|x,\theta_{i}) (13)

recalling that f(xi|x,θi)f(x_{i}|x,\theta_{i}) is the density for the value of the trait at node ii conditional on the value of the trait for its parent. More generally, we may only observe some value ziz_{i} for which we have the conditional probability π(zi|xi,θi)\pi(z_{i}|x_{i},\theta_{i}) conditional on the trait value xix_{i}. In this case the partial likelihood is given by

i(x)=f(x~|x,θi)π(zi|x~)dx~.\mathcal{F}_{i}(x)=\int f(\tilde{x}|x,\theta_{i})\pi(z_{i}|\tilde{x})\,\,\mathrm{d}\tilde{x}. (14)

Suppose node ii is not the root and that it has two children u,vu,v. Since trait evolution is conditionally independent on disjoint subtrees, we obtain the recursive formula

i(x)=f(x~|x,θi)u(x~)v(x~)𝑑x~.\mathcal{F}_{i}(x)=\int f(\tilde{x}|x,\theta_{i})\mathcal{F}_{u}(\tilde{x})\mathcal{F}_{v}(\tilde{x})\,d\tilde{x}. (15)

Finally, suppose that node ii is the root and has two children u,vu,v. We evaluate the complete tree likelihood using the density of the trait value at the root,

(T)=f(x|θr)u(x)v(x)dx.\mathcal{L}(T)=\int f(x|\theta_{r})\mathcal{F}_{u}(x)\mathcal{F}_{v}(x)\,\,\mathrm{d}x. (16)

The bounds of integration in (14)—(16) will vary according to the model.

We use numerical integration techniques to approximate (14)—(16) and dynamic programming to avoid an exponential explosion in the computation time. Let NN denote the number of function evaluations for each node. In practice, this might vary over the tree, but for simplicity we assume that it is constant. For each node ii, we select N+1N+1 trait values

Xi[0]<Xi[1]<<Xi[N].X_{i}[0]<X_{i}[1]<\cdots<X_{i}[N]. (17)

How we do this will depend on the trait model and the numerical integration technique. If, for example, the trait values vary between aa and bb and we are applying Simpson’s method with NN intervals we would use Xi[k]=a+baNkX_{i}[k]=a+\frac{b-a}{N}k for k=0,1,2,,Nk=0,1,2,\ldots,N.

We traverse the tree starting at the tips and working towards the root. For each non-root node ii and k=0,1,,Nk=0,1,\ldots,N we compute and store an approximation Fi[k]F_{i}[k] of i(Xj[k])\mathcal{F}_{i}(X_{j}[k]), where node jj is the parent of node ii. Note that this is an approximation of i(Xj[k])\mathcal{F}_{i}(X_{j}[k]) rather than of i(Xi[k])\mathcal{F}_{i}(X_{i}[k]) since i(x)\mathcal{F}_{i}(x) is the partial likelihood conditional on the trait value for the parent of node ii. The value approximation Fv[i]F_{v}[i] is computed by applying the numerical integration method to the appropriate integral (14)—(16), where we replace function evaluations with approximations previously computed. See below for a worked example of this general approach.

The numerical integration methods we use run in time linear in the number of points being evaluated. Hence if nn is the number of tips in the tree, the algorithm will run in time O(nN2)O(nN^{2}). For the integration techniques described above, the convergence rate (in NN) for the likelihood on the entire tree had the same order as the convergence rate for the individual one-dimensional integrations (see below for a formal proof of a specific model). We have therefore avoided the computational blow-out typically associated with such high-dimensional integrations, and achieve this without sacrificing accuracy.

Posterior densities for ancestral states

The algorithms we have described compute the joint density of the states at the tips, given the tree, the branch lengths, and other parameters. As with discrete traits, the algorithms can be modified to infer ancestral states for internal nodes in the tree. Here we show how to carry out reconstruction of the marginal posterior density of a state at a particular node. The differences between marginal and joint reconstructions are reviewed in (Yang, 2006, pg 121).

First consider marginal reconstruction of ancestral states at the root. Let uu and vv be the children of the root. The product u(x)v(x)\mathcal{F}_{u}(x)\mathcal{F}_{v}(x) equals the probability of the observed character conditional on the tree, branch lengths, parameters and a state of xx at the root. The marginal probability of xx, ignoring the data, is given by the root density f(x|θr)f(x|\theta_{r}). Integrating the product of u(x)v(x)\mathcal{F}_{u}(x)\mathcal{F}_{v}(x) and f(x|θr)f(x|\theta_{r}) gives the likelihood (T)\mathcal{L}(T), as in (16). Plugging these into Bayes’ rule, we obtain the posterior density of the state at the root:

f(xr|z1,,zn)=u(xr)v(xr)f(xr|θr)(T).f(x_{r}|z_{1},\ldots,z_{n})=\frac{\mathcal{F}_{u}(x_{r})\mathcal{F}_{v}(x_{r})f(x_{r}|\theta_{r})}{\mathcal{L}(T)}. (18)

With general time reversible models used in phylogenetics, the posterior distributions at other nodes can be found by changing the root of the tree. Unfortunately the same trick does not work for many qualitative trait models, including the threshold model we study here. Furthermore, recomputing likelihoods for each possible root entails a large amount of unneccessary computation.

Instead we derive a second recursion, this one starting at the root and working towards the tips. A similar trick is used to compute derivatives of the likelihood function in Felsenstein and Churchill (1996). For a node ii and state xx we let 𝒢i(x)\mathcal{G}_{i}(x) denote the likelihood for the trait values at tips which are not descendants of node ii, conditional on node ii having trait value xx. If node ii is the root rr, then 𝒢r(x)\mathcal{G}_{r}(x) is 11 for all xx.

Let node ii be any node apart from the root, let node jj be its parent and let node uu be the other child of jj (that is, the sibling of node ii). We let x~\tilde{x} denote the trait value at node jj. Then 𝒢i(x)\mathcal{G}_{i}(x) can be written

𝒢i(x)=f(x~|x,θi)𝒢j(x~)u(x~)𝑑x~.\mathcal{G}_{i}(x)=\int f(\tilde{x}|x,\theta_{i})\mathcal{G}_{j}(\tilde{x})\mathcal{F}_{u}(\tilde{x})\,d\tilde{x}. (19)

This integral can be evaluated using the same numerical integrators used when computing likelihoods. Note that f(x~|x,θi)f(\tilde{x}|x,\theta_{i}) is the conditional density of the parent state given the child state, which is the reverse of the transition densities used to formulate the model. How this is computed will depend on the model and its properties; see below for an implementation of this calculation in the threshold model.

Once 𝒢i(x)\mathcal{G}_{i}(x) has been computed for all nodes, the actual (marginal) posterior densities are computed from Bayes’ rule. Letting u,vu,v be the children of node ii,

f(xi|z1,,zn)=𝒢i(xi)u(xi)v(xi)f(xi)(T).f(x_{i}|z_{1},\ldots,z_{n})=\frac{\mathcal{G}_{i}(x_{i})\mathcal{F}_{u}(x_{i})\mathcal{F}_{v}(x_{i})f(x_{i})}{\mathcal{L}(T)}. (20)

Case study: threshold models

In this section we show how the general framework can be applied to the threshold model of Wright (1934) and Felsenstein (2005, 2012). Each trait is modelled by a continuously varying liability which evolves along branches according to a Brownian motion process. While the underlying liability is continuous, the observed data is discrete: at each tip we observe only whether the liability is above or below some threshold.

We will use standard notation for Gaussian densities. Let ϕ(x|μ,σ2)\phi(x|\mu,\sigma^{2}) denote the density of a Gaussian random variable xx with mean μ\mu and variance σ2\sigma^{2}; let

Φ(y|μ,σ2)=yϕ(x|μ,σ2)\Phi(y|\mu,\sigma^{2})=\int_{-\infty}^{y}\phi(x|\mu,\sigma^{2}) (21)

denote its cumulative density function, with inverse Φ1(α|μ,σ2)\Phi^{-1}(\alpha|\mu,\sigma^{2}).

Let X1,,X2n1X_{1},\ldots,X_{2n-1} denote the (unobserved) liability values at the nn tips and n1n-1 internal nodes. As above we assume that the i<ji<j whenever node ii is a child of node jj, so that the root has index 2n12n-1.

The liability value at the root has a Gaussian density with mean μr\mu_{r} and variance σr2\sigma_{r}^{2}:

f(x2n1|θr)=ϕ(x2n1|μr,σr2).f(x_{2n-1}|\theta_{r})=\phi(x_{2n-1}|\mu_{r},\sigma_{r}^{2}). (22)

Consider any non-root node ii and let jj be the index of its parent. Let tit_{i} denote the length of the branch connecting nodes ii and jj. Then XiX_{i} has a Gaussian density with mean xjx_{j} and variance σ2tv\sigma^{2}t_{v}:

f(xi|xj,θi)=ϕ(xi|xj,σ2ti).f(x_{i}|x_{j},\theta_{i})=\phi(x_{i}|x_{j},\sigma^{2}t_{i}). (23)

Following Felsenstein (2005), we assume thresholds for the tips are all set at zero. We observe 11 if the liability is positive, 0 if the liability is negative, and ?? if data is missing. We can include the threshold step into our earlier framework by defining

π(zi|xi)={1 if zi=1 and xi>0, or zi=0 and xi0, or zi=?0 otherwise.\pi(z_{i}|x_{i})=\begin{cases}1&\mbox{ if $z_{i}=1$ and $x_{i}>0$, or $z_{i}=0$ and $x_{i}\leq 0$, or $z_{i}=?$}\\ 0&\mbox{ otherwise.}\end{cases} (24)

The likelihood function for observed discrete values z1,,znz_{1},\ldots,z_{n} is then given by integrating over liability values for all nodes on the tree:

(T|z1,,zn)=ϕ(x2n1|μr,σr2)(i,j)ϕ(xi|xj,σ2ti)i=1nπ(zi|xi)dx1dx2n1.\mathcal{L}(T|z_{1},\ldots,z_{n})=\int\displaylimits_{-\infty}^{\infty}\!\cdots\!\int\displaylimits_{-\infty}^{\infty}\phi(x_{2n-1}|\mu_{r},\sigma_{r}^{2})\prod_{(i,j)}\!\!\phi(x_{i}|x_{j},\sigma^{2}t_{i})\prod_{i=1}^{n}\pi(z_{i}|x_{i})\,\,\mathrm{d}x_{1}\ldots\,\mathrm{d}x_{2n-1}. (25)

The first step towards computing (T|z1,,zn)\mathcal{L}(T|z_{1},\ldots,z_{n}) is to bound the domain of integration so that we can apply Simpson’s method. Ideally, we would like these bounds to be as tight as possible, for improved efficiency. For the moment we will just outline a general procedure which can be adapted to a wide range of evolutionary models.

The marginal (prior) density of a single liability or trait value at a single node is the density for that liability value marginalizing over all other values and data. With the threshold model, the marginal density for the liability at node ii is Gaussian with mean μr\mu_{r} (like the root) and variance viv_{i} equal to the sum of the variance at the root and the transition variances on the path from the root to node ii. If PiP_{i} is the set of nodes from the root to node ii, then

vi=σr2+σ2jPitj.v_{i}=\sigma_{r}^{2}+\sigma^{2}\sum_{j\in P_{i}}t_{j}. (26)

The goal is to constrain the error introduced by truncating the integrals with infinite domain. Let ϵ\epsilon be the desired bound on this truncation error. Recall that the number of internal nodes in the tree is n1n-1. Define

Li=Φ1(ϵ2(n1)|μr,vi)L_{i}=\Phi^{-1}\left(\frac{\epsilon}{2(n-1)}\Big{|}\mu_{r},v_{i}\right) (27)

and

Ui=Φ1(ϵ2(n1)|μr,vi)U_{i}=\Phi^{-1}\left(\frac{\epsilon}{2(n-1)}\Big{|}\mu_{r},v_{i}\right) (28)

so that the probability XiX_{i} lies outside the interval [Li,Ui][L_{i},U_{i}] is at most ϵ/(n1)\epsilon/(n-1). By the inclusion-exclusion principle, the joint probability Xi[Li,Ui]X_{i}\not\in[L_{i},U_{i}] for any internal node ii is at most ϵ\epsilon. We use this fact to bound the contribution of the regions outside these bounds.

\displaystyle\int\displaylimits_{-\infty}^{\infty} f(x2n1|μr,σr2)(u,v)f(xv|xu,θv)i=1nπ(zi|xi)dx1dx2n1\displaystyle\!\cdots\!\int\displaylimits_{-\infty}^{\infty}f(x_{2n-1}|\mu_{r},\sigma_{r}^{2})\prod_{(u,v)}\!\!f(x_{v}|x_{u},\theta_{v})\prod_{i=1}^{n}\pi(z_{i}|x_{i})\,\,\mathrm{d}x_{1}\ldots\,\mathrm{d}x_{2n-1} (29)
a2n1b2n1an+1bn+1f(x2n1|μr,σr2)(u,v)f(xv|xu,θv)i=1nπ(zi|xi)dx1dx2n1\displaystyle\qquad-\int\displaylimits_{a_{2n-1}}^{b_{2n-1}}\!\!\cdots\!\int\displaylimits_{a_{n+1}}^{b_{n+1}}\int\displaylimits_{-\infty}^{\infty}\!\cdots\!\int\displaylimits_{-\infty}^{\infty}f(x_{2n-1}|\mu_{r},\sigma_{r}^{2})\prod_{(u,v)}\!\!f(x_{v}|x_{u},\theta_{v})\prod_{i=1}^{n}\pi(z_{i}|x_{i})\,\,\mathrm{d}x_{1}\ldots\,\mathrm{d}x_{2n-1} (30)
f(x2n1|μr,σr2)(u,v)f(xv|xu,θv)dx1dx2n1\displaystyle\leq\int\displaylimits_{-\infty}^{\infty}\!\cdots\!\int\displaylimits_{-\infty}^{\infty}f(x_{2n-1}|\mu_{r},\sigma_{r}^{2})\prod_{(u,v)}\!\!f(x_{v}|x_{u},\theta_{v})\,\,\mathrm{d}x_{1}\ldots\,\mathrm{d}x_{2n-1} (31)
a2n1b2n1an+1bn+1f(x2n1|μr,σr2)(u,v)f(xv|xu,θv)dx1dx2n1\displaystyle\qquad-\int\displaylimits_{a_{2n-1}}^{b_{2n-1}}\!\!\cdots\!\int\displaylimits_{a_{n+1}}^{b_{n+1}}\int\displaylimits_{-\infty}^{\infty}\!\cdots\!\int\displaylimits_{-\infty}^{\infty}f(x_{2n-1}|\mu_{r},\sigma_{r}^{2})\prod_{(u,v)}\!\!f(x_{v}|x_{u},\theta_{v})\,\,\mathrm{d}x_{1}\ldots\,\mathrm{d}x_{2n-1} (32)
P(Xn+1[Ln+1,Un+1] or Xn+2[Ln+2,Un+2] or  or X2n1[L2n1,U2n1])\displaystyle\leq P\Big{(}X_{n+1}\not\in[L_{n+1},U_{n+1}]\mbox{ or }X_{n+2}\not\in[L_{n+2},U_{n+2}]\mbox{ or }\cdots\mbox{ or }X_{2n-1}\not\in[L_{2n-1},U_{2n-1}]\Big{)} (33)
<ϵ.\displaystyle<\epsilon. (34)

We therefore compute values Li,UiL_{i},U_{i} for n+1i2n1n+1\leq i\leq 2n-1 and use these bounds when carrying out integration at the internal nodes. We define

Xi[k]=Li+UiLiNkX_{i}[k]=L_{i}+\frac{U_{i}-L_{i}}{N}k (35)

for k=0,1,,Nk=0,1,\ldots,N for each internal node ii.

The next step is to use dynamic programming and numerical integration to compute the approximate likelihood. Let node ii be a tip of the tree, let node jj be its parent and let ziz_{i} be the binary trait value at this tip. For each k=0,1,,Nk=0,1,\ldots,N we use standard error functions to compute

Fi[k]\displaystyle F_{i}[k] =\displaystyle= i(Xj[k])\displaystyle\mathcal{F}_{i}(X_{j}[k]) (36)
=\displaystyle= {0ϕ(x~|Xj[k],σ2ti)dx~ if zi=10ϕ(x~|Xj[k],σ2ti)dx~ if zi=01 if zi=?.\displaystyle\begin{cases}\int\displaylimits_{0}^{\infty}\phi(\tilde{x}|X_{j}[k],\sigma^{2}t_{i})\,\,\mathrm{d}\tilde{x}&\mbox{ if $z_{i}=1$}\\ \int\displaylimits_{-\infty}^{0}\phi(\tilde{x}|X_{j}[k],\sigma^{2}t_{i})\,\,\mathrm{d}\tilde{x}&\mbox{ if $z_{i}=0$}\\ 1&\mbox{ if $z_{i}=?$.}\end{cases} (37)

Here ϕ(x|μ,σ2)\phi(x|\mu,\sigma^{2}) is the density of a Gaussian with mean μ\mu and variance σ2\sigma^{2}.

Now suppose that node ii is an internal node with parent node jj and children uu and vv. Applying Simpson’s rule to the bounds Li,UiL_{i},U_{i} to (15) we have for each k=0,1,,Nk=0,1,\ldots,N:

Fi[k]\displaystyle F_{i}[k] =\displaystyle= UiLiN=0Nwϕ(Xi[]|Xj[k],σ2ti)Fu[]Fv[]\displaystyle\frac{U_{i}-L_{i}}{N}\sum\displaylimits_{\ell=0}^{N}w_{\ell}\phi(X_{i}[\ell]|X_{j}[k],\sigma^{2}t_{i})F_{u}[\ell]F_{v}[\ell] (38)
\displaystyle\approx i(Xj[k]).\displaystyle\mathcal{F}_{i}(X_{j}[k]). (39)

Suppose node ii is the root, and u,vu,v are its children. Applying Simpson’s rule to (16) gives

L\displaystyle L \displaystyle\leftarrow U2n1Ln1N=0Nwϕ(Xi[]|μr,σr2)Fu[]Fv[]\displaystyle\frac{U_{2n-1}-L_{n-1}}{N}\sum_{\ell=0}^{N}w_{\ell}\phi(X_{i}[\ell]|\mu_{r},\sigma_{r}^{2})F_{u}[\ell]F_{v}[\ell] (40)
\displaystyle\approx (T|z1,,zn,θ).\displaystyle\mathcal{L}(T|z_{1},\ldots,z_{n},\theta). (41)

Pseudo-code for the algorithm appears in Algorithm 1. Regarding efficiency and convergence we have:

Theorem 1.

Algorithm 1 runs in O(nN2)O(nN^{2}) time and approximates L(T)L(T) with O(nN4)O(nN^{-4}) error.

Proof
The running time follows from the fact that for each of the O(n)O(n) nodes in the tree we carry out O(N)O(N) applications of Simpson’s method.

Simpson’s rule has O(N4)O(N^{-4}) convergence on functions with bounded fourth derivatives (Dahlquist and Björck, 2008). The root density and each of the transition densities are Gaussians, so have individually have bounded fourth derivatives. For each node ii, let nin_{i} denote the number of tips which are descendents of the node. Using induction on (15), we see that for all nodes ii, the fourth derivate of i(x)\mathcal{F}_{i}(x) is O(ni)O(n_{i}).

Letting ϵ=nN4\epsilon=nN^{-4} we have from above that the error introduced by replacing the infinite domain integrals with integrals on [Li,Ui][L_{i},U_{i}] introduces at most nN4nN^{-4} error. Using a second induction proof on (15) and (38) together with the bound on fourth derivatives we have that |i(Xj[k])Fi[k]||\mathcal{F}_{i}(X_{j}[k])-F_{i}[k]| is at most O(niN4)O(n_{i}N^{-4}) for all nodes ii, where node jj is the parent of node ii. In this way we obtain error bound of O(n2n1N4)=O(nN4)O(n_{2n-1}N^{-4})=O(nN^{-4}) on the approximation of (T|z1,,zn,θ)\mathcal{L}(T|z_{1},\ldots,z_{n},\theta). \Box

Algorithm 1: Compute probability of a threshold character.
Input:
NN: Number of intervals in numerical integration.
t1,,t2n2t_{1},\ldots,t_{2n-2}: branch lengths in tree.
μr,σr2\mu_{r},\sigma^{2}_{r}: mean and variance of root density
σ2\sigma^{2}: variance of transition densities (per unit branch length)
z1,,znz_{1},\ldots,z_{n} observed character (zi{+1,0,?}z_{i}\in\{+1,0,?\})
Output:
Probability LL of observed character under the threshold model.
Construct the vector 𝐱=[0,1,2,,N]/N\mathbf{x}=[0,1,2,\ldots,N]/N.
Construct the vector 𝐰=[1,4,2,4,2,,4,2,1]\mathbf{w}=[1,4,2,4,2,\ldots,4,2,1] as in (9)
Compute the path length pip_{i} from the root to each node ii.
Initialize Fi[k]1F_{i}[k]\leftarrow 1 for all nodes ii and 0kN0\leq k\leq N.
For all i=n+1,n+2,,2n1i=n+1,n+2,\ldots,2n\!-\!1
LiΦ1(nN42(n1)|μr,σr2+σ2pi)L_{i}\leftarrow\Phi^{-1}(\frac{nN^{-4}}{2(n-1)}|\mu_{r},\sigma_{r}^{2}+\sigma^{2}p_{i})
UiΦ1(1nN42(n1)|μr,σr2+σ2pi)U_{i}\leftarrow\Phi^{-1}(1-\frac{nN^{-4}}{2(n-1)}|\mu_{r},\sigma_{r}^{2}+\sigma^{2}p_{i})
Xi(UiLi)𝐱+LiX_{i}\leftarrow(U_{i}-L_{i})\mathbf{x}+L_{i}
For all tip nodes i=1,2,,ni=1,2,\ldots,n
Let jj be the index of the parent of node ii
For k=0,,Nk=0,\ldots,N
If zi=1z_{i}=1
Fi[k]=1Φ(0;Xj[k],σ2ti)F_{i}[k]=1-\Phi(0;X_{j}[k],\sigma^{2}t_{i})
else if zi=0z_{i}=0
Fi[k]=Φ(0;Xj[k],σ2ti)F_{i}[k]=\Phi(0;X_{j}[k],\sigma^{2}t_{i})
For all internal nodes i=n+1,,2n2i=n\!+\!1,...,2n\!-\!2, excluding the root
Let jj be the index of the parent of node ii
Let u,vu,v be the indices of the children of node ii
For k=0,1,,Nk=0,1,\ldots,N
Fi[k]UiLiN=0N𝐰ϕ(Xi[];Xj[k],σ2ti)Fu[]Fv[]\displaystyle F_{i}[k]\leftarrow\frac{U_{i}-L_{i}}{N}\sum\displaylimits_{\ell=0}^{N}\mathbf{w}_{\ell}\phi(X_{i}[\ell];X_{j}[k],\sigma^{2}t_{i})F_{u}[\ell]F_{v}[\ell]
Let u,vu,v be indices of the the children of the root.
LU2n1Ln1N=0N𝐰ϕ(Xi[];μr,σr2)Fu[]Fv[]\displaystyle L\leftarrow\frac{U_{2n-1}-L_{n-1}}{N}\sum_{\ell=0}^{N}\mathbf{w}_{\ell}\phi(X_{i}[\ell];\mu_{r},\sigma_{r}^{2})F_{u}[\ell]F_{v}[\ell]
Algorithm 1: Pseudo-code of the likelihood approximation algorithm for a single character, under the threshold model. The nodes are numbered in increasing order from tips to the root.

We can estimate posterior densities using the recursion (19) followed by equation (20). The conditional density

f(x~|x,θi)=ϕ(x~|μr+σr2+σ2Pjσr2+σ2Pi(xμr),σ2ti(σr2+σ2Pj)σr2+σ2Pi)f(\tilde{x}|x,\theta_{i})=\phi\left(\tilde{x}\Big{|}\mu_{r}+\frac{\sigma_{r}^{2}+\sigma^{2}P_{j}}{\sigma_{r}^{2}+\sigma^{2}P_{i}}\left(x-\mu_{r}\right),\frac{\sigma^{2}t_{i}\left(\sigma_{r}^{2}+\sigma^{2}P_{j}\right)}{\sigma_{r}^{2}+\sigma^{2}P_{i}}\right) (42)

can be obtained by plugging the transitional density

f(x|x~,θi)=ϕ(x|x~,σ2ti)f(x|\tilde{x},\theta_{i})=\phi(x|\tilde{x},\sigma^{2}t_{i}) (43)

and the two marginal densities (26)

f(x~)=ϕ(x~,σr2+σ2Pj),f(x)=ϕ(x,σr2+σ2Pi)f(\tilde{x})=\phi(\tilde{x},\sigma_{r}^{2}+\sigma^{2}P_{j}),\quad f(x)=\phi(x,\sigma_{r}^{2}+\sigma^{2}P_{i}) (44)

into the identity f(x~|x,θi)=f(x|x~,θi)f(x~)f(x)f(\tilde{x}|x,\theta_{i})=f(x|\tilde{x},\theta_{i})\frac{f(\tilde{x})}{f(x)}. We thereby obtain the recursion

𝒢i(x)=ϕ(x~|μr+σr2+σ2Pjσr2+σ2Pi(xμr),σ2ti(σr2+σ2Pj)σr2+σ2Pi)𝒢j(x~)u(x~)𝑑x~\mathcal{G}_{i}(x)=\int\phi\left(\tilde{x}\Big{|}\mu_{r}+\frac{\sigma_{r}^{2}+\sigma^{2}P_{j}}{\sigma_{r}^{2}+\sigma^{2}P_{i}}\left(x-\mu_{r}\right),\frac{\sigma^{2}t_{i}\left(\sigma_{r}^{2}+\sigma^{2}P_{j}\right)}{\sigma_{r}^{2}+\sigma^{2}P_{i}}\right)\mathcal{G}_{j}(\tilde{x})\mathcal{F}_{u}(\tilde{x})\,d\tilde{x} (45)

which we estimate using Simpson’s method. Algorithm estimates values of the posterior densities at each node, evaluated using the same set of grid points as used in Algorithm 1. An additional round of numerical integration can be used to obtain posterior means and variances.

Algorithm 2: Compute posterior probabilities
Input:
NN, t1,2n2t_{1},\ldots 2n-2, μr\mu_{r}, σr2\sigma^{2}_{r}, and σ2\sigma^{2} as in Algorithm 1
Vector pp, likelihood LL and arrays FiF_{i} computed in Algorithm 1.
Output:
Arrays PiP_{i} for each internal node ii.
Construct the vectors 𝐱\mathbf{x}, 𝐰\mathbf{w}, LL, UU, and path lengths pip_{i} as in Algorithm 1.
G2n1[k]1G_{2n-1}[k]\leftarrow 1 for all kk.
For all i=2n2,2n2,,n+1i=2n\!-\!2,2n-2,\ldots,n+1
Let jj be the index of the parent of node ii.
Let vv be the index of the sibling of node ii.
For k=0,1,,Nk=0,1,\ldots,N
μμr+σr2+σ2Pjσr2+σ2Pi(Xi[k]μr)\mu\leftarrow\mu_{r}+\frac{\sigma_{r}^{2}+\sigma^{2}P_{j}}{\sigma_{r}^{2}+\sigma^{2}P_{i}}\left(X_{i}[k]-\mu_{r}\right)
Vσ2ti(σr2+σ2Pj)σr2+σ2PiV\leftarrow\frac{\sigma^{2}t_{i}\left(\sigma_{r}^{2}+\sigma^{2}P_{j}\right)}{\sigma_{r}^{2}+\sigma^{2}P_{i}}
Gi[k]UjLjN=0N𝐰ϕ(Xj[];μ,V)Gj[]Fv[]\displaystyle G_{i}[k]\leftarrow\frac{U_{j}-L_{j}}{N}\sum\displaylimits_{\ell=0}^{N}\mathbf{w}_{\ell}\phi(X_{j}[\ell];\mu,V)G_{j}[\ell]F_{v}[\ell]
For all i=n+1,,2n1i=n+1,\ldots,2n-1
Let u,vu,v be the children of node ii.
For all k=0,1,,Nk=0,1,\ldots,N
Pi[k]1LGi[k]Fu[k]Fv[k]ϕ(Xi[k]|μr,σr2+σ2pi)P_{i}[k]\leftarrow\frac{1}{L}G_{i}[k]F_{u}[k]F_{v}[k]\phi(X_{i}[k]|\mu_{r},\sigma_{r}^{2}+\sigma^{2}p_{i})
Algorithm 2: Pseudo-code for the algorithm to efficiently compute ancestral posterior densities under the threshold model. At the termination of the algorithm, Pi[k]P_{i}[k] is an estimate of the posterior density at internal node ii, evaluated at x=Xi[k]x=X_{i}[k].

Evolutionary precursors of plant extrafloral nectaries

To study the methods in practice, we reanalyse trait data published by Marazzi et al. (2012), using a fixed phylogeny. Marazzi et al. (2012) introduce and apply a new discrete state model for morphological traits which, in addition to states for presence and absence, incorporates an intermediate ‘pre-cursor’ state. Whenever the intermediate state is observed at the tips it is coded as ‘absent’. The motivation behind the model is that the intermediate state represents evoutionary pre-cursors, changes which are necessary for the evolution of a new state but which may not be directly observed. These pre-cursors could explain repeated parallel evolution of a trait in closely related traits (Marazzi et al., 2012). They compiled a data set recording presence or absence of plant extrafloral nectaries (EFNs) across a phylogeny of 839 species of Fabales, fitting their models to these data.

The threshold model also involves evolutionary pre-cursors in terms of changes in ancestral liabilities. We use these models, and our new algorithms to analyse the EFN dataset. Our analysis also makes use of the time-calibrated phylogeny inferred by Simon et al. (2009), although unlike Marazzi et al. (2012) we ignore phylogenetic uncertainty.

Experimental protocol

We conduct three separate experiments. For the first experiment, we examine the rate of convergence of the likelihood algorithm as we increase NN. This is done for the ‘All’ EFN character (Character 1 in Marazzi et al. (2012)) for a range of estimates for the liability variance at the root, σr2\sigma_{r}^{2}. The interest in σr2\sigma_{r}^{2} stems from its use in determining bounds Li,UiL_{i},U_{i} for each node, with the expectation that as σr2\sigma_{r}^{2} increases, the convergence of the integration algorithm will slow. The mean liability at the root, μr\mu_{r} was determined from the data using Maximum Likelihood estimation.

We also examined convergence of the algorithm on randomly generated characters. We first evolved liabilities according to the threshold model, using the parameter settings obtained above. To examine the difference in performance for non-phylogenetic characters we also simulated binary characters by simulated coin flipping. Twenty replicates were carried out for each case.

The second experiment extends the model comparisons carried out in Marazzi et al. (2012) to include the threshold models. For this comparison we fix the transitional variance σ2\sigma^{2} at one, since changing this values corresponds to a rescaling of the Brownian process, with no change in likelihood. With only one character, the maximum likelihood estimate of the root variance σr2\sigma_{r}^{2} is zero, irrespective of the data. This leaves a single parameter to infer: the value of the liability at the root state. We computed a maximum likelihood estimate for the state at the root, then applied our algorithm with a sufficiently large value of NN to be sure of convergence. The Akaike Information Criterion (AIC) was determined and compared with those obtained for the model of Marazzi et al. (2012).

For the third experiment, we determine the marginal posterior densities for the liabilities at internal nodes, using Algorithm 2. These posterior probabilities are then mapped onto the phylogeny, using shading to denote the (marginal) posterior probability that a liability is larger than zero. We therefore obtain a figure analogous to Supplementary Figure 7 of Marazzi et al. (2012).

Results

Convergence of the algorithm

Plots of error versus NN are given in Figure 1, both for Simpson’s method (left) and for the modified Gaussian kernel method (right). For larger NN, the error in a log-log plot decreases with slope at most 4-4 (as indicated), corresponding to N4N^{-4} convergence of the method. Log-log plots of error versus NN for the simulated data are given in Figure 2. In each case, the method converges for by N30N\approx 30.

Refer to caption

Figure 1: Log-log plots of error as a function of NN for the dynamic programming algorithm with Simpson’s method (left) and with the Gaussian kernel method (right). The likelihoods were computed under the threshold model on EFN trait data for an 839 taxon tree. Dotted lines have slope -4 (corresponding to convergence rate of N4N^{-4}. Note the difference in scale for the two methods.). Logarithms computed to base 10.

Refer to caption

Figure 2: Plots of log-likelihood values as a function of log(N)\log(N) for the two types of simulated data, computed using our algorithm together with the Gaussian kernel method. Logarithms computed to base 10.

While the level of convergence for both algorithms is correct, the accuracy of the method based on Simpson’s method is far worse. When a branch length is short, the transition density becomes highly peaked, as does the function being integrated. Such functions are difficult to approximate with piecewise quadratics, and Simpson’s method can fail miserably. Indeed, for N<50N<50, we would often observe negative estimated probabilities, or estimates greater than one! (These were omitted from the plots). While we can always bound estimates computed by the algorithm, a sounder approach is to improve the integration technique. This we did using the Gaussian kernel method, and the result was far improved accuracy for little additional computation. For the remainder of the experiments with this model we used the Gaussian kernel method when carrying out numerical integration.

Model comparison

Marazzi et al. (2012) describe AIC comparisons between their pre-cursor model and a conventional binary trait model. We extend this comparison to include the threshold model. This is a one parameter model, the parameter being the value of the liability at the root. We used the MATLAB command fminsearch with multiple starting points to compute the maximum likelihood estimate for this value. The resulting log-likelihood was logL=240.6\log L=-240.6, giving an AIC of 483.2483.2. This compares to an AIC of 507.4507.4 for the (two parameter) binary character model and an AIC of 495.4495.4 for the (one parameter) precursor model of Marazzi et al. (2012).

We analyzed the five other EFN traits in the same way, and present the computed AIC values in Table 1, together with AIC values for the two parameter binary state model and one parameter precursor model computed by Marazzi et al. (2012) (and the 2 parameter precursor model for trait 6). We see that the threshold model fits better than either the binary or precursor models for all of the six traits.

Trait Model kk logL\log L AIC
1 (All) Binary 2 -251.7 507.4
Precursor 1 -246.7 495.4
Threshold 1 -240.6 483.2
2 (Leaves) Binary 2 -240.3 484.6
Precursor 1 -234.5 470.9
Threshold 1 -230.6 463.1
3 (Inflorescence) Binary 2 -108.3 220.5
Precursor 1 -110.9 223.9
Threshold 1 -108.3 218.5
4 (Trichomes) Binary 2 -86.7 177.3
Precursor 1 -86.9 325.3
Threshold 1 -85.8 173.5
5 (Substitutive) Binary 2 -163.0 330.1
Precursor 1 -161.6 325.3
Threshold 1 -161.3 324.6
6 (True) Binary 2 -132.3.1 268.7
Precursor 1 -131.1 264.3
Precursor 2 -126.7 257.3
Threshold 1 -125.3 252.6
Table 1: Table of log-likelihood and AIC values for the binary character, precursor, and threshold models on six EFN traits. Column kk indicates numbers of parameters for each model. Data for the binary and precursor models copied from Table 1 in Marazzi et al. (2012). All likelihoods and AIC values rounded to 1 d.p. Boldface indicates the best fitting model for each trait.

It is not clear, a priori, why the threshold model would appear to fit some data better than the precursor model since they appear to capture similar evolutionary phenomena. It would be useful to explore this observation more thoroughly, given the new computational tools, perhaps incorporating phylogenetic error in a manner similar to Marazzi et al. (2012).

Inferring ancestral liabilities

Figure 3 gives a representation of how the (marginal) posterior liabilities change over the tree. Branches are divided into three classes according to the posterior probability that the liability is positive, with lineages with posterior probability >0.7>0.7 colored red, lineages with posterior probability <0.3<0.3 colored white, and remaining lineages colored pink.

This diagram can be compared to Supplementary Figure 7, of Marazzi et al. (2012). The representations are, on the whole, directly comparable. An positive liability corresponds, roughly, to an ancestral precursor state. Both analyses suggest multiple origins of a precursor state, for example for a large clade of Mimosoidae. Interestingly, there are several clades where the analysis of Marazzi et al. (2012) suggests widespread ancestral distribution of the precursor state whereas our analysis indicates a negative liability at the same nodes.

Refer to caption

Figure 3: Marginal posterior probabilities for the liabilities, for EFN trait 1 of Marazzi et al. (2012) on the phylogeny inferred by Simon et al. (2009). Lineages with posterior probability >0.7>0.7 colored red, lineages with posterior probability <0.3<0.3 colored white, and remaining lineages colored pink.

Once again, our analysis is only preliminary, our goal here simply being to demonstrate what calculations can now be carried out.

Discussion

We have introduced a new framework for the computation of likelihoods from continuous characters, and illustrated the framework using an efficient algorithm for evaluating (approximate) likelihoods under Wright and Felsenstein’s threshold model.

This framework opens up possibilities in several directions. The numerical integration, or numerical quadrature, literature is vast. In this article, we have focused in on a popular and simple numerical integration method, and our algorithm should be seen as a proof of principle rather than a definitive threshold likelihood method. There is no question that the numerical efficiency of Algorithm 1 could be improved significantly through the use of more sophisticated techniques: better basis functions or adaptive quadrature methods for a start.

The connection with Felsenstein’s (discrete character) pruning algorithm also opens up opportunities for efficiency gains. Techniques such as storing partial likelihoods, or approximating local neighborhoods, are fundamental to efficient phylogenetic computations on sequence data (Felsenstein, 1981b, Larget and Simon, 1998, Pond and Muse, 2004, Stamatakis, 2006, Swofford, 2003). These tricks could all be now applied to the calculation of likelihoods from continuous traits.

Finally, we stress that the algorithm does not depend on special characteristics of the continuous trait model, beyond conditional independence of separate lineages. Felsenstein’s pruning algorithm for continuous characters is limited to Gaussian processes and breaks down if, for example, the transition probabilities are governed by Levy processes (Landis et al., 2013). In contrast, our approach works whenever we can numerically evaluation transition densities, an indeed only a few minor changes would transform our Algorithm 1 to one implementing on a far more complex evolutionary process.

Acknowledgements

This research was supported by an Allan Wilson Centre Doctoral Scholarship to GH, financial support to DB from the Allan Wilson Centre, a Marsden grant to DB, and financial support to all authors from the University of Otago.

References

  • Cavalli-Sforza and Edwards [1967] L. L. Cavalli-Sforza and A. W. Edwards. Phylogenetic analysis. models and estimation procedures. American journal of human genetics, 19(3 Pt 1):233, 1967.
  • Cheney and Kincaid [2012] E. Cheney and D. Kincaid. Numerical Mathematics and Computing. Cengage Learning, 2012.
  • Dahlquist and Björck [2008] G. Dahlquist and Å. Björck. Chapter 5: Numerical integration. Numerical Methods in Scientific Computing, 1, 2008.
  • Davis and Rabinowitz [2007] P. J. Davis and P. Rabinowitz. Methods of numerical integration. Courier Corporation, 2007.
  • Felsenstein [1968] J. Felsenstein. Statistical inference and the estimation of phylogenies. 1968.
  • Felsenstein [1973] J. Felsenstein. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic zoology, pages 240–249, 1973.
  • Felsenstein [1981a] J. Felsenstein. Evolutionary Trees From Gene Frequencies and Quantitative Characters: Finding Maximum Likelihood Estimates. Evolution, 35(6):1229–1242, Nov. 1981a. ISSN 0014-3820. doi: 10.2307/2408134. URL http://www.jstor.org/stable/2408134.
  • Felsenstein [1981b] J. Felsenstein. Evolutionary trees from dna sequences: a maximum likelihood approach. Journal of molecular evolution, 17(6):368–376, 1981b.
  • Felsenstein [1988] J. Felsenstein. Phylogenies and quantitative characters. Annual Review of Ecology and Systematics, pages 445–471, 1988.
  • Felsenstein [2002] J. Felsenstein. Quantitative characters, phylogenies, and morphometrics. Morphology, shape and phylogeny, pages 27–44, 2002.
  • Felsenstein [2005] J. Felsenstein. Using the quantitative genetic threshold model for inferences between and within species. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1459):1427–1434, 2005.
  • Felsenstein [2012] J. Felsenstein. A comparative method for both discrete and continuous characters using the threshold model. The American Naturalist, 179(2):145–156, 2012.
  • Felsenstein and Churchill [1996] J. Felsenstein and G. A. Churchill. A hidden markov model approach to variation among sites in rate of evolution. Molecular Biology and Evolution, 13(1):93–104, 1996.
  • Hansen [1997] T. F. Hansen. Stabilizing selection and the comparative analysis of adaptation. Evolution, pages 1341–1351, 1997.
  • Harmon et al. [2010] L. J. Harmon, J. B. Losos, T. Jonathan Davies, R. G. Gillespie, J. L. Gittleman, W. Bryan Jennings, K. H. Kozak, M. A. McPeek, F. Moreno-Roark, T. J. Near, et al. Early bursts of body size and shape evolution are rare in comparative data. Evolution, 64(8):2385–2396, 2010.
  • Khaitovich et al. [2005] P. Khaitovich, S. Pääbo, and G. Weiss. Toward a neutral evolutionary model of gene expression. Genetics, 170(2):929–939, 2005. URL http://www.genetics.org/content/170/2/929.short.
  • Khaitovich et al. [2006] P. Khaitovich, W. Enard, M. Lachmann, and S. Pääbo. Evolution of primate gene expression. Nature Reviews Genetics, 7(9):693–702, Sept. 2006. ISSN 1471-0056. doi: 10.1038/nrg1940. URL http://www.nature.com.ezproxy.otago.ac.nz/nrg/journal/v7/n9/full/nrg1940.html.
  • Kutsukake and Innan [2013] N. Kutsukake and H. Innan. Simulation-based likelihood approach for evolutionary models of phenotypic traits on phylogeny. Evolution, 67(2):355–367, 2013.
  • Lande [1976] R. Lande. Natural selection and random genetic drift in phenotypic evolution. Evolution, pages 314–334, 1976.
  • Landis et al. [2013] M. J. Landis, J. G. Schraiber, and M. Liang. Phylogenetic analysis using lévy processes: finding jumps in the evolution of continuous traits. Systematic biology, 62(2):193–204, 2013.
  • Larget and Simon [1998] B. Larget and D. Simon. Faster likelihood calculations on trees. Pittsburgh, Pennsylvania: Department of Mathematics and Computer Science, Duquesne University, pages 98–02, 1998.
  • Lemey et al. [2010] P. Lemey, A. Rambaut, J. J. Welch, and M. A. Suchard. Phylogeography takes a relaxed random walk in continuous space and time. Molecular biology and evolution, 27(8):1877–1885, 2010.
  • Lepage et al. [2006] T. Lepage, S. Lawi, P. Tupper, and D. Bryant. Continuous and tractable models for the variation of evolutionary rates. Mathematical biosciences, 199(2):216–233, 2006.
  • Marazzi et al. [2012] B. Marazzi, C. Ané, M. F. Simon, A. Delgado-Salinas, M. Luckow, and M. J. Sanderson. Locating evolutionary precursors on a phylogenetic tree. Evolution, 66(12):3918–3930, 2012.
  • Pond and Muse [2004] S. L. K. Pond and S. V. Muse. Column sorting: Rapid calculation of the phylogenetic likelihood function. Systematic biology, 53(5):685–692, 2004.
  • Ronquist [2004] F. Ronquist. Bayesian inference of character evolution. Trends in Ecology & Evolution, 19(9):475–481, 2004. URL http://www.sciencedirect.com/science/article/pii/S0169534704001831.
  • Simon et al. [2009] M. F. Simon, R. Grether, L. P. de Queiroz, C. Skema, R. T. Pennington, and C. E. Hughes. Recent assembly of the cerrado, a neotropical plant diversity hotspot, by in situ evolution of adaptations to fire. Proceedings of the National Academy of Sciences, 106(48):20359–20364, 2009.
  • Sirén et al. [2011] J. Sirén, P. Marttinen, and J. Corander. Reconstructing population histories from single nucleotide polymorphism data. Molecular biology and evolution, 28(1):673–683, 2011.
  • Stamatakis [2006] A. Stamatakis. Raxml-vi-hpc: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 22(21):2688–2690, 2006.
  • Stevens [1991] P. F. Stevens. Character states, morphological variation, and phylogenetic analysis: a review. Systematic Botany, pages 553–583, 1991. URL http://www.jstor.org/stable/2419343.
  • Swofford [2003] D. L. Swofford. {\{PAUP*. Phylogenetic analysis using parsimony (* and other methods). Version 4.}\}. 2003.
  • Wright [1934] S. Wright. An analysis of variability in number of digits in an inbred strain of guinea pigs. Genetics, 19(6):506, 1934.
  • Yang [2006] Z. Yang. Computational molecular evolution, volume 21. Oxford University Press Oxford, 2006.