This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Simultaneous SNP identification in association studies with missing data

Zhen Li    Vikneswaran Gopal    Xiaobo Li    John M. Davis    George Casellalabel=e1]casella@stat.ufl.edu [ State Street Corporation, University of Florida, University of Florida, University of Florida and University of Florida Z. Li
State Street Corporation
1 Lincoln Street, 15th floor
Boston, Massachusetts 02111
USA
V. Gopal
Department of Statistics
University of Florida
Gainesville, Florida 32611
USA
X. Li
J. M. Davis
School of Forest Resources
 and Conservation
University of Florida
Gainesville, Florida 32611
USA
G. Casella
Department of Statistics
 and Genetics Institute
University of Florida
Gainesville, Florida 32611
USA
(2012; 12 2010; 9 2011)
Abstract

Association testing aims to discover the underlying relationship between genotypes (usually Single Nucleotide Polymorphisms, orSNPs) and phenotypes (attributes, or traits). The typically large data sets used in association testing often contain missing values. Standard statistical methods either impute the missing values using relatively simple assumptions, or delete them, or both, which can generate biased results. Here we describe the Bayesian hierarchical model BAMD (Bayesian Association with Missing Data). BAMD is a Gibbs sampler, in which missing values are multiply imputed based upon all of the available information in the data set. We estimate the parameters and prove that updating one SNP at each iteration preserves the ergodic property of the Markov chain, and at the same time improves computational speed. We also implement a model selection option in BAMD, which enables potential detection of SNP interactions. Simulations show that unbiased estimates of SNP effects are recovered with missing genotype data. Also, we validate associations between SNPs and a carbon isotope discrimination phenotype that were previously reported using a family based method, and discover an additional SNP associated with the trait. BAMD is available as an R-package from http://cran.r-project.org/package=BAMD.

Hierarchical models,
Bayes models,
Gibbs sampling,
genome-wide association,
doi:
10.1214/11-AOAS516
keywords:
.
volume: 6issue: 2

, , , and

1 Introduction

This work was motivated from a study of the genomics of loblolly pine, an economically and ecologically important tree species in the United States. The native range of loblolly pine extends from Maryland, south to Florida, and west to Texas. Its annual harvest value is approximately 1919 billion dollars [McKeever and Howard (1996)]. The pine species in the southern states produces 58%58\% of the timber in the US and 15.8%15.8\% of the world’s timber [Wear and Greis (2002)]. We are interested in discovering the relationship between phenotypic traits and genes underlying complex traits in loblolly pine, so we can understand their evolution and apply that knowledge to genetic improvement. We are especially interested in SNPs associated with disease resistance and response to water deficit.

Large genomic data sets typically contain missing data. Missing data create imbalance and complicate calculations required for statistical analyses. There are various approaches to dealing with missing data. Eliminating cases is one approach, but undesirable in large data sets where most or all cases have missing data. Imputation is more commonly used [Huisman (2000), Dai et al. (2006)]. Single imputation using haplotype data [Marchini et al. (2007), Su et al. (2008), Sun and Kardia (2008), Szatkiewicz et al. (2008)], either implicitly or explicitly, relies on linkage disequilibrium among markers, or information that can be extracted from other data sets [Stephens, Smith and Donnelly (2001), Scheet and Stephens (2006), Servin and Stephens (2007)]. However, there is no reference genome sequence for loblolly pine, so it is not possible to impute missing SNPs from flanking SNPs.

It is well established that single imputation approaches, while fast, can give biased parameter estimates [Greenland and Finkle (1995); see also van der Heijden et al. (2006)]. The best approach is to average over the missing data using the formal missing data distribution, rather than to impute a single value based on a possibly ad hoc scheme. This is appealing because it addresses uncertainty and variability in the missing data [Little and Rubin (2002), Dai et al. (2006)], particularly in species or genomic regions where LD decays rapidly and thus adjacent SNPs are not necessarily correlated [Flint-Garcia, Thornsberry and Buckler (2003), Neale and Ingvarsson (2008)]. However, multiple imputation is so computationally intensive that, prior to the present work, it has not been feasible for larger genomic data sets.

Several approaches have been developed recently to enable association testing. Association testing identifies relationships between polymorphisms in DNA sequence (most commonly Single Nucleotide Polymorphisms, or SNPs) and phenotypes, as a strategy to identify the genes that control traits [Flint-Garcia, Thornsberry and Buckler (2003), Hirschhorn and Daly (2005), Balding (2006)]. For family-based analysis, Chen and Abecasis (2007) used an identity by descent parameter to measure correlation among SNPs, and a kinship coefficient to model the correlation among siblings to develop the Quantitative Transmission Disequilibrium Test (QTDT). Other approaches allow association testing in populations with recent mating or historical (unrecorded) mating, or combinations. TASSEL fits a mixed model to detect associations while taking into account both “coarse-scale” relatedness based on population structure [Pritchard, Stephens and Donnelly (2000)] and “fine-scale” relatedness based on a matrix of kinship coefficients [Yu et al. (2006)]. Our approach for family based analyses accomplishes the same goal by employing the numerator relationship matrix [Henderson (1976); see also Quaas (1976)], which avoids complications arising from nonpositive definite matrices derived from complex interrelationships. A desirable feature of any association testing approach is simultaneous solution of multiple SNP effects to prevent upward bias in parameter estimates, and to appropriately model the underlying biological system in which many SNPs act in concert to condition the phenotype. Such an approach is developed in Wilson et al. (2010), who introduce Multilevel Inference for SNP Association (MISA), using imputation from fastPHASE [Stephens, Smith and Donnelly (2001), Servin and Stephens (2007)] and a Bayes-driven stochastic search method to find good models.

In this paper we introduce BAMD (Bayesian Association with Missing Data) and show that computation time required for formal multiple imputation can be reduced without sacrificing accuracy, establishing the feasibility of using BAMD on genomic data sets. Our approach is to use all available data in imputation of missing SNPs. This approach is motivated statistically; we use all of the available information to estimate SNP effects on phenotypes, across all possible values for missing SNPs. Prior knowledge such as pedigree structure may be used as constraints. Simulations show that BAMD detects SNP effects efficiently.

On the loblolly pine genomic data, we used a series of tag SNPs [González-Martínez et al. (2008)]. Tag SNPs are markers that are relatively evenly dispersed throughout the genome, and are used to survey chromosomal segments for genes that underly phenotypes. One assessment of the performance of BAMD was to use it on this same population, genotype, and phenotype data, in which it had been found that three of the tag SNPs were significantly associated with carbon isotope discrimination [a measure of water use efficiency, González-Martínez et al. (2008)]. BAMD detected a fifth tag SNP in addition to the other four tag SNPs that were detected using QTDT in that previous work.

An additional feature of BAMD is the variable selector. The variable selector searches model space for the most parsimonious set of SNPs that explain the phenotype. This feature is designed for unsupervised discovery of interactions among SNPs, and should find application in situations where epistatic interactions are important determinants of phenotype.

The remainder of the paper is organized as follows. In Section 2 we describe the model and the estimation of parameters, and Section 3 describes the variable selector, including the use of Bayes factors, the stochastic search, and computational strategies. In Section 4.1, we investigate the amount of missing data that BAMD can handle through a simulation. Section 4.2 compares our procedure to BIMBAM, a popular genomics program that does both imputation and variable selection. Section 5 analyzes the loblolly pine data, where we discover a previously undiscovered SNP. Section 6 contains a concluding discussion. Computational implementation is described in the Appendix, and the accompanying theorems and proofs can be found in the online Supplemental Information [Li et al. (2011)].

2 Model

Our method can be viewed as a two-stage procedure. The first stage involves identifying individual SNPs that have significant effects on the phenotype, with all SNPs in the model. The second stage searches for the best subset of SNPs, from those picked out in the first stage. First we describe the model.

2.1 Conceptual framework for BAMD

The response is assumed to be continuous, following a normal distribution. The data set has fully observed family covariates for all the observations. Missing values are imputed only among SNPs, although the method can be modified to impute missing values for phenotypes as well. We focus on testing the relationship between the response and the SNPs. We assume only additive effects among SNPs, although the method can be adapted to quantifying additive and dominance effects of SNPs.

We begin with the linear mixed model

𝐘=𝐗\boldsβ+𝐙\boldsγ+\boldsε,\mathbf{Y}=\mathbf{X}\bolds{\beta}+\mathbf{Z}\bolds{\gamma}+\bolds{\varepsilon}, (1)

where 𝐘n×1\mathbf{Y}_{n\times 1} is the phenotypic trait, 𝐗n×p\mathbf{X}_{n\times p} is the design matrix for family covariates, \boldsβp×1\bolds{\beta}_{p\times 1} are the coefficients for the family effect, 𝐙n×s\mathbf{Z}_{n\times s} is the design matrix for SNPs (genotypes), \boldsγs×1\bolds{\gamma}_{s\times 1} are the coefficients of the additive effect for SNPs, and \boldsεn×1N(0,σ2𝐑)\bolds{\varepsilon}_{n\times 1}\sim N(0,\sigma^{2}\mathbf{R}). The matrix 𝐑\mathbf{R} is the numerator relationship matrix, describing the degree of kinship between different individuals. (Details on the calculation of 𝐑\mathbf{R} are given in Appendix A.)

For our application here (carbon isotope data), we have n=1000n=1000 and s=450s=450. In another application of BAMD [Quesada et al. (2010)], they used n=450n=450 and s=400s=400. In both cases the number of covariates, pp, was less than 66. With a fully Bayesian implementation, it is possible to adapt BAMD to the pnp\gg n case.

Each row of the matrix 𝐙\mathbf{Z}, 𝐙i,i=1,,n,\mathbf{Z}_{i},i=1,\ldots,n, corresponds to the SNP genotype information of one individual, which can be homozygous for either of the two nucleotides (1,1)(-1,1) or heterozygous (0)(0). Some of this information may be missing, and we write 𝐙i=(𝐙iobs,𝐙imiss)\mathbf{Z}_{i}=(\mathbf{Z}_{i}^{\mathrm{obs}},\mathbf{Z}_{i}^{\mathrm{miss}}), where 𝐙iobs\mathbf{Z}_{i}^{\mathrm{obs}} are the observed genotypes for the iith individual, and 𝐙imiss\mathbf{Z}_{i}^{\mathrm{miss}} are the missing genotypes. Note two aspects of this framework: {longlist}[(2)]

The values of 𝐙imiss\mathbf{Z}_{i}^{\mathrm{miss}} are not observed. Thus, if * denotes one missing SNP, a possible 𝐙i\mathbf{Z}_{i} is 𝐙i=(1,,0,0,,,1)\mathbf{Z}_{i}=(1,*,0,0,*,*,1).

Individuals are likely to have missing data at different SNP loci. So for 2 different individuals, we might have

𝐙i=(1,,0,0,,,1)and𝐙i=(,,1,0,0,1,1).\mathbf{Z}_{i}=(1,*,0,0,*,*,1)\quad\mbox{and}\quad\mathbf{Z}_{i^{\prime}}=(*,*,1,0,0,1,1).

For a Bayesian model, we want to put prior distributions on the parameters. We put a noninformative uniform prior for \boldsβ\bolds{\beta}, which essentially leads us to least squares estimation. For \boldsγ\bolds{\gamma}, we use the normal prior \boldsγN(𝟎,σ2ϕ2𝐈s)\bolds{\gamma}\sim N(\mathbf{0},\sigma^{2}\phi^{2}\mathbf{I}_{s}). Here ϕ2\phi^{2} is a scale parameter for the variance and σ2\sigma^{2} is the variance parameter. For σ2\sigma^{2} and ϕ2\phi^{2}, we use inverted Gamma priors: σ2\operatornameIG(a,b)\sigma^{2}\sim\operatorname{IG}(a,b) and ϕ2\operatornameIG(c,d)\phi^{2}\sim\operatorname{IG}(c,d), where IG stands for the inverted Gamma distribution, and a,b,c,a,b,c, and dd are constants used in the priors. For specified a,b,c,da,b,c,d, the resulting posterior distribution is proper [see Hobert and Casella (1996)].

We consider the case of tag SNP markers in the loblolly pine genome with no significant linkage disequilibrium between them [González-Martínez et al. (2006)]. Therefore, noninformative priors are used for the missing SNPs, meaning that missing data have equal probability of any allelic state. As information increases due to higher marker density, or parental information, or allele frequency in the population, missing data imputation could be constrained accordingly.

We assume that missing SNPs in the data set are Missing at Random (MAR). In particular, let the value of the random variable TT denote whether ZZ is observed, with T=1T=1 if the value is observed and T=0T=0 if it is missing. If ξ\xi is the parameter of the missing mechanism, then, under the model (1), the MAR assumption results in

P(T|Y,Zobs,Zmiss,ξ)=P(T|Y,Zobs,ξ).P(T|Y,Z^{\mathrm{obs}},Z^{\mathrm{miss}},\xi)=P(T|Y,Z^{\mathrm{obs}},\xi).

So the distribution of the missing SNP could depend on the observed SNPs, and the observed phenotypes. Of course, this does not require such a dependence, it only allows for it.

Other assumptions about missing data mechanisms are less common than MAR. The strongest assumption, and the most difficult to justify, is Missing Completely at Random (MCAR). Under this assumption, the missing data distribution is independent of all observed data, the complete cases can be regarded as sub-samples from the population, and statistical inference with regard to the complete cases is totally valid. Under the model (1), the MCAR assumption can be expressed as

P(T|Y,Zobs,Zmiss,ξ)=P(T|ξ).P(T|Y,Z^{\mathrm{obs}},Z^{\mathrm{miss}},\xi)=P(T|\xi).

MCAR is regarded as unrealistic and, in most cases, it is not satisfied. It is typically not used to model missing data, and we do not use it here.

Conditional on the MAR assumption, we impute the missing SNPs based on the correlation between SNPs within individuals and between individuals, and use the phenotypic trait information to improve the power of imputation.

In this model, the covariance matrix, 𝐑\mathbf{R}, models the covariance between individuals within the same family, and covariance between individuals across families. Phenotypic traits of related individuals are alike because they share some proportion of SNPs, and genotypes of relatives are similar because they share the same alleles passed on from parents. Various methods can be used to calculate the relationship matrix, such as using a co-ancestry matrix, a kinship matrix, etc. The basic idea is to calculate the probability that 2 individuals share SNPs that are identical by descent. Some methods use pairwise calculations and thus do not guarantee a positive definite relationship matrix, which is unsatisfactory when the relationship matrix is used as covariance matrix. We use the recursive calculation method of Henderson (1976), which gives a numerator relationship matrix that quantifies the probability of sharing a SNP from the same ancestry, based on known family pedigree and parent pedigree in the population. So by calculating this relationship matrix we obtain a probability of 0.5 for the case that two siblings are within the same control-pollinated family and therefore share the same copy of a SNP, or a 0.250.25 probability if the two siblings only have one parent in common. For the complex pedigree that we analyze here, there are a total of 99 categories of relatedness.

2.2 Estimation of parameters

The model (1) along with the prior specification allows the use of a Gibbs sampler to estimate parameters. We can iteratively sample from the the full conditionals, given by

β\displaystyle\beta \displaystyle\sim N((XR1X)1XR1(Y𝐙γ),σ2(XR1X)1),\displaystyle N\bigl{(}(X^{\prime}R^{-1}X)^{-1}X^{\prime}R^{-1}(Y-\mathbf{Z}\gamma),\sigma^{2}(X^{\prime}R^{-1}X)^{-1}\bigr{)},\hskip-30.0pt
γ\displaystyle\gamma \displaystyle\sim N((𝐙R1𝐙+Iϕ2)1𝐙R1(YXβ),σ2(𝐙R1𝐙+Iϕ2)1),\displaystyle N\biggl{(}\biggl{(}\mathbf{Z}^{\prime}R^{-1}\mathbf{Z}+\frac{I}{\phi^{2}}\biggr{)}^{-1}\mathbf{Z}^{\prime}R^{-1}(Y-X\beta),\sigma^{2}\biggl{(}\mathbf{Z}^{\prime}R^{-1}\mathbf{Z}+\frac{I}{\phi^{2}}\biggr{)}^{-1}\biggr{)},\hskip-30.0pt
σ2\displaystyle\sigma^{2} \displaystyle\sim 1(σ2)n/2+s/2+a+1\displaystyle\frac{1}{(\sigma^{2})^{n/2+s/2+a+1}}\hskip-30.0pt
×exp((YXβ𝐙γ)R1(YXβ𝐙γ)+|γ|2/ϕ2+2b2σ2),\displaystyle{}\times\exp\biggl{(}-\frac{(Y-X\beta-\mathbf{Z}\gamma)^{\prime}R^{-1}(Y-X\beta-\mathbf{Z}\gamma)+{|\gamma|^{2}}/{\phi^{2}}+2b}{2\sigma^{2}}\biggr{)},\hskip-30.0pt
ϕ2\displaystyle\phi^{2} \displaystyle\sim 1(ϕ2)s/2+c+1exp(|γ|2/σ2+2d)2ϕ2.\displaystyle\frac{1}{(\phi^{2})^{{s}/{2}+c+1}}\exp{-\frac{({|\gamma|^{2}}/{\sigma^{2}}+2d)}{2\phi^{2}}}.\hskip-30.0pt

The SNPs are contained in the 𝐙\mathbf{Z} matrix, which includes both the observed SNPs and missing SNPs, and we use the Gibbs sampler to impute the missing SNPs. The Gibbs sampler for the missing data simulates the samples of ZimZ_{i}^{m} according to the distribution of each missing SNP conditional on the rest of observed SNPs and sampled missing SNPs. For a particular SNP ZijmZ^{m}_{ij}, the jjth missing SNP in the iith individual, the conditional distribution given the rest of the vector Zi(j)mZ^{m}_{i(-j)} and all other parameters in the model is

P(Zijm=c|Zi(j)m)\displaystyle P\bigl{(}Z^{m}_{ij}=c|Z^{m}_{i(-j)}\bigr{)}\hskip-30.0pt
(3)
=exp((YiXiβZioγioZi(j)mγi(j)mcγijm)2/(2σ2))=13exp((YiXiβZioγioZi(j)mγi(j)mcγijm)2/(2σ2)).\displaystyle\qquad=\frac{\exp(-(Y_{i}-X_{i}\beta-Z_{i}^{o}\gamma_{i}^{o}-Z^{m}_{i(-j)}\gamma^{m}_{i(-j)}-c\gamma^{m}_{ij})^{2}/(2\sigma^{2}))}{\sum_{\ell=1}^{3}\exp(-(Y_{i}-X_{i}\beta-Z_{i}^{o}\gamma_{i}^{o}-Z^{m}_{i(-j)}\gamma^{m}_{i(-j)}-c_{\ell}\gamma^{m}_{ij})^{2}/(2\sigma^{2}))}.\hskip-30.0pt

The value cc is the genotype currently being considered for that missing SNP, and clc_{l} represents any one of the possible genotypes for the SNP. Notice there are only 33 terms in the denominator sum for each SNP and this is a key point why Gibbs sampling is computationally feasible for our situation with many SNPs and many observations. We also note that the EM algorithm, which provides an alternative method of parameter estimation, can require a prohibitive amount of computation. See Appendix B.

3 Variable selection

The Gibbs sampler will estimate the full set of parameters in model (1). However, it is often the case that a small set of SNPs will explain a sufficient proportion of the variability that might also be biologically meaningful. To this end, along with the Gibbs sampler, we run a second Markov chain that searches through the space of available models, looking for the one with the largest Bayes factor.

A model is specified by a vector δ\delta of length ss, whose entries are either 0 or 11. The γ\gamma vector of model (1) becomes γδ=γδ\gamma_{\delta}=\gamma\star\delta, where “\star” denotes componentwise multiplication. The corresponding columns of 𝐙\mathbf{Z} are deleted, giving 𝐙δ\mathbf{Z}_{\delta}, and the reduced model is

𝐘=𝐗\boldsβ+𝐙δ\boldsγδ+\boldsε.\mathbf{Y}=\mathbf{X}\bolds{\beta}+\mathbf{Z}_{\delta}\bolds{\gamma}_{\delta}+\bolds{\varepsilon}. (4)

Thus, the components of γi\gamma_{i} corresponding to δi=0\delta_{i}=0 are excluded from the model. Correspondingly, let \boldsθ\bolds{\theta} denote the random vector consisting of all parameters in the full model, so \boldsθ:=(\boldsβ,\boldsγ,σ2,ϕ2,𝐙)\bolds{\theta}:=(\bolds{\beta},\bolds{\gamma},\sigma^{2},\phi^{2},\mathbf{Z}) and, naturally, \boldsθδ:=(\boldsβ,\boldsγδ,σ2,ϕ2,𝐙δ)\bolds{\theta}_{\delta}:=(\bolds{\beta},\bolds{\gamma}_{\delta},\sigma^{2},\phi^{2},\mathbf{Z}_{\delta}). Let mδ,πδm_{\delta},\pi_{\delta}, and pδp_{\delta} denote the marginal distribution of 𝐘\mathbf{Y}, the prior distribution on \boldsθδ\bolds{\theta}_{\delta}, and the conditional distribution of 𝐘\mathbf{Y}, respectively. We also write, if needed, \boldsθ=(\boldsθδ,\boldsθδc)\bolds{\theta}=(\bolds{\theta}_{\delta},\bolds{\theta}_{\delta^{c}}), the latter containing the remaining parameters not specified by δ\delta. For the full model containing all parameters we omit the subscript.

3.1 Searching with Bayes factors

In order to compare models, we shall use the Bayes factor comparing each candidate model to the full model, given by

BFδ=mδ(𝐘)m(𝐘)=πδ(\boldsθδ)pδ(𝐘|\boldsθδ)d\boldsθδπ(\boldsθ)p(𝐘|\boldsθ)d\boldsθ,\mathrm{BF}_{\delta}=\frac{m_{\delta}(\mathbf{Y})}{m(\mathbf{Y})}=\frac{\int\pi_{\delta}(\bolds{\theta}_{\delta})p_{\delta}(\mathbf{Y}|\bolds{\theta}_{\delta})\,\mathrm{d}\bolds{\theta}_{\delta}}{\int\pi(\bolds{\theta})p(\mathbf{Y}|\bolds{\theta})\,\mathrm{d}\bolds{\theta}}, (5)

where pp denotes the full model. We now can compare models δ\delta and δ\delta^{\prime} through their Bayes factors, as a larger Bayes factor corresponds to a model that explains more variability, when compared to the full model. These pairwise comparisons result in a consistent model selector [O’Hagan and Forster (2004)], and have an advantage over BIC, which is overly biased toward smaller models [Casella et al. (2009)].

We now set up a Metropolis–Hastings (MH) search that has target distribution proportional to the the Bayes factor, BFδ\mathrm{BF}_{\delta}. Given that we are at model δ\delta, we choose a candidate δ\delta^{\prime} from a random walk (choose one component at random and switch 010\rightarrow 1 or 101\rightarrow 0) with probability aa and, with probability 1a1-a, we do an independent jump. This is a symmetric candidate, and δ\delta^{\prime} is accepted with probability min{1,BFδ/BFδ}\min\{1,\mathrm{BF}_{\delta^{\prime}}/\mathrm{BF}_{\delta}\}.

3.2 Estimating the Bayes factor

Calculating the Bayes factor in (5) requires knowing the 𝐙\mathbf{Z} matrix, which is not the case with missing data. Thus, to calculate the Bayes factor, we need to use the imputed 𝐙\mathbf{Z} matrix from the Gibbs sampler. Thus, we run two Markov chains simultaneously: {longlist}[(2)]

A Gibbs sampler on the full model, to impute the missing data in 𝐙\mathbf{Z} and estimate all parameters.

A Metropolis–Hastings algorithm on δ\delta, in model space, to find the best model. This algorithm uses an estimated Bayes factor based on the current values in the Gibbs chain. The aim is to search for δ\delta^{*} such that δ=argmaxδBFδ\delta^{*}=\arg\max_{\delta}\mathrm{BF}_{\delta}, but since we are not able to compute BFδ\mathrm{BF}_{\delta} exactly for any given δ\delta, we estimate it using samples from the Gibbs sampler, which yields a strongly consistent estimator. We then use the estimated Bayes factor as the target in a stochastic search driven by a Metropolis–Hastings algorithm.

A typical method of estimating a quantity such as (5) would be to use bridge sampling [Meng and Wong (1996)]. However, since the numerator and denominator have different dimensions (but the numerator model is always nested in the denominator model), ordinary bridge sampling will not work. A variation [Chen and Shao (1997)] which accounts for this introduces a weight function to handle the dimension difference. We summarize this strategy in the following proposition.

Proposition 1

Referring to (5), let g(\boldsθ)g(\bolds{\theta}) be such that p(𝐘|\boldsθ)g(\boldsθ)d\boldsθδc=pδ(𝐘|\boldsθδ)\int p(\mathbf{Y}|\bolds{\theta})g(\bolds{\theta})\,\mathrm{d}\bolds{\theta}_{\delta^{c}}=p_{\delta}(\mathbf{Y}|\bolds{\theta}_{\delta}). Then if expectation is taken with respect to the posterior distribution π(\boldsθ|𝐘)\pi(\bolds{\theta}|\mathbf{Y}),

𝔼[πδ(\boldsθδ)g(\boldsθ)π(\boldsθ)]=BFδ.\mathbb{E}\biggl{[}\frac{\pi_{\delta}(\bolds{\theta}_{\delta})g(\bolds{\theta})}{\pi(\bolds{\theta})}\biggr{]}=\mathrm{BF}_{\delta}.

One particular gg function is defined as follows. Let Pδc:=𝐙δc(𝐙δc𝐙δc)1𝐙δcP_{\delta^{c}}:=\mathbf{Z}_{\delta^{c}}(\mathbf{Z}^{\prime}_{\delta^{c}}\mathbf{Z}_{\delta^{c}})^{-1}\mathbf{Z}^{\prime}_{\delta^{c}}, 𝐂δ:=(𝐘𝐗\boldsβ𝐙δ\boldsγδ)\mathbf{C}_{\delta}:=(\mathbf{Y}-\mathbf{X}\bolds{\beta}-\mathbf{Z}_{\delta}\bolds{\gamma}_{\delta}), and

g(\boldsθ)=(2πσ2)dc/2|𝐙δc𝐙δc|1/2×exp(12σ2𝐂δPδc𝐂δ),g(\bolds{\theta})=(2\pi\sigma^{2})^{-d^{c}/2}|\mathbf{Z}^{\prime}_{\delta^{c}}\mathbf{Z}_{\delta^{c}}|^{1/2}\times\exp\biggl{(}-\frac{1}{2\sigma^{2}}\mathbf{C}_{\delta}^{\prime}P_{\delta^{c}}\mathbf{C}_{\delta}\biggr{)}, (6)

which leads to the strongly consistent Bayes factor estimator

BF^δ\displaystyle\widehat{\mathrm{BF}}_{\delta} =\displaystyle= 1Ni=1N(ϕ2(i))dc/2|𝐙δc(i)𝐙δc(i)|1/2\displaystyle\frac{1}{N}\sum_{i=1}^{N}\bigl{(}\phi^{2(i)}\bigr{)}^{d^{c}/2}\bigl{|}\mathbf{Z}^{(i)\prime}_{\delta^{c}}\mathbf{Z}^{(i)}_{\delta^{c}}\bigr{|}^{1/2}
×exp(12σ2(i)(|\boldsγδc(i)|2ϕ2(i)+𝐂δ(i)Pδc(i)𝐂δ(i))).\displaystyle\hphantom{\frac{1}{N}\sum_{i=1}^{N}}{}\times\exp\biggl{(}-\frac{1}{2\sigma^{2(i)}}\biggl{(}\frac{|\bolds{\gamma}^{(i)}_{\delta^{c}}|^{2}}{\phi^{2(i)}}+\mathbf{C}_{\delta}^{(i)\prime}P^{(i)}_{\delta^{c}}\mathbf{C}_{\delta}^{(i)}\biggr{)}\biggr{)}.

Details and proofs of the results given here are in Supplemental Information, Section D [Li et al. (2011)].

3.3 Increasing computational speed

For data sets with large numbers of SNPs and phenotypes, the slow computation speed of the Gibbs sampler can be a major problem. We have identified two bottlenecks. First, if the number of SNPs is increased, then for each iteration, the number of missing SNPs to be updated will also increase. Second, in the iterations of the Gibbs sampler, the generation of γ\gamma involves inverting the matrix 𝐙R1𝐙+(1/ϕ2)I\mathbf{Z}^{\prime}R^{-1}\mathbf{Z}+(1/\phi^{2})I each time, as the 𝐙\mathbf{Z} matrix changes at each iteration. We address these in the following sections.

3.3.1 SNP updating

To speed up calculation, we show that instead of updating all the SNPs at each iteration, updating only one column of SNPs (that is, one SNP updated for all observations) at each cycle will still conserve the target stationary distribution and ergodicity. As the SNP has only three possible values, this change should not have a great effect on the mixing.

A consequence of this result is that instead of updating tens or hundreds of SNPs in one cycle, we need to update just one SNP in each cycle. This single-SNP updating will dramatically speed up computation, especially when there are large numbers of SNPs, or large numbers of observations, in the data. (See in Supplemental Information [Li et al. (2011)], Section E.)

3.3.2 Matrix inverse updating

In the iterations of the Gibbs sampler, a major bottleneck is the generation of γ\gamma, since it involves inverting the matrix 𝐙R1𝐙+(1/ϕ2)I\mathbf{Z}^{\prime}R^{-1}\mathbf{Z}+(1/\phi^{2})I each time, as the 𝐙\mathbf{Z} matrix changes at each iteration. Two modifications will speed up this calculation, each based on Woodbury’s formula [see Hager (1989) and Appendix C].

By Woodbury’s formula, if the matrices AA and IVA1UI-VA^{-1}U are both invertible, then

(A+UV)1=A1A1U(I+VA1U)1VA1.(A+UV)^{-1}=A^{-1}-A^{-1}U(I+VA^{-1}U)^{-1}VA^{-1}. (8)

If UU and VV are vectors, the inverse takes a particularly nice form:

(A+uv)1=A1A1uvA1(1+vA1u),(A+uv^{\prime})^{-1}=A^{-1}-\frac{A^{-1}uv^{\prime}A^{-1}}{(1+v^{\prime}A^{-1}u)}, (9)

so if we have A1A^{-1}, no further inversion is needed.

First, relating to the generation of γ\gamma in (2.2), (8) leads to the identity

(𝐙R1𝐙+1ϕ2I)1=ϕ2[I𝐙(1ϕ2R+𝐙𝐙)1𝐙],\biggl{(}\mathbf{Z}^{\prime}R^{-1}\mathbf{Z}+\frac{1}{\phi^{2}}I\biggr{)}^{-1}=\phi^{2}\biggl{[}I-\mathbf{Z}^{\prime}\biggl{(}\frac{1}{\phi^{2}}R+\mathbf{Z}\mathbf{Z}^{\prime}\biggr{)}^{-1}\mathbf{Z}\biggr{]}, (10)

where the left-hand side involves the inversion of an s×ss\times s matrix, and the right-hand side involves the inversion of an n×nn\times n matrix. Thus, we can always choose to invert the smaller matrix.

Next we look at inverting 𝐙R1𝐙+(1/ϕ2)I\mathbf{Z}^{\prime}R^{-1}\mathbf{Z}+(1/\phi^{2})I [a similar argument can be developed for the right-hand side of (10)]. Suppose, at the current iteration, we have A0=𝐙0R1𝐙0+(1/ϕ02)IA_{0}=\mathbf{Z}_{0}^{\prime}R^{-1}\mathbf{Z}_{0}+(1/\phi_{0}^{2})I, and we update to A1=𝐙1R1𝐙1+(1/ϕ12)IA_{1}=\mathbf{Z}_{1}^{\prime}R^{-1}\mathbf{Z}_{1}+(1/\phi_{1}^{2})I. Because we update one column of SNPs at each iteration, we have 𝐙1=𝐙0+Δ\mathbf{Z}_{1}=\mathbf{Z}_{0}+\Delta, where Δ\Delta is a matrix of all 0’s, except for one column. This column contains the differences of the respective columns from 𝐙1\mathbf{Z}_{1} and 𝐙0\mathbf{Z}_{0}. Thus, Δ=(000δ00)\Delta=(0\cdots 00\delta 0\cdots 0), and

A1=A0+ΔR1𝐙0+𝐙0R1Δ+ΔR1Δ+(1ϕ121ϕ02)I.A_{1}=A_{0}+\Delta^{\prime}R^{-1}\mathbf{Z}_{0}+\mathbf{Z}_{0}^{\prime}R^{-1}\Delta+\Delta^{\prime}R^{-1}\Delta+\biggl{(}\frac{1}{\phi_{1}^{2}}-\frac{1}{\phi_{0}^{2}}\biggr{)}I.

The three matrices on the right-hand side involving Δ\Delta are all rank one matrices, that is, they are of the form uvuv^{\prime} for column vectors uu and vv. Moreover, we can write I=j=1sejejI=\sum_{j=1}^{s}e_{j}e_{j}^{\prime}, where eje_{j} is a column vector of zeros with a 11 in the jjth position. We can then apply (9) three times to get the inverse of A1A_{1}. This calculation involves only matrix by vector multiplications for the middle three terms on the right-hand side. For the eje_{j} vectors, the multiplications reduce to an element extraction. (See Appendix C for details.)

4 Empirical analyses of BAMD

4.1 Percentage of missingness handled

In this subsection we applyBAMD to simulated data in order to assess the procedure’s performance as we increase the percentage of missing data in the 𝐙\mathbf{Z} matrix. We simulated a data set with six families, 20 observations in each family and 5 SNPs per observation. The five SNPs are independent of each other. The six families are also independent, so that the parents of the six families are not related and individuals across families are independent. On the other hand, the individuals within each family share the same parents; this relationship is captured via the numerator relationship matrix. From this data set, four data sets with different percentages of missing values, 5%, 10%, 15%, and 20%, were randomly derived. The family effects, β\beta, which were used to simulate the data, are listed in Table 1. The true SNP effects (additive and dominant effects) used to generate the data are listed in Table 2. When simulating, we let the variance parameter σ2=1\sigma^{2}=1. Our proposed methodology was applied to analyze the data without missing values and also to the new data containing missing values.

Table 1: The true family effects for the simulated data set are given in the first row of the table. The remaining rows indicate the estimated means returned from running BAMD on the data sets derived by setting different degrees of missing values in the SNP matrix of the simulated data set
\boldsβ1\bolds{\beta_{1}} \boldsβ2\bolds{\beta_{2}} \boldsβ3\bolds{\beta_{3}} \boldsβ4\bolds{\beta_{4}} \boldsβ5\bolds{\beta_{5}} \boldsβ6\bolds{\beta_{6}}
Actual value 15.00 20.00 25.00 30.00 35.00 40.00
0% missing 15.45 20.65 25.48 29.84 34.76 40.40
5% missing 15.16 20.74 25.46 28.29 33.43 38.62
10% missing 16.18 21.38 25.65 30.71 35.86 40.81
15% missing 15.45 19.63 24.59 30.18 35.38 40.18
20% missing 14.87 20.18 24.68 30.08 34.88 40.13
Table 2: The true additive and dominant effects for each SNP in the simulated data set are given in the first row of the table. The remaining rows indicate the estimated SNP effects returned from running BAMD on the data sets derived by setting different degrees of missing values in the SNP matrix of the simulated data set
SNP1:a SNP1:d SNP2:a SNP2:d SNP3:a SNP3:d SNP4:a SNP4:d SNP5:a SNP5:d
Actual SNP 2.00-2.00 1.00 1.00 1.00-1.00 3.00 0.00 2.50 0.100.10 0.300.30 3.00
0% missing 2.16-2.16 1.00 0.82 0.75-0.75 2.59 0.30 2.43 0.600.60 0.04-0.04 2.59
5% missing 1.86-1.86 1.14 1.16 1.05-1.05 3.00 0.05 2.21 0.20-0.20 0.480.48 3.00
10% missing 1.95-1.95 0.77 1.18 1.52-1.52 2.74 0.18 2.51 0.130.13 0.000.00 2.74
15% missing 1.80-1.80 0.78 0.99 0.96-0.96 2.48 0.67 2.43 0.470.47 0.730.73 2.48
20% missing 2.08-2.08 1.29 1.21 0.76-0.76 3.10 1.32 1.87 0.20-0.20 0.470.47 3.10

Note that for this small simulation, we used a parameterization different from the {1,0,1}\{-1,0,1\} coding that we use for larger numbers of SNPs. In this example, each SNP effect is represented as (γa,γd)(\gamma_{a},\gamma_{d})—the additive and dominant effects of the SNP genotypes.

Tables 1 and 2 summarize the parameter estimation capabilities of BAMD for family and SNP effects. All calculations were based on samples obtained after an initial burn-in of 20,000 iterations of BAMD. The results show that when the percentage of missing values is less than 15%, the proposed methodology yields good estimates for the parameters of direct interest. When the percentage of missing values is greater than 15%, we should be wary of interpreting the results. For example, the true dominant effect for SNP 3 is 0, but the estimate is 1.32 when the percentage of missing values is 20%. Note that the estimate in this case is accurate when the percentage of missing values is less than 10%. We believe the discrepancy arises because one category of genotype for SNP 3 has substantially higher probability and it overpowers the other two categories. When the percentage of missing values increases, the dominated genotype category has only a small chance to be well represented and thus may have unreliable estimates.

Table 3: The true genotype probabilities for the SNPs used to generate the simulated data set are given in the first 3 rows. The final row identifies the frequency with which the true genotype was imputed when running BAMD with 10% of missing data in the SNP matrix
SNP1 SNP2 SNP3 SNP4 SNP5
\boldsa=2\bolds{a=-2} \boldsa=1\bolds{a=1} \boldsa=3\bolds{a=3} \boldsa=2.5\bolds{a=2.5} \boldsa=0.3\bolds{a=0.3}
\boldsd=1\bolds{d=1} \boldsd=1\bolds{d=-1} \boldsd=0.5\bolds{d=0.5} \boldsd=0.1\bolds{d=0.1} \boldsd=3\bolds{d=3}
Actual SNP
   Pr(GG)\Pr(GG) 0.1309 0.3012 0.8181 0.7719 0.3983
   Pr(GC)\Pr(GC) 0.5307 0.3875 0.0796 0.1950 0.5425
   Pr(CC)\Pr(CC) 0.3384 0.3113 0.1023 0.0331 0.0592
Frequency of correct 0.5500 0.5479 0.6337 0.8508 0.6516
imputation

Our ultimate goal is to identify significant SNPs from the candidate SNPs. Since we believe that imputation is a tool to obtain better estimates of the parameters, we are not particularly interested in recovering the actual imputed values for the missing SNPs. Nonetheless, the simulation results in Table 3 show that when the probability of one genotype for a certain SNP is dominantly high, the imputed SNPs are correctly identified with probability ranging from 0.550.550.850.85, being correctly imputed more frequently than the other genotypes distributions (see SNPs 3 and 4).

4.2 Comparison with BIMBAM

Here we compare our multiple-imputation missing data algorithm with a program called BIMBAM [Servin and Stephens (2007), http://stephenslab.uchicago.edu/software.html],which is a popular program among geneticists for association genetics and variable selection with missing data (using single imputation).

BAMD and BIMBAM both propose a two-stage procedure that involves first finding a set of significant SNPs, and then running these significant SNPs through a variable selection procedure that finds the best subset of the significant SNPs that describes the variation in the phenotype. Hence, in this study, BIMBAM and BAMD are assessed through the SNPs they find in the first stage and through the final model they put forth.

For the evaluation, we simulated data from the model given in equation (1). The dimensions of the model were fixed to be n=50n=50, p=3p=3, and s=25s=25 throughout. The three families comprised 16, 17, and 17 individuals, respectively. In addition, the 𝐗\mathbf{X} and \boldsβ\bolds{\beta} matrices were fixed. Entries in the 𝐙\mathbf{Z} matrix took three possible values, mirroring the real-life situation, when they would represent genotypes. Interest lies in discovering the significant coordinates of the \boldsγ\bolds{\gamma} vector (which corresponds to SNP effects), in the presence of missing values in the 𝐙\mathbf{Z} matrix.

In the simulation study, three factors—percentage of missing values in 𝐙\mathbf{Z}, magnitude of \boldsγ\bolds{\gamma} effects, and the degree of correlation within a family—were varied across different levels. When a particular factor was being investigated, the others were held constant. Here in the main paper, we only present two specific comparisons. The remainder of the results from the simulation study can be found in Supplemental Information [Li et al. (2011)], Section F. In running the study, we simulated several data sets for each case, and observed very consistent results. Hence, in presenting our results, we focus on a single representative data set in each case.

In both of the studies presented here, the \boldsγ\bolds{\gamma} vector was generated from a multivariate normal, and any values less than 3 in absolute value were set to 0. After generating the 𝐘\mathbf{Y} responses, 20% of the entries in the 𝐙\mathbf{Z} matrix were set to missing before being passed to BAMD and BIMBAM.

The first comparison measures the performance of the procedures when an equicorrelation structure (ρ\rho was set to be 0.8) exists within each of the three families. The second comparison presented here aims to see if BAMD turns up many false positives. The \boldsγ\bolds{\gamma} vector was generated in the same way as earlier, but only the coordinates with the five largest values were kept. The rest were set to 0. In addition, the individuals were assumed to be uncorrelated, that is, 𝐑=𝐈n\mathbf{R}=\mathbf{I}_{n}.

Refer to caption
Figure 1: In the upper panel, triangles and squares represent the true coordinates of the \boldsγ\bolds{\gamma} vector, where the true nonzero SNPs in the model were (1), (2), (4), (5), (8), (9), (10), (11), (16), (17), (18), (22), (24), and (25). A solid triangle means that BIMBAM found that SNP to be significant at α=0.05\alpha=0.05 level, the remaining SNPs are squares. Horizontal lines represent highest posterior density intervals returned by BAMD. Solid lines mean the 95% HPD interval found that SNP to be significant. Thus, in the SNP-discovery stage, BIMBAM found SNPs (1), (5), (8), and (25) to be significantly nonzero while BAMD picked out SNPs (2), (4), (5), (8), (11), (18), and (25). In the lower panel, the gray numbers are SNPs that were exactly 0 in the true model, and the black numbers are SNPs with nonzero effects. The circled numbers are the SNPs that were in the best model found by the procedure. Thus, the best model found by BIMBAM contains only SNP (8), whereas the best model found by BAMD contains SNPs (2), (4), (5), (8), (11), and (25).

The results for each comparison are summarized through two diagrams. The first (the upper panels in Figures 1 and 2) display the SNPs that BAMD and BIMBAM found to be significant in the first stage. The lower panels in Figures 1 and 2 display the output from the variable selection procedure.

Refer to caption
Figure 2: In the upper panel triangles and squares represent the true coordinates of the \boldsγ\bolds{\gamma} vector, where the true nonzero SNPs in the model were (9), (10), (17), (22), and (24). A solid triangle means that BIMBAM found that SNP to be significant at α=0.05\alpha=0.05 level, the remaining SNPs are squares. Horizontal lines represent highest posterior density intervals returned by BAMD. Solid lines mean the 95% HPD interval found that SNP to be significant. Thus, in the SNP-discovery stage, BIMBAM found SNPs (10), (13), and (22) to be significantly nonzero while BAMD found SNPs (9), (10), (17), (22), and (24). In the lower panel, the gray numbers are SNPs that were exactly 0 in the true model, and the black numbers are SNPs with nonzero effects. The circled numbers are the SNPs that were in the best model found by the procedure. The best model found by BIMBAM contains only SNP (22), whereas the best model found by BAMD contains SNPs (9), (10), (17), and (22).

Each figure shows that BAMD significantly outperforms BIMBAM. In the first example, BIMBAM found only one of the 1414 significant SNPs, while BAMD found six. In the second example, there were five significant SNPs, and BIMBAM only found one again, while BAMD found four of them.

5 Analysis of the loblolly pine data

Carbon isotope discrimination (CID) is a time-integrated trait measure of water use efficiency. González-Martínez et al. (2008) used the family-based approach of the Quantitative Transmission Disequilibrium Test QTDT to detect SNPs associated with CID. We utilized the family structure of this population [also described in Kayihan et al. (2005)] in the design matrix in our model. Of the 61 control-pollinated families measured for CID, each had approximately 15 offspring that were clonally propagated by rooted cuttings to generate the ramets (genetically identical replicates). Each genotype has two ramets, sampled from each of two testing sites at Cuthbert, GA and Palatka, FL. Our approach enables us to utilize the family pedigree and parental information to recover missing SNP genotypes. With informative priors, we infer the progeny SNP genotype through Mendelian randomization [Falconer and Macay (1996)]. With uninformative priors, we assume SNPs are missing at random and assign equal probability for each genotype class for missing SNPs.

All SNPs are simultaneously tested under our association model. The Gibbs sampler ran for 50,000 iterations. The first 10,000 iterations were burn-in, after which we thinned the chain every 4 iterations; the autocorrelation reduced significantly after thinning (data not shown). Thus, we have a total of 10,000 samples for each chain for our statistics, and we then applied the variable selector on the four SNPs that were found when using the informative prior.

Table 4: Significant SNPs from QTDT tests and the results from the BAMD association model, with 95% confidence intervals
SNP
Informative prior Uninformative prior
95% C.I. 95% C.I. Type\bolds\bolds{{}^{\ddagger}}
(3) caf1_s1 (-0.008, 0.110)- (0.013, 0.129) Syn
(5) ccoaomt_s10∗† (\bolds\bolds{-}0.103, \bolds\bolds{-}0.012) (\bolds\bolds{-}0.097, \bolds\bolds{-}0.005) NC(intron)
(6) cpk3_s5 (\bolds\bolds{-}0.052, \bolds\bolds{-}0.004) (-0.048, 0.001)- Syn
(29) dhn1_s2∗† (0.065, 0.113) (0.044, 0.092) NC(33^{\prime}UTR)
(31) ein2_s1∗† (0.077, 0.142) (0.067, 0.126) NC(33^{\prime}UTR)
\sv@tabnotetext

[] Indicates significant in González-Martínez et al. (2008). Bold type indicates significant at the 5% level from our association testing, the rest being nonsignificant. indicates presence in best model found by variable selector. As indicated in González-Martínez et al. (2008), there are additional SNPs that are marginally significant at α=0.1\alpha=0.1, which we also detected. : Syn, synonymous SNP; NC, noncoding; UTR, untranslated region.

We detected significant effects of several SNPs on CID at a 95% Bayesian confidence interval (Table 4). Using the uninformative prior, we found 3 significant SNPs [(3) ccoaomt_s10, (5) ein2_s1, (31) Caf1_s1]. Using the informative prior, we detected 4 SNPs [(5) ein2_s1, (6) cpk3_s5, (29) dhn1_s2, (31) Caf1_s1] as significant. Note that (6) and (29) are close to significant using the uninformative prior, and (3) was close to significant using the informative prior. This suggests that for these data, the effect of the prior information is important. The QTDT test resulted in 4 significant SNPs, (3), (5), (29), (31), all of which were detected by BAMD which, in addition, found SNP (6). Moreover, it is important to note that BAMD detected these SNPs simultaneously, an indication that their collective effect on the phenotype is being detected.

The use of tag SNPs in a pedigree does not allow for “fine mapping” to SNP effects [Neale and Ingvarsson (2008), Flint-Garcia, Thornsberry and Buckler (2003)]. Thus, the effects of these SNPs on carbon isotope discrimination may reflect the involvement of many linked genes on the phenotype.

We also provide Figures 3 and 4, showing the results for all of the SNPs in the data set. Figure 3 is based on using uninformative SNP priors, while Figure 4 uses informative priors. Although there are few differences in the graphs (showing the strength of the data with respect to the model), we see that the prior can matter. For example, SNP (3) is significant when the noninformative prior is used, but not so when we use the informative prior. The opposite finding holds for SNP (6). Looking at the figures, we see that the significant intervals only barely cross zero; thus, the inclusion of relevant prior information can be quite important.

Refer to caption
Figure 3: 95%95\% Confidence intervals for the 4444 SNPs from the carbon isotope data, based on 10,000 Gibbs samples from the BAMD model using uninformative priors (equal probability) for the missing SNPs. The significant SNPs are those with intervals that do not cross 0, SNPs (3) caf1, (5) ccoaomt, and (31) ein2.
Refer to caption
Figure 4: 95%95\% Confidence intervals for the 4444 SNPs from the carbon isotope data, based on 10,000 Gibbs samples from the BAMD model using informative priors (Mendelian randomization) for the missing SNPs. The significant SNPs are those with intervals that do not cross 0, SNPs (5) ccoaomt, (6) cpk3, (29) dhn1, and (31) ein2.

The four SNPs picked out when using the informative prior were (5), (6), (29), and (31). Due to the small number of variables under consideration, the variable selector procedure was able to run through all 16 possible models. The one with the highest Bayes factor was found to contain SNPs (5), (29), and (31).

6 Discussion

Association testing is being applied to discover relationships among SNPs and complex traits in plants and animals [Flint-Garcia, Thornsberry and Buckler (2003), Hirschhorn and Daly (2005), Balding (2006), Zhu et al. (2008)]. Our model was developed specifically for detecting associations in loblolly pine data, but can be applied to other species as well. Here we discuss some features and limitations of the method.

Multiple imputation

Multiple imputation of missing SNP data is the best way to ensure unbiased parameter estimates, which is an important consideration given that SNP effects tend to be small for complex traits of greatest biological interest, and given that results of association studies typically motivate more detailed and labor-intensive investigations of how and why associations were detected.

We used simulation to compare BAMD and BIMBAM for their detection of “correct” vs. “incorrect” SNPs, and found that BAMD performed better than BIMBAM. In practice, this advantage of BAMD over BIMBAM would likely be greatest when missing SNPs are not in LD with nearby SNPs (or adjacency cannot be determined). This is the case in many species, including loblolly pine, in which LD is low and genomic resources such as high-resolution genomic maps and high-density SNP chips for genome scanning are not as well developed as they are for the human genome. The higher computational intensity required for formal multiple imputation in BAMD is a trade-off, however, this has not restricted its practical use in most data sets. For very large data sets, parallel processing seems a logical next step in further increasing the computational efficiency of BAMD.

Family structure

Our method can be applied to family-based association populations, populations of unrelated genotypes, or combination populations. It can incorporate prior information if known. [The application of BAMD in Quesada et al. (2010) was to a population of unrelated genotypes, where significant SNPs related to disease resistance were found.]

Probit models

Although here we assume a continuous response variable, the method can be adapted to discrete phenotypes using a probit link. For example, in a case control study, the response would be either case or control status, and with a probit model we add a latent variable in the Gibbs sampler.

SNP detection

Although BAMD successfully detected the same significant SNPs as were previously detected using the family-based method QTDT [González-Martínez et al. (2008)], as well as an additional significant SNP, the BAMD variable selector indicated that a subset of the significant SNPs was sufficient to explain variation in the phenotype carbon isotope discrimination. This is a useful tool for biologists because a simultaneous solution for SNP effects enables detection of numerous SNPs that collectively explain phenotypes, which in turn enables further biological experiments to investigate their underlying basis.

However, the candidate SNPs found by BAMD and QTDT cannot necessarily be deemed “correct” or “incorrect” without additional biological experiments. As such, little more can be stated about the correctness of SNPs 3, 5, 29, and 31 without validation experiments. In the broader context of association testing, it is relevant to note that the use of QTDT is limited to families, whereas BAMD and BIMBAM can be used to detect associations in families as well as populations of unrelated individuals. The ability to use BAMD and BIMBAM in many different types of populations is appealing.

Simultaneous vs. genome-wide

Genome-wide association studies are, typically, marginal processors of the data. That is, each SNP is assessed individually for its association, so simultaneous assessment, or epistasis, cannot be detected. A model such as (1) is assessing the SNPs simultaneously—that is its strength. But how many SNPs should we expect to be able to handle in one model? Computational issues aside, if the number of SNPs is greatly increased, we are then susceptible to the usual regression pitfalls—multicollinearity being the most prevalent. Thus, we recommend using BAMD on smaller sets of SNPs that have had some preprocessing. Thus far, BAMD has been used successfully on a model with 400400 SNPs [Quesada et al. (2010)], and we have tested it on as many as 800800 SNPs.

Missing data

The missing data problem is common across all genomics data sets, so there is broad potential utility of this method. The assumption of MAR (missing at random), which is reasonable in these contexts, may bear additional research attention. If there are quality concerns about SNP data, there are some statistical steps forward, as noted by Wilson et al. (2010), such as using indicator variables of missingness as predictors. This approach can even be extended to test if missingness is a heritable trait and, if so, the MAR assumption is invalid. Next generation sequencing platforms may generate sufficient data to enable this assumption to be tested and, if borne out, may motivate placement of priors on SNP calls in certain sequence contexts.

Last, software to run the Gibbs sampler and variable selector is available in the R package BAMD.

Appendix A Calculating the numerator relationship matrix

The algorithm is due to Henderson (1976) and Quaas (1976). The individuals within 6161 families and the parents for the 6161 families are ordered together such that the first 1,,a1,\ldots,a subjects are unrelated and are used as a “base” population. Let the total number of subjects within families and parents of the 6161 families be nn, and we will get a numerator relationship matrix with dimension n×nn\times n. As the first aa subjects (being part of the parents of the 6161 families ) are unrelated, the upper left submatrix with dimension a×aa\times a of the numerator relationship matrix is identity matrix II. This identity submatrix will be expanded iteratively until it reaches to dimension n×nn\times n.

As we know the sub-numerator relationship matrix for the first unrelated aa subjects is the identity, next we will give the details how to calculate the remaining cells of the numerator relationship matrix for the related subjects. Consider the jjth and the iith subject from the above ordered subjects: {longlist}[(2)]

If both parents of the jjth individual are known, say, gg and hh, then

Rji\displaystyle R_{ji} =\displaystyle= Rij=0.5(Rig+Rih),i=1,,j1;\displaystyle R_{ij}=0.5(R_{ig}+R_{ih}),\qquad i=1,\ldots,j-1;
Rjj\displaystyle R_{jj} =\displaystyle= 1+0.5Rgh,\displaystyle 1+0.5R_{gh},

where RjiR_{ji} is the cell of the numerator relationship matrix in the jjth row and iith column.

If only one parent is known for the jjth subject, say, it is gg, then

Rji\displaystyle R_{ji} =\displaystyle= Rij=0.5Rig,i=1,,j1;\displaystyle R_{ij}=0.5R_{ig},\qquad i=1,\ldots,j-1;
Rjj\displaystyle R_{jj} =\displaystyle= 1.\displaystyle 1.

If neither parent is known for the jjth subject,

Rji\displaystyle R_{ji} =\displaystyle= Rij=0,i=1,,j1;\displaystyle R_{ij}=0,\qquad i=1,\ldots,j-1;
Rjj\displaystyle R_{jj} =\displaystyle= 1.\displaystyle 1.

For the loblolly pine data, we have 4444 pines acting as grandparents and they produce 6161 pine families. The 6161 families contains 888888 individual pine trees all together, also called clones. The phenotypic responses are taken from the individual clones. So our interest is in calculating the relationship matrix for the 888888 clones and it would have a dimension 888×888888\times 888. According to Henderson’s method, we ordered the 4444 grandparent pines and 888888 individual pines together such that the first aa pines are not related. Starting from the (a+1)(a+1)th pine, we applied the above iteration calculation algorithm, and in the end had a relationship matrix with dimension 932×932932\times 932 for all the grandparent pines and all individual clones. We took a submatrix from the right bottom of the previous numerator relationship matrix with dimension 888×888888\times 888 and it is the numerator relationship matrix we used in the loblolly pine data analysis.

Appendix B Estimation with the EM algorithm

B.1 Missing data

The EM algorithm begins by building the complete data likelihood, which is the likelihood function that would be used if the missing data were observed. When we fill in the missing data we write Zi=(Zio,Zim)Z_{i}^{*}=(Z_{i}^{o},Z_{i}^{m}), and the complete data are (Y,Z)(Y,Z^{*}) with likelihood function

LC\displaystyle L_{C} \displaystyle\propto iI0exp(12σ2(YiXiβZiγ)2)\displaystyle\prod_{i\in I_{0}}\exp\biggl{(}-\frac{1}{2\sigma^{2}}(Y_{i}-X_{i}\beta-Z_{i}\gamma)^{2}\biggr{)}
×iIMexp(12σ2(YiXiβZiγ)2),\displaystyle{}\times\prod_{i\in I_{M}}\exp\biggl{(}-\frac{1}{2\sigma^{2}}(Y_{i}-X_{i}\beta-Z_{i}^{*}\gamma)^{2}\biggr{)},

where IoI_{o} indexes those individuals with complete SNP data, and IMI_{M} indexes those individuals with missing SNP information.

The observed data likelihood, which is the function that we eventually use to estimate the parameters, must be summed over all possible values of the missing data. So we have

Lo\displaystyle L_{o} \displaystyle\propto 1(2πσ2)n/2iI0exp(12σ2(YiXiβZiγ)2)\displaystyle\frac{1}{(2\pi\sigma^{2})^{n/2}}\prod_{i\in I_{0}}\exp\biggl{(}-\frac{1}{2\sigma^{2}}(Y_{i}-X_{i}\beta-Z_{i}\gamma)^{2}\biggr{)}
×iIMZiexp(12σ2(YiXiβZiγ)2).\displaystyle{}\times\prod_{i\in I_{M}}\sum_{Z_{i}^{*}}\exp\biggl{(}-\frac{1}{2\sigma^{2}}(Y_{i}-X_{i}\beta-Z_{i}^{*}\gamma)^{2}\biggr{)}.

The distribution of the missing data ZiZ_{i}^{*} is given by the ratio of LC/LoL_{C}/L_{o}:

P(Zi)=exp((YiXiβZiγ)2/(2σ2))Ziexp((YiXiβZiγ)2/(2σ2)),P(Z_{i}^{*})=\frac{\exp(-(Y_{i}-X_{i}\beta-Z_{i}^{*}\gamma)^{2}/({2\sigma^{2}}))}{\sum_{Z_{i}^{*}}\exp(-(Y_{i}-X_{i}\beta-Z_{i}^{*}\gamma)^{2}/({2\sigma^{2}}))}, (12)

where the sum in the denominator is over all possible realizations of ZiZ_{i}^{*}. This is a discrete distribution on the missing SNP data for each individual. To understand it, look at one individual.

Suppose that there are gg possible genotypes (typically g=2g=2 or 33) and individual ii has missing data on kk SNPs. So the data for individual ii is Zi=(Zo,Zm)Z_{i}=(Z^{o},Z^{m}), where ZmZ^{m} has kk elements, each of which could be one of gg classes. For example, if g=3g=3 and k=7k=7, then ZmZ^{m} can take values in the following:

SNP
\ast \ast
Genotype \ast \ast \ast
\ast \ast

where the \ast show one possible value of the ZimZ_{i}^{m}. For the example, there are 37=21873^{7}=2187 possible values for ZimZ_{i}^{m}. In a real data set this could grow out of hand. For example, if there were 1212 missing SNPs, then there are 531,441 possible values for ZimZ_{i}^{m}; with 20 missing SNPs the number grows to 3,486,784,401 (3.5 billion).

B.2 An EM algorithm

To the expected value of the log of the complete data likelihood (B.1), we only deal with the second term (with the missing data). This expected value does not change the piece with no missing data, but does change the second piece. Standard calculations give

E(12σ2(YiXiβZiγ)2)=12σ2(YiXiβE(Zi)γ)2+\operatornameVar(Ziγ),\mathrm{E}\biggl{(}\frac{1}{2\sigma^{2}}(Y_{i}-X_{i}\beta-Z_{i}^{*}\gamma)^{2}\biggr{)}=\frac{1}{2\sigma^{2}}\bigl{(}Y_{i}-X_{i}\beta-\mathrm{E}(Z_{i}^{*})\gamma\bigr{)}^{2}+\operatorname{Var}(Z_{i}^{*}\gamma),

where

E(Zi)=(Zio,E(Zim))and\operatornameVar(Ziγ)=\operatornameVar(Zimγim).\mathrm{E}(Z_{i}^{*})=(Z_{i}^{o},\mathrm{E}(Z_{i}^{m}))\quad\mbox{and}\quad\operatorname{Var}(Z_{i}^{*}\gamma)=\operatorname{Var}(Z_{i}^{m}\gamma_{i}^{m}). (13)

If we define

Yn×1=(Y1Y2Yn),Xn×p=(X1X2Xn),𝐙n×s=((Z1o,E(Z1m))(Z2o,E(Z2m))(Zno,E(Znm))),Y_{n\times 1}=\pmatrix{Y_{1}\cr Y_{2}\cr\vdots\cr Y_{n}},\qquad X_{n\times p}=\pmatrix{X_{1}\cr X_{2}\cr\vdots\cr X_{n}},\qquad\mathbf{Z}_{n\times s}=\pmatrix{(Z_{1}^{o},\mathrm{E}(Z_{1}^{m}))\vskip 2.0pt\cr(Z_{2}^{o},\mathrm{E}(Z_{2}^{m}))\cr\vdots\cr(Z_{n}^{o},\mathrm{E}(Z_{n}^{m}))},

the expected complete data log likelihood is

ElogLC=n2logσ212σ2|YXβ𝐙γ|212σ2γVZγ,\mathrm{E}\log L_{C}=-\frac{n}{2}\log\sigma^{2}-\frac{1}{2\sigma^{2}}|Y-X\beta-\mathbf{Z}\gamma|^{2}-\frac{1}{2\sigma^{2}}\gamma^{\prime}V_{Z}\gamma, (14)

where VZiV_{Z_{i}} is the variance–covariance matrix of the vector ZiZ_{i} with elements given by

VZijj={0, if either Zij or Zij is observed, \operatornameCov(Zij,Zij), if neither Zij nor Zij is observed,V_{{Z_{i}}_{jj^{\prime}}}=\cases{0,&\quad if either $Z_{ij}$ or $Z_{ij^{\prime}}$ is observed, \vskip 2.0pt\cr\operatorname{Cov}(Z_{ij},Z_{ij^{\prime}}),&\quad if neither $Z_{ij}$ nor $Z_{ij^{\prime}}$ is observed,}

and VZ=iIMVZiV_{Z}=\sum_{i\in I_{M}}V_{Z_{i}}. Standard calculus will show that the MLEs from (14) are given by

β^\displaystyle\hat{\beta} =\displaystyle= (XX)1X(Y𝐙γ^),\displaystyle(X^{\prime}X)^{-1}X^{\prime}(Y-\mathbf{Z}\hat{\gamma}),
γ^\displaystyle\hat{\gamma} =\displaystyle= (𝐙𝐙VZ)1𝐙(IH)Y,\displaystyle(\mathbf{Z}^{\prime}\mathbf{Z}-V_{Z})^{-1}\mathbf{Z}^{\prime}(I-H)Y, (15)
σ^2\displaystyle\hat{\sigma}^{2} =\displaystyle= 1n(|YXβ^𝐙γ^|2+γ^VZγ^).\displaystyle\frac{1}{n}(|Y-X\hat{\beta}-\mathbf{Z}\hat{\gamma}|^{2}+\hat{\gamma}^{\prime}V_{Z}\hat{\gamma}).

The algorithm now iterates between (12), (13), and (B.2) until convergence.

B.3 Implementation

To implement the EM algorithm, we must be able to either: {longlist}[(1)]

calculate the expectation and variance in (13), or

generate a random sample from (12) and calculate the terms in (13) by simulation. The first option is impossible and the second is computationally intensive, but the only way.

Going back to (12), note that this is the distribution of the vector of missing values for individual ii. If the data are Zi=(Zio,Zim)Z_{i}^{*}=(Z_{i}^{o},Z_{i}^{m}), we are only concerned with Zim=(Zi1m,,Zikm)Z_{i}^{m}=(Z^{m}_{i1},\ldots,Z^{m}_{ik}), and for 𝐜=(c1,,ck)\mathbf{c}=(c_{1},\ldots,c_{k}),

P(Zim=𝐜0)=exp((YiXiβZioγio𝐜0γim)2/(2σ2))all𝐜exp((YiXiβZioγio𝐜γim)2/(2σ2)),P(Z_{i}^{m}=\mathbf{c}_{0})=\frac{\exp(-(Y_{i}-X_{i}\beta-Z_{i}^{o}\gamma_{i}^{o}-\mathbf{c}_{0}\gamma_{i}^{m})^{2}/({2\sigma^{2}}))}{\sum_{\mathrm{all}\ \mathbf{c}}\exp(-(Y_{i}-X_{i}\beta-Z_{i}^{o}\gamma_{i}^{o}-\mathbf{c}\gamma_{i}^{m})^{2}/({2\sigma^{2}}))},

where the sum in the denominator can easily have over 11 billion terms.

A possible alternative is to use a Gibbs sampler to simulate the distribution of ZimZ_{i}^{m} by calculating the distribution of each element conditional on the rest of the vector. For a particular element ZijmZ^{m}_{ij}, the conditional distribution given the rest of the vector Zi(j)mZ^{m}_{i(-j)} is given in (2.2). So to produce a sample of ZimZ_{i}^{m}, we loop through a Gibbs sampler.

Unfortunately, there may be problems with this algorithm in that it may still be too computationally intensive. The Gibbs samplers (2.2) and (2.2) need to be run for every iteration of the EM algorithm. For each iteration of EM we may need 20–50 thousand Gibbs iterations. If there is a lot of missing data, this could result in a very slow algorithm.

Appendix C Matrix inverse updates

We are interested in matrices of the form A0+k=1pukvkA_{0}+\sum_{k=1}^{p}u_{k}v_{k}^{\prime}, where uk,vku_{k},v_{k}, k=1,,pk=1,\ldots,p, are vectors. For this form we have the following lemma, which follows from Woodbury’s formula.

Lemma 1

Let A0A_{0} be invertible, and uj,vju_{j},v_{j}, j=1,,pj=1,\ldots,p, be vectors. Define

Ak=A0+j=1kujvj.A_{k}=A_{0}+\sum_{j=1}^{k}u_{j}v_{j}^{\prime}.

Then for k=1,,pk=1,\ldots,p,

Ak1=Ak11Ak11ukvkAk111+vkAk11uk.A_{k}^{-1}=A_{k-1}^{-1}-\frac{A_{k-1}^{-1}u_{k}v^{\prime}_{k}A_{k-1}^{-1}}{1+v^{\prime}_{k}A_{k-1}^{-1}u_{k}}. (16)

Then, to calculate Ap1=(A0+k=1pukvk)1A_{p}^{-1}=(A_{0}+\sum_{k=1}^{p}u_{k}v_{k}^{\prime})^{-1}, we can start with A01A_{0}^{-1}, and use the recursion to get to Ap1A_{p}^{-1}. Note that each step of the recursion requires only multiplication of matrices by vectors. Moreover, in many applications the vectors uk,vku_{k},v_{k} are sparse, so the multiplication amounts to extracting elements.

{supplement}

[id=suppA] \stitleTheory and additional simulations \slink[doi]10.1214/11-AOAS516SUPP \slink[url]http://lib.stat.cmu.edu/aoas/516/supplement.pdf \sdatatype.pdf \sdescriptionThe Supplemental Information contains details on the variable selector, and the proof of convergence of the two Markov chains (the Gibbs sampler and the model search). In addition, there are further comparisons between BAMD and BIMBAM.

References

  • Balding (2006) {barticle}[pbm] \bauthor\bsnmBalding, \bfnmDavid J.\binitsD. J. (\byear2006). \btitleA tutorial on statistical methods for population association studies. \bjournalNat. Rev. Genet. \bvolume7 \bpages781–791. \biddoi=10.1038/nrg1916, issn=1471-0056, pii=nrg1916, pmid=16983374 \bptokimsref \endbibitem
  • Casella et al. (2009) {barticle}[mr] \bauthor\bsnmCasella, \bfnmGeorge\binitsG., \bauthor\bsnmGirón, \bfnmF. Javier\binitsF. J., \bauthor\bsnmMartínez, \bfnmM. Lina\binitsM. L. and \bauthor\bsnmMoreno, \bfnmElías\binitsE. (\byear2009). \btitleConsistency of Bayesian procedures for variable selection. \bjournalAnn. Statist. \bvolume37 \bpages1207–1228. \biddoi=10.1214/08-AOS606, issn=0090-5364, mr=2509072 \bptokimsref \endbibitem
  • Chen and Abecasis (2007) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmChen, \bfnmW. M.\binitsW. M. and \bauthor\bsnmAbecasis, \bfnmG. R.\binitsG. R. (\byear2007). \btitleFamily-based association tests for genomewide association scan. \bjournalThe American Journal of Human Genetics \bvolume81 \bpages913–926. \bptokimsref \endbibitem
  • Chen and Shao (1997) {barticle}[mr] \bauthor\bsnmChen, \bfnmMing-Hui\binitsM.-H. and \bauthor\bsnmShao, \bfnmQi-Man\binitsQ.-M. (\byear1997). \btitleEstimating ratios of normalizing constants for densities with different dimensions. \bjournalStatist. Sinica \bvolume7 \bpages607–630. \bidissn=1017-0405, mr=1467451 \bptokimsref \endbibitem
  • Dai et al. (2006) {barticle}[pbm] \bauthor\bsnmDai, \bfnmJames Y.\binitsJ. Y., \bauthor\bsnmRuczinski, \bfnmIngo\binitsI., \bauthor\bsnmLeBlanc, \bfnmMichael\binitsM. and \bauthor\bsnmKooperberg, \bfnmCharles\binitsC. (\byear2006). \btitleImputation methods to improve inference in SNP association studies. \bjournalGenet. Epidemiol. \bvolume30 \bpages690–702. \biddoi=10.1002/gepi.20180, issn=0741-0395, pmid=16986162 \bptokimsref \endbibitem
  • Falconer and Macay (1996) {bbook}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmFalconer, \bfnmD. S.\binitsD. S. and \bauthor\bsnmMacay, \bfnmT. F. C.\binitsT. F. C. (\byear1996). \btitleIntroduction to Quantitative Genetics, \bedition4th ed. \bpublisherLongman, \baddressHarlow. \bptokimsref \endbibitem
  • Flint-Garcia, Thornsberry and Buckler (2003) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmFlint-Garcia, \bfnmS. A.\binitsS. A., \bauthor\bsnmThornsberry, \bfnmJ. M.\binitsJ. M. and \bauthor\bsnmBuckler, \bfnmE. S.\binitsE. S. (\byear2003). \btitleStructure of linkage disequilibrium in plants. \bjournalAnnu. Rev. Plant Bio. \bvolume54 \bpages357–374. \bptokimsref \endbibitem
  • González-Martínez et al. (2006) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmGonzález-Martínez, \bfnmS. C.\binitsS. C., \bauthor\bsnmErsoz, \bfnmE.\binitsE., \bauthor\bsnmBrown, \bfnmG. R.\binitsG. R., \bauthor\bsnmWheeler, \bfnmN. C.\binitsN. C. and \bauthor\bsnmNeale, \bfnmD. B.\binitsD. B. (\byear2006). \btitleDNA sequence variation and selection of tag single-nucleotide polymorphisms at candidate genes for drought-stress response in Pinus taeda L. \bjournalGenetics \bvolume172 \bpages1915–1926. \bptokimsref \endbibitem
  • González-Martínez et al. (2008) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmGonzález-Martínez, \bfnmS. C.\binitsS. C., \bauthor\bsnmHuber, \bfnmD. A.\binitsD. A., \bauthor\bsnmErsoz, \bfnmE.\binitsE., \bauthor\bsnmDavis, \bfnmJ. M.\binitsJ. M. and \bauthor\bsnmNeale, \bfnmD. B.\binitsD. B. (\byear2008). \btitleAssociation genetics in Pinus taeda L. II. Carbon isotope discrimination. \bjournalHeredity \bvolume101 \bpages19–26. \bptokimsref \endbibitem
  • Greenland and Finkle (1995) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmGreenland, \bfnmS.\binitsS. and \bauthor\bsnmFinkle, \bfnmW. D.\binitsW. D. (\byear1995). \btitleA critical look at methods for handling missing covariates in epidemiologic regression analyses. \bjournalAm. J. Epidemiol. \bvolume142 \bpages1255–1264. \bptokimsref \endbibitem
  • Hager (1989) {barticle}[mr] \bauthor\bsnmHager, \bfnmWilliam W.\binitsW. W. (\byear1989). \btitleUpdating the inverse of a matrix. \bjournalSIAM Rev. \bvolume31 \bpages221–239. \biddoi=10.1137/1031049, issn=0036-1445, mr=0997457 \bptokimsref \endbibitem
  • Henderson (1976) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmHenderson, \bfnmC. R.\binitsC. R. (\byear1976). \btitleA simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. \bjournalBiometrics \bvolume32 \bpages69–83. \bptokimsref \endbibitem
  • Hirschhorn and Daly (2005) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmHirschhorn, \bfnmJ. N.\binitsJ. N. and \bauthor\bsnmDaly, \bfnmM. J.\binitsM. J. (\byear2005). \btitleGenome-wide association studies for common diseases and complex traits. \bjournalNature Genetics \bvolume6 \bpages95–108. \bptokimsref \endbibitem
  • Hobert and Casella (1996) {barticle}[mr] \bauthor\bsnmHobert, \bfnmJames P.\binitsJ. P. and \bauthor\bsnmCasella, \bfnmGeorge\binitsG. (\byear1996). \btitleThe effect of improper priors on Gibbs sampling in hierarchical linear mixed models. \bjournalJ. Amer. Statist. Assoc. \bvolume91 \bpages1461–1473. \bidissn=0162-1459, mr=1439086 \bptokimsref \endbibitem
  • Huisman (2000) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmHuisman, \bfnmM.\binitsM. (\byear2000). \btitleImputation of missing item responses: Some simple techniques. \bjournalQuality and Quantity \bvolume34 \bpages331–351. \bptokimsref \endbibitem
  • Kayihan et al. (2005) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmKayihan, \bfnmG. C.\binitsG. C., \bauthor\bsnmHuber, \bfnmD. A.\binitsD. A., \bauthor\bsnmMorse, \bfnmA. M.\binitsA. M., \bauthor\bsnmWhite, \bfnmT. T.\binitsT. T. and \bauthor\bsnmDavis, \bfnmJ. M.\binitsJ. M. (\byear2005). \btitleGenetic dissection of fusiform rust and pitch canker disease traits in loblolly pine. \bjournalTheory of Applied Genetics \bvolume110 \bpages948–958. \bptokimsref \endbibitem
  • Li et al. (2011) {bmisc}[auto:STB—2012/01/09—08:49:38] \bauthor\bsnmLi, \bfnmZhen\binitsZ., \bauthor\bsnmGopal, \bfnmVikneswaran\binitsV., \bauthor\bsnmLi, \bfnmXiaobo\binitsX., \bauthor\bsnmDavis, \bfnmJohn M.\binitsJ. M. and \bauthor\bsnmCasella, \bfnmGeorge\binitsG. (\byear2011). \bhowpublishedSupplement to “Simultaneous SNP identification in association studies with missing data.” DOI:\doiurl10.1214/11-AOAS516SUPP. \bptokimsref \endbibitem
  • Little and Rubin (2002) {bbook}[mr] \bauthor\bsnmLittle, \bfnmRoderick J. A.\binitsR. J. A. and \bauthor\bsnmRubin, \bfnmDonald B.\binitsD. B. (\byear2002). \btitleStatistical Analysis with Missing Data. \bpublisherWiley, \baddressNew York. \bptokimsref \endbibitem
  • Marchini et al. (2007) {barticle}[pbm] \bauthor\bsnmMarchini, \bfnmJonathan\binitsJ., \bauthor\bsnmHowie, \bfnmBryan\binitsB., \bauthor\bsnmMyers, \bfnmSimon\binitsS., \bauthor\bsnmMcVean, \bfnmGil\binitsG. and \bauthor\bsnmDonnelly, \bfnmPeter\binitsP. (\byear2007). \btitleA new multipoint method for genome-wide association studies by imputation of genotypes. \bjournalNat. Genet. \bvolume39 \bpages906–913. \biddoi=10.1038/ng2088, issn=1061-4036, pii=ng2088, pmid=17572673 \bptokimsref \endbibitem
  • McKeever and Howard (1996) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmMcKeever, \bfnmD. B.\binitsD. B. and \bauthor\bsnmHoward, \bfnmJ. L.\binitsJ. L. (\byear1996). \btitleValue of timber and agricultural products in the United States 1991. \bjournalForest Products Journal \bvolume46 \bpages45–50. \bptokimsref \endbibitem
  • Meng and Wong (1996) {barticle}[mr] \bauthor\bsnmMeng, \bfnmXiao-Li\binitsX.-L. and \bauthor\bsnmWong, \bfnmWing Hung\binitsW. H. (\byear1996). \btitleSimulating ratios of normalizing constants via a simple identity: A theoretical exploration. \bjournalStatist. Sinica \bvolume6 \bpages831–860. \bidissn=1017-0405, mr=1422406 \bptokimsref \endbibitem
  • Neale and Ingvarsson (2008) {barticle}[pbm] \bauthor\bsnmNeale, \bfnmDavid B.\binitsD. B. and \bauthor\bsnmIngvarsson, \bfnmPär K.\binitsP. K. (\byear2008). \btitlePopulation, quantitative and comparative genomics of adaptation in forest trees. \bjournalCurr. Opin. Plant Biol. \bvolume11 \bpages149–155. \biddoi=10.1016/j.pbi.2007.12.004, issn=1369-5266, pii=S1369-5266(07)00177-X, pmid=18262830 \bptokimsref \endbibitem
  • O’Hagan and Forster (2004) {bbook}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmO’Hagan, \bfnmA.\binitsA. and \bauthor\bsnmForster, \bfnmJ.\binitsJ. (\byear2004). \btitleKendall’s Advanced Theory of Statistics: Vol. 2B: Bayesian Inference. \bpublisherArnold, \baddressLondon. \bptokimsref \endbibitem
  • Pritchard, Stephens and Donnelly (2000) {barticle}[pbm] \bauthor\bsnmPritchard, \bfnmJ. K.\binitsJ. K., \bauthor\bsnmStephens, \bfnmM.\binitsM. and \bauthor\bsnmDonnelly, \bfnmP.\binitsP. (\byear2000). \btitleInference of population structure using multilocus genotype data. \bjournalGenetics \bvolume155 \bpages945–959. \bidissn=0016-6731, pmcid=1461096, pmid=10835412 \bptokimsref \endbibitem
  • Quaas (1976) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmQuaas, \bfnmR. L.\binitsR. L. (\byear1976). \btitleComputing the diagonal elements and inverse of a large numerator relationship matrix. \bjournalBiometrics \bvolume46 \bpages949–953. \bptokimsref \endbibitem
  • Quesada et al. (2010) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmQuesada, \bfnmT.\binitsT., \bauthor\bsnmGopal, \bfnmV.\binitsV., \bauthor\bsnmCumbie, \bfnmW. P.\binitsW. P., \bauthor\bsnmEckert, \bfnmA. J.\binitsA. J., \bauthor\bsnmWegrzyn, \bfnmJ. L.\binitsJ. L., \bauthor\bsnmNeale, \bfnmD. B.\binitsD. B., \bauthor\bsnmGoldfarb, \bfnmB.\binitsB., \bauthor\bsnmHuber, \bfnmD. A.\binitsD. A., \bauthor\bsnmCasella, \bfnmG.\binitsG. and \bauthor\bsnmDavis, \bfnmJ. M.\binitsJ. M. (\byear2010). \btitleAssociation mapping of quantitative disease resistance in a natural population of Loblolly pine (Pinus taeda L.). \bjournalGenetics \bvolume186 \bpages677–686. \bptokimsref \endbibitem
  • Scheet and Stephens (2006) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmScheet, \bfnmP.\binitsP. and \bauthor\bsnmStephens, \bfnmM. A.\binitsM. A. (\byear2006). \btitleA fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. \bjournalAm. Journ. Hum. Genetics \bvolume78 \bpages629–644. \bptokimsref \endbibitem
  • Servin and Stephens (2007) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmServin, \bfnmB.\binitsB. and \bauthor\bsnmStephens, \bfnmM.\binitsM. (\byear2007). \btitleImputation-based analysis of association studies: Candidate regions and quantitative traits. \bjournalPLoS Genet. \bvolume3 \bpagese114. \biddoi=10.1371/journal.pgen.0030114 \bptokimsref \endbibitem
  • Stephens, Smith and Donnelly (2001) {barticle}[pbm] \bauthor\bsnmStephens, \bfnmM.\binitsM., \bauthor\bsnmSmith, \bfnmN. J.\binitsN. J. and \bauthor\bsnmDonnelly, \bfnmP.\binitsP. (\byear2001). \btitleA new statistical method for haplotype reconstruction from population data. \bjournalAm. J. Hum. Genet. \bvolume68 \bpages978–989. \biddoi=10.1086/319501, issn=0002-9297, pii=S0002-9297(07)61424-4, pmcid=1275651, pmid=11254454 \bptokimsref \endbibitem
  • Su et al. (2008) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmSu, \bfnmS. Y.\binitsS. Y., \bauthor\bsnmWhite, \bfnmJ.\binitsJ., \bauthor\bsnmBalding, \bfnmD. J.\binitsD. J. and \bauthor\bsnmCoin, \bfnmL. J. M.\binitsL. J. M. (\byear2008). \btitleInference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. \bjournalBMC Bioinformatics \bvolume9 \bpagesArt. 513. \bptokimsref \endbibitem
  • Sun and Kardia (2008) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmSun, \bfnmY. V.\binitsY. V. and \bauthor\bsnmKardia, \bfnmS. L. R.\binitsS. L. R. (\byear2008). \btitleImputing missing genotypic data of single-nucleotide polymorphisms using neural networks. \bjournalEuropean Journal of Human Genetics \bvolume16 \bpages487–495. \bptokimsref \endbibitem
  • Szatkiewicz et al. (2008) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmSzatkiewicz, \bfnmJ. P.\binitsJ. P., \bauthor\bsnmBeane, \bfnmG. L.\binitsG. L., \bauthor\bsnmDing, \bfnmY.\binitsY., \bauthor\bsnmHutchins, \bfnmL.\binitsL., \bauthor\bparticlede \bsnmVillena, \bfnmF. P.\binitsF. P. and \bauthor\bsnmChurchill, \bfnmG. A.\binitsG. A. (\byear2008). \btitleAn imputed genotype resource for the laboratory mouse. \bjournalMammalian Genome \bvolume19 \bpages199–208. \bptokimsref \endbibitem
  • van der Heijden et al. (2006) {barticle}[pbm] \bauthor\bparticlevan der \bsnmHeijden, \bfnmGeert J.\binitsG. J., \bauthor\bsnmDonders, \bfnmA. Rogier\binitsA. R., \bauthor\bsnmStijnen, \bfnmTheo\binitsT. and \bauthor\bsnmMoons, \bfnmKarel G.\binitsK. G. (\byear2006). \btitleImputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example. \bjournalJ. Clin. Epidemiol. \bvolume59 \bpages1102–1109. \biddoi=10.1016/j.jclinepi.2006.01.015, issn=0895-4356, pii=S0895-4356(06)00198-3, pmid=16980151 \bptokimsref \endbibitem
  • Wear and Greis (2002) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmWear, \bfnmD. N.\binitsD. N. and \bauthor\bsnmGreis, \bfnmJ. G.\binitsJ. G. (\byear2002). \btitleSouthern forest resource assessment: Summary of findings. \bjournalJournal of Forestry \bvolume100 \bpages6–14. \bptokimsref \endbibitem
  • Wilson et al. (2010) {barticle}[mr] \bauthor\bsnmWilson, \bfnmMelanie A.\binitsM. A., \bauthor\bsnmIversen, \bfnmEdwin S.\binitsE. S., \bauthor\bsnmClyde, \bfnmMerlise A.\binitsM. A., \bauthor\bsnmSchmidler, \bfnmScott C.\binitsS. C. and \bauthor\bsnmSchildkraut, \bfnmJoellen M.\binitsJ. M. (\byear2010). \btitleBayesian model search and multilevel inference for SNP association studies. \bjournalAnn. Appl. Stat. \bvolume4 \bpages1342–1364. \biddoi=10.1214/09-AOAS322, issn=1932-6157, mr=2758331 \bptokimsref \endbibitem
  • Yu et al. (2006) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmYu, \bfnmJ. M.\binitsJ. M., \bauthor\bsnmPressoir, \bfnmG.\binitsG., \bauthor\bsnmBriggs, \bfnmW. H.\binitsW. H., \bauthor\bsnmBi, \bfnmI. V.\binitsI. V., \bauthor\bsnmYamasaki, \bfnmM.\binitsM., \bauthor\bsnmDoebley, \bfnmJ.\binitsJ., \bauthor\bsnmMcMullen, \bfnmM. D.\binitsM. D., \bauthor\bsnmGaut, \bfnmB. S.\binitsB. S., \bauthor\bsnmNielsen, \bfnmD. M.\binitsD. M., \bauthor\bsnmHolland, \bfnmJ. B.\binitsJ. B., \bauthor\bsnmKresovich, \bfnmS.\binitsS. and \bauthor\bsnmBuckler, \bfnmE. S.\binitsE. S. (\byear2006). \btitleA unified mixed model method for association mapping that accounts for multiple levels of relatedness. \bjournalNature Genetics \bvolume38 \bpages203–208. \bptokimsref \endbibitem
  • Zhu et al. (2008) {barticle}[auto:STB—2012/01/05—16:28:07] \bauthor\bsnmZhu, \bfnmC.\binitsC., \bauthor\bsnmGore, \bfnmM.\binitsM., \bauthor\bsnmBuckler, \bfnmE. S.\binitsE. S. and \bauthor\bsnmYu, \bfnmJ.\binitsJ. (\byear2008). \btitleStatus and prospects of association mapping in plants. \bjournalThe Plant Genome \bvolume1 \bpages5–20. \bptokimsref \endbibitem