This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Disentangling Linkage and Population Structure in Association Mapping

Hanbin Lee†,‡ and Moohyuk Lee Department of Medicine, Seoul National University College of Medicine
Seoul, Republic of Korea
Department of Mathematical Sciences, Seoul National University
Seoul, Republic of Korea
hanbin973@snu.ac.kr
Abstract.

Genome-wide association study (GWAS) tests single nucleotide polymorphism (SNP) markers across the genome to localize the underlying causal variant of a trait. Because causal variants are seldom observed directly, a surrogate model based on genotyped markers are widely considered. Although many methods estimating the parameters of the surrogate model have been proposed, the connection between the surrogate model and the true causal model is yet investigated. In this work, we establish the connection between the surrogate model and the true causal model. The connection shows that population structure is accounted in GWAS by modelling the variant of interest and not the trait. Such observation explains how environmental confounding can be partially corrected using genetic covariates and why the previously claimed connection between PC correction and linear mixed models is incorrect.

1. Introduction

Genome-wide association study (GWAS) identifies regions in the genome responsible for the variation in a trait through single nucleotide polymorphisms (SNPs) distributed across the genome. Each SNP is tested for its association with the trait (generally through regression models) and its regression coefficient (henceforth, marker effect size) is recorded. Although markers themselves may not be causal, a significant effect size hints a true causal variant nearby the marker being tested. This is due to linkage, the correlation between two physically proximal variants. Earlier studies have found that SNPs are dense and widespread across the genome [1, 2]. Therefore, testing SNPs across the genome enables the identification of genetic signals genome-wide even if some causal variants are missing which gives the name of GWAS.

For continuous traits, the quantitative trait model is widely adopted where the trait is an additive function respect to the observed SNPs. From now on, we call this model as the marker-additive model (MAM). The estimation of MAM parameters have been extensively studied and many methods have been proposed [3, 4, 5]. MAM, however, is a surrogate model where the parameters (the marker effect sizes) have no direct causal interpretation since SNP markers are seldom causal themselves. Furthermore, both empirical and theoretical work have shown that stratification of allele frequencies due to population structure can produce spurious signals without linkage [6].

In this work, we consider an additive model respect to the causal variants, the causal-additive model (CAM), and establish the connection with the MAM. CAM parameters have direct causal interpretation because it directly involves often unobserved causal variants of the trait. We deliver a formula showing that marker effect size can be decomposed into three terms that reflect linkage and population structure separately. Best to our knowledge, our work is the first attempt to address the identification problem of covariate adjusted regression in GWAS.

2. Setup

2.1. Notations and Models

The data discussed in this work is (Yi,𝐂i,𝐌i,𝐒i)i=1n(Y_{i},\mathbf{C}_{i},\mathbf{M}_{i},\mathbf{S}_{i})_{i=1}^{n} where YiY_{i}\in\mathbb{R} is the trait, 𝐂ip\mathbf{C}_{i}\in\mathbb{R}^{p} is the causal variants, 𝐌iq\mathbf{M}_{i}\in\mathbb{R}^{q} is the markers and 𝐒i\mathbf{S}_{i} is the population membership. In this study, we focus on diploid populations where elements of 𝐂i\mathbf{C}_{i} and 𝐌i\mathbf{M}_{i} take values 0,1,20,1,2. h=𝐦,𝐩h=\mathbf{m},\mathbf{p} denotes the haplotype index (𝐦\mathbf{m} for maternal and 𝐩\mathbf{p} for paternal haplotype) so that Cijh=0,1C_{ijh}=0,1 and Mikh=0,1M_{ikh}=0,1, accordingly. The marker-additive model (MAM) is

Yi=𝐌iT𝜷+β0+δi and 𝔼[δi𝐌i]=0\displaystyle Y_{i}=\mathbf{M}_{i}^{T}\boldsymbol{\beta}+\beta_{0}+\delta_{i}\text{\quad and\quad}\mathbb{E}[\delta_{i}\mid\mathbf{M}_{i}]=0 (2.1)

where 𝜷q\boldsymbol{\beta}\in\mathbb{R}^{q} is the marker effect size and δi\delta_{i} is the noise variable. In most cases with very few exceptions, markers are tested marginally rather than as a whole. Without loss of generality, we write Mi1M_{i1} as the marker being tested which leads to the following regression.

Yi=Mi1β1+𝐌i(1)T𝜷(1)+β0+δi\displaystyle Y_{i}=M_{i1}\beta_{1}+\mathbf{M}_{i(-1)}^{T}\boldsymbol{\beta}_{(-1)}+\beta_{0}+\delta_{i} (2.2)

where (1)(-1) in the subscript means that the first element is dropped from the original variable. Although model (2.1) contains all genotyped markers, in marginal testing, markers that are physically close to the variant being tested are removed from the regression. One such way is the leave one chromosome out (LOCO) approach where markers appearing in 𝐌i(1)\mathbf{M}_{i(-1)} are all sampled outside the chromosome which contains the variant being tested [7].

The causal-additive model (CAM), analogous to MAM, is

Yi=𝐂iT𝜶+α0+ϵi and 𝔼[ϵi𝐂i]=0\displaystyle Y_{i}=\mathbf{C}_{i}^{T}\boldsymbol{\alpha}+\alpha_{0}+\epsilon_{i}\text{\quad and\quad}\mathbb{E}[\epsilon_{i}\mid\mathbf{C}_{i}]=0 (2.3)

where 𝜶p\boldsymbol{\alpha}\in\mathbb{R}^{p} is the causal effect size and ϵi\epsilon_{i} is the noise variable. Since changes in 𝐂i\mathbf{C}_{i} has direct influence on YiY_{i}, 𝜶\boldsymbol{\alpha} has a straightforward causal interpretation.

Several population structure models are discussed in this work. When the sample is obtained from LL discrete populations indexed by ll, 𝐒iL\mathbf{S}_{i}\in\mathbb{R}^{L} where

𝐒i=(0,,0,,0) if the i-th individual is from population l=1or𝐒i=(0,,1lth,,0) if the i-th individual is from population l1\displaystyle\begin{gathered}\mathbf{S}_{i}=(0,\ldots,0,\ldots,0)\text{ if the $i$-th individual is from population }l=1\\ \text{or}\\ \mathbf{S}_{i}=(0,\ldots,\underset{\underset{l-\text{th}}{\uparrow}}{1},\ldots,0)\text{ if the $i$-th individual is from population }l\neq 1\end{gathered}

The Pritchard-Stephens-Donnelly (PSD) model is a more general model that describes population structure incorporating admixture, where the genetic materials of an individual originates from multiple ancestries (indexed by ll with total LL of them) rather than a single one [8]. In this case, 𝐒i\mathbf{S}_{i} is often called the admixture proportion where

𝐒i=(ai1,,ail,,aiL)\displaystyle\mathbf{S}_{i}=(a_{i1},\ldots,a_{il},\ldots,a_{iL})

with l=1Lail=1\sum_{l=1}^{L}a_{il}=1. The ll-th element is the proportion of genome of an individual that originates from the ll-th ancestral population. If one restricts the elements of 𝐒i\mathbf{S}_{i} to take values 0 or 11, the PSD model reduces to the discrete populations model. The PSD model further assumes that the Hardy-Weinberg Equilibrium (HWE) holds conditional on 𝐒i\mathbf{S}_{i}. That being said, random mating is assumed and as a consequence, an individual receives its genome by randomly selecting a haplotype from the parent generation.

We define haplotype frequencies and linkage disequilibrium (LD) parameters. fj=(Cijh=1)f_{j}=\mathbb{P}(C_{ijh}=1), gk=(Mikh=1)g_{k}=\mathbb{P}(M_{ikh}=1) and hjk=(Cijh=1,Mikh=1)h_{jk}=\mathbb{P}(C_{ijh}=1,M_{ikh}=1) are allele frequencies and joint allele frequency of causal variant jj, marker variant kk and both variant jj and variant kk. LD covariance Djk=hjkfjhkD_{jk}=h_{jk}-f_{j}h_{k} is the covariance of variant jj and kk. When considering population membership, population specific analogues should be defined. fj𝐬=(Cijh=1𝐒i=𝐬)f_{j}^{\mathbf{s}}=\mathbb{P}(C_{ijh}=1\mid\mathbf{S}_{i}=\mathbf{s}), gk𝐬=(Mikh=1𝐒i=𝐬)g_{k}^{\mathbf{s}}=\mathbb{P}(M_{ikh}=1\mid\mathbf{S}_{i}=\mathbf{s}) and hjk𝐬=(Cijh=1,Mikh=1𝐒i=𝐬)h_{jk}^{\mathbf{s}}=\mathbb{P}(C_{ijh}=1,M_{ikh}=1\mid\mathbf{S}_{i}=\mathbf{s}) are the population specific allele frequencies and joint allele frequency. Population specific LD covariance is defined accordingly Djk𝐬=hjk𝐬fj𝐬hk𝐬D_{jk}^{\mathbf{s}}=h_{jk}^{\mathbf{s}}-f_{j}^{\mathbf{s}}h_{k}^{\mathbf{s}}.

Finally, linear projection of XX respect to YY, L[XY]\mathrm{L}[X\mid Y], will be frequently used throughout [9]. Note that the variables appearing on the right side the vertical sign (||) only contain the random components, and the intercept 11 is always included but ignored in the notation for convenience.

2.2. The Causal Structure

The conditional independence implied by the causal structure plays a pivotal role in this work. The causal structure is summarized in the following causal diagram.

𝐌i𝐦\mathbf{M}_{i\mathbf{m}}𝐒i\mathbf{S}_{i}𝐌i𝐩\mathbf{M}_{i\mathbf{p}}𝐂i𝐦\mathbf{C}_{i\mathbf{m}}𝐂i𝐩\mathbf{C}_{i\mathbf{p}}YiY_{i}𝐌i\mathbf{M}_{i}𝐂i\mathbf{C}_{i}LinkagePopulation structure

The key condition we later use implied by the diagram is 𝐌i𝐦,𝐂i𝐦𝐌i𝐩,𝐂i𝐩𝐒i\mathbf{M}_{i\mathbf{m}},\mathbf{C}_{i\mathbf{m}}\perp\!\!\!\perp\mathbf{M}_{i\mathbf{p}},\mathbf{C}_{i\mathbf{p}}\mid\mathbf{S}_{i}. In other words, the genotype at different haplotypes are conditionally independent which is also implied by the conditional HWE assumption.

The diagram shows the two sources of association between the marker (𝐌i\mathbf{M}_{i}) and the trait (YiY_{i}):

  1. (1)
    𝐌ih\mathbf{M}_{ih}𝐂ih\mathbf{C}_{ih}YiY_{i}

    : Association due to linkage.

  2. (2)
    𝐌ih\mathbf{M}_{ih}𝐒i\mathbf{S}_{i}𝐂ih\mathbf{C}_{ih}YiY_{i}

    : Association due to population structure.

In GWAS, we only want the first association to be present. This can be achieved by conditioning on 𝐒i\mathbf{S}_{i}, but it is generally not observed in population samples. We later show that this problem is partially, but not fully, accounted by including many markers to the regression as in (2.2).

2.3. The Goal

Our goal is to characterize the estimand of the regression (2.2) under the CAM (2.3). Following the practice of excluding variants that are near to the marker being tested, we assume that Mi1𝐌i(1)𝐒iM_{i1}\perp\!\!\!\perp\mathbf{M}_{i(-1)}\mid\mathbf{S}_{i}. Due to the causal structure, this is equivalent to saying that Mi1M_{i1} is unlinked to the rest of the variants in the regression. Hence, conditional on 𝐒i\mathbf{S}_{i}, the following regression and regression (2.2) has the same estimand.

Yi=Mi1β1,nocov+β0,nocov+δi,nocov\displaystyle Y_{i}=M_{i1}\beta_{1,\mathrm{nocov}}+\beta_{0,\mathrm{nocov}}+\delta_{i,\mathrm{nocov}} (2.4)

i.e. β1=β1,nocov\beta_{1}=\beta_{1,\mathrm{nocov}} when the regression is restricted to a fixed 𝐒i\mathbf{S}_{i}. Because the regression is conditioned on 𝐒i\mathbf{S}_{i}, β1\beta_{1} only reflects route (1) in section (2.2). To distinguish it from the regression coefficient obtained from the whole sample without restricting 𝐒i\mathbf{S}_{i}, we write this estimand as β1(𝐒i)\beta_{1}(\mathbf{S}_{i}). It can be shown that

β1(𝐒i=𝐬)=jDj1𝐬g1𝐬(1g1𝐬)αj=jβ1j(𝐒i=𝐬)\displaystyle\begin{aligned} \beta_{1}(\mathbf{S}_{i}=\mathbf{s})&=\sum_{j}\frac{D_{j1}^{\mathbf{s}}}{g_{1}^{\mathbf{s}}(1-g_{1}^{\mathbf{s}})}\alpha_{j}\\ &=\sum_{j}\beta_{1j}(\mathbf{S}_{i}=\mathbf{s})\end{aligned} (2.5)

by the formula of univariate linear regression. Given that 𝔼[Cij𝐒i]\mathbb{E}[C_{ij}\mid\mathbf{S}_{i}] and 𝔼[Mik𝐒i]\mathbb{E}[M_{ik}\mid\mathbf{S}_{i}] are linear functions respect to 𝐒i\mathbf{S}_{i} (e.g. admixture), the regression applied to the whole population

Yi=M1iβ1,𝐬+𝐒i𝜸𝐬+β0,𝐬+δi,𝐬\displaystyle Y_{i}=M_{1i}\beta_{1,\mathbf{s}}+\mathbf{S}_{i}\boldsymbol{\gamma}_{\mathbf{s}}+\beta_{0,\mathbf{s}}+\delta_{i,\mathbf{s}}

has the estimand

β1,𝐬=j𝔼𝐒[β1j(𝐒i)Var(Mi1𝐒i)]𝔼𝐒[Var(Mi1𝐒i)]=𝔼𝐒[β1(𝐒i)Var(Mi1𝐒i)]𝔼𝐒[Var(Mi1𝐒i)]\displaystyle\begin{aligned} \beta_{1,\mathbf{s}}&=\sum_{j}\frac{\mathbb{E}_{\mathbf{S}}[\beta_{1j}(\mathbf{S}_{i})\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}{\mathbb{E}_{\mathbf{S}}[\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}\\ &=\frac{\mathbb{E}_{\mathbf{S}}[\beta_{1}(\mathbf{S}_{i})\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}{\mathbb{E}_{\mathbf{S}}[\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}\end{aligned} (2.6)

similar to the case of OLS under heterogeneous treatment effect (HTE) [10]. Equation (2.6) is a weighted average over β1(𝐒i)\beta_{1}(\mathbf{S}_{i}) with weights that sum up to 11.

In the whole sample, two regressions (2.2) and (2.4) have different estimands because Mi1𝐌i(1)M_{i1}\not\!\perp\!\!\!\perp\mathbf{M}_{i(-1)} in general. We aim to understand how β1\beta_{1}, the marker effect size obtained from regression (2.2) applied to the whole population, is related to β1(𝐒i)\beta_{1}(\mathbf{S}_{i}), the desired estimand obtained from ideal populations each in HWE.

3. Main Results

In GWAS, we are only interested in the contribution of linkage with a physically proximal causal variant. It should be desirable if path (1) in section (2.2) can be separated from path (2). However, we show that this is only partially achievable under the population-based design. Nevertheless, it is achievable under the within-sibship design [11].

3.1. Population-based GWAS

We begin by stating the result for a simple regression (2.4) without any covariates.

Proposition 3.1.

The estimand of regression (2.4) on the whole population is

β1,nocov=j𝔼𝐒[β1j(𝐒i)Var(Mi1𝐒i)]Var[Mi1]linkage+jCov𝐒[𝔼(Cij𝐒i),𝔼(Mi1𝐒i)]Var[Mi1]αjpopulation structure=𝔼𝐒[β1(𝐒i)Var(Mi1𝐒i)]Var[Mi1]+jCov𝐒[𝔼(Cij𝐒i),𝔼(Mi1𝐒i)]Var[Mi1]αj\displaystyle\begin{aligned} \beta_{1,\mathrm{nocov}}&=\sum_{j}\underbrace{\frac{\mathbb{E}_{\mathbf{S}}[\beta_{1j}(\mathbf{S}_{i})\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}{\mathrm{Var}[M_{i1}]}}_{\text{linkage}}+\sum_{j}\underbrace{\frac{\mathrm{Cov}_{\mathbf{S}}[\mathbb{E}(C_{ij}\mid\mathbf{S}_{i}),\mathbb{E}(M_{i1}\mid\mathbf{S}_{i})]}{\mathrm{Var}[M_{i1}]}\cdot\alpha_{j}}_{\text{population structure}}\\ &=\frac{\mathbb{E}_{\mathbf{S}}[\beta_{1}(\mathbf{S}_{i})\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}{\mathrm{Var}[M_{i1}]}+\sum_{j}\frac{\mathrm{Cov}_{\mathbf{S}}[\mathbb{E}(C_{ij}\mid\mathbf{S}_{i}),\mathbb{E}(M_{i1}\mid\mathbf{S}_{i})]}{\mathrm{Var}[M_{i1}]}\cdot\alpha_{j}\end{aligned}
Remark 1 (The Wahlund effect).

The first term can be viewed as a average of β1(𝐒i)\beta_{1}(\mathbf{S}_{i}) weighted by Var(Mi1𝐒i)(𝐒i)Var[Mi1]\frac{\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})\mathbb{P}(\mathbf{S}_{i})}{\mathrm{Var}[M_{i1}]}. An important observation is that the weights sum up to a value smaller than 11 as long as there is allele frequency stratification of Mi1M_{i1} across 𝐒i\mathbf{S}_{i}. To see this,

Var[Mi1]=𝔼𝐒[Var(Mi1𝐒i)]+Var𝐒[𝔼𝐒(Mi1𝐒i)]\displaystyle\mathrm{Var}[M_{i1}]=\mathbb{E}_{\mathbf{S}}[\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]+\mathrm{Var}_{\mathbf{S}}[\mathbb{E}_{\mathbf{S}}(M_{i1}\mid\mathbf{S}_{i})]

which is due to the law of total variance. This was first reported in 1928 by Wahlund [12]. The sum over the weights can also be expressed in terms of Wright’s FF-statistics [13] which is a popular measure of population structure.

𝔼𝐒[Var(Mi1𝐒i)]Var[Mi1]=1FST\displaystyle\frac{\mathbb{E}_{\mathbf{S}}[\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}{\mathrm{Var}[M_{i1}]}=1-F_{\mathrm{ST}}

A straightforward consequence of the Wahlund effect is that the desired effect β1(𝐒i)\beta_{1}(\mathbf{S}_{i}) is attenuated and the attenuation is proportional to the strength of population structure, namely, FSTF_{\mathrm{ST}}.

The second term due to population structure in Proposition 3.1 has a more familiar interpretation. The numerator is non-zero if and only if the causal variant and the marker variant allele frequencies align simultaneously across 𝐒i\mathbf{S}_{i}. In sum, Proposition 3.1 shows that population structure affects β1,nocov\beta_{1,\mathrm{nocov}} in two folds. It attenuates the true signal and puts undesirable signals into the estimand.

Now we propose our main theorem which delivers the estimand of regression (2.2).

Theorem 3.2.

Let M~i1=Mi1L[Mi1𝐌i(1)]\widetilde{M}_{i1}=M_{i1}-\mathrm{L}[M_{i1}\mid\mathbf{M}_{i(-1)}]. The estimand of regression (2.2) is

β1=j𝔼𝐒[β1j(𝐒i)Var(Mi1𝐒i)]Var[M~i1]linkage(=𝔼𝐒[β1(𝐒i)Var(Mi1𝐒i)]Var[M~i1])+j𝔼[Cij(𝔼[Mi1𝐒i]𝔼𝐒[𝔼(Mi1𝐒i)𝐌i(1)])]Var[M~i1]αjprediction error+j𝔼[Cij(𝔼[Mi1𝐌i(1)]L[Mi1𝐌i(1)])]Var[M~i1]αjfunctional misspecification\displaystyle\begin{aligned} \beta_{1}&=\sum_{j}\underbrace{\frac{\mathbb{E}_{\mathbf{S}}[\beta_{1j}(\mathbf{S}_{i})\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}{\mathrm{Var}[\widetilde{M}_{i1}]}}_{\text{linkage}}\;\left(=\frac{\mathbb{E}_{\mathbf{S}}[\beta_{1}(\mathbf{S}_{i})\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}{\mathrm{Var}[\widetilde{M}_{i1}]}\right)\\ &+\sum_{j}\underbrace{\frac{\mathbb{E}[C_{ij}(\mathbb{E}[M_{i1}\mid\mathbf{S}_{i}]-\mathbb{E}_{\mathbf{S}}[\mathbb{E}(M_{i1}\mid\mathbf{S}_{i})\mid\mathbf{M}_{i(-1)}])]}{\mathrm{Var}[\widetilde{M}_{i1}]}\cdot\alpha_{j}}_{\text{prediction error}}\\ &+\sum_{j}\underbrace{\frac{\mathbb{E}[C_{ij}(\mathbb{E}[M_{i1}\mid\mathbf{M}_{i(-1)}]-\mathrm{L}[M_{i1}\mid\mathbf{M}_{i(-1)}])]}{\mathrm{Var}[\widetilde{M}_{i1}]}\cdot\alpha_{j}}_{\text{functional misspecification}}\end{aligned}
Proposition 3.3.

The weights of the linkage term in Theorem 3.2 sum up to a value smaller than 11.

𝔼𝐒[Var(Mi1𝐒i)]Var[M~i1]1\displaystyle\frac{\mathbb{E}_{\mathbf{S}}[\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}{\mathrm{Var}[\widetilde{M}_{i1}]}\leq 1

where the equality holds if and only if 𝔼[Mi1𝐌i(1)]=L[Mi1𝐌i(1)]\mathbb{E}[M_{i1}\mid\mathbf{M}_{i(-1)}]=\mathrm{L}[M_{i1}\mid\mathbf{M}_{i(-1)}].

Theorem 3.4.

Let 𝐌i(1)(q)\mathbf{M}_{i(-1)}^{(q)} be a sequence of markers other than 11 with length qq. Assume that 𝐌i(1)(q)\mathbf{M}_{i(-1)}^{(q)} defines a monotonically increasing filteration σ𝐌i(1)(q)\sigma\langle\mathbf{M}_{i(-1)}^{(q)}\rangle respect to qq. Under the casual structure of section (2.2),

𝔼𝐒[𝔼(Mi1𝐒i)𝐌i(1)]q𝔼[Mi1𝐒i=𝐬]\displaystyle\mathbb{E}_{\mathbf{S}}[\mathbb{E}(M_{i1}\mid\mathbf{S}_{i})\mid\mathbf{M}_{i(-1)}]\xrightarrow[q\rightarrow\infty]{}\mathbb{E}[M_{i1}\mid\mathbf{S}_{i}=\mathbf{s}]

in L1()L^{1}(\mathbb{P}).

Theorem 3.4 tells us that with sufficiently large number of markers as covariates, the prediction error becomes negligible. This is very likely in modern GWAS since millions of variants are genotyped. We defer the detailed proof to the Appendix but provide a brief sketch here.

To prove the theorem, we resort ourselves to a more general problem where we consider the convergence of distributions, namely, (𝐌i(1))(𝐒i=𝐬)\mathbb{P}(\;\cdot\mid\mathbf{M}_{i(-1)})\rightarrow\mathbb{P}(\;\cdot\mid\mathbf{S}_{i}=\mathbf{s}) with 𝐒i=𝐬\mathbf{S}_{i}=\mathbf{s} fixed for an individual. If we view 𝐬\mathbf{s} as the true parameter, (𝐒i)\mathbb{P}(\mathbf{S}_{i}) as the prior distribution and (𝐒i𝐌i(1))\mathbb{P}(\mathbf{S}_{i}\mid\mathbf{M}_{i(-1)}) (where 𝐌i(1)\mathbf{M}_{i(-1)} is the data) as the posterior distribution, the convergence is simply a posterior consistency problem in Bayesian statistics [14, 15]. Loosely speaking, posterior consistency means that the posterior distribution degenerates to a point mass as the number of data increases. Hence, we evoke the Doob’s theorem to prove our statement [15]. The theorem tells us that we can exactly predict the population membership 𝐒i\mathbf{S}_{i} with large number of markers.

Remark 2.

The functional misspecification 𝔼[Mi1𝐌i(1)]L[Mi1𝐌i(1)]\mathbb{E}[M_{i1}\mid\mathbf{M}_{i(-1)}]-\mathrm{L}[M_{i1}\mid\mathbf{M}_{i(-1)}] is generally nonzero. One concrete example is the two discrete population model. A closed-form analysis can be delivered assuming that elements of 𝐌i(1)\mathbf{M}_{i(-1)} are mutually unlinked. By the Bayes theorem,

(𝐒i𝐌i(1))=(𝐌i(1)𝐒i)(𝐒i)(𝐌i(1))\displaystyle\mathbb{P}(\mathbf{S}_{i}\mid\mathbf{M}_{i(-1)})=\frac{\mathbb{P}(\mathbf{M}_{i(-1)}\mid\mathbf{S}_{i})\;\mathbb{P}(\mathbf{S}_{i})}{\mathbb{P}(\mathbf{M}_{i(-1)})}

This gives

(𝐒i=1𝐌i(1))1(𝐒i=1𝐌i(1))\displaystyle\frac{\mathbb{P}(\mathbf{S}_{i}=1\mid\mathbf{M}_{i(-1)})}{1-\mathbb{P}(\mathbf{S}_{i}=1\mid\mathbf{M}_{i(-1)})} =(𝐒i=1𝐌i(1))(𝐒i=0𝐌i(1))\displaystyle=\frac{\mathbb{P}(\mathbf{S}_{i}=1\mid\mathbf{M}_{i(-1)})}{\mathbb{P}(\mathbf{S}_{i}=0\mid\mathbf{M}_{i(-1)})}
=(𝐌i(1)𝐒i=1)(𝐌i(1)𝐒i=0)(𝐒i=1)(𝐒i=0)\displaystyle=\frac{\mathbb{P}(\mathbf{M}_{i(-1)}\mid\mathbf{S}_{i}=1)}{\mathbb{P}(\mathbf{M}_{i(-1)}\mid\mathbf{S}_{i}=0)}\cdot\frac{\mathbb{P}(\mathbf{S}_{i}=1)}{\mathbb{P}(\mathbf{S}_{i}=0)}

On the other hand, (𝐌i(1)𝐒i)\mathbb{P}(\mathbf{M}_{i(-1)}\mid\mathbf{S}_{i}) is factorized due to the conditional independence MikMik𝐒iM_{ik}\perp\!\!\!\perp M_{ik^{\prime}}\mid\mathbf{S}_{i}.

(𝐌i(1)𝐒i=𝐬)\displaystyle\mathbb{P}(\mathbf{M}_{i(-1)}\mid\mathbf{S}_{i}=\mathbf{s}) =k1(Mik𝐒i)\displaystyle=\prod_{k\neq 1}\mathbb{P}(M_{ik}\mid\mathbf{S}_{i})
=k1(2Mik)(gk𝐬)Mik(1gk𝐬)2Mik\displaystyle=\prod_{k\neq 1}\binom{2}{M_{ik}}\left(g_{k}^{\mathbf{s}}\right)^{M_{ik}}\left(1-g_{k}^{\mathbf{s}}\right)^{2-M_{ik}}

Finally, applying log()\log(\;\cdot\;) on both sides of the equation shows that 𝔼[𝐒i𝐌i(1)]\mathbb{E}[\mathbf{S}_{i}\mid\mathbf{M}_{i(-1)}] is a logit-linear function respect to 𝐌i(1)\mathbf{M}_{i(-1)}.

logit(𝔼[𝐒iMi(1)])\displaystyle\mathrm{logit}(\mathbb{E}[\mathbf{S}_{i}\mid M_{i(-1)}]) =k1[Miklog(gk1gk0)+(2Mik)log(1gk11gk0)]+log((𝐒i=1)(𝐒i=0))\displaystyle=\sum_{k\neq 1}\left[M_{ik}\log\left(\frac{g_{k}^{1}}{g_{k}^{0}}\right)+(2-M_{ik})\log\left(\frac{1-g_{k}^{1}}{1-g_{k}^{0}}\right)\right]+\log\left(\frac{\mathbb{P}(\mathbf{S}_{i}=1)}{\mathbb{P}(\mathbf{S}_{i}=0)}\right)

In sum, even with handful of covariates, the non-vanishing functional misspecification attenuates the true signal according to Proposition 3.3 and leaves an additional confounding term according to Theorem 3.2.

3.2. Within-sibship GWAS

In within-sibship GWAS, we observe the family membership of each individual. Denote the family membership of individual ii as 𝐅i\mathbf{F}_{i}. The families are index by ff.

Marginal testing of markers are performed using the following regression.

Yi=Mi1β1,𝐟+f𝟏(𝐅i=f)γf+β0+δi\displaystyle Y_{i}=M_{i1}\beta_{1,\mathbf{f}}+\sum_{f}\mathbf{1}(\mathbf{F}_{i}=f)\gamma_{f}+\beta_{0}+\delta_{i} (3.1)

The OLS estimator of regression (3.1) is equivalent to the famous sibling difference regression when there are only two siblings per family [16].

YiYi=(Mi1Mi1)β1,sd+(δiδi) where 𝐅i=𝐅i\displaystyle Y_{i}-Y_{i^{\prime}}=(M_{i1}-M_{i^{\prime}1})\beta_{1,\mathrm{sd}}+(\delta_{i}-\delta_{i^{\prime}})\text{\quad where\quad}\mathbf{F}_{i}=\mathbf{F}_{i^{\prime}} (3.2)

The proof of the algebraic equivalence can be found in Wooldrdige (2021) [17].

The estimand of regression (3.1) is

β1,𝐟=j𝔼𝐅[β1j(𝐅i)Var(Mi1𝐅i)]𝔼𝐅[Var(Mi1𝐅i)]=𝔼𝐅[β1(𝐅i)Var(Mi1𝐅i)]𝔼𝐅[Var(Mi1𝐅i)]\displaystyle\begin{aligned} \beta_{1,\mathbf{f}}&=\sum_{j}\frac{\mathbb{E}_{\mathbf{F}}[\beta_{1j}(\mathbf{F}_{i})\mathrm{Var}(M_{i1}\mid\mathbf{F}_{i})]}{\mathbb{E}_{\mathbf{F}}[\mathrm{Var}(M_{i1}\mid\mathbf{F}_{i})]}\\ &=\frac{\mathbb{E}_{\mathbf{F}}[\beta_{1}(\mathbf{F}_{i})\mathrm{Var}(M_{i1}\mid\mathbf{F}_{i})]}{\mathbb{E}_{\mathbf{F}}[\mathrm{Var}(M_{i1}\mid\mathbf{F}_{i})]}\end{aligned} (3.3)

where β1(𝐅i=f)\beta_{1}(\mathbf{F}_{i}=f) is the estimand of regression (3.1) restricted to family 𝐅i=f\mathbf{F}_{i}=f. By combining the facts (1) regressions (3.1) and (3.2) have algebraically identical OLS estimators and (2) OLS consistently estimates β1,𝐟\beta_{1,\mathbf{f}}, we show that the estimand β1,𝐟\beta_{1,\mathbf{f}} is equal to β1,𝐬\beta_{1,\mathbf{s}}, the population estimand when 𝐒i\mathbf{S}_{i} is known.

Theorem 3.5.

If we ignore recombination,

𝔼𝐅[β1(𝐅i)Var(Mi1𝐅i)]𝔼𝐅[Var(Mi1𝐅i)]=𝔼𝐒[β1(𝐒i)Var(Mi1𝐒i)]𝔼𝐒[Var(Mi1𝐒i)]\displaystyle\frac{\mathbb{E}_{\mathbf{F}}[\beta_{1}(\mathbf{F}_{i})\mathrm{Var}(M_{i1}\mid\mathbf{F}_{i})]}{\mathbb{E}_{\mathbf{F}}[\mathrm{Var}(M_{i1}\mid\mathbf{F}_{i})]}=\frac{\mathbb{E}_{\mathbf{S}}[\beta_{1}(\mathbf{S}_{i})\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}{\mathbb{E}_{\mathbf{S}}[\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}

3.3. Non-genetic confounding

We initially assumed that 𝔼[ϵi𝐂i,𝐌i]=0\mathbb{E}[\epsilon_{i}\mid\mathbf{C}_{i},\mathbf{M}_{i}]=0 in the CAM model (2.3). This can be relaxed so that ϵi=νi+Ui\epsilon_{i}=\nu_{i}+U_{i} where 𝔼[νi𝐂i,𝐌i]=0\mathbb{E}[\nu_{i}\mid\mathbf{C}_{i},\mathbf{M}_{i}]=0 and 𝔼[Ui]=0\mathbb{E}[U_{i}]=0. UiU_{i} may contain non-genetic random variables.

Equation (2.6) still holds provided

  1. (1)
    𝐌i,𝐂i\mathbf{M}_{i},\mathbf{C}_{i}𝐒i\mathbf{S}_{i}UiU_{i}

    : UiU_{i} is correlated with the genetic variables only through 𝐒i\mathbf{S}_{i}.

  2. (2)

    𝔼[Ui𝐒i]\mathbb{E}[U_{i}\mid\mathbf{S}_{i}] is linear respect to 𝐒i\mathbf{S}_{i}.

which are identical to the conditions in which genetic confounding is resolved with linear regression.

The estimand using regression (2.2) is also similar to Theorem 3.2 with two additional terms.

+𝔼[Ui(𝔼[Mi1𝐒i]𝔼𝐒[𝔼(Mi1𝐒i)𝐌i(1)])]Var[M~i1]prediction error+𝔼[Ui(𝔼[Mi1𝐌i(1)]L[Mi1𝐌i(1)])]Var[M~i1]functional misspecification\displaystyle\begin{aligned} &+\underbrace{\frac{\mathbb{E}[U_{i}(\mathbb{E}[M_{i1}\mid\mathbf{S}_{i}]-\mathbb{E}_{\mathbf{S}}[\mathbb{E}(M_{i1}\mid\mathbf{S}_{i})\mid\mathbf{M}_{i(-1)}])]}{\mathrm{Var}[\widetilde{M}_{i1}]}}_{\text{prediction error}}\\ &+\underbrace{\frac{\mathbb{E}[U_{i}(\mathbb{E}[M_{i1}\mid\mathbf{M}_{i(-1)}]-\mathrm{L}[M_{i1}\mid\mathbf{M}_{i(-1)}])]}{\mathrm{Var}[\widetilde{M}_{i1}]}}_{\text{functional misspecification}}\end{aligned} (3.4)

Prediction error vanishes and the functional misspecification term is generally non-zero as before. The derivation is essentially the same as in Theorem 3.2 in which CijC_{ij} is simply replaced by UiU_{i}. Note that the result does not depend on the two conditions in the previous paragraph. The formula shows how non-genetic confounding is partially resolved with genetic markers in linear regression: genetic markers indirectly controls the non-genetic variables by predicting 𝐒i\mathbf{S}_{i}.

4. Conclusions

In this work, we aimed to solve the identification problem of GWAS of quantitative traits using linear regression in a structured population consisting of subpopulations that are conditionally in HWE. We established the connection between the CAM and the MAM which provides a closed-form formula for what is being estimated using linear regression. The formula shows that under the popular population design, population structure exhibits a two-fold effect in which it induces an additive confounding term together with an attenuation of the true effect of a causal variant (Proposition 3.1). As expected, within-sibship design can overcome this problem due to direct access to family membership.

The work shows how population structure is corrected in GWAS of quantitative traits. By including unlinked markers to the regressions as covariates, linear regression implicitly removes the variation associated to the population structure from the variant being tested. The remaining variation after the removal is then regressed on the trait which gives a less biased estimate. Importantly, this means that the bias is corrected by modelling the distribution of the variant and not the trait as many have believed [18, 19]. Nevertheless, the bias is never completely removed because the expectation of the variant being tested is never truly linear respect to the covariate markers (Theorem 3.2). In this viewpoint, it becomes clear how genetic covariates can further correct environmental confounding even without directly observing them as shown in equation (3.4).

We expect our framework to be extended to incorporate other important evolutionary processes such as assortative mating and inbreeding. As the (conditional) independence between the two haplotypes of an individual plays an important role in our results and proofs, haplotype dependence induced by such evolutionary processes is likely to have an non-trivial impact on GWAS estimands [20]. Another shortcoming of our work is that it only deals with the identification and tells little about the estimation process. Therefore, although our work deals with the biases in the estimate, it remains silent about the power and the precision of the estimators.

Among popular methods, only linear mixed models (LMMs) exactly conform to equation (2.2). Since PC correction applies principal component analysis (PCA) to 𝐌i(1)\mathbf{M}_{i(-1)} prior to regression, it is a biased estimator for β1\beta_{1} [21]. This fact, together with previous works on PCA that show that it asymptotically recovers admixture proportions, shows that a previously believed relationship between PC correction and LMMs is incorrect [22, 23]. Previous work had argued that LMMs confer a stronger correction against population structure compared to PC correction [24, 25].

Two distinct arguments show that this claim is wrong. First, because PCA recovers the admixture proportion asymptotically, including the PC covariates properly adjusts population structure with the estimand in equation (2.6). Meanwhile, LMM fails to do so because it conforms to the MAM that leads to Theorem 3.2. Second, an important feature of LMMs, that was missed by previous analyses based on the number of eigen components considered, is that random effects are assumed to be independent from the variant being tested. Such an assumption is equivalent to the claim that the variant of interest is not subject to population structure which makes the attempt to correct population structure obsolete. This is a general shortcoming of random effects estimators that led to alternative methods in other fields [9, 26]. On the other hand, PC correction includes covariates as fixed effects which does not assume such independence thereby making a correct adjustment.

Acknowledgements

We thank the following colleagues who provided helpful feedbacks after reading an early version of the draft. Our work would have been impossible without them. Doc Edge (University of Southern California, US) gave important comments on a population genetic perspective. Qingyuan Zhao (University of Cambridge, UK) suggested recent literature in statistics and probability related to our work.

Proofs

Proof of Proposition 3.1.

It follows from Theorem 3.2 by setting 𝐌i(1)\mathbf{M}_{i(-1)} empty. ∎

Proof of Theorem 3.2.

By the Frisch-Waugh-Lovell (FWL) theorem [9], the estimand of regression (2.2) is

β1=𝔼[YiM~i1]𝔼[M~i12]\displaystyle\beta_{1}=\frac{\mathbb{E}[Y_{i}\widetilde{M}_{i1}]}{\mathbb{E}[\widetilde{M}_{i1}^{2}]}

and the estimand of the regression restricted to 𝐒i=𝐬\mathbf{S}_{i}=\mathbf{s} is

β1(𝐒i=𝐬)=𝔼[Yi(Mi1𝔼[Mi1𝐒i=𝐬])]Var[Mi1𝐒i]\displaystyle\beta_{1}(\mathbf{S}_{i}=\mathbf{s})=\frac{\mathbb{E}[Y_{i}(M_{i1}-\mathbb{E}[M_{i1}\mid\mathbf{S}_{i}=\mathbf{s}])]}{\mathrm{Var}[M_{i1}\mid\mathbf{S}_{i}]}

Now we expand the numerator of β1\beta_{1}. Substituting YiY_{i} with the CAM model (2.3) gives

𝔼[YiM~i1]=j𝔼[CijM~i1]αj+𝔼[ϵiM~i1]=j𝔼[CijM~i1]αj=j(𝔼[CijM˙i1]+𝔼[Cij(𝔼[Mi1𝐒i]L[Mi1𝐌i(1))])αj\displaystyle\begin{aligned} \mathbb{E}[Y_{i}\widetilde{M}_{i1}]&=\sum_{j}\mathbb{E}[C_{ij}\widetilde{M}_{i1}]\alpha_{j}+\mathbb{E}[\epsilon_{i}\widetilde{M}_{i1}]\\ &=\sum_{j}\mathbb{E}[C_{ij}\widetilde{M}_{i1}]\alpha_{j}\\ &=\sum_{j}\left(\mathbb{E}[C_{ij}\dot{M}_{i1}]+\mathbb{E}[C_{ij}(\mathbb{E}[M_{i1}\mid\mathbf{S}_{i}]-\mathrm{L}[M_{i1}\mid\mathbf{M}_{i(-1)})]\right)\cdot\alpha_{j}\end{aligned}

The first term 𝔼[CijM˙i1]αj\mathbb{E}[C_{ij}\dot{M}_{i1}]\alpha_{j} can be expressed in terms of β1j(𝐒i=𝐬)\beta_{1j}(\mathbf{S}_{i}=\mathbf{s}) by substituting the expression of it in the first paragraph of the proof.

𝔼[CijM˙i1]αj=𝔼𝐒[𝔼(CijM˙i1𝐒i)αj]=𝔼𝐒[β1j(𝐒i)Var(Mi1𝐒i)]\displaystyle\begin{aligned} \mathbb{E}[C_{ij}\dot{M}_{i1}]\alpha_{j}&=\mathbb{E}_{\mathbf{S}}[\mathbb{E}(C_{ij}\dot{M}_{i1}\mid\mathbf{S}_{i})\alpha_{j}]\\ &=\mathbb{E}_{\mathbf{S}}[\beta_{1j}(\mathbf{S}_{i})\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]\end{aligned}

The second term is expanded by adding and subtracting E[Mi1𝐌i(1)]E[M_{i1}\mid\mathbf{M}_{i(-1)}].

𝔼[Cij(𝔼[Mi1𝐒i]L[Mi1𝐌i(1))]=𝔼[Cij(𝔼[Mi1𝐒i]𝔼[Mi1𝐌i(1)])]+𝔼[Cij(𝔼[Mi1𝐌i(1)]L[Mi1𝐌i(1)])]\displaystyle\begin{aligned} &\mathbb{E}[C_{ij}(\mathbb{E}[M_{i1}\mid\mathbf{S}_{i}]-\mathrm{L}[M_{i1}\mid\mathbf{M}_{i(-1)})]\\ &=\mathbb{E}[C_{ij}(\mathbb{E}[M_{i1}\mid\mathbf{S}_{i}]-\mathbb{E}[M_{i1}\mid\mathbf{M}_{i(-1)}])]+\mathbb{E}[C_{ij}(\mathbb{E}[M_{i1}\mid\mathbf{M}_{i(-1)}]-\mathrm{L}[M_{i1}\mid\mathbf{M}_{i(-1)}])]\end{aligned}

Finally, the following reformulation of 𝔼[Mi1𝐌i(1)]\mathbb{E}[M_{i1}\mid\mathbf{M}_{i(-1)}] completes the proof.

𝔼[Mi1𝐌i(1)]=𝔼𝐒[𝔼(Mi1𝐌i(1),𝐒i)𝐌i(1)]=𝔼𝐒[𝔼(Mi1𝐒i)𝐌i(1)]\displaystyle\begin{aligned} \mathbb{E}[M_{i1}\mid\mathbf{M}_{i(-1)}]&=\mathbb{E}_{\mathbf{S}}[\mathbb{E}(M_{i1}\mid\mathbf{M}_{i(-1)},\mathbf{S}_{i})\mid\mathbf{M}_{i(-1)}]\\ &=\mathbb{E}_{\mathbf{S}}[\mathbb{E}(M_{i1}\mid\mathbf{S}_{i})\mid\mathbf{M}_{i(-1)}]\end{aligned}

due to Mi1𝐌i(1)𝐒iM_{i1}\perp\!\!\!\perp\mathbf{M}_{i(-1)}\mid\mathbf{S}_{i}.

Proof of Proposition 3.3.

Let M˙i1=Mi1𝔼[Mi1𝐒i]\dot{M}_{i1}=M_{i1}-\mathbb{E}[M_{i1}\mid\mathbf{S}_{i}].

𝔼𝐒[Var(Mi1𝐒i)]Var[M~i1]=𝔼𝐒[Var(Mi1𝐒i)]Var[M˙i1]Var[M˙i1]Var[M~i1]\displaystyle\begin{aligned} \frac{\mathbb{E}_{\mathbf{S}}[\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}{\mathrm{Var}[\widetilde{M}_{i1}]}&=\frac{\mathbb{E}_{\mathbf{S}}[\mathrm{Var}(M_{i1}\mid\mathbf{S}_{i})]}{\mathrm{Var}[\dot{M}_{i1}]}\cdot\frac{\mathrm{Var}[\dot{M}_{i1}]}{\mathrm{Var}[\widetilde{M}_{i1}]}\end{aligned}

The first term is smaller than 11 due to the law of total variance.

Now we show that the second term is smaller than 11.

Var[M˙i1]=𝔼[(Mi1𝔼[Mi1𝐒i])2]=𝔼[(Mi1𝔼[Mi1𝐒i,𝐌i(1)])2]𝔼[(Mi1𝔼[Mi1𝐌i(1)])2]𝔼[(Mi1L[Mi1𝐌i(1)])2]=Var[M~i1]\displaystyle\begin{aligned} \mathrm{Var}[\dot{M}_{i1}]&=\mathbb{E}\left[(M_{i1}-\mathbb{E}[M_{i1}\mid\mathbf{S}_{i}])^{2}\right]\\ &=\mathbb{E}\left[(M_{i1}-\mathbb{E}[M_{i1}\mid\mathbf{S}_{i},\mathbf{M}_{i(-1)}])^{2}\right]\\ &\leq\mathbb{E}\left[(M_{i1}-\mathbb{E}[M_{i1}\mid\mathbf{M}_{i(-1)}])^{2}\right]\\ &\leq\mathbb{E}\left[(M_{i1}-\mathrm{L}[M_{i1}\mid\mathbf{M}_{i(-1)}])^{2}\right]\\ &=\mathrm{Var}[\widetilde{M}_{i1}]\end{aligned}

The second line follows from the conditional independence implied by the causal structure. The third line follows from the property of conditional expectation. The forth line follows from the fact that the conditional expectation minimizes the square norm.

The result can be informally inferred in a intuitive way using the causal graph. Explaining the variance of Mi1M_{i1} with 𝐒i\mathbf{S}_{i} is more effective than with 𝐌i(1)\mathbf{M}_{i(-1)} because the path from 𝐌i(1)\mathbf{M}_{i(-1)} from Mi1M_{i1}, 𝐌i(1)𝐒iMi1\mathbf{M}_{i(-1)}\leftarrow\mathbf{S}_{i}\rightarrow M_{i1}, must go through the path 𝐒iMi1\mathbf{S}_{i}\rightarrow M_{i1}.

Proof of Theorem 3.4.

See Theorem 6.9 of Ghosal and van der Vaart [15] (Doob’s theorem). First, substitute the variables appearing in the theorem according to our notation. Replace X(n)X^{(n)} to 𝐌i(q)\mathbf{M}_{i}^{(q)}, σX(1),X(2),\sigma\langle X^{(1)},X^{(2)},\ldots\rangle to σMi2,Mi3,\sigma\langle M_{i2},M_{i3},\ldots\rangle, θ\theta to 𝐬\mathbf{s} and Π\Pi to (𝐒i)\mathbb{P}(\mathbf{S}_{i}). Next, apply the theorem to f(𝐬)=𝔼[Mi1𝐒i=𝐬]f(\mathbf{s})=\mathbb{E}[M_{i1}\mid\mathbf{S}_{i}=\mathbf{s}] which gives the desired result. ∎

Proof of Theorem 3.5.

The proof is based on Veller and Coop (2023) [cite]. For a fixed 𝐒i=𝐬\mathbf{S}_{i}=\mathbf{s}, equation (7) of Veller and Coop shows that

β1,𝐟(𝐒i=𝐬)=2H1(𝐒i=𝐬)jDj1𝐬αj\displaystyle\beta_{1,\mathbf{f}}(\mathbf{S}_{i}=\mathbf{s})=\frac{2}{H_{1}(\mathbf{S}_{i}=\mathbf{s})}\sum_{j}D_{j1}^{\mathbf{s}}\alpha_{j}

where H1(𝐒i=𝐬)=2g1𝐬(1g1𝐬)H_{1}(\mathbf{S}_{i}=\mathbf{s})=2g_{1}^{\mathbf{s}}(1-g_{1}^{\mathbf{s}}) and Dj1𝐬=hj1𝐬fj𝐬g1𝐬D_{j1}^{\mathbf{s}}=h_{j1}^{\mathbf{s}}-f_{j}^{\mathbf{s}}g_{1}^{\mathbf{s}}. Therefore, β1,𝐟(𝐒i=𝐬)=β1(𝐒i=𝐬)\beta_{1,\mathbf{f}}(\mathbf{S}_{i}=\mathbf{s})=\beta_{1}(\mathbf{S}_{i}=\mathbf{s}). Note that λ\lambda was replaced to 11 and ll was replaced to jj to match our notation. This implicitly assumes that families are nested within populations. Finally, substituting the result to equation (3.1) gives the desired result.

References

  • [1] The International HapMap Consortium. A haplotype map of the human genome. Nature, 437(7063):1299–1320, oct 2005.
  • [2] The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449(7164):851–861, oct 2007.
  • [3] Alkes L Price, Nick J Patterson, Robert M Plenge, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet, 38(8):904–909, jul 2006.
  • [4] Hyun Min Kang, Jae Hoon Sul, Susan K Service, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet, 42(4):348–354, mar 2010.
  • [5] Joelle Mbatchou, Leland Barnard, Joshua Backman, et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet, 53(7):1097–1103, may 2021.
  • [6] Noah A Rosenberg and Magnus Nordborg. A general population-genetic model for the production by population structure of spurious genotype–phenotype associations in discrete, admixed or spatially distributed populations. Genetics, 173(3):1665–1678, jul 2006.
  • [7] Jian Yang, Noah A Zaitlen, Michael E Goddard, et al. Advantages and pitfalls in the application of mixed-model association methods. Nat Genet, 46(2):100–106, jan 2014.
  • [8] Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. Inference of population structure using multilocus genotype data. Genetics, 155(2):945–959, jun 2000.
  • [9] Jeffrey M. Wooldridge. Econometric Analysis of Cross Section and Panel Data, second edition. The MIT Press, 10 2010.
  • [10] Guido W Imbens and Jeffrey M Wooldridge. Recent developments in the econometrics of program evaluation. Journal of Economic Literature, 47(1):5–86, mar 2009.
  • [11] Laurence J. Howe, Michel G. Nivard, Tim T. Morris, et al. Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects. Nat Genet, 54(5):581–592, may 2022.
  • [12] STEN WAHLUND. ZUSAMMENSETZUNG VON POPULATIONEN UND KORRELATIONSERSCHEINUNGEN VOM STANDPUNKT DER VERERBUNGSLEHRE AUS BETRACHTET. Hereditas, 11(1):65–106, jul 2010.
  • [13] Sewall Wright. ISOLATION BY DISTANCE. Genetics, 28(2):114–138, mar 1943.
  • [14] Subhashis Ghosal and Aad van der Vaart. Convergence rates of posterior distributions for noniid observations. Ann. Statist., 35(1), feb 2007.
  • [15] Subhashis Ghosal and Aad van der Vaart. Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press, 06 2017.
  • [16] Ben Brumpton, Eleanor Sanderson, Karl Heilbron, et al. Avoiding dynastic, assortative mating, and population stratification biases in mendelian randomization through within-family analyses. Nat Commun, 11(1), jul 2020.
  • [17] Jeffrey M. Wooldridge. Two-way fixed effects, the two-way mundlak regression, and difference-in-differences estimators. SSRN Journal Electronic Journal, 2021.
  • [18] Jae Hoon Sul, Lana S. Martin, and Eleazar Eskin. Population structure in genetic studies: Confounding factors and mixed models. PLoS Genetics, 14(12):e1007309, dec 2018.
  • [19] Zachary R. McCaw, Thomas Colthurst, Taedong Yun, et al. DeepNull models non-linear covariate effects to improve phenotypic prediction and association power. Nat Commun, 13(1), jan 2022.
  • [20] Carl Veller and Graham Coop. Interpreting population and family-based genome-wide association studies in the presence of confounding. bioRxiv, feb 2023.
  • [21] The Tien Mai and Pierre Alquier. Understanding the population structure correction regression. In Proceedings of the 4th International Conference on Statistics: Theory and Applications. Avestia Publishing, aug 2022.
  • [22] Gil McVean. A genealogical interpretation of principal components analysis. PLoS Genetics, 5(10):e1000686, oct 2009.
  • [23] Xiuwen Zheng and Bruce S. Weir. Eigenanalysis of SNP data with an identity by descent interpretation. Theoretical Population Biology, 107:65–76, feb 2016.
  • [24] Bjarni J. Vilhjálmsson and Magnus Nordborg. The nature of confounding in genome-wide association studies. Nat Rev Genet, 14(1):1–2, nov 2012.
  • [25] Gabriel E. Hoffman. Correcting for population structure and kinship using the linear mixed model: Theory and extensions. PLoS ONE, 8(10):e75707, oct 2013.
  • [26] Gary Chamberlain. Multivariate regression models for panel data. Journal of Econometrics, 18(1):5–46, jan 1982.