This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Parsimonious and Efficient Likelihood Composition by Gibbs Sampling

Davide Ferrari and Guoqi Qian
Department of Mathematics and Statistics, University of Melbourne
Abstract

The traditional maximum likelihood estimator (MLE) is often of limited use in complex high-dimensional data due to the intractability of the underlying likelihood function. Maximum composite likelihood estimation (McLE) avoids full likelihood specification by combining a number of partial likelihood objects depending on small data subsets, thus enabling inference for complex data. A fundamental difficulty in making the McLE approach practicable is the selection from numerous candidate likelihood objects for constructing the composite likelihood function. In this paper, we propose a flexible Gibbs sampling scheme for optimal selection of sub-likelihood components. The sampled composite likelihood functions are shown to converge to the one maximally informative on the unknown parameters in equilibrium, since sub-likelihood objects are chosen with probability depending on the variance of the corresponding McLE. A penalized version of our method generates sparse likelihoods with a relatively small number of components when the data complexity is intense. Our algorithms are illustrated through numerical examples on simulated data as well as real genotype SNP data from a case-control study.

Keywords: Composite likelihood estimation, Gibbs sampling, Jackknife, Efficiency, Model Selection

1 Introduction

While maximum likelihood estimation plays a central role in statistical inference, today its application is challenged in a number of fields where modern technologies allow scientists to collect data in unprecedented size and complexity. These fields include genetics, biology, environmental research, meteorology and physics, to name a few. Two main issues arise when attempting to apply traditional maximum likelihood to high-dimensional or complex data. The first concerns modelling and model selection, since high-dimensional data typically imply complex models for which the full likelihood function is difficult, or impossible, to specify. The second relates to computing, when the full likelihood function is available but it is just too complex to be evaluated.

The above limitations of traditional maximum likelihood have motivated the development of composite likelihood methods, which avoid the issues from full likelihood maximization by combining a set of low-dimensional likelihood objects. ? was an early proponent of composite likelihood estimation in the context of data with spatial dependence; ? developed composite likelihood inference in its generality and systematically studied its properties. Over the years, composite likelihood methods have proved useful in a range of complex applications, including models for geostatistics, spatial extremes and statistical genetics. See ? for a comprehensive survey of composite likelihood theory and applications.

Let 𝑿=(X1,,Xd)T\boldsymbol{X}=(X_{1},\dots,X_{d})^{T} be a d×1d\times 1 random vector with pdf (or pmf) f(𝒙|𝜽0)f(\boldsymbol{x}|\boldsymbol{\theta}_{0}), where 𝜽0𝚯p\boldsymbol{\theta}_{0}\in\boldsymbol{\Theta}\subseteq\mathbb{R}^{p}, p1p\geq 1, is the unknown parameter. From independent observations 𝑿(1),,𝑿(n)\boldsymbol{X}^{(1)},\dots,\boldsymbol{X}^{(n)}, one typically computes the maximum likelihood estimator (MLE), 𝜽^mle\hat{\boldsymbol{\boldsymbol{\theta}}}_{mle}, by maximizing the likelihood function Lmle(𝜽)=i=1nf(𝑿(i)|𝜽)L_{mle}(\boldsymbol{\theta})=\prod_{i=1}^{n}f(\boldsymbol{X}^{(i)}|\boldsymbol{\theta}). Now suppose that complications in the dd-dimensional pdf (pmf) f(𝒙|𝜽)f(\boldsymbol{x}|\boldsymbol{\theta}) make it difficult to specify (or compute) Lmle(𝜽)L_{mle}(\boldsymbol{\theta}) as the data dimension dd grows, but it is relatively easy to specify (or compute) one-, two-,…, dimensional distributions up to some order for some functions of X1,,XdX_{1},\dots,X_{d}. One can then follow ? to estimate 𝜽\boldsymbol{\theta} by the maximum composite likelihood estimator (McLE), which maximizes the composite likelihood function:

Lcl(𝜽)=m=1MLm(𝜽),L_{cl}(\boldsymbol{\theta})=\prod_{m=1}^{M}L_{m}(\boldsymbol{\theta}), (1)

where each Lm(𝜽)L_{m}(\boldsymbol{\theta}) is a user-specified partial likelihood (or sub-likelihood) depending on marginal or conditional events on variables. For example, LmL_{m} can be defined using a marginal event {xm}\{x_{m}\} (marginal composite likelihood), pairs of variables such as {x1,x2}\{x_{1},x_{2}\} (pairwise likelihood), or conditional events like {x1,x2}|{x1}\{x_{1},x_{2}\}|\{x_{1}\} (conditional composite likelihood). For simplicity, we assume that 𝜽\boldsymbol{\theta} is common to all sub-likelihood components, so that any factorization based on a subset of {Lm(𝜽)\{L_{m}(\boldsymbol{\theta}), m=1,,M}m=1,\dots,M\} yields a valid objective function.

Although the composite likelihood approach provides a flexible framework with a sound theory for making inference about 𝜽\boldsymbol{\theta} in situations involving multivariate data, there exist at least two challenges hindering the efficiency improvement and feasible computing of McLE in applications. The first challenge lies with selecting the right sub-likelihood components for constructing an informative composite likelihood function. The current practice of keeping all plausible factors in (1) is not well justified in terms of efficiency relative to MLE, since inclusion of redundant factors can deteriorate dramatically the variance of the corresponding composite likelihood estimator (e.g., see ?). A better strategy would be to choose a subset of likelihood components which are maximally informative on 𝜽0\boldsymbol{\theta}_{0}, and drop noisy or redundant components to the maximum extent. However, little work is found in the literature in regard to optimal selection of sub-likelihood components. The second challenge lies with the computational complexity involved in the maximization of Lcl(𝜽)L_{cl}(\boldsymbol{\theta}), which can go quickly out of reach as dd (and MM) increases. Particularly, note that computing LclL_{cl} involves M×Nops(dcl)M\times N_{ops}(d_{cl}) operations, where Nops(dcl)N_{ops}(d_{cl}) is the number of operations for each sub-likelihood component. The computational burden is exacerbated when 𝚯\boldsymbol{\Theta} is relatively large and the different sub-likelihood factors Lm(𝜽)L_{m}(\boldsymbol{\theta}) do not depend on distinct elements of 𝜽\boldsymbol{\theta}. One would like to see this computing complexity be alleviated to a manageable level by applying parsimonious likelihood composition in the presence of high (or ultra-high) dimensional data.

Motivated by the aforementioned challenges, in this paper we propose a new class of stochastic selection algorithms for optimal likelihood composition. Our method uses Gibbs sampler, a specific Markov Chain Monte Carlo (MCMC) sampling scheme to generate informative – yet parsimonious – solutions. The resulting estimates will converge to the one maximally informative on the target parameter 𝜽0\boldsymbol{\theta}_{0} as the underlying Markov chain reaches equilibrium. This is because sub-likelihood components generated in the algorithms are drawn according to probabilities determined by a criterion minimizing the McLE’s variance or its consistent approximation. Theory of unbiased estimating equations prescribes McLE’s asymptotic variance as an optimality criterion, i.e., the OFO_{F}-optimality criterion, see ?, but such an ideal objective has scarce practical utility due to the mathematical complexity and computational cost involved in evaluating the numerous covariance terms implied by the asymptotic variance expression (?). To address this issue, we replace the asymptotic variance by a rather inexpensive jackknife variance estimator computed by a one-step Newton-Raphson iteration. Such a replacement is shown to work very well based on our numerical study.

Another advantage of our approach is that proper use of the Gibbs sampler can generate sparse composition rules, i.e., composite likelihoods involving a relatively small number of informative sub-likelihood components. Note that the model space implied by the set of all available sub-likelihood components can be large, even when dd is moderate. For example, if d=20d=20, we have 2M=2(d2)=21902^{M}=2^{{d}\choose{2}}=2^{190} possible composition rules based on pair-wise likelihood components alone. To cope with such a high-dimensionality, we combine Gibbs sampling with a composite likelihood stochastic stability mechanism. Analogously to the stochastic stability selection proposed by ? in the context of high-dimensional model selection, our approach selects a small but sufficient number of informative likelihood components through the control of the error rates of false discoveries.

The paper is organized as follows: In Section 2, we outline the main framework and basic concepts related to composite likelihood estimation. We also describe the OFO_{F}-optimality criterion and introduce its jackknife approximation. In Section 3, we describe our core algorithm for simultaneous likelihood estimation and selection. In Section 4, we discuss an extension of our algorithm by incorporating the ideas of model complexity penalty and stochastic stability selection. This leads to a second algorithm for parsimonious likelihoods composition for high-dimensional data. In Section 5, we illustrate our methods through numerical examples involving simulated data and real genetic single nucleotide polymorphism (SNP) data from a breast cancer case-control study. Section 6 concludes the paper with final remarks.

2 Sparse Composite Likelihood Functions

2.1 Binary Likelihood Composition

Let {𝒜1,,𝒜M}\{\mathcal{A}_{1},\dots,\mathcal{A}_{M}\} be a set of marginal or conditional sample sub-spaces associated with probability density functions (pdfs) fm(𝒙𝒜m|𝜽)f_{m}(\boldsymbol{x}\in\mathcal{A}_{m}|\boldsymbol{\theta}). See ? for interpretation and examples of 𝒜m\mathcal{A}_{m}. Given independent dd-dimensional observations 𝑿(1),,𝑿(n)f(𝒙|𝜽0)\boldsymbol{X}^{(1)},\dots,\boldsymbol{X}^{(n)}\sim f(\boldsymbol{x}|\boldsymbol{\theta}_{0}), 𝜽0𝚯p\boldsymbol{\theta}_{0}\in\boldsymbol{\Theta}\subseteq\mathbb{R}^{p}, p1p\geq 1, we define the composite log-likelihood function:

cl(𝜽,𝝎)=m=1Mωmm(𝜽),\displaystyle\ell_{cl}(\boldsymbol{\theta},\boldsymbol{\omega})=\sum_{m=1}^{M}\omega_{m}\ell_{m}(\boldsymbol{\theta}), (2)

where 𝝎=(ω1,,ωM)T𝛀={0,1}M\boldsymbol{\omega}=(\omega_{1},\dots,\omega_{M})^{T}\in\boldsymbol{\Omega}=\{0,1\}^{M}, and m(𝜽)\ell_{m}(\boldsymbol{\theta}) is the partial log-likelihood

m(𝜽)=i=1nlogfm(𝑿(i)𝒜m|𝜽).\displaystyle\ell_{m}(\boldsymbol{\theta})=\sum_{i=1}^{n}\log f_{m}(\boldsymbol{X}^{(i)}\in\mathcal{A}_{m}|\boldsymbol{\theta}). (3)

Each partial likelihood object (sub-likelihood) m(𝜽)\ell_{m}(\boldsymbol{\theta}) is allowed to be selected or not, depending on whether ωm\omega_{m} is 11 or 0. We aim to approximate the unknown complete log-likelihood function mle(𝜽)=logLmle(𝜽)\ell_{mle}(\boldsymbol{\theta})=\log L_{mle}(\boldsymbol{\theta}) by selecting a few – yet the most informative – sub-likelihood objects from the MM available objects, where MM is allowed to be larger than nn and pp. Given the composition rule 𝝎𝛀\boldsymbol{\omega}\in\boldsymbol{\Omega}, the maximum composite likelihood estimator (McLE), denoted by 𝜽^(𝝎)\hat{\boldsymbol{\boldsymbol{\theta}}}(\boldsymbol{\omega}), is defined by the solution of the following system of estimating equations:

𝟎=i=1n𝑼(i)(𝜽,𝝎):=i=1nm=1Mωm𝑼m(i)(𝜽),\displaystyle\mathbf{0}=\sum_{i=1}^{n}\boldsymbol{U}^{(i)}(\boldsymbol{\theta},\boldsymbol{\omega}):=\sum_{i=1}^{n}\sum_{m=1}^{M}\omega_{m}\boldsymbol{U}^{(i)}_{m}(\boldsymbol{\theta}), (4)

where 𝑼m(i)(𝜽)=𝜽logfm(𝑿(i)𝒜m|𝜽)\boldsymbol{U}^{(i)}_{m}(\boldsymbol{\theta})=\nabla_{\boldsymbol{\theta}}\log f_{m}(\boldsymbol{X}^{(i)}\in\mathcal{A}_{m}|\boldsymbol{\theta}), m=1,,Mm=1,\dots,M, i=1,,ni=1,\dots,n, are unbiased partial scores functions. Under standard regularity conditions on the sub-likelihoods (?), n(𝜽^(𝝎)𝜽0(𝝎))𝒟Np(𝟎,𝐕0)\sqrt{n}(\hat{\boldsymbol{\boldsymbol{\theta}}}(\boldsymbol{\omega})-\boldsymbol{\theta}_{0}(\boldsymbol{\omega}))\overset{\mathcal{D}}{\rightarrow}N_{p}(\mathbf{0},\mathbf{V}_{0}) with asymptotic variance given by the p×pp\times p matrix

𝐕0(𝝎)=𝐕(𝜽0,𝝎)=𝐇(𝜽0,𝝎)1𝐊(𝜽0,𝝎)𝐇(𝜽0,𝝎)1,\displaystyle\mathbf{V}_{0}(\boldsymbol{\omega})=\mathbf{V}(\boldsymbol{\theta}_{0},\boldsymbol{\omega})=\mathbf{H}(\boldsymbol{\theta}_{0},\boldsymbol{\omega})^{-1}\mathbf{K}(\boldsymbol{\theta}_{0},\boldsymbol{\omega})\mathbf{H}(\boldsymbol{\theta}_{0},\boldsymbol{\omega})^{-1}, (5)

where

𝐇(𝜽,𝝎)=m=1Mωm𝐇m(𝜽),𝐊(𝜽,𝝎)=Var[m=1Mωm𝑼m(𝜽)],\mathbf{H}(\boldsymbol{\theta},\boldsymbol{\omega})=\sum_{m=1}^{M}\omega_{m}\mathbf{H}_{m}(\boldsymbol{\theta}),\ \ \mathbf{K}(\boldsymbol{\theta},\boldsymbol{\omega})=Var\left[\sum_{m=1}^{M}\omega_{m}\boldsymbol{U}_{m}(\boldsymbol{\theta})\right],

𝐇m(𝜽)=Var(𝑼m(𝜽))\mathbf{H}_{m}(\boldsymbol{\theta})=Var(\boldsymbol{U}_{m}(\boldsymbol{\theta})) is the p×pp\times p sensitivity matrix for the mmth component, and 𝑼m(𝜽)=𝜽logfm(𝑿𝒜m|𝜽)\boldsymbol{U}_{m}(\boldsymbol{\theta})=\nabla_{\boldsymbol{\theta}}\log f_{m}(\boldsymbol{X}\in\mathcal{A}_{m}|\boldsymbol{\theta}).

The main approach followed here is to minimize, in some sense, the asymptotic variance, 𝐕0(𝝎)\mathbf{V}_{0}(\boldsymbol{\omega}). To this end, theory of unbiased estimating equations suggests that both matrices 𝐇\mathbf{H} and 𝐊\mathbf{K} should be considered in order to achieve this goal (e.g., see ?). On one hand, note that 𝐇\mathbf{H} measures the covariance between the composite likelihood score 𝑼(𝜽,𝝎)\boldsymbol{U}(\boldsymbol{\theta},\boldsymbol{\omega})=m=1Mωm𝑼m(𝜽)=m=1Mωm𝜽logfm(𝑿𝒜m|𝜽)=\sum_{m=1}^{M}\omega_{m}\boldsymbol{U}_{m}(\boldsymbol{\theta})=\sum_{m=1}^{M}\omega_{m}\nabla_{\boldsymbol{\theta}}\log f_{m}(\boldsymbol{X}\in\mathcal{A}_{m}|\boldsymbol{\theta}) and the MLE score 𝑼mle(𝜽)=𝜽logf(𝑿;𝜽)\boldsymbol{U}_{mle}(\boldsymbol{\theta})=\nabla_{\boldsymbol{\theta}}\log f(\boldsymbol{X};\boldsymbol{\theta}). To see this, differentiate both sides in E[𝑼(𝜽0,𝝎)]=𝟎E[\boldsymbol{U}(\boldsymbol{\theta}_{0},\boldsymbol{\omega})]=\mathbf{0} and obtain 𝐇(𝜽0,𝝎)=E[𝑼mle(𝜽0)𝑼(𝜽0,𝝎)T].\mathbf{H}(\boldsymbol{\theta}_{0},\boldsymbol{\omega})=E\left[\boldsymbol{U}_{mle}(\boldsymbol{\theta}_{0})\boldsymbol{U}(\boldsymbol{\theta}_{0},\boldsymbol{\omega})^{T}\right]. This shows that adding sub-likelihood components is desirable, since it increases the covariance with the full likelihood. On the other hand, including too many correlated sub-likelihoods components inflates the variance through the covariance terms in 𝐊(𝜽0,𝝎)\mathbf{K}(\boldsymbol{\theta}_{0},\boldsymbol{\omega}).

2.2 Fixed-Sample Optimality and its Jackknife Approximation

The objective of minimizing the asymptotic variance is still undefined since 𝐕(𝜽0,𝝎)\mathbf{V}(\boldsymbol{\theta}_{0},\boldsymbol{\omega}) in (5) is a p×pp\times p positive semidefinite matrix. Therefore, we consider the following one-dimensional objective function

g0(𝝎)=logdet{𝐕(𝜽0,𝝎)}=logdet{𝐊(𝜽0,𝝎)}2logdet{𝐇(𝜽0,𝝎)}.\displaystyle g_{0}(\boldsymbol{\omega})=\log\det\{\mathbf{V}(\boldsymbol{\theta}_{0},\boldsymbol{\omega})\}=\log\det\{\mathbf{K}(\boldsymbol{\theta}_{0},\boldsymbol{\omega})\}-2\log\det\{\mathbf{H}(\boldsymbol{\theta}_{0},\boldsymbol{\omega})\}. (6)

The minimizer, 𝝎0\boldsymbol{\omega}_{0}, of the ideal objective (6) corresponds to the OFO_{F}-optimal solution (fixed-sample optimality) (e.g., see ?, Chapter 2). Clearly, such a program still lacks practical relevance, since (6) depends on the unknown parameter 𝜽0\boldsymbol{\theta}_{0}. Therefore, g0(𝝎)g_{0}(\boldsymbol{\omega}) should be replaced by some sample-based estimate, say g^0(𝝎)\hat{g}_{0}(\boldsymbol{\omega}).

One option is to use the following consistent estimates of 𝐇(𝜽0,𝝎)\mathbf{H}(\boldsymbol{\theta}_{0},\boldsymbol{\omega}) and 𝐊(𝜽0,𝝎)\mathbf{K}(\boldsymbol{\theta}_{0},\boldsymbol{\omega}) in (6):

𝐇^(𝝎)=1n1m=1Mi=1n𝝎m𝑼m(i)(𝜽^)𝑼m(i)(𝜽^)T,𝐊^(𝝎)=1ni=1n𝑼(i)(𝜽^,𝝎)𝑼(i)(𝜽^,𝝎)T,\displaystyle\hat{\mathbf{H}}(\boldsymbol{\omega})=\dfrac{1}{n-1}\sum_{m=1}^{M}\sum_{i=1}^{n}\boldsymbol{\omega}_{m}\boldsymbol{U}^{(i)}_{m}(\hat{\boldsymbol{\boldsymbol{\theta}}})\boldsymbol{U}^{(i)}_{m}(\hat{\boldsymbol{\boldsymbol{\theta}}})^{T},\ \ \hat{\mathbf{K}}(\boldsymbol{\omega})=\dfrac{1}{n}\sum_{i=1}^{n}\boldsymbol{U}^{(i)}(\hat{\boldsymbol{\boldsymbol{\theta}}},\boldsymbol{\omega})\boldsymbol{U}^{(i)}(\hat{\boldsymbol{\boldsymbol{\theta}}},\boldsymbol{\omega})^{T}, (7)

where the estimator 𝜽^=𝜽^(𝝎)\hat{\boldsymbol{\boldsymbol{\theta}}}=\hat{\boldsymbol{\boldsymbol{\theta}}}(\boldsymbol{\omega}) is the McLE. Although this strategy works in simple models when nn is relatively large and MM is small, the estimator 𝐊^(𝝎)\hat{\mathbf{K}}(\boldsymbol{\omega}) is knowingly unstable when nn is small compared to dim(𝚯)\dim(\boldsymbol{\Theta}) (?). Another issue with this approach in high-dimensional datasets is that the number of operations required to compute 𝐊^(𝝎)\hat{\mathbf{K}}(\boldsymbol{\omega}) (or 𝐊(𝜽,𝝎)\mathbf{K}(\boldsymbol{\theta},\boldsymbol{\omega})) increases quadratically in m=1Mωm\sum_{m=1}^{M}\omega_{m}.

To reduce the computational burden and avoid numerical instabilities, we estimate g0g_{0} by the following one-step jackknife criterion:

g^(𝝎)=logdet{i=1n(𝜽^(𝝎)(i)𝜽¯(𝝎))(𝜽^(𝝎)(i)𝜽¯(𝝎))T},\hat{g}(\boldsymbol{\omega})=\log\det\left\{\sum_{i=1}^{n}\left(\hat{\boldsymbol{\boldsymbol{\theta}}}(\boldsymbol{\omega})^{(-i)}-\overline{\boldsymbol{\theta}}(\boldsymbol{\omega})\right)\left(\hat{\boldsymbol{\boldsymbol{\theta}}}(\boldsymbol{\omega})^{(-i)}-\overline{\boldsymbol{\theta}}(\boldsymbol{\omega})\right)^{T}\right\}, (8)

where the pseudo-value 𝜽^(𝝎)(i)\hat{\boldsymbol{\boldsymbol{\theta}}}(\boldsymbol{\omega})^{(-i)} is a composite likelihood estimator based on a sample without observation 𝑿(i)\boldsymbol{X}^{(i)}, and 𝜽¯(𝝎)=i=1n𝜽^(𝝎)(i)/n\overline{\boldsymbol{\theta}}(\boldsymbol{\omega})=\sum_{i=1}^{n}\hat{\boldsymbol{\boldsymbol{\theta}}}(\boldsymbol{\omega})^{(-i)}/n. Alternatively, one could use the delete-kk jackknife estimate where k>1k>1 observations at the time are deleted to compute the pseudo-values. The delete-kk version is computationally cheaper than delete-11 jackknife and therefore should be preferred when the sample size, nn, is moderate or large. Other approaches – including bootstrap – should be considered depending on the model set up. For example, block re-sampling techniques such as the block-bootstrap (see ? and subsequent papers) are viable options for spatial data and time series.

When the sub-likelihood scores are in closed form, the pseudo-values can be efficiently approximated by the following one-step Newton-Raphson iteration:

𝜽^(𝝎)(i)=𝜽~+(m=1Mωmji𝑼m(j)(𝜽~)𝑼m(j)(𝜽~)T)1(m=1Mωmji𝑼m(j)(𝜽~)),\hat{\boldsymbol{\boldsymbol{\theta}}}(\boldsymbol{\omega})^{(-i)}=\tilde{\boldsymbol{\theta}}+\left(\sum_{m=1}^{M}\omega_{m}\sum_{j\neq i}\boldsymbol{U}_{m}^{(j)}(\tilde{\boldsymbol{\theta}})\boldsymbol{U}_{m}^{(j)}(\tilde{\boldsymbol{\theta}})^{T}\right)^{-1}\left(\sum_{m=1}^{M}\omega_{m}\sum_{j\neq i}\boldsymbol{U}_{m}^{(j)}(\tilde{\boldsymbol{\theta}})\right), (9)

where 𝜽~\tilde{\boldsymbol{\theta}} is any root-nn consistent estimator of 𝜽\boldsymbol{\theta}. Note that 𝜽~\tilde{\boldsymbol{\theta}} does not need to coincide with the McLE, 𝜽^(𝝎)\hat{\boldsymbol{\boldsymbol{\theta}}}(\boldsymbol{\omega}), so a computationally cheap initial estimator – based only on a few small sub-likelihoods subset – may be considered. Remarkably, the number of operations required in the Newton-Raphson iteration (9) grows linearly in the number of sub-likelihood components, so that the one-step jackknife objective has computational complexity comparable to a single evaluation of the composite likelihood function (4). Large sample properties of jackknife estimator of McLE’s asympotic variance can be derived under regularity conditions on 𝑼m\boldsymbol{U}_{m} and 𝐇m\mathbf{H}_{m} analogous to those described in ?. Then, for the one-step jackknife using the root-nn consistent starting point 𝜽~\tilde{\boldsymbol{\theta}}, we have n(g^(𝝎)g0(𝝎))𝑝0,asn.n\left(\hat{g}(\boldsymbol{\omega})-g_{0}(\boldsymbol{\omega})\right)\overset{p}{\rightarrow}0,\ \ \text{as}\ \ n\rightarrow\infty. uniformly on 𝛀\boldsymbol{\Omega}. Moreover, the estimator g^(𝝎)\hat{g}(\boldsymbol{\omega}) is asymptotically equivalent to the classic jackknife estimator.

3 Parameter estimation and likelihood composition

3.1 Likelihood Composition via Gibbs Sampling

When computing the McLE of 𝜽\boldsymbol{\theta}, our objective is to find an optimal binary vector 𝝎\boldsymbol{\omega}^{\ast} to estimate 𝝎0=argmin𝝎Ωg0(𝝎),\boldsymbol{\omega}_{0}={\text{argmin}}_{\boldsymbol{\omega}\in\Omega}\ g_{0}(\boldsymbol{\omega}), where g0()g_{0}(\cdot) is the ideal objective function defined in (6). Typically, the population quantity g0g_{0} cannot be directly assessed, so we replace g0g_{0} with the sample-based jacknife estimate g^\hat{g} described in Section 2 and aim at finding

𝝎=argmin𝝎Ωg^(𝝎).\boldsymbol{\omega}^{\ast}=\underset{\boldsymbol{\omega}\in\Omega}{\text{argmin}}\ \hat{g}(\boldsymbol{\omega}).

This task, however, is computationally infeasible through enumerating the space 𝛀\boldsymbol{\Omega} if dd is even moderately large. For example, for composite likelihoods defined based on all pairs of variables (Xs,Xt)(X_{s},X_{t}), 1s<td1\leq s<t\leq d (pairwise likelihood), 𝛀\boldsymbol{\Omega} contains 2(d2)=21902^{{d}\choose{2}}=2^{190} elements when d=20d=20.

To overcome this enumeration complexity, we carry out a random search method based on Gibbs sampling. We regard the weight vector 𝝎\boldsymbol{\omega} as a random vector following the joint probability mass function (pmf)

πτ(𝝎)=1Z(τ)exp{τg^(𝝎)},𝝎𝛀,\pi_{\tau}(\boldsymbol{\omega})=\dfrac{1}{Z(\tau)}\exp\left\{-\tau\hat{g}(\boldsymbol{\omega})\right\},\ \ \boldsymbol{\omega}\in\boldsymbol{\Omega}, (10)

where Z(τ)=𝝎𝛀exp{τg^(𝝎)}Z(\tau)=\sum_{\boldsymbol{\omega}\in\boldsymbol{\Omega}}\exp\{-\tau\hat{g}(\boldsymbol{\omega})\} is the normalizing constant. The above distribution depends on the tuning parameter τ>0\tau>0, which controls the extent to which we emphasize larger probabilities (and reduce smaller probabilities) on 𝛀\boldsymbol{\Omega}. Then 𝝎\boldsymbol{\omega}^{*} is also the mode of πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}), meaning that 𝝎\boldsymbol{\omega}^{*} will have the highest probability to appear, and will be more likely to appear earlier rather than later, if a random sequence of 𝝎\boldsymbol{\omega} is to be generated from πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}).

Therefore, estimating 𝝎\boldsymbol{\omega}^{*} can be readily done based on the random sequence generated from πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}). But generating a random sample from πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}) directly is difficult because πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}) contains an intractable normalizing constant Z(τ)Z(\tau). Instead, we will generate a Markov chain using the product of all univariate conditional pmf’s with respect to πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}) as the transitional kernel. The stationary distribution of such a Markov chain can be proved to be πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}) (?, Chapter 10). Hence, the part of this Markov chain after reaching equilibrium can be regarded as a random sample from πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}) for most purposes. The MCMC method just described is in fact the so-called Gibbs sampling method. The key factor for the Gibbs sampling to work is all the univariate conditional probability distributions of the target distribution can be relatively easily simulated.

Let us write 𝝎=(ω1,,ωM)\boldsymbol{\omega}=(\omega_{1},\cdots,\omega_{M}); 𝝎m1:m2=(ωm1,ωm1+1,,ωm2)\boldsymbol{\omega}_{m_{1}:m_{2}}=(\omega_{m_{1}},\omega_{m_{1}+1},\cdots,\omega_{m_{2}}), if m1m2m_{1}\leq m_{2}, and 𝝎m1:m2=\boldsymbol{\omega}_{m_{1}:m_{2}}=\emptyset otherwise; and 𝝎m=(ω1,,ωm1,ωm+1,,ωM)\boldsymbol{\omega}_{-m}=(\omega_{1},\cdots,\omega_{m-1},\omega_{m+1},\cdots,\omega_{M}). Then it is easy to see that the conditional pmf of ωm\omega_{m} given 𝝎m\boldsymbol{\omega}_{-m}, m=1,,Mm=1,\cdots,M, is

πτ(ωm|𝝎m)=exp{τg^(𝝎)}exp{τg^(𝝎m[0])}+exp{τg^(𝝎m[1])},ωm=0,1,\pi_{\tau}(\omega_{m}|\boldsymbol{\omega}_{-m})=\frac{\exp\{-\tau\hat{g}(\boldsymbol{\omega})\}}{\exp\{-\tau\hat{g}(\boldsymbol{\omega}_{m}^{[0]})\}+\exp\{-\tau\hat{g}(\boldsymbol{\omega}_{m}^{[1]})\}},\quad\omega_{m}=0,1, (11)

where ωm[j]=(ω1:(m1),j,ω(m+1):M)\omega_{m}^{[j]}=(\omega_{1:(m-1)},j,\omega_{(m+1):M}), j=0,1j=0,1. Note that πτ(ωm|𝝎m)\pi_{\tau}(\omega_{m}|\boldsymbol{\omega}_{-m}) is simply a Bernoulli pmf and so it is easily generated. For each binary vector 𝝎\boldsymbol{\omega}, g^(𝝎)\hat{g}(\boldsymbol{\omega}) is computed using the one-step jackknife estimator described in Section 2. Therefore, the probability mass function πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}) is well defined for any τ>0\tau>0. On the other hand, we have shown that the Gibbs sampling can be used to generate a Markov chain from πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}), from which we can find a consistent estimator 𝝎^\hat{\boldsymbol{\omega}}^{*} for the mode 𝝎\boldsymbol{\omega}^{*} if 𝝎\boldsymbol{\omega}^{*} is unique. Consequently, the universally maximum composite log-likelihood estimator of 𝜽\boldsymbol{\theta} can be approximated by the McLE, 𝜽^(𝝎^)\hat{\boldsymbol{\boldsymbol{\theta}}}({\hat{\boldsymbol{\omega}}}^{*}).

Note that the mode 𝝎\boldsymbol{\omega}^{*} is not necessarily unique. In this case one can still consider consistent estimation for 𝝎\boldsymbol{\omega}^{*} but in the following meaning. We know g^(𝝎)\hat{g}(\boldsymbol{\omega}^{*}) is the minimum and always unique according to its definition. Thus g^(𝝎)\hat{g}(\boldsymbol{\omega}^{*}) can be consistently and uniquely estimated based on the Markov chain of g^(𝝎)\hat{g}(\boldsymbol{\omega}) induced from the Markov chain of 𝝎\boldsymbol{\omega} generated from πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}) when the length of the Markov chain goes to infinity. Therefore, any estimator 𝝎^\hat{\boldsymbol{\omega}}^{*} such that g^(𝝎^)\hat{g}(\hat{\boldsymbol{\omega}}^{*}) becomes consistent with g^(𝝎)\hat{g}(\boldsymbol{\omega}^{*}) can be regarded as a consistent estimator of 𝝎\boldsymbol{\omega}^{*}. Consequently, the McLE 𝜽^(𝝎^)\hat{\boldsymbol{\boldsymbol{\theta}}}({\hat{\boldsymbol{\omega}}}^{*}) for each such 𝝎^\hat{\boldsymbol{\omega}}^{*} is still the universally McLE but not necessarily unique. From a practitioner’s view point there is no need to identify all consistent estimators of 𝝎\boldsymbol{\omega}^{*} and all universally McLE of 𝜽\boldsymbol{\theta}. Finding out one such 𝝎^\hat{\boldsymbol{\omega}}^{*} or a tight superset of it would be a sufficient advance in parsimonious and efficient likelihood composition.

3.2 Algorithm 1: MCMC Composite Likelihood Selection (MCMC-CLS)

The above discussion motivates the steps of our core Gibbs sampling algorithm for simultaneous composite likelihood estimation and selection. Let τ\tau be given and fixed.

  1. 0.

    For t=0t=0, choose an initial binary vector of weights 𝝎(0)\boldsymbol{\omega}^{(0)} and compute the one-step jackknife estimator g^(𝝎(0))\hat{g}(\boldsymbol{\omega}^{(0)}). E.g. randomly set 5 elements of 𝝎(0)\boldsymbol{\omega}^{(0)} to 1 and the rest to 0.

  2. 1.

    For each t=1,,Tt=1,\cdots,T for a given TT, obtain 𝝎(t)\boldsymbol{\omega}^{(t)} by repeating 1.11.1 to 1.41.4 for each m=1,,Mm=1,\cdots,M sequentially.

    1. 1.1.

      Compute, if not available yet, g^(ω1:(m1)(t),j,ω(m+1):M(t1))\hat{g}(\omega_{1:(m-1)}^{(t)},j,\omega_{(m+1):M}^{(t-1)}), j=0,1j=0,1.

    2. 1.2.

      Compute the conditional pmf of ωm\omega_{m} given (ω1:(m1)(t),ω(m+1):M(t1))(\omega_{1:(m-1)}^{(t)},\omega_{(m+1):M}^{(t-1)}):

      πτ(ωm=j|ω1:(m1)(t),ω(m+1):M(t1))exp{τg^(ω1:(m1)(t),j,ω(m+1):M(t1))}\pi_{\tau}(\omega_{m}=j|\omega_{1:(m-1)}^{(t)},\omega_{(m+1):M}^{(t-1)})\propto\exp\{-\tau\hat{g}(\omega_{1:(m-1)}^{(t)},j,\omega_{(m+1):M}^{(t-1)})\} (12)

      where j=0,1j=0,1. Note this is a Bernoulli pmf.

    3. 1.3.

      Generate a random number from the Bernoulli pmf obtained in 1.21.2, and denote the result as ωm(t)\omega_{m}^{(t)}.

    4. 1.4.

      Set 𝝎(t)(ω1:(m1)(t),ωm(t),ω(m+1):M(t1))T\boldsymbol{\omega}^{(t)}\leftarrow(\omega_{1:(m-1)}^{(t)},\omega_{m}^{(t)},\omega_{(m+1):M}^{(t-1)})^{T}. Also compute and record g^(𝝎(t))\hat{g}(\boldsymbol{\omega}^{(t)}).

  3. 2.

    Compute 𝝎^=argmin1tTg^(𝝎(t)),\hat{\boldsymbol{\omega}}^{*}=\arg\min_{1\leq t\leq T}\hat{g}(\boldsymbol{\omega}^{(t)}), and regard it as the estimate of 𝝎\boldsymbol{\omega}^{*}. Alternatively, column-combine 𝝎(1),,𝝎(T)\boldsymbol{\omega}^{(1)},\cdots,\boldsymbol{\omega}^{(T)} generated in Step 11 into an M×TM\times T matrix 𝐖^\hat{\mathbf{W}}; then compute row averages of 𝐖^\hat{\mathbf{W}}, say ω¯1,,ω¯M\overline{\omega}_{1},\cdots,\overline{\omega}_{M}, and set 𝝎~=(ω~1,,ω~M)\tilde{\boldsymbol{\omega}}^{*}=(\tilde{\omega}_{1}^{*},\cdots,\tilde{\omega}_{M}^{*}) where ω~m=1\tilde{\omega}_{m}^{*}=1, if ω¯mξ\overline{\omega}_{m}\geq\xi, and ω~m=0\tilde{\omega}_{m}^{*}=0 if ω¯m<ξ\overline{\omega}_{m}<\xi, where ξ\xi is some constant larger than 0.50.5.

  4. 3.

    Finally, compute 𝜽^(𝝎^)\hat{\boldsymbol{\boldsymbol{\theta}}}(\hat{\boldsymbol{\omega}}^{*}) (or 𝜽^(𝝎~)\hat{\boldsymbol{\boldsymbol{\theta}}}(\tilde{{\boldsymbol{\omega}}}^{*})) and g^(𝝎^)\hat{g}(\hat{\boldsymbol{\omega}}^{*}) (or g^(𝝎~)\hat{g}(\tilde{{\boldsymbol{\omega}}}^{*})).

Firstly, note that Gibbs sampling has been used in various contexts in the literature of model selection. ? used a similar strategy to generate the distribution of the variable indicators in Bayesian linear regression. ? used Gibbs sampler for selecting robust linear regression models. ? used the Gibbs sampler for selecting generalized linear regression models. ? and ? used Gibbs sampling for selection in the context of time series models. However, to our knowledge this is the first work proposing a general-purpose Gibbs sampler for construction of composite likelihoods.

Secondly, the sequence 𝝎(1),,𝝎(T)\boldsymbol{\omega}^{(1)},\dots,\boldsymbol{\omega}^{(T)} is a Markov chain, which requires an initial vector 𝝎(0)\boldsymbol{\omega}^{(0)} and a burn-in period to be in equilibrium. Values of 𝝎(0)\boldsymbol{\omega}^{(0)} do not affect the eventual attainment of equilibrium so can be arbitrarily chosen. From a computational point of view most components of 𝝎(0)\boldsymbol{\omega}^{(0)} should be set to 0 to reduce computing load. For example, we can randomly set all but 5 of the components to 0. To assess whether the chain has reached equilibrium, we suggest the control-chart method discussed in ?. For the random variable g^(𝝎)\hat{g}({\boldsymbol{\omega}}), we have the following probability inequality for any given b>1b>1:

Pr(g^(𝝎)ming^(𝝎)bVar[g^(𝝎)]+(E[g^(𝝎)]ming^(𝝎))2)1b2Pr\left(\hat{g}(\boldsymbol{\omega})-\min\hat{g}(\boldsymbol{\omega})\geq b\sqrt{Var[\hat{g}(\boldsymbol{\omega})]+(E[\hat{g}(\boldsymbol{\omega})]-\min\hat{g}(\boldsymbol{\omega}))^{2}}\right)\leq\dfrac{1}{b^{2}}

This inequality can be used to find an upper control limit for g^(𝝎)\hat{g}({\boldsymbol{\omega}}). For example, by setting b=10b=\sqrt{10}, an at least 90%90\% upper control limit for g^(𝝎)\hat{g}({\boldsymbol{\omega}}) can be estimated as g^+10s2+10(g¯g^)2,\hat{g}^{\ast}+\sqrt{10s^{2}+10(\overline{g}-\hat{g}^{\ast})^{2}}, where g^\hat{g}^{\ast}, g¯\overline{g} and s2s^{2} are the minimum, sample mean and sample variance based on the first NN observations, g^(𝝎(1)),,g^(𝝎(N))\hat{g}(\boldsymbol{\omega}^{(1)}),\dots,\hat{g}(\boldsymbol{\omega}^{(N)}), N<TN<T, where typically we set N=T/2N=\lfloor T/2\rfloor. We then count the number of observations passing the upper control limit in the remaining sample g^(𝝎(N+1)),,g^(𝝎(T))\hat{g}(\boldsymbol{\omega}^{(N+1)}),\dots,\hat{g}(\boldsymbol{\omega}^{(T)}). If more than 10% of the points are above the control limit, then at a significance level not more than 90% there is statistical evidence against equilibrium. Upper control limits of different levels for g^(𝝎)\hat{g}({\boldsymbol{\omega}}) can be similarly calculated and interpreted by choosing different values of bb.

Thirdly, 𝝎^\hat{\boldsymbol{\omega}}^{*} computed in Step 2 is simply a sample mode of πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}) based on its definition in (10). Hence, by the Ergodic Theorem for stationary Markov chain, 𝝎^\hat{\boldsymbol{\omega}}^{*} is a strongly consistent estimator of 𝝎\boldsymbol{\omega} under minimal regularity conditions. With similar arguments ω¯m\overline{\omega}_{m} is a strongly consistent estimator of the success probability involved in the marginal distribution of ωm\omega_{m} induced from πτ(𝝎)\pi_{\tau}(\boldsymbol{\omega}). Hence, it is not difficult to see that the resultant estimator 𝝎~\tilde{\boldsymbol{\omega}}^{*} should satisfy ω~mωm\tilde{\omega}_{m}^{*}\geq\omega_{m}^{*} without requiring TT to be very large. Propositions 1 and 2 in ? provide an exposition of this property. Therefore, the estimator 𝝎~\tilde{\boldsymbol{\omega}}^{*} captures all informative sub-likelihood components with high probability.

Finally, the tuning constant τ\tau adjusts the mixing behavior of the chain, which has important consequences on the exploration/exploitation trade-off on the search space 𝛀\boldsymbol{\Omega}. If τ\tau is too small, the algorithm produces solutions approaching the global optimal value 𝝎\boldsymbol{\omega}^{\ast} slowly; if τ\tau is large, then the algorithm finds local optima and may not reach 𝝎\boldsymbol{\omega}^{\ast}. The former behavior corresponds to a rapidly mixing chain, while the latter occurs when the chain is mixing too slowly. In the composite likelihood selection setting, the main hurdle is the computational cost, so τ\tau should be set according to the available computing capacity, after running some graphical or numerical diagnostics (e.g., see ?). We choose to use τ=d\tau=d in our empirical study, which does not seem to create adverse effects.

4 An extension for high-dimensional data

4.1 Sparsity-enforcing penalization

Without additional modifications, Algorithm 1 ignores the likelihood complexity, since solutions with many sub-likelihoods have in principle the same chance to occur as those with fewer components. To discourage selection of overly complex likelihoods, we augment the Gibbs distribution (10) as follows:

πτ,λ(𝝎)=Z(τ,λ)1exp{τg^λ(𝝎)},\pi_{\tau,\lambda}(\boldsymbol{\omega})=Z(\tau,\lambda)^{-1}\exp\{-\tau\hat{g}_{\lambda}(\boldsymbol{\omega})\}, (13)

where

g^λ(𝝎)=g^(𝝎)+pen(𝝎|λ),τ>0,λ>0,\hat{g}_{\lambda}(\boldsymbol{\omega})=\hat{g}(\boldsymbol{\omega})+{pen}(\boldsymbol{\omega}|\lambda),\quad\tau>0,\;\lambda>0, (14)

g^(𝝎)\hat{g}(\boldsymbol{\omega}) is the jackknifed variance objective defined in Section 2, Z(τ,λ)Z(\tau,\lambda) is the normalization constant, and pen(𝝎)pen(\boldsymbol{\omega}) is a complexity penalty enforcing sparse solutions when dim(𝛀)\dim(\boldsymbol{\Omega}) is large. Maximization of πτ,λ(𝝎)\pi_{\tau,\lambda}(\boldsymbol{\omega}) is interpreted as a maximum a posteriori estimation for 𝝎\boldsymbol{\omega}, where the probability distribution proportional to exp{pen(𝝎|λ)}\exp\{-pen(\boldsymbol{\omega}|\lambda)\} is regarded as a prior pmf over 𝛀\boldsymbol{\Omega}. In this paper, we use the penalty term of form pen(𝝎|λ)=λm=1Mωm\text{pen}(\boldsymbol{\omega}|\lambda)=\lambda\sum_{m=1}^{M}\omega_{m}, since it corresponds to well-established model-selection criteria. For example, choices λ=1\lambda=1, λ=21logn\lambda=2^{-1}\log n and λ=loglogn\lambda=\log\log n correspond to the AIC, BIC and HQC criteria, respectively (e.g., see ?). Other penalties could be considered as well depending on the model structure and available prior information; however, these are not shown to be crucial based on our empirical study so will not be explored in this paper.

4.2 Composite Likelihood Stability Selection

To find the optimal solution 𝝎\boldsymbol{\omega}^{\ast}, one could compute a sequence of optimal values 𝝎^λ1,,𝝎^λB\hat{\boldsymbol{\omega}}^{\ast}_{\lambda_{1}},\dots,\hat{\boldsymbol{\omega}}^{\ast}_{\lambda_{B}} and then take min1bBg^(𝝎^λb)\min_{1\leq b\leq B}\hat{g}(\hat{\boldsymbol{\omega}}^{\ast}_{\lambda_{b}}). There are, however, various issues in this approach: first, the globally optimal value 𝝎\boldsymbol{\omega}^{\ast} might not be a member of the set {𝝎^λb}b=1B\{\hat{\boldsymbol{\omega}}^{\ast}_{\lambda_{b}}\}_{b=1}^{B}, since the mode of πτ,λ(𝝎)\pi_{\tau,\lambda}(\boldsymbol{\omega}) is not necessarily the composite likelihood solution which minimizes g^(𝝎)\hat{g}(\boldsymbol{\omega}). Second, even if 𝝎\boldsymbol{\omega}^{\ast} is in such a set, determining λ\lambda is typically challenging. To address the above issues, we employ the idea of stability selection, introduced by ? in the context of variable selection for linear models. Given an arbitrary value for λ\lambda, stochastic stability exploits the variability of random samples generated from πτ,λ(𝝎)\pi_{\tau,\lambda}(\boldsymbol{\omega}) by the Gibbs procedure, say 𝝎λ(1),,𝝎λ(T)\boldsymbol{\omega}^{(1)}_{\lambda},\dots,\boldsymbol{\omega}^{(T)}_{\lambda} and choses all the partial likelihoods that occur in a large fraction of generated samples. For a given 0<ξ<10<\xi<1, we define the set of stable likelihoods by the vector 𝝎^stable\hat{\boldsymbol{\omega}}^{\text{stable}}, with elements

ω^mstable={1,if 1Tt=1Tωλ,m(t)ξ,0,otherwise,\hat{\omega}_{m}^{\text{stable}}=\left\{\begin{array}[]{ll}1,&\text{if }\ \ \dfrac{1}{T}\sum_{t=1}^{T}\omega_{\lambda,m}^{(t)}\geq\xi,\\ 0,&\text{otherwise},\end{array}\right. (15)

so we regard as stable those sub-likelihoods selected more frequently and disregard sub-likelihood items with low selection probabilities. Following ?, we choose the tuning constant ξ\xi using the following bound on the expected number of false selections, VV:

E(V)1(2ξ1)ηλM,E(V)\leq\dfrac{1}{(2\xi-1)}\dfrac{\eta_{\lambda}}{M}, (16)

where ηλ\eta_{\lambda} is average number of selected sub-likelihood components. In multiple testing, the quantity α=E(V)/M\alpha=E(V)/M is sometimes called the per-comparison error rate (PCER). By increasing ξ\xi, only few likelihood components are selected, so that we reduce the expected number of falsely selected variables. We choose the threshold ξ\xi by fixing the PCER at some desired value (e.g., α=0.10\alpha=0.10), and then choose ξ\xi corresponding to the desired error rate. The unknown quantity ηλ\eta_{\lambda} in our setting can be estimated by the average number of sub-likelihood components over TT Gibbs samples.

Finally, note that tuning ξ\xi according to (16) makes redundant the determination of the optimal λ\lambda value as long as pen(𝝎|λ)pen(\boldsymbol{\omega}|\lambda) in (14) is not dominant over g^(𝝎)\hat{g}({\boldsymbol{\omega}}). This is further supported by our empirical study where we found the effect of λ\lambda on 𝝎^stable\hat{\boldsymbol{\omega}}^{\text{stable}} is negligible.

4.3 Algorithm 2: MCMC Composite Likelihood with Stability Selection (MCMC-CLS2)

The preceding discussions lead to Algorithm 2 which is essentially the same as Algorithm 1 with two exceptions: (i) we replace g^\hat{g} in Algorithm 1 by the augmented objective function g^λ\hat{g}_{\lambda} defined in (14); (ii) Steps 2 in Algorithm 1 is replaced by the following stability selection step.

  • 22^{\prime}.

    Column-combine 𝝎(1),,𝝎(T)\boldsymbol{\omega}^{(1)},\cdots,\boldsymbol{\omega}^{(T)} generated in Step 11 into an M×TM\times T matrix 𝐖^\hat{\mathbf{W}}. Compute the row averages of 𝐖^\hat{\mathbf{W}}, denoted as (ω^1,,ω^M)(\hat{\omega}_{1},\cdots,\hat{\omega}_{M}), i.e. ω^m=T1t=1Tωm(t)\hat{\omega}_{m}=T^{-1}\sum_{t=1}^{T}\omega_{m}^{(t)}, m=1,,Mm=1,\cdots,M. Then set 𝝎^stable=(ω^1,,ω^M)\hat{\boldsymbol{\omega}}^{\text{stable}}=(\hat{\omega}_{1}^{*},\cdots,\hat{\omega}_{M}^{*}) where ω^m=1\hat{\omega}_{m}^{*}=1 if ω^mξ^λ\hat{\omega}_{m}\geq\hat{\xi}_{\lambda} and ω^m=0\hat{\omega}_{m}^{*}=0 if ω^m<ξ^λ\hat{\omega}_{m}<\hat{\xi}_{\lambda}, m=1,,Mm=1,\cdots,M, where ξ^λ=12(η^λαM2+1),\hat{\xi}_{\lambda}=\dfrac{1}{2}\left(\dfrac{\hat{\eta}_{\lambda}}{\alpha M^{2}}+1\right), and α\alpha is a nominal level for per-comparison error rate (e.g., 0.05 or 0.1).

The estimated threshold ξ^λ\hat{\xi}_{\lambda} is obtained from (16) by plugging-in η^λ=t=1Tm=1Mωm(t)/T\hat{\eta}_{\lambda}=\sum_{t=1}^{T}\sum_{m=1}^{M}\omega^{(t)}_{m}/T, the sample average of the numbers of selected sub-likelihood components in TT Gibbs samples.

5 Numerical Examples

5.1 Normal Variables with Common Location

Let 𝑿Nd(μ𝟏,𝚺)\boldsymbol{X}\sim N_{d}(\mu\bf 1,\mathbf{\Sigma}), where the parameter of interest is the common location μ\mu. We study the scenario where many components bring redundant information on μ\mu by considering covariance matrix 𝚺\mathbf{\Sigma} with elements {𝚺}mm=1\{\mathbf{\Sigma}\}_{mm}=1, for all 1md1\leq m\leq d, and off-diagonal elements {𝚺}lm=ρ>0\{\mathbf{\Sigma}\}_{lm}=\rho>0 if l,mdl,m\leq d^{\ast}, for some d<dd^{\ast}<d, while {𝚺}lm=0\{\mathbf{\Sigma}\}_{lm}=0 elsewhere.

We consider one-wise score composite likelihood estimator solving 0=m=1dωmUm(μ)0=\sum_{m=1}^{d}\omega_{m}U_{m}(\mu), where Um(μ)=i=1n(Xm(i)μ)U_{m}(\mu)=\sum_{i=1}^{n}(X^{(i)}_{m}-\mu). It is easy to find the composite likelihood estimator is μ^cl(𝝎)=m=1dωmX¯m/m=1dωm\widehat{\mu}_{cl}(\boldsymbol{\omega})={\sum_{m=1}^{d}\omega_{m}\overline{X}_{m}}/{\sum_{m=1}^{d}\omega_{m}} where X¯m=i=1nXm(i)/n\overline{X}_{m}=\sum_{i=1}^{n}X_{m}^{(i)}/n. For this simple model, the jackknife criterion g^()\hat{g}(\cdot) can be easily computed in closed form. The pseudo-values are μ^cl(i)(𝝎)=m=1dωmX¯m(i)/m=1dωm\widehat{\mu}^{(-i)}_{cl}(\boldsymbol{\omega})=\sum_{m=1}^{d}\omega_{m}\overline{X}^{(-i)}_{m}/\sum_{m=1}^{d}\omega_{m}, and the average of pseudo-values is μ^cl\widehat{\mu}_{cl}. It can be shown that

g^(𝝎)=logi=1n(m=1dωm(Xm(i)X¯m))22log(m=1dωm),\hat{g}(\boldsymbol{\omega})=\log\sum_{i=1}^{n}\left(\sum_{m=1}^{d}\omega_{m}(X_{m}^{(i)}-\overline{X}_{m})\right)^{2}-2\log\left(\sum_{m=1}^{d}\omega_{m}\right),

up to a constant not depending on 𝝎\boldsymbol{\omega}. It can also be shown that OFO_{F}-criterion has the following expression

g0(𝝎)=logVar(μ^cl(𝝎))log(m=1dωm+2ρl<mdωlωm)2log(m=1dωm),\displaystyle g_{0}(\boldsymbol{\omega})=\log Var\left(\widehat{\mu}_{cl}(\boldsymbol{\omega})\right)\propto\log\left(\sum_{m=1}^{d}\omega_{m}+2\rho\sum_{l<m\leq d^{\ast}}\omega_{l}\omega_{m}\right)-2\log\left(\sum_{m=1}^{d}\omega_{m}\right), (17)

depending on the unknown parameter ρ\rho. This suggests that including too many correlated (redundant) components (with ρ0\rho\neq 0) damages McLE’s variance as dd increases. Particularly, setting ωj=1\omega_{j}=1, for all j=1,,dj=1,\dots,d, implies Var(μ^cl(𝝎))=O(1)Var(\widehat{\mu}_{cl}(\boldsymbol{\omega}))=O(1), while choosing only uncorrelated sub-likelihoods (ωj=0\omega_{j}=0 if 2jd2\leq j\leq d^{\ast}, and ωj=1\omega_{j}=1 elsewhere), gives Var(μ^cl(𝝎))=O(d1)Var(\widehat{\mu}_{cl}(\boldsymbol{\omega}))=O(d^{-1}).

In Table 1, we show Monte Carlo simulation results from B=250B=250 runs of Algorithm 1 (MCMC-CLS1) using the two approaches described in Section 3.2: one consists of choosing the best weights vector (CLS1 min); the other uses thresholding, i.e. selects elements when the weights are selected with a sufficiently large frequency (CLS1 thres.). In the same table we also report results on the Algorithm 2 (MCMC-CLS2) based on the stochastic stability selection approach. Our algorithms are compared with the estimator including all one-wise sub-likelihood components (No selection) and the optimal maximum likelihood estimator (MLE). We compute the MLE of μ\mu in two ways: either based on using the known Σ\Sigma value or based on using the sample covariance Σ^=(n1)1i=1n(XiX¯)(XiX¯)T\hat{\Sigma}=(n-1)^{-1}\sum_{i=1}^{n}(X_{i}-\overline{X})(X_{i}-\overline{X})^{T}. Note that the latter is not available when d>nd>n. Note our algorithms are designed to obtain simultaneous estimation and dimension reduction in the presence of limited information (i.e. d>nd>n or dnd\gg n) where the optimal weights are difficult to estimate from the data. Stochastic selection with ξ=0.7\xi=0.7 and T=10dT=10d was carried out for samples of n=5,25n=5,25 and 100100 observations from a model with d=0.8×dd^{\ast}=0.8\times d correlated components, with d=10,30d=10,30. To compare the methods we computed Monte Carlo estimates of the variance (VarVar) and squared bias (Bias2Bias^{2}) of μ^cl(𝝎)\widehat{\mu}_{cl}(\boldsymbol{\omega}). Table 2 further reports the average number of selected likelihood components (no. comp).

For all considered data dimensions, our selection methods outperform the all one-wise likelihood (AOW or No selection) estimator and show relatively small losses in terms of mean squared error (Var+Bias2Var+Bias^{2}) compared to the MLEs. The gains in terms of variance reduction are particularly evident for larger data dimensions (e.g., see d=30d=30). At the same time, our method tends to select mostly uncorrelated sub-likelihoods, while discarding the redundant components which do not contribute useful information on μ\mu.

No selection CLS1 (min) CLS1 (thresh.) CLS2 MLE (known Σ\Sigma) MLE (unknown Σ\Sigma)
nn dd ρ\rho VarVar Bias2Bias^{2} VarVar Bias2Bias^{2} VarVar Bias2Bias^{2} VarVar Bias2Bias^{2} VarVar Bias2Bias^{2} VarVar Bias2Bias^{2}
5 10 0.50 80.26 7.19 79.18 0.04 81.36 0.00 87.55 0.01 52.52 0.04 NA NA
0.90 115.06 10.31 96.97 0.05 114.82 0.08 114.90 0.02 62.28 0.04 NA NA
30 0.50 80.08 7.18 60.61 1.60 54.06 1.18 52.16 0.74 26.22 0.03 NA NA
0.90 103.50 9.28 61.27 0.01 44.14 0.01 44.68 0.00 24.93 0.02 NA NA
25 10 0.50 17.14 8.28 15.27 0.18 17.45 0.06 17.73 0.06 12.01 0.03 20.99 0.03
0.90 19.89 1.78 14.75 0.07 18.65 0.03 19.08 0.02 13.02 0.02 21.64 0.04
30 0.50 11.87 1.06 7.38 0.12 6.14 0.00 6.07 0.00 4.86 0.00 NA NA
0.90 24.51 2.20 11.59 0.02 6.69 0.00 6.75 0.00 5.73 0.01 NA NA
100 10 0.50 4.14 0.37 3.09 0.02 3.55 0.03 4.24 0.09 2.56 0.02 2.79 0.02
0.90 6.18 0.55 3.25 0.00 4.58 0.00 4.58 0.00 3.12 0.00 3.37 0.00
30 0.50 3.40 0.30 1.79 0.00 1.60 0.00 1.60 0.00 1.20 0.00 1.69 0.00
0.90 5.54 0.50 2.60 0.01 1.68 0.01 1.68 0.01 1.41 0.01 2.03 0.01
Table 1: Bias and variance of estimators for the location model 𝑿Nd(μ𝟏,𝚺(ρ))\boldsymbol{X}\sim N_{d}(\mu\bf 1,\mathbf{\Sigma}(\rho)) by different methods. MCMC-CLS1 algorithm with and without thresholding (CLS1 (min) and MCMC-CLS1 (thresh.), respectively); MCMC-CLS2 algorithm with stability selection (CLS2) with α=0.1\alpha=0.1 and λ=1\lambda=1; Maximum likelihood estimator based on Σ1\Sigma^{-1} where Σ\Sigma is known; Maximum likelihood estimator based on Σ^1\hat{\Sigma}^{-1}, where Σ^\hat{\Sigma} can be estimated if d<nd<n, otherwise the value is missing and denoted by NA. For each method we show the finite sample variance (VarVar) and squared bias (Bias2Bias^{2}) for n=5,25,100n=5,25,100, d=10,30d=10,30 and ρ=0.5,0.9\rho=0.5,0.9. Estimates based on B=250B=250 Monte Carlo runs (simulation settings: τ=d\tau=d, T=10dT=10d , ξ=0.7\xi=0.7). Monte Carlo standard errors are smaller than 0.0010.001.
nn dd ρ\rho No selection CLS1 (min) CLS1 (thres.) CLS2
5 10 0.50 10 3.77 3.92 3.72
0.90 10 3.28 2.58 2.36
30 0.50 30 13.94 11.01 10.84
0.90 30 12.04 6.53 6.43
25 10 0.50 10 4.30 3.58 3.26
0.90 10 3.27 2.04 2.00
30 0.50 30 12.81 7.70 7.48
0.90 30 11.96 5.98 5.98
100 10 0.50 10 4.28 2.56 2.31
0.90 10 3.17 2.00 2.00
30 0.50 30 12.22 6.08 6.05
0.90 30 11.94 6.00 6.00
Table 2: Number of selected sub-likelihoods by different methods. MCMC-CLS1 algorithm with and without thresholding (CLS1 (min) and MCMC-CLS1 (thresh.), respectively); MCMC-CLS2 algorithm with stability selection (CLS2) with α=0.1\alpha=0.1 and λ=1\lambda=1. For each method we show results for n=5,25,100n=5,25,100, d=10,30d=10,30 and ρ=0.5,0.9\rho=0.5,0.9. Estimates based on B=250B=250 Monte Carlo runs (simulation settings: τ=d\tau=d, T=10dT=10d , ξ=0.7\xi=0.7). Monte Carlo standard errors are smaller than 0.010.01.

Next, we illustrate the selection procedure based on MCMC-CLS2 (Algorithm 2). We draw a random sample of n=50n=50 observations from the model Nd(μ𝟏,𝚺(ρ))N_{d}(\mu\bf 1,\mathbf{\Sigma}(\rho)) described above with d=250d=250, d=0.8dd^{\ast}=0.8d, and ρ=0.9\rho=0.9. This corresponds to 200 strongly redundant variables and 50 independent variables. We applied Algorithm 2 with α=0.1\alpha=0.1 and λ=1\lambda=1 (corresponding to the AIC penalty). In Figure 1 (b), we show the relative frequencies of the components, ω¯m=t=1Tωm(t)\overline{\omega}_{m}=\sum_{t=1}^{T}\omega^{(t)}_{m}, m=1,,250m=1,\dots,250, in T=1000T=1000 MCMC iterations. As expected, the uncorrelated sub-likelihood components (components 201–250) are sampled much more frequently than the redundant ones (components 1–200). Figure 1 (c) shows the objective function g^λ(𝝎^stable)\hat{g}_{\lambda}(\hat{\boldsymbol{\omega}}^{\text{stable}}) (up to a constant) evaluated at the best solution computed from past MCMC samples (ξ=0.7\xi=0.7). Figure 1 (d) shows the Hamming distance, dist(𝝎,𝝎)=jI(ωjωj)dist(\boldsymbol{\omega},\boldsymbol{\omega}^{\prime})=\sum_{j}I(\omega_{j}\neq\omega_{j}^{\prime}), between the current selected rule, 𝝎^\hat{\boldsymbol{\omega}}^{\ast} and true optimal value, 𝝎=(1,0,,0199,1,,150)\boldsymbol{\omega}^{\ast}=(1,\underbrace{0,\dots,0}_{199},\underbrace{1,\dots,1}_{50}). Overall, our algorithm quickly approaches the optimal solution and the final selection has 94.0% asymptotic relative efficiency (ARE) compared to MLE. When no stochastic selection is applied and all 250 sub-likelihood components are included, the relative efficiency is only 13%. As far as computing time is concerned, our non-optimized R implementations of the MCMC-CLS1 and MCMC-CLS2 algorithms for the above example takes, respectively 4.25 and 1.87 seconds per MCMC iteration, respectively. The computing time was measured on a laptop computer with Intel ®CoreTM{}^{\text{TM}} i7-2620M CPU 2.70GHz.

    (a)     (b)
Refer to caption Refer to caption
    (c)      (d)
Refer to caption Refer to caption
Figure 1: Stochastic selection for Model 1, Nd(μ𝟏,𝚺)N_{d}(\mu\bf 1,\mathbf{\Sigma}), based on Algorithm 2. (a) Objective function g^λ(𝝎)\hat{g}_{\lambda}(\boldsymbol{\omega}) evaluated at samples drawn from πτ,λ(𝝎)\pi_{\tau,\lambda}(\boldsymbol{\omega}); the horizontal solid line is the control chart limit as described in Section 3.2. (b) Observed frequency of sampled sub-likelihood components. (c) Estimated objective function evaluated at the progressively selected likelihood, 𝝎^stable\hat{\boldsymbol{\omega}}^{\text{stable}}, based on past samples. (d) Hamming distance between the progressively selected likelihood, 𝝎^stable\hat{\boldsymbol{\omega}}^{\text{stable}}, and the globally optimal solution, 𝝎\boldsymbol{\omega}^{\ast}. Simulation settings: n=50n=50, d=250d=250, d=200d^{\ast}=200 τ=d\tau=d, T=1000T=1000 , burn-in length = 250250, α=0.1\alpha=0.1, λ=1\lambda=1.

5.2 Exchangeable Normal Variables with Unknown Correlation

Let 𝑿Nd(𝟎,𝚺(ρ))\boldsymbol{X}\sim N_{d}(\bf 0,\mathbf{\Sigma}(\rho)), where 𝚺(ρ)={(1ρ)𝑰d+ρ𝟏d𝟏dT}\mathbf{\Sigma}(\rho)=\{(1-\rho)\boldsymbol{I}_{d}+\rho{\bf 1}_{d}{\bf 1}_{d}^{T}\} and 0ρ10\leq\rho\leq 1 is the unknown parameter of interest. The marginal univariate sub-likelihoods do not contain information on ρ\rho, so we consider pairwise sub-likelihoods

lm(ρ)=n2log(1ρ2)(SSll+SSmm)2(1ρ2)+ρSSlm1ρ2, 1l<md,\displaystyle\ell_{lm}(\rho)=-\frac{n}{2}\log(1-\rho^{2})-\frac{(SS_{ll}+SS_{mm})}{2(1-\rho^{2})}+\frac{\rho SS_{lm}}{1-\rho^{2}},\ \ \ 1\leq l<m\leq d, (18)

where SSmm=i=1n(Xm(i))2SS_{mm}=\sum_{i=1}^{n}(X^{(i)}_{m})^{2} and SSlm=i=1nXl(i)Xm(i)SS_{lm}=\sum_{i=1}^{n}X^{(i)}_{l}X^{(i)}_{m}. Given 𝝎\boldsymbol{\omega}, we estimate ρ\rho by solving the composite score equation 0=j<kωjkUjk(ρ)0=\sum_{j<k}\omega_{jk}U_{jk}(\rho) by Newton-Raphson iterations where each pairwise score

Ujk(ρ)=(1+ρ2)SSjkρ(SSjj+SSkk)+nρ(1ρ2)U_{jk}(\rho)=(1+\rho^{2})SS_{jk}-\rho(SS_{jj}+SS_{kk})+n\rho(1-\rho^{2}) (19)

is a cubic function of ρ\rho. It is well known that the composite likelihood estimation can lead to poor results for this model. ? carries out asymptotic variance calculations, showing that efficiency losses compared to MLE occur whenever d>2d>2, with more pronounced efficiency losses for ρ\rho near 0.50.5. Particularly, if d=2d=2 (exactly one pair), ARE=1ARE=1; if d>2d>2, ARE=1 if ρ\rho approaches 0 or 11. The next simulation results show that in finite samples composite likelihood selection is advantageous even in such a challenging situation.

Since closed-form pairwise score expressions are available for this model, we use the objective function g^(𝝎)\hat{g}(\boldsymbol{\omega}) based on the one-step jackknife with pseudo-values computed as in Equation (9). In Table 3, we present results from B=250B=250 Monte Carlo runs of the MCMC-CLS algorithm applied to the model with d=5,8,10d=5,8,10 dimensions, which correspond to M=(d2)=10,28,45M={{d}\choose{2}}=10,28,45 sub-likelihoods, respectively. We compute Monte Carlo estimates for the finite-sample relative efficiency RE=Var^(ρ^apw)/Var^(ρ^cl(𝝎^))RE=\widehat{Var}(\hat{\rho}_{apw})/\widehat{Var}(\hat{\rho}_{cl}(\hat{\boldsymbol{\omega}}^{\ast})), where Var^\widehat{Var} denotes the Monte Carlo estimate of the finite-sample variance ρ^cl(𝝎^)\hat{\rho}_{cl}(\hat{\boldsymbol{\omega}}^{\ast}) is the estimator selected by the MCMC-CLS algorithm, while ρ^apw\hat{\rho}_{apw} is the all pairwise (APW) estimator obtained by including all available pairs (thus RE>1RE>1 indicates that our stochastic selection outperforms no selection). We show values of ρ\rho around 0.50.5, since they correspond to the largest asymptotic efficiency losses of pairwise likelihood compared to MLE (see ?, Figure 1). In all considered cases, our stochastic selection method improves the efficiency of the estimator based on all pairwise components; at the same time, our composite likelihoods employ a much smaller number of components. For example, when ρ=0.6\rho=0.6 the efficiency improvements range from 8% to 39%, using only about half of the available components.

n=10n=10 n=50n=50
M=(d2)=M={{d}\choose{2}}= 10 28 45 10 28 45
ρ=0.4\rho=0.4 RERE 1.27(0.03) 1.11(0.01) 1.07(0.01) 1.15(0.01) 1.12(0.01) 1.06(0.01)
No. comp. 5.83(0.09) 13.70(0.16) 21.12(0.22) 7.30(0.08) 13.82(0.16) 21.45(0.22)
ρ=0.5\rho=0.5 RERE 1.36(0.02) 1.14(0.01) 1.06(0.01) 1.19(0.01) 1.11(0.01) 1.06(0.01)
No. comp. 5.51(0.08) 13.32(0.07) 20.96(0.22) 7.01(0.07) 13.64(0.18) 21.23(0.22)
ρ=0.6\rho=0.6 RERE 1.39(0.02) 1.17(0.01) 1.10(0.01) 1.38(0.02) 1.14(0.01) 1.08(0.01)
No. comp. 5.80(0.08) 12.60(0.16) 20.99(0.20) 5.39(0.08) 13.27(0.15) 21.45(0.21)
Table 3: Pairwise likelihood selection for Model 2, Nd(𝟎,𝚺(ρ))N_{d}(\mathbf{0},\mathbf{\Sigma}(\rho)) based on Algorithm 1: Monte Carlo estimates for: (i) the relative efficiency of the parameter estimate under selection versus no selection ( RE=Var(ρ^apw)/Var(ρ^cl(𝝎^))RE=Var(\hat{\rho}_{apw})/Var(\hat{\rho}_{cl}(\hat{\boldsymbol{\omega}}^{\ast})), so that RE>1RE>1 indicates that the selection outperforms no selection); (ii) and number of sub-likelihood components (No. comp.). Monte Carlo standard errors are in parenthesis. Simulation settings: τ=d\tau=d, chain length T=10dT=10d, ξ=0.7\xi=0.7.

5.3 Real Data Analysis: Australian Breast Cancer Family Study

In this section, we apply the MCMC-CLS algorithm to a real genetic dataset of women with breast cancer obtained from the Australian Breast Cancer Family Study (ABCFS) (?) and control subjects from the Australian Mammographic Density Twins and Sisters Study (?). All women were genotyped using a Human610-Quad beadchip array. The final data set that we used consisted of a subset of 20 single nucleotide polymorphisms (SNPs) corresponding to genes encoding a candidate susceptibility pathway, which is motivated by biological considerations. After recommended data cleaning and quality control procedures (e.g., checks for SNP missingness, duplicate relatedness, population outliers (?)), the final data consisted of n=333n=333 observations (6767 cases and 266266 controls).

To detect group effects due to cancer, we consider an extension of the latent multivariate Gaussian model first introduced by ?. Let 𝒀(i)=(Y1(i),,Yd(i))\boldsymbol{Y}^{(i)}=(Y_{1}^{(i)},\dots,Y_{d}^{(i)}), i=1,,ni=1,\dots,n, be independent observations of a multivariate categorical variable measured on nn subjects. Each variable Yk(i)Y^{(i)}_{k} can take values 0,10,1 or 22, representing the copy number of one of the alleles of SNP kk of subject ii. The binary variable X(i)=x(i)=0X^{(i)}=x^{(i)}=0 or 11 represents disease status of the iith subject (0 = control and 1 = disease). We assume a latent random dd-vector 𝒁(i)=(Z1(i),,Zd(i))Nd(𝝁(i)(θ),𝑹)\boldsymbol{Z}^{(i)}=(Z^{(i)}_{1},\dots,Z^{(i)}_{d})\sim N_{d}(\boldsymbol{\mu}^{(i)}(\theta),\boldsymbol{R}), where 𝝁(i)(θ)\boldsymbol{\mu}^{(i)}(\theta) is a conditional mean vector with elements μ1(i)(θ)==μd(i)(θ)=θx(i)\mu_{1}^{(i)}(\theta)=\cdots=\mu_{d}^{(i)}(\theta)=\theta x^{(i)} and 𝑹\boldsymbol{R} is the correlation matrix. Our main interest is on the unknown mean parameter θ\theta, which is common to all the SNP variables and represents the main effect due to disease. We assume P(Yk(i)=0|X(i)=x(i))=P(Zk(i)γk1)P(Y_{k}^{(i)}=0|X^{(i)}=x^{(i)})=P(Z^{(i)}_{k}\leq\gamma_{k1}), P(Yk(i)=1|X(i)=x(i))=P(γk1<Zk(i)γk2)P(Y_{k}^{(i)}=1|X^{(i)}=x^{(i)})=P(\gamma_{k1}<Z^{(i)}_{k}\leq\gamma_{k2}), and P(Yk(i)=2|X(i)=x(i))=P(Zk(i)>γk2)P(Y_{k}^{(i)}=2|X^{(i)}=x^{(i)})=P(Z^{(i)}_{k}>\gamma_{k2}), where γk1\gamma_{k1} and γk2\gamma_{k2} are SNP-specific thresholds. The above model reflects the ordinal nature of genotypes and assumes absence of the Hardy-Weinberg equilibrium (HWE) (allele frequencies and genotypes in a population are constant from generation to generation). If the HWE holds the parameters γ1k\gamma_{1k} and γ2k\gamma_{2k} are not needed, since we have the additional constraint P(Yk(i)=2)=P(Yk(i)=1)2P(Y_{k}^{(i)}=2)=P(Y_{k}^{(i)}=1)^{2}.

Let 𝜸={(γ1k,γ2k):k=1,,d}\boldsymbol{\gamma}=\{(\gamma_{1k},\gamma_{2k}):k=1,\dots,d\} and define intervals Γk(Yk(i))\Gamma_{k}(Y^{(i)}_{k}) to be (,γk1](-\infty,\gamma_{k1}], (γk1,γk2](\gamma_{k1},\gamma_{k2}] and [γk2,)[\gamma_{k2},\infty), corresponding to Yk(i)=0,1Y^{(i)}_{k}=0,1 and 22, respectively. The full log-likelihood function is

(θ,𝜸,𝑹)\displaystyle\ell(\theta,\boldsymbol{\gamma},\boldsymbol{R}) =i=1nlogP(𝒀(i)=𝐲(i)|X(i)=x(i))\displaystyle=\sum_{i=1}^{n}\log P(\boldsymbol{Y}^{(i)}=\mathbf{y}^{(i)}|X^{(i)}=x^{(i)})
=i=1nlogΓ1(y1(i))Γd(yd(i))f(z1,,zd|𝝁(i)(θ),𝑹)𝑑z1𝑑zd,\displaystyle=\sum_{i=1}^{n}\log\int_{\Gamma_{1}(y^{(i)}_{1})}\cdots\int_{\Gamma_{d}(y^{(i)}_{d})}f(z_{1},\dots,z_{d}|\boldsymbol{\mu}^{(i)}(\theta),\boldsymbol{R})dz_{1}\cdots dz_{d},

where f(z1,,zd|𝝁,𝑹)f(z_{1},\dots,z_{d}|\boldsymbol{\mu},\boldsymbol{R}) is the pdf of the dd-variate normal density function with mean 𝝁\boldsymbol{\mu} and correlation matrix 𝑹\boldsymbol{R}. Clearly, the full log-likelihood function is intractable when dd is moderate or large, due to the multivariate integral in the likelihood expression. Note that for the marginal latent components, we have Zk(i)N1(θx(i),1)Z_{k}^{(i)}\sim N_{1}(\theta x^{(i)},1), so 𝜸\boldsymbol{\gamma} and θ\theta can be estimated by minimizing the one-wise composite log-likelihood

cl(θ,𝜸,𝝎)\displaystyle\ell_{cl}(\theta,\boldsymbol{\gamma},\boldsymbol{\omega}) =k=1dωki=1nlogP(Yk(i)=yk(i)|X(i)=x(i))\displaystyle=\sum_{k=1}^{d}\omega_{k}\sum_{i=1}^{n}\log P(Y_{k}^{(i)}=y_{k}^{(i)}|X^{(i)}=x^{(i)}) (20)
=k=1dωki=1nlogΓk(yk(i))ϕ(zk|θx(i),1)𝑑zk,\displaystyle=\sum_{k=1}^{d}\omega_{k}\sum_{i=1}^{n}\log\int_{\Gamma_{k}(y^{(i)}_{k})}\phi(z_{k}|\theta x^{(i)},1)dz_{k}, (21)

where ϕ(|μ,1)\phi(\cdot|\mu,1) denotes the normal pdf with mean μ\mu and unit variance. We focus on using the one-wise composite log-likelihood in this section, except when 𝑹\boldsymbol{R} is to be estimated where we use pairwise composite log-likelihood. Differently from the expression in ?, the disease group effect θ\theta is common to multiple sub-likelihood components; also, we allow for the inclusion/exclusion of particular sub-likelihood components (corresponding to SNPs) by selecting 𝝎\boldsymbol{\omega}.

(a) (b)
Refer to caption Refer to caption
(c) (d)
Refer to caption Refer to caption
Figure 2: Composite likelihood selection for the ABCFS data by Algorithm 1: (a) Frequency of the sampled marginal likelihood components; (b) Objective function evaluated at the current solution 𝝎^\hat{\boldsymbol{\omega}}^{\ast} (computed from past samples with ξ=0.7\xi=0.7); (c) Parameter estimates, θ^(t)=θ^(𝝎(t))\hat{\theta}^{(t)}=\hat{\theta}(\boldsymbol{\omega}^{(t)}), based on sampled composition rules, 𝝎(t)\boldsymbol{\omega}^{(t)} with optimal parameter estimate θ^(𝝎)\hat{\theta}(\boldsymbol{\omega}^{\ast}) corresponding to vertical dashed line. (d) Pairwise likelihood estimates for the correlation matrix 𝑹\boldsymbol{R} for SNPs in the susceptibility pathway.

We estimated the optimal composition rule 𝝎^\hat{\boldsymbol{\omega}}^{\ast} based on Gibbs samples from Algorithm 1, where the objective function g^(𝝎)=logVar(θ^(𝝎))\hat{g}(\boldsymbol{\omega})=\log Var(\hat{\theta}(\boldsymbol{\omega})) was estimated by delete-10 jackknife. We selected five marginal likelihoods (SNPs) occurring with at least ξ=0.7\xi=0.7 frequency in 250 runs of the Gibbs sampler (see Figure 2 (a)). In Figure 2 (b), we show the trajectory of the objective function g^(𝝎)\hat{g}(\boldsymbol{\omega}) evaluated at the current optimal solution 𝝎^\hat{\boldsymbol{\omega}}^{\ast} (optimal solutions are computed from past samples using a ξ=0.7\xi=0.7 threshold). The estimated variance tends to sway toward the minimum as more composition rules are drawn by our Gibbs sampler. This behavior is in agreement with preliminary simulation results carried out on this model (not presented here) as well as the examples presented in Sections 5.1 and 5.2.

Figure 2 (c) shows the empirical distribution of parameter estimates, θ^(t)=θ^(𝝎(t))\hat{\theta}^{(t)}=\hat{\theta}(\boldsymbol{\omega}^{(t)}), based on sampled vectors 𝝎(t)\boldsymbol{\omega}^{(t)}, t=1,,250t=1,\dots,250. The vertical dashed line corresponds to the selected parameter estimate θ^(𝝎^)\hat{\theta}(\hat{\boldsymbol{\omega}}^{\ast}), which is located near the mode of the empirical distribution. Particularly, the selected McLE is θ^(𝝎^)=0.125\hat{\theta}(\hat{\boldsymbol{\omega}}^{\ast})=-0.125 and the corresponding delete-10 jackknife standard error is sd^(θ^(𝝎^))=0.012\hat{sd}(\hat{\theta}(\hat{\boldsymbol{\omega}}^{\ast}))=0.012. The McLE based on using all 20 target SNPs is θ^all=0.112\hat{\theta}_{all}=-0.112 with the delete-10 jackknife standard error sd^(θ^all)=0.042\hat{sd}(\hat{\theta}_{all})=0.042. Our estimator gives a substantial accuracy improvement, supporting the conclusion of a difference between case and control groups (i.e., θ0\theta\neq 0) with higher confidence. Finally, in Figure 2 (d), we show estimates for the correlation matrix 𝑹\boldsymbol{R} for the target SNPs, based on the pairwise composite likelihood described in ?.

6 Final remarks

Composite likelihood estimation is a rapidly-growing need for a number of fields, due to the astonishing growth of data complexity and the limitations of traditional maximum likelihood estimation in this context. The Gibbs sampling protocol proposed in this paper addresses an important unresolved issue by providing a tool to automatically select the most useful sub-likelihoods from a pool of feasible components. Our numerical results on simulated and real data show that the composition rules generated by our MCMC approach are useful to improve the variance of traditional McLE estimators, typically obtained by using all sub-likelihood components available. Another advantage deriving from our method is the possibility to generate sparse composition rules, since our Gibbs sampler selects only a (relatively small) subset of informative sub-likelihoods while discarding non-informative or redundant components.

In the present paper, likelihood sparsity derives naturally from the discrete nature of our MCMC approach based on binary composition rules with 𝝎{0,1}M\boldsymbol{\omega}\in\{0,1\}^{M}. On the other hand, the development of sparsity-enforcing likelihood selection methods suited to real-valued weights would be valuable as well and could result in more efficient composition rules. For example, ? discuss the optimality of composite likelihood estimating functions involving both positive and negative real-valued weights. Actually, the row averages ω¯1,,ω¯M\overline{\omega}_{1},\cdots,\overline{\omega}_{M} computed in Step 2 of Algorithm 1 can also be used to replace 𝝎=(ω1,,ωM)T\boldsymbol{\omega}=(\omega_{1},\cdots,\omega_{M})^{T} in cl(𝜽,𝝎)\ell_{cl}(\boldsymbol{\theta},\boldsymbol{\omega}) defined by (2), providing a composite log-likelihood with between-0-and-1 weights. It would be of interest to see how well this form of composite log-likelihood gets on in regard to the efficiency. This, however, was not pursued in this paper and is left for future exploration. Finally, the penalized version of the objective function described in Section 4 enforces sparse likelihood functions, which is necessary in situations where the model complexity is relatively large compared to the sample size. Thus developing a thorough theoretical understanding on the effect of the penalty on the selection as d,nd,n\rightarrow\infty would be very valuable for improved selection algorithm in high-dimensions.

REFERENCES

  • [2] [] Besag, J. (1974), “Spatial interaction and the statistical analysis of lattice systems,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 192–236.
  • [4] [] Brooks, S., Friel, N., and King, R. (2003), “Classical model selection via simulated annealing,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2), 503–520.
  • [6] [] Claeskens, G., and Hjort, N. L. (2008), Model selection and model averaging, Vol. 330 Cambridge University Press Cambridge.
  • [8] [] Cox, D. R., and Reid, N. (2004), “A note on pseudolikelihood constructed from marginal densities,” Biometrika, 91(3), 729–737.
  • [10] [] Dite, G. S., Jenkins, M. A., Southey, M. C., Hocking, J. S., Giles, G. G., McCredie, M. R., Venter, D. J., and Hopper, J. L. (2003), “Familial risks, early-onset breast cancer, and BRCA1 and BRCA2 germline mutations,” Journal of the National Cancer Institute, 95(6), 448–457.
  • [12] [] George, E. I., and McCulloch, R. E. (1993), “Variable selection via Gibbs sampling,” Journal of the American Statistical Association, 88(423), 881–889.
  • [14] [] Hall, P., Horowitz, J. L., and Jing, B.-Y. (1995), “On blocking rules for the bootstrap with dependent data,” Biometrika, 82(3), 561–574.
  • [16] [] Han, F., and Pan, W. (2012), “A Composite Likelihood Approach to Latent Multivariate Gaussian Modeling of SNP Data with Application to Genetic Association Testing,” Biometrics, 68(1), 307–315.
  • [18] [] Heyde, C. C. (1997), Quasi-likelihood and its application: a general approach to optimal parameter estimation. Springer.
  • [20] [] Lindsay, B. G. (1988), “Composite likelihood methods,” Contemporary Mathematics, 80, 221–239.
  • [22] [] Lindsay, B. G., Yi, G. Y., and Sun, J. (2011), “Issues and Strategies in the Selection of Composite Likelihoods,” Statistica Sinica, 21(1), 71–105.
  • [24] [] Meinshausen, N., and Bühlmann, P. (2010), “Stability selection,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417–473.
  • [26] [] Odefrey, F., Stone, J., Gurrin, L. C., Byrnes, G. B., Apicella, C., Dite, G. S., Cawson, J. N., Giles, G. G., Treloar, S. A., English, D. R. et al. (2010), “Common genetic variants associated with breast cancer and mammographic density measures that predict disease,” Cancer research, 70(4), 1449–1458.
  • [28] [] Qian, G. (1999), “Computations and analysis in robust regression model selection using stochastic complexity,” Computational Statistics, 14, 293–314.
  • [30] [] Qian, G., and Field, C. (2002), “Using MCMC for logistic regression model selection involving large number of candidate models,” in Monte Carlo and Quasi-Monte Carlo Methods 2000 Springer, pp. 460–474.
  • [32] [] Qian, G., and Zhao, X. (2007), “On time series model selection involving many candidate ARMA models,” Computational Statistics & Data Analysis, 51(12), 6180–6196.
  • [34] [] Robert, C. P., and Casella, G. (2004), Monte Carlo statistical methods, Vol. 319 Citeseer.
  • [36] [] Shao, J. (1992), “One-step jackknife for M-estimators computed using Newton’s method,” Annals of the Institute of Statistical Mathematics, 44(4), 687–701.
  • [38] [] Varin, C., Reid, N., and Firth, D. (2011), “An Overview of Composite Likelihood Methods,” Statistica Sinica, 21, 5–42.
  • [40] [] Weale, M. E. (2010), “Quality control for genome-wide association studies,” in Genetic Variation Springer, pp. 341–372.
  • [41]