This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Joint and Individual Component Regression

Peiyao Wang1, Haodong Wang1, Quefeng Li2, Dinggang Shen4,5,6,
Yufeng Liu1,2,3∗

1Department of Statistics and Operations Research,
2Department of Biostatistics, 3Department of Genetics,
University of North Carolina at Chapel Hill;
4School of Biomedical Engineering, ShanghaiTech University;
5Shanghai United Imaging Intelligence Co., Ltd.;
6Department of Artificial Intelligence, Korea University
email: yfliu@email.unc.edu
Abstract

Multi-group data are commonly seen in practice. Such data structure consists of data from multiple groups and can be challenging to analyze due to data heterogeneity. We propose a novel Joint and Individual Component Regression (JICO) model to analyze multi-group data. In particular, our proposed model decomposes the response into shared and group-specific components, which are driven by low-rank approximations of joint and individual structures from the predictors respectively. The joint structure has the same regression coefficients across multiple groups, whereas individual structures have group-specific regression coefficients. Moreover, the choice of global and individual ranks allows our model to cover global and group-specific models as special cases. For model estimation, we formulate this framework under the representation of latent components and propose an iterative algorithm to solve for the joint and individual scores under the new representation. To construct the latent scores, we utilize the Continuum Regression (CR), which provides a unified framework that covers the Ordinary Least Squares (OLS), the Partial Least Squares (PLS), and the Principal Component Regression (PCR) as its special cases. We show that JICO attains a good balance between global and group-specific models and remains flexible due to the usage of CR. Finally, we conduct simulation studies and analysis of an Alzheimer’s disease dataset to further demonstrate the effectiveness of JICO. R implementation of JICO is available online at https://github.com/peiyaow/JICO.

Keywords: Continuum Regression; Heterogeneity; Latent Component Regression; Multi-group Data.

1 Introduction

Many fields of scientific research involve the analysis of heterogeneous data. In particular, data may appear in the form of multiple matrices, with data heterogeneity arising from either variables or samples. One example is the multi-view/source data, which measure different sets of variables on the same set of samples. The sets of variables may come from different platforms/sources/modalities. For instance, in genomics studies, measurements are collected as different biomarkers, such as mRNA and miRNA (Muniategui et al.,, 2013). Another example is the multi-group data, which measure the same set of variables on disparate sets of samples, which leads to heterogenous subpopulations/subgroups in the entire population. For instance, in the Alzheimer’s Disease (AD) study, subjects can have different subtypes, such as Normal Control (NC), Mild Cognitive Impairment (MCI), and AD.

We study the classical regression problem with one continuous response for multi-group data. Although there are many well-established regression techniques for homogeneous data (Hoerl and Kennard,, 1970; Tibshirani,, 1996), they may not be suitable for multi-group data. One naive approach is to ignore data heterogeneity and fit a global model using these techniques. However, a single global model can be too restrictive because the diverse information from different subgroups may not be identified. On the other hand, one can train separate group-specific models. Despite its flexibility, the information that is jointly shared across different groups is not sufficiently captured. Therefore, it is desirable to build a flexible statistical model that can simultaneously quantify the jointly shared global information and individual group-specific information for heterogeneous data.

There are several existing methods proposed in the literature under the context of regression for multi-group data. Meinshausen and Bühlmann, (2015) took a conservative approach and proposed a maxmin effect method that is reliable for all possible subsets of the data. Zhao et al., (2016) proposed a partially linear regression framework for massive heterogeneous data, and the goal is to extract common features across all subpopulations while exploring heterogeneity of each subpopulation. Tang and Song, (2016); Ma and Huang, (2017) proposed fused penalties to estimate regression coefficients that capture subgroup structures in a linear regression framework. Wang et al., (2018) proposed a locally-weighted penalized model to perform subject-wise variable selection. Wang et al., (2023) proposed a factor regression model for heterogeneous subpopulations under the high-dimensional factor decomposition. However, these models either are not specifically designed to identify the globally-shared and group-specific structures, or impose strong theoretical assumptions on the covariates. On the other hand, there exist some works that study this manner for multi-source data. Lock et al., (2013) proposed JIVE to learn joint and individual structures from multiple data matrices by low-rank approximations. Some extensions of JIVE can be found in Feng et al., (2018); Gaynanova and Li, (2019). All of these decomposition methods are fully unsupervised. Recently, Li and Li, (2021) proposed a supervised integrative factor regression model for mult-source data and studied its statistical properties with hypothesis tests. Palzer et al., (2022) proposed sJIVE that extends JIVE with the supervision from the response. These methods are supervised, but both focused on regressions for multi-source data.

In this paper, we consider the supervised learning problem of predicting a response with multi-group data. We propose a Joint and Individual COmponent Regression (JICO), a novel latent component regression model that covers JIVE as a special case. Our proposed model decomposes the response into jointly shared and group-specific components, which are driven by low-rank approximations of joint and individual structures from the predictors respectively. The joint structure shares the same coefficients across all groups, whereas individual structures have group-specific coefficients. Moreover, by choosing different ranks of joint and individual structures, our model covers global and group-specific models as special cases. To estimate JICO, we propose an iterative algorithm to solve for joint and individual scores using latent component representation. To construct the latent scores, we utilize the Continuum Regression (CR) (Stone and Brooks,, 1990), which provides a unified framework that covers OLS, PLS, and PCR as special cases. Some follow-up studies and modern extensions of CR can be found in Björkström and Sundberg, (1996); Lee and Liu, (2013). Embracing this flexibility and generaliziblity from CR, our proposed JICO model extends to the heterogeneous data setup and is able to achieve different model configurations on the spectrum of CR under this more complicated setting. JICO attains a good balance between global and group-specific models, and further achieves its flexibility by extending CR.

The rest of this paper is organized as follows. In Section 2, we briefly review JIVE and introduce our proposed JICO model. We further provide sufficient conditions to make JICO identifiable. In Section 3, after two motivating special cases, we introduce our iterative algorithm. In Sections 4 and 5, we evaluate the performance of JICO by simulation studies and real data analysis on the Alzheimer’s disease dataset, respectively. In Section 6, we conclude this paper with some discussion and possible extensions.

2 Motivation and Model Framework

Suppose we observe data pairs (𝐗g,𝒀g)g=1G(\mathbf{X}_{g},\bm{Y}_{g})_{g=1}^{G} from GG groups, where 𝐗gng×p\mathbf{X}_{g}\in\mathbb{R}^{n_{g}\times p} and 𝒀gng\bm{Y}_{g}\in\mathbb{R}^{n_{g}} are the data matrix and the response vector for the ggth group respectively. In this setting, each data matrix has the same set of pp explanatory variables, whereas the samples vary across groups. We let 𝐗=[𝐗1,,𝐗G]n×p\mathbf{X}=[\mathbf{X}_{1}^{\prime},\ldots,\mathbf{X}_{G}^{\prime}]^{\prime}\in\mathbb{R}^{n\times p} and 𝒀=[𝒀1,,𝒀G]n\bm{Y}=[\bm{Y}_{1}^{\prime},\ldots,\bm{Y}_{G}^{\prime}]^{\prime}\in\mathbb{R}^{n}, where n=g=1Gngn=\sum_{g=1}^{G}n_{g}.

Our model is closely related to JIVE, which provides a general formulation to decompose multiple data matrices into joint and individual structures. The JIVE decomposes 𝐗g\mathbf{X}_{g} as

𝐗g=𝐉g+𝐀g+𝐄g,\mathbf{X}_{g}=\mathbf{J}_{g}+\mathbf{A}_{g}+\mathbf{E}_{g}, (2.1)

where 𝐉gng×p\mathbf{J}_{g}\in\mathbb{R}^{n_{g}\times p} represents the submatrix of the joint structure of 𝐗g\mathbf{X}_{g}, 𝐀gng×p\mathbf{A}_{g}\in\mathbb{R}^{n_{g}\times p} represents the individual structure of 𝐗g\mathbf{X}_{g}, and 𝐄gng×p\mathbf{E}_{g}\in\mathbb{R}^{n_{g}\times p} is the error matrix. We consider that 𝒀g\bm{Y}_{g} has a similar decomposition into joint and individual signals

𝒀g=𝑱gY+𝑨gY+𝐞g,\bm{Y}_{g}=\bm{J}^{Y}_{g}+\bm{A}^{Y}_{g}+\mathbf{e}_{g}, (2.2)

where 𝐞gng\mathbf{e}_{g}\in\mathbb{R}^{n_{g}} is the noise from the gg-th group. Let 𝐗~g=𝐉g+𝐀g\tilde{\mathbf{X}}_{g}=\mathbf{J}_{g}+\mathbf{A}_{g} and 𝒀~g=𝑱gY+𝑨gY\tilde{\bm{Y}}_{g}=\bm{J}^{Y}_{g}+\bm{A}^{Y}_{g} be the noiseless counterparts of 𝐗g\mathbf{X}_{g} and 𝒀g\bm{Y}_{g}. Lemma 1 gives conditions to ensure that 𝐉g\mathbf{J}_{g}, 𝐀g\mathbf{A}_{g}, 𝑱gY\bm{J}_{g}^{Y}, and 𝑨gY\bm{A}_{g}^{Y} are identifiable.

Lemma 1

Given {𝐗~g,𝐘g~}g=1G\{\tilde{\mathbf{X}}_{g},\tilde{\bm{Y}_{g}}\}_{g=1}^{G}, where 𝐘g~col(𝐗~g)\tilde{\bm{Y}_{g}}\in\text{col}(\tilde{\mathbf{X}}_{g}). There exists unique 𝐉g\mathbf{J}_{g} and 𝐀g\mathbf{A}_{g} such that 𝐗~g=𝐉g+𝐀g\tilde{\mathbf{X}}_{g}=\mathbf{J}_{g}+\mathbf{A}_{g} and they satisfy the following conditions:

  1. (i)

    row(𝐉1)==row(𝐉G)row(𝐗~g)\text{row}(\mathbf{J}_{1})=\ldots=\text{row}(\mathbf{J}_{G})\subset\text{row}(\tilde{\mathbf{X}}_{g});

  2. (ii)

    row(𝐉g)row(𝐀g)\text{row}(\mathbf{J}_{g})\perp\text{row}(\mathbf{A}_{g}), for g=1,,Gg=1,\ldots,G;

  3. (iii)

    g=1Grow(𝐀g)={𝟎p×1}\bigcap_{g=1}^{G}\text{row}(\mathbf{A}_{g})=\{\mathbf{0}_{p\times 1}\}.

Moreover, if col(𝐉g)col(𝐀g)\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g}), then there exists unique 𝐉gY\bm{J}^{Y}_{g} and 𝐀gY\bm{A}^{Y}_{g} such that 𝐘~g=𝐉gY+𝐀gY\tilde{\bm{Y}}_{g}=\bm{J}^{Y}_{g}+\bm{A}^{Y}_{g} and they satisfy 𝐉gYcol(𝐉g)\bm{J}^{Y}_{g}\in\text{col}(\mathbf{J}_{g}) and 𝐀gYcol(𝐀g)\bm{A}^{Y}_{g}\in\text{col}(\mathbf{A}_{g}).

Lemma 1 shows that 𝐗~g\tilde{\mathbf{X}}_{g} can be uniquely decomposed into the sum of 𝐉g\mathbf{J}_{g} and 𝐀g\mathbf{A}_{g} if we require them to satisfy conditions (i)(iii)(i)-(iii), following similar statements as in Feng et al., (2018). To ensure the unique decomposition of 𝐘~g\tilde{\mathbf{Y}}_{g}, we need to further require col(𝐉g)col(𝐀g)\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g}), which is different from Palzer et al., (2022), that requires row(𝐉g)row(𝐀g)\text{row}(\mathbf{J}_{g})\perp\text{row}(\mathbf{A}_{g}).

In practice, only 𝐗g\mathbf{X}_{g} and 𝒀g\bm{Y}_{g} are observable. In Lemma 2, we show in (a)(a) that the identifiable conditions in Lemma 1 can still be achieved given the observed {𝐗g,𝒀g}g=1G\{\mathbf{X}_{g},\bm{Y}_{g}\}_{g=1}^{G}. Moreover, in Lemma 2(b)(b), we show that if 𝐉g\mathbf{J}_{g} and 𝐀g\mathbf{A}_{g} are assumed to have low ranks, they can be further decomposed as 𝐉g=𝐒g𝐔\mathbf{J}_{g}=\mathbf{S}_{g}\mathbf{U}, where 𝐒g\mathbf{S}_{g} is a ng×Kn_{g}\times K score matrix, 𝐔\mathbf{U} is a K×pK\times p loading matrix, and K=rank(𝐉g)K=\text{rank}(\mathbf{J}_{g}); and 𝐀g=𝐓g𝐔g\mathbf{A}_{g}=\mathbf{T}_{g}\mathbf{U}_{g}, where 𝐓g\mathbf{T}_{g} is a ng×Kgn_{g}\times K_{g} score matrix and 𝐔g\mathbf{U}_{g} is a Kg×pK_{g}\times p loading matrix, and Kg=rank(𝐀g)K_{g}=\text{rank}(\mathbf{A}_{g}). Under this formulation, if 𝐒g𝐓g=𝟎K×Kg\mathbf{S}_{g}^{\prime}\mathbf{T}_{g}=\bm{0}_{K\times K_{g}}, then (2.1) and (2.2) can be expressed as

𝐗g=𝐒g𝐔+𝐓g𝐔g+𝐄g,\displaystyle\mathbf{X}_{g}=\mathbf{S}_{g}\mathbf{U}+\mathbf{T}_{g}\mathbf{U}_{g}+\mathbf{E}_{g}, (2.3)
𝒀g=𝐒g𝜶+𝐓g𝜶g+𝐞g,\displaystyle\bm{Y}_{g}=\mathbf{S}_{g}\bm{\alpha}+\mathbf{T}_{g}\bm{\alpha}_{g}+\mathbf{e}_{g}, (2.4)

where 𝜶K\bm{\alpha}\in\mathbb{R}^{K} and 𝜶gKg\bm{\alpha}_{g}\in\mathbb{R}^{K_{g}} are the coefficients of the joint and individual components respectively. Model (2.4) gives a unified framework to model multi-group data. When K=0K=0, the joint term 𝐒g𝜶\mathbf{S}_{g}\bm{\alpha} vanishes and (2.4) reduces to a group-specific model of 𝒀g=𝐓g𝜶g+𝐞g\bm{Y}_{g}=\mathbf{T}_{g}\bm{\alpha}_{g}+\mathbf{e}_{g}. On the other hand, when K1==Kg=0K_{1}=\cdots=K_{g}=0, the individual term 𝐓g𝜶g\mathbf{T}_{g}\bm{\alpha}_{g} vanishes and (2.4) reduces to a global model of 𝒀g=𝐒g𝜶+𝐞g\bm{Y}_{g}=\mathbf{S}_{g}\bm{\alpha}+\mathbf{e}_{g}. When K0K\neq 0 and Kg0K_{g}\neq 0, (2.4) has both global and group-specific components, thus lies between the above two extreme cases.

Lemma 2

Given {𝐗g,𝐘g}g=1G\{{\mathbf{X}}_{g},{\bm{Y}_{g}}\}_{g=1}^{G}, it holds that

  1. (a)

    There exists matrices 𝐉g\mathbf{J}_{g} and 𝐀g\mathbf{A}_{g} such that they satisfy conditions (i)(iii)(i)-(iii) in Lemma 1 and col(𝐉g)col(𝐀g)\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g}).

  2. (b)

    There exists matrices 𝐔,𝐔g,𝐒g,𝐓g\mathbf{U},\mathbf{U}_{g},\mathbf{S}_{g},\mathbf{T}_{g} such that 𝐉g\mathbf{J}_{g} and 𝐀g\mathbf{A}_{g} can be expressed as 𝐉g=𝐒g𝐔\mathbf{J}_{g}=\mathbf{S}_{g}\mathbf{U} and 𝐀g=𝐓g𝐔g\mathbf{A}_{g}=\mathbf{T}_{g}\mathbf{U}_{g}, where 𝐒g𝐓g=𝟎K×Kg\mathbf{S}_{g}^{\prime}\mathbf{T}_{g}=\bm{0}_{K\times K_{g}}. Moreover, there exists 𝑱gY\bm{J}^{Y}_{g} and 𝑨gY\bm{A}^{Y}_{g} which can be expressed as 𝑱gY=𝐒g𝜶\bm{J}^{Y}_{g}=\mathbf{S}_{g}\bm{\alpha} and 𝑨gY=𝐓g𝜶g\bm{A}^{Y}_{g}=\mathbf{T}_{g}\bm{\alpha}_{g}, where 𝜶=(𝐒𝐒)1𝐒𝒀{\bm{\alpha}}=({\mathbf{S}}^{\prime}{\mathbf{S}})^{-1}{\mathbf{S}}^{\prime}\bm{Y} and 𝜶g=(𝐓g𝐓g)1𝐓g𝒀g{\bm{\alpha}}_{g}=({\mathbf{T}}_{g}^{\prime}{\mathbf{T}}_{g})^{-1}{\mathbf{T}}_{g}^{\prime}{\bm{Y}}_{g}, with 𝐒=[𝐒1,,𝐒G]\mathbf{S}=[\mathbf{S}_{1}^{\prime},\ldots,\mathbf{S}_{G}^{\prime}].

Corollary 1

There exist matrices 𝐖p×K\mathbf{W}\in\mathbb{R}^{p\times K} and 𝐖gp×Kg{\mathbf{W}_{g}}\in\mathbb{R}^{p\times K_{g}} such that 𝐉g=𝐒g𝐔\mathbf{J}_{g}=\mathbf{S}_{g}\mathbf{U} and 𝐀g=𝐓g𝐔g\mathbf{A}_{g}=\mathbf{T}_{g}\mathbf{U}_{g} defined by 𝐒g=𝐗g𝐖\mathbf{S}_{g}={\mathbf{X}}_{g}\mathbf{W}, 𝐓g=𝐗g𝐖g\mathbf{T}_{g}={\mathbf{X}}_{g}\mathbf{W}_{g}, 𝐔=(𝐖𝐖)1𝐖\mathbf{U}=(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime} and 𝐔g=(𝐖g𝐖g)1𝐖g\mathbf{U}_{g}=(\mathbf{W}_{g}^{\prime}\mathbf{W}_{g})^{-1}\mathbf{W}_{g}^{\prime} as in Lemma 2(b)(b) satisfy conditions (i)(iii)(i)-(iii) in Lemma 1 and col(𝐉g)col(𝐀g)\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g}), if 𝐖𝐖g=𝟎\mathbf{W}^{\prime}\mathbf{W}_{g}=\bm{0} and 𝐖𝐗g𝐗g𝐖g=𝟎\mathbf{W}^{\prime}{\mathbf{X}}_{g}^{\prime}{\mathbf{X}}_{g}\mathbf{W}_{g}=\bm{0}, for all gg.

Corollary 1 follows directly from the proof of Lemma 2(b)(b). As a remark, the columns of 𝐖\mathbf{W} and 𝐖g\mathbf{W}_{g} form the sets of bases that span the row spaces of 𝐉g\mathbf{J}_{g} and 𝐀g\mathbf{A}_{g} respectively. Hence, 𝐖𝐖g=𝟎\mathbf{W}^{\prime}\mathbf{W}_{g}=\bm{0} is a sufficient and necessary condition for row(𝐉g)row(𝐀g)\text{row}(\mathbf{J}_{g})\perp\text{row}(\mathbf{A}_{g}). Moreover, note that in Lemma 2(b)(b), 𝐒g𝐓g=𝟎\mathbf{S}_{g}^{\prime}\mathbf{T}_{g}=\bm{0} directly implies that 𝐀g𝐉g=𝟎\mathbf{A}_{g}^{\prime}\mathbf{J}_{g}=\bm{0}, the latter being a sufficient condition for col(𝐉g)col(𝐀g)\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g}). Therefore, in Corollary 1, 𝐖𝐗g𝐗g𝐖g=𝟎\mathbf{W}^{\prime}{\mathbf{X}}_{g}^{\prime}{\mathbf{X}}_{g}\mathbf{W}_{g}=\bm{0} provides a sufficient condition for col(𝐉g)col(𝐀g)\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g}), which satisfies one of the identifiability constraints for the unique decomposition of 𝐗g\mathbf{X}_{g} in Lemma 1. In Section 3, we describe the algorithm to solve for 𝐖\mathbf{W} and 𝐖g\mathbf{W}_{g} respectively.

3 Model Estimation

The key to estimate (2.3) and (2.4) is the constructions of score matrices 𝐒g\mathbf{S}_{g} and 𝐓g\mathbf{T}_{g}. To motivate our estimation procedure, in Sections 3.1, we discuss the joint and individual score estimation under two special cases respectively. In Section 3.2, we introduce an iterative algorithm for the general case.

3.1 Joint and Individual Score Estimation

We first consider a special case that Kg=0,g=1,,GK_{g}=0,g=1,\ldots,G. Under this setup, the individual components vanish and (2.3) and (2.4) reduce to the following model:

𝐗=𝐒𝐔+𝐄,𝒀=𝐒𝜶+𝐞,\displaystyle\mathbf{X}=\mathbf{S}\mathbf{U}+\mathbf{E},\quad\bm{Y}=\mathbf{S}\bm{\alpha}+\mathbf{e}, (3.1)

where 𝐒=[𝐒1,,𝐒G]\mathbf{S}=[\mathbf{S}_{1}^{\prime},\ldots,\mathbf{S}_{G}^{\prime}]^{\prime}, 𝐄=[𝐄1,,𝐄G]\mathbf{E}=[\mathbf{E}_{1}^{\prime},\ldots,\mathbf{E}_{G}]^{\prime}, and 𝐞=[𝐞1,,𝐞G]\mathbf{e}=[\mathbf{e}_{1}^{\prime},\ldots,\mathbf{e}_{G}^{\prime}]^{\prime}.

The formulation of (3.1) covers many existing classic methods. For example, in PCR, 𝐒\mathbf{S} is chosen to be the score matrix of the first KK principal components of 𝐗𝐗\mathbf{X}^{\prime}\mathbf{X}. However, PCR is inherently unsupervised and ignores the information from 𝒀\bm{Y}. Among the other supervised methods, PLS regression is a popular approach that incorporates regression on the latent scores. When K=1K=1 and p<np<n, the standard OLS can also be cast under the above setup.

According to the proof of our Lemma 2, 𝐒\mathbf{S} can be constructed with the basis matrix 𝐖\mathbf{W}. For the estimation of 𝐖\mathbf{W}, we utilize the continuum regression (CR) (Stone and Brooks,, 1990) algorithm, the result of which covers OLS, PLS, and PCR as special cases. For k=1,,Kk=1,\ldots,K, CR sequentially solves 𝐰k\mathbf{w}_{k} from the following optimization problem:

max𝐰\displaystyle\max_{\mathbf{w}} cov(𝐗𝐰,𝒀)2var(𝐗𝐰)γ1\displaystyle\text{cov}(\mathbf{X}\mathbf{w},\bm{Y})^{2}\text{var}(\mathbf{X}\mathbf{w})^{\gamma-1} (3.2)
s.t. 𝐰𝐰=1,𝐰𝐗𝐗𝐰j=0;j=1,,k1 if k2,\displaystyle\mathbf{w}^{\prime}\mathbf{w}=1,\mathbf{w}^{\prime}\mathbf{X}^{\prime}\mathbf{X}\mathbf{w}_{j}=0;\quad j=1,\ldots,k-1\text{ if }k\geq 2,

where cov(𝐗𝐰,𝒀)=𝐰𝐗𝒀\text{cov}(\mathbf{X}\mathbf{w},\bm{Y})=\mathbf{w}^{\prime}\mathbf{X}^{\prime}\bm{Y} and var(𝐗𝐰)=𝐰𝐗𝐗𝐰\text{var}(\mathbf{X}\mathbf{w})=\mathbf{w}^{\prime}\mathbf{X}^{\prime}\mathbf{X}\mathbf{w}, once columns of 𝐗\mathbf{X} and 𝒀\bm{Y} are centralized to have mean zero. Here, γ0\gamma\geq 0 is a tuning parameter that controls how much variability of 𝐗\mathbf{X} is taken into account for the construction of 𝐒\mathbf{S}. When γ\gamma\rightarrow\infty, the objective function in (3.2) is dominated by var(𝐗𝐰)γ1\text{var}(\mathbf{X}\mathbf{w})^{\gamma-1} and 𝒀\bm{Y} does not play a role. The CR solution of 𝐖\mathbf{W} then seeks to find the principal component directions that maximize the variation of 𝐗\mathbf{X}. It can be shown that (3.2) coincides with OLS and PLS solutions when γ=0\gamma=0 and 11 respectively.

Let 𝐖^\hat{\mathbf{W}} denote the solution to (3.2) and 𝐒^=𝐗𝐖^\hat{\mathbf{S}}=\mathbf{X}\hat{\mathbf{W}}. Then 𝜶\bm{\alpha} can be estimated by 𝜶^=(𝐒^𝐒^)1𝐒^𝒀.\hat{\bm{\alpha}}=(\hat{\mathbf{S}}^{\prime}\hat{\mathbf{S}})^{-1}\hat{\mathbf{S}}^{\prime}\bm{Y}. As illustrated in Lemma 2, 𝐉g\mathbf{J}_{g} is the projection of 𝐗g\mathbf{X}_{g} onto the column space spanned by 𝐖\mathbf{W}. Hence, we have 𝐉^g=𝐗g𝐖^(𝐖^𝐖^)1𝐖^\hat{\mathbf{J}}_{g}=\mathbf{X}_{g}\hat{\mathbf{W}}(\hat{\mathbf{W}}^{\prime}\hat{\mathbf{W}})^{-1}\hat{\mathbf{W}}^{\prime}, which further gives 𝐔^=(𝐖^𝐖^)1𝐖^\hat{\mathbf{U}}=(\hat{\mathbf{W}}^{\prime}\hat{\mathbf{W}})^{-1}\hat{\mathbf{W}}^{\prime}.

Next we consider our model estimation under the special case that K=0K=0. In this case, the joint component vanishes, and (2.3) and (2.4) reduce to the following individual model:

𝐗g=𝐓g𝐔g+𝐄g,𝒀g=𝐓g𝜶g+𝐞g.\displaystyle\mathbf{X}_{g}=\mathbf{T}_{g}\mathbf{U}_{g}+\mathbf{E}_{g},\quad\bm{Y}_{g}=\mathbf{T}_{g}\bm{\alpha}_{g}+\mathbf{e}_{g}. (3.3)

Same as the above discussion for joint score estimation, we utilize CR to construct 𝐓g=𝐗g𝐖g\mathbf{T}_{g}=\mathbf{X}_{g}\mathbf{W}_{g} as linear transformation of 𝐗g\mathbf{X}_{g}, where 𝐖g\mathbf{W}_{g} is a p×Kgp\times K_{g} basis matrix, whose columns span row(𝐀g)\text{row}(\mathbf{A}_{g}). Let 𝐖g=[𝐰g1,,𝐰gKg]\mathbf{W}_{g}=[\mathbf{w}_{g1},\ldots,\mathbf{w}_{gK_{g}}]. Given group gg, for k=1,,Kgk=1,\ldots,K_{g}, CR sequentially solves 𝐰gk\mathbf{w}_{gk} from the following optimization problem:

max𝐰\displaystyle\max_{\mathbf{w}} cov(𝐗g𝐰,𝒀g)2var(𝐗g𝐰)γ1\displaystyle\text{cov}(\mathbf{X}_{g}\mathbf{w},\bm{Y}_{g})^{2}\text{var}(\mathbf{X}_{g}\mathbf{w})^{\gamma-1} (3.4)
s.t. 𝐰𝐰=1,\displaystyle\mathbf{w}^{\prime}\mathbf{w}=1,
𝐰𝐗g𝐗g𝐰gj=0;j=1,,k1 if k2.\displaystyle\mathbf{w}^{\prime}\mathbf{X}_{g}^{\prime}\mathbf{X}_{g}\mathbf{w}_{gj}=0;\quad j=1,\ldots,k-1\text{ if }k\geq 2.

Denote 𝐖^g\hat{\mathbf{W}}_{g} the solution to (3.4). Similar to the joint estimation, once 𝐓^g=𝐗g𝐖^g\hat{\mathbf{T}}_{g}=\mathbf{X}_{g}\hat{\mathbf{W}}_{g} is constructed, 𝜶g\bm{\alpha}_{g} can be obtained as the least square solution: 𝜶^g=(𝐓^g𝐓^g)1𝐓^g𝒀g.\hat{\bm{\alpha}}_{g}=(\hat{\mathbf{T}}_{g}^{\prime}\hat{\mathbf{T}}_{g})^{-1}\hat{\mathbf{T}}_{g}^{\prime}{\bm{Y}}_{g}. Afterwards, we can have 𝐀^g=𝐗g𝐖^g(𝐖^g𝐖^g)1𝐖^g\hat{\mathbf{A}}_{g}=\mathbf{X}_{g}\hat{\mathbf{W}}_{g}(\hat{\mathbf{W}}_{g}^{\prime}\hat{\mathbf{W}}_{g})^{-1}\hat{\mathbf{W}}_{g}^{\prime} and 𝐔^g=(𝐖^g𝐖^g)1𝐖^g\hat{\mathbf{U}}_{g}=(\hat{\mathbf{W}}_{g}^{\prime}\hat{\mathbf{W}}_{g})^{-1}\hat{\mathbf{W}}_{g}^{\prime}.

3.2 JICO Algorithm

In this section, we consider the general case where KK or KgK_{g} can be both nonzero. Since solving (3.2) and (3.4) simultaneously can be hard with both joint and individual structures specified in the full model (2.3) and (2.4), we propose to iteratively solve one of them while fixing the other. This leads to the following iterative procedure.

  • Given 𝐖^g\hat{\mathbf{W}}_{g}, solve the following constrained problem sequentially for 𝐰1,,𝐰K\mathbf{w}_{1},\ldots,\mathbf{w}_{K}:

    max𝐰\displaystyle\max_{\mathbf{w}} cov(𝐗Joint𝐰,𝒀Joint)2var(𝐗Joint𝐰)γ1\displaystyle\text{cov}(\mathbf{X}^{\text{Joint}}\mathbf{w},\bm{Y}^{\text{Joint}})^{2}\text{var}(\mathbf{X}^{\text{Joint}}\mathbf{w})^{\gamma-1} (3.5)
    s.t. 𝐰𝐰=1,\displaystyle\mathbf{w}^{\prime}\mathbf{w}=1,
    𝐰𝐗Joint𝐗Joint𝐰j=0;j=1,,k1 if k2,\displaystyle\mathbf{w}^{\prime}{\mathbf{X}^{\text{Joint}}}^{\prime}\mathbf{X}^{\text{Joint}}\mathbf{w}_{j}=0;\quad j=1,\ldots,k-1\text{ if }k\geq 2,
    𝐖^g𝐰=𝟎Kg×1;g=1,,G,\displaystyle\hat{\mathbf{W}}_{g}^{\prime}\mathbf{w}=\mathbf{0}_{K_{g}\times 1};\quad g=1,\ldots,G,
    𝐖^g𝐗gIndiv𝐗gJoint𝐰=𝟎Kg×1;g=1,,G.\displaystyle\hat{\mathbf{W}}_{g}^{\prime}{\mathbf{X}_{g}^{\text{Indiv}}}^{\prime}\mathbf{X}_{g}^{\text{Joint}}\mathbf{w}=\mathbf{0}_{K_{g}\times 1};\quad g=1,\ldots,G.
  • Given 𝐖^\hat{\mathbf{W}}, for any 1gG1\leq g\leq G, solve the following constrained problem sequentially for 𝐰g,1,,𝐰g,Kg\mathbf{w}_{g,1},\ldots,\mathbf{w}_{g,K_{g}}:

    max𝐰\displaystyle\max_{\mathbf{w}} cov(𝐗gIndiv𝐰,𝒀gIndiv)2var(𝐗gIndiv𝐰)γ1\displaystyle\text{cov}(\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w},\bm{Y}_{g}^{\text{Indiv}})^{2}\text{var}(\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w})^{\gamma-1} (3.6)
    s.t. 𝐰𝐰=1,\displaystyle\mathbf{w}^{\prime}\mathbf{w}=1,
    𝐰𝐗gIndiv𝐗gIndiv𝐰gj=0;j=1,,k1 if k2,\displaystyle\mathbf{w}^{\prime}{\mathbf{X}_{g}^{\text{Indiv}}}^{\prime}\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w}_{gj}=0;\quad j=1,\ldots,k-1\text{ if }k\geq 2,
    𝐖^𝐰=𝟎K×1;\displaystyle\hat{\mathbf{W}}^{\prime}\mathbf{w}=\mathbf{0}_{K\times 1};
    𝐖^𝐗gJoint𝐗gIndiv𝐰=𝟎.\displaystyle\hat{\mathbf{W}}^{\prime}{\mathbf{X}_{g}^{\text{Joint}}}^{\prime}\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w}=\mathbf{0}.
  • Repeat the above two procedures until convergence.

Note that in (3.5) and (3.6), we denote

𝐗Joint=[𝐗1𝐓1𝐔G𝐗G𝐓G𝐔G],𝒀Joint=[𝒀1𝐓1𝜶1𝒀G𝐓G𝜶G],\mathbf{X}^{\text{Joint}}=\left[\begin{array}[]{c}\mathbf{X}_{1}-\mathbf{T}_{1}\mathbf{U}_{G}\\ \vdots\\ \mathbf{X}_{G}-\mathbf{T}_{G}\mathbf{U}_{G}\end{array}\right],\quad\bm{Y}^{\text{Joint}}=\left[\begin{array}[]{c}\bm{Y}_{1}-\mathbf{T}_{1}\bm{\alpha}_{1}\\ \vdots\\ \bm{Y}_{G}-\mathbf{T}_{G}\bm{\alpha}_{G}\end{array}\right],

and 𝐗gIndiv=𝐗g𝐒g𝐔,𝒀gIndiv=𝒀g𝐒g𝜶;g=1,,G.\mathbf{X}_{g}^{\text{Indiv}}=\mathbf{X}_{g}-\mathbf{S}_{g}\mathbf{U},\quad\bm{Y}_{g}^{\text{Indiv}}=\bm{Y}_{g}-\mathbf{S}_{g}\bm{\alpha};\quad g=1,\ldots,G. Moreover, the last two constraints in (3.5) and (3.6) correspond to the two sufficient conditions in Corollary 1 to satisfy the identifiability conditions row(𝐉g)row(𝐀g)\text{row}(\mathbf{J}_{g})\perp\text{row}(\mathbf{A}_{g}) and col(𝐉g)col(𝐀g)\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g}) needed in Lemma 1.

We formulate (3.5) and (3.6) into a generic CR problem, and derive an algorithm to solve it in Appendix B. Furthermore, we describe the convergence criterion for the iterative procedure and give its pseudo code in Algorithm 1 of Appendix C. Empirically, the algorithm always meets our convergence criteria, albeit there are no theoretical guarantees. In practice, we recommend starting the algorithm with multiple initial values and choose the one with the smallest cross-validated mean squared error. To predict the response using JICO estimates, we let 𝐒g,test=𝐗g,test𝐖^\mathbf{S}_{g,test}=\mathbf{X}_{g,test}\hat{\mathbf{W}} and 𝐓g,test=𝐗g,test𝐖^g\mathbf{T}_{g,test}=\mathbf{X}_{g,test}\hat{\mathbf{W}}_{g}, where 𝐗g,test\mathbf{X}_{g,test} is the test set. Then, the prediction of response is given by 𝐒g,test𝜶^+𝐓g,test𝜶^g\mathbf{S}_{g,test}\hat{{\bm{\alpha}}}+\mathbf{T}_{g,test}\hat{{\bm{\alpha}}}_{g}.

In practice, we need to select tuning parameters KK, KgK_{g} and γ\gamma. We describe how to select the tuning parameters in Appendix D. Since we use a grid search for tuning parameters, if the difference between the ranks of individual matrices and joint matrix are large, our algorithm may take a long time to compute. The goal of our algorithm is to approximate the joint and individual structure of the data by some low-rank matrices, and use the resulting low-rank matrices for prediction. Therefore, when the individual matrices and joint matrix are too complex and can not be approximated by some low-rank matrices, we suggest to use alternative methods to solve the problem.

Finally, we point out that our method includes JIVE-predict (Kaplan and Lock,, 2017) as a special case. JIVE-predict is a two-stage method that implements JIVE on 𝐗\mathbf{X} first and then regresses the responses on the loading matrix. When we let γ\gamma\to\infty in (3.2) and (3.4), JICO is equivalent as performing JIVE on 𝐗\mathbf{X}. For that reason, our method in that case is equivalent to JIVE-predict.

4 Simulation Studies

One significant advantage of our proposed model is its flexibility of lying in between global and group-specific models. Moreover, the choice of parameter γ\gamma in CR allows it to identify the model that best fits the data. In this section, we conduct multiple simulation studies to further demonstrate the advantage of our proposed model.

We consider three simulation settings in this section. In the first two settings, we generate data according to models that contain both global and group-specific components. The data are generated in a way that PCR and PLS solutions are favored respectively. In the last setting, we simulate data from two special cases: a global model and a group-specific model. The data are simulated so that the OLS is favored for both cases. For all three settings, JICO can adaptively choose the correct model parameter γ\gamma so that it has the optimal performance. Moreover, we further illustrate how the rank selection impacts the performance of JICO by examining the results using mis-specified ranks.

We fix G=2,p=200,n1=n2=50G=2,p=200,n_{1}=n_{2}=50. In each replication, we generate 100 training samples to train the models and evaluate the corresponding Mean Squared Error (MSE) in an independent test set of 100 samples. We repeat simulations for 50 times.

For g=1,,Gg=1,\ldots,G, we generate 𝐗g\mathbf{X}_{g} as i.i.d. samples from 𝒩(𝟎,𝐈p×p)\mathcal{N}(\bm{0},\mathbf{I}_{p\times p}). For the sake of simplicity, we generate 𝒀g\bm{Y}_{g} by the following model with two latent components:

𝒀g=α𝑺g+αg𝑻g+𝐞g,\bm{Y}_{g}=\alpha\bm{S}_{g}+\alpha_{g}\bm{T}_{g}+\mathbf{e}_{g}, (4.1)

where 𝑺g=𝐗g𝐰ng\bm{S}_{g}=\mathbf{X}_{g}\mathbf{w}\in\mathbb{R}^{n_{g}} is the joint latent score vector with an coefficient α\alpha, 𝑻g=𝐗g𝐰gng\bm{T}_{g}=\mathbf{X}_{g}\mathbf{w}_{g}\in\mathbb{R}^{n_{g}} is the individual latent score vector with an coefficient αg\alpha_{g}, and 𝐞g\mathbf{e}_{g} is generated as i.i.d. samples from 𝒩(0,0.04)\mathcal{N}(0,0.04). Here, 𝐰\mathbf{w} and 𝐰g\mathbf{w}_{g} are all p×1p\times 1 vectors, and they are constructed such that 𝐰𝐰g=0\mathbf{w}^{\prime}\mathbf{w}_{g}=0. We vary the choices of 𝐰\mathbf{w}, 𝐰g\mathbf{w}_{g}, α\alpha, and αg\alpha_{g}, which will be discussed later.

4.1 PCR Setting

In this section, we simulate the model which favors γ=\gamma=\infty. In this case, CR solutions to (3.2) and (3.4) coincide with PCR, which are essentially the top eigenvectors of the corresponding covariance matrices.

To simulate this setup, given training data 𝐗=[𝐗1,𝐗2]\mathbf{X}=[\mathbf{X}_{1}^{\prime},\mathbf{X}_{2}^{\prime}]^{\prime}, we let 𝐰\mathbf{w} be the top eigenvector corresponding to 𝐗𝐗\mathbf{X}^{\prime}\mathbf{X}. We further set 𝐰g\mathbf{w}_{g} as the top eigenvector of 𝐗~g𝐗~g\tilde{\mathbf{X}}_{g}^{\prime}\tilde{\mathbf{X}}_{g}, where 𝐗~g=𝐗g(𝐈𝐰𝐰)\tilde{\mathbf{X}}_{g}=\mathbf{X}_{g}(\mathbf{I}-\mathbf{w}\mathbf{w}^{\prime}) is the data matrix after projecting 𝐗g\mathbf{X}_{g} into the linear subspace that is orthogonal to 𝐰\mathbf{w}. This projection ensures that the construction of 𝐰\mathbf{w} and 𝐰g\mathbf{w}_{g} satisfies 𝐰𝐰g=0\mathbf{w}^{\prime}\mathbf{w}_{g}=0. To generate 𝒀g\bm{Y}_{g}, we let α=1,αg=1,g=1,2\alpha=1,\alpha_{g}=1,g=1,2.

We train JICO on a wide range of γ[0,)\gamma\in[0,\infty), using different combinations of K,K1,K2K,K_{1},K_{2}, with a maximum of 300300 iterations. Figure 4.1(a) demonstrates the MSEs evaluated on the test data over 5050 repetitions. For better illustration, we plot MSE curves as a function of aa, with a=γ/(γ+1)a=\gamma/(\gamma+1), which is a one-to-one monotone map from γ[0,)\gamma\in[0,\infty) to a[0,1]a\in[0,1]. In particular, when a=0,0.5a=0,0.5 and 11, we have γ=0,1\gamma=0,1 and \infty, which correspond to the cases of OLS, PLS and PCR respectively. The solid curve demonstrates the model performance given true ranks K=K1=K2=1K=K_{1}=K_{2}=1, whereas the gray curves show the performance of models with mis-specified ranks. In particular, we consider four mis-specified rank combinations. Among them, two rank combinations (K=1K=1, K1=K2=0K_{1}=K_{2}=0; K=2K=2, K1=K2=0K_{1}=K_{2}=0) correspond to joint models. The other two combinations (K=0K=0, K1=K2=1K_{1}=K_{2}=1; K=0K=0, K1=K2=2K_{1}=K_{2}=2) correspond to group-specific models. We can see from Figure 4.1(a) that the absolute minimum is given by the model with true ranks and a=1a=1, which refers to the underlying true model. When we look at the curves on the spectrum of aa as a whole, the joint models with K=1K=1 or 22, K1=K2=0K_{1}=K_{2}=0 always perform worse than the model with K=K1=K2=1K=K_{1}=K_{2}=1, because they are unable to capture the group-specific information from the underlying model. The model with true ranks performs better than the individual models with K=0K=0, K1=K2=1K_{1}=K_{2}=1 or 22 for larger values of aa, because the latter models cannot capture as much global information as the former. However, the model with K=K1=K2=1K=K_{1}=K_{2}=1 performs worse than the individual models for smaller values of aa, where the latter achieves much more acceptable performances. This means that the choice of optimal ranks for our model can be sensitive to the choice of γ\gamma. For smaller γ\gamma values, individual models tend to be more reliable under the PCR setting. We notice that the end of the curve is not very smooth when K=2,K1=K2=0K=2,K_{1}=K_{2}=0. One possible reason is that the solution path of CR can sometimes be discontinuous with respect to γ\gamma (Björkström and Sundberg,, 1996), consequently the CR algorithm may be numerically unstable for certain γ\gamma values.

Refer to caption

(a) PCR Example

Refer to caption

(b) PLS Example

Figure 4.1: MSE curves for JICO models with different ranks under the PCR setting (a) and PLS setting (b). This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

We further illustrate the performance of JICO by comparing it with several existing methods. In particular, we include ridge regression (Ridge), partial least squares (PLS) and principal component regression (PCR). For JICO, we select the models trained under true ranks K=K1=K2=1K=K_{1}=K_{2}=1 (performance as illustrated by the solid curve in Figure 4.1(a), with γ=0,1,\gamma=0,1,\infty, which correspond to the cases of OLS, PLS and PCR respectively. For a fair comparison, for PLS and PCR methods, we fix the number of components to be 22 for both a global fit and a group-specific fit. Table 1(a) summarizes the MSEs and of these methods. The numbers provided in the brackets represent the standard error. The first two columns summarize the performance for each group (g=1,2g=1,2), and the last column summarizes the overall performance. The JICO model with γ=\gamma=\infty performs significantly better than the rest, because it agrees with the underlying true model. Among other mis-specified methods, group-specific PLS is relatively more robust to model mis-specification.

Method g=1g=1 g=2g=2 Overall
(a) PCR Example JICO γ=0\gamma=0 1.994(0.063) 2.012(0.068) 2.003(0.053)
γ=1\gamma=1 0.679(0.018) 0.701(0.026) 0.69(0.017)
γ=\gamma=\infty 0.04(0.001) 0.04(0.001) 0.04(0.001)
Global Ridge 1.734(0.056) 1.78(0.065) 1.757(0.052)
PLS 1.163(0.033) 1.194(0.045) 1.178(0.031)
PCR 0.946(0.04) 0.977(0.044) 0.961(0.022)
Group-specific Ridge 0.252(0.009) 0.27(0.009) 0.261(0.005)
PLS 0.254(0.009) 0.272(0.009) 0.263(0.005)
PCR 0.68(0.042) 0.71(0.05) 0.695(0.032)
(b) PLS Example JICO γ=0\gamma=0 0.57(0.023) 0.567(0.021) 0.569(0.018)
γ=1\gamma=1 0.211(0.008) 0.218(0.008) 0.215(0.006)
γ=\gamma=\infty 1.236(0.038) 1.277(0.037) 1.256(0.025)
Global Ridge 1.698(0.064) 1.742(0.06) 1.72(0.055)
PLS 0.3(0.011) 0.297(0.01) 0.299(0.008)
PCR 1.229(0.036) 1.298(0.041) 1.263(0.025)
Group-specific Ridge 0.375(0.013) 0.425(0.016) 0.4(0.01)
PLS 0.412(0.014) 0.406(0.016) 0.409(0.008)
PCR 1.234(0.037) 1.25(0.036) 1.242(0.024)
(c) OLS Example (a) JICO γ=0\gamma=0 0.082 (0.002) 0.083 (0.003) 0.082 (0.002)
γ=1\gamma=1 0.403 (0.011) 0.419 (0.011) 0.411 (0.007)
γ=\gamma=\infty 1.006 (0.031) 1.07 (0.03) 1.038 (0.02)
Global Ridge 0.084 (0.004) 0.084 (0.003) 0.084 (0.003)
PLS 0.221 (0.007) 0.226 (0.006) 0.223 (0.005)
PCR 0.991 (0.032) 1.069 (0.030) 1.030 (0.020)
Group-specific Ridge 0.574 (0.017) 0.599 (0.024) 0.586 (0.013)
PLS 0.572 (0.016) 0.599 (0.024) 0.585 (0.013)
PCR 0.996 (0.032) 1.061 (0.030) 1.028 (0.021)
(d) OLS Example (b) JICO γ=0\gamma=0 0.063(0.002) 0.066(0.004) 0.064(0.002)
γ=1\gamma=1 0.257(0.009) 0.27(0.009) 0.264(0.006)
γ=\gamma=\infty 1.004(0.031) 1.002(0.03) 1.003(0.023)
Global Ridge 0.646(0.021) 0.673(0.024) 0.66(0.019)
PLS 0.957(0.027) 0.971(0.032) 0.964(0.022)
PCR 1.023(0.031) 1.016(0.031) 1.019(0.023)
Group-specific Ridge 0.076(0.004) 0.072(0.005) 0.074(0.003)
PLS 0.113(0.003) 0.116(0.005) 0.115(0.003)
PCR 0.978(0.03) 0.987(0.029) 0.983(0.022)
Table 1: Groupwise and overall MSEs under the PCR, PLS and OLS settings. Numbers in brackets are standard errors.

4.2 PLS Setting

In this section, we consider the model setup that is more favorable to γ=1\gamma=1. In this scenario, the CR solutions to (3.2) and (3.4) coincide with the PLS solutions. Same as in Section 4.1, we still consider the construction of weights as linear transformations of the eigenvectors.

Given training data 𝐗=[𝐗1,𝐗2]\mathbf{X}=[\mathbf{X}_{1}^{\prime},\mathbf{X}_{2}^{\prime}]^{\prime}, denote 𝐕p×q\mathbf{V}_{p\times q} as the matrix of top qq eigenvectors of 𝐗𝐗\mathbf{X}^{\prime}\mathbf{X}. We let 𝐰=𝐕𝟏q/q\mathbf{w}=\mathbf{V}\mathbf{1}_{q}/\sqrt{q}, where 𝟏q\bm{1}_{q} denotes a q×1q\times 1 vector with elements all equal to 11. In this way, the qq top eigenvectors contribute equally to the construction of 𝑺g\bm{S}_{g}. Similarly, we let 𝐗~g=𝐗g(𝐈𝐰𝐰)\tilde{\mathbf{X}}_{g}=\mathbf{X}_{g}(\mathbf{I}-\mathbf{w}\mathbf{w}^{\prime}) and 𝐕g\mathbf{V}_{g} be the p×qgp\times q_{g} matrix of top qgq_{g} eigenvectors of 𝐗~g𝐗~g\tilde{\mathbf{X}}_{g}^{\prime}\tilde{\mathbf{X}}_{g}. Then we let 𝐰g=𝐕g𝟏qg/qg\mathbf{w}_{g}=\mathbf{V}_{g}\bm{1}_{q_{g}}/\sqrt{q_{g}}. To construct a model more favorable to PLS, in this section, we let q=n/2q=n/2 and qg=ng/2q_{g}=n_{g}/2. We generate 𝒀g\bm{Y}_{g} from (4.1) by letting α=1\alpha=1 and αg=0.5\alpha_{g}=0.5.

Similar to the PCR setting, in Figure 4.1(b), we illustrate the MSE curves of JICO models with different rank combinations on a spectrum of aa, where γ=a/(1a)\gamma=a/(1-a). Again, the solid curve represents the model with true ranks, while the gray curves represent models with mis-specified ranks. The absolute minimum is given by the solid curve at aa around 0.50.5, which corresponds to the underlying true model. Moreover, the solid curve gives almost uniformly the best performance on the spectrum of aa compared with the gray curves, except on a small range of aa close to 0. Hence, under the PLS setting, the optimal ranks can be less sensitive to the choice of γ\gamma. At initial values of aa, the solid curve almost overlaps with the gray curve that represents the joint model with K=1,K1=K2=0K=1,K_{1}=K_{2}=0. This means that when γ\gamma is close to 0, the individual signals identified by the full model with K=K1=K2=1K=K_{1}=K_{2}=1 can be ignored. Therefore, the two group-specific models that capture more individual information give the best performance in this case. For aa values closer to 11, the gray curve that represents the joint model with K=2,K1=K2=0K=2,K_{1}=K_{2}=0 is very close to the solid curve. This means that the effects of individual components estimated by JICO tend to become more similar across groups for larger γ\gamma.

In Table 1(b), we summarize the MSEs of JICO models trained with true ranks K=K1=K2=1K=K_{1}=K_{2}=1 and γ=0,1,\gamma=0,1,\infty, along with other methods as described in Section 4.1. JICO with γ=1\gamma=1 shows the best performance among all methods, followed by the global PLS method, since the true model favors PLS and the coefficient αg=0.5\alpha_{g}=0.5 for the group-specific component is relatively small.

4.3 OLS Setting

In this section, we simulate the setting that favors γ=0\gamma=0, which corresponds to the case of OLS in CR. It is shown in Stone and Brooks, (1990) that when γ=0\gamma=0, there is only one non-degenerate direction that can be constructed from the CR algorithm. Hence, under the JICO framework, the model that favors γ=0\gamma=0 embraces two special cases: a global model with K=1K=1, Kg=0K_{g}=0 and a group-specific model with K=0K=0, Kg=1K_{g}=1.

For the two cases, we simulate 𝒀g\bm{Y}_{g} with (a) α=1\alpha=1, αg=0\alpha_{g}=0 and (b) α=0\alpha=0, αg=1\alpha_{g}=1 respectively. The construction of 𝐰\mathbf{w} and 𝐰g\mathbf{w}_{g} is the same as that in Section 4.2 with q=nq=n and qg=ngq_{g}=n_{g}.

Figure 4.2 illustrates MSE curves of the two cases, where (a) represents the case of the global model and (b) represents the case of the group-specific model. In both cases, the absolute minimum can be found on the solid curves at a=0a=0, which represents the MSE curves from the models with true ranks K=1,Kg=0K=1,K_{g}=0 and K=0,Kg=1K=0,K_{g}=1 respectively. In (a), there are two competitive models against the model with true ranks: another global model with K=2K=2, Kg=0K_{g}=0 and the model with K=K1=K2=1K=K_{1}=K_{2}=1. They both achieve the same performance with the solid curve at a=0a=0, and stay low at larger values of aa. This is because, when γ\gamma is mis-specified, additional model ranks help capture more information from data. The K=2K=2, Kg=0K_{g}=0 model performs better because the underlying model is a global model. This is also true for (b). The global minimum can be found at a=0a=0 on the solid curve, while the K=0,Kg=2K=0,K_{g}=2 model performs better when aa gets larger. Again, this is because larger KgK_{g} helps capture more information from data. The K=K1=K2=1K=K_{1}=K_{2}=1 model does not perform as well, because the estimated joint information dominates, which does not agree with the true model. We observe some discontinuities on the K=2,Kg=0K=2,K_{g}=0 curve, since the CR solution path can sometimes be discontinuous with respect to γ\gamma as discussed in the PCR setting in Section 4.1.

Refer to caption

(a) α=1,αg=0\alpha=1,\alpha_{g}=0

Refer to caption

(b) α=0,αg=1\alpha=0,\alpha_{g}=1

Figure 4.2: MSE curves for JICO models with different ranks under OLS settings (a) and (b). (a) is generated under a global model and (b) is generated under a group-specific model. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

In Table 1 (c) and (d), we summarize the MSEs of JICO models trained with the true ranks with γ=0,1,\gamma=0,1,\infty and other methods described in Section 4.1. For a fair comparison, the number of components for PCR and PLS is chosen to be 11 for both global and group-specific fits. The JICO model with γ=0\gamma=0, along with Ridge always achieve better performance than all other methods. It is interesting to notice that in (c), the JICO models with γ=1\gamma=1 and \infty coincides with global PLS and PCR respectively, and hence they achieve the same performances. Similarly, in (d), JICO models with γ=1\gamma=1 and \infty coincide with group-specific PLS and PCR respectively, and they achieve the same performance correspondingly. In addition, when K=1,Kg=0K=1,K_{g}=0, the solution of CR algorithm coincides with the global OLS model. Thus the JICO model with γ=0\gamma=0 and the global Ridge have similar performance in (c). Similarly, when K=0,Kg=1K=0,K_{g}=1, the JICO model with γ=0\gamma=0 and group-specific Ridge have similar performance in (d).

5 Applications to ADNI Data Analysis

We apply our proposed method to analyze data from the Alzheimer’s Disease (AD) Neuroimaging Initiative (ADNI). It is well known that AD accounts for most forms of dementia characterized by progressive cognitive and memory deficits. The increasing incidence of AD makes it a very important health issue and has attracted a lot of scientific attentions. To predict the AD progression, it is very important and useful to develop efficient methods for the prediction of disease status and clinical scores (e.g., the Mini Mental State Examination (MMSE) score and the AD Assessment Scale-Cognitive Subscale (ADAS-Cog) score). In this analysis, we are interested in predicting the ADAS-Cog score by features extracted from 93 brain regions scanned from structural magnetic resonance imaging (MRI). All subjects in our analysis are from the ADNI2 phase of the study. There are 494 subjects in total in our analysis and 3 subgroups: NC (178), eMCI (178) and AD (145), where the numbers in parentheses indicate the sample sizes for each subgroup. As a reminder, NC stands for the Normal Control, and eMCI stands for the early stage of Mild Cognitive Impairment in AD progression.

For each group, we randomly partition the data into two parts: 80% for training the model and the rest for testing the performance. We repeat the random split for 50 times. The testing MSEs and the corresponding standard errors are reported in Table 2. Both groupwise and overall performance are summarized. We compare our proposed JICO model with four methods: ridge regression (Ridge), PLS, and PCR. We perform both a global and a group-specific fit for Ridge, PLS, and PCR, where the regularization parameter in Ridge and the number of components in PCR or PLS are tuned by 10-fold cross validation (CV). For our proposed JICO model, we demonstrate the result by fitting the model with fixed γ=0,0.25,1,\gamma=0,0.25,1,\infty, or tuned γ\gamma respectively. In practice, using exhaustive search to select the optimal values for KK and KgK_{g} can be computationally cumbersome, because the number of combinations grows exponentially with the number of candidates for each parameter. Based on our numerical experience, we find that choosing KgK_{g} to be the same does not affect the performance on prediction too much. Details are discussed in Appendix D. Therefore, in all these cases, the optimal ranks for JICO are chosen by an exhaustive search in K{0,1}K\in\{0,1\} and K1=K2=K3{0,1}K_{1}=K_{2}=K_{3}\in\{0,1\} to see which combination gives the best MSE. We choose KK and KgK_{g} to be small to improve our model interpretations. The optimal value of γ\gamma is chosen by 10-fold CV.

As shown in Table 2, JICO performs the best among all competitors. Fitting JICO with γ=0.25\gamma=0.25 yields the smallest overall MSE. JICO with parameters chosen by CV performs slightly worse, but is still better than the other global or group-specific methods. The results of JICO with γ=0,1\gamma=0,1 and \infty, which correspond to OLS, PLS, and PCR, are also provided in Table 2. Even though their prediction is not the best, an interesting observation is that they always have better performance than their global or group-specific counterparts. For example, when γ=1\gamma=1, JICO has much better overall prediction than the group-specific PLS. This indicates that it is beneficial to capture global and individual structures for regression when subpopulations exist in the data.

In Table 2, global models perform the worst, because they do not take group heterogeneity into consideration. The group-specific Ridge appears to be the most competitive one among group-specific methods. Note that for the AD group, our JICO model with γ=0.25\gamma=0.25 or tuned γ\gamma outperforms the group-specific Ridge method by a great margin.

Method NC eMCI AD Overall
JICO γ=0\gamma=0 6.671 (0.137) 11.319 (0.309) 55.798 (1.556) 22.821 (0.466)
γ=.25\gamma=.25 6.316 (0.121) 10.394 (0.279) 40.853 (1.294) 17.951 (0.394)
γ=1\gamma=1 6.443 (0.124) 10.353 (0.291) 44.054 (1.449) 18.929 (0.441)
γ=\gamma=\infty 6.608 (0.138) 11.121 (0.308) 49.997 (1.832) 21.013 (0.558)
CV 6.414 (0.129) 10.333(0.289) 41.297 (1.348) 18.096 (0.401)
Global Ridge 23.450 (0.751) 21.276 (0.796) 63.989 (2.657) 34.692 (0.840)
PLS 26.310 (0.787) 22.672 (0.915) 68.193 (3.183) 37.442 (0.982)
PCR 25.228 (0.771) 21.966 (0.802) 69.541 (2.969) 37.209 (0.907)
Group-specific Ridge 6.336 (0.116) 10.353 (0.278) 42.271 (1.315) 18.364 (0.392)
PLS 6.656 (0.136) 11.298 (0.306) 48.434 (1.725) 20.629 (0.524)
PCR 6.656 (0.136) 11.346 (0.304) 47.357 (1.484) 20.327 (0.454)
Table 2: Groupwise and overall MSEs on the ADNI data. Numbers in brackets are standard errors.

To get our results more interpretable, we further apply the JICO model to NC and AD groups. We run 50 replications of 10-fold CV to see which combination of tuning parameters gives the smallest overall MSE. The best choice is γ=,K=1,KNC=KAD=3\gamma=\infty,K=1,K_{NC}=K_{AD}=3. Then, we apply JICO using this choice and tuning parameters and give the heatmaps of the estimated 𝐉^g\hat{\mathbf{J}}_{g} (left column) and 𝐀^g\hat{\mathbf{A}}_{g} (right column) in Figure 5.1. Rows of heatmaps represent samples and columns represent MRI features. We use the Ward’s linkage to perform hierarchical clustering on the rows of 𝐉^g\hat{\mathbf{J}}_{g}, and arrange the rows of 𝐉^g\hat{\mathbf{J}}_{g} and 𝐀^g\hat{\mathbf{A}}_{g} in the same order for each group. Moreover, we apply the same clustering algorithm to the columns of 𝐉^g\hat{\mathbf{J}}_{g} to arrange the columns in the same order across the two disease groups for both joint and individual structures. Figure 5.1 shows that JICO separates joint and individual structures effectively. The joint structures across different disease groups share a very similar pattern, whereas the individual structures appear to be very distinct. We further magnify the right column of Figure 5.1 in Figure 5.2 with the brain region names listed. We find that the variation in 𝐀^g\hat{\mathbf{A}}_{g} for the AD group is much larger than the counterpart for the NC group. We highlight the brain regions that differ the most between the two groups. The highlighted regions play crucial roles in human’s cognition, thus important in AD early diagnosis (Michon et al.,, 1994; Killiany et al.,, 1993). For example, Michon et al., (1994) suggested that anosognosia in AD results in part from frontal dysfunction. Killiany et al., (1993) showed that the temporal horn of the lateral ventricles can be used as antemortem markers of AD.

NC

AD

𝐉g\mathbf{J}_{g}                                𝐀g\mathbf{A}_{g}

Refer to captionRefer to caption
Figure 5.1: Heatmaps of joint and individual structures from NC and AD estimated from JICO. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.
Refer to caption
Figure 5.2: Heatmaps of individual structures from NC and AD estimated from JICO with MRI feature names. The highlighted regions are frontal lobe WM, temporal lobe WM, occipital lobe WM, parietal lobe WM, superior parietal lobule, middle temporal gyrus. The top row is from NC and the bottom row is from AD. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

6 Discussion

In this paper, we propose JICO, a latent component regression model for multi-group heterogeneous data. Our proposed model decomposes the response into jointly shared and group-specific components, which are driven by low-rank approximations of joint and individual structures from the predictors respectively. For model estimation, we propose an iterative procedure to solve for model components, and utilize CR algorithm that covers OLS, PLS, and PCR as special cases. As a result, the proposed procedure is able to extend many regression algorithms covered by CR to the setting of heterogeneous data. Extensive simulation studies and a real data analysis on ADNI data further demonstrate the competitive performance of JICO.

JICO is designed to be very flexible for multi-group data. It is able to choose the optimal parameter to determine the regression algorithm that suits the data the best, so that the prediction power is guaranteed. At the same time, it is also able to select the optimal joint and individual ranks that best describe the degree of heterogeneity residing in each subgroup. The JICO application to ADNI data has effectively illustrated that our proposed model can provide nice visualization on identifying joint and individual components from the entire dataset without losing much of the prediction power.

DATA AVAILABILITY STATEMENT

The data that support the findings of this paper are available in the Alzheimer’s Disease Neuroimaging Initiative at https://adni.loni.usc.edu/.

Appendix A Proofs

Proof of Lemma 1. The proof of the first part of this lemma follows similar arguments as in Feng et al., (2018). Define row(𝐉)=g=1Grow(𝐗~g)\text{row}(\mathbf{J})=\bigcap_{g=1}^{G}\text{row}(\tilde{\mathbf{X}}_{g}). We assume that row(𝐉){𝟎}\text{row}(\mathbf{J})\neq\{\bm{0}\} for non-trivial cases. For each gg, 𝐉g\mathbf{J}_{g} and 𝐀g\mathbf{A}_{g} can be obtained by projecting 𝐗~g\tilde{\mathbf{X}}_{g} to row(𝐉)\text{row}(\mathbf{J}) and its orthogonal complement in row(𝐗~g)\text{row}(\tilde{\mathbf{X}}_{g}). Thus, by construction, we have 𝐗~g=𝐉g+𝐀g\tilde{\mathbf{X}}_{g}=\mathbf{J}_{g}+\mathbf{A}_{g} and row(𝐉g)row(𝐀g)\text{row}(\mathbf{J}_{g})\perp\text{row}(\mathbf{A}_{g}). Since for all gg, we have row(𝐀g)row(𝐉)\text{row}(\mathbf{A}_{g})\perp\text{row}(\mathbf{J}), then g=1Grow(𝐀g)\bigcap_{g=1}^{G}\text{row}(\mathbf{A}_{g}) has a zero projection matrix. Therefore, we have g=1Grow(𝐀g)={𝟎}\bigcap_{g=1}^{G}\text{row}(\mathbf{A}_{g})=\{\mathbf{0}\}. On the other hand, since row(𝐉g)row(𝐗~g)\text{row}(\mathbf{J}_{g})\subset\text{row}(\tilde{\mathbf{X}}_{g}), the orthogonal projection of 𝐗~g\tilde{\mathbf{X}}_{g} onto row(𝐉g)\text{row}(\mathbf{J}_{g}) is unique. Therefore, the matrices 𝐉g\mathbf{J}_{g} and 𝐀g\mathbf{A}_{g} are uniquely defined.

For the second part of Lemma 1, note that 𝒀g~col(𝐗~g)col(𝐉g)+col(𝐀g)\tilde{\bm{Y}_{g}}\in\text{col}(\tilde{\mathbf{X}}_{g})\subset\text{col}(\mathbf{J}_{g})+\text{col}(\mathbf{A}_{g}). Then, 𝑱gY\bm{J}_{g}^{Y} and 𝑨gY\bm{A}_{g}^{Y} can be obtained by projecting 𝒀g~\tilde{\bm{Y}_{g}} onto col(𝐉g)\text{col}(\mathbf{J}_{g}) and col(𝐀g)\text{col}(\mathbf{A}_{g}) respectively. Because col(𝐉g)col(𝐀g)\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g}), 𝑱gY\bm{J}_{g}^{Y} and 𝑨gY\bm{A}_{g}^{Y} are uniquely defined.

Proof of Lemma 2. We explicitly describe how to find 𝐉g\mathbf{J}_{g}, 𝐀g\mathbf{A}_{g}, 𝐔\mathbf{U}, 𝐔g\mathbf{U}_{g}, 𝐒g\mathbf{S}_{g}, 𝐓g\mathbf{T}_{g}, 𝜶\bm{\alpha}, 𝜶g\bm{\alpha}_{g} to satisfy the requirements. First, 𝐉g\mathbf{J}_{g} can be obtained by finding an arbitrary set of bases that span g=1Grow(𝐗g)\bigcap_{g=1}^{G}\text{row}({\mathbf{X}}_{g}). Then, 𝐀g\mathbf{A}_{g} can be obtained by solving a system of linear equations. We prove that such 𝐉g\mathbf{J}_{g} and 𝐀g\mathbf{A}_{g} satisfy conditions (i)(i)(iii)(iii) in Lemma 1 and col(𝐉g)col(𝐀g)\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g}). Given 𝐉g\mathbf{J}_{g} and 𝐀g\mathbf{A}_{g}, we construct 𝐔,𝐔g,𝐒g,𝐓g\mathbf{U},\mathbf{U}_{g},\mathbf{S}_{g},\mathbf{T}_{g}, so that they satisfy 𝐉g=𝐒g𝐔\mathbf{J}_{g}=\mathbf{S}_{g}\mathbf{U}, 𝐀g=𝐓g𝐔g\mathbf{A}_{g}=\mathbf{T}_{g}\mathbf{U}_{g} and 𝐒g𝐓g=𝟎K×Kg\mathbf{S}_{g}^{\prime}\mathbf{T}_{g}=\bm{0}_{K\times K_{g}}.

Let 𝐰1,,𝐰Kp\mathbf{w}_{1},\ldots,\mathbf{w}_{K}\in\mathbb{R}^{p} be an arbitrary set of bases that span g=1Grow(𝐗g)\bigcap_{g=1}^{G}\text{row}({\mathbf{X}}_{g}). Denote 𝐖p×K=[𝐰1,,𝐰K]\mathbf{W}_{p\times K}=[\mathbf{w}_{1},\ldots,\mathbf{w}_{K}]. Let 𝐉g=𝐗g𝐖(𝐖𝐖)1𝐖\mathbf{J}_{g}={\mathbf{X}}_{g}\mathbf{W}(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}. Next, we show that 𝐉g\mathbf{J}_{g} satisfies condition (i)(i) in Lemma 1 for any 0g1g2G0\leq g_{1}\neq g_{2}\leq G. It suffices to show that row(𝐉g1)row(𝐉g2)\text{row}(\mathbf{J}_{g_{1}})\subset\text{row}(\mathbf{J}_{g_{2}}), i.e. there exists 𝐂g1,ng1×ng2g2\mathbf{C}_{g_{1},n_{g_{1}}\times n_{g_{2}}}^{g_{2}}, such that 𝐂g1g2𝐉g2=𝐉g1\mathbf{C}^{g_{2}}_{g_{1}}\mathbf{J}_{g_{2}}=\mathbf{J}_{g_{1}}. Since the columns of 𝐖\mathbf{W} form the bases of g=1Grow(𝐗g)\bigcap_{g=1}^{G}\text{row}({\mathbf{X}}_{g}), there exists 𝐐K×ng2g2\mathbf{Q}^{g_{2}}_{K\times n_{g_{2}}}, such that 𝐐g2𝐗g2=𝐖\mathbf{Q}^{g_{2}}{\mathbf{X}}_{g_{2}}=\mathbf{W}^{\prime}. Then, we have 𝐐g2𝐉g2=𝐐g2𝐗g2𝐖(𝐖𝐖)1𝐖=(𝐖𝐖)(𝐖𝐖)1𝐖=𝐖\mathbf{Q}^{g_{2}}\mathbf{J}_{g_{2}}=\mathbf{Q}^{g_{2}}{\mathbf{X}}_{g_{2}}\mathbf{W}(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}=(\mathbf{W}^{\prime}\mathbf{W})(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}=\mathbf{W}^{\prime}. Let 𝐑g1=𝐗g1𝐖(𝐖𝐖)1\mathbf{R}_{g_{1}}={\mathbf{X}}_{g_{1}}\mathbf{W}(\mathbf{W}^{\prime}\mathbf{W})^{-1} and 𝐂g1g2=𝐑g1𝐐g2\mathbf{C}_{g_{1}}^{g_{2}}=\mathbf{R}_{g_{1}}\mathbf{Q}^{g_{2}}, then we have

𝐂g1g2𝐉g2=𝐑g1𝐐g2𝐉g2=𝐗g1𝐖(𝐖𝐖)1𝐖=𝐉g1.\mathbf{C}_{g_{1}}^{g_{2}}\mathbf{J}_{g_{2}}=\mathbf{R}_{g_{1}}\mathbf{Q}^{g_{2}}\mathbf{J}_{g_{2}}={\mathbf{X}}_{g_{1}}\mathbf{W}(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}=\mathbf{J}_{g_{1}}.

Given 𝐖\mathbf{W}, we propose to solve

[𝐖𝐖𝐗g𝐗g]𝐱=𝟎,\begin{bmatrix}\mathbf{W}^{\prime}\\ \mathbf{W}^{\prime}{\mathbf{X}}_{g}^{\prime}{\mathbf{X}}_{g}\end{bmatrix}\mathbf{x}=\bm{0}, (A.1)

where 𝐱p\mathbf{x}\in\mathbb{R}^{p} is unknown. We first show that (A.1) has non-zero solutions. Since 𝐐g𝐗g=𝐖\mathbf{Q}^{g}{\mathbf{X}}_{g}=\mathbf{W}^{\prime}, (A.1) can be rewritten as 𝐐~g𝐁g𝐗g𝐱=𝟎,\tilde{\mathbf{Q}}^{g}{\mathbf{B}}_{g}{\mathbf{X}}_{g}\mathbf{x}=\bm{0}, where

𝐐~g=(𝐐g𝐐g) and 𝐁g=(𝐈ng×ng𝐗g𝐗g).\tilde{\mathbf{Q}}^{g}=\begin{pmatrix}\mathbf{Q}^{g}&\\ &\mathbf{Q}^{g}\end{pmatrix}\text{ and }{\mathbf{B}}_{g}=\begin{pmatrix}\mathbf{I}_{n_{g}\times n_{g}}\\ {\mathbf{X}}_{g}{\mathbf{X}}_{g}^{\prime}\end{pmatrix}.

Since

rank(𝐐~g𝐁g𝐗g)min{rank(𝐐~g),rank(𝐁g),rank(𝐗g)}min{2K,ng,rank(𝐗g)}rank(𝐗g)<p,\text{rank}(\tilde{\mathbf{Q}}^{g}{\mathbf{B}}_{g}{\mathbf{X}}_{g})\leq\min\{\text{rank}(\tilde{\mathbf{Q}}^{g}),\text{rank}({\mathbf{B}}_{g}),\text{rank}({\mathbf{X}}_{g})\}\leq\min\{2K,n_{g},\text{rank}(\mathbf{X}_{g})\}\leq\text{rank}(\mathbf{X}_{g})<p,

(A.1) has non-zero solutions.

Let 𝐖g=[𝐰g,1,,𝐰g,Kg]\mathbf{W}_{g}=[\mathbf{w}_{g,1},\ldots,\mathbf{w}_{g,K_{g}}] be an arbitrary set of bases that spans the space of solutions to (A.1). Let 𝐀g=𝐗g𝐖g(𝐖g𝐖g)1𝐖g.\mathbf{A}_{g}={\mathbf{X}}_{g}\mathbf{W}_{g}{(\mathbf{W}_{g}^{\prime}\mathbf{W}_{g})}^{-1}\mathbf{W}_{g}^{\prime}. Next we show that 𝐀g\mathbf{A}_{g} satisfies (ii)(iii)(ii)-(iii) in Lemma 1 and col(𝐀g)col(𝐉g)\text{col}(\mathbf{A}_{g})\perp\text{col}(\mathbf{J}_{g}). Since columns of 𝐖g\mathbf{W}_{g} satisfy (A.1), we have 𝐖𝐖g=𝟎\mathbf{W}^{\prime}\mathbf{W}_{g}=\bm{0} and 𝐖𝐗g𝐗g𝐖g=𝟎\mathbf{W}^{\prime}{\mathbf{X}}_{g}^{\prime}{\mathbf{X}}_{g}\mathbf{W}_{g}=\bm{0}. Then, by definition, 𝐀g𝐉g=𝟎\mathbf{A}_{g}^{\prime}\mathbf{J}_{g}=\bm{0} and 𝐉g𝐀g=𝟎\mathbf{J}_{g}\mathbf{A}_{g}^{\prime}=\bm{0}, which imply that row(𝐀g)row(𝐉g)\text{row}(\mathbf{A}_{g})\perp\text{row}(\mathbf{J}_{g}) and col(𝐀g)col(𝐉g)\text{col}(\mathbf{A}_{g})\perp\text{col}(\mathbf{J}_{g}). Since for all gg, row(𝐀g)\text{row}(\mathbf{A}_{g}) is perpendicular to the columns of 𝐖\mathbf{W}, then we have g=1Grow(𝐀g)={𝟎}\bigcap_{g=1}^{G}\text{row}(\mathbf{A}_{g})=\{\bm{0}\}.

Finally, letting 𝐒g=𝐗g𝐖\mathbf{S}_{g}={\mathbf{X}}_{g}\mathbf{W}, 𝐓g=𝐗g𝐖g\mathbf{T}_{g}={\mathbf{X}}_{g}\mathbf{W}_{g}, 𝐔=(𝐖𝐖)1𝐖\mathbf{U}=(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime} and 𝐔g=(𝐖g𝐖g)1𝐖g\mathbf{U}_{g}=(\mathbf{W}_{g}^{\prime}\mathbf{W}_{g})^{-1}\mathbf{W}_{g}^{\prime}, we have 𝐒g𝐓g=𝟎\mathbf{S}_{g}^{\prime}\mathbf{T}_{g}=\bm{0}. Let 𝜶=(𝐒𝐒)1𝐒𝒀{\bm{\alpha}}=({\mathbf{S}}^{\prime}{\mathbf{S}})^{-1}{\mathbf{S}}^{\prime}\bm{Y} and 𝜶g=(𝐓g𝐓g)1𝐓g𝒀g{\bm{\alpha}}_{g}=({\mathbf{T}}_{g}^{\prime}{\mathbf{T}}_{g})^{-1}{\mathbf{T}}_{g}^{\prime}{\bm{Y}}_{g}, where 𝐒=𝐗𝐖\mathbf{S}=\mathbf{X}\mathbf{W}. Then we obtain 𝑱gY\bm{J}^{Y}_{g} and 𝑨gY\bm{A}^{Y}_{g} by letting 𝑱gY=𝐒g𝜶col(𝐉g)\bm{J}^{Y}_{g}=\mathbf{S}_{g}\bm{\alpha}\in\text{col}(\mathbf{J}_{g}) and 𝑨gY=𝐓g𝜶gcol(𝐀g)\bm{A}^{Y}_{g}=\mathbf{T}_{g}\bm{\alpha}_{g}\in\text{col}(\mathbf{A}_{g}).

Appendix B Derivation of the CR algorithm for solving (3.5) and (3.6)

Consider the following CR problem that covers (3.5) and (3.6) as special cases:

max𝐰\displaystyle\max_{\mathbf{w}} cov(𝐗𝐰,𝒀)2var(𝐗𝐰)γ1\displaystyle\text{cov}(\mathbf{X}\mathbf{w},\bm{Y})^{2}\text{var}(\mathbf{X}\mathbf{w})^{\gamma-1} (B.1)
s.t. 𝐰𝐰=1,\displaystyle\mathbf{w}^{\prime}\mathbf{w}=1,
𝐰𝐗𝐗𝐰j=0;j=1,,k1 if k2,\displaystyle\mathbf{w}^{\prime}\mathbf{X}^{\prime}\mathbf{X}\mathbf{w}_{j}=0;\quad j=1,\ldots,k-1\text{ if }k\geq 2,
𝐖^𝐰=𝟎;\displaystyle\hat{\mathbf{W}}^{\prime}\mathbf{w}=\mathbf{0};
𝐒^𝐗𝐰=𝟎,\displaystyle\hat{\mathbf{S}}^{\prime}\mathbf{X}\mathbf{w}=\mathbf{0},

where 𝐖^\hat{\mathbf{W}}, 𝐒^\hat{\mathbf{S}}, 𝐗\mathbf{X} and 𝒀\bm{Y} are given a priori.

The solution to (S2) resides in col(𝐖^)\text{col}(\hat{\mathbf{W}})^{\perp}, i.e. the space that is orthogonal to the columns of 𝐖^\hat{\mathbf{W}}. Hence, we let 𝐏^=𝐖^(𝐖^𝐖^)1𝐖^\hat{\mathbf{P}}=\hat{\mathbf{W}}(\hat{\mathbf{W}}^{\prime}\hat{\mathbf{W}})^{-1}\hat{\mathbf{W}}^{\prime} and 𝐗^=𝐗(𝐈𝐏^)\hat{\mathbf{X}}=\mathbf{X}(\mathbf{I}-\hat{\mathbf{P}}), the latter being the projection of 𝐗\mathbf{X} into the space that is orthogonal to the columns of 𝐖^\hat{\mathbf{W}}. Let mm be the rank of the matrix 𝐗^\hat{\mathbf{X}}, and 𝐗^=𝐔𝐃𝐕\hat{\mathbf{X}}=\mathbf{U}\mathbf{D}\mathbf{V}^{\prime} be the corresponding rank-mm SVD decomposition, with 𝐔n×m\mathbf{U}\in\mathbb{R}^{n\times m}, 𝐕p×m\mathbf{V}\in\mathbb{R}^{p\times m}, 𝐃\mathbf{D} the m×mm\times m diagonal matrix. Since the representation of the solution to (S2) might not be unique, to avoid ambiguity, we write the solution as the linear combination of the column vectors of 𝐕\mathbf{V}, i.e. 𝐰=𝐕𝐳\mathbf{w}=\mathbf{V}\mathbf{z}, for some 𝐳m\mathbf{z}\in\mathbb{R}^{m}. Note that all 𝐰=𝐕𝐳\mathbf{w}=\mathbf{V}\mathbf{z} satisfies 𝐖^𝐰=𝟎\hat{\mathbf{W}}^{\prime}\mathbf{w}=\bm{0}. Hence, the constraint 𝐖^𝐰=𝟎\hat{\mathbf{W}}^{\prime}\mathbf{w}=\bm{0} from (S2) can be satisfied under this representation.

At step k+1k+1, the original optimization problem can be reformulated as follows:

max𝐳(𝐳𝐝)2(𝐳𝐄~𝐳)γ1s.t. 𝐳𝐳=1,𝐳𝐄~𝐙k=𝟎, and 𝐳𝐃𝐔𝐒^=𝟎,\max_{\mathbf{z}}\ (\mathbf{z}^{\prime}\mathbf{d})^{2}(\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z})^{\gamma-1}\quad\text{s.t. }\mathbf{z}^{\prime}\mathbf{z}=1,\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{Z}_{k}=\bm{0},\text{ and }\mathbf{z}^{\prime}\mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}=\bm{0}, (B.2)

where 𝐝=𝐕𝐗^𝒀\mathbf{d}=\mathbf{V}^{\prime}\hat{\mathbf{X}}^{\prime}\bm{Y}, 𝐙k=[𝐳1,,𝐳k]\mathbf{Z}_{k}=[\mathbf{z}_{1},\ldots,\mathbf{z}_{k}] and 𝐄~=𝐃2\tilde{\mathbf{E}}=\mathbf{D}^{2}. To solve (B.2), we can expand the objective to its Lagrangian form:

T(𝐳)=(𝐳𝐝)2(𝐳𝐄~𝐳)γ1λ0(𝐳𝐳1)2𝐳[𝐄~𝐙k𝐃𝐔𝐒^]𝚲k,T^{*}(\mathbf{z})=(\mathbf{z}^{\prime}\mathbf{d})^{2}(\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z})^{\gamma-1}-\lambda_{0}(\mathbf{z}^{\prime}\mathbf{z}-1)-2\mathbf{z}^{\prime}[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}]\bm{\Lambda}_{k},

where 𝚲k=[λ1,,λk+K]\bm{\Lambda}_{k}=[\lambda_{1},\ldots,\lambda_{k+K}]^{\prime} and λ0,,λk+K\lambda_{0},\ldots,\lambda_{k+K} are Lagrange multipliers. To solve (B.2), we take the derivative of TT^{*} with respect to 𝐳\mathbf{z}, then the optimizer should be the solution to the following:

T𝐳=2(𝐳𝐝)(𝐳𝐄~𝐳)γ1𝐝+2(γ1)(𝐳𝐝)2(𝐳𝐄~𝐳)γ2𝐄~𝐳2λ0𝐳2[𝐄~𝐙k𝐃𝐔𝐒^]𝚲k=𝟎.{\partial T^{*}\over\partial\mathbf{z}}=2(\mathbf{z}^{\prime}\mathbf{d})(\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z})^{\gamma-1}\mathbf{d}+2(\gamma-1)(\mathbf{z}^{\prime}\mathbf{d})^{2}(\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z})^{\gamma-2}\tilde{\mathbf{E}}\mathbf{z}-2\lambda_{0}\mathbf{z}-2[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}]\bm{\Lambda}_{k}=\bm{0}. (B.3)

Left multiply 𝐳\mathbf{z}^{\prime} to (B.3) and apply the constraints then we can conclude that λ0=γ(𝐳𝐝)2(𝐳𝐄~𝐳)γ1\lambda_{0}=\gamma(\mathbf{z}^{\prime}\mathbf{d})^{2}(\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z})^{\gamma-1}. Plug this back to (B.3), and let τ=𝐳𝐝\tau=\mathbf{z}^{\prime}\mathbf{d} and ρ=𝐳𝐄~𝐳\rho=\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z}, then we have

γτ2ργ1𝐳+(1γ)τ2ργ2𝐄~𝐳+[𝐄~𝐙k𝐃𝐔𝐒^]𝚲k=τργ1𝐝.\gamma\tau^{2}\rho^{\gamma-1}\mathbf{z}+(1-\gamma)\tau^{2}\rho^{\gamma-2}\tilde{\mathbf{E}}\mathbf{z}+[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}]\bm{\Lambda}_{k}=\tau\rho^{\gamma-1}\mathbf{d}.

A simple reformulation of the above plus the constraints 𝐳[𝐄~𝐙k𝐃𝐔𝐒^]=𝟎\mathbf{z}^{\prime}[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}]=\bm{0} yields the following matrix form:

[𝐀𝐁𝐁𝟎][𝐳𝚲k]=[𝐪𝟎],\left[\begin{array}[]{cc}\mathbf{A}&\mathbf{B}\\ \mathbf{B}^{\prime}&\bm{0}\end{array}\right]\left[\begin{array}[]{c}\mathbf{z}\\ \bm{\Lambda}_{k}\end{array}\right]=\left[\begin{array}[]{c}\mathbf{q}\\ \bm{0}\end{array}\right], (B.4)

where 𝐀=τ2[γργ1𝐈+(1γ)ργ2𝐄~]\mathbf{A}=\tau^{2}[\gamma\rho^{\gamma-1}\mathbf{I}+(1-\gamma)\rho^{\gamma-2}\tilde{\mathbf{E}}], 𝐁=[𝐄~𝐙k𝐃𝐔𝐒^]\mathbf{B}=[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}], and 𝐪=τργ1𝐝\mathbf{q}=\tau\rho^{\gamma-1}\mathbf{d}. By the standard formula for inverse of a partitioned matrix and the constraint that 𝐳𝐳=1\mathbf{z}^{\prime}\mathbf{z}=1, we obtain

𝐳=𝐌𝐪𝐌𝐪,\mathbf{z}={\mathbf{M}\mathbf{q}\over\|\mathbf{M}\mathbf{q}\|}, (B.5)

where 𝐌=𝐀1𝐀1𝐁(𝐁𝐀1𝐁)1𝐁𝐀1\mathbf{M}=\mathbf{A}^{-1}-\mathbf{A}^{-1}\mathbf{B}(\mathbf{B}^{\prime}\mathbf{A}^{-1}\mathbf{B})^{-1}\mathbf{B}^{\prime}\mathbf{A}^{-1}. Note that 𝐌\mathbf{M} has a factor of τ2\tau^{-2} and 𝐪\mathbf{q} has a factor of τ\tau, hence 𝐌𝐪\mathbf{M}\mathbf{q} has a factor of τ1\tau^{-1} that gets canceled out during the normalization in (B.5). Therefore, 𝐳\mathbf{z} does not rely on the quantity of τ\tau. Without loss of generality, we can choose τ=1\tau=1. Then the only unknown parameter is ρ\rho. Hence, we can formulate (B.4) and (B.5) as a fixed point problem of 𝐳(ρ)𝐄~𝐳(ρ)\mathbf{z}(\rho)^{\prime}\tilde{\mathbf{E}}\mathbf{z}(\rho) as a function of ρ\rho. More specifically, we seek for ρ\rho^{*} that satisfies 𝐳(ρ)𝐄~𝐳(ρ)=ρ\mathbf{z}(\rho^{*})^{\prime}\tilde{\mathbf{E}}\mathbf{z}(\rho^{*})=\rho^{*}. Afterwards, we obtain 𝐳=𝐌𝐪/𝐌𝐪\mathbf{z}^{*}=\mathbf{M}^{*}\mathbf{q}^{*}/\|\mathbf{M}^{*}\mathbf{q}^{*}\|, where 𝐌\mathbf{M}^{*} and 𝐪\mathbf{q}^{*} are computed from ρ\rho^{*}. We summarize the procedure to solve (B.2) as in the following two steps:

Step 1: Solve the fixed point ρ\rho^{*} for ρ=𝐳(ρ)𝐄~𝐳(ρ)\rho=\mathbf{z}(\rho)^{\prime}\tilde{\mathbf{E}}\mathbf{z}(\rho) with 𝐳(ρ)=𝐌𝐪/𝐌𝐪\mathbf{z}(\rho)={\mathbf{M}\mathbf{q}/\|\mathbf{M}\mathbf{q}\|}, where

𝐌\displaystyle\mathbf{M} =𝐀1𝐀1𝐁(𝐁𝐀1𝐁)1𝐁𝐀1,\displaystyle=\mathbf{A}^{-1}-\mathbf{A}^{-1}\mathbf{B}(\mathbf{B}^{\prime}\mathbf{A}^{-1}\mathbf{B})^{-1}\mathbf{B}^{\prime}\mathbf{A}^{-1},
𝐀\displaystyle\mathbf{A} =γργ1𝐈+(1γ)ργ2𝐄~,\displaystyle=\gamma\rho^{\gamma-1}\mathbf{I}+(1-\gamma)\rho^{\gamma-2}\tilde{\mathbf{E}},
𝐁\displaystyle\mathbf{B} =[𝐄~𝐙k𝐃𝐔𝐒^],\displaystyle=[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}],
𝐪\displaystyle\mathbf{q} =ργ1𝐝;\displaystyle=\rho^{\gamma-1}\mathbf{d};

Step 2: Compute 𝐌\mathbf{M}^{*} and 𝐪\mathbf{q}^{*} from the fixed point ρ\rho^{*} in Step 2. The solution to (B.2) is then given by 𝐳=𝐌𝐪/𝐌𝐪\mathbf{z}^{*}=\mathbf{M}^{*}\mathbf{q}^{*}/\|\mathbf{M}^{*}\mathbf{q}^{*}\|.

The most challenging step is Step 2, where a nonlinear equation needs to be solved. This can be done numerically by several existing algorithms. For example, we can use Newton’s method (Kelley,, 2003), which is implemented by many optimization packages and gives fast convergence in practice.

Appendix C JICO iterative algorithm

After each iteration, we obtain KK joint objective values as a K×1K\times 1 vector:

=diag(𝐖𝐗Joint𝐗Joint𝐖)γ1(𝐖𝐗Joint𝒀Joint)2,\mathcal{L}=\text{diag}(\mathbf{W}^{\prime}\mathbf{X}^{\text{Joint}}{{}^{\prime}}\mathbf{X}^{\text{Joint}}\mathbf{W})^{\gamma-1}(\mathbf{W}^{\prime}\mathbf{X}^{\text{Joint}}{{}^{\prime}}\bm{Y}^{\text{Joint}})^{2}, (C.1)

and KgK_{g} individual objective values as a Kg×1K_{g}\times 1 vector for g=1,,Gg=1,\ldots,G:

g=diag(𝐖g𝐗gIndiv𝐗gIndiv𝐖g)γ1(𝐖g𝐗gIndiv𝒀gIndiv)2,\mathcal{L}_{g}=\text{diag}(\mathbf{W}_{g}^{\prime}\mathbf{X}_{g}^{\text{Indiv}}{{}^{\prime}}\mathbf{X}_{g}^{\text{Indiv}}\mathbf{W}_{g})^{\gamma-1}(\mathbf{W}_{g}^{\prime}\mathbf{X}_{g}^{\text{Indiv}}{{}^{\prime}}\bm{Y}_{g}^{\text{Indiv}})^{2}, (C.2)

and compare them with the corresponding vectors obtained from the previous iteration step. Our iterative procedure stops until the differences between two consecutive iteration steps are under a certain tolerance level. Empirically, the algorithm always met convergence criteria, albeit there are no theoretical guarantees for it.

Data: {𝐗g,𝒀g}g=1G\{\mathbf{X}_{g},\bm{Y}_{g}\}_{g=1}^{G};
Parameters: tolerance level τ\tau; joint rank KK and individual rank KgK_{g};
Initialize 𝐓g=𝟎ng×Kg\mathbf{T}_{g}=\bm{0}_{n_{g}\times K_{g}}; 𝐔g=𝟎Kg×p\mathbf{U}_{g}=\bm{0}_{K_{g}\times p}; 𝜶g=𝟎Kg×1\bm{\alpha}_{g}=\bm{0}_{K_{g}\times 1}; ;
while Euclidian distance ,g>τ\|\nabla\mathcal{L}\|,\|\nabla\mathcal{L}_{g}\|>\tau do
       𝐗gJoint=𝐗g𝐓g𝐔g\mathbf{X}_{g}^{\text{Joint}}=\mathbf{X}_{g}-\mathbf{T}_{g}\mathbf{U}_{g}; 𝒀gJoint=𝒀g𝐓g𝜶g\bm{Y}_{g}^{\text{Joint}}=\bm{Y}_{g}-\mathbf{T}_{g}\bm{\alpha}_{g};
       Estimate 𝐖=(𝐰1,,𝐰K)\mathbf{W}=(\mathbf{w}_{1},\ldots,\mathbf{w}_{K}) from (3.2);
       𝐒=𝐗Joint𝐖\mathbf{S}=\mathbf{X}^{\text{Joint}}\mathbf{W}; 𝐔=(𝐖𝐖)1𝐖\mathbf{U}=(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}; 𝜶=(𝐒𝐒)1𝐒𝒀Joint\bm{\alpha}=(\mathbf{S}^{\prime}\mathbf{S})^{-1}\mathbf{S}^{\prime}\bm{Y}^{\text{Joint}};
       for g=1,,Gg=1,\ldots,G do
             𝐗gIndiv=𝐗g𝐒g𝐔\mathbf{X}_{g}^{\text{Indiv}}=\mathbf{X}_{g}-\mathbf{S}_{g}\mathbf{U}; 𝒀gIndiv=𝒀g𝐒g𝜶\bm{Y}_{g}^{\text{Indiv}}=\bm{Y}_{g}-\mathbf{S}_{g}\bm{\alpha};
             Estimate 𝐖g=(𝐰g1,,𝐰gKg)\mathbf{W}_{g}=(\mathbf{w}_{g1},\ldots,\mathbf{w}_{gK_{g}}) from (3.4);
             𝐓g=𝐗gIndiv𝐖g\mathbf{T}_{g}=\mathbf{X}_{g}^{\text{Indiv}}\mathbf{W}_{g}; 𝐔g=(𝐖g𝐖g)1𝐖g\mathbf{U}_{g}=(\mathbf{W}_{g}^{\prime}\mathbf{W}_{g})^{-1}\mathbf{W}_{g}^{\prime}; 𝜶g=(𝐓g𝐓g)1𝐓g𝒀gIndiv\bm{\alpha}_{g}=(\mathbf{T}_{g}^{\prime}\mathbf{T}_{g})^{-1}\mathbf{T}_{g}^{\prime}\bm{Y}_{g}^{\text{Indiv}};
       end for
      𝐖~=[𝐖1,,𝐖G]\tilde{\mathbf{W}}=[\mathbf{W}_{1},\ldots,\mathbf{W}_{G}];
      
end while
Algorithm 1 JICO Algorithm

Appendix D Selection of Tuning Parameters

To apply the proposed JICO method, we need to select tuning parameters KK, KgK_{g} and γ\gamma. Based on our numerical experience, we find that the performance of JICO depends more on the choice of KK but less on KgK_{g}. We also find that even when the true KgK_{g} differs in multiple groups, choosing them to be the same doesn’t affect the MSE in predicting the response too much. Thus, to saving computational time, we can choose KgK_{g} to be the same for all groups. We propose to do an exhaustive search for 0KD10\leq K\leq D_{1} and 0KgD20\leq K_{g}\leq D_{2} to find their best combination that gives the smallest MSE, where D1D_{1} and D2D_{2} are two user-defined constants. In this way, we need to try (D1+1)×(D2+1)(D_{1}+1)\times(D_{2}+1) combinations in total. Since we assume 𝐉𝐠\mathbf{J_{g}} and 𝐀g\mathbf{A}_{g} are both low-rank, we can set D1D_{1} and D2D_{2} to be relatively small. As for the tuning of γ\gamma, we propose to perform a grid search and choose the one that can minimize the MSE in predicting the responses.

Appendix E The impact of using different initial value 𝐈g\mathbf{I}_{g}

The following simulation study illustrates the impact of using different matrices as the initial value for our algorithm.

We use the same data generating scheme as described in Section 4. We fix G=2,p=200,n1=n2=50G=2,p=200,n_{1}=n_{2}=50. In each replication, we generate 100 training samples to train the models and evaluate the corresponding Mean Squared Error (MSE) in an independent test set of 100 samples. We also record the objective values as described below:

  • When γ=0\gamma=0, we compare the converged objective value as in (3.6) for g=1,2g=1,2.

  • When γ=1\gamma=1, we compare cov(𝐗Joint𝐰,𝒀Joint)2\text{cov}(\mathbf{X}^{\text{Joint}}\mathbf{w},\bm{Y}^{\text{Joint}})^{2} and cov(𝐗gIndiv𝐰,𝒀gIndiv)2\text{cov}(\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w},\bm{Y}_{g}^{\text{Indiv}})^{2} for g=1,2g=1,2.

  • When γ=\gamma=\infty, we compare var(𝐗Joint𝐰)\text{var}(\mathbf{X}^{\text{Joint}}\mathbf{w}) and var(𝐗gIndiv𝐰)\text{var}(\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w}) for g=1,2g=1,2.

We compare the performance of our algorithm by giving the boxplots of the MSEs and the converged objective values using different initial values. We repeat the simulation for 50 times, with three kinds of initial values of the individual matrices 𝐈g\mathbf{I}_{g} as follows:

  • Zero matrix: In this setting, our algorithm starts with zero matrices as the initial value, where all entries in 𝐈g\mathbf{I}_{g} are equal to 0.

  • Large matrix: In this setting, our algorithm starts with matrices with large entries as the initial value, where all entries in 𝐈g\mathbf{I}_{g} are equal to 3.

  • Joint matrix: In this setting, our algorithm starts with estimated joint matrices 𝐉^g\mathbf{\hat{J}}_{g} as the initial value, where 𝐉^g\mathbf{\hat{J}}_{g} are derived from our first setting, that is using zero matrices as the initial value.

Figures E.2 shows the converged value of the joint objective function under the three different initial values of the individual matrices 𝐈g\mathbf{I}_{g} mentioned above. Similarly, Figure E.3 shows the converged value of the first objective function under the three different initial values. Finally, Figure E.4 shows the the converged value of the second individual objective function under three different initial values. The results show that our algorithm may not converge to the same objective function value using different initial values when γ=1\gamma=1, but Figure E.1 shows that our results tend to have a similar performance in terms of prediction error, no matter how we choose the initial individual matrices for our algorithm. To ensure a better performance, we recommend running our algorithm on the same dataset multiple times with different initial values, and then choose the result with the best performance.

Refer to caption
Figure E.1: The impact of using different initial individual matrix on MSE under different settings.
Refer to caption
Figure E.2: The impact of using different initial individual matrix on converged value of the joint objective function under different settings.
Refer to caption
Figure E.3: The impact of using different initial individual matrix on converged value of the first individual objective function under different settings.
Refer to caption
Figure E.4: The impact of using different initial individual matrix on converged value of the second individual objective function under different settings.

References

  • Björkström and Sundberg, (1996) Björkström, A. and Sundberg, R. (1996). Continuum regression is not always continuous. Journal of the Royal Statistical Society: Series B (Methodological), 58(4):703–710.
  • Feng et al., (2018) Feng, Q., Jiang, M., Hannig, J., and Marron, J. (2018). Angle-based joint and individual variation explained. Journal of Multivariate Analysis, 166:241–265.
  • Gaynanova and Li, (2019) Gaynanova, I. and Li, G. (2019). Structural learning and integrative decomposition of multi-view data. Biometrics, 75(4):1121–1132.
  • Hoerl and Kennard, (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67.
  • Kaplan and Lock, (2017) Kaplan, A. and Lock, E. F. (2017). Prediction with dimension reduction of multiple molecular data sources for patient survival. Cancer Informatics, 16:1176935117718517.
  • Kelley, (2003) Kelley, C. T. (2003). Solving nonlinear equations with Newton’s method, volume 1. Siam.
  • Killiany et al., (1993) Killiany, R. J., Moss, M. B., Albert, M. S., Sandor, T., Tieman, J., and Jolesz, F. (1993). Temporal lobe regions on magnetic resonance imaging identify patients with early alzheimer’s disease. Archives of Neurology, 50(9):949–954.
  • Lee and Liu, (2013) Lee, M. H. and Liu, Y. (2013). Kernel continuum regression. Computational Statistics and Data Analysis, 68:190–201.
  • Li and Li, (2021) Li, Q. and Li, L. (2021). Integrative factor regression and its inference for multimodal data analysis. Journal of the American Statistical Association, pages 1–15.
  • Lock et al., (2013) Lock, E. F., Hoadley, K. A., Marron, J. S., and Nobel, A. B. (2013). Joint and individual variation explained (jive) for integrated analysis of multiple data types. The Annals of Applied Statistics, 7(1):523–542.
  • Ma and Huang, (2017) Ma, S. and Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517):410–423.
  • Meinshausen and Bühlmann, (2015) Meinshausen, N. and Bühlmann, P. (2015). Maximin effects in inhomogeneous large-scale data. The Annals of Statistics, 43(4):1801–1830.
  • Michon et al., (1994) Michon, A., Deweer, B., Pillon, B., Agid, Y., and Dubois, B. (1994). Relation of anosognosia to frontal lobe dysfunction in alzheimer’s disease. Journal of Neurology, Neurosurgery and Psychiatry, 57(7):805–809.
  • Muniategui et al., (2013) Muniategui, A., Pey, J., Planes, F. J., and Rubio, A. (2013). Joint analysis of mirna and mrna expression data. Briefings in Bioinformatics, 14(3):263–278.
  • Palzer et al., (2022) Palzer, E. F., Wendt, C. H., Bowler, R. P., Hersh, C. P., Safo, S. E., and Lock, E. F. (2022). sjive: Supervised joint and individual variation explained. Computational Statistics & Data Analysis, 175:107547.
  • Stone and Brooks, (1990) Stone, M. and Brooks, R. J. (1990). Continuum regression: cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. Journal of the Royal Statistical Society: Series B (Methodological), 52(2):237–258.
  • Tang and Song, (2016) Tang, L. and Song, P. X. (2016). Fused lasso approach in regression coefficients clustering: learning parameter heterogeneity in data integration. The Journal of Machine Learning Research, 17(1):3915–3937.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
  • Wang et al., (2023) Wang, P., Li, Q., Shen, D., and Liu, Y. (2023). High-dimensional factor regression for heterogeneous subpopulations. Statistica Sinica, 33:1–27.
  • Wang et al., (2018) Wang, P., Liu, Y., and Shen, D. (2018). Flexible locally weighted penalized regression with applications on prediction of alzheimer’s disease neuroimaging initiative’s clinical scores. IEEE Transactions on Medical Imaging, 38(6):1398–1408.
  • Zhao et al., (2016) Zhao, T., Cheng, G., and Liu, H. (2016). A partially linear framework for massive heterogeneous data. Annals of Statistics, 44(4):1400–1437.