Joint and Individual Component Regression

Peiyao Wang¹, Haodong Wang¹, Quefeng Li², Dinggang Shen^4,5,6,
Yufeng Liu^1,2,3∗

¹Department of Statistics and Operations Research,
²Department of Biostatistics, ³Department of Genetics,
University of North Carolina at Chapel Hill;
⁴School of Biomedical Engineering, ShanghaiTech University;
⁵Shanghai United Imaging Intelligence Co., Ltd.;
⁶Department of Artificial Intelligence, Korea University
^∗email: yfliu@email.unc.edu

Abstract

Multi-group data are commonly seen in practice. Such data structure consists of data from multiple groups and can be challenging to analyze due to data heterogeneity. We propose a novel Joint and Individual Component Regression (JICO) model to analyze multi-group data. In particular, our proposed model decomposes the response into shared and group-specific components, which are driven by low-rank approximations of joint and individual structures from the predictors respectively. The joint structure has the same regression coefficients across multiple groups, whereas individual structures have group-specific regression coefficients. Moreover, the choice of global and individual ranks allows our model to cover global and group-specific models as special cases. For model estimation, we formulate this framework under the representation of latent components and propose an iterative algorithm to solve for the joint and individual scores under the new representation. To construct the latent scores, we utilize the Continuum Regression (CR), which provides a unified framework that covers the Ordinary Least Squares (OLS), the Partial Least Squares (PLS), and the Principal Component Regression (PCR) as its special cases. We show that JICO attains a good balance between global and group-specific models and remains flexible due to the usage of CR. Finally, we conduct simulation studies and analysis of an Alzheimer’s disease dataset to further demonstrate the effectiveness of JICO. R implementation of JICO is available online at https://github.com/peiyaow/JICO.

Keywords: Continuum Regression; Heterogeneity; Latent Component Regression; Multi-group Data.

1 Introduction

Many fields of scientific research involve the analysis of heterogeneous data. In particular, data may appear in the form of multiple matrices, with data heterogeneity arising from either variables or samples. One example is the multi-view/source data, which measure different sets of variables on the same set of samples. The sets of variables may come from different platforms/sources/modalities. For instance, in genomics studies, measurements are collected as different biomarkers, such as mRNA and miRNA (Muniategui et al.,, 2013). Another example is the multi-group data, which measure the same set of variables on disparate sets of samples, which leads to heterogenous subpopulations/subgroups in the entire population. For instance, in the Alzheimer’s Disease (AD) study, subjects can have different subtypes, such as Normal Control (NC), Mild Cognitive Impairment (MCI), and AD.

We study the classical regression problem with one continuous response for multi-group data. Although there are many well-established regression techniques for homogeneous data (Hoerl and Kennard,, 1970; Tibshirani,, 1996), they may not be suitable for multi-group data. One naive approach is to ignore data heterogeneity and fit a global model using these techniques. However, a single global model can be too restrictive because the diverse information from different subgroups may not be identified. On the other hand, one can train separate group-specific models. Despite its flexibility, the information that is jointly shared across different groups is not sufficiently captured. Therefore, it is desirable to build a flexible statistical model that can simultaneously quantify the jointly shared global information and individual group-specific information for heterogeneous data.

There are several existing methods proposed in the literature under the context of regression for multi-group data. Meinshausen and Bühlmann, (2015) took a conservative approach and proposed a maxmin effect method that is reliable for all possible subsets of the data. Zhao et al., (2016) proposed a partially linear regression framework for massive heterogeneous data, and the goal is to extract common features across all subpopulations while exploring heterogeneity of each subpopulation. Tang and Song, (2016); Ma and Huang, (2017) proposed fused penalties to estimate regression coefficients that capture subgroup structures in a linear regression framework. Wang et al., (2018) proposed a locally-weighted penalized model to perform subject-wise variable selection. Wang et al., (2023) proposed a factor regression model for heterogeneous subpopulations under the high-dimensional factor decomposition. However, these models either are not specifically designed to identify the globally-shared and group-specific structures, or impose strong theoretical assumptions on the covariates. On the other hand, there exist some works that study this manner for multi-source data. Lock et al., (2013) proposed JIVE to learn joint and individual structures from multiple data matrices by low-rank approximations. Some extensions of JIVE can be found in Feng et al., (2018); Gaynanova and Li, (2019). All of these decomposition methods are fully unsupervised. Recently, Li and Li, (2021) proposed a supervised integrative factor regression model for mult-source data and studied its statistical properties with hypothesis tests. Palzer et al., (2022) proposed sJIVE that extends JIVE with the supervision from the response. These methods are supervised, but both focused on regressions for multi-source data.

In this paper, we consider the supervised learning problem of predicting a response with multi-group data. We propose a Joint and Individual COmponent Regression (JICO), a novel latent component regression model that covers JIVE as a special case. Our proposed model decomposes the response into jointly shared and group-specific components, which are driven by low-rank approximations of joint and individual structures from the predictors respectively. The joint structure shares the same coefficients across all groups, whereas individual structures have group-specific coefficients. Moreover, by choosing different ranks of joint and individual structures, our model covers global and group-specific models as special cases. To estimate JICO, we propose an iterative algorithm to solve for joint and individual scores using latent component representation. To construct the latent scores, we utilize the Continuum Regression (CR) (Stone and Brooks,, 1990), which provides a unified framework that covers OLS, PLS, and PCR as special cases. Some follow-up studies and modern extensions of CR can be found in Björkström and Sundberg, (1996); Lee and Liu, (2013). Embracing this flexibility and generaliziblity from CR, our proposed JICO model extends to the heterogeneous data setup and is able to achieve different model configurations on the spectrum of CR under this more complicated setting. JICO attains a good balance between global and group-specific models, and further achieves its flexibility by extending CR.

The rest of this paper is organized as follows. In Section 2, we briefly review JIVE and introduce our proposed JICO model. We further provide sufficient conditions to make JICO identifiable. In Section 3, after two motivating special cases, we introduce our iterative algorithm. In Sections 4 and 5, we evaluate the performance of JICO by simulation studies and real data analysis on the Alzheimer’s disease dataset, respectively. In Section 6, we conclude this paper with some discussion and possible extensions.

2 Motivation and Model Framework

Suppose we observe data pairs $(\mathbf{X}_{g},\bm{Y}_{g})_{g=1}^{G}$ from $G$ groups, where $\mathbf{X}_{g}\in\mathbb{R}^{n_{g}\times p}$ and $\bm{Y}_{g}\in\mathbb{R}^{n_{g}}$ are the data matrix and the response vector for the $g$ th group respectively. In this setting, each data matrix has the same set of $p$ explanatory variables, whereas the samples vary across groups. We let $\mathbf{X}=[\mathbf{X}_{1}^{\prime},\ldots,\mathbf{X}_{G}^{\prime}]^{\prime}\in\mathbb{R}^{n\times p}$ and $\bm{Y}=[\bm{Y}_{1}^{\prime},\ldots,\bm{Y}_{G}^{\prime}]^{\prime}\in\mathbb{R}^{n}$ , where $n=\sum_{g=1}^{G}n_{g}$ .

Our model is closely related to JIVE, which provides a general formulation to decompose multiple data matrices into joint and individual structures. The JIVE decomposes $\mathbf{X}_{g}$ as

\mathbf{X}_{g}=\mathbf{J}_{g}+\mathbf{A}_{g}+\mathbf{E}_{g},

(2.1)

where $\mathbf{J}_{g}\in\mathbb{R}^{n_{g}\times p}$ represents the submatrix of the joint structure of $\mathbf{X}_{g}$ , $\mathbf{A}_{g}\in\mathbb{R}^{n_{g}\times p}$ represents the individual structure of $\mathbf{X}_{g}$ , and $\mathbf{E}_{g}\in\mathbb{R}^{n_{g}\times p}$ is the error matrix. We consider that $\bm{Y}_{g}$ has a similar decomposition into joint and individual signals

\bm{Y}_{g}=\bm{J}^{Y}_{g}+\bm{A}^{Y}_{g}+\mathbf{e}_{g},

(2.2)

where $\mathbf{e}_{g}\in\mathbb{R}^{n_{g}}$ is the noise from the $g$ -th group. Let $\tilde{\mathbf{X}}_{g}=\mathbf{J}_{g}+\mathbf{A}_{g}$ and $\tilde{\bm{Y}}_{g}=\bm{J}^{Y}_{g}+\bm{A}^{Y}_{g}$ be the noiseless counterparts of $\mathbf{X}_{g}$ and $\bm{Y}_{g}$ . Lemma 1 gives conditions to ensure that $\mathbf{J}_{g}$ , $\mathbf{A}_{g}$ , $\bm{J}_{g}^{Y}$ , and $\bm{A}_{g}^{Y}$ are identifiable.

Lemma 1

Given $\{\tilde{\mathbf{X}}_{g},\tilde{\bm{Y}_{g}}\}_{g=1}^{G}$ , where $\tilde{\bm{Y}_{g}}\in\text{col}(\tilde{\mathbf{X}}_{g})$ . There exists unique $\mathbf{J}_{g}$ and $\mathbf{A}_{g}$ such that $\tilde{\mathbf{X}}_{g}=\mathbf{J}_{g}+\mathbf{A}_{g}$ and they satisfy the following conditions:

(i)

$\text{row}(\mathbf{J}_{1})=\ldots=\text{row}(\mathbf{J}_{G})\subset\text{row}(\tilde{\mathbf{X}}_{g})$ ;
(ii)

$\text{row}(\mathbf{J}_{g})\perp\text{row}(\mathbf{A}_{g})$ , for $g=1,\ldots,G$ ;
(iii)

$\bigcap_{g=1}^{G}\text{row}(\mathbf{A}_{g})=\{\mathbf{0}_{p\times 1}\}$ .

Moreover, if $\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g})$ , then there exists unique $\bm{J}^{Y}_{g}$ and $\bm{A}^{Y}_{g}$ such that $\tilde{\bm{Y}}_{g}=\bm{J}^{Y}_{g}+\bm{A}^{Y}_{g}$ and they satisfy $\bm{J}^{Y}_{g}\in\text{col}(\mathbf{J}_{g})$ and $\bm{A}^{Y}_{g}\in\text{col}(\mathbf{A}_{g})$ .

Lemma 1 shows that $\tilde{\mathbf{X}}_{g}$ can be uniquely decomposed into the sum of $\mathbf{J}_{g}$ and $\mathbf{A}_{g}$ if we require them to satisfy conditions $(i)-(iii)$ , following similar statements as in Feng et al., (2018). To ensure the unique decomposition of $\tilde{\mathbf{Y}}_{g}$ , we need to further require $\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g})$ , which is different from Palzer et al., (2022), that requires $\text{row}(\mathbf{J}_{g})\perp\text{row}(\mathbf{A}_{g})$ .

In practice, only $\mathbf{X}_{g}$ and $\bm{Y}_{g}$ are observable. In Lemma 2, we show in $(a)$ that the identifiable conditions in Lemma 1 can still be achieved given the observed $\{\mathbf{X}_{g},\bm{Y}_{g}\}_{g=1}^{G}$ . Moreover, in Lemma 2 $(b)$ , we show that if $\mathbf{J}_{g}$ and $\mathbf{A}_{g}$ are assumed to have low ranks, they can be further decomposed as $\mathbf{J}_{g}=\mathbf{S}_{g}\mathbf{U}$ , where $\mathbf{S}_{g}$ is a $n_{g}\times K$ score matrix, $\mathbf{U}$ is a $K\times p$ loading matrix, and $K=\text{rank}(\mathbf{J}_{g})$ ; and $\mathbf{A}_{g}=\mathbf{T}_{g}\mathbf{U}_{g}$ , where $\mathbf{T}_{g}$ is a $n_{g}\times K_{g}$ score matrix and $\mathbf{U}_{g}$ is a $K_{g}\times p$ loading matrix, and $K_{g}=\text{rank}(\mathbf{A}_{g})$ . Under this formulation, if $\mathbf{S}_{g}^{\prime}\mathbf{T}_{g}=\bm{0}_{K\times K_{g}}$ , then (2.1) and (2.2) can be expressed as

	$\displaystyle\mathbf{X}_{g}=\mathbf{S}_{g}\mathbf{U}+\mathbf{T}_{g}\mathbf{U}_{g}+\mathbf{E}_{g},$		(2.3)
	$\displaystyle\bm{Y}_{g}=\mathbf{S}_{g}\bm{\alpha}+\mathbf{T}_{g}\bm{\alpha}_{g}+\mathbf{e}_{g},$		(2.4)

where $\bm{\alpha}\in\mathbb{R}^{K}$ and $\bm{\alpha}_{g}\in\mathbb{R}^{K_{g}}$ are the coefficients of the joint and individual components respectively. Model (2.4) gives a unified framework to model multi-group data. When $K=0$ , the joint term $\mathbf{S}_{g}\bm{\alpha}$ vanishes and (2.4) reduces to a group-specific model of $\bm{Y}_{g}=\mathbf{T}_{g}\bm{\alpha}_{g}+\mathbf{e}_{g}$ . On the other hand, when $K_{1}=\cdots=K_{g}=0$ , the individual term $\mathbf{T}_{g}\bm{\alpha}_{g}$ vanishes and (2.4) reduces to a global model of $\bm{Y}_{g}=\mathbf{S}_{g}\bm{\alpha}+\mathbf{e}_{g}$ . When $K\neq 0$ and $K_{g}\neq 0$ , (2.4) has both global and group-specific components, thus lies between the above two extreme cases.

Lemma 2

Given $\{{\mathbf{X}}_{g},{\bm{Y}_{g}}\}_{g=1}^{G}$ , it holds that

(a)

There exists matrices $\mathbf{J}_{g}$ and $\mathbf{A}_{g}$ such that they satisfy conditions $(i)-(iii)$ in Lemma 1 and $\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g})$ .
(b)

There exists matrices $\mathbf{U},\mathbf{U}_{g},\mathbf{S}_{g},\mathbf{T}_{g}$ such that $\mathbf{J}_{g}$ and $\mathbf{A}_{g}$ can be expressed as $\mathbf{J}_{g}=\mathbf{S}_{g}\mathbf{U}$ and $\mathbf{A}_{g}=\mathbf{T}_{g}\mathbf{U}_{g}$ , where $\mathbf{S}_{g}^{\prime}\mathbf{T}_{g}=\bm{0}_{K\times K_{g}}$ . Moreover, there exists $\bm{J}^{Y}_{g}$ and $\bm{A}^{Y}_{g}$ which can be expressed as $\bm{J}^{Y}_{g}=\mathbf{S}_{g}\bm{\alpha}$ and $\bm{A}^{Y}_{g}=\mathbf{T}_{g}\bm{\alpha}_{g}$ , where ${\bm{\alpha}}=({\mathbf{S}}^{\prime}{\mathbf{S}})^{-1}{\mathbf{S}}^{\prime}\bm{Y}$ and ${\bm{\alpha}}_{g}=({\mathbf{T}}_{g}^{\prime}{\mathbf{T}}_{g})^{-1}{\mathbf{T}}_{g}^{\prime}{\bm{Y}}_{g}$ , with $\mathbf{S}=[\mathbf{S}_{1}^{\prime},\ldots,\mathbf{S}_{G}^{\prime}]$ .

Corollary 1

There exist matrices $\mathbf{W}\in\mathbb{R}^{p\times K}$ and ${\mathbf{W}_{g}}\in\mathbb{R}^{p\times K_{g}}$ such that $\mathbf{J}_{g}=\mathbf{S}_{g}\mathbf{U}$ and $\mathbf{A}_{g}=\mathbf{T}_{g}\mathbf{U}_{g}$ defined by $\mathbf{S}_{g}={\mathbf{X}}_{g}\mathbf{W}$ , $\mathbf{T}_{g}={\mathbf{X}}_{g}\mathbf{W}_{g}$ , $\mathbf{U}=(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}$ and $\mathbf{U}_{g}=(\mathbf{W}_{g}^{\prime}\mathbf{W}_{g})^{-1}\mathbf{W}_{g}^{\prime}$ as in Lemma 2 $(b)$ satisfy conditions $(i)-(iii)$ in Lemma 1 and $\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g})$ , if $\mathbf{W}^{\prime}\mathbf{W}_{g}=\bm{0}$ and $\mathbf{W}^{\prime}{\mathbf{X}}_{g}^{\prime}{\mathbf{X}}_{g}\mathbf{W}_{g}=\bm{0}$ , for all $g$ .

Corollary 1 follows directly from the proof of Lemma 2 $(b)$ . As a remark, the columns of $\mathbf{W}$ and $\mathbf{W}_{g}$ form the sets of bases that span the row spaces of $\mathbf{J}_{g}$ and $\mathbf{A}_{g}$ respectively. Hence, $\mathbf{W}^{\prime}\mathbf{W}_{g}=\bm{0}$ is a sufficient and necessary condition for $\text{row}(\mathbf{J}_{g})\perp\text{row}(\mathbf{A}_{g})$ . Moreover, note that in Lemma 2 $(b)$ , $\mathbf{S}_{g}^{\prime}\mathbf{T}_{g}=\bm{0}$ directly implies that $\mathbf{A}_{g}^{\prime}\mathbf{J}_{g}=\bm{0}$ , the latter being a sufficient condition for $\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g})$ . Therefore, in Corollary 1, $\mathbf{W}^{\prime}{\mathbf{X}}_{g}^{\prime}{\mathbf{X}}_{g}\mathbf{W}_{g}=\bm{0}$ provides a sufficient condition for $\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g})$ , which satisfies one of the identifiability constraints for the unique decomposition of $\mathbf{X}_{g}$ in Lemma 1. In Section 3, we describe the algorithm to solve for $\mathbf{W}$ and $\mathbf{W}_{g}$ respectively.

3 Model Estimation

The key to estimate (2.3) and (2.4) is the constructions of score matrices $\mathbf{S}_{g}$ and $\mathbf{T}_{g}$ . To motivate our estimation procedure, in Sections 3.1, we discuss the joint and individual score estimation under two special cases respectively. In Section 3.2, we introduce an iterative algorithm for the general case.

3.1 Joint and Individual Score Estimation

We first consider a special case that $K_{g}=0,g=1,\ldots,G$ . Under this setup, the individual components vanish and (2.3) and (2.4) reduce to the following model:

\displaystyle\mathbf{X}=\mathbf{S}\mathbf{U}+\mathbf{E},\quad\bm{Y}=\mathbf{S}\bm{\alpha}+\mathbf{e},

(3.1)

where $\mathbf{S}=[\mathbf{S}_{1}^{\prime},\ldots,\mathbf{S}_{G}^{\prime}]^{\prime}$ , $\mathbf{E}=[\mathbf{E}_{1}^{\prime},\ldots,\mathbf{E}_{G}]^{\prime}$ , and $\mathbf{e}=[\mathbf{e}_{1}^{\prime},\ldots,\mathbf{e}_{G}^{\prime}]^{\prime}$ .

The formulation of (3.1) covers many existing classic methods. For example, in PCR, $\mathbf{S}$ is chosen to be the score matrix of the first $K$ principal components of $\mathbf{X}^{\prime}\mathbf{X}$ . However, PCR is inherently unsupervised and ignores the information from $\bm{Y}$ . Among the other supervised methods, PLS regression is a popular approach that incorporates regression on the latent scores. When $K=1$ and $p<n$ , the standard OLS can also be cast under the above setup.

According to the proof of our Lemma 2, $\mathbf{S}$ can be constructed with the basis matrix $\mathbf{W}$ . For the estimation of $\mathbf{W}$ , we utilize the continuum regression (CR) (Stone and Brooks,, 1990) algorithm, the result of which covers OLS, PLS, and PCR as special cases. For $k=1,\ldots,K$ , CR sequentially solves $\mathbf{w}_{k}$ from the following optimization problem:

	$\displaystyle\max_{\mathbf{w}}$	$\displaystyle\text{cov}(\mathbf{X}\mathbf{w},\bm{Y})^{2}\text{var}(\mathbf{X}\mathbf{w})^{\gamma-1}$		(3.2)
	s.t.	$\displaystyle\mathbf{w}^{\prime}\mathbf{w}=1,\mathbf{w}^{\prime}\mathbf{X}^{\prime}\mathbf{X}\mathbf{w}_{j}=0;\quad j=1,\ldots,k-1\text{ if }k\geq 2,$		(3.2)

where $\text{cov}(\mathbf{X}\mathbf{w},\bm{Y})=\mathbf{w}^{\prime}\mathbf{X}^{\prime}\bm{Y}$ and $\text{var}(\mathbf{X}\mathbf{w})=\mathbf{w}^{\prime}\mathbf{X}^{\prime}\mathbf{X}\mathbf{w}$ , once columns of $\mathbf{X}$ and $\bm{Y}$ are centralized to have mean zero. Here, $\gamma\geq 0$ is a tuning parameter that controls how much variability of $\mathbf{X}$ is taken into account for the construction of $\mathbf{S}$ . When $\gamma\rightarrow\infty$ , the objective function in (3.2) is dominated by $\text{var}(\mathbf{X}\mathbf{w})^{\gamma-1}$ and $\bm{Y}$ does not play a role. The CR solution of $\mathbf{W}$ then seeks to find the principal component directions that maximize the variation of $\mathbf{X}$ . It can be shown that (3.2) coincides with OLS and PLS solutions when $\gamma=0$ and $1$ respectively.

Let $\hat{\mathbf{W}}$ denote the solution to (3.2) and $\hat{\mathbf{S}}=\mathbf{X}\hat{\mathbf{W}}$ . Then $\bm{\alpha}$ can be estimated by $\hat{\bm{\alpha}}=(\hat{\mathbf{S}}^{\prime}\hat{\mathbf{S}})^{-1}\hat{\mathbf{S}}^{\prime}\bm{Y}.$ As illustrated in Lemma 2, $\mathbf{J}_{g}$ is the projection of $\mathbf{X}_{g}$ onto the column space spanned by $\mathbf{W}$ . Hence, we have $\hat{\mathbf{J}}_{g}=\mathbf{X}_{g}\hat{\mathbf{W}}(\hat{\mathbf{W}}^{\prime}\hat{\mathbf{W}})^{-1}\hat{\mathbf{W}}^{\prime}$ , which further gives $\hat{\mathbf{U}}=(\hat{\mathbf{W}}^{\prime}\hat{\mathbf{W}})^{-1}\hat{\mathbf{W}}^{\prime}$ .

Next we consider our model estimation under the special case that $K=0$ . In this case, the joint component vanishes, and (2.3) and (2.4) reduce to the following individual model:

\displaystyle\mathbf{X}_{g}=\mathbf{T}_{g}\mathbf{U}_{g}+\mathbf{E}_{g},\quad\bm{Y}_{g}=\mathbf{T}_{g}\bm{\alpha}_{g}+\mathbf{e}_{g}.

(3.3)

Same as the above discussion for joint score estimation, we utilize CR to construct $\mathbf{T}_{g}=\mathbf{X}_{g}\mathbf{W}_{g}$ as linear transformation of $\mathbf{X}_{g}$ , where $\mathbf{W}_{g}$ is a $p\times K_{g}$ basis matrix, whose columns span $\text{row}(\mathbf{A}_{g})$ . Let $\mathbf{W}_{g}=[\mathbf{w}_{g1},\ldots,\mathbf{w}_{gK_{g}}]$ . Given group $g$ , for $k=1,\ldots,K_{g}$ , CR sequentially solves $\mathbf{w}_{gk}$ from the following optimization problem:

$\displaystyle\max_{\mathbf{w}}$	$\displaystyle\text{cov}(\mathbf{X}_{g}\mathbf{w},\bm{Y}_{g})^{2}\text{var}(\mathbf{X}_{g}\mathbf{w})^{\gamma-1}$	(3.4)
s.t.	$\displaystyle\mathbf{w}^{\prime}\mathbf{w}=1,$
	$\displaystyle\mathbf{w}^{\prime}\mathbf{X}_{g}^{\prime}\mathbf{X}_{g}\mathbf{w}_{gj}=0;\quad j=1,\ldots,k-1\text{ if }k\geq 2.$

Denote $\hat{\mathbf{W}}_{g}$ the solution to (3.4). Similar to the joint estimation, once $\hat{\mathbf{T}}_{g}=\mathbf{X}_{g}\hat{\mathbf{W}}_{g}$ is constructed, $\bm{\alpha}_{g}$ can be obtained as the least square solution: $\hat{\bm{\alpha}}_{g}=(\hat{\mathbf{T}}_{g}^{\prime}\hat{\mathbf{T}}_{g})^{-1}\hat{\mathbf{T}}_{g}^{\prime}{\bm{Y}}_{g}.$ Afterwards, we can have $\hat{\mathbf{A}}_{g}=\mathbf{X}_{g}\hat{\mathbf{W}}_{g}(\hat{\mathbf{W}}_{g}^{\prime}\hat{\mathbf{W}}_{g})^{-1}\hat{\mathbf{W}}_{g}^{\prime}$ and $\hat{\mathbf{U}}_{g}=(\hat{\mathbf{W}}_{g}^{\prime}\hat{\mathbf{W}}_{g})^{-1}\hat{\mathbf{W}}_{g}^{\prime}$ .

3.2 JICO Algorithm

In this section, we consider the general case where $K$ or $K_{g}$ can be both nonzero. Since solving (3.2) and (3.4) simultaneously can be hard with both joint and individual structures specified in the full model (2.3) and (2.4), we propose to iteratively solve one of them while fixing the other. This leads to the following iterative procedure.

•

Given $\hat{\mathbf{W}}_{g}$ , solve the following constrained problem sequentially for $\mathbf{w}_{1},\ldots,\mathbf{w}_{K}$ :

$\displaystyle\max_{\mathbf{w}}$	$\displaystyle\text{cov}(\mathbf{X}^{\text{Joint}}\mathbf{w},\bm{Y}^{\text{Joint}})^{2}\text{var}(\mathbf{X}^{\text{Joint}}\mathbf{w})^{\gamma-1}$	(3.5)
s.t.	$\displaystyle\mathbf{w}^{\prime}\mathbf{w}=1,$
	$\displaystyle\mathbf{w}^{\prime}{\mathbf{X}^{\text{Joint}}}^{\prime}\mathbf{X}^{\text{Joint}}\mathbf{w}_{j}=0;\quad j=1,\ldots,k-1\text{ if }k\geq 2,$
	$\displaystyle\hat{\mathbf{W}}_{g}^{\prime}\mathbf{w}=\mathbf{0}_{K_{g}\times 1};\quad g=1,\ldots,G,$
	$\displaystyle\hat{\mathbf{W}}_{g}^{\prime}{\mathbf{X}_{g}^{\text{Indiv}}}^{\prime}\mathbf{X}_{g}^{\text{Joint}}\mathbf{w}=\mathbf{0}_{K_{g}\times 1};\quad g=1,\ldots,G.$

•

Given $\hat{\mathbf{W}}$ , for any $1\leq g\leq G$ , solve the following constrained problem sequentially for $\mathbf{w}_{g,1},\ldots,\mathbf{w}_{g,K_{g}}$ :

$\displaystyle\max_{\mathbf{w}}$	$\displaystyle\text{cov}(\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w},\bm{Y}_{g}^{\text{Indiv}})^{2}\text{var}(\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w})^{\gamma-1}$	(3.6)
s.t.	$\displaystyle\mathbf{w}^{\prime}\mathbf{w}=1,$
	$\displaystyle\mathbf{w}^{\prime}{\mathbf{X}_{g}^{\text{Indiv}}}^{\prime}\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w}_{gj}=0;\quad j=1,\ldots,k-1\text{ if }k\geq 2,$
	$\displaystyle\hat{\mathbf{W}}^{\prime}\mathbf{w}=\mathbf{0}_{K\times 1};$
	$\displaystyle\hat{\mathbf{W}}^{\prime}{\mathbf{X}_{g}^{\text{Joint}}}^{\prime}\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w}=\mathbf{0}.$

•

Repeat the above two procedures until convergence.

Note that in (3.5) and (3.6), we denote

\mathbf{X}^{\text{Joint}}=\left[\begin{array}[]{c}\mathbf{X}_{1}-\mathbf{T}_{1}\mathbf{U}_{G}\\ \vdots\\ \mathbf{X}_{G}-\mathbf{T}_{G}\mathbf{U}_{G}\end{array}\right],\quad\bm{Y}^{\text{Joint}}=\left[\begin{array}[]{c}\bm{Y}_{1}-\mathbf{T}_{1}\bm{\alpha}_{1}\\ \vdots\\ \bm{Y}_{G}-\mathbf{T}_{G}\bm{\alpha}_{G}\end{array}\right],

and $\mathbf{X}_{g}^{\text{Indiv}}=\mathbf{X}_{g}-\mathbf{S}_{g}\mathbf{U},\quad\bm{Y}_{g}^{\text{Indiv}}=\bm{Y}_{g}-\mathbf{S}_{g}\bm{\alpha};\quad g=1,\ldots,G.$ Moreover, the last two constraints in (3.5) and (3.6) correspond to the two sufficient conditions in Corollary 1 to satisfy the identifiability conditions $\text{row}(\mathbf{J}_{g})\perp\text{row}(\mathbf{A}_{g})$ and $\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g})$ needed in Lemma 1.

We formulate (3.5) and (3.6) into a generic CR problem, and derive an algorithm to solve it in Appendix B. Furthermore, we describe the convergence criterion for the iterative procedure and give its pseudo code in Algorithm 1 of Appendix C. Empirically, the algorithm always meets our convergence criteria, albeit there are no theoretical guarantees. In practice, we recommend starting the algorithm with multiple initial values and choose the one with the smallest cross-validated mean squared error. To predict the response using JICO estimates, we let $\mathbf{S}_{g,test}=\mathbf{X}_{g,test}\hat{\mathbf{W}}$ and $\mathbf{T}_{g,test}=\mathbf{X}_{g,test}\hat{\mathbf{W}}_{g}$ , where $\mathbf{X}_{g,test}$ is the test set. Then, the prediction of response is given by $\mathbf{S}_{g,test}\hat{{\bm{\alpha}}}+\mathbf{T}_{g,test}\hat{{\bm{\alpha}}}_{g}$ .

In practice, we need to select tuning parameters $K$ , $K_{g}$ and $\gamma$ . We describe how to select the tuning parameters in Appendix D. Since we use a grid search for tuning parameters, if the difference between the ranks of individual matrices and joint matrix are large, our algorithm may take a long time to compute. The goal of our algorithm is to approximate the joint and individual structure of the data by some low-rank matrices, and use the resulting low-rank matrices for prediction. Therefore, when the individual matrices and joint matrix are too complex and can not be approximated by some low-rank matrices, we suggest to use alternative methods to solve the problem.

Finally, we point out that our method includes JIVE-predict (Kaplan and Lock,, 2017) as a special case. JIVE-predict is a two-stage method that implements JIVE on $\mathbf{X}$ first and then regresses the responses on the loading matrix. When we let $\gamma\to\infty$ in (3.2) and (3.4), JICO is equivalent as performing JIVE on $\mathbf{X}$ . For that reason, our method in that case is equivalent to JIVE-predict.

4 Simulation Studies

One significant advantage of our proposed model is its flexibility of lying in between global and group-specific models. Moreover, the choice of parameter $\gamma$ in CR allows it to identify the model that best fits the data. In this section, we conduct multiple simulation studies to further demonstrate the advantage of our proposed model.

We consider three simulation settings in this section. In the first two settings, we generate data according to models that contain both global and group-specific components. The data are generated in a way that PCR and PLS solutions are favored respectively. In the last setting, we simulate data from two special cases: a global model and a group-specific model. The data are simulated so that the OLS is favored for both cases. For all three settings, JICO can adaptively choose the correct model parameter $\gamma$ so that it has the optimal performance. Moreover, we further illustrate how the rank selection impacts the performance of JICO by examining the results using mis-specified ranks.

We fix $G=2,p=200,n_{1}=n_{2}=50$ . In each replication, we generate 100 training samples to train the models and evaluate the corresponding Mean Squared Error (MSE) in an independent test set of 100 samples. We repeat simulations for 50 times.

For $g=1,\ldots,G$ , we generate $\mathbf{X}_{g}$ as i.i.d. samples from $\mathcal{N}(\bm{0},\mathbf{I}_{p\times p})$ . For the sake of simplicity, we generate $\bm{Y}_{g}$ by the following model with two latent components:

\bm{Y}_{g}=\alpha\bm{S}_{g}+\alpha_{g}\bm{T}_{g}+\mathbf{e}_{g},

(4.1)

where $\bm{S}_{g}=\mathbf{X}_{g}\mathbf{w}\in\mathbb{R}^{n_{g}}$ is the joint latent score vector with an coefficient $\alpha$ , $\bm{T}_{g}=\mathbf{X}_{g}\mathbf{w}_{g}\in\mathbb{R}^{n_{g}}$ is the individual latent score vector with an coefficient $\alpha_{g}$ , and $\mathbf{e}_{g}$ is generated as i.i.d. samples from $\mathcal{N}(0,0.04)$ . Here, $\mathbf{w}$ and $\mathbf{w}_{g}$ are all $p\times 1$ vectors, and they are constructed such that $\mathbf{w}^{\prime}\mathbf{w}_{g}=0$ . We vary the choices of $\mathbf{w}$ , $\mathbf{w}_{g}$ , $\alpha$ , and $\alpha_{g}$ , which will be discussed later.

4.1 PCR Setting

In this section, we simulate the model which favors $\gamma=\infty$ . In this case, CR solutions to (3.2) and (3.4) coincide with PCR, which are essentially the top eigenvectors of the corresponding covariance matrices.

To simulate this setup, given training data $\mathbf{X}=[\mathbf{X}_{1}^{\prime},\mathbf{X}_{2}^{\prime}]^{\prime}$ , we let $\mathbf{w}$ be the top eigenvector corresponding to $\mathbf{X}^{\prime}\mathbf{X}$ . We further set $\mathbf{w}_{g}$ as the top eigenvector of $\tilde{\mathbf{X}}_{g}^{\prime}\tilde{\mathbf{X}}_{g}$ , where $\tilde{\mathbf{X}}_{g}=\mathbf{X}_{g}(\mathbf{I}-\mathbf{w}\mathbf{w}^{\prime})$ is the data matrix after projecting $\mathbf{X}_{g}$ into the linear subspace that is orthogonal to $\mathbf{w}$ . This projection ensures that the construction of $\mathbf{w}$ and $\mathbf{w}_{g}$ satisfies $\mathbf{w}^{\prime}\mathbf{w}_{g}=0$ . To generate $\bm{Y}_{g}$ , we let $\alpha=1,\alpha_{g}=1,g=1,2$ .

We train JICO on a wide range of $\gamma\in[0,\infty)$ , using different combinations of $K,K_{1},K_{2}$ , with a maximum of $300$ iterations. Figure 4.1(a) demonstrates the MSEs evaluated on the test data over $50$ repetitions. For better illustration, we plot MSE curves as a function of $a$ , with $a=\gamma/(\gamma+1)$ , which is a one-to-one monotone map from $\gamma\in[0,\infty)$ to $a\in[0,1]$ . In particular, when $a=0,0.5$ and $1$ , we have $\gamma=0,1$ and $\infty$ , which correspond to the cases of OLS, PLS and PCR respectively. The solid curve demonstrates the model performance given true ranks $K=K_{1}=K_{2}=1$ , whereas the gray curves show the performance of models with mis-specified ranks. In particular, we consider four mis-specified rank combinations. Among them, two rank combinations ( $K=1$ , $K_{1}=K_{2}=0$ ; $K=2$ , $K_{1}=K_{2}=0$ ) correspond to joint models. The other two combinations ( $K=0$ , $K_{1}=K_{2}=1$ ; $K=0$ , $K_{1}=K_{2}=2$ ) correspond to group-specific models. We can see from Figure 4.1(a) that the absolute minimum is given by the model with true ranks and $a=1$ , which refers to the underlying true model. When we look at the curves on the spectrum of $a$ as a whole, the joint models with $K=1$ or $2$ , $K_{1}=K_{2}=0$ always perform worse than the model with $K=K_{1}=K_{2}=1$ , because they are unable to capture the group-specific information from the underlying model. The model with true ranks performs better than the individual models with $K=0$ , $K_{1}=K_{2}=1$ or $2$ for larger values of $a$ , because the latter models cannot capture as much global information as the former. However, the model with $K=K_{1}=K_{2}=1$ performs worse than the individual models for smaller values of $a$ , where the latter achieves much more acceptable performances. This means that the choice of optimal ranks for our model can be sensitive to the choice of $\gamma$ . For smaller $\gamma$ values, individual models tend to be more reliable under the PCR setting. We notice that the end of the curve is not very smooth when $K=2,K_{1}=K_{2}=0$ . One possible reason is that the solution path of CR can sometimes be discontinuous with respect to $\gamma$ (Björkström and Sundberg,, 1996), consequently the CR algorithm may be numerically unstable for certain $\gamma$ values.

Refer to caption — Figure 4.1: MSE curves for JICO models with different ranks under the PCR setting (a) and PLS setting (b). This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

We further illustrate the performance of JICO by comparing it with several existing methods. In particular, we include ridge regression (Ridge), partial least squares (PLS) and principal component regression (PCR). For JICO, we select the models trained under true ranks $K=K_{1}=K_{2}=1$ (performance as illustrated by the solid curve in Figure 4.1(a), with $\gamma=0,1,\infty$ , which correspond to the cases of OLS, PLS and PCR respectively. For a fair comparison, for PLS and PCR methods, we fix the number of components to be $2$ for both a global fit and a group-specific fit. Table 1(a) summarizes the MSEs and of these methods. The numbers provided in the brackets represent the standard error. The first two columns summarize the performance for each group ( $g=1,2$ ), and the last column summarizes the overall performance. The JICO model with $\gamma=\infty$ performs significantly better than the rest, because it agrees with the underlying true model. Among other mis-specified methods, group-specific PLS is relatively more robust to model mis-specification.

	Method		$g=1$	$g=2$	Overall
(a) PCR Example	JICO	$\gamma=0$	1.994(0.063)	2.012(0.068)	2.003(0.053)
		$\gamma=1$	0.679(0.018)	0.701(0.026)	0.69(0.017)
		$\gamma=\infty$	0.04(0.001)	0.04(0.001)	0.04(0.001)
	Global	Ridge	1.734(0.056)	1.78(0.065)	1.757(0.052)
		PLS	1.163(0.033)	1.194(0.045)	1.178(0.031)
		PCR	0.946(0.04)	0.977(0.044)	0.961(0.022)
	Group-specific	Ridge	0.252(0.009)	0.27(0.009)	0.261(0.005)
		PLS	0.254(0.009)	0.272(0.009)	0.263(0.005)
		PCR	0.68(0.042)	0.71(0.05)	0.695(0.032)
(b) PLS Example	JICO	$\gamma=0$	0.57(0.023)	0.567(0.021)	0.569(0.018)
		$\gamma=1$	0.211(0.008)	0.218(0.008)	0.215(0.006)
		$\gamma=\infty$	1.236(0.038)	1.277(0.037)	1.256(0.025)
	Global	Ridge	1.698(0.064)	1.742(0.06)	1.72(0.055)
		PLS	0.3(0.011)	0.297(0.01)	0.299(0.008)
		PCR	1.229(0.036)	1.298(0.041)	1.263(0.025)
	Group-specific	Ridge	0.375(0.013)	0.425(0.016)	0.4(0.01)
		PLS	0.412(0.014)	0.406(0.016)	0.409(0.008)
		PCR	1.234(0.037)	1.25(0.036)	1.242(0.024)
(c) OLS Example (a)	JICO	$\gamma=0$	0.082 (0.002)	0.083 (0.003)	0.082 (0.002)
		$\gamma=1$	0.403 (0.011)	0.419 (0.011)	0.411 (0.007)
		$\gamma=\infty$	1.006 (0.031)	1.07 (0.03)	1.038 (0.02)
	Global	Ridge	0.084 (0.004)	0.084 (0.003)	0.084 (0.003)
		PLS	0.221 (0.007)	0.226 (0.006)	0.223 (0.005)
		PCR	0.991 (0.032)	1.069 (0.030)	1.030 (0.020)
	Group-specific	Ridge	0.574 (0.017)	0.599 (0.024)	0.586 (0.013)
		PLS	0.572 (0.016)	0.599 (0.024)	0.585 (0.013)
		PCR	0.996 (0.032)	1.061 (0.030)	1.028 (0.021)
(d) OLS Example (b)	JICO	$\gamma=0$	0.063(0.002)	0.066(0.004)	0.064(0.002)
		$\gamma=1$	0.257(0.009)	0.27(0.009)	0.264(0.006)
		$\gamma=\infty$	1.004(0.031)	1.002(0.03)	1.003(0.023)
	Global	Ridge	0.646(0.021)	0.673(0.024)	0.66(0.019)
		PLS	0.957(0.027)	0.971(0.032)	0.964(0.022)
		PCR	1.023(0.031)	1.016(0.031)	1.019(0.023)
	Group-specific	Ridge	0.076(0.004)	0.072(0.005)	0.074(0.003)
		PLS	0.113(0.003)	0.116(0.005)	0.115(0.003)
		PCR	0.978(0.03)	0.987(0.029)	0.983(0.022)

Table 1: Groupwise and overall MSEs under the PCR, PLS and OLS settings. Numbers in brackets are standard errors.

4.2 PLS Setting

In this section, we consider the model setup that is more favorable to $\gamma=1$ . In this scenario, the CR solutions to (3.2) and (3.4) coincide with the PLS solutions. Same as in Section 4.1, we still consider the construction of weights as linear transformations of the eigenvectors.

Given training data $\mathbf{X}=[\mathbf{X}_{1}^{\prime},\mathbf{X}_{2}^{\prime}]^{\prime}$ , denote $\mathbf{V}_{p\times q}$ as the matrix of top $q$ eigenvectors of $\mathbf{X}^{\prime}\mathbf{X}$ . We let $\mathbf{w}=\mathbf{V}\mathbf{1}_{q}/\sqrt{q}$ , where $\bm{1}_{q}$ denotes a $q\times 1$ vector with elements all equal to $1$ . In this way, the $q$ top eigenvectors contribute equally to the construction of $\bm{S}_{g}$ . Similarly, we let $\tilde{\mathbf{X}}_{g}=\mathbf{X}_{g}(\mathbf{I}-\mathbf{w}\mathbf{w}^{\prime})$ and $\mathbf{V}_{g}$ be the $p\times q_{g}$ matrix of top $q_{g}$ eigenvectors of $\tilde{\mathbf{X}}_{g}^{\prime}\tilde{\mathbf{X}}_{g}$ . Then we let $\mathbf{w}_{g}=\mathbf{V}_{g}\bm{1}_{q_{g}}/\sqrt{q_{g}}$ . To construct a model more favorable to PLS, in this section, we let $q=n/2$ and $q_{g}=n_{g}/2$ . We generate $\bm{Y}_{g}$ from (4.1) by letting $\alpha=1$ and $\alpha_{g}=0.5$ .

Similar to the PCR setting, in Figure 4.1(b), we illustrate the MSE curves of JICO models with different rank combinations on a spectrum of $a$ , where $\gamma=a/(1-a)$ . Again, the solid curve represents the model with true ranks, while the gray curves represent models with mis-specified ranks. The absolute minimum is given by the solid curve at $a$ around $0.5$ , which corresponds to the underlying true model. Moreover, the solid curve gives almost uniformly the best performance on the spectrum of $a$ compared with the gray curves, except on a small range of $a$ close to $0$ . Hence, under the PLS setting, the optimal ranks can be less sensitive to the choice of $\gamma$ . At initial values of $a$ , the solid curve almost overlaps with the gray curve that represents the joint model with $K=1,K_{1}=K_{2}=0$ . This means that when $\gamma$ is close to $0$ , the individual signals identified by the full model with $K=K_{1}=K_{2}=1$ can be ignored. Therefore, the two group-specific models that capture more individual information give the best performance in this case. For $a$ values closer to $1$ , the gray curve that represents the joint model with $K=2,K_{1}=K_{2}=0$ is very close to the solid curve. This means that the effects of individual components estimated by JICO tend to become more similar across groups for larger $\gamma$ .

In Table 1(b), we summarize the MSEs of JICO models trained with true ranks $K=K_{1}=K_{2}=1$ and $\gamma=0,1,\infty$ , along with other methods as described in Section 4.1. JICO with $\gamma=1$ shows the best performance among all methods, followed by the global PLS method, since the true model favors PLS and the coefficient $\alpha_{g}=0.5$ for the group-specific component is relatively small.

4.3 OLS Setting

In this section, we simulate the setting that favors $\gamma=0$ , which corresponds to the case of OLS in CR. It is shown in Stone and Brooks, (1990) that when $\gamma=0$ , there is only one non-degenerate direction that can be constructed from the CR algorithm. Hence, under the JICO framework, the model that favors $\gamma=0$ embraces two special cases: a global model with $K=1$ , $K_{g}=0$ and a group-specific model with $K=0$ , $K_{g}=1$ .

For the two cases, we simulate $\bm{Y}_{g}$ with (a) $\alpha=1$ , $\alpha_{g}=0$ and (b) $\alpha=0$ , $\alpha_{g}=1$ respectively. The construction of $\mathbf{w}$ and $\mathbf{w}_{g}$ is the same as that in Section 4.2 with $q=n$ and $q_{g}=n_{g}$ .

Figure 4.2 illustrates MSE curves of the two cases, where (a) represents the case of the global model and (b) represents the case of the group-specific model. In both cases, the absolute minimum can be found on the solid curves at $a=0$ , which represents the MSE curves from the models with true ranks $K=1,K_{g}=0$ and $K=0,K_{g}=1$ respectively. In (a), there are two competitive models against the model with true ranks: another global model with $K=2$ , $K_{g}=0$ and the model with $K=K_{1}=K_{2}=1$ . They both achieve the same performance with the solid curve at $a=0$ , and stay low at larger values of $a$ . This is because, when $\gamma$ is mis-specified, additional model ranks help capture more information from data. The $K=2$ , $K_{g}=0$ model performs better because the underlying model is a global model. This is also true for (b). The global minimum can be found at $a=0$ on the solid curve, while the $K=0,K_{g}=2$ model performs better when $a$ gets larger. Again, this is because larger $K_{g}$ helps capture more information from data. The $K=K_{1}=K_{2}=1$ model does not perform as well, because the estimated joint information dominates, which does not agree with the true model. We observe some discontinuities on the $K=2,K_{g}=0$ curve, since the CR solution path can sometimes be discontinuous with respect to $\gamma$ as discussed in the PCR setting in Section 4.1.

In Table 1 (c) and (d), we summarize the MSEs of JICO models trained with the true ranks with $\gamma=0,1,\infty$ and other methods described in Section 4.1. For a fair comparison, the number of components for PCR and PLS is chosen to be $1$ for both global and group-specific fits. The JICO model with $\gamma=0$ , along with Ridge always achieve better performance than all other methods. It is interesting to notice that in (c), the JICO models with $\gamma=1$ and $\infty$ coincides with global PLS and PCR respectively, and hence they achieve the same performances. Similarly, in (d), JICO models with $\gamma=1$ and $\infty$ coincide with group-specific PLS and PCR respectively, and they achieve the same performance correspondingly. In addition, when $K=1,K_{g}=0$ , the solution of CR algorithm coincides with the global OLS model. Thus the JICO model with $\gamma=0$ and the global Ridge have similar performance in (c). Similarly, when $K=0,K_{g}=1$ , the JICO model with $\gamma=0$ and group-specific Ridge have similar performance in (d).

5 Applications to ADNI Data Analysis

We apply our proposed method to analyze data from the Alzheimer’s Disease (AD) Neuroimaging Initiative (ADNI). It is well known that AD accounts for most forms of dementia characterized by progressive cognitive and memory deficits. The increasing incidence of AD makes it a very important health issue and has attracted a lot of scientific attentions. To predict the AD progression, it is very important and useful to develop efficient methods for the prediction of disease status and clinical scores (e.g., the Mini Mental State Examination (MMSE) score and the AD Assessment Scale-Cognitive Subscale (ADAS-Cog) score). In this analysis, we are interested in predicting the ADAS-Cog score by features extracted from 93 brain regions scanned from structural magnetic resonance imaging (MRI). All subjects in our analysis are from the ADNI2 phase of the study. There are 494 subjects in total in our analysis and 3 subgroups: NC (178), eMCI (178) and AD (145), where the numbers in parentheses indicate the sample sizes for each subgroup. As a reminder, NC stands for the Normal Control, and eMCI stands for the early stage of Mild Cognitive Impairment in AD progression.

For each group, we randomly partition the data into two parts: 80% for training the model and the rest for testing the performance. We repeat the random split for 50 times. The testing MSEs and the corresponding standard errors are reported in Table 2. Both groupwise and overall performance are summarized. We compare our proposed JICO model with four methods: ridge regression (Ridge), PLS, and PCR. We perform both a global and a group-specific fit for Ridge, PLS, and PCR, where the regularization parameter in Ridge and the number of components in PCR or PLS are tuned by 10-fold cross validation (CV). For our proposed JICO model, we demonstrate the result by fitting the model with fixed $\gamma=0,0.25,1,\infty$ , or tuned $\gamma$ respectively. In practice, using exhaustive search to select the optimal values for $K$ and $K_{g}$ can be computationally cumbersome, because the number of combinations grows exponentially with the number of candidates for each parameter. Based on our numerical experience, we find that choosing $K_{g}$ to be the same does not affect the performance on prediction too much. Details are discussed in Appendix D. Therefore, in all these cases, the optimal ranks for JICO are chosen by an exhaustive search in $K\in\{0,1\}$ and $K_{1}=K_{2}=K_{3}\in\{0,1\}$ to see which combination gives the best MSE. We choose $K$ and $K_{g}$ to be small to improve our model interpretations. The optimal value of $\gamma$ is chosen by 10-fold CV.

As shown in Table 2, JICO performs the best among all competitors. Fitting JICO with $\gamma=0.25$ yields the smallest overall MSE. JICO with parameters chosen by CV performs slightly worse, but is still better than the other global or group-specific methods. The results of JICO with $\gamma=0,1$ and $\infty$ , which correspond to OLS, PLS, and PCR, are also provided in Table 2. Even though their prediction is not the best, an interesting observation is that they always have better performance than their global or group-specific counterparts. For example, when $\gamma=1$ , JICO has much better overall prediction than the group-specific PLS. This indicates that it is beneficial to capture global and individual structures for regression when subpopulations exist in the data.

In Table 2, global models perform the worst, because they do not take group heterogeneity into consideration. The group-specific Ridge appears to be the most competitive one among group-specific methods. Note that for the AD group, our JICO model with $\gamma=0.25$ or tuned $\gamma$ outperforms the group-specific Ridge method by a great margin.

Method		NC	eMCI	AD	Overall
JICO	$\gamma=0$	6.671 (0.137)	11.319 (0.309)	55.798 (1.556)	22.821 (0.466)
	$\gamma=.25$	6.316 (0.121)	10.394 (0.279)	40.853 (1.294)	17.951 (0.394)
	$\gamma=1$	6.443 (0.124)	10.353 (0.291)	44.054 (1.449)	18.929 (0.441)
	$\gamma=\infty$	6.608 (0.138)	11.121 (0.308)	49.997 (1.832)	21.013 (0.558)
	CV	6.414 (0.129)	10.333(0.289)	41.297 (1.348)	18.096 (0.401)
Global	Ridge	23.450 (0.751)	21.276 (0.796)	63.989 (2.657)	34.692 (0.840)
	PLS	26.310 (0.787)	22.672 (0.915)	68.193 (3.183)	37.442 (0.982)
	PCR	25.228 (0.771)	21.966 (0.802)	69.541 (2.969)	37.209 (0.907)
Group-specific	Ridge	6.336 (0.116)	10.353 (0.278)	42.271 (1.315)	18.364 (0.392)
	PLS	6.656 (0.136)	11.298 (0.306)	48.434 (1.725)	20.629 (0.524)
	PCR	6.656 (0.136)	11.346 (0.304)	47.357 (1.484)	20.327 (0.454)

Table 2: Groupwise and overall MSEs on the ADNI data. Numbers in brackets are standard errors.

To get our results more interpretable, we further apply the JICO model to NC and AD groups. We run 50 replications of 10-fold CV to see which combination of tuning parameters gives the smallest overall MSE. The best choice is $\gamma=\infty,K=1,K_{NC}=K_{AD}=3$ . Then, we apply JICO using this choice and tuning parameters and give the heatmaps of the estimated $\hat{\mathbf{J}}_{g}$ (left column) and $\hat{\mathbf{A}}_{g}$ (right column) in Figure 5.1. Rows of heatmaps represent samples and columns represent MRI features. We use the Ward’s linkage to perform hierarchical clustering on the rows of $\hat{\mathbf{J}}_{g}$ , and arrange the rows of $\hat{\mathbf{J}}_{g}$ and $\hat{\mathbf{A}}_{g}$ in the same order for each group. Moreover, we apply the same clustering algorithm to the columns of $\hat{\mathbf{J}}_{g}$ to arrange the columns in the same order across the two disease groups for both joint and individual structures. Figure 5.1 shows that JICO separates joint and individual structures effectively. The joint structures across different disease groups share a very similar pattern, whereas the individual structures appear to be very distinct. We further magnify the right column of Figure 5.1 in Figure 5.2 with the brain region names listed. We find that the variation in $\hat{\mathbf{A}}_{g}$ for the AD group is much larger than the counterpart for the NC group. We highlight the brain regions that differ the most between the two groups. The highlighted regions play crucial roles in human’s cognition, thus important in AD early diagnosis (Michon et al.,, 1994; Killiany et al.,, 1993). For example, Michon et al., (1994) suggested that anosognosia in AD results in part from frontal dysfunction. Killiany et al., (1993) showed that the temporal horn of the lateral ventricles can be used as antemortem markers of AD.

6 Discussion

In this paper, we propose JICO, a latent component regression model for multi-group heterogeneous data. Our proposed model decomposes the response into jointly shared and group-specific components, which are driven by low-rank approximations of joint and individual structures from the predictors respectively. For model estimation, we propose an iterative procedure to solve for model components, and utilize CR algorithm that covers OLS, PLS, and PCR as special cases. As a result, the proposed procedure is able to extend many regression algorithms covered by CR to the setting of heterogeneous data. Extensive simulation studies and a real data analysis on ADNI data further demonstrate the competitive performance of JICO.

JICO is designed to be very flexible for multi-group data. It is able to choose the optimal parameter to determine the regression algorithm that suits the data the best, so that the prediction power is guaranteed. At the same time, it is also able to select the optimal joint and individual ranks that best describe the degree of heterogeneity residing in each subgroup. The JICO application to ADNI data has effectively illustrated that our proposed model can provide nice visualization on identifying joint and individual components from the entire dataset without losing much of the prediction power.

DATA AVAILABILITY STATEMENT

The data that support the findings of this paper are available in the Alzheimer’s Disease Neuroimaging Initiative at https://adni.loni.usc.edu/.

Appendix A Proofs

Proof of Lemma 1. The proof of the first part of this lemma follows similar arguments as in Feng et al., (2018). Define $\text{row}(\mathbf{J})=\bigcap_{g=1}^{G}\text{row}(\tilde{\mathbf{X}}_{g})$ . We assume that $\text{row}(\mathbf{J})\neq\{\bm{0}\}$ for non-trivial cases. For each $g$ , $\mathbf{J}_{g}$ and $\mathbf{A}_{g}$ can be obtained by projecting $\tilde{\mathbf{X}}_{g}$ to $\text{row}(\mathbf{J})$ and its orthogonal complement in $\text{row}(\tilde{\mathbf{X}}_{g})$ . Thus, by construction, we have $\tilde{\mathbf{X}}_{g}=\mathbf{J}_{g}+\mathbf{A}_{g}$ and $\text{row}(\mathbf{J}_{g})\perp\text{row}(\mathbf{A}_{g})$ . Since for all $g$ , we have $\text{row}(\mathbf{A}_{g})\perp\text{row}(\mathbf{J})$ , then $\bigcap_{g=1}^{G}\text{row}(\mathbf{A}_{g})$ has a zero projection matrix. Therefore, we have $\bigcap_{g=1}^{G}\text{row}(\mathbf{A}_{g})=\{\mathbf{0}\}$ . On the other hand, since $\text{row}(\mathbf{J}_{g})\subset\text{row}(\tilde{\mathbf{X}}_{g})$ , the orthogonal projection of $\tilde{\mathbf{X}}_{g}$ onto $\text{row}(\mathbf{J}_{g})$ is unique. Therefore, the matrices $\mathbf{J}_{g}$ and $\mathbf{A}_{g}$ are uniquely defined.

For the second part of Lemma 1, note that $\tilde{\bm{Y}_{g}}\in\text{col}(\tilde{\mathbf{X}}_{g})\subset\text{col}(\mathbf{J}_{g})+\text{col}(\mathbf{A}_{g})$ . Then, $\bm{J}_{g}^{Y}$ and $\bm{A}_{g}^{Y}$ can be obtained by projecting $\tilde{\bm{Y}_{g}}$ onto $\text{col}(\mathbf{J}_{g})$ and $\text{col}(\mathbf{A}_{g})$ respectively. Because $\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g})$ , $\bm{J}_{g}^{Y}$ and $\bm{A}_{g}^{Y}$ are uniquely defined.

Proof of Lemma 2. We explicitly describe how to find $\mathbf{J}_{g}$ , $\mathbf{A}_{g}$ , $\mathbf{U}$ , $\mathbf{U}_{g}$ , $\mathbf{S}_{g}$ , $\mathbf{T}_{g}$ , $\bm{\alpha}$ , $\bm{\alpha}_{g}$ to satisfy the requirements. First, $\mathbf{J}_{g}$ can be obtained by finding an arbitrary set of bases that span $\bigcap_{g=1}^{G}\text{row}({\mathbf{X}}_{g})$ . Then, $\mathbf{A}_{g}$ can be obtained by solving a system of linear equations. We prove that such $\mathbf{J}_{g}$ and $\mathbf{A}_{g}$ satisfy conditions $(i)$ – $(iii)$ in Lemma 1 and $\text{col}(\mathbf{J}_{g})\perp\text{col}(\mathbf{A}_{g})$ . Given $\mathbf{J}_{g}$ and $\mathbf{A}_{g}$ , we construct $\mathbf{U},\mathbf{U}_{g},\mathbf{S}_{g},\mathbf{T}_{g}$ , so that they satisfy $\mathbf{J}_{g}=\mathbf{S}_{g}\mathbf{U}$ , $\mathbf{A}_{g}=\mathbf{T}_{g}\mathbf{U}_{g}$ and $\mathbf{S}_{g}^{\prime}\mathbf{T}_{g}=\bm{0}_{K\times K_{g}}$ .

Let $\mathbf{w}_{1},\ldots,\mathbf{w}_{K}\in\mathbb{R}^{p}$ be an arbitrary set of bases that span $\bigcap_{g=1}^{G}\text{row}({\mathbf{X}}_{g})$ . Denote $\mathbf{W}_{p\times K}=[\mathbf{w}_{1},\ldots,\mathbf{w}_{K}]$ . Let $\mathbf{J}_{g}={\mathbf{X}}_{g}\mathbf{W}(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}$ . Next, we show that $\mathbf{J}_{g}$ satisfies condition $(i)$ in Lemma 1 for any $0\leq g_{1}\neq g_{2}\leq G$ . It suffices to show that $\text{row}(\mathbf{J}_{g_{1}})\subset\text{row}(\mathbf{J}_{g_{2}})$ , i.e. there exists $\mathbf{C}_{g_{1},n_{g_{1}}\times n_{g_{2}}}^{g_{2}}$ , such that $\mathbf{C}^{g_{2}}_{g_{1}}\mathbf{J}_{g_{2}}=\mathbf{J}_{g_{1}}$ . Since the columns of $\mathbf{W}$ form the bases of $\bigcap_{g=1}^{G}\text{row}({\mathbf{X}}_{g})$ , there exists $\mathbf{Q}^{g_{2}}_{K\times n_{g_{2}}}$ , such that $\mathbf{Q}^{g_{2}}{\mathbf{X}}_{g_{2}}=\mathbf{W}^{\prime}$ . Then, we have $\mathbf{Q}^{g_{2}}\mathbf{J}_{g_{2}}=\mathbf{Q}^{g_{2}}{\mathbf{X}}_{g_{2}}\mathbf{W}(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}=(\mathbf{W}^{\prime}\mathbf{W})(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}=\mathbf{W}^{\prime}$ . Let $\mathbf{R}_{g_{1}}={\mathbf{X}}_{g_{1}}\mathbf{W}(\mathbf{W}^{\prime}\mathbf{W})^{-1}$ and $\mathbf{C}_{g_{1}}^{g_{2}}=\mathbf{R}_{g_{1}}\mathbf{Q}^{g_{2}}$ , then we have

\mathbf{C}_{g_{1}}^{g_{2}}\mathbf{J}_{g_{2}}=\mathbf{R}_{g_{1}}\mathbf{Q}^{g_{2}}\mathbf{J}_{g_{2}}={\mathbf{X}}_{g_{1}}\mathbf{W}(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}=\mathbf{J}_{g_{1}}.

Given $\mathbf{W}$ , we propose to solve

\begin{bmatrix}\mathbf{W}^{\prime}\\ \mathbf{W}^{\prime}{\mathbf{X}}_{g}^{\prime}{\mathbf{X}}_{g}\end{bmatrix}\mathbf{x}=\bm{0},

(A.1)

where $\mathbf{x}\in\mathbb{R}^{p}$ is unknown. We first show that (A.1) has non-zero solutions. Since $\mathbf{Q}^{g}{\mathbf{X}}_{g}=\mathbf{W}^{\prime}$ , (A.1) can be rewritten as $\tilde{\mathbf{Q}}^{g}{\mathbf{B}}_{g}{\mathbf{X}}_{g}\mathbf{x}=\bm{0},$ where

\tilde{\mathbf{Q}}^{g}=\begin{pmatrix}\mathbf{Q}^{g}&\\ &\mathbf{Q}^{g}\end{pmatrix}\text{ and }{\mathbf{B}}_{g}=\begin{pmatrix}\mathbf{I}_{n_{g}\times n_{g}}\\ {\mathbf{X}}_{g}{\mathbf{X}}_{g}^{\prime}\end{pmatrix}.

Since

\text{rank}(\tilde{\mathbf{Q}}^{g}{\mathbf{B}}_{g}{\mathbf{X}}_{g})\leq\min\{\text{rank}(\tilde{\mathbf{Q}}^{g}),\text{rank}({\mathbf{B}}_{g}),\text{rank}({\mathbf{X}}_{g})\}\leq\min\{2K,n_{g},\text{rank}(\mathbf{X}_{g})\}\leq\text{rank}(\mathbf{X}_{g})<p,

(A.1) has non-zero solutions.

Let $\mathbf{W}_{g}=[\mathbf{w}_{g,1},\ldots,\mathbf{w}_{g,K_{g}}]$ be an arbitrary set of bases that spans the space of solutions to (A.1). Let $\mathbf{A}_{g}={\mathbf{X}}_{g}\mathbf{W}_{g}{(\mathbf{W}_{g}^{\prime}\mathbf{W}_{g})}^{-1}\mathbf{W}_{g}^{\prime}.$ Next we show that $\mathbf{A}_{g}$ satisfies $(ii)-(iii)$ in Lemma 1 and $\text{col}(\mathbf{A}_{g})\perp\text{col}(\mathbf{J}_{g})$ . Since columns of $\mathbf{W}_{g}$ satisfy (A.1), we have $\mathbf{W}^{\prime}\mathbf{W}_{g}=\bm{0}$ and $\mathbf{W}^{\prime}{\mathbf{X}}_{g}^{\prime}{\mathbf{X}}_{g}\mathbf{W}_{g}=\bm{0}$ . Then, by definition, $\mathbf{A}_{g}^{\prime}\mathbf{J}_{g}=\bm{0}$ and $\mathbf{J}_{g}\mathbf{A}_{g}^{\prime}=\bm{0}$ , which imply that $\text{row}(\mathbf{A}_{g})\perp\text{row}(\mathbf{J}_{g})$ and $\text{col}(\mathbf{A}_{g})\perp\text{col}(\mathbf{J}_{g})$ . Since for all $g$ , $\text{row}(\mathbf{A}_{g})$ is perpendicular to the columns of $\mathbf{W}$ , then we have $\bigcap_{g=1}^{G}\text{row}(\mathbf{A}_{g})=\{\bm{0}\}$ .

Finally, letting $\mathbf{S}_{g}={\mathbf{X}}_{g}\mathbf{W}$ , $\mathbf{T}_{g}={\mathbf{X}}_{g}\mathbf{W}_{g}$ , $\mathbf{U}=(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}$ and $\mathbf{U}_{g}=(\mathbf{W}_{g}^{\prime}\mathbf{W}_{g})^{-1}\mathbf{W}_{g}^{\prime}$ , we have $\mathbf{S}_{g}^{\prime}\mathbf{T}_{g}=\bm{0}$ . Let ${\bm{\alpha}}=({\mathbf{S}}^{\prime}{\mathbf{S}})^{-1}{\mathbf{S}}^{\prime}\bm{Y}$ and ${\bm{\alpha}}_{g}=({\mathbf{T}}_{g}^{\prime}{\mathbf{T}}_{g})^{-1}{\mathbf{T}}_{g}^{\prime}{\bm{Y}}_{g}$ , where $\mathbf{S}=\mathbf{X}\mathbf{W}$ . Then we obtain $\bm{J}^{Y}_{g}$ and $\bm{A}^{Y}_{g}$ by letting $\bm{J}^{Y}_{g}=\mathbf{S}_{g}\bm{\alpha}\in\text{col}(\mathbf{J}_{g})$ and $\bm{A}^{Y}_{g}=\mathbf{T}_{g}\bm{\alpha}_{g}\in\text{col}(\mathbf{A}_{g})$ .

Appendix B Derivation of the CR algorithm for solving (3.5) and (3.6)

Consider the following CR problem that covers (3.5) and (3.6) as special cases:

$\displaystyle\max_{\mathbf{w}}$	$\displaystyle\text{cov}(\mathbf{X}\mathbf{w},\bm{Y})^{2}\text{var}(\mathbf{X}\mathbf{w})^{\gamma-1}$	(B.1)
s.t.	$\displaystyle\mathbf{w}^{\prime}\mathbf{w}=1,$
	$\displaystyle\mathbf{w}^{\prime}\mathbf{X}^{\prime}\mathbf{X}\mathbf{w}_{j}=0;\quad j=1,\ldots,k-1\text{ if }k\geq 2,$
	$\displaystyle\hat{\mathbf{W}}^{\prime}\mathbf{w}=\mathbf{0};$
	$\displaystyle\hat{\mathbf{S}}^{\prime}\mathbf{X}\mathbf{w}=\mathbf{0},$

where $\hat{\mathbf{W}}$ , $\hat{\mathbf{S}}$ , $\mathbf{X}$ and $\bm{Y}$ are given a priori.

The solution to (S2) resides in $\text{col}(\hat{\mathbf{W}})^{\perp}$ , i.e. the space that is orthogonal to the columns of $\hat{\mathbf{W}}$ . Hence, we let $\hat{\mathbf{P}}=\hat{\mathbf{W}}(\hat{\mathbf{W}}^{\prime}\hat{\mathbf{W}})^{-1}\hat{\mathbf{W}}^{\prime}$ and $\hat{\mathbf{X}}=\mathbf{X}(\mathbf{I}-\hat{\mathbf{P}})$ , the latter being the projection of $\mathbf{X}$ into the space that is orthogonal to the columns of $\hat{\mathbf{W}}$ . Let $m$ be the rank of the matrix $\hat{\mathbf{X}}$ , and $\hat{\mathbf{X}}=\mathbf{U}\mathbf{D}\mathbf{V}^{\prime}$ be the corresponding rank- $m$ SVD decomposition, with $\mathbf{U}\in\mathbb{R}^{n\times m}$ , $\mathbf{V}\in\mathbb{R}^{p\times m}$ , $\mathbf{D}$ the $m\times m$ diagonal matrix. Since the representation of the solution to (S2) might not be unique, to avoid ambiguity, we write the solution as the linear combination of the column vectors of $\mathbf{V}$ , i.e. $\mathbf{w}=\mathbf{V}\mathbf{z}$ , for some $\mathbf{z}\in\mathbb{R}^{m}$ . Note that all $\mathbf{w}=\mathbf{V}\mathbf{z}$ satisfies $\hat{\mathbf{W}}^{\prime}\mathbf{w}=\bm{0}$ . Hence, the constraint $\hat{\mathbf{W}}^{\prime}\mathbf{w}=\bm{0}$ from (S2) can be satisfied under this representation.

At step $k+1$ , the original optimization problem can be reformulated as follows:

\max_{\mathbf{z}}\ (\mathbf{z}^{\prime}\mathbf{d})^{2}(\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z})^{\gamma-1}\quad\text{s.t. }\mathbf{z}^{\prime}\mathbf{z}=1,\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{Z}_{k}=\bm{0},\text{ and }\mathbf{z}^{\prime}\mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}=\bm{0},

(B.2)

where $\mathbf{d}=\mathbf{V}^{\prime}\hat{\mathbf{X}}^{\prime}\bm{Y}$ , $\mathbf{Z}_{k}=[\mathbf{z}_{1},\ldots,\mathbf{z}_{k}]$ and $\tilde{\mathbf{E}}=\mathbf{D}^{2}$ . To solve (B.2), we can expand the objective to its Lagrangian form:

T^{*}(\mathbf{z})=(\mathbf{z}^{\prime}\mathbf{d})^{2}(\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z})^{\gamma-1}-\lambda_{0}(\mathbf{z}^{\prime}\mathbf{z}-1)-2\mathbf{z}^{\prime}[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}]\bm{\Lambda}_{k},

where $\bm{\Lambda}_{k}=[\lambda_{1},\ldots,\lambda_{k+K}]^{\prime}$ and $\lambda_{0},\ldots,\lambda_{k+K}$ are Lagrange multipliers. To solve (B.2), we take the derivative of $T^{*}$ with respect to $\mathbf{z}$ , then the optimizer should be the solution to the following:

{\partial T^{*}\over\partial\mathbf{z}}=2(\mathbf{z}^{\prime}\mathbf{d})(\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z})^{\gamma-1}\mathbf{d}+2(\gamma-1)(\mathbf{z}^{\prime}\mathbf{d})^{2}(\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z})^{\gamma-2}\tilde{\mathbf{E}}\mathbf{z}-2\lambda_{0}\mathbf{z}-2[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}]\bm{\Lambda}_{k}=\bm{0}.

(B.3)

Left multiply $\mathbf{z}^{\prime}$ to (B.3) and apply the constraints then we can conclude that $\lambda_{0}=\gamma(\mathbf{z}^{\prime}\mathbf{d})^{2}(\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z})^{\gamma-1}$ . Plug this back to (B.3), and let $\tau=\mathbf{z}^{\prime}\mathbf{d}$ and $\rho=\mathbf{z}^{\prime}\tilde{\mathbf{E}}\mathbf{z}$ , then we have

\gamma\tau^{2}\rho^{\gamma-1}\mathbf{z}+(1-\gamma)\tau^{2}\rho^{\gamma-2}\tilde{\mathbf{E}}\mathbf{z}+[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}]\bm{\Lambda}_{k}=\tau\rho^{\gamma-1}\mathbf{d}.

A simple reformulation of the above plus the constraints $\mathbf{z}^{\prime}[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}]=\bm{0}$ yields the following matrix form:

\left[\begin{array}[]{cc}\mathbf{A}&\mathbf{B}\\ \mathbf{B}^{\prime}&\bm{0}\end{array}\right]\left[\begin{array}[]{c}\mathbf{z}\\ \bm{\Lambda}_{k}\end{array}\right]=\left[\begin{array}[]{c}\mathbf{q}\\ \bm{0}\end{array}\right],

(B.4)

where $\mathbf{A}=\tau^{2}[\gamma\rho^{\gamma-1}\mathbf{I}+(1-\gamma)\rho^{\gamma-2}\tilde{\mathbf{E}}]$ , $\mathbf{B}=[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}]$ , and $\mathbf{q}=\tau\rho^{\gamma-1}\mathbf{d}$ . By the standard formula for inverse of a partitioned matrix and the constraint that $\mathbf{z}^{\prime}\mathbf{z}=1$ , we obtain

\mathbf{z}={\mathbf{M}\mathbf{q}\over\|\mathbf{M}\mathbf{q}\|},

(B.5)

where $\mathbf{M}=\mathbf{A}^{-1}-\mathbf{A}^{-1}\mathbf{B}(\mathbf{B}^{\prime}\mathbf{A}^{-1}\mathbf{B})^{-1}\mathbf{B}^{\prime}\mathbf{A}^{-1}$ . Note that $\mathbf{M}$ has a factor of $\tau^{-2}$ and $\mathbf{q}$ has a factor of $\tau$ , hence $\mathbf{M}\mathbf{q}$ has a factor of $\tau^{-1}$ that gets canceled out during the normalization in (B.5). Therefore, $\mathbf{z}$ does not rely on the quantity of $\tau$ . Without loss of generality, we can choose $\tau=1$ . Then the only unknown parameter is $\rho$ . Hence, we can formulate (B.4) and (B.5) as a fixed point problem of $\mathbf{z}(\rho)^{\prime}\tilde{\mathbf{E}}\mathbf{z}(\rho)$ as a function of $\rho$ . More specifically, we seek for $\rho^{*}$ that satisfies $\mathbf{z}(\rho^{*})^{\prime}\tilde{\mathbf{E}}\mathbf{z}(\rho^{*})=\rho^{*}$ . Afterwards, we obtain $\mathbf{z}^{*}=\mathbf{M}^{*}\mathbf{q}^{*}/\|\mathbf{M}^{*}\mathbf{q}^{*}\|$ , where $\mathbf{M}^{*}$ and $\mathbf{q}^{*}$ are computed from $\rho^{*}$ . We summarize the procedure to solve (B.2) as in the following two steps:

Step 1: Solve the fixed point $\rho^{*}$ for $\rho=\mathbf{z}(\rho)^{\prime}\tilde{\mathbf{E}}\mathbf{z}(\rho)$ with $\mathbf{z}(\rho)={\mathbf{M}\mathbf{q}/\|\mathbf{M}\mathbf{q}\|}$ , where

	$\displaystyle\mathbf{M}$	$\displaystyle=\mathbf{A}^{-1}-\mathbf{A}^{-1}\mathbf{B}(\mathbf{B}^{\prime}\mathbf{A}^{-1}\mathbf{B})^{-1}\mathbf{B}^{\prime}\mathbf{A}^{-1},$
	$\displaystyle\mathbf{A}$	$\displaystyle=\gamma\rho^{\gamma-1}\mathbf{I}+(1-\gamma)\rho^{\gamma-2}\tilde{\mathbf{E}},$
	$\displaystyle\mathbf{B}$	$\displaystyle=[\tilde{\mathbf{E}}\mathbf{Z}_{k}\ \mathbf{D}\mathbf{U}^{\prime}\hat{\mathbf{S}}],$
	$\displaystyle\mathbf{q}$	$\displaystyle=\rho^{\gamma-1}\mathbf{d};$

Step 2: Compute $\mathbf{M}^{*}$ and $\mathbf{q}^{*}$ from the fixed point $\rho^{*}$ in Step 2. The solution to (B.2) is then given by $\mathbf{z}^{*}=\mathbf{M}^{*}\mathbf{q}^{*}/\|\mathbf{M}^{*}\mathbf{q}^{*}\|$ .

The most challenging step is Step 2, where a nonlinear equation needs to be solved. This can be done numerically by several existing algorithms. For example, we can use Newton’s method (Kelley,, 2003), which is implemented by many optimization packages and gives fast convergence in practice.

Appendix C JICO iterative algorithm

After each iteration, we obtain $K$ joint objective values as a $K\times 1$ vector:

\mathcal{L}=\text{diag}(\mathbf{W}^{\prime}\mathbf{X}^{\text{Joint}}{{}^{\prime}}\mathbf{X}^{\text{Joint}}\mathbf{W})^{\gamma-1}(\mathbf{W}^{\prime}\mathbf{X}^{\text{Joint}}{{}^{\prime}}\bm{Y}^{\text{Joint}})^{2},

(C.1)

and $K_{g}$ individual objective values as a $K_{g}\times 1$ vector for $g=1,\ldots,G$ :

\mathcal{L}_{g}=\text{diag}(\mathbf{W}_{g}^{\prime}\mathbf{X}_{g}^{\text{Indiv}}{{}^{\prime}}\mathbf{X}_{g}^{\text{Indiv}}\mathbf{W}_{g})^{\gamma-1}(\mathbf{W}_{g}^{\prime}\mathbf{X}_{g}^{\text{Indiv}}{{}^{\prime}}\bm{Y}_{g}^{\text{Indiv}})^{2},

(C.2)

and compare them with the corresponding vectors obtained from the previous iteration step. Our iterative procedure stops until the differences between two consecutive iteration steps are under a certain tolerance level. Empirically, the algorithm always met convergence criteria, albeit there are no theoretical guarantees for it.

Data:

\{\mathbf{X}_{g},\bm{Y}_{g}\}_{g=1}^{G}

;

Parameters: tolerance level

\tau

; joint rank

K

and individual rank

K_{g}

;

Initialize

\mathbf{T}_{g}=\bm{0}_{n_{g}\times K_{g}}

;

\mathbf{U}_{g}=\bm{0}_{K_{g}\times p}

;

\bm{\alpha}_{g}=\bm{0}_{K_{g}\times 1}

; ;

while Euclidian distance $\|\nabla\mathcal{L}\|,\|\nabla\mathcal{L}_{g}\|>\tau$ do

\mathbf{X}_{g}^{\text{Joint}}=\mathbf{X}_{g}-\mathbf{T}_{g}\mathbf{U}_{g}

;

\bm{Y}_{g}^{\text{Joint}}=\bm{Y}_{g}-\mathbf{T}_{g}\bm{\alpha}_{g}

;

Estimate

\mathbf{W}=(\mathbf{w}_{1},\ldots,\mathbf{w}_{K})

from (3.2);

\mathbf{S}=\mathbf{X}^{\text{Joint}}\mathbf{W}

;

\mathbf{U}=(\mathbf{W}^{\prime}\mathbf{W})^{-1}\mathbf{W}^{\prime}

;

\bm{\alpha}=(\mathbf{S}^{\prime}\mathbf{S})^{-1}\mathbf{S}^{\prime}\bm{Y}^{\text{Joint}}

;

for $g=1,\ldots,G$ do

\mathbf{X}_{g}^{\text{Indiv}}=\mathbf{X}_{g}-\mathbf{S}_{g}\mathbf{U}

;

\bm{Y}_{g}^{\text{Indiv}}=\bm{Y}_{g}-\mathbf{S}_{g}\bm{\alpha}

;

Estimate

\mathbf{W}_{g}=(\mathbf{w}_{g1},\ldots,\mathbf{w}_{gK_{g}})

from (3.4);

\mathbf{T}_{g}=\mathbf{X}_{g}^{\text{Indiv}}\mathbf{W}_{g}

;

\mathbf{U}_{g}=(\mathbf{W}_{g}^{\prime}\mathbf{W}_{g})^{-1}\mathbf{W}_{g}^{\prime}

;

\bm{\alpha}_{g}=(\mathbf{T}_{g}^{\prime}\mathbf{T}_{g})^{-1}\mathbf{T}_{g}^{\prime}\bm{Y}_{g}^{\text{Indiv}}

;

end for

\tilde{\mathbf{W}}=[\mathbf{W}_{1},\ldots,\mathbf{W}_{G}]

;

end while

Algorithm 1 JICO Algorithm

Appendix D Selection of Tuning Parameters

To apply the proposed JICO method, we need to select tuning parameters $K$ , $K_{g}$ and $\gamma$ . Based on our numerical experience, we find that the performance of JICO depends more on the choice of $K$ but less on $K_{g}$ . We also find that even when the true $K_{g}$ differs in multiple groups, choosing them to be the same doesn’t affect the MSE in predicting the response too much. Thus, to saving computational time, we can choose $K_{g}$ to be the same for all groups. We propose to do an exhaustive search for $0\leq K\leq D_{1}$ and $0\leq K_{g}\leq D_{2}$ to find their best combination that gives the smallest MSE, where $D_{1}$ and $D_{2}$ are two user-defined constants. In this way, we need to try $(D_{1}+1)\times(D_{2}+1)$ combinations in total. Since we assume $\mathbf{J_{g}}$ and $\mathbf{A}_{g}$ are both low-rank, we can set $D_{1}$ and $D_{2}$ to be relatively small. As for the tuning of $\gamma$ , we propose to perform a grid search and choose the one that can minimize the MSE in predicting the responses.

Appendix E The impact of using different initial value $\mathbf{I}_{g}$

The following simulation study illustrates the impact of using different matrices as the initial value for our algorithm.

We use the same data generating scheme as described in Section 4. We fix $G=2,p=200,n_{1}=n_{2}=50$ . In each replication, we generate 100 training samples to train the models and evaluate the corresponding Mean Squared Error (MSE) in an independent test set of 100 samples. We also record the objective values as described below:

•

When $\gamma=0$ , we compare the converged objective value as in (3.6) for $g=1,2$ .
•

When $\gamma=1$ , we compare $\text{cov}(\mathbf{X}^{\text{Joint}}\mathbf{w},\bm{Y}^{\text{Joint}})^{2}$ and $\text{cov}(\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w},\bm{Y}_{g}^{\text{Indiv}})^{2}$ for $g=1,2$ .
•

When $\gamma=\infty$ , we compare $\text{var}(\mathbf{X}^{\text{Joint}}\mathbf{w})$ and $\text{var}(\mathbf{X}_{g}^{\text{Indiv}}\mathbf{w})$ for $g=1,2$ .

We compare the performance of our algorithm by giving the boxplots of the MSEs and the converged objective values using different initial values. We repeat the simulation for 50 times, with three kinds of initial values of the individual matrices $\mathbf{I}_{g}$ as follows:

•

Zero matrix: In this setting, our algorithm starts with zero matrices as the initial value, where all entries in $\mathbf{I}_{g}$ are equal to 0.
•

Large matrix: In this setting, our algorithm starts with matrices with large entries as the initial value, where all entries in $\mathbf{I}_{g}$ are equal to 3.
•

Joint matrix: In this setting, our algorithm starts with estimated joint matrices $\mathbf{\hat{J}}_{g}$ as the initial value, where $\mathbf{\hat{J}}_{g}$ are derived from our first setting, that is using zero matrices as the initial value.

Figures E.2 shows the converged value of the joint objective function under the three different initial values of the individual matrices $\mathbf{I}_{g}$ mentioned above. Similarly, Figure E.3 shows the converged value of the first objective function under the three different initial values. Finally, Figure E.4 shows the the converged value of the second individual objective function under three different initial values. The results show that our algorithm may not converge to the same objective function value using different initial values when $\gamma=1$ , but Figure E.1 shows that our results tend to have a similar performance in terms of prediction error, no matter how we choose the initial individual matrices for our algorithm. To ensure a better performance, we recommend running our algorithm on the same dataset multiple times with different initial values, and then choose the result with the best performance.

References

Björkström and Sundberg, (1996) Björkström, A. and Sundberg, R. (1996). Continuum regression is not always continuous. Journal of the Royal Statistical Society: Series B (Methodological), 58(4):703–710.
Feng et al., (2018) Feng, Q., Jiang, M., Hannig, J., and Marron, J. (2018). Angle-based joint and individual variation explained. Journal of Multivariate Analysis, 166:241–265.
Gaynanova and Li, (2019) Gaynanova, I. and Li, G. (2019). Structural learning and integrative decomposition of multi-view data. Biometrics, 75(4):1121–1132.
Hoerl and Kennard, (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67.
Kaplan and Lock, (2017) Kaplan, A. and Lock, E. F. (2017). Prediction with dimension reduction of multiple molecular data sources for patient survival. Cancer Informatics, 16:1176935117718517.
Kelley, (2003) Kelley, C. T. (2003). Solving nonlinear equations with Newton’s method, volume 1. Siam.
Killiany et al., (1993) Killiany, R. J., Moss, M. B., Albert, M. S., Sandor, T., Tieman, J., and Jolesz, F. (1993). Temporal lobe regions on magnetic resonance imaging identify patients with early alzheimer’s disease. Archives of Neurology, 50(9):949–954.
Lee and Liu, (2013) Lee, M. H. and Liu, Y. (2013). Kernel continuum regression. Computational Statistics and Data Analysis, 68:190–201.
Li and Li, (2021) Li, Q. and Li, L. (2021). Integrative factor regression and its inference for multimodal data analysis. Journal of the American Statistical Association, pages 1–15.
Lock et al., (2013) Lock, E. F., Hoadley, K. A., Marron, J. S., and Nobel, A. B. (2013). Joint and individual variation explained (jive) for integrated analysis of multiple data types. The Annals of Applied Statistics, 7(1):523–542.
Ma and Huang, (2017) Ma, S. and Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517):410–423.
Meinshausen and Bühlmann, (2015) Meinshausen, N. and Bühlmann, P. (2015). Maximin effects in inhomogeneous large-scale data. The Annals of Statistics, 43(4):1801–1830.
Michon et al., (1994) Michon, A., Deweer, B., Pillon, B., Agid, Y., and Dubois, B. (1994). Relation of anosognosia to frontal lobe dysfunction in alzheimer’s disease. Journal of Neurology, Neurosurgery and Psychiatry, 57(7):805–809.
Muniategui et al., (2013) Muniategui, A., Pey, J., Planes, F. J., and Rubio, A. (2013). Joint analysis of mirna and mrna expression data. Briefings in Bioinformatics, 14(3):263–278.
Palzer et al., (2022) Palzer, E. F., Wendt, C. H., Bowler, R. P., Hersh, C. P., Safo, S. E., and Lock, E. F. (2022). sjive: Supervised joint and individual variation explained. Computational Statistics & Data Analysis, 175:107547.
Stone and Brooks, (1990) Stone, M. and Brooks, R. J. (1990). Continuum regression: cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. Journal of the Royal Statistical Society: Series B (Methodological), 52(2):237–258.
Tang and Song, (2016) Tang, L. and Song, P. X. (2016). Fused lasso approach in regression coefficients clustering: learning parameter heterogeneity in data integration. The Journal of Machine Learning Research, 17(1):3915–3937.
Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
Wang et al., (2023) Wang, P., Li, Q., Shen, D., and Liu, Y. (2023). High-dimensional factor regression for heterogeneous subpopulations. Statistica Sinica, 33:1–27.
Wang et al., (2018) Wang, P., Liu, Y., and Shen, D. (2018). Flexible locally weighted penalized regression with applications on prediction of alzheimer’s disease neuroimaging initiative’s clinical scores. IEEE Transactions on Medical Imaging, 38(6):1398–1408.
Zhao et al., (2016) Zhao, T., Cheng, G., and Liu, H. (2016). A partially linear framework for massive heterogeneous data. Annals of Statistics, 44(4):1400–1437.