\newcites

appReferences

Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models

Ye Tian Department of Statistics, Columbia University Haolei Weng Department of Statistics and Probability, Michigan State University Lucy Xia Department of ISOM, School of Business and Management
Hong Kong University of Science and Technology Yang Feng Department of Biostatistics, School of Global Public Health
New York University

Abstract

Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multi-task learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to single-task learning. We propose a multi-task GMM learning procedure based on the EM algorithm that effectively utilizes unknown similarities between related tasks and is robust against a fraction of outlier tasks from arbitrary distributions. The proposed procedure is shown to achieve the minimax optimal rate of convergence for both parameter estimation error and the excess mis-clustering error, in a wide range of regimes. Moreover, we generalize our approach to tackle the problem of transfer learning for GMMs, where similar theoretical results are derived. Additionally, iterative unsupervised multi-task and transfer learning methods may suffer from an initialization alignment problem, and two alignment algorithms are proposed to resolve the issue. Finally, we demonstrate the effectiveness of our methods through simulations and real data examples. To the best of our knowledge, this is the first work studying multi-task and transfer learning on GMMs with theoretical guarantees.

Keywords: Multi-task learning, transfer learning, unsupervised learning, Gaussian mixture models, robustness, minimax rate, EM algorithm

1 Introduction

1.1 Gaussian mixture models (GMMs)

Unsupervised learning that learns patterns from unlabeled data is a prevalent problem in statistics and machine learning. Clustering is one of the most important problems in unsupervised learning, where the goal is to group the observations based on some metrics of similarity. Researchers have developed numerous clustering methods including $k$ -means (Forgy, , 1965), $k$ -medians (Jain and Dubes, , 1988), spectral clustering (Ng et al., , 2001), and hierarchical clustering (Murtagh and Contreras, , 2012), among others. On the other hand, clustering problems have been analyzed from the perspective of the mixture of several probability distributions (Scott and Symons, , 1971). The mixture of Gaussian distributions is one of the simplest models in this category and has been widely applied in many real applications (Yang and Ahuja, , 1998; Lee et al., , 2012).

In the binary Gaussian mixture models (GMMs) with common covariances, each observation $Z\in\mathbb{R}^{p}$ comes from the following mixture of two Gaussian distributions:

	$\displaystyle Y$	$\displaystyle=\begin{cases}1,&\text{with probability }1-w,\\ 2,&\text{with probability }w,\end{cases}$		(1)
	$\displaystyle Z\|Y$	$\displaystyle=r\sim\mathcal{N}(\bm{\mu}_{r},\bm{\Sigma}),\quad r=1,2,$		(2)

where $w\in(0,1)$ , $\bm{\mu}_{1}\in\mathbb{R}^{p}$ , $\bm{\mu}_{2}\in\mathbb{R}^{p}$ and $\bm{\Sigma}\succ 0$ are parameters. This is the same setting as the linear discriminant analysis (LDA) problem in classification (Hastie et al., , 2009), except that the label $Y$ is unknown in the clustering problem, while it is observed in the classification case. It has been shown that the Bayes classifier for the LDA problem is

\mathcal{C}(\bm{z})=\begin{cases}1,&\text{if }\bm{\beta}^{\top}\bm{z}-\delta\leq\log\left(\frac{1-w}{w}\right);\\ 2,&\text{otherwise},\end{cases}

(3)

where $\bm{\beta}=\bm{\Sigma}^{-1}(\bm{\mu}_{2}-\bm{\mu}_{1})\in\mathbb{R}^{p}$ and $\delta=\bm{\beta}^{\top}(\bm{\mu}_{1}+\bm{\mu}_{2})/2$ . Note that $\bm{\beta}$ is usually referred to as the discriminant coefficient (Anderson, , 1958; Efron, , 1975). Naturally, this classifier is useful in clustering too. In clustering, after learning $w$ , $\bm{\mu}_{1}$ , $\bm{\mu}_{2}$ and $\bm{\beta}$ , we can plug their estimators into (3) to group any new observation $Z^{\textup{new}}$ . Generally, we define the mis-clustering error rate of any given clustering method $\mathcal{C}$ as

R(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}(\mathcal{C}(Z^{\textup{new}})\neq\pi(Y^{\textup{new}})),

(4)

where $\pi$ is a permutation function, $Y^{\textup{new}}$ is the label of a future observation $Z^{\textup{new}}$ , and the probability is taken w.r.t. the joint distribution of $(Z^{\textup{new}},Y^{\textup{new}})$ based on parameters $w$ , $\bm{\mu}_{1}$ , $\bm{\mu}_{2}$ and $\bm{\Sigma}$ . Here the error is calculated up to a permutation due to the lack of label information. It is clear that in the ideal case where the parameters are known, $\mathcal{C}(\cdot)$ in (3) achieves the optimal mis-clustering error. Multi-cluster Gaussian mixture models with $R\geq 3$ components can be described in a similar way. We provide the details in Section S.1 of the supplementary materials.

There is a large volume of published studies on learning a GMM. The vast majority of approaches can be roughly divided into three categories. The first category is the method of moments, where the parameters are estimated through several moment equations (Pearson, , 1894; Kalai et al., , 2010; Hsu and Kakade, , 2013; Ge et al., , 2015). The second category is the spectral method, where the estimation is based on the spectral decomposition (Vempala and Wang, , 2004; Hsu and Kakade, , 2013; Jin et al., , 2017). The last category is the likelihood-based method including the popular expectation-maximization (EM) algorithm as a canonical example. The general form of the EM algorithm was formalized by Dempster et al., (1977) in the context of incomplete data, though earlier works (Hartley, , 1958; Hasselblad, , 1966; Baum et al., , 1970; Sundberg, , 1974) have studied EM-style algorithms in various concrete settings. Classical convergence results on the EM algorithm (Wu, , 1983; Redner and Walker, , 1984; Meng and Rubin, , 1994; McLachlan and Krishnan, , 2007) guarantee local convergence of the algorithm to fixed points of the sample likelihood. Recent advances in the analysis of EM algorithm and its variants provide stronger guarantees by establishing geometric convergence rates of the algorithm to the underlying true parameters under mild initialization conditions. See, for example, Dasgupta and Schulman, (2013); Wang et al., (2014); Xu et al., (2016); Balakrishnan et al., (2017); Yan et al., (2017); Cai et al., (2019); Kwon and Caramanis, (2020); Zhao et al., (2020) for GMM related works. In this paper, we propose modified versions of the EM algorithm with similarly strong guarantees to learn GMMs, under the new context of multi-task and transfer learning.

1.2 Multi-task learning and transfer learning

Multi-tasking is an ability that helps people pay attention to more than one task simultaneously. Moreover, we often find that the knowledge learned from one task can also be useful in other tasks. Multi-task learning (MTL) is a learning paradigm inspired by the human learning ability, which aims to learn multiple tasks and improve performance by utilizing the similarity between these tasks (Zhang and Yang, , 2021). There has been numerous research on MTL, which can be classified into five categories (Zhang and Yang, , 2021): feature learning approach (Argyriou et al., , 2008; Obozinski et al., , 2006), low-rank approach (Ando et al., , 2005), task clustering approach (Thrun and O’Sullivan, , 1996), task relation learning approach (Evgeniou and Pontil, , 2004) and decomposition approach (Jalali et al., , 2010). The majority of existing works focus on the use of MTL in supervised learning problems, while the application of MTL in unsupervised learning, such as clustering, has received less attention. Zhang and Zhang, (2011) developed an MTL clustering method based on a penalization framework, where the objective function consists of a local loss function and a pairwise task regularization term, both of which are related to the Bregman divergence. In Gu et al., (2011), a reproducing kernel Hilbert space (RKHS) was first established, and then a multi-task kernel k-means clustering was applied based on that RKHS. Yang et al., (2014) proposed a spectral MTL clustering method with a novel $\ell_{2,p}$ -norm, which can also produce a linear regression function to predict labels for out-of-sample data. Zhang et al., (2018) suggested a new method based on the similarity matrix of samples in each task, which can learn the within-task clustering structure as well as the task relatedness simultaneously. Marfoq et al., (2021) established a new federated multi-task EM algorithm to learn the mixture of distributions and provided some theory on the convergence guarantee, but the statistical properties of the estimators were not fully understood. Zhang and Chen, (2022) proposed a distributed learning algorithm for GMMs based on transportation divergence when all GMMs are identical. In general, there are very few theoretical results about unsupervised MTL.

Transfer learning (TL) is another learning paradigm similar to multi-task learning but has different objectives. While MTL aims to learn all the tasks well with no priority for any specific task, the goal of TL is to improve the performance on the target task using the information from the source tasks (Zhang and Yang, , 2021). According to Pan and Yang, (2009), most of TL approaches can be classified into four categories: instance-based transfer (Dai et al., , 2007), feature representation transfer (Dai et al., , 2008), parameter transfer (Lawrence and Platt, , 2004) and relational-knowledge transfer (Mihalkova et al., , 2007). Similar to MTL, most of the TL methods focus on supervised learning. Some TL approaches are also developed for the semi-supervised learning setting (Chattopadhyay et al., , 2012; Li et al., , 2013), where only part of target or source data is labeled. There are much fewer discussions on the unsupervised TL approaches ¹¹1There are different definitions for unsupervised TL. Sometimes people call the semi-supervised TL an unsupervised TL as well. We follow the definition in Pan and Yang, (2009) here., which focuses on the cases where both target and source data are unlabeled. Dai et al., (2008) developed a co-clustering approach to transfer information from a single source to the target, which relies on the loss in mutual information and requires the features to be discrete. Wang et al., (2008) proposed a TL discriminant analysis method, where the target data is allowed to be unlabeled, but some labeled source data is necessary. In Wang et al., (2021), a TL approach was developed to learn Gaussian mixture models with only one source by weighting the target and source likelihood functions. Zuo et al., (2018) proposed a TL method based on infinite Gaussian mixture models and active learning, but their approach needs sufficient labeled source data and a few labeled target samples.

There are some recent studies on TL and MTL under various statistical settings, including high-dimensional linear regression (Xu and Bastani, , 2021; Li et al., 2022b, ; Zhang et al., , 2022; Li et al., 2022a, ), high-dimensional generalized linear models (Bastani, , 2021; Li et al., , 2023; Tian and Feng, , 2023), functional linear regression (Lin and Reimherr, , 2022), high-dimensional graphical models (Li et al., 2022b, ), reinforcement learning (Chen et al., , 2022), among others. The recent work Duan and Wang, (2023) developed an adaptive and robust MTL framework with sharp statistical guarantees for a broad class of models. We discuss its connection to our work in Section 2.

1.3 Our contributions and paper structure

Our main contributions in this work can be summarized in the following:

(i)

We develop efficient polynomial-time iterative procedures to learn GMMs in both MTL and TL settings. These procedures can be viewed as adaptations of the standard EM algorithm for MTL and TL problems.
(ii)

The developed procedures come with provable statistical guarantees. Specifically, we derive the upper bounds of their estimation and excess mis-clustering error rates under mild conditions. For MTL, it is shown that when the tasks are close to each other, our method can achieve better upper bounds than those from the single-task learning; when the tasks are substantially different from each other, our method can still obtain competitive convergence rates compared to single-task learning. Similarly for TL, our method can achieve better upper bounds than those from fitting GMM only to target data when the target and sources are similar, and remains competitive otherwise. In addition, the derived upper bounds reveal the robustness of our methods against a fraction of outlier tasks (for MTL) or outlier sources (for TL) from arbitrary distributions. These guarantees certify our procedures as adaptive (to the unknown task relatedness) and robust (to contaminated data) learning approaches.
(iii)

We derive the minimax lower bounds for parameter estimation and excess mis-clustering errors. In various regimes, the upper bounds from our methods match the lower bounds (up to small order terms), showing that the proposed methods are (nearly) minimax rate optimal.
(iv)

Our MTL and TL approaches require the initial estimates for different tasks to be “well-aligned”, due to the non-identifiability of GMM. We propose two pre-processing alignment algorithms to provably resolve the alignment problem. Similar problems arise in many unsupervised MTL settings. However, to our knowledge, there is no formal discussion of the alignment issue in the existing literature on unsupervised MTL (Gu et al., , 2011; Zhang and Zhang, , 2011; Yang et al., , 2014; Zhang et al., , 2018; Dieuleveut et al., , 2021; Marfoq et al., , 2021). Therefore, our rigorous treatment of the alignment problem is an important step forward in this field.

The rest of the paper is organized as follows. In Section 2, we first discuss the multi-task learning problem for binary GMMs, by introducing the problem setting, our method, and the associated theory. The above-mentioned alignment problem is discussed in Section 2.4. We present a simulation study in Section 3 to validate our theory. Finally, in Section 4, we point out some interesting future research directions. Due to the space limit, the extension to multi-cluster GMMs, additional numerical results, a full treatment of the transfer learning problem, and all the proofs are delegated to the supplementary materials.

We summarize the notations used throughout the paper here for convenience. We use bold capital letters (e.g., $\bm{\Sigma}$ ) to denote matrices and use bold small letters (e.g., $\bm{x}$ , $\bm{y}$ ) to denote vectors. For a matrix $\bm{A}=[a_{ij}]_{p\times q}\in\mathbb{R}^{p\times q}$ , its 2-norm or spectral norm is defined as $\|\bm{A}\|_{2}=\max_{\bm{x}:\|\bm{x}\|_{2}=1}\|\bm{A}\bm{x}\|_{2}$ . If $q=1$ , $\bm{A}$ becomes a $p$ -dimensional vector and $\|\bm{A}\|_{2}$ equals its Euclidean norm. For symmetric $\bm{A}$ , we define $\lambda_{\max}(\bm{A})$ and $\lambda_{\min}(\bm{A})$ as the maximum and minimum eigenvalues of $\bm{A}$ , respectively. For two non-zero real sequences $\{a_{n}\}_{n=1}^{\infty}$ and $\{b_{n}\}_{n=1}^{\infty}$ , we use $a_{n}\ll b_{n}$ , $b_{n}\gg a_{n}$ or $a_{n}=\mathchoice{{\scriptstyle\mathcal{O}}}{{\scriptstyle\mathcal{O}}}{{\scriptscriptstyle\mathcal{O}}}{\scalebox{0.7}{$\scriptscriptstyle\mathcal{O}$}}(b_{n})$ to represent $|a_{n}/b_{n}|\rightarrow 0$ as $n\rightarrow\infty$ . And $a_{n}\lesssim b_{n}$ , $b_{n}\gtrsim a_{n}$ or $a_{n}=\mathcal{O}(b_{n})$ means $\sup_{n}|a_{n}/b_{n}|<\infty$ . For two random variable sequences $\{x_{n}\}_{n=1}^{\infty}$ and $\{y_{n}\}_{n=1}^{\infty}$ , the notation $x_{n}=\mathcal{O}_{\mathbb{P}}(y_{n})$ means that for any $\epsilon>0$ , there exists a positive constant $M$ such that $\sup_{n}\mathbb{P}(|x_{n}/y_{n}|>M)\leq\epsilon$ . For two real numbers $a$ and $b$ , $a\vee b$ and $a\wedge b$ represent $\max(a,b)$ and $\min(a,b)$ , respectively. For any positive integer $K$ , both $1:K$ and $[K]$ stand for the set $\{1,2,\ldots,K\}$ . For any set $S\subseteq[K]$ , $|S|$ denotes its cardinality, and $S^{c}$ denotes its complement. Without further notice, $c$ , $C$ , $C_{1}$ , $C_{2}$ , $\ldots$ represent some positive constants and can change from line to line.

2 Multi-task Learning

2.1 Problem setting

Suppose there are $K$ tasks, for which we have $n_{k}$ observations $\{\bm{z}_{i}^{(k)}\}_{i=1}^{n_{k}}$ from the $k$ -th task. Suppose there exists an unknown subset $S\subseteq 1:K$ , such that observations from each task in $S$ independently follow a GMM, while samples from tasks outside $S$ can be arbitrarily distributed. This means,

\displaystyle y_{i}^{(k)}=\begin{cases}1,&\text{with probability }1-w^{(k)*};\\ 2,&\text{with probability }w^{(k)*};\end{cases}\quad\quad\bm{z}_{i}^{(k)}|y_{i}^{(k)}=r\sim\mathcal{N}(\bm{\mu}^{(k)*}_{r},\bm{\Sigma}^{(k)*}),\quad\quad r=1,2,

(5)

for all $k\in S$ , $i=1:n_{k}$ , and

\{\bm{z}^{(k)}_{i}\}_{i,k\in S^{c}}\sim\mathbb{Q}_{S},

(6)

where $\mathbb{Q}_{S}$ is some probability measure on $(\mathbb{R}^{p})^{\otimes n_{S^{c}}}$ and $n_{S^{c}}=\sum_{k\in S^{c}}n_{k}$ . In unsupervised learning, we have no access to the true labels $\{y^{(k)}_{i}\}_{i,k}$ . To formalize the multi-task learning problem, we first introduce the parameter space for a single GMM:

	$\displaystyle\overline{\Theta}=\{\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\Sigma}):$	$\displaystyle\\|\bm{\mu}_{1}\\|_{2}\vee\\|\bm{\mu}_{2}\\|_{2}\leq M,w\in(c_{w},1-c_{w}),$		(7)
		$\displaystyle c_{\bm{\Sigma}}^{-1}\leq\lambda_{\min}(\bm{\Sigma})\leq\lambda_{\max}(\bm{\Sigma})\leq c_{\bm{\Sigma}}\},$		(8)

where $M$ , $c_{w}\in(0,1/2]$ and $c_{\bm{\Sigma}}$ are some fixed positive constants. For simplicity, throughout the main text, we assume these constants are fixed. Hence, we have suppressed the dependency on them in the notation $\overline{\Theta}$ . The parameter space $\overline{\Theta}$ is a standard formulation. Similar parameter spaces have been considered, for example, in Cai et al., (2019).

Our goal of multi-task learning is to leverage the potential similarity shared by different tasks in $S$ to collectively learn them all. The tasks outside $S$ can be arbitrarily distributed and they can be potentially outlier tasks. This motivates us to define a joint parameter space for GMMs in $S$ :

\displaystyle\overline{\Theta}_{S}(h)=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}=\{(w^{(k)},\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2},\bm{\Sigma}^{(k)})\}_{k\in S}:\overline{\bm{\theta}}^{(k)}\in\overline{\Theta},\inf_{\overline{\bm{\beta}}}\max_{k\in S}\|\bm{\beta}^{(k)}-\overline{\bm{\beta}}\|_{2}\leq h\Big{\}},

(9)

where $\bm{\beta}^{(k)}=(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{2}-\bm{\mu}^{(k)}_{1})$ is called the discriminant coefficient in the $k$ -th task (recall Section 1.1). For convenience, we define $\delta^{(k)}=(\bm{\beta}^{(k)})^{\top}(\bm{\mu}^{(k)}_{1}+\bm{\mu}^{(k)}_{2})/2$ , which together with $\log((1-w^{(k)})/w^{(k)})$ is part of the decision boundary. Note that this parameter space is defined only for GMMs of tasks in $S$ . To model potentially corrupted or contaminated data, we do not impose any distributional constraints for tasks in $S^{c}$ . Such a modeling framework is reminiscent of Huber’s $\epsilon$ -contamination model (Huber, , 1964). Similar formulations have been adopted in recent multi-task learning research such as Konstantinov et al., (2020) and Duan and Wang, (2023).

For GMMs in $\overline{\Theta}_{S}(h)$ , we assume that they share similar discriminant coefficients. The similarity is formalized by assuming that all the discriminant coefficients in $S$ are within Euclidean distance $h$ from a “center”. Given that the discriminant coefficient has a major impact on the clustering performance (see the discriminant rule in (3)), the parameter space $\overline{\Theta}_{S}(h)$ is tailored to characterize the task relatedness from the clustering perspective. A similar viewpoint that focuses on modeling the discriminant coefficient has appeared in the study of high-dimensional GMM clustering (Cai et al., , 2019) and sparse linear discriminant analysis (Cai and Liu, , 2011; Mai et al., , 2012). With both $S$ and $h$ being unknown in practice, we aim to develop a multi-task learning procedure that is robust to outlier tasks in $S^{c}$ , and achieves improved performance for tasks in $S$ (compared to the single-task learning), in terms of discriminant coefficient estimation and clustering, whenever $h$ is small.

The parameter space does not require the mean vectors $\{\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2}\}_{k\in S}$ or the covariance matrices $\{\bm{\Sigma}^{(k)}\}_{k\in S}$ to be similar, although they are not free parameters due to the constraint on $\{\bm{\beta}^{(k)}\}_{k\in S}$ . And the mixture proportions $\{w^{(k)}\}_{k\in S}$ do not need to be similar either. We thus avoid imposing restrictive conditions on those parameters. On the other hand, it implies that estimation of the mixture proportions, mean vectors, and covariance matrices in multi-task learning may not be generally improvable over that in the single-task learning. This is verified by the theoretical results in Section 2.3. While the current treatment in the paper does not consider similarity structure among $\{\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2}\}_{k\in S},\{\bm{\Sigma}^{(k)}\}_{k\in S}$ or $\{w^{(k)}\}_{k\in S}$ , our methods and theory can be readily adapted to handle such scenarios, if desired.

There are two main reasons why this MTL problem can be challenging. First, commonly used strategies like data pooling are fragile with respect to outlier tasks and can lead to arbitrarily inaccurate outcomes in the presence of even a small number of outliers. Also, since the distribution of data from outlier tasks can be adversarial to the learner, the idea of outlier task detection in the recent literature (Li et al., , 2021; Tian and Feng, , 2023) may not be applicable. Second, to address the nonconvexity of the likelihood, we propose to explore the similarity among tasks via a generalization of the EM algorithm. However, a clear theoretical understanding of such an iterative procedure requires a delicate analysis of the whole iterative process. In particular, as in the analysis of EM algorithms (Cai et al., , 2019; Kwon and Caramanis, , 2020), the estimates of similar discriminant vectors $\{\bm{\beta}^{(k)*}\}_{k\in S}$ and other potentially dissimilar parameters are entangled in the iterations. It is highly non-trivial to separate the impact of estimating $\{\bm{\beta}^{(k)*}\}_{k\in S}$ and other parameters to derive the desired statistical error rates. We manage to address this challenge through a localization technique by carefully shrinking the analysis radius of estimators as the iteration proceeds.

2.2 Method

We aim to tackle the problem of GMM estimation under the context of multi-task learning. The EM algorithm is commonly used to address the non-convexity of the log-likelihood function arising from the latent labels. In the standard EM algorithm, we “classify” the observations (update the posterior) in E-step and update the parameter estimations in M-step (Redner and Walker, , 1984). For multi-task and transfer learning problems, the penalization framework is very popular, where we solve an optimization problem based on a new objective function. This objective function consists of a local loss function and a penalty term, forcing the estimators of similar tasks to be close to each other. For examples, see Zhang and Zhang, (2011); Zhang et al., (2015); Bastani, (2021); Xu and Bastani, (2021); Li et al., (2021); Duan and Wang, (2023); Lin and Reimherr, (2022); Li et al., (2023); Tian and Feng, (2023). Thus motivated, our method seeks a combination of the EM algorithm and the penalization framework.

In particular, we adapt the penalization framework of Duan and Wang, (2023) and modify the updating formulas in the M-step accordingly. The proposed procedure is summarized in Algorithm 1. For simplicity, in Algorithm 1 we have used the notation

\gamma_{\bm{\theta}}(\bm{z})=\frac{w\exp(\bm{\beta}^{\top}\bm{z}-\delta)}{1-w+w\exp(\bm{\beta}^{\top}\bm{z}-\delta)},\quad\text{for }\bm{\theta}=(w,\bm{\beta},\delta).

(11)

Note that $\gamma_{\bm{\theta}}(\bm{z})$ is the posterior probability $\mathbb{P}(Y=2|Z=\bm{z})$ given the observation $\bm{z}$ . The estimated posterior probability is calculated in every E-step given the updated parameter estimates.

Input: Initialization

\{(\widehat{w}^{(k)[0]},\widehat{\bm{\beta}}^{(k)[0]},\widehat{\bm{\mu}}^{(k)[0]}_{1},\widehat{\bm{\mu}}^{(k)[0]}_{2})\}_{k=1}^{K}

, maximum number of iteration rounds

T

, initial penalty parameter

\lambda^{[0]}

, tuning parameters

C_{\lambda}>0

and

\kappa\in(0,1)

\widehat{\bm{\theta}}^{(k)[0]}=(\widehat{w}^{(k)[0]},\widehat{\bm{\beta}}^{(k)[0]},\widehat{\delta}^{(k)[0]})

and

\widehat{\delta}^{(k)[0]}=\frac{1}{2}(\widehat{\bm{\beta}}^{(k)[0]})^{\top}(\widehat{\bm{\mu}}^{(k)[0]}_{1}+\widehat{\bm{\mu}}^{(k)[0]}_{2})

for

k=1:K

2 for $t=1$ to $T$ do

\lambda^{[t]}=\kappa\lambda^{[t-1]}+C_{\lambda}\sqrt{p+\log K}

// Penalty parameter update

4 for $k=1$ to $K$ do // Local update for each task

\widehat{w}^{(k)[t]}=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})

\widehat{\bm{\mu}}^{(k)[t]}_{1}=\frac{\sum_{i=1}^{n_{k}}[1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})]\bm{z}^{(k)}_{i}}{n_{k}(1-\widehat{w}^{(k)[t]})}

\widehat{\bm{\mu}}^{(k)[t]}_{2}=\frac{\sum_{i=1}^{n_{k}}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})\bm{z}^{(k)}_{i}}{n_{k}\widehat{w}^{(k)[t]}}

\widehat{\bm{\Sigma}}^{(k)[t]}=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\left\{[1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})]\cdot(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{1})(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\right.

\left.\hskip 93.89418pt+\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})\cdot(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{2})(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{2})^{\top}\right\}

8 end for

\{\widehat{\bm{\beta}}^{(k)[t]}\}_{k=1}^{K}

\overline{\bm{\beta}}^{[t]}=\operatorname*{arg\,min}\limits_{\bm{\beta}^{(1)},\ldots,\bm{\beta}^{(K)},\overline{\bm{\beta}}}\bigg{\{}\sum_{k=1}^{K}n_{k}\Big{[}\frac{1}{2}(\bm{\beta}^{(k)})^{\top}\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)}-(\bm{\beta}^{(k)})^{\top}(\widehat{\bm{\mu}}_{2}^{(k)[t]}-\widehat{\bm{\mu}}_{1}^{(k)[t]})\Big{]}+\sum_{k=1}^{K}\sqrt{n_{k}}\lambda^{[t]}\cdot\|\bm{\beta}^{(k)}-\overline{\bm{\beta}}\|_{2}\bigg{\}}

// Aggregation to learn

\{\widehat{\bm{\beta}}^{(k)[t]}\}_{k=1}^{K}

10 for $k=1$ to $K$ do // Local update for each task

\widehat{\delta}^{(k)[t]}=\frac{1}{2}(\widehat{\bm{\beta}}^{(k)[t]})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{1}+\widehat{\bm{\mu}}^{(k)[t]}_{2})

12 Let

\widehat{\bm{\theta}}^{(k)[t]}=(\widehat{w}^{(k)[t]},\widehat{\bm{\beta}}^{(k)[t]},\widehat{\delta}^{(k)[t]})

13 end for

15 end for

Output:

\{(\widehat{\bm{\theta}}^{(k)[T]},\widehat{\bm{\mu}}^{(k)[T]}_{1},\widehat{\bm{\mu}}^{(k)[T]}_{2},\widehat{\bm{\Sigma}}^{(k)[T]})\}_{k=1}^{K}

with

\widehat{\bm{\theta}}^{(k)[T]}=(\widehat{w}^{(k)[T]},\widehat{\bm{\beta}}^{(k)[T]},\widehat{\delta}^{(k)[T]})

, and

\overline{\bm{\beta}}^{[T]}

Algorithm 1 MTL-GMM

Recall that the parameter space $\overline{\Theta}_{S}(h)$ introduced in (LABEL:eq:_parameter_space_mtl) does not encode similarity for the mixture proportions $\{w^{(k)*}\}_{k\in S}$ , mean vectors $\{\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2}\}_{k\in S}$ , or covariance matrices $\{\bm{\Sigma}^{(k)*}\}_{k\in S}$ . Hence, the updates of them in Steps 5-7 are kept the same as in the standard EM algorithm. Regarding the update for discriminant coefficients in Step 9, the quadratic loss function is motivated by the direct estimation of the discriminant coefficient in high-dimensional GMM (Cai et al., , 2019) and high-dimensional LDA literature (Cai and Liu, , 2011; Witten and Tibshirani, , 2011; Fan et al., , 2012; Mai et al., , 2012, 2019). The penalty term in Step 9 penalizes the contrasts of $\bm{\beta}^{(k)}$ ’s to exploit the similarity structure among tasks. Having the “center” parameter $\overline{\bm{\beta}}$ in the penalization induces robustness against outlier tasks. We refer to Duan and Wang, (2023) for a systematic treatment of this penalization framework. It is straightforward to verify that when the tuning parameters $\{\lambda^{[t]}\}_{t=1}^{T}$ are set to zero, Algorithm 1 reduces to the standard EM algorithm performed separately on the $K$ tasks. That is, for each $k=1:K$ , given the parameter estimate from the previous step $\widehat{\bm{\theta}}^{(k)[t-1]}=(\widehat{w}^{(k)[t-1]},\widehat{\bm{\beta}}^{(k)[t-1]},\widehat{\delta}^{(k)[t-1]})$ , we update $\widehat{w}^{(k)[t]}$ , $\widehat{\bm{\mu}}^{(k)[t]}_{1}$ , $\widehat{\bm{\mu}}^{(k)[t]}_{2}$ , $\widehat{\bm{\Sigma}}^{(k)[t]}$ , and $\widehat{\delta}^{(k)[t]}$ as in Algorithm 1, and update $\widehat{\bm{\beta}}^{(k)[t]}$ via

\widehat{\bm{\beta}}^{(k)[t]}=(\widehat{\bm{\Sigma}}^{(k)[t]})^{-1}(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1}).

(12)

For the maximum number of iteration rounds, $T$ , our theory will show that $T\gtrsim\log(\max_{k=1:K}n_{k})$ is sufficient to reach the desired statistical error rates. In practice, we can terminate the iteration when the change of estimates within two successive rounds falls below some pre-set small tolerance level. We discuss the initialization in detail in Sections 2.3 and 2.4.

2.3 Theory

In this section, we develop statistical theories for our proposed procedure MTL-GMM (see Algorithm 1). As mentioned in Section 2.1, we are interested in the performance of both parameter estimation and clustering, although the latter is the main focus and motivation. First, we impose conditions in the following assumption set.

Assumption 1.

Denote $\Delta^{(k)}=\sqrt{(\bm{\mu}^{(k)*}_{1}-\bm{\mu}^{(k)*}_{2})^{\top}(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{1}-\bm{\mu}^{(k)*}_{2})}$ for $k\in S$ . The quantity $\Delta^{(k)}$ is the Mahalanobis distance between $\bm{\mu}^{(k)*}_{1}$ and $\bm{\mu}^{(k)*}_{2}$ with covariance matrix $\bm{\Sigma}^{(k)*}$ , and can be viewed as the signal-to-noise ratio (SNR) in the $k$ -th task (Anderson, , 1958). Suppose the following conditions hold:

(i)

$n_{S}=\sum_{k\in S}n_{k}\geq C_{1}|S|\max_{k=1:K}n_{k}$ with a constant $C_{1}\in(0,1]$ ;
(ii)

$\min_{k\in S}n_{k}\geq C_{2}(p+\log K)$ with some constant $C_{2}>0$ ;
(iii)
Either of the following two conditions holds with some constant $C_{3}>0$ :
1. (a)
  
  $\max_{k\in S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2}\big{)}\leq C_{3}\min_{k\in S}\Delta^{(k)}$ , $\max_{k\in S}|\widehat{w}^{(k)[0]}-w^{(k)*}|\leq c_{w}/2$ ;
2. (b)
  
  $\max_{k\in S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{1}-\bm{\mu}^{(k)*}_{2}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{2}-\bm{\mu}^{(k)*}_{1}\|_{2}\big{)}\leq C_{3}\min_{k\in S}\Delta^{(k)}$ , $\max_{k\in S}|1-\widehat{w}^{(k)[0]}-w^{(k)*}|\leq c_{w}/2$ .
(iv)

$\min_{k\in S}\Delta^{(k)}\geq C_{4}>0$ with some constant $C_{4}>0$ ;

Remark 1.

These are common and mild conditions related to the sample size, initialization, and signal-to-noise ratio of GMMs. Condition (i) requires the maximum sample size of all tasks not to be much larger than the average sample size of tasks in $S$ . Similar conditions can be found in Duan and Wang, (2023). Condition (ii) is the requirement of the sample size of tasks in $S$ . The usual condition for low-dimensional single-task learning is $n_{k}\gtrsim p$ (Cai et al., , 2019). The additional $\log K$ term arises from the simultaneous control of performance on all tasks in $S$ , where $S$ can be as large as $1:K$ . Condition (iii) requires that the initialization should not be too far away from the truth, which is commonly assumed in either the analysis of EM algorithm (Redner and Walker, , 1984; Balakrishnan et al., , 2017; Cai et al., , 2019) or other iterative algorithms like the local estimation used in semi-parametric models (Carroll et al., , 1997; Li and Liang, , 2008) and adaptive Lasso (Zou, , 2006). The two possible forms considered in this condition are due to the fact that binary GMM is only identifiable up to label permutation. Condition (iv) requires that the signal strength of GMM (in terms of Mahalanobis distance) is strong enough, which is usually assumed in the literature about the likelihood-based methods of GMMs (Dasgupta and Schulman, , 2013; Azizyan et al., , 2013; Balakrishnan et al., , 2017; Cai et al., , 2019).

We first establish the rate of convergence for the estimation. Recalling the parameter space $\overline{\Theta}_{S}(h)$ in (LABEL:eq:_parameter_space_mtl), let us denote the true parameter by

\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}=\{(w^{(k)*},\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2},\bm{\Sigma}^{(k)*})\}_{k\in S}\in\overline{\Theta}_{S}(h).

To better present the results for parameters related to the optimal discriminant rule (3), we further denote

\bm{\theta}^{(k)*}=(w^{(k)*},\bm{\beta}^{(k)*},\delta^{(k)*}),\quad\forall k\in S,

where $\bm{\beta}^{(k)*}=(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{2}-\bm{\mu}^{(k)*}_{1}),\delta^{(k)*}=\frac{1}{2}(\bm{\beta}^{(k)*})^{\top}(\bm{\mu}^{(k)*}_{1}+\bm{\mu}^{(k)*}_{2})$ . Note that $\bm{\theta}^{(k)*}$ is a function of $\overline{\bm{\theta}}^{(k)*}$ . For the estimators returned by MTL-GMM (see Algorithm 1), we are particularly interested in the following two error metrics:

	$\displaystyle d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)}):=\min\{\|\widehat{w}^{(k)[T]}-w^{(k)}\|\vee\\|\widehat{\bm{\beta}}^{(k)[T]}-\bm{\beta}^{(k)}\\|_{2}\vee\|\widehat{\delta}^{(k)[T]}-\delta^{(k)}\|,$		(13)
	$\displaystyle\hskip 113.81102pt\|1-\widehat{w}^{(k)[T]}-w^{(k)}\|\vee\\|\widehat{\bm{\beta}}^{(k)[T]}+\bm{\beta}^{(k)}\\|_{2}\vee\|\widehat{\delta}^{(k)[T]}+\delta^{(k)*}\|\},$		(14)
	$\displaystyle\Big{(}\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\\|\widehat{\bm{\mu}}^{(k)[T]}_{r}-\bm{\mu}^{(k)}_{\pi(r)}\\|_{2}\Big{)}\vee\\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)}\\|_{2},$		(15)

where $\pi:[2]\rightarrow[2]$ is a permutation on $\{1,2\}$ . Again, we take the minimum above because binary GMM is identifiable up to label permutation. The first error metric $d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*})$ involves the error for discriminant coefficients and is closely related to the clustering performance. It reveals how well our method utilizes similarity structure in multi-task learning. The second error metric is about the mean vectors and covariance matrix. As discussed in Section 2.1, we shall not expect it to be improved compared to that in single-task learning, as these parameters are not necessarily similar.

We are ready to present upper bounds for the estimation error of MTL-GMM. We recall that $\overline{\Theta}_{S}(h)$ and $\mathbb{Q}_{S}$ are the parameter space and probability measure that we use in Section 2.1 to describe the data distributions for tasks in $S$ and $S^{c}$ , respectively.

Theorem 1.

(Upper bounds of the estimation error of GMM parameters for MTL-GMM) Suppose Assumption 1 holds for some $S$ with $|S|\geq s$ and $\epsilon\coloneqq\frac{K-s}{K}<1/3$ . Let $\lambda^{[0]}\geq C_{1}\max_{k=1:K}\sqrt{n_{k}}$ , $C_{\lambda}\geq C_{1}$ and $\kappa>C_{2}$ with some constants $C_{1}>0,C_{2}\in(0,1)$ ²²2 $C_{1}$ and $C_{2}$ depend on the constants $M$ , $c_{w}$ , and $c_{\bm{\Sigma}}$ etc.. Then there exist a constant $C_{3}>0$ , such that for any $\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}=\{(w^{(k)*},\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2},\bm{\Sigma}^{(k)*})\}_{k\in S}\in\overline{\Theta}_{S}(h)$ and any probability measure $\mathbb{Q}_{S}$ on $(\mathbb{R}^{p})^{\otimes n_{S^{c}}}$ , with probability $1-C_{3}K^{-1}$ , the following hold for all $k\in S$ :

\displaystyle d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*})

\displaystyle\lesssim\sqrt{\frac{p}{n_{S}}}+\sqrt{\frac{\log K}{n_{k}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+T^{2}(\kappa^{\prime})^{\top},\quad

(16)

\Big{(}\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(k)[T]}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\Big{)}\vee\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)*}\|_{2}\lesssim\sqrt{\frac{p+\log K}{n_{k}}}+T^{2}(\kappa^{\prime})^{\top},

(17)

where $\kappa^{\prime}\in(0,1)$ is some constant and $n_{S}=\sum_{k\in S}n_{k}$ . When $T\geq C\log(\max_{k=1:K}n_{k})$ with a large constant $C>0$ , the last term on the right-hand side will be dominated by other terms in both inequalities.

The upper bound of $\big{(}\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(k)[T]}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\big{)}\vee\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)*}\|_{2}$ contains two parts. The first part is comparable to the single-task learning rate (Cai et al., , 2019) (up to a $\sqrt{\log K}$ term due to the simultaneous control over all tasks in $S$ ), and the second part characterizes the geometric convergence of iterates. As expected, since $\bm{\mu}^{(k)*}_{1}$ , $\bm{\mu}^{(k)*}_{2}$ , $\bm{\Sigma}^{(k)*}$ in $S$ are not necessarily similar, an improved error rate over single-task learning is generally impossible. The upper bound for $d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*})$ is directly related to the clustering performance of our method. Thus we will provide a detailed discussion about it after presenting the clustering result in the next theorem.

As introduced in Section 1.1, using the estimate $\widehat{\bm{\theta}}^{(k)[T]}=(\widehat{w}^{(k)[T]},\widehat{\bm{\beta}}^{(k)[T]},\widehat{\delta}^{(k)[T]})$ from Algorithm 1, we can construct a classifier for task $k$ as

\widehat{\mathcal{C}}^{(k)[T]}(\bm{z})=\begin{cases}1,&\text{if }(\widehat{\bm{\beta}}^{(k)[T]})^{\top}\bm{z}-\widehat{\delta}^{(k)[T]}\leq\log\left(\frac{1-\widehat{w}^{(k)[T]}}{\widehat{w}^{(k)[T]}}\right);\\ 2,&\text{otherwise}.\end{cases}

(18)

Recall that for a clustering method $\mathcal{C}:\mathbb{R}^{p}\rightarrow\{1,2\}$ , its mis-clustering error rate under GMM with parameter $\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\Sigma})$ is

R_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:[2]\rightarrow[2]}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z^{\textup{new}})\neq\pi(Y^{\textup{new}})),

(19)

where $Z^{\textup{new}}\sim(1-w)\mathcal{N}(\bm{\mu}_{1},\bm{\Sigma})+w\mathcal{N}(\bm{\mu}_{2},\bm{\Sigma})$ is a future observation associated with the label $Y^{\textup{new}}$ , independent from $\mathcal{C}$ ; the probability $\mathbb{P}_{\overline{\bm{\theta}}}$ is w.r.t. $(Z^{\textup{new}},Y^{\textup{new}})$ , and the minimum is taken over two permutation functions on $\{1,2\}$ . Denote $\mathcal{C}_{\overline{\bm{\theta}}}$ as the Bayes classifier that minimizes $R_{\overline{\bm{\theta}}}(\mathcal{C})$ . In the following theorem, we obtain the upper bound of the excess mis-clustering error of $\widehat{\mathcal{C}}^{(k)[T]}$ for $k\in S$ .

Theorem 2.

(Upper bound of the excess mis-clustering error for MTL-GMM) Suppose the same conditions in Theorem 1 hold. Then there exist a constant $C_{1}>0$ such that for any $\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)$ and any probability measure $\mathbb{Q}_{S}$ on $(\mathbb{R}^{p})^{\otimes n_{S^{c}}}$ , with probability $1-C_{1}K^{-1}$ , the following holds for all $k\in S$ :

	$\displaystyle R_{\overline{\bm{\theta}}^{(k)}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})$	$\displaystyle\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*})$		(20)
		$\displaystyle\lesssim\underbrace{\frac{p}{n_{S}}}_{\rm(\@slowromancap i@)}+\underbrace{\frac{\log K}{n_{k}}}_{\rm(\@slowromancap ii@)}+\underbrace{h^{2}\wedge\frac{p+\log K}{n_{k}}}_{\rm(\@slowromancap iii@)}+\underbrace{\epsilon^{2}\frac{p+\log K}{\max_{k=1:K}n_{k}}}_{\rm(\@slowromancap iv@)}+\underbrace{T^{4}(\kappa^{\prime})^{2T}}_{\rm(\@slowromancap v@)},$		(21)

with some $\kappa^{\prime}\in(0,1)$ . When $T\geq C\log(\max_{k=1:K}n_{k})$ with a large constant $C>0$ , the last term on the right-hand side will be dominated by the second term.

The upper bounds of $d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*})$ in Theorem 1 and $R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})$ in Theorem 2 consist of five parts with one-to-one correspondence. It is sufficient to discuss the bound of $R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})$ . Part (\@slowromancapi@) represents the “oracle rate”, which can be achieved when all tasks in $S$ are the same. This is the best rate to possibly achieve. Part (\@slowromancapii@) is a dimension-free error caused by estimating scalar parameters $\delta^{(k)*}$ and $w^{(k)*}$ that appears in the optimal discriminant rule. Part (\@slowromancapiii@) includes $h$ that measures the degree of similarity among the tasks in $S$ . When these tasks are very similar, $h$ will be small, contributing a small term to the upper bound. Nicely, even when $h$ is large, the term becomes $\frac{p+\log K}{n_{k}}$ and it is still comparable to the minimax error rate of single-task learning $\mathcal{O}_{\mathbb{P}}(p/n_{k})$ (e.g., Theorems 4.1 and 4.2 in Cai et al., (2019)). We have the extra $\log K$ term here due to the simultaneous control over all tasks in $S$ . Part (\@slowromancapiv@) quantifies the influence from the outlier tasks in $S^{c}$ . When there are more outlier tasks, $\epsilon$ increases, and the bound becomes worse. On the other hand, as long as $\epsilon$ is small enough to make this term dominated by any other part, the error rate induced by outlier tasks becomes negligible. Given that data from outlier tasks can be arbitrarily contaminated, we can conclude that our method is robust against a fraction of outlier tasks from arbitrary sources. The term in Part (\@slowromancapv@) decreases geometrically in the iteration number $T$ , implying that the iterates in Algorithm 1 converge geometrically to a ball of radius determined by the errors from Parts (\@slowromancapi@)-(\@slowromancapiv@).

After explaining each part of the upper bound, we now compare it with the convergence rate $\mathcal{O}_{\mathbb{P}}(\frac{p+\log K}{n_{k}})$ (including $\log K$ here since we consider all the tasks simultaneously) in the single-task learning and reveal how our method performs. With a quick inspection, we can conclude the following:

•

The rate of the upper bound is never larger than $\frac{p+\log K}{n_{k}}$ . So, in terms of rate of convergence, our method MTL-GMM performs at least as well as single-task learning, regardless of the similarity level $h$ and outlier task fraction $\epsilon$ .
•

When $n_{S}\gg n_{k}$ (large total sample size for tasks in $S$ ), $p$ increases with $n_{k}$ (diverging dimension), $h\ll\sqrt{\frac{p+\log K}{n_{k}}}$ (sufficient similarity between tasks in $S$ ), and $\epsilon\ll\sqrt{(\max_{k=1:K}n_{k})/n_{k}}$ (small fraction of outlier tasks), MTL-GMM attains a faster excess mis-clustering error rate and improves over single-task learning.

The preceding discussions on the upper bounds have demonstrated the superiority of our method. But can we do better? To further evaluate the upper bounds of our method, we next derive complementary minimax lower bounds for both estimation error and excess mis-clustering error. We will show that our method is (nearly) minimax rate optimal in a broad range of regimes.

Theorem 3.

(Lower bounds of the estimation error of GMM parameters in multi-task learning) Suppose $\epsilon=\frac{K-s}{K}<1/3$ . Suppose there exists a subset $S$ with $|S|\geq s$ such that $\min_{k\in S}n_{k}\geq C_{1}(p+\log K)$ and $\min_{k\in S}\Delta^{(k)}\geq C_{2}$ , where $C_{1},C_{2}>0$ are some constants. Then

	$\displaystyle\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}}$	$\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\Bigg{\{}d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\gtrsim\sqrt{\frac{p}{n_{S}}}+\sqrt{\frac{\log K}{n_{k}}}$		(22)
		$\displaystyle\quad\quad+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\frac{\epsilon}{\sqrt{\max_{k=1:K}n_{k}}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10},$		(23)

	$\displaystyle\inf_{\begin{subarray}{c}\{\widehat{\bm{\mu}}^{(k)}_{1},\widehat{\bm{\mu}}^{(k)}_{2}\}_{k=1}^{K}\\ \{\widehat{\bm{\Sigma}}^{(k)}\}_{k=1}^{K}\end{subarray}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}$	$\displaystyle\Bigg{(}\bigcup_{k\in S}\Bigg{\{}\Big{(}\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\\|\widehat{\bm{\mu}}^{(k)}_{r}-\bm{\mu}^{(k)}_{\pi(r)}\\|_{2}\Big{)}\vee\\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)}\\|_{2}$		(24)
		$\displaystyle\quad\quad\quad\gtrsim\sqrt{\frac{p+\log K}{n_{k}}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10}.$		(25)

Theorem 4.

(Lower bound of the excess mis-clustering error in multi-task learning) Suppose the same conditions in Theorem 3 hold. Then

	$\displaystyle\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}}$	$\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\Bigg{\{}R_{\overline{\bm{\theta}}^{(k)}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\gtrsim\frac{p}{n_{S}}+\frac{\log K}{n_{k}}$		(26)
		$\displaystyle\quad\quad\quad\quad+h^{2}\wedge\frac{p+\log K}{n_{k}}+\frac{\epsilon^{2}}{\max_{k=1:K}n_{k}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10}.$		(27)

Comparing the upper and lower bounds in Theorems 1-4, we make several remarks:

•

Regarding the estimation of mean vectors $\{\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2}\}_{k\in S}$ and covariance matrices $\{\bm{\Sigma}^{(k)*}\}_{k\in S}$ , the upper and lower bounds match, hence our method is minimax rate optimal.
•

For the estimation error $d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})$ and excess mis-clustering error $R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})$ with $k\in S$ , the first three terms in the upper and lower bounds match. Only the term involving $\epsilon$ in the lower bound differs from that in the upper bound by a factor $\sqrt{p+\log K}$ or $p+\log K$ . As a result, in the classical low-dimensional regime where $p$ is bounded, the upper and lower bounds match (up to a logarithmic factor). Therefore, our method is (nearly) minimax rate optimal for estimating $\{\bm{\theta}^{(k)*}\}_{k\in S}$ and clustering in such a classical regime.
•

When the dimension $p$ diverges, there might exist a non-negligible gap between the upper and lower bounds for $d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})$ and $R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})$ with $k\in S$ . Nevertheless, this only occurs when the fraction of outlier task $\epsilon$ is above the threshold $\sqrt{\frac{\max_{k=1:K}n_{k}}{p+\log K}\big{(}h^{2}\vee\frac{\log K}{n_{k}}\vee\frac{p}{n_{S}}\big{)}}$ . Below the threshold, our method remains minimax rate optimal even though $p$ is unbounded.
•

Does the gap, when it exists, arise from the upper bound or the lower bound? We believe that it is the upper bound that sometimes becomes not sharp. As can be seen from the proof of Theorem 1, the term $\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$ is due to the estimation of those “center” parameters in Algorithm 1. Recent advances in robust statistics (Chen et al., , 2018) have shown that estimators based on statistical depth functions such as Tukey’s depth function (Tukey, , 1975) can achieve optimal minimax rate under Huber’s $\epsilon$ -contamination model for location and covariance estimation. It might be possible to utilize depth functions to estimate “center” parameters in our problem and kill the factor $\sqrt{p}$ in the upper bound. We leave a rigorous development of optimal robustness as an interesting future research. On the other hand, such statistical improvement may come with expensive computation, as depth function-based estimation typically requires solving a challenging non-convex optimization problem.

2.4 Initialization and cluster alignment

As specified by Condition (iii) in Assumption 1, our proposed learning procedure requires that for each task in $S$ , initial values of the GMM parameter estimates lie within a distance of SNR-order from the ground truth. This can be satisfied by the method of moments proposed in Ge et al., (2015). In practice, a natural initialization method is to run the standard EM algorithm or other common clustering methods like $k$ -means on each task and use the corresponding estimate as the initial values. We adopted the standard EM algorithm in our numerical experiments, and the numerical results in Section 3 and supplements showed that this practical initialization works quite well. However, in the context of multi-task learning, Condition (iii) further requires a correct alignment of those good initializations from each task, owing to the non-identifiability of GMMs. We discuss in detail the alignment issue in Section 2.4.1 and propose two algorithms to resolve this issue in Section 2.4.2.

2.4.1 The alignment issue

Recall that Section 2.1 introduces the binary GMM with parameters $(w^{(k)*},\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2},\bm{\Sigma}^{(k)*})$ for each task $k\in S$ . Because the two sets of parameter values $\{(w,\bm{u},\bm{v},\bm{\Sigma}),(1-w,\bm{v},\bm{u},\bm{\Sigma})\}$ for $(w^{(k)*},\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2},\bm{\Sigma}^{(k)*})$ index the same distribution, a good initialization close to the truth is up to a permutation of the two cluster labels. The permutations in the initialization of different tasks could be different. Therefore, in light of the joint parameter space $\overline{\Theta}_{S}(h)$ defined in (LABEL:eq:_parameter_space_mtl) and Condition (iii) in Assumption 1, for given initializations from different tasks, we may need to permute their cluster labels to feed the well-aligned initialization into Algorithm 1.

We further elaborate on the alignment issue using Algorithm 1. The penalization in Step 9 aims to push the estimators $\widehat{\bm{\beta}}^{(k)[t]}$ ’s with different $k$ towards each other, which is expected to improve the performance thanks to the similarity among underlying true parameters $\{\bm{\beta}^{(k)*}\}_{k\in S}$ . However, due to the potential permutation of two cluster labels, the vanilla single-task initializations (without alignment) cannot guarantee that the estimators $\{\widehat{\bm{\beta}}^{(k)[t]}\}_{k\in S}$ at each iteration are all estimating the corresponding $\bm{\beta}^{(k)*}$ ’s (some may estimate $-\bm{\beta}^{(k)*}$ ’s).

Refer to caption — Figure 1: Examples of well-aligned (left) and badly-aligned (right) initializations.

Figure 1 illustrates the alignment issue in the case of two tasks. The left-hand-side situation is ideal where $\widehat{\bm{\beta}}^{(1)[0]}$ , $\widehat{\bm{\beta}}^{(2)[0]}$ are estimates of $\bm{\beta}^{(1)*}$ , $\bm{\beta}^{(2)*}$ (which are similar). The right-hand-side situation is problematic because $\widehat{\bm{\beta}}^{(1)[0]}$ , $\widehat{\bm{\beta}}^{(2)[0]}$ are estimates of $\bm{\beta}^{(1)*}$ , $-\bm{\beta}^{(2)*}$ (which are not similar). Therefore, in practice, after obtaining the initializations from each task, it is necessary to align their cluster labels to ensure that estimators of similar parameters are correctly put together in the penalization framework in Algorithm 1. We formalize the problem and provide two solutions in the next subsection.

2.4.2 Two alignment algorithms

Suppose $\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K}$ are the initial estimates of discriminant coefficients with potentially bad alignment for $k\in S$ . Note that a good initialization and alignment is not required (in fact, it is not even well defined) for the outlier tasks in $S^{c}$ , because they can be from arbitrary distributions. However, since $S$ is unknown, we will have to address the alignment issue for tasks in $S$ based on initial estimates from all the tasks. For binary GMMs, each alignment of $\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K}$ can be represented by a $K$ -dimensional Rademacher vector $\bm{r}\in\{\pm 1\}^{K}$ . Define the ideal alignment as $r^{*}_{k}=\operatorname*{arg\,min}_{r_{k}=\pm 1}\|r_{k}\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2},k\in S$ . The goal is to recover the well-aligned initializers $\{r_{k}^{*}\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S}$ from the initial estimates $\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K}$ (equivalently, to recover $\{r^{*}_{k}\}_{k\in S}$ ), which can then be fed into Algorithm 1. Once $\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S}$ are well aligned, other initial estimates in Algorithm 1 will be automatically well aligned.

In the following, we will introduce two alignment algorithms. The first one is the “exhaustive search” method (Algorithm 2), where we search among all possible alignments to find the best one. The second one is the “greedy search” method (Algorithm 3), where we flip the sign of $\widehat{\bm{\beta}}^{(k)[0]}$ in a greedy way to recover $\{r_{k}^{*}\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S}$ . Both methods are proved to recover $\{r_{k}^{*}\}_{k\in S}$ under mild conditions. The conditions required by the “exhaustive search” method are slightly weaker than those required by the “greedy search” method. As for computational complexity, the latter enjoys a linear time complexity $\mathcal{O}(K)$ , while the former suffers from an exponential time complexity $\mathcal{O}(2^{K})$ due to optimization over all possible $2^{K}$ alignments.

To this end, for a given alignment $\bm{r}=\{r_{k}\}_{k=1}^{K}\in\{\pm 1\}^{K}$ with the correspondingly aligned estimates $\{r_{k}\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K}$ , define its alignment score as

\text{score}(\bm{r})=\sum_{1\leq k_{1}\neq k_{2}\leq K}\|r_{k_{1}}\widehat{\bm{\beta}}^{(k_{1})[0]}-r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}.

(28)

The intuition is that as long as the initializations $\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S}$ are close to the ground truth, a smaller score indicates less difference among $\{r_{k}\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S}$ , which implies a better alignment. The score can be thus used to evaluate the quality of an alignment. Note that the score is defined in a symmetric way, that is, $\text{score}(\bm{r})=\text{score}(-\bm{r})$ . The exhaustive search algorithm is presented in Algorithm 2, where scores of all alignments are calculated, and the alignment that minimizes the score is output. Since the score is symmetric, there are at least two alignments with the minimum score. The algorithm can arbitrarily choose and output one of them.

Input: Initialization

\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K}

\widehat{\bm{r}}\leftarrow\operatorname*{arg\,min}_{\bm{r}\in\{\pm 1\}^{K}}\text{score}(\bm{r})

Output:

\widehat{\bm{r}}

Algorithm 2 Exhaustive search for the alignment

The following theorem reveals that the exhaustive search algorithm can successfully find the ideal alignment under mild conditions.

Theorem 5 (Alignment correctness for Algorithm 2).

Assume that

(i)

$\epsilon<\frac{1}{3}$ ;
(ii)

$\min_{k\in S}\|\bm{\beta}^{(k)*}\|_{2}\geq\frac{4(1-\epsilon)}{1-3\epsilon}h+\frac{2(2-\epsilon)}{1-3\epsilon}\max_{k\in S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}\big{)}$ ,

where $\epsilon=\frac{K-|S|}{K}$ is the outlier task proportion introduced in Theorem 1, and $h$ is the similarity level of discriminant coefficient defined in (LABEL:eq:_parameter_space_mtl). Then the output of Algorithm 2 satisfies

\widehat{r}_{k}=r_{k}^{*}\text{ for all }k\in S\quad\text{ or }\quad\widehat{r}_{k}=-r_{k}^{*}\text{ for all }k\in S

(29)

Remark 2.

The conditions imposed in Theorem 5 are no stronger than conditions required by Theorem 1. First of all, Condition (i) is also required in Theorem 1. Moreover, from the definition of $h$ in (LABEL:eq:_parameter_space_mtl), it is bounded by a constant. This together with Conditions (iii) and (iv) in Assumption 1 implies Condition (ii) in Theorem 5.

Remark 3.

With Theorem 5, we can relax the original Condition (iii) in Assumption 1 to the following condition:

For all $k\in S$ , either of the following two conditions holds with a sufficiently small constant $C_{3}$ :

(a)

$\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2}\leq C_{3}\min_{k\in S}\Delta^{(k)}$ , $|\widehat{w}^{(k)[0]}-w^{(k)*}|\leq c_{w}/2$ ;
(b)

$\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{1}-\bm{\mu}^{(k)*}_{2}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{2}-\bm{\mu}^{(k)*}_{1}\|_{2}\leq C_{3}\min_{k\in S}\Delta^{(k)}$ , $|1-\widehat{w}^{(k)[0]}-w^{(k)*}|\leq c_{w}/2$ .

In the relaxed version, the initialization for each task only needs to be good up to an arbitrary permutation, while in the original version, the initialization for each task needs to be good under the same permutation.

Next, we would like to introduce the second alignment algorithm, the “greedy search” method, summarized in Algorithm 3. The main idea is to flip the sign of the discriminant coefficient estimates (equivalently, swap the two cluster labels) from $K$ tasks in a sequential fashion to check whether the alignment score decreases or not. If yes, we keep the alignment after the flip and proceed with the next task. Otherwise, we keep the alignment before the flip and proceed with the next task. A surprising fact of Algorithm 3 is that it is sufficient to iterate this procedure for all tasks just once to recover the ideal alignment, making the algorithm computationally efficient.

Input: Initialization

\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K}

\widehat{\bm{r}}=(1,\ldots,1)\in\{\pm 1\}^{K}

2 for $k=1$ to $K$ do

\widetilde{\bm{r}}\leftarrow\text{flip the sign of }\widehat{r}_{k}

\widehat{\bm{r}}

4 if $\textup{score}(\widehat{\bm{r}})>\textup{score}(\widetilde{\bm{r}})$ then

\widehat{\bm{r}}\leftarrow\widetilde{\bm{r}}

7 end if

9 end for

Output:

\widehat{\bm{r}}

Algorithm 3 Greedy search for the alignment

To help state the theory of the greedy search algorithm, we define the “mismatch proportion” of $\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S}$ as

p_{a}=\min\{\#\{k\in S:r_{k}^{*}=1\},\#\{k\in S:r_{k}^{*}=-1\}\}/|S|

(30)

Intuitively, $p_{a}$ represents the level of mismatch between the initial alignment and the ideal one. It’s straightforward to verify that $p_{a}\in[0,1/2]$ ; $p_{a}=0$ means the initial alignment equals the ideal one, while $p_{a}=1/2$ (or $\frac{|S|-1}{2|S|}$ when $|S|$ is odd) represents the “worst” alignment, where almost half of the tasks are badly-aligned. The smaller $p_{a}$ is, the better alignment $\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S}$ is. Note that we only care about the alignment of tasks in $S$ .

The following theorem shows that the greedy search algorithm can succeed in finding the ideal alignment under mild conditions.

Theorem 6 (Alignment correctness for Algorithm 3).

Assume that

(i)

$\epsilon<\frac{1}{2}$ ;
(ii)

$\min_{k\in S}\|\bm{\beta}^{(k)*}\|_{2}\geq\frac{2(1-\epsilon)}{2(1-\epsilon)(1-p_{a})-1}h+\frac{2-\epsilon}{2(1-\epsilon)(1-p_{a})-1}\max_{k\in S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}\big{)}$ ;
(iii)

$p_{a}<\frac{1-2\epsilon}{2(1-\epsilon)}$ ,

where $\epsilon$ and $h$ are the same as in Theorem 5. Then the output of Algorithm 3 satisfies

r_{k}=r_{k}^{*}\text{ for all }k\in S\quad\text{ or }\quad r_{k}=-r_{k}^{*}\text{ for all }k\in S

(31)

Remark 4.

Conditions (i) and (ii) required by Theorem 6 are similar to the requirements in Theorem 5, which have been shown to be no stronger than conditions in Assumption 1 and Theorem 1 (See Remark 2). However, Condition (iii) is an additional requirement for the success of the greedy label-swapping algorithm. The intuition is that in the exhaustive search algorithm, we compare the scores of all alignments and only need to ensure the ideal alignment can defeat the badly-aligned ones in terms of the alignment score. In contrast, the success of the greedy search algorithm relies on the correct move at each step. We need to guarantee that the “better” alignment after the swap (which may still be badly aligned) can outperform the “worse” one before the swap. This is more difficult to satisfy. Hence, more conditions are needed for the success of Algorithm 3. Condition (iii) is one such condition to provide a reasonably good initial alignment to start the greedy search process. More details of the analysis can be found in the proofs of Theorems 5 and 6 in the supplements.

Remark 5.

In practice, Condition (iii) can fail to hold with a non-zero probability. One solution is to start with random alignments, run the greedy search algorithm multiple times, and use the alignment that appears most frequently in the output. Nevertheless, this will increase the computational burden. In our numerical studies, Algorithm 3 without multiple random alignments works well.

One appealing feature of the two alignment algorithms is that they are robust against a fraction of outlier tasks from arbitrary distributions. According to the definition of the alignment score, this may appear impossible at first glance because the score depends on the estimators from all tasks. However, it turns out that the impact of outliers when comparing the scores in Algorithm 2 and 3 can be bounded by parameters and constants that are unrelated to outlier tasks via the triangle inequality of Euclidean norms. The key idea is that the alignment of outlier tasks in $S^{c}$ does not matter in Theorems 5 and 6. More details can be found in the proof of Theorems 5 and 6 in the supplementary materials.

In contrast with supervised MTL, the alignment issue commonly exists in unsupervised MTL. It generally occurs when aggregating information (up to latent label permutation) across different tasks. Alignment pre-processing is thus necessary and important. However, to our knowledge, there is no formal discussion regarding alignment in the existing literature of unsupervised MTL (Gu et al., , 2011; Zhang and Zhang, , 2011; Yang et al., , 2014; Zhang et al., , 2018; Dieuleveut et al., , 2021; Marfoq et al., , 2021). Our treatment of alignment in Section 2.4 is an important step forward in this field. Our algorithms can be potentially extended to other unsupervised MTL scenarios and we leave it for future studies.

3 Simulations

In this section, we present a simulation study of our multi-task learning procedure MTL-GMM, i.e., Algorithms 1. The tuning parameter $\kappa\in(0,1)$ is set as $1/3$ , and the value of $C_{\lambda}$ is determined by a 10-fold cross-validation based on the log-likelihood of the final fitted model. The candidates of $C_{\lambda}$ are chosen in a data-driven way, which is described in detail in Section S.3.1.5 of the supplements. All the experiments in this section are implemented in R. Function Mcluster in R package mclust is called to fit a single GMM. We also conducted two additional simulation studies and two real-data studies. Due to the page limit, we included these in Section S.3 of the supplementary materials.

We consider a binary GMM setting. There are $K=10$ tasks of which each has sample size $n_{k}=100$ and dimension $p=15$ . When $k\in S$ , we generate each $w^{(k)*}$ from $\text{Unif}(0.1,0.9)$ and $\bm{\mu}^{(k)*}_{1}$ from $(2,2,\bm{0}_{p-2})^{\top}+h/2\cdot(\bm{\Sigma}^{(k)*})^{-1}\bm{u}$ , where $\bm{u}\sim\text{Unif}(\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=1\})$ , $\bm{\Sigma}^{(k)*}=(0.2^{|i-j|})_{p\times p}$ , and let $\bm{\mu}^{(k)*}_{2}=-\bm{\mu}^{(k)*}_{1}$ . When $k\notin S$ , the distributions still follow GMM, but we generate each $w^{(k)*}$ from $\text{Unif}(0.2,0.4)$ and $\bm{\mu}^{(k)*}_{1}$ from $\text{Unif}(\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=0.5\})$ , and let $\bm{\mu}^{(k)*}_{2}=-\bm{\mu}^{(k)*}_{1}$ , $\bm{\Sigma}^{(k)*}=(0.5^{|i-j|)})_{p\times p}$ . In this setup, it is clear that $h$ quantifies the similarity among tasks in $S$ , and tasks in $S^{c}$ have very distinct distributions and can be viewed as outlier tasks. For a given $\epsilon\in[0,1)$ , the outlier task index set $S^{c}$ in each replication is uniformly sampled from all subsets of $1:K$ with cardinality $K\epsilon$ . We consider two cases:

(i)

No outlier tasks ( $\epsilon=0$ ), and $h$ changes from 0 to 10 with increment 1;
(ii)

2 outlier tasks ( $\epsilon=0.2$ ), and $h$ changes from 0 to 10 with increment 1;

We fit Single-task-GMM on each separate task, Pooled-GMM on the merged data of all tasks, and our MTL-GMM in Algorithm 1 coupled with the exhaustive search for the alignment in Algorithm 2. The performances of all three methods are evaluated by the estimation error of $w^{(k)*}$ , $\bm{\mu}^{(k)*}_{1}$ , $\bm{\mu}^{(k)*}_{2}$ , $\bm{\beta}^{(k)*}$ , $\delta^{(k)*}$ , and $\bm{\Sigma}^{(k)*}$ , as well as the empirical mis-clustering error calculated on a test data set of size 500, for tasks in $S$ . Due to page limit, we only present the estimation error of $\bm{\beta}^{(k)*}$ and the mis-clustering error here, and leave the others to Section S.3.1.1 of the supplements. These two errors are the maximum errors over tasks in $S$ . For each setting, the simulation is replicated 200 times, and the average of the maximum errors together with the standard deviation are reported in Figure 2.

When there are no outlier tasks, it can be seen that MTL-GMM and Pooled-GMM are competitive when $h$ is small (i.e. the tasks are similar), and they outperform Single-task-GMM. As $h$ increases (i.e. tasks become more heterogenous), MTL-GMM starts to outperform Pooled-GMM by a large margin. Moreover, MTL-GMM is significantly better than Single-task-GMM in terms of both estimation and mis-clustering errors over a wide range of $h$ . These comparisons demonstrate that MTL-GMM not only effectively utilizes the unknown similarity structure among tasks, but also adapts to it. When the outlier tasks exist, even when $h$ is very small, MTL-GMM still performs better than Pooled-GMM, showing the robustness of MTL-GMM against a fraction of outlier tasks.

4 Discussions

We would like to highlight several interesting open problems for potential future work:

•

What if only some clusters are similar among different tasks? This may be a more realistic situation in particular when there are more than 2 clusters in each task. Our current proposed algorithms may not work well because they do not take into account this extra layer of heterogeneity. Furthermore, in this situation, different tasks may have a different number of Gaussian clusters. Such a setting with various numbers of clusters has been considered in some literature on general unsupervised multi-task learning (Zhang and Zhang, , 2011; Yang et al., , 2014; Zhang et al., , 2018). It would be of great interest to develop multi-task and transfer learning methods with provable guarantees for GMMs under these more complicated settings.
•

How to accommodate heterogeneous covariance matrices for different Gaussian clusters within each task? This is related to the quadratic discriminant analysis (QDA) in supervised learning where the Bayes classifier has a leading quadratic term. It may require more delicate analysis for methodological and theoretical development. Some recent QDA literature might be helpful (Li and Shao, , 2015; Fan et al., , 2015; Hao et al., , 2018; Jiang et al., , 2018).
•

In this paper, we have focused on the pure unsupervised learning problem, where all the samples are unlabeled. It would be interesting to consider the semi-supervised learning setting, where labels in some tasks (or sources) are known. Li et al., 2022a discusses a similar problem under the linear regression setting, but how the labeled data can help the estimation and clustering in the context of GMMs remains unknown.

References

Anderson, (1958) Anderson, T. W. (1958). An introduction to multivariate statistical analysis: Wiley series in probability and mathematical statistics: Probability and mathematical statistics.
Ando et al., (2005) Ando, R. K., Zhang, T., and Bartlett, P. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(11).
Argyriou et al., (2008) Argyriou, A., Evgeniou, T., and Pontil, M. (2008). Convex multi-task feature learning. Machine learning, 73(3):243–272.
Azizyan et al., (2013) Azizyan, M., Singh, A., and Wasserman, L. (2013). Minimax theory for high-dimensional gaussian mixtures with sparse mean separation. Advances in Neural Information Processing Systems, 26.
Balakrishnan et al., (2017) Balakrishnan, S., Wainwright, M. J., and Yu, B. (2017). Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120.
Bastani, (2021) Bastani, H. (2021). Predicting with proxies: Transfer learning in high dimension. Management Science, 67(5):2964–2984.
Baum et al., (1970) Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics, 41(1):164–171.
Cai and Liu, (2011) Cai, T. and Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American statistical association, 106(496):1566–1577.
Cai et al., (2019) Cai, T. T., Ma, J., and Zhang, L. (2019). Chime: Clustering of high-dimensional gaussian mixtures with em algorithm and its optimality. The Annals of Statistics, 47(3):1234–1267.
Carroll et al., (1997) Carroll, R. J., Fan, J., Gijbels, I., and Wand, M. P. (1997). Generalized partially linear single-index models. Journal of the American Statistical Association, 92(438):477–489.
Chattopadhyay et al., (2012) Chattopadhyay, R., Sun, Q., Fan, W., Davidson, I., Panchanathan, S., and Ye, J. (2012). Multisource domain adaptation and its application to early detection of fatigue. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(4):1–26.
Chen et al., (2022) Chen, E. Y., Jordan, M. I., and Li, S. (2022). Transferred q-learning. arXiv preprint arXiv:2202.04709.
Chen et al., (2018) Chen, M., Gao, C., and Ren, Z. (2018). Robust covariance and scatter matrix estimation under huber’s contamination model. The Annals of Statistics, 46(5):1932–1960.
Dai et al., (2007) Dai, W., Yang, Q., Xue, G., and Yu, Y. (2007). Boosting for transfer learning. In ACM International Conference Proceeding Series, volume 227, page 193.
Dai et al., (2008) Dai, W., Yang, Q., Xue, G.-R., and Yu, Y. (2008). Self-taught clustering. In Proceedings of the 25th international conference on Machine learning, pages 200–207.
Dasgupta and Schulman, (2013) Dasgupta, S. and Schulman, L. (2013). A two-round variant of em for gaussian mixtures. arXiv preprint arXiv:1301.3850.
Dempster et al., (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22.
Dieuleveut et al., (2021) Dieuleveut, A., Fort, G., Moulines, E., and Robin, G. (2021). Federated-em with heterogeneity mitigation and variance reduction. Advances in Neural Information Processing Systems, 34:29553–29566.
Duan and Wang, (2023) Duan, Y. and Wang, K. (2023). Adaptive and robust multi-task learning. The Annals of Statistics, 51(5):2015–2039.
Efron, (1975) Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70(352):892–898.
Evgeniou and Pontil, (2004) Evgeniou, T. and Pontil, M. (2004). Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117.
Fan et al., (2012) Fan, J., Feng, Y., and Tong, X. (2012). A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(4):745–771.
Fan et al., (2015) Fan, Y., Kong, Y., Li, D., and Zheng, Z. (2015). Innovated interaction screening for high-dimensional nonlinear classification. The Annals of Statistics, 43(3):1243–1272.
Forgy, (1965) Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21:768–769.
Ge et al., (2015) Ge, R., Huang, Q., and Kakade, S. M. (2015). Learning mixtures of gaussians in high dimensions. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 761–770.
Gu et al., (2011) Gu, Q., Li, Z., and Han, J. (2011). Learning a kernel for multi-task clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 368–373.
Hao et al., (2018) Hao, N., Feng, Y., and Zhang, H. H. (2018). Model selection for high-dimensional quadratic regression via regularization. Journal of the American Statistical Association, 113(522):615–625.
Hartley, (1958) Hartley, H. O. (1958). Maximum likelihood estimation from incomplete data. Biometrics, 14(2):174–194.
Hasselblad, (1966) Hasselblad, V. (1966). Estimation of parameters for a mixture of normal distributions. Technometrics, 8(3):431–444.
Hastie et al., (2009) Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
Hsu and Kakade, (2013) Hsu, D. and Kakade, S. M. (2013). Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11–20.
Huber, (1964) Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, pages 73–101.
Jain and Dubes, (1988) Jain, A. K. and Dubes, R. C. (1988). Algorithms for clustering data. Prentice-Hall, Inc.
Jalali et al., (2010) Jalali, A., Sanghavi, S., Ruan, C., and Ravikumar, P. (2010). A dirty model for multi-task learning. Advances in neural information processing systems, 23.
Jiang et al., (2018) Jiang, B., Wang, X., and Leng, C. (2018). A direct approach for sparse quadratic discriminant analysis. Journal of Machine Learning Research, 19(1):1098–1134.
Jin et al., (2017) Jin, J., Ke, Z. T., and Wang, W. (2017). Phase transitions for high dimensional clustering and related problems. The Annals of Statistics, 45(5):2151–2189.
Kalai et al., (2010) Kalai, A. T., Moitra, A., and Valiant, G. (2010). Efficiently learning mixtures of two gaussians. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 553–562.
Konstantinov et al., (2020) Konstantinov, N., Frantar, E., Alistarh, D., and Lampert, C. (2020). On the sample complexity of adversarial multi-source pac learning. In International Conference on Machine Learning, pages 5416–5425. PMLR.
Kwon and Caramanis, (2020) Kwon, J. and Caramanis, C. (2020). The em algorithm gives sample-optimality for learning mixtures of well-separated gaussians. In Conference on Learning Theory, pages 2425–2487. PMLR.
Lawrence and Platt, (2004) Lawrence, N. D. and Platt, J. C. (2004). Learning to learn with the informative vector machine. In Proceedings of the twenty-first international conference on Machine learning, page 65.
Lee et al., (2012) Lee, K., Guillemot, L., Yue, Y., Kramer, M., and Champion, D. (2012). Application of the gaussian mixture model in pulsar astronomy-pulsar classification and candidates ranking for the fermi 2fgl catalogue. Monthly Notices of the Royal Astronomical Society, 424(4):2832–2840.
Li and Shao, (2015) Li, Q. and Shao, J. (2015). Sparse quadratic discriminant analysis for high dimensional data. Statistica Sinica, pages 457–473.
Li and Liang, (2008) Li, R. and Liang, H. (2008). Variable selection in semiparametric regression modeling. The Annals of Statistics, 36(1):261–286.
Li et al., (2023) Li, S., Cai, T., and Duan, R. (2023). Targeting underrepresented populations in precision medicine: A federated transfer learning approach. The Annals of Applied Statistics, 17(4):2970–2992.
Li et al., (2021) Li, S., Cai, T. T., and Li, H. (2021). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages 1–25.
(46) Li, S., Cai, T. T., and Li, H. (2022a). Estimation and inference with proxy data and its genetic applications. arXiv preprint arXiv:2201.03727.
(47) Li, S., Cai, T. T., and Li, H. (2022b). Transfer learning in large-scale gaussian graphical models with false discovery rate control. Journal of the American Statistical Association, pages 1–13.
Li et al., (2013) Li, W., Duan, L., Xu, D., and Tsang, I. W. (2013). Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 36(6):1134–1148.
Lin and Reimherr, (2022) Lin, H. and Reimherr, M. (2022). On transfer learning in functional linear regression. arXiv preprint arXiv:2206.04277.
Mai et al., (2019) Mai, Q., Yang, Y., and Zou, H. (2019). Multiclass sparse discriminant analysis. Statistica Sinica, 29(1):97–111.
Mai et al., (2012) Mai, Q., Zou, H., and Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99(1):29–42.
Marfoq et al., (2021) Marfoq, O., Neglia, G., Bellet, A., Kameni, L., and Vidal, R. (2021). Federated multi-task learning under a mixture of distributions. Advances in Neural Information Processing Systems, 34:15434–15447.
McLachlan and Krishnan, (2007) McLachlan, G. J. and Krishnan, T. (2007). The EM algorithm and extensions. John Wiley & Sons.
Meng and Rubin, (1994) Meng, X.-L. and Rubin, D. B. (1994). On the global and componentwise rates of convergence of the em algorithm. Linear Algebra and its Applications, 199:413–425.
Mihalkova et al., (2007) Mihalkova, L., Huynh, T., and Mooney, R. J. (2007). Mapping and revising markov logic networks for transfer learning. In Proceedings of the 22nd national conference on Artificial intelligence-Volume 1, pages 608–614.
Murtagh and Contreras, (2012) Murtagh, F. and Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97.
Ng et al., (2001) Ng, A., Jordan, M., and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14.
Obozinski et al., (2006) Obozinski, G., Taskar, B., and Jordan, M. (2006). Multi-task feature selection. Statistics Department, UC Berkeley, Tech. Rep, 2(2.2):2.
Pan and Yang, (2009) Pan, S. J. and Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359.
Pearson, (1894) Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71–110.
Redner and Walker, (1984) Redner, R. A. and Walker, H. F. (1984). Mixture densities, maximum likelihood and the em algorithm. SIAM review, 26(2):195–239.
Scott and Symons, (1971) Scott, A. J. and Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics, pages 387–397.
Sundberg, (1974) Sundberg, R. (1974). Maximum likelihood theory for incomplete data from an exponential family. Scandinavian Journal of Statistics, pages 49–58.
Thrun and O’Sullivan, (1996) Thrun, S. and O’Sullivan, J. (1996). Discovering structure in multiple learning tasks: The tc algorithm. In ICML, volume 96, pages 489–497.
Tian and Feng, (2023) Tian, Y. and Feng, Y. (2023). Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association, 118(544):2684–2697.
Tukey, (1975) Tukey, J. W. (1975). Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531.
Vempala and Wang, (2004) Vempala, S. and Wang, G. (2004). A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841–860.
Wang et al., (2021) Wang, R., Zhou, J., Jiang, H., Han, S., Wang, L., Wang, D., and Chen, Y. (2021). A general transfer learning-based gaussian mixture model for clustering. International Journal of Fuzzy Systems, 23(3):776–793.
Wang et al., (2014) Wang, Z., Gu, Q., Ning, Y., and Liu, H. (2014). High dimensional expectation-maximization algorithm: Statistical optimization and asymptotic normality. arXiv preprint arXiv:1412.8729.
Wang et al., (2008) Wang, Z., Song, Y., and Zhang, C. (2008). Transferred dimensionality reduction. In Joint European conference on machine learning and knowledge discovery in databases, pages 550–565. Springer.
Witten and Tibshirani, (2011) Witten, D. M. and Tibshirani, R. (2011). Penalized classification using fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):753–772.
Wu, (1983) Wu, C. J. (1983). On the convergence properties of the em algorithm. The Annals of statistics, pages 95–103.
Xu et al., (2016) Xu, J., Hsu, D. J., and Maleki, A. (2016). Global analysis of expectation maximization for mixtures of two gaussians. Advances in Neural Information Processing Systems, 29.
Xu and Bastani, (2021) Xu, K. and Bastani, H. (2021). Learning across bandits in high dimension via robust statistics. arXiv preprint arXiv:2112.14233.
Yan et al., (2017) Yan, B., Yin, M., and Sarkar, P. (2017). Convergence of gradient em on multi-component mixture of gaussians. Advances in Neural Information Processing Systems, 30.
Yang and Ahuja, (1998) Yang, M.-H. and Ahuja, N. (1998). Gaussian mixture model for human skin color and its applications in image and video databases. In Storage and retrieval for image and video databases VII, volume 3656, pages 458–466. SPIE.
Yang et al., (2014) Yang, Y., Ma, Z., Yang, Y., Nie, F., and Shen, H. T. (2014). Multitask spectral clustering by exploring intertask correlation. IEEE transactions on cybernetics, 45(5):1083–1094.
Zhang and Zhang, (2011) Zhang, J. and Zhang, C. (2011). Multitask bregman clustering. Neurocomputing, 74(10):1720–1734.
Zhang and Chen, (2022) Zhang, Q. and Chen, J. (2022). Distributed learning of finite gaussian mixtures. Journal of Machine Learning Research, 23(99):1–40.
Zhang et al., (2022) Zhang, X., Blanchet, J., Ghosh, S., and Squillante, M. S. (2022). A class of geometric structures in transfer learning: Minimax bounds and optimality. In International Conference on Artificial Intelligence and Statistics, pages 3794–3820. PMLR.
Zhang et al., (2015) Zhang, X., Zhang, X., and Liu, H. (2015). Smart multitask bregman clustering and multitask kernel clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(1):1–29.
Zhang et al., (2018) Zhang, X., Zhang, X., Liu, H., and Luo, J. (2018). Multi-task clustering with model relation learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 3132–3140.
Zhang and Yang, (2021) Zhang, Y. and Yang, Q. (2021). A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering.
Zhao et al., (2020) Zhao, R., Li, Y., and Sun, Y. (2020). Statistical convergence of the em algorithm on gaussian mixture models. Electronic Journal of Statistics, 14:632–660.
Zou, (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429.
Zuo et al., (2018) Zuo, H., Lu, J., Zhang, G., and Liu, F. (2018). Fuzzy transfer learning using an infinite gaussian mixture model and active learning. IEEE Transactions on Fuzzy Systems, 27(2):291–303.

Supplementary Materials of “Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models”

S.1 Extension to Multi-cluster GMMs

In the main text, we have discussed the MTL problem for binary GMMs. In this section, we extend our methods and theory to Gaussian mixtures with $R$ clusters ( $R\geq 3$ ).

We first generalize the problem setting introduced in Sections 1.1 and 2.1. There are $K$ tasks where we have $n_{k}$ observations $\{\bm{z}_{i}^{(k)}\}_{i=1}^{n_{k}}$ from the $k$ -th task. An unknown subset $S\subseteq 1:K$ denotes tasks whose samples follow multi-cluster GMMs, and $S^{c}$ refers to outlier tasks that can have arbitrary distributions. Specifically, for all $k\in S,i=1:n_{k}$ ,

\displaystyle y^{(k)}_{i}=r\text{ \,\,with probability }w^{(k)*}_{r},\quad\quad\bm{z}_{i}^{(k)}|y_{i}^{(k)}=r\sim\mathcal{N}(\bm{\mu}^{(k)*}_{r},\bm{\Sigma}^{(k)*}),\quad\quad r=1:R,

(S.1.32)

with $\sum_{r=1}^{R}w^{(k)*}_{r}=1$ , and

\{\bm{z}^{(k)}_{i}\}_{i,k\in S^{c}}\sim\mathbb{Q}_{S},

(S.1.33)

where $\mathbb{Q}_{S}$ is some probability measure on $(\mathbb{R}^{p})^{\otimes n_{S^{c}}}$ and $n_{S^{c}}=\sum_{k\in S^{c}}n_{k}$ . We focus on the following joint parameter space

\overline{\Theta}_{S}(h)=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}=\{(\{w^{(k)}_{r}\}_{r=1}^{R},\{\bm{\mu}^{(k)}_{r}\}_{r=1}^{R},\bm{\Sigma}^{(k)})\}_{k\in S}:\overline{\bm{\theta}}^{(k)}\in\overline{\Theta},\max_{r=2:R}\inf_{\bm{\beta}}\max_{k\in S}\|\bm{\beta}^{(k)}_{r}-\bm{\beta}\|_{2}\leq h\Big{\}},

(S.1.34)

where $\bm{\beta}^{(k)}_{r}=(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{r}-\bm{\mu}^{(k)}_{1})$ is the $r$ -th discriminant coefficient in the $k$ -th task, and $\overline{\Theta}$ is the parameter space for a single multi-cluster GMM,

	$\displaystyle\overline{\Theta}=\bigg{\{}\overline{\bm{\theta}}=(\{w_{r}\}_{r=1}^{R},\{\bm{\mu}_{r}\}_{r=1}^{R},\bm{\Sigma}):$	$\displaystyle\max_{r=1:R}\\|\bm{\mu}_{r}\\|_{2}\leq M,c_{w}\leq\min_{r=1:R}w_{r}\leq\max_{r=1:R}w_{r}\leq 1-c_{w},$		(S.1.35)
		$\displaystyle\sum_{r=1}^{R}w_{r}=1,c_{\bm{\Sigma}}^{-1}\leq\lambda_{\min}(\bm{\Sigma})\leq\lambda_{\max}(\bm{\Sigma})\leq c_{\bm{\Sigma}}\bigg{\}}.$		(S.1.36)

Note that (S.1.34) and (S.1.36) are natural generalizations of (LABEL:eq:_parameter_space_mtl) and (8), respectively.

Under a multi-cluster GMM with parameter $\overline{\bm{\theta}}=(\{w_{r}\}_{r=1}^{R},\{\bm{\mu}_{r}\}_{r=1}^{R},\bm{\Sigma})$ , compared with (3), the optimal discriminant rule now becomes

\displaystyle\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})=\operatorname*{arg\,max}_{r=1:R}\bigg{\{}\bigg{(}\bm{z}-\frac{\bm{\mu}_{1}+\bm{\mu}_{r}}{2}\bigg{)}^{\top}\bm{\beta}_{r}+\log\bigg{(}\frac{w_{r}}{w_{1}}\bigg{)}\bigg{\}},

(S.1.37)

where $\bm{\beta}_{r}=\bm{\Sigma}^{-1}(\bm{\mu}_{r}-\bm{\mu}_{1})$ . Once we have the parameter estimators, we plug them into the above rule to obtain the plug-in clustering method. Recall that for a clustering method $\mathcal{C}:\mathbb{R}^{p}\rightarrow[R]$ , its mis-clustering error is

R_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:[R]\rightarrow[R]}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z^{\textup{new}})\neq\pi(Y^{\textup{new}})).

(S.1.38)

Here, $Z^{\textup{new}}\sim\sum_{r=1}^{R}w_{r}\mathcal{N}(\bm{\mu}_{r},\bm{\Sigma})$ is an independent future observation associated with the label $Y^{\textup{new}}$ , the probability $\mathbb{P}_{\overline{\bm{\theta}}}$ is w.r.t. $(Z^{\textup{new}},Y^{\textup{new}})$ , and the minimum is taken over $R!$ permutation functions on $[R]$ . Since (S.1.37) is the optimal clustering method that minimizes $R_{\overline{\bm{\theta}}}(\mathcal{C})$ , the excess mis-clustering error for a given clustering $\mathcal{C}$ is $R_{\overline{\bm{\theta}}}(\mathcal{C})-R_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}})$ . The rest of the section aims to extend the EM-stylized multi-task learning procedure and the two alignment algorithms in Section 2 to the general multi-cluster GMM setting, and provide similar statistical guarantees in terms of estimation and excess mis-clustering errors. For simplicity, throughout this section, we assume the number of clusters $R$ to be bounded and known. We leave the case of diverging $R$ as a future work.

Since both the EM algorithm and the penalization framework work beyond binary GMM, our methodological idea described in Section 2.2 can be directly adapted to extend Algorithm 1 to the multi-cluster case. We summarize the general procedure in Algorithm 4. Like in Algorithm 1, we have adopted the following notation for posterior probability in Algorithm 4,

\gamma^{(r)}_{\bm{\theta}}(\bm{z})=\frac{w_{r}\exp(\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r})}{\sum_{r=1}^{R}w_{r}\exp(\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r})},\quad{\rm for~{}}\bm{\theta}=(\{w_{r}\}_{r=2}^{R},\{\bm{\beta}_{r}\}_{r=2}^{R},\{\delta_{r}\}_{r=2}^{R}),

(S.1.39)

where $w_{1}=1-\sum_{r=2}^{R}w_{r}$ , $\bm{\beta}_{1}\coloneqq\bm{0}$ , and $\delta_{1}\coloneqq 0$ . Specifically, $\gamma^{(r)}_{\bm{\theta}}(\bm{z})$ is the posterior probability $\mathbb{P}(Y=r|Z=\bm{z})$ given the observation $\bm{z}$ , when the true parameter of a multi-cluster GMM $(\{w^{*}_{r}\}_{r=1}^{R},\{\bm{\mu}^{*}_{r}\}_{r=1}^{R},\bm{\Sigma}^{*})$ satisfies $w_{r}=w_{r}^{*},\bm{\beta}_{r}=(\bm{\Sigma}^{*})^{-1}(\bm{\mu}^{*}_{r}-\bm{\mu}^{*}_{1}),\delta_{r}=\frac{1}{2}\bm{\beta}_{r}^{\top}(\bm{\mu}^{*}_{1}+\bm{\mu}^{*}_{r})$ , for $r=1:R$ .

Input: Initialization

\{(\{\widehat{w}^{(k)[0]}_{r}\}_{r=1}^{R},\{\widehat{\bm{\beta}}^{(k)[0]}_{r}\}_{r=2}^{R},\{\widehat{\bm{\mu}}^{(k)[0]}_{r}\}_{r=1}^{R})\}_{k=1}^{K}

, maximum number of iteration rounds

T

, initial penalty parameter

\lambda^{[0]}

, tuning parameters

C_{\lambda}>0

\kappa\in(0,1)

\widehat{\bm{\theta}}^{(k)[0]}=(\{\widehat{w}^{(k)[0]}_{r}\}_{r=1}^{R},\{\widehat{\bm{\beta}}^{(k)[0]}_{r}\}_{r=2}^{R},\{\widehat{\delta}^{(k)[0]}_{r}\}_{r=2}^{R})

, where

\widehat{\delta}^{(k)[0]}_{r}=\frac{1}{2}(\widehat{\bm{\beta}}^{(k)[0]}_{r})^{\top}(\widehat{\bm{\mu}}^{(k)[0]}_{1}+\widehat{\bm{\mu}}^{(k)[0]}_{r})

, for

k=1:K

2 for $t=1$ to $T$ do

\lambda^{[t]}=\kappa\lambda^{[t-1]}+C_{\lambda}\sqrt{p+\log K}

// Update the penalty parameter

4 for $k=1$ to $K$ do // Local update for each task

5 for $r=1$ to $R$ do

\widehat{w}^{(k)[t]}_{r}=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma^{(r)}_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})

\widehat{\bm{\mu}}^{(k)[t]}_{r}=\frac{\sum_{i=1}^{n_{k}}\gamma^{(r)}_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})\bm{z}^{(k)}_{i}}{n_{k}\widehat{w}^{(k)[t]}_{r}}

9 end for

\widehat{\bm{\Sigma}}^{(k)[t]}=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\sum_{r=1}^{R}\gamma^{(r)}_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})\cdot(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{r})(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{r})^{\top}

11 end for

13 for $r=2$ to $R$ do

\{\widehat{\bm{\beta}}^{(k)[t]}_{r}\}_{k=1}^{K}

\overline{\bm{\beta}}_{r}^{[t]}=\operatorname*{arg\,min}\limits_{\bm{\beta}^{(1)},\ldots,\bm{\beta}^{(K)},\overline{\bm{\beta}}}\bigg{\{}\sum_{k=1}^{K}n_{k}\Big{[}\frac{1}{2}(\bm{\beta}^{(k)})^{\top}\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)}-(\bm{\beta}^{(k)})^{\top}(\widehat{\bm{\mu}}_{r}^{(k)[t]}-\widehat{\bm{\mu}}_{1}^{(k)[t]})\Big{]}+\sum_{k=1}^{K}\sqrt{n_{k}}\lambda^{[t]}\cdot\|\bm{\beta}^{(k)}-\overline{\bm{\beta}}\|_{2}\bigg{\}}

// Aggregation

16 end for

18 for $k=1$ to $K$ do // Local update for each task

19 for $r=2$ to $R$ do

\widehat{\delta}^{(k)[t]}_{r}=\frac{1}{2}(\widehat{\bm{\beta}}^{(k)[t]}_{r})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{1}+\widehat{\bm{\mu}}^{(k)[t]}_{r})

21 end for

22 Let

\widehat{\bm{\theta}}^{(k)[t]}=(\{\widehat{w}^{(k)[t]}_{r}\}_{r=1}^{R},\{\widehat{\bm{\beta}}^{(k)[t]}_{r}\}_{r=2}^{R},\{\widehat{\delta}^{(k)[t]}\}_{r=2}^{R})

23 end for

25 end for

Output:

\{(\widehat{\bm{\theta}}^{(k)[T]},\{\widehat{\bm{\mu}}^{(k)[T]}_{r}\}_{r=1}^{R},\widehat{\bm{\Sigma}}^{(k)[T]})\}_{k=1}^{K}

with

\widehat{\bm{\theta}}^{(k)[T]}=(\{\widehat{w}^{(k)[T]}_{r}\}_{r=1}^{R},\{\widehat{\bm{\beta}}^{(k)[T]}_{r}\}_{r=2}^{R},\allowbreak\{\widehat{\delta}^{(k)[T]}\}_{r=2}^{R})

Algorithm 4 MTL-GMM (Multi-cluster)

Having the estimates $(\{\widehat{w}^{(k)[T]}_{r}\}_{r=1}^{R},\{\widehat{\bm{\beta}}^{(k)[T]}_{r}\}_{r=2}^{R},\allowbreak\{\widehat{\bm{\mu}}^{(k)[T]}_{r}\}_{r=1}^{R})$ from Algorithm 4, we can plug them into (S.1.37) to construct the clustering method, denoted by $\widehat{\mathcal{C}}^{(k)[T]}(\bm{z})$ . Equivalently,

\widehat{\mathcal{C}}^{(k)[T]}(\bm{z})=\operatorname*{arg\,max}_{r=1:R}\gamma^{(r)}_{\widehat{\bm{\theta}}^{(k)[T]}}(\bm{z}).

(S.1.40)

S.1.1 Theory

We need the following assumption before stating the theory.

Assumption 2.

Denote $\Delta^{(k)}_{rj}=\sqrt{(\bm{\mu}^{(k)*}_{r}-\bm{\mu}^{(k)*}_{j})^{\top}(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{r}-\bm{\mu}^{(k)*}_{j})}$ for $k\in S$ . Suppose the following conditions hold:

(i)

$n_{S}=\sum_{k\in S}n_{k}\geq C_{1}|S|\max_{k=1:K}n_{k}$ with a constant $C_{1}>0$ ;
(ii)

$\min_{k\in S}n_{k}\geq C_{2}(p+\log K)$ with some constant $C_{2}$ ;
(iii)
There exists a permutation $\pi:[R]\rightarrow[R]$ such that
1. (a)
  
  $\max_{k\in S}\big{\{}\big{[}\max_{r=2:R}\|\widehat{\bm{\beta}}^{(k)[0]}_{r}-(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{\pi(r)}-\bm{\mu}^{(k)*}_{\pi(1)})\|_{2}\big{]}\vee\big{(}\max_{r=1:R}\|\widehat{\bm{\mu}}^{(k)[0]}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\big{)}\big{\}}\leq C_{3}\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}$ , with some constant $C_{3}$ ;
2. (b)
  
  $\max_{k\in S}\max_{r=2:R}|\widehat{w}^{(k)[0]}_{r}-w^{(k)*}_{\pi(r)}|\leq c_{w}/2$ .
(iv)

$\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}\geq C_{4}>0$ with some constant $C_{4}$ ;

Remark 6.

The above set of conditions are analogues of those in Assumption 1. We refer to Remark 1 for a detailed explanation of each condition.

We first present the result for parameter estimation. We adopt similar error metrics as the ones in (14) and (15). Specifically, denote the true parameter by $\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}=\{(\{w^{(k)*}_{r}\}_{r=2}^{R},\{\bm{\mu}^{(k)*}_{r}\}_{r=1}^{R},\bm{\Sigma}^{(k)*})\}_{k\in S}$ which belongs to the parameter space $\overline{\Theta}_{S}(h)$ in (S.1.34). For each $k\in S$ , define the functional $\bm{\theta}^{(k)*}=(\{w^{(k)*}_{r}\}_{r=2}^{R},\{\bm{\beta}^{(k)*}_{r}\}_{r=2}^{R},\{\delta^{(k)*}_{r}\}_{r=2}^{R})$ , where $\bm{\beta}^{(k)*}_{r}=(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{r}-\bm{\mu}^{(k)*}_{1}),\delta^{(k)*}_{r}=\frac{1}{2}(\bm{\beta}^{(k)*})^{\top}(\bm{\mu}^{(k)*}_{1}+\bm{\mu}^{(k)*}_{r})$ . For the estimators returned by Algorithm 4, we are interested in the error metrics ³³3Similar to the binary case, the minimum is taken due to the non-identifiability in multi-cluster GMMs.:

	$\displaystyle d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)})=\min_{\pi:[R]\rightarrow[R]}\max_{r=2:R}\Big{\{}\|\widehat{w}^{(k)[T]}_{r}-w^{(k)}_{\pi(r)}\|\vee\\|\widehat{\bm{\beta}}^{(k)[T]}_{r}-(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{\pi(r)}-\bm{\mu}^{(k)*}_{\pi(1)})\\|_{2}$		(S.1.41)
	$\displaystyle\hskip 170.71652pt\vee\|\widehat{\delta}^{(k)[T]}_{r}-(\bm{\mu}^{(k)}_{\pi(r)}+\bm{\mu}^{(k)}_{\pi(1)})^{\top}(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{\pi(r)}-\bm{\mu}^{(k)*}_{\pi(1)})/2\|\Big{\}},$		(S.1.42)
	$\displaystyle\Big{(}\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\\|\widehat{\bm{\mu}}^{(k)[T]}_{r}-\bm{\mu}^{(k)}_{\pi(r)}\\|_{2}\Big{)}\vee\\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)}\\|_{2}.$		(S.1.43)

Theorem 7.

(Upper bounds of the estimation error of GMM parameters for multi-cluster MTL-GMM) Suppose Assumption 2 holds, $|S|\geq s$ , and $\epsilon=\frac{K-s}{K}<1/3$ . Let $\lambda^{[0]}\geq C_{1}\max_{k=1:K}\sqrt{n_{k}}$ , $C_{\lambda}\geq C_{1}$ and $\kappa>C_{2}$ with some constants $C_{1}>0,C_{2}\in(0,1)$ ⁴⁴4 $C_{1}$ and $C_{2}$ depend on the constants $M$ , $c_{w}$ , and $c_{\bm{\Sigma}}$ etc. Then there exists a constant $C_{3}>0$ , such that for any $\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)$ and any probability measure $\mathbb{Q}_{S}$ on $(\mathbb{R}^{p})^{\otimes n_{S^{c}}}$ , with probability $1-C_{3}K^{-1}$ , the following hold for all $k\in S$ :

\displaystyle d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*})

\displaystyle\lesssim\sqrt{\frac{p}{n_{S}}}+\sqrt{\frac{\log K}{n_{k}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+T^{2}(\kappa^{\prime})^{T},

(S.1.44)

\displaystyle\left(\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\|\widehat{\bm{\mu}}^{(k)[T]}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\right)\vee\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)*}\|_{2}\lesssim\sqrt{\frac{p+\log K}{n_{k}}}+T^{2}(\kappa^{\prime})^{T},

(S.1.45)

where $\kappa^{\prime}\in(0,1)$ is some constant and $n_{S}=\sum_{k\in S}n_{k}$ . When $T\geq C\log(\max_{k=1:K}n_{k})$ with some large constant $C>0$ , the last term on the right-hand side will be dominated by other terms in both inequalities.

Recall the clustering method $\widehat{\mathcal{C}}^{(k)[T]}$ defined in (S.1.40). The next theorem obtains the upper bound of the excess mis-clustering error of $\widehat{\mathcal{C}}^{(k)[T]}$ for $k\in S$ .

Theorem 8.

(Upper bound of the excess mis-clustering error for multi-cluster MTL-GMM) Suppose the same conditions in Theorem 7 hold. Then there exists a constant $C_{1}>0$ such that for any $\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)$ and any probability measure $\mathbb{Q}_{S}$ on $(\mathbb{R}^{p})^{\otimes n_{S^{c}}}$ , with probability at least $1-C_{1}K^{-1}$ , the following holds for all $k\in S$ :

	$\displaystyle R_{\overline{\bm{\theta}}^{(k)}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)}})\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)})\cdot\log d^{-1}(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*})$		(S.1.46)
	$\displaystyle\lesssim\bigg{[}\frac{p}{n_{S}}+\frac{\log K}{n_{k}}+h^{2}\wedge\frac{p+\log K}{n_{k}}+\epsilon^{2}\frac{p+\log K}{\max_{k=1:K}n_{k}}+T^{4}(\kappa^{\prime})^{2T}\bigg{]}\cdot\log\left(\frac{n_{S}}{p}\wedge\frac{n_{k}}{\log K}\right),$		(S.1.47)

where $\kappa^{\prime}\in(0,1)$ is some constant. When $T\geq C\log(\max_{k=1:K}n_{k})$ with a large constant $C>0$ , the term involving $T$ on the right-hand side will be dominated by other terms.

Comparing the upper bounds in Theorems 7 and 8 with those in Theorems 1 and 2, the only difference is an extra logarithmic term $\log\big{(}\frac{n_{S}}{p}\wedge\frac{n_{k}}{\log K}\big{)}$ in Theorem 8, which we believe is a proof artifact. Similar logarithmic terms appear in other multi-cluster GMM literatures as well, see for example, \citeappyan2017convergence and \citeappzhao2020statistical. To understand the upper bounds in Theorems 7 and 8, we can follow the discussion after Theorems 1 and 2. We do not repeat it here.

The following lower bounds together with the derived upper bounds will show that our method is (nearly) minimax optimal in a wide range of regimes.

Theorem 9.

(Lower bounds of the estimation error of GMM parameters in multi-task learning) Suppose $\epsilon=\frac{K-s}{K}<1/3$ . When there exists a subset $S$ with $|S|\geq s$ such that $\min_{k\in S}n_{k}\geq C_{1}(p+\log K)$ and $\min_{k\in S}\min_{r,j}\Delta_{rj}^{(k)}\geq C_{2}$ , where $C_{1},C_{2}>0$ are some constants, we have

	$\displaystyle\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}}$	$\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\Bigg{\{}d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\gtrsim\sqrt{\frac{p}{n_{S}}}+\sqrt{\frac{\log K}{n_{k}}}$		(S.1.48)
		$\displaystyle\quad\quad+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\frac{\epsilon}{\sqrt{\max_{k=1:K}n_{k}}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10},$		(S.1.49)

	$\displaystyle\inf_{\begin{subarray}{c}\{\widehat{\bm{\mu}}^{(k)}_{r}\}_{k=1:K,r=1:R}\\ \{\widehat{\bm{\Sigma}}^{(k)}\}_{k=1}^{K}\end{subarray}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\Bigg{\{}$	$\displaystyle\left(\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\\|\widehat{\bm{\mu}}^{(k)}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\\|_{2}\right)\vee$		(S.1.50)
		$\displaystyle\\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\\|_{2}\gtrsim\sqrt{\frac{p+\log K}{n_{k}}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10}.$		(S.1.51)

Theorem 10.

(Lower bound of the excess mis-clustering error in multi-task learning) Suppose the same conditions in Theorem 9 hold. Then

	$\displaystyle\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}}$	$\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\Bigg{\{}R_{\overline{\bm{\theta}}^{(k)}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\gtrsim\frac{p}{n_{S}}+\frac{\log K}{n_{k}}$		(S.1.52)
		$\displaystyle\quad\quad\quad+h^{2}\wedge\frac{p+\log K}{n_{k}}+\frac{\epsilon^{2}}{\max_{k=1:K}n_{k}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10}.$		(S.1.53)

The lower bounds in Theorems 9 and 10 are the same as those in Theorems 3 and 4. Therefore, the remarks on the comparison of upper and lower bounds presented after Theorem 4 carry over to the multi-cluster setting (up to the logarithmic term from Theorem 8). We do not repeat the details here.

S.1.2 Alignment

Similar to the binary case, we have the alignment issues in multi-cluster case as well. In this section, we propose two alignment algorithms as the extension to the Algorithms 2 and 3.

In the multi-cluster case, the alignment of each task can be represented as a permutation of $[R]$ . Consider a series of permutations $\bm{\pi}=\{\pi_{k}\}_{k=1}^{K}$ , where each $\pi_{k}$ is a permutation function on $[R]$ . Define a score of $\bm{\pi}$ as

\textup{score}(\bm{\pi})=\sum_{r=2}^{R}\sum_{k\neq k^{\prime}}\left\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}\big{(}\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)}\big{)}\right\|_{2}.

(S.1.54)

We want to recover the correct alignment $\pi_{k}^{*}=\operatorname*{arg\,min}\limits_{\pi_{k}:[R]\rightarrow[R]}\sum_{r=2}^{R}\left\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}\big{(}\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-\bm{\beta}^{(k)*}_{r}\right\|_{2}$ . We proposed an exhaustive search algorithm, which is summarized in Algorithm 5.

Input: Initialization

\{\{\widehat{\bm{\mu}}^{(k)[0]}_{r}\}_{r=1}^{R},\widehat{\bm{\Sigma}}^{(k)[0]}\}_{k=1}^{K}

\widehat{\bm{\pi}}=\{\widehat{\pi}_{k}\}_{k=1}^{K}\leftarrow\operatorname*{arg\,min}_{\bm{\pi}}\text{score}(\bm{\pi})

Output:

\widehat{\bm{\pi}}=\{\widehat{\pi}_{k}\}_{k=1}^{K}

Algorithm 5 Exhaustive search for the alignment in multi-cluster GMMs

The following theorem shows that under certain conditions, the output from Algorithm 5 recovers the correct alignment up to a permutation.

Theorem 11 (Alignment correctness for Algorithm 5).

Assume that

(i)

$\max_{k\in S}\max_{r\neq j}\Delta^{(k)}_{rj}/\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}\leq D$ with $D\geq 1$ .
(ii)

$\epsilon<\frac{1}{24Dc_{\bm{\Sigma}}+1}$ ;
(iii)

$\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}\geq\left[\frac{4(1-\epsilon)c_{\bm{\Sigma}}^{1/2}}{1-(24Dc_{\bm{\Sigma}}+1)\epsilon}h+\frac{(4+20\epsilon)c_{\bm{\Sigma}}^{1/2}}{1-(24Dc_{\bm{\Sigma}}+1)\epsilon}\xi\right]\vee\left[\frac{13(1-\epsilon)c_{\bm{\Sigma}}^{1/2}}{1-(9Dc_{\bm{\Sigma}}+1)\epsilon}h+\frac{(13-4\epsilon)c_{\bm{\Sigma}}^{1/2}}{1-(9Dc_{\bm{\Sigma}}+1)\epsilon}\xi\right]$ ,

where $\epsilon=\frac{K-s}{K}$ is the outlier task proportion introduced in Theorem 7, $h$ is degree of discriminant coefficient similarity defined in (S.1.34), and

	$\displaystyle\xi$	$\displaystyle=\max_{k\in S}\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\left\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}\big{(}\widehat{\bm{\mu}}^{(k)[0]}_{\pi(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi(1)}\big{)}-\bm{\beta}^{(k)*}_{r}\right\\|_{2}$		(S.1.55)
		$\displaystyle=\max_{k\in S}\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\left\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}\big{(}\widehat{\bm{\mu}}^{(k)[0]}_{\pi(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi(1)}\big{)}-(\bm{\Sigma}^{(k)})^{-1}\big{(}\bm{\mu}^{(k)}_{r}-\bm{\mu}^{(k)*}_{1}\big{)}\right\\|_{2}.$		(S.1.56)

Then there exists a permutation $\iota:[R]\rightarrow[R]$ , such that the output of Algorithm 5 satisfies

\widehat{\pi}_{k}=\iota\circ\pi_{k}^{*},

(S.1.57)

for all $k\in S$ .

The biggest issue of Algorithm 5 is the computational time. It is easy to see that the time complexity of it is $\mathcal{O}((R!)^{K}\cdot RK^{2})$ , because it needs to search over all permutations. This is not practically feasible when $R$ and $K$ are large. Therefore, we propose the following greedy search algorithm to reduce the computational cost, which is summarized in Algorithm 6. Note that its main idea is similar to Algorithm 3 for the binary GMM, but the procedure is different. We define the score of alignments $\{\pi_{k^{\prime}}\}_{k^{\prime}=1}^{k}$ of tasks $1$ - $k^{\prime}$ as

	$\displaystyle\textup{score}(\{\pi_{k^{\prime}}\}_{k^{\prime}=1}^{k}\|\{\{\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}\}_{r=1}^{R}\}_{k^{\prime}=1}^{k},\{\widehat{\bm{\Sigma}}^{(k^{\prime})[0]}\}_{k^{\prime}=1}^{k})$		(S.1.58)
	$\displaystyle=\sum_{r=2}^{R}\sum_{\tilde{k},k^{\prime}\leq k}\left\\|(\widehat{\bm{\Sigma}}^{(\tilde{k})[0]})^{-1}\big{(}\widehat{\bm{\mu}}^{(\tilde{k})[0]}_{\pi_{\tilde{k}}(r)}-\widehat{\bm{\mu}}^{(\tilde{k})[0]}_{\pi_{\tilde{k}}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)}\big{)}\right\\|_{2}.$		(S.1.59)

Input: Initialization

\{\{\widehat{\bm{\mu}}^{(k)[0]}_{r}\}_{r=1}^{R},\widehat{\bm{\Sigma}}^{(k)[0]}\}_{k=1}^{K}

1 for $k=1$ to $K$ do

2 With

\{\widehat{\pi}_{k^{\prime}}\}_{k^{\prime}=1}^{k-1}

fixed (

\emptyset

when

k=1

), set

\widehat{\pi}_{k}=\operatorname*{arg\,min}\limits_{\pi:[R]\rightarrow[R]}\textup{score}(\{\widehat{\pi}_{k^{\prime}}\}_{k^{\prime}=1}^{k-1}\cup\pi|\{\{\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}\}_{r=1}^{R}\}_{k^{\prime}=1}^{k},\{\widehat{\bm{\Sigma}}^{(k^{\prime})[0]}\}_{k^{\prime}=1}^{k})

4 end for

Output:

\widehat{\bm{\pi}}=\{\widehat{\pi}_{k}\}_{k=1}^{K}

Algorithm 6 Greedy search for the alignment in multi-cluster GMMs

The subsequent theorem demonstrates that, with slightly stronger assumptions than those required by Algorithm 5, the greedy search algorithm can recover the correct alignment up to a permutation with high probability. Importantly, this approach significantly alleviates the computational cost from $\mathcal{O}((R!)^{K}\cdot RK^{2})$ to $\mathcal{O}(R!K\cdot RK^{2})$ .

Theorem 12.

Assume there are no outlier tasks in the first $K_{0}$ tasks, and

(i)

$\max_{k\in S}\max_{r\neq j}\Delta^{(k)}_{rj}/\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}\leq D$ with $D\geq 1$ .
(ii)

$\epsilon<\frac{1}{2Dc_{\bm{\Sigma}}+1}$ ;
(iii)

$K_{0}>2Dc_{\bm{\Sigma}}K\epsilon$ ;
(iv)

$\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}\geq\left[\frac{4K_{0}c_{\bm{\Sigma}}^{1/2}}{K_{0}-Dc_{\bm{\Sigma}}K\epsilon}h+\frac{(4K_{0}+K\epsilon)c_{\bm{\Sigma}}^{1/2}}{K_{0}-Dc_{\bm{\Sigma}}K\epsilon}\xi\right]\vee\left[\frac{2K_{0}c_{\bm{\Sigma}}^{1/2}}{K_{0}-2Dc_{\bm{\Sigma}}K\epsilon}h+\frac{(2K_{0}+2K\epsilon)c_{\bm{\Sigma}}^{1/2}}{K_{0}-2Dc_{\bm{\Sigma}}K\epsilon}\xi\right]$ ,

where $\epsilon=\frac{K-s}{K}$ is the outlier task proportion and $c_{\bm{\Sigma}}$ appears in the condition that $c_{\bm{\Sigma}}^{-1}\leq\min_{k\in S}\lambda_{\min}(\bm{\Sigma}^{(k)*})\leq\max_{k\in S}\lambda_{\max}(\bm{\Sigma}^{(k)*})\leq c_{\bm{\Sigma}}$ . Then there exists a permutation $\iota:[R]\rightarrow[R]$ , such that the output of Algorithm 6 satisfies

\widehat{\pi}_{k}=\iota\circ\pi_{k}^{*},

(S.1.60)

for all $k\in S$ .

Remark 7.

Conditions (ii)-(iv) are similar to the conditions in Theorem 6. The inclusion of Condition (i) aims to facilitate the analysis in the proof, and we conjecture that the obtained results persist even if this condition is omitted.

When $R$ is very large, the computational burden becomes prohibitive, rendering even the $\mathcal{O}(R!K\cdot RK^{2})$ time complexity impractical. Addressing this computational challenge requires the development of more efficient alignment algorithms, a pursuit that we defer to future investigations. In addition, one caveat of the greedy search algorithm is that we need to know $K_{0}$ non-outlier tasks a priori, which may not be unrealistic in practice. In our empirical examinations, we enhance the algorithm’s performance by introducing a random shuffle of the $K$ tasks in each iteration. Specifically, we execute Algorithm 6 for 200 times, yielding 200 alignment candidates. The final alignment is then determined by selecting the configuration that attains the minimum score among the candidates.

S.2 Transfer Learning

S.2.1 Problem setting

In the main text and Section S.1, we discussed GMMs under the context of multi-task learning, where the goal is to learn all tasks jointly by utilizing the potential similarities shared by different tasks. In this section, we will study binary GMMs in the transfer learning context where the focus is on the improvement of learning in one target task through the transfer of knowledge from related source tasks. Multi-cluster results can be obtained similarly as in the MTL case, and we omit the details given the extensive length of the paper.

Suppose that there are $(K+1)$ tasks in total, where the first task is called the target and the $K$ remaining ones are called $K$ sources. As in multi-task learning, we assume that there exists an unknown subset $S\subseteq 1:K$ , such that samples from sources in $S$ follow an independent GMM, while samples from sources outside $S$ can be arbitrarily distributed. This means,

\displaystyle y_{i}^{(k)}=\begin{cases}1,&\text{with probability }1-w^{(k)*};\\ 2,&\text{with probability }w^{(k)*};\end{cases}

(S.2.61)

\bm{z}_{i}^{(k)}|y_{i}^{(k)}=j\sim\mathcal{N}(\bm{\mu}^{(k)*}_{j},\bm{\Sigma}^{(k)*}),\,\,j=1,2,

(S.2.62)

for all $k\in S$ , $i=1:n_{k}$ , and

\{\bm{z}^{(k)}_{i}\}_{i,k\in S^{c}}\sim\mathbb{Q}_{S},

(S.2.63)

where $\mathbb{Q}_{S}$ is some probability measure on $(\mathbb{R}^{p})^{\otimes n_{S^{c}}}$ and $n_{S^{c}}=\sum_{k\in S^{c}}n_{k}$ .

For the target task, we observe sample $\{\bm{z}_{i}^{(0)}\}_{i=1}^{n_{0}}$ independently sampled from the following GMM:

\displaystyle y_{i}^{(0)}=\begin{cases}1,&\text{with probability }1-w^{(0)*};\\ 2,&\text{with probability }w^{(0)*};\end{cases}

(S.2.64)

\bm{z}_{i}^{(0)}|y_{i}^{(0)}=j\sim\mathcal{N}(\bm{\mu}^{(0)*}_{j},\bm{\Sigma}^{(0)*}),\,\,j=1,2.

(S.2.65)

The objective of transfer learning is to use source data to help improve GMM learning in the target task. As for multi-task learning, we measure the learning performance by both parameter estimation error and the excess mis-clustering error, but only on the target GMM. Toward this end, we define the joint parameter space for GMM parameters of the target and sources in $S$ :

\overline{\Theta}_{S}^{\prime}(h)=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}=\{(w^{(k)},\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2},\bm{\Sigma}^{(k)})\}_{k\in\{0\}\cup S}:\overline{\bm{\theta}}^{(k)}\in\overline{\Theta},\max_{k\in S}\|\bm{\beta}^{(k)}-\bm{\beta}^{(0)}\|_{2}\leq h\Big{\}},

(S.2.66)

where $\overline{\Theta}$ is the single GMM parameter space introduced in (8), and $\bm{\beta}^{(k)}=(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{2}-\bm{\mu}^{(k)}_{1})$ , $k\in\{0\}\cup S$ . Comparing $\overline{\Theta}_{S}^{\prime}(h)$ with the parameter space $\overline{\Theta}_{S}(h)$ from multi-task learning in (LABEL:eq:_parameter_space_mtl), here the target discriminant coefficient $\bm{\beta}^{(0)}$ serves as the “center” of discriminant coefficients of sources in $S$ . The quantity $h$ characterizes the closeness between sources in $S$ and the target.

S.2.2 Method

Like the MTL-GMM procedure developed in Section 2.2, we combine the EM algorithm and the penalization framework to develop a variant of the EM algorithm for transfer learning. The key idea is to first apply MTL-GMM to all the sources to obtain estimates of discriminant coefficient “center” as good summary statistics of the $K$ source data sets, and then shrink the target discriminant coefficient towards those center estimates in the EM iterations to explore the relatedness between sources and the target. See Section 3.3 of \citeappduan2023adaptive for more general discussions on this idea. Our proposed transfer learning procedure TL-GMM is summarized in Algorithm 7.

While the steps of TL-GMM look very similar to those of MTL-GMM, there exist two major differences between them. First, for each optimization problem in TL-GMM, the first part of the objective function only involves the target data $\{\bm{z}^{(0)}_{i}\}_{i=1}^{n_{0}}$ , while in MTL-GMM, it is a weighted average of all tasks. Second, in TL-GMM, the penalty is imposed on the distance between a discriminant coefficient estimator and a given center estimator produced by MTL-GMM from the source data. In contrast, the center is estimated simultaneously with other parameters through the penalization in MTL-GMM. In light of existing transfer learning approaches in the literature, TL-GMM can be considered as the “debiasing” step described in \citeappli2021transfer and \citeapptian2023transfer, which corrects potential bias of the center estimate using the target data.

The tuning parameters $\{\lambda^{[t]}_{0}\}_{t=1}^{T_{0}}$ in Algorithm 7 control the amount of knowledge to be transferred from sources. Setting tuning parameters large enough pushes parameter estimates for the target task to be exactly equal to the center learned from sources while letting them be zero makes TL-GMM reduce to the standard EM algorithm on the target data.

Input: Initialization

\widehat{\bm{\theta}}^{(0)[0]}=(\widehat{w}^{(0)[0]},\widehat{\bm{\beta}}^{(0)[0]},\widehat{\delta}^{(0)[0]})

, output

\overline{\bm{\beta}}^{[T]}

from Algorithm 1, maximum number of iteration rounds

T_{0}

, initial penalty parameter

\lambda^{[0]}_{0}

, tuning parameters

C_{\lambda_{0}}>0

and

\kappa_{0}\in(0,1)

1 for $t=1$ to $T_{0}$ do

\lambda^{[t]}_{0}=\kappa_{0}\lambda^{[t-1]}_{0}+C_{\lambda_{0}}\sqrt{p+\log K}

// Update the penalty parameter

\widehat{w}^{(0)[t]}=\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\gamma_{\widehat{\bm{\theta}}^{(0)[t-1]}}(\bm{z}_{i}^{(0)})

\widehat{\bm{\mu}}^{(0)[t]}_{1}=\frac{\sum_{i=1}^{n_{0}}[1-\gamma_{\widehat{\bm{\theta}}^{(0)[t-1]}}(\bm{z}_{i}^{(0)})]\bm{z}^{(0)}_{i}}{n_{0}(1-\widehat{w}^{(0)[t]})}

\widehat{\bm{\mu}}^{(0)[t]}_{2}=\frac{\sum_{i=1}^{n_{0}}\gamma_{\widehat{\bm{\theta}}^{(0)[t-1]}}(\bm{z}_{i}^{(0)})\bm{z}^{(0)}_{i}}{n_{0}\widehat{w}^{(0)[t]}}

\widehat{\bm{\Sigma}}^{(0)[t]}=\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\left\{[1-\gamma_{\widehat{\bm{\theta}}^{(0)[t-1]}}(\bm{z}_{i}^{(0)})]\cdot(\bm{z}^{(0)}_{i}-\widehat{\bm{\mu}}^{(0)[t]}_{1})(\bm{z}^{(0)}_{i}-\widehat{\bm{\mu}}^{(0)[t]}_{1})^{\top}\right.

\left.\hskip 93.89418pt+\gamma_{\widehat{\bm{\theta}}^{(0)[t-1]}}(\bm{z}_{i}^{(0)})\cdot(\bm{z}^{(0)}_{i}-\widehat{\bm{\mu}}^{(0)[t]}_{2})(\bm{z}^{(0)}_{i}-\widehat{\bm{\mu}}^{(0)[t]}_{2})^{\top}\right\}

\widehat{\bm{\beta}}^{(0)[t]}=\operatorname*{arg\,min}\limits_{\bm{\beta}^{(0)}}\left\{\left[\frac{1}{2}(\bm{\beta}^{(0)})^{\top}\widehat{\bm{\Sigma}}^{(0)[t]}\bm{\beta}^{(0)}-(\bm{\beta}^{(0)})^{\top}(\widehat{\bm{\mu}}_{2}^{(0)[t]}-\widehat{\bm{\mu}}_{1}^{(0)[t]})\right]+\frac{\lambda^{[t]}_{0}}{\sqrt{n_{0}}}\|\bm{\beta}^{(0)}-\overline{\bm{\beta}}^{[T]}\|_{2}\right\}

\widehat{\delta}^{(0)[t]}=\frac{1}{2}(\widehat{\bm{\beta}}^{(0)[t]})^{\top}(\widehat{\bm{\mu}}^{(0)[t]}_{1}+\widehat{\bm{\mu}}^{(0)[t]}_{2})

8 Let

\widehat{\bm{\theta}}^{(0)[t]}=(\widehat{w}^{(0)[t]},\widehat{\bm{\beta}}^{(0)[t]},\widehat{\delta}^{(0)[t]})

9 end for

Output:

(\widehat{\bm{\theta}}^{(0)[T_{0}]},\widehat{\bm{\mu}}^{(0)[T_{0}]}_{1},\widehat{\bm{\mu}}^{(0)[T_{0}]}_{2},\widehat{\bm{\Sigma}}^{(0)[T_{0}]})

with

\widehat{\bm{\theta}}^{(0)[T_{0}]}=(\widehat{w}^{(0)[T_{0}]},\widehat{\bm{\beta}}^{(0)[T_{0}]},\widehat{\delta}^{(0)[T_{0}]})

Algorithm 7 TL-GMM

S.2.3 Theory

In this section, we will establish the upper and lower bounds for the GMM parameter estimation error and the excess mis-clustering error on the target task. First, we impose the following assumption set.

Assumption 3.

Denote $\Delta^{(0)}=\sqrt{(\bm{\mu}^{(0)*}_{1}-\bm{\mu}^{(0)*}_{2})^{\top}(\bm{\Sigma}^{(0)*})^{-1}(\bm{\mu}^{(0)*}_{1}-\bm{\mu}^{(0)*}_{2})}$ . Assume the following conditions hold:

(i)

$C_{1}\left[\frac{\log K}{p}+\epsilon^{2}\big{(}1+\frac{\log K}{p}\big{)}\right]\leq\frac{\max_{k\in S}n_{k}}{n_{0}}\leq C_{2}\big{(}1+\frac{\log K}{p}\big{)}$ with constants $C_{1}$ and $C_{2}$ , where $\epsilon=\frac{K-s}{K}$ .
(ii)

$n_{0}\geq C_{3}p$ with some constant $C_{3}$ ;
(iii)
Either of the following two conditions holds with some constant $C_{4}$ :
1. (a)
  
  $\|\widehat{\bm{\beta}}^{(0)[0]}-\bm{\beta}^{(0)*}\|_{2}\vee|\widehat{\delta}^{(0)[0]}-\delta^{(0)*}|\leq C_{4}\Delta^{(0)}$ , $|\widehat{w}^{(0)[0]}-w^{(0)*}|\leq c_{w}/2$ ;
2. (b)
  
  $\|\widehat{\bm{\beta}}^{(0)[0]}+\bm{\beta}^{(0)*}\|_{2}\vee|\widehat{\delta}^{(0)[0]}+\delta^{(0)*}|\leq C_{4}\Delta^{(0)}$ , $|1-\widehat{w}^{(0)[0]}-w^{(0)*}|\leq c_{w}/2$ .
(iv)

$\Delta^{(0)}\geq C_{5}>0$ with some constant $C_{5}$ ;

Remark 8.

Condition (i) requires the target sample size not to be much smaller than the maximum source sample size, which appears due to technical reasons in the proof. Conditions (ii)-(iv) can be seen as the counterpart of Conditions (ii)-(iv) in Assumption 1 for the target GMM.

We are in the position to present the upper bounds of the estimation error of GMM parameters for TL-GMM.

Theorem 13.

(Upper bounds of the estimation error of GMM parameters for TL-GMM) Suppose the conditions in Theorem 1 and Assumption 3 hold. Let $\lambda^{[0]}_{0}\geq C_{1}\max_{k=1:K}\sqrt{n_{k}}$ , $C_{\lambda_{0}}\geq C_{1}$ , $\kappa_{0}>C_{2}$ with some specific constants $C_{1}>0,C_{2}\in(0,1)$ . Then there exists a constant $C_{3}>0$ , such that for any $\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}(h)$ and any probability measure $\mathbb{Q}_{S}$ on $(\mathbb{R}^{p})^{\otimes n_{S^{c}}}$ , we have

	$\displaystyle d(\widehat{\bm{\theta}}^{(0)[T_{0}]},\bm{\theta}^{(0)*})$	$\displaystyle\lesssim\sqrt{\frac{p}{n_{S}+n_{0}}}+\sqrt{\frac{1}{n_{0}}}+h\wedge\sqrt{\frac{p}{n_{0}}}+\bigg{(}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}\bigg{)}\wedge\sqrt{\frac{p}{n_{0}}}$		(S.2.67)
		$\displaystyle\quad\quad\quad\quad+\sqrt{\frac{\log K}{\max_{k=1:K}n_{k}}}+T_{0}(\kappa_{0}^{\prime})^{T_{0}},$		(S.2.68)

\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)[T_{0}]}_{\pi(r)}-\bm{\mu}^{(0)*}_{r}\|_{2}\vee\|\widehat{\bm{\Sigma}}^{(0)[T_{0}]}-\bm{\Sigma}^{(0)*}\|_{2}\lesssim\sqrt{\frac{p}{n_{0}}}+T_{0}(\kappa_{0}^{\prime})^{T_{0}},

(S.2.69)

with probability at least $1-C_{3}K^{-1}$ , where $\kappa^{\prime}_{0}\in(0,1)$ and $n_{S}=\sum_{k\in S}n_{k}$ . When $T_{0}\geq C\log n_{0}$ with a large constant $C>0$ , in both inequalities, the last term on the right-hand side will be dominated by other terms.

Next, we present the upper bound of the excess mis-clustering error on the target task for TL-GMM. Having the estimator $\widehat{\bm{\theta}}^{(0)[T_{0}]}$ and the truth $\bm{\theta}^{(0)*}$ , the clustering method $\widehat{\mathcal{C}}^{(0)[T_{0}]}$ and its mis-clustering error $R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)[T_{0}]})$ are defined in the same way as in (18) and (19).

Theorem 14.

(Upper bound of the target excess mis-clustering error for TL-GMM) Suppose the same conditions in Theorem 13 hold. Then there exists a constant $C_{1}>0$ such that for any $\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}(h)$ and any probability measure $\mathbb{Q}_{S}$ on $(\mathbb{R}^{p})^{\otimes n_{S^{c}}}$ , with probability at least $1-C_{1}K^{-1}$ the following holds:

	$\displaystyle R_{\overline{\bm{\theta}}^{(0)}}(\widehat{\mathcal{C}}^{(0)[T_{0}]})-R_{\overline{\bm{\theta}}^{(0)}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})$	$\displaystyle\lesssim\underbrace{\frac{p}{n_{S}+n_{0}}}_{\rm(\@slowromancap i@)}+\underbrace{\frac{1}{n_{0}}}_{\rm(\@slowromancap ii@)}+\underbrace{h^{2}\wedge\frac{p}{n_{0}}}_{\rm(\@slowromancap iii@)}+\underbrace{\epsilon^{2}\frac{p+\log K}{\max_{k=1:K}n_{k}}\wedge\frac{p}{n_{0}}}_{\rm(\@slowromancap iv@)}$		(S.2.70)
		$\displaystyle\quad\quad\quad\quad+\underbrace{\frac{\log K}{\max_{k=1:K}n_{k}}}_{\rm(\@slowromancap v@)}+\underbrace{T_{0}^{2}(\kappa_{0}^{\prime})^{2T_{0}}}_{\rm(\@slowromancap vi@)},$		(S.2.71)

with some constant $\kappa^{\prime}_{0}\in(0,1)$ . When $T_{0}\geq C\log n_{0}$ with some large constant $C>0$ , the last term in the upper bound will be dominated by the second term.

Similar to the upper bounds of $d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*})$ and $R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})$ in Theorems 1 and 2, the upper bounds for $d(\widehat{\bm{\theta}}^{(0)[T_{0}]},\bm{\theta}^{(0)*})$ and $R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)[T_{0}]})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})$ consist of multiple parts with one-to-one correspondence. We take the bound of $R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)[T_{0}]})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})$ in Theorem 14 as an example. Part (\@slowromancapi@) is the oracle rate $\mathcal{O}_{\mathbb{P}}\big{(}\frac{p}{n_{S}+n_{0}}\big{)}$ . Part (\@slowromancapii@) is the error caused by estimating scalar parameters $\delta^{(0)*}$ and $w^{(0)*}$ in the decision boundary, which thus do not depend on dimension $p$ . Part (\@slowromancapiii@) quantifies the contribution of related sources to the target task learning. The more related the sources in $S$ to the target (i.e. the smaller $h$ is), the smaller Part (\@slowromancapiii@) becomes. Part (\@slowromancapiv@) captures the impact of outlier sources on the estimation error. As $\epsilon$ increases (i.e. the proportion of outlier sources increases), Part (\@slowromancapiv@) first increases and then flats out. It never exceeds the minimax rate $\mathcal{O}_{\mathbb{P}}(p/n_{0})$ of the single task learning on target task \citepappbalakrishnan2017statistical, cai2019chime. Therefore, our method is robust against a fraction of outlier sources with arbitrary contaminated data. Part (\@slowromancapv@) is an extra term caused by estimating the center in MTL-GMM, which by Assumption 3.(i) is smaller than the single-task learning rate $\mathcal{O}_{\mathbb{P}}(p/n_{0})$ . Part (\@slowromancapvi@) decreases geometrically in the iteration number $T_{0}$ of Algorithm 7, which becomes negligible by setting the iteration numbers $T_{0}$ large enough.

Consider the general scenario $T_{0}\gtrsim\log n_{0}$ . Then the upper bound of excess mis-clustering error rate $R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)[T_{0}]})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})$ is guaranteed to be no worse than the optimal single-task learning rate $\mathcal{O}_{\mathbb{P}}(p/n_{0})$ . More importantly, in the general regime where $\epsilon\ll\sqrt{\frac{p\max_{k=1:K}n_{k}}{(p+\log K)n_{0}}}$ (small number of outlier sources), $h\ll\sqrt{p/n_{0}}$ (enough similarity between sources and target), $n_{S}\gg n_{0}$ (large total source sample size), and $\max_{k\in S}n_{k}/n_{0}\gg\log K/p$ (large maximum source sample size), TL-GMM improves the GMM learning on the target task by achieving a better estimation error rate. As for the upper bound of $\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)[T_{0}]}_{\pi(r)}-\bm{\mu}^{(0)*}_{r}\|_{2}$ and $\|\widehat{\bm{\Sigma}}^{(0)[T_{0}]}-\bm{\Sigma}^{(0)*}\|_{2}$ , when $T_{0}\gtrsim\log n_{0}$ , it has the single-task learning rate $\mathcal{O}_{\mathbb{P}}(\sqrt{p/n_{0}})$ . This is expected since the mean vectors and covariance matrices from sources are not necessarily similar to the one from target in the parameter space $\overline{\Theta}_{S}^{\prime}(h)$ .

The following result of minimax lower bounds shows that the upper bounds in Theorems 13 and 14 are optimal in a broad range of regimes.

Theorem 15.

(Lower bounds of the estimation error of GMM parameters in transfer learning) Suppose $\epsilon=\frac{K-s}{K}<1/3$ . Suppose there exists a subset $S$ with $|S|\geq s$ such that $\min_{k\in S}n_{k}\geq C_{1}(p+\log K)$ , $n_{0}\geq C_{1}p$ and $\min_{k\in\{0\}\cup S}\Delta^{(k)}\geq C_{2}$ with some constants $C_{1},C_{2}>0$ . Then we have

	$\displaystyle\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}(h)\\ \mathbb{Q}_{S}\end{subarray}}$	$\displaystyle\mathbb{P}\Bigg{(}d(\widehat{\bm{\theta}}^{(0)},\bm{\theta}^{(0)*})\gtrsim\sqrt{\frac{p}{n_{S}+n_{0}}}+\sqrt{\frac{1}{n_{0}}}+h\wedge\sqrt{\frac{p}{n_{0}}}$		(S.2.72)
		$\displaystyle\hskip 113.81102pt+\frac{\epsilon}{\sqrt{\max_{k=1:K}n_{k}}}\wedge\sqrt{\frac{p}{n_{0}}}\Bigg{)}\geq\frac{1}{10},$		(S.2.73)

\inf_{\begin{subarray}{c}\widehat{\bm{\mu}}^{(0)}_{1},\widehat{\bm{\mu}}^{(0)}_{2}\\ \widehat{\bm{\Sigma}}^{(0)}\end{subarray}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}(h)\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)}_{\pi(r)}-\bm{\mu}^{(0)*}_{r}\|_{2}\vee\|\widehat{\bm{\Sigma}}^{(0)}-\bm{\Sigma}^{(0)*}\|_{2}\gtrsim\sqrt{\frac{p}{n_{0}}}\Bigg{)}\geq\frac{1}{10}.

(S.2.74)

Theorem 16.

(Lower bound of the target excess mis-clustering error in transfer learning) Suppose the same conditions in Theorem 15 hold. Then we have

	$\displaystyle\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}(h)\\ \mathbb{Q}_{S}\end{subarray}}$	$\displaystyle\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\gtrsim\frac{p}{n_{S}+n_{0}}+\frac{1}{n_{0}}+h^{2}\wedge\frac{p}{n_{0}}$		(S.2.75)
		$\displaystyle\hskip 142.26378pt+\frac{\epsilon^{2}}{\max_{k=1:K}n_{k}}\wedge\frac{1}{n_{0}}\Bigg{)}\geq\frac{1}{10}.$		(S.2.76)

Comparing the upper and lower bounds in Theorems 13- 16, several remarks are in order:

•

With $T_{0}\gtrsim\log n_{0}$ , our estimators $\widehat{\bm{\mu}}^{(0)[T_{0}]}_{1}$ , $\widehat{\bm{\mu}}^{(0)[T_{0}]}_{2}$ , $\widehat{\bm{\Sigma}}^{(0)[T_{0}]}$ achieve the minimax optimal rate for estimating the mean vectors $\bm{\mu}^{(0)*}_{1}$ , $\bm{\mu}^{(0)*}_{2}$ and the covariance matrix $\bm{\Sigma}^{(0)*}$ .
•
Regarding the target excess mis-clustering error, with the choices $T_{0}\gtrsim\log n_{0}$ , Part (\@slowromancapvi@) in the upper bound becomes negligible. We thus compare the other five terms in the upper bound with the corresponding terms in the lower bound.
1. 1.
  
  Part (\@slowromancapiv@) in the upper bound differs from the one in the lower bound by a factor $p$ (up to $\log K$ ). Hence the gap can arise when the dimension $p$ diverges. The reason is similar to the one in a multi-task learning setting and using statistical depth function based “center” estimates might be able to close the gap. We refer to the paragraph after Theorem 4 for more details.
2. 2.
  
  Part (\@slowromancapv@) in the upper bound does not appear in the lower bound. This term is due to the center estimate from the upper bound in MTL-GMM. When $\max_{k\in S}n_{k}/n_{0}\gtrsim\log K$ , this term is dominated by Part (\@slowromancapii@).
3. 3.
  
  The other three terms from the upper bound match with the ones in the lower bound.
•

Based on the above comparisons, we can conclude that under the mild condition $\max_{k\in S}n_{k}/n_{0}\gtrsim\log K$ , our method is minimax rate optimal for the estimation of $\bm{\theta}^{(0)*}$ in the classical low-dimensional regime $p=O(1)$ . Even when $p$ is unbounded, the gap between the upper and lower bounds appears only when the fourth or fifth term is the dominating term in the upper bound. Like the discussions after Theorem 4, similar restricted regimes where our method might become sub-optimal can be derived.

S.2.4 Label alignment

As in multi-task learning, the alignment issue exists in transfer learning as well. Referring to the parameter space $\overline{\Theta}^{\prime}(h)$ and the conditions of initialization in Assumptions 1 and 3, the success of Algorithm 7 requires correct alignments in two places. First, the center estimate $\overline{\bm{\beta}}^{[T]}$ used as input of Algorithm 7 are obtained from Algorithm 1 which involves the alignment of initial estimates for sources. This alignment problem can be readily solved by Algorithm 2 or 3. Second, the initialization of the target problem $\widehat{\bm{\beta}}^{(0)[0]}$ needs to be correctly aligned with the aforementioned center estimates. This is easy to address using the alignment score described in Section 2.4.2 as there are only two different alignment options. We summarize the steps in Algorithm 8.

Like Algorithms 2 and 3, Algorithm 8 is able to find the correct alignments under mild conditions. Suppose $\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=0}^{K}$ are the initialization values with potentially wrong alignment. Define the correct alignment as $\bm{r}^{*}=(r_{0}^{*},r^{*}_{1},\ldots,r^{*}_{K})$ with $r^{*}_{k}=\operatorname*{arg\,min}_{r_{k}=\pm 1}\|r_{k}\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}$ . For any $\bm{r}=\{r_{k}\}_{k=0}^{K}\in\{\pm 1\}^{K+1}$ which is a permutation order of $\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=0}^{K}$ and its corresponding alignment $\{r_{k}\widehat{\bm{\beta}}^{(k)[0]}\}_{k=0}^{K}$ , define its alignment score as

\text{score}(\bm{r})=\sum_{0\leq k_{1}\neq k_{2}\leq K}\|r_{k_{1}}\widehat{\bm{\beta}}^{(k_{1})[0]}-r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}.

(S.2.77)

Input: Initialization

\{(\widehat{\bm{\beta}}^{(k)[0]})\}_{k=0}^{K}

, and

\widehat{\bm{r}}

from Algorithm 2 or 3

1 if $\textup{score}((-1,\widehat{\bm{r}}))>\textup{score}((1,\widehat{\bm{r}}))$ then

\widehat{\bm{r}}^{\prime}=(1,\widehat{\bm{r}})

4 else

\widehat{\bm{r}}^{\prime}=(-1,\widehat{\bm{r}})

6 end if

Output:

\widehat{\bm{r}}^{\prime}

Algorithm 8 Alignment for transfer learning

As expected, under the conditions from Algorithms 2 or 3 for sources together with some similar conditions on the target, Algorithm 8 will output the ideal alignment $\widehat{\bm{r}}^{\prime}$ (equivalently, the good initialization $\widehat{r}_{0}^{\prime}\widehat{\bm{\beta}}^{(0)[0]}$ for Algorithm 7).

Theorem 17 (Alignment correctness for Algorithm 8).

Assume that

(i)

$\epsilon<\frac{1}{2}$ ;
(ii)

$\|\bm{\beta}^{(0)*}\|_{2}>\frac{2(1-\epsilon)}{1-2\epsilon}h+\frac{2-\epsilon}{1-2\epsilon}\max_{k\in\{0\}\cup S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}\big{)}$ ,

where $\epsilon=\frac{K-s}{K}$ is the outlier source task proportion, and $h$ is the degree of discriminant coefficient relatedness defined in (S.2.66).

For $\widehat{\bm{r}}$ in Algorithm 8: if it is from Algorithm 2, assume the conditions of Theorem 5 hold; if it is from Algorithm 3, assume the conditions of Theorem 6 hold. Then the output of Algorithm 8 satisfies

\widehat{\bm{r}}^{\prime}_{k}=r_{k}^{*}\text{ for all }k\in\{0\}\cup S\quad\text{ or }\quad\widehat{\bm{r}}^{\prime}_{k}=-r_{k}^{*}\text{ for all }k\in\{0\}\cup S.

(S.2.78)

S.3 Additional Numerical Studies

In this section, we present results from additional numerical studies, including supplementary results from the simulation study in Section 3 of the main text. Additionally, we provide results from two new MTL simulations, one TL simulation, explorations of different penalty parameters, and two real-data studies.

S.3.1 Simulations

S.3.1.1 Simulation 1 of MTL

In this subsection, we provide additional performance evaluations for the three methods (MTL-GMM, Pooled-GMM, and Single-task-GMM) in the simulation presented in the main text (referred to as Simulation 1). The results are displayed in Figures S.3 and S.4.

Referring to Figure S.3 for the case without outlier tasks, MTL-GMM outperforms Pooled-GMM in estimating $w^{(k)*}$ all the time. This makes sense because Pooled-GMM does not take the heterogeneity of $w^{(k)*}$ ’s into account. For the estimation of other parameters (except $\delta^{(k)*}$ ⁵⁵5Actually it is not surprising to see Pooled-GMM estimates $\bm{\mu}^{(k)*}_{1}$ , $\bm{\mu}^{(k)*}_{2}$ , $\delta^{(k)*}$ , and $\bm{\Sigma}^{(k)*}$ better than MTL-GMM when $h$ is small in this example. The reason is that these parameters are similar to each other (although MTL-GMM does not rely on this similarity) which makes pooling the data a good approach.) and clustering, MTL-GMM and Pooled-GMM are competitive when $h$ is small (i.e. the tasks are similar). As $h$ increases (i.e. tasks become more heterogenous), MTL-GMM starts to outperform Pooled-GMM by a large margin. Moreover, MTL-GMM is significantly better than Single-task-GMM in terms of both estimation and mis-clustering errors over a wide range of $h$ . They only become comparable when $h$ is very large. These comparisons demonstrate that MTL-GMM not only effectively utilizes the unknown similarity structure among tasks, but also adapts to it.

The results for the case with two outlier tasks are shown in Figure S.4. It is clear that the comparison between MTL-GMM and Single-task-GMM is similar to the one in Figure S.3. What is new here is that even when $h$ is very small, MTL-GMM still performs much better than Pooled-GMM, showing the robustness of MTL-GMM against a fraction of outlier tasks. Note that in this simulation, $\delta^{(k)*}=0$ for all $k\in[K]$ , which might explain the phenomenon where Pooled-GMM outperforms MTL-GMM in estimating $\delta^{(k)*}$ ’s.

S.3.1.2 Simulation 2

The second simulation is a multi-cluster example, which is built based on Simulation 1. Consider a multi-task learning problem with $K=10$ tasks, where each task has sample size $n_{k}=100$ and dimension $p=15$ , and follows a GMM with $R=4$ clusters. For all $k\in[K]$ , we generate $(w^{(k)*}_{1},\ldots,w^{(k)*}_{R})$ independently from $\text{Dirichlet}(\bm{\alpha})$ with $\bm{\alpha}=5\cdot\bm{1}_{R}$ . When $k\in S$ , we generate $\bm{\mu}^{(k)*}_{r}$ from $(2\cdot\bm{0}_{2r-2},2,2,\bm{0}_{p-2r})^{\top}+h/2\cdot(\bm{\Sigma}^{(k)*})^{-1}\bm{u}$ , where $\bm{u}\sim\text{Unif}(\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=1\})$ , $\bm{\Sigma}^{(k)*}=(0.2^{|i-j|})_{p\times p}$ . When $k\notin S$ , we generate each $w^{(k)*}$ from the same Dirichlet distribution and set $\bm{\Sigma}^{(k)*}=(0.5^{|i-j|)})_{p\times p}$ and $\bm{\mu}^{(k)*}_{r}$ from $\text{Unif}(\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=0.5\})$ for $r=1:R$ . For a given $\epsilon\in[0,1)$ , in each replication the outlier task index set $S^{c}$ is uniformly sampled from all subsets of $1:K$ with cardinality $K\epsilon$ . We consider two cases:

(i)

No outlier tasks ( $\epsilon=0$ ), and $h$ changes from 0 to 10 with increment 1;
(ii)

2 outlier tasks ( $\epsilon=0.2$ ), and $h$ changes from 0 to 10 with increment 1.

Algorithm 4 is run with the alignment Algorithm 6. The results of it and other benchmarks are reported in Figures S.5. The main message is the same as in Simulation 1: Pooled-GMM is sensitive to outlier tasks and suffers from negative transfer when $h$ is large, while MTL-GMM is robust to outliers and can adapt to the unknown similarity level $h$ . Note that in this example, $\{\bm{\mu}^{(k)*}_{r}\}_{k\in S}$ are similar, and $\{\bm{\Sigma}^{(k)*}\}_{k\in S}$ are the same, therefore running the EM algorithm by pooling all the data when $h$ is small without outliers may be more effective than our MTL algorithm. This could explain why MTL-GMM performs slightly worse than Pooled-GMM in terms of maximum mis-clustering error when $h$ is small and $\epsilon=0$ .

We also provide additional performance evaluations for the three methods in Simulation 2. The results are presented in Figures S.6 and S.7. The main takeaway is the same as in the previous simulation example: Pooled-GMM is sensitive to outlier tasks and suffers from negative transfer when $h$ is large, while MTL-GMM is robust to outliers and can adapt to the unknown similarity level $h$ . The results verify the theoretical findings in the multi-cluster case.

S.3.1.3 Simulation 3 of MTL

In the third simulation of MTL, we consider a different similarity structure among tasks in $S$ and a different type of outlier tasks. For a multi-task learning problem with $K=10$ tasks, set the sample size of each task equal to 100. Let $\bm{\beta}^{(1)*}=(2.5,0,0,0,0)$ , $\bm{\Sigma}^{(1)*}=(0.5^{|i-j|})_{5\times 5}$ , and $1\in S$ , i.e., the first task is not an outlier task. We generate each $w^{(k)*}$ from $\text{Unif}(0.1,0.9)$ for all $k\in S$ . For $k\in S\backslash\{1\}$ , we generate $\bm{\Sigma}^{(k)*}$ as

\bm{\Sigma}^{(k)*}=\begin{cases}(0.5^{|i-j|})_{5\times 5},&\text{with probability }1/2,\\ (a^{|i-j|})_{5\times 5},&\text{with probability }1/2,\end{cases}

(S.3.79)

and set $\bm{\beta}^{(k)*}=(\bm{\Sigma}^{(k)*})^{-1}\bm{\Sigma}^{(1)*}\bm{\beta}^{(1)*}$ . Here, the value of $a$ is determined by $\max\{a\in[0.5,1):\|\bm{\beta}^{(k)*}-\bm{\beta}^{(1)*}\|_{2}\leq h\}$ for a given $h$ . Let $\bm{\mu}^{(k)*}_{2}=\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}$ and $\bm{\mu}^{(k)*}_{1}=\bm{0}$ , $\forall k\in S$ . In this generation process, $\bm{\mu}^{(k)*}_{2}=\bm{\mu}^{(1)*}_{2}=\bm{\Sigma}^{(1)*}\bm{\beta}^{(1)*}=(5/2,5/4,5/8,5/16,5/32)^{\top}$ for all $k\in S$ . The covariance matrix of tasks in $S$ can differ. When $k\notin S$ , we generate the data of task $k$ from two clusters with probability $1-w^{(k)*}$ and $w^{(k)*}$ , where $w^{(k)*}\sim\text{Unif}(0.1,0.9)$ . Samples from the second cluster follow $N(\bm{\mu}^{(k)*}_{2},\bm{\Sigma}^{(k)*})$ , with $\bm{\Sigma}^{(k)*}$ coming from (S.3.79), $\bm{\beta}^{(k)*}=(-2.5,-2.5,-2.5,-2.5,-2.5)^{\top}$ , and $\bm{\mu}^{(k)*}_{2}=\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}$ . For each sample from the first cluster, each component is independently generated from a $t$ -distribution with degrees of freedom $4$ . In each replication, for given $\epsilon$ , the outlier task index set $S^{c}$ is uniformly sampled from all subsets of $2:K$ with cardinality $K\epsilon$ (since task 1 has been fixed in $S$ ). We consider two cases:

(i)

No outlier tasks ( $\epsilon=0$ ), and $h$ changes from 0 to 10 with increment 1;
(ii)

2 outlier tasks ( $\epsilon=0.2$ ), and $h$ changes from 0 to 10 with increment 1.

We implement the same three methods as in Simulation 1 and the results are reported in Figures S.8 and S.9. When there are no outlier tasks, both MTL-GMM and Pooled-GMM significantly outperform Single-task-GMM. Note that in this simulation, $\bm{\mu}^{(k)*}_{1}=\bm{\mu}^{(k^{\prime})*}_{1}$ and $\bm{\mu}^{(k)*}_{2}=\bm{\mu}^{(k^{\prime})*}_{2}$ for all $k\neq k^{\prime}\in[K]$ , which might explain the phenomenon where Pooled-GMM outperforms MTL-GMM in estimating $\bm{\mu}^{(k)*}_{1}$ and $\bm{\mu}^{(k)*}_{2}$ ’s. When there are two outlier tasks, Figure S.9 shows that Pooled-GMM performs much worse than Single-task-GMM on most of the estimation errors of GMM parameters as well as the mis-clustering error rate. In contrast, MTL-GMM greatly improves the performance of Single-task-GMM, showing the advantage of MTL-GMM when dealing with outlier tasks and heterogeneous covariance matrices.

S.3.1.4 Simulation of TL

Consider a transfer learning problem with $K=10$ source data sets, where all sources are from the same GMM. The setting is modified based on Simulation 1 of MTL. The source and target sample sizes are equal to 100. For each of the source and target task, $w^{(k)*}\sim\textup{Unif}(0.1,0.9)$ and $\bm{\mu}^{(k)*}_{1}=-\bm{\mu}^{(k)*}_{2}=(2,2,\bm{0}_{p-2})^{\top}+h/2\cdot(\bm{\Sigma}^{(k)*})^{-1}\bm{u}$ , where $p=15$ , $\bm{\Sigma}^{(k)*}=(0.2^{|i-j|})_{5\times 5}$ , and $\bm{u}\sim\text{Unif}(\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=1\})$ . We consider the case that $h$ changes from 0 to 10 with increment 1.

We compare five different methods, including Target-GMM fitted on target data only, MTL-GMM fitted on all the data, MTL-GMM-center which fits MTL-GMM on source data and outputs the estimated “center” $\overline{\bm{\beta}}^{[T]}$ as the target estimate ⁶⁶6MTL-GMM-center only appears in the comparison of estimation error of $\bm{\beta}^{(k)*}$ ’s., Pooled-GMM which fits a merged GMM on all the data, and our TL-GMM. The performance is evaluated by the estimation errors of $w^{(0)*}$ , $\bm{\mu}^{(0)*}_{1}$ , $\bm{\mu}^{(0)*}_{2}$ , $\bm{\beta}^{(0)*}$ , $\delta^{(0)*}$ , and $\bm{\Sigma}^{(0)*}$ as well as the mis-clustering error rate calculated on an independent test target data of size 500. Results are presented in Figure S.10.

Figure S.10 shows that when $h$ is small, the performances of MTL-GMM, MTL-GMM-center, Pooled-GMM, and TL-GMM are comparable, and all of them are much better than Target-GMM. This is expected, because the sources are very similar to the target and can be easily used to improve the target task learning. As $h$ keeps increasing, the target and sources become increasingly different. This is the phase where the knowledge of sources needs to be carefully transferred for the possible learning improvement on the target task. As is clear from Figure S.10, MTL-GMM, MTL-GMM-center, and Pooled-GMM do not handle heterogeneous resources well, thus outperformed by Target-GMM. By contrast, TL-GMM remains effective in transferring source knowledge to improve over Target-GMM; when $h$ is very large so that sources are not useful anymore, TL-GMM is robust enough to still have competitive performance compared to Target-GMM.

Figure S.11 shows the results when there are two outlier tasks ( $\epsilon=0.2$ ). It can be seen that MTL-GMM is robust to outliers.

S.3.1.5 Tuning parameters $C_{\lambda}$ and $C_{\lambda_{0}}$ in Algorithms 1, 4, and 7

The candidates of $C_{\lambda}$ and $C_{\lambda_{0}}$ values used in the 10-fold cross-validation are chosen through a data-driven way. For $C_{\lambda}$ in Algorithm 1, we first determine the smallest $C_{\lambda}$ value which makes all $\bm{\beta}^{(k)}$ estimators identical, which is denoted as $C_{\max}$ . Then the $C_{\lambda}$ candidates are set to be the sequence from $C_{\max}/50$ and $2C_{\max}$ with equal logarithm distance. For $C_{\lambda_{0}}$ in Algorithm 7, we first determine the smallest $C_{\lambda_{0}}$ value which makes the $\bm{\beta}^{(0)}$ estimator equal to $\overline{\bm{\beta}}^{[T]}$ , which is denoted as $C_{\max}^{\prime}$ . Then the $C_{\lambda_{0}}$ candidates are set to be the sequence from $C_{\max}^{\prime}/50$ and $2C_{\max}^{\prime}$ with equal logarithm distance.

We also run MTL-GMM with different $C_{\lambda}$ values in Simulation 1 to test the impact of the penalty parameter. The results are presented in Figure S.12. The values 1.29, 2.15, 3.59, 5.99, and 10 are the last 5 elements in sequence from 0.1 to 10 with equal logarithm distance. It can be seen that with small $C_{\lambda}$ values like 1.29 and 2.15, the performance of MTL-GMM is similar to that of Single-task-GMM, although MTL-GMM-2.15 improves Single-task-GMM a lot when $h$ is small. With large $C_{\lambda}$ values like 5.99 and 10, MTL-GMM performs similarly to Pooled-GMM when $h$ is small while suffering from negative transfer when $h$ is large. However, as $h$ continues to increase, the performance of MTL-GMM with large $C_{\lambda}$ values starts to improve and finally becomes similar to Single-task-GMM. This phenomenon is in accordance with the theory, as the theory predicts that MTL-GMM achieves the same rate as Single-task-GMM for large $h$ . The negative transfer effect of MTL-GMM with large $C_{\lambda}$ could be caused by large unknown constants in the upper bound. Comparing Figure S.12 with figures in Sections 3 and S.3.1.1, we can see that cross-validation enhances the performance of MTL-GMM.

S.3.1.6 Tuning parameter $\kappa$ and $\kappa_{0}$ in Algorithms 1, 4, and 7

We set $\kappa=\kappa_{0}=1/3$ in Algorithms 1, 4, and 7. We run MTL-GMM with different $\kappa$ values in Simulation 1 to test the impact of $\kappa$ on the performance. The results are presented in Figure S.13. We tried $\kappa=0.1,0.3,0.5,0.7,0.9$ in Algorithms 1. It can be seen that the lines representing MTL-GMM with different $\kappa$ values highly overlap with each other, which shows that the performance of MTL-GMM is very robust to the choice of $\kappa$ . In practice, we take $\kappa=1/3$ for convenience.

S.3.2 Real-data analysis

S.3.2.1 Human activity recognition

Human Activity Recognition (HAR) Using Smartphones Data Set contains the data collected from 30 volunteers when they performed six activities (walking, walking upstairs, walking downstairs, sitting, standing, and laying) wearing a smartphone \citepappanguita2013public. Each observation has 561 time and frequency domain variables. Each volunteer can be viewed as a task, and the sample size of each task varies from 281 to 409. The original data set is available at UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones.

Here, we first focus on two activities, standing and laying, and perform clustering without the label information, to test our method in the binary case. This is a binary MTL clustering problem with 30 tasks. The sample size of each task varies from 95 to 179. For each task, in each replication, we use 90% of the samples as training data and hold 10% of the samples as test data.

We first run a principal component analysis (PCA) on the training data of each task and project both the training and test data onto the first 15 principal components. PCA has often been used for dimension reduction in pre-processing the HAR data set \citepappzeng2014convolutional, walse2016pca, aljarrah2019human, duan2023adaptive. We fit Single-task-GMM on each task separately, Pooled-GMM on merged data from 30 tasks, and our MTL-GMM with the greedy label swapping alignment algorithm. The performance of the three methods is evaluated by the mis-clustering error rate on the test data of all 30 tasks. The maximum and average mis-clustering errors among the 30 tasks are calculated in each replication. The mean and standard deviation of these two errors over 200 replications are reported on the left side of Table S.1. To better display the clustering performance on each task, we further generate the box plot of mis-clustering errors of 30 tasks (averaged over 200 replications) for each method in the left plot of Figure S.14. It is clear that MTL-GMM outperforms both Pooled-GMM and Single-task-GMM.

	Binary			Multi-cluster
Method	Single-task	Pooled	MTL	Single-task	Pooled	MTL
Max. error	0.49 (0.02)	0.40 (0.12)	0.37 (0.09)	0.51 (0.04)	0.50 (0.04)	0.51 (0.04)
Avg. error	0.28 (0.02)	0.18 (0.18)	0.04 (0.04)	0.25 (0.01)	0.35 (0.03)	0.25 (0.01)

Table S.1: Maximum and average mis-clustering errors and standard deviations (numbers in the parentheses) in binary and multi-cluster HAR data sets.

Next, we consider all six activities and compare the performance of the three approaches using the same sample-splitting strategy, to test our method in a multi-cluster scenario. Now the sample size of each task varies from 281 to 409. The maximum and average mis-clustering error rates and standard deviations over 200 replications are reported on the right side of Table S.1. We can see that Pooled-GMM might suffer from negative transfer with a worse performance than the other two methods, while MTL-GMM and Single-task-GMM have similar performances. The right plot in Figure S.14 reveals the same comparison results.

In summary, the HAR data set exhibits different levels of similarity in binary and multi-cluster cases: tasks in the binary data are sufficiently similar so that Pooled-GMM achieves a large margin of improvement over Single-task GMM, while tasks in the multi-cluster data become much more heterogeneous, resulting in the degraded performance of Pooled-GMM compared to Single-task GMM. Nevertheless, our method MTL-GMM performs either competitively or better than the best of the two, regardless of the similarity level. These results lend further support to the effectiveness of our method.

S.3.2.2 Pen-based recognition of handwritten digits (PRHD)

The Pen-based Recognition of Handwritten Digits (PRHD) data set contains 250 samples from each of the 44 writers. Each of these writers was asked to write digits 0-9 on a pressure-sensitive tablet with an integrated LCD display and a cordless stylus. The $x$ and $y$ tablet coordinates and pressure level values of the pen were recorded. After some transformations, each observation has 16 features. The data set and more information about it are available at UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/81/pen+based+recognition+of+handwritten+digits.

Similar to the previous real-data example, we first focus on a binary clustering problem by clustering observations of digits 8 and 9. The number of observations varies between 47 and 48 among the 44 tasks, showing that this is a more balanced data set with a smaller sample size (per dimension) than the HAR data. For each task, in each replication, we use 90% of the samples as training data and hold 10% of the samples as test data. The maximum and average mis-clustering error rates and standard deviations over 200 replications are reported on the left side of Table S.2, and the box plots of mis-clustering errors of 44 tasks are shown in Figure S.15. We can see that Pooled-GMM and MTL-GMM perform similarly and are much better than Single-task-GMM.

	Binary			Multi-cluster
Method	Single-task	Pooled	MTL	Single-task	Pooled	MTL
Max. error	0.32 (0.10)	0.03 (0.07)	0.03 (0.09)	0.26 (0.07)	0.37 (0.06)	0.27 (0.07)
Avg. error	0.02 (0.01)	0.00 (0.00)	0.00 (0.02)	0.03 (0.01)	0.12 (0.01)	0.03 (0.01)

Table S.2: Maximum and average mis-clustering errors and standard deviations (numbers in the parentheses) in binary and multi-cluster PRHD data sets.

Next, we consider the observations of digits 5-9, i.e. a 5-class clustering problem. The maximum and average mis-clustering error rates and standard deviations over 200 replications are reported on the right side of Table S.2, and the box plots of mis-clustering errors of 44 tasks are shown in Figure S.15. In this multi-cluster case, MTL-GMM and Single-task-GMM have similar performance which is better than that of Pooled-GMM. Like in the first real-data example, our method MTL-GMM adapts to the unknown similarity and is competitive with the best of the other two methods.

S.4 Technical Lemmas

S.4.1 General lemmas

Denote the unit ball $\mathcal{B}^{p}=\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}\leq 1\}$ and the unit sphere $\mathcal{S}^{p-1}=\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=1\}$ .

Lemma 1 (Covering number of the unit ball under Euclidean norm, Example 5.8 in \citealpappwainwright2019high).

Denote the $\epsilon$ -covering number of a unit ball $\mathcal{B}^{p}$ in $\mathbb{R}^{p}$ under Euclidean norm as $N(\epsilon,\mathcal{B}^{p},\|\cdot\|_{2})$ , where the centers of covering balls are required to be on the sphere. We have $(1/\epsilon)^{p}\leq N(\epsilon,\mathcal{B}^{p},\|\cdot\|_{2})\leq(1+2/\epsilon)^{p}$ .

Lemma 2 (Packing number of the unit sphere under Euclidean norm).

Denote the $\epsilon$ -packing number of the unit sphere $\mathcal{S}^{p-1}$ in $\mathbb{R}^{p}$ under Euclidean norm as $M(\epsilon,\mathcal{S}^{p-1},\|\cdot\|_{2})$ . When $p\geq 2$ , we have $M(\epsilon,\mathcal{S}^{p-1},\|\cdot\|_{2})\geq N(\epsilon,\mathcal{B}^{p-1},\|\cdot\|_{2})\geq(1/\epsilon)^{p-1}$ .

Lemma 3 (Fano’s lemma, see Chapter 2 of \citealpapptsybakov2009introduction, Chapter 15 of \citealpappwainwright2019high).

Suppose $(\Theta,d)$ is a metric space and each $\theta$ in this space is associated with a probability measure $\mathbb{P}_{\theta}$ . If $\{\theta_{j}\}_{j=1}^{N}$ is an $s$ -separated set (i.e. $d(\theta_{j},\theta_{k})\geq s$ for any $j\neq k$ ), and $\textup{KL}(\mathbb{P}_{\theta_{j}},\mathbb{P}_{\theta_{k}})\leq\alpha\log N$ , then

\inf_{\widehat{\theta}}\sup_{\theta\in\Theta}\mathbb{P}_{\theta}(d(\widehat{\theta},\theta)\geq s/2)\geq\inf_{\psi}\sup_{j=1:N}\mathbb{P}_{\theta_{j}}(\psi\neq j)\geq 1-\alpha-\frac{\log 2}{\log N},

(S.4.80)

where $\psi:X\mapsto\psi(X)\in\{1,\ldots,N\}$ .

Lemma 4 (Packing number of the unit sphere in a quadrant under Euclidean norm).

In $\mathbb{R}^{p}$ , we can use a vector $\bm{v}\in\{\pm 1\}^{\otimes p}$ to indicate each quadrant $\mathcal{Q}_{\bm{v}}=\{[0,+\infty)\cdot\mathds{1}(v_{j}=+1)+(-\infty,0)\cdot\mathds{1}(v_{j}=-1)\}^{\otimes p}$ . Then when $p\geq 2$ , there exists a quadrant $\mathcal{Q}_{\bm{v}_{0}}$ such that $M(\epsilon,\mathcal{S}^{p-1}\cap\mathcal{Q}_{\bm{v}_{0}},\|\cdot\|_{2})\geq(\frac{1}{2})^{p}(\frac{1}{\epsilon})^{p-1}$ .

Lemma 5.

For one-dimensional Gaussian mixture variable $Z\sim(1-w)\mathcal{N}(\mu_{1},\sigma^{2})+w\mathcal{N}(\mu_{2},\sigma^{2})$ with $(1-w)\mu_{1}+w\mu_{2}=0$ , it is a $\sqrt{\sigma^{2}+\frac{1}{4}|\mu_{1}-\mu_{2}|^{2}}$ -subGaussian variable. That means,

\mathbb{E}e^{\lambda Z}\leq\exp\left\{\frac{1}{2}\lambda^{2}\left(\sigma^{2}+\frac{1}{4}|\mu_{1}-\mu_{2}|^{2}\right)\right\}.

(S.4.81)

Lemma 6 (\citealpappduan2023adaptive).

Let

(\{\widehat{\bm{\theta}}_{j}\}_{k=1}^{K},\widehat{\bm{\beta}})=\operatorname*{arg\,min}_{\bm{\theta}_{k},\bm{\beta}\in\mathbb{R}^{p}}\left\{\sum_{k=1}^{K}\omega_{k}f^{(k)}(\bm{\theta}_{k})+\sqrt{\omega_{k}}\lambda\|\bm{\beta}-\bm{\theta}_{k}\|_{2})\right\}.

(S.4.82)

Suppose there exists $S\subseteq 1:K$ such that the following conditions are satisfied:

(i)
For any $k\in S$ , $f_{k}$ is $(\bm{\theta}^{*}_{k},M,\rho,L)$ -regular, that is
- •
  
  $f_{k}$ is convex and twice differentiable;
- •
  
  $\rho\bm{I}\preceq\nabla^{2}f_{k}(\bm{\theta})\preceq L\bm{I}$ for all $\bm{\theta}\in\mathcal{B}(\bm{\theta}^{*}_{k},M)$ ;
- •
  
  $\|\nabla f_{k}(\bm{\theta}^{*}_{k})\|_{2}\leq\rho M/2$ .
(ii)

$\min_{\bm{\theta}\in\mathbb{R}^{d}}\max_{k\in S}\{\|\bm{\theta}_{k}^{*}-\bm{\theta}\|_{2}\}\leq h$ , $\sum_{k\in S^{c}}\sqrt{\omega_{k}}\leq\epsilon^{\prime}\sum_{k\in S}\omega_{k}/\max_{k\in S}\sqrt{\omega_{k}}$ , with $\epsilon^{\prime}=\frac{|S^{c}|}{|S|}$ .

Then we have the following conclusions:

(i)

$\|\widehat{\bm{\theta}}_{k}-\bm{\theta}_{k}^{*}\|_{2}\leq\frac{1}{\rho}(\|\nabla f^{(k)}(\bm{\theta}^{*}_{k})\|_{2}+\lambda/\sqrt{w_{k}})$ for all $k\in S$ .

(ii)

\frac{5\varrho\kappa_{w}\max_{k\in S}\{\sqrt{w_{k}}\|\nabla f^{(k)}(\bm{\theta}^{*}_{k})\|_{2}\}}{1-\varrho\epsilon^{\prime}}<\lambda<\frac{\rho M}{2}\min_{k\in S}\sqrt{\omega_{k}},

(S.4.83)

where $\varrho=L/\rho$ , $\varrho\epsilon^{\prime}<1$ , and $\max_{k\in S}\sqrt{n_{k}}\cdot\frac{\sum_{k\in S}\sqrt{n_{k}}}{n_{S}}\leq\sqrt{\frac{|S|\max_{k\in S}n_{k}}{n_{S}}}\coloneqq\kappa_{w}$ , then

\|\widehat{\bm{\theta}}_{k}-\bm{\theta}^{*}_{k}\|_{2}\leq\frac{\|\sum_{k\in S}\omega_{k}\nabla f^{(k)}(\bm{\theta}_{k}^{*})\|_{2}}{\rho\sum_{k\in S}\omega_{k}}+\frac{6}{1-\varrho\epsilon^{\prime}}\min\left\{3\varrho^{2}\kappa_{w}h,\frac{2\lambda}{5\rho\sqrt{w_{k}}}\right\}+\frac{\lambda\epsilon^{\prime}}{\rho\max_{k\in S}\sqrt{\omega_{k}}}.

(S.4.84)

Furthermore, if we also have

\lambda\geq\frac{15\varrho\kappa_{w}L\max_{k\in S}\sqrt{\omega_{k}}h}{1-\varrho\epsilon^{\prime}},

(S.4.85)

then $\widehat{\bm{\theta}}_{k}=\widehat{\bm{\beta}}$ for all $k\in S$ , and

\sup_{k\in S}\|\widehat{\bm{\theta}}_{k}-\bm{\theta}^{*}_{k}\|_{2}\leq\frac{\|\sum_{k\in S}\omega_{k}\nabla f^{(k)}(\bm{\theta}_{k}^{*})\|_{2}}{\rho\sum_{k\in S}\omega_{k}}+2\varrho\kappa_{w}h+\frac{\lambda\epsilon^{\prime}}{\rho\max_{k\in S}\sqrt{\omega_{k}}}.

(S.4.86)

Lemma 7.

Suppose

\widehat{\bm{\theta}}=\operatorname*{arg\,min}_{\bm{\theta}}\left\{f^{(0)}(\bm{\theta})+\frac{\lambda}{\sqrt{n_{0}}}\|\widehat{\bm{\theta}}-\overline{\bm{\theta}}\|_{2}\right\}

(S.4.87)

with some $\overline{\bm{\theta}}\in\mathbb{R}^{p}$ . Assume $f^{(0)}$ is convex and twice differentiable, and $\rho\bm{I}_{p}\leq\nabla^{2}f^{(0)}(\bm{\theta})\leq L\bm{I}_{p}$ for any $\bm{\theta}\in\mathbb{R}^{p}$ . Then

(i)

$\|\widehat{\bm{\theta}}-\bm{\theta}^{*}\|_{2}\leq\frac{\nabla f^{(0)}(\bm{\theta})}{\rho}+\frac{\lambda}{\rho\sqrt{n_{0}}}$ , for any $\bm{\theta}^{*}\in\mathbb{R}^{p}$ and $\lambda\geq 0$ ;
(ii)

$\widehat{\bm{\theta}}=\overline{\bm{\theta}}$ , if $\lambda\geq 2\|\nabla f^{(0)}(\bm{\theta}^{*})\|_{2}\sqrt{n_{0}}$ and $\|\overline{\bm{\theta}}-\bm{\theta}^{*}\|_{2}\leq(\lambda/\sqrt{n_{0}}-\|\nabla f^{(0)}(\bm{\theta}^{*})\|_{2})/L$ .

S.5 Proofs

S.5.1 Proofs of general lemmas

S.5.1.1 Proof of Lemma 2

The second half inequality is due to Lemma 1. It suffices to show the first half inequality. For any $\bm{x}=(x_{1},\ldots,x_{p-1})^{\top}\in\mathcal{B}^{p-1}$ , define $x_{p}=\sqrt{1-\sum_{j=1}^{p-1}x_{j}^{2}}$ . Then we can define a mapping

\bm{x}\in\mathcal{B}^{p-1}\mapsto\widetilde{\bm{x}}=(\widetilde{x}_{1},\ldots,\widetilde{x}_{p-1},\widetilde{x}_{p})\in\mathcal{S}^{p},

(S.5.88)

with $\widetilde{x}_{j}=x_{j}$ for $j\leq p-1$ and $\widetilde{x}_{p}=\pm x_{p}$ . And it’s easy to see that for any $\bm{x}$ , $\bm{y}\in\mathcal{B}^{p-1}$ , we have $\|\bm{x}-\bm{y}\|_{2}\leq\|\widetilde{\bm{x}}-\widetilde{\bm{y}}\|_{2}$ . Therefore, if $\{\widetilde{\bm{x}}_{j}\}_{j=1}^{N}$ is an $\epsilon$ -cover of $\mathcal{S}^{p-1}$ under Euclidean norm, then it $\{\bm{x}_{j}\}_{j=1}^{N}$ must be an $\epsilon$ -cover of $\mathcal{B}^{p-1}$ under Euclidean norm. Then

N(\epsilon,\mathcal{B}^{p-1},\|\cdot\|_{2})\leq N(\epsilon,\mathcal{S}^{p-1},\|\cdot\|_{2})\leq M(\epsilon,\mathcal{S}^{p-1},\|\cdot\|_{2}).

(S.5.89)

S.5.1.2 Proof of Lemma 4

If $\{\bm{u}_{j}\}_{j=1}^{N}$ is an $\epsilon$ -packing of $\mathcal{S}^{p-1}$ under Euclidean norm, then $\{\bm{u}_{j}\}_{j=1}^{N}\cap\mathcal{Q}_{\bm{v}}$ must be an $\epsilon$ -packing of $\mathcal{S}^{p}\cap\mathcal{Q}_{\bm{v}}$ under Euclidean norm. Then by Lemma 2,

$\displaystyle 2^{p}\max_{\bm{v}\in\{\pm 1\}^{\otimes p}}M(\epsilon,\mathcal{S}^{p-1}\cap\mathcal{Q}_{\bm{v}},\\|\cdot\\|_{2})$	$\displaystyle\geq\sum_{\bm{v}\in\{\pm 1\}^{\otimes p}}M(\epsilon,\mathcal{S}^{p-1}\cap\mathcal{Q}_{\bm{v}},\\|\cdot\\|_{2})$	(S.5.90)
	$\displaystyle\geq M(\epsilon,\mathcal{S}^{p-1},\\|\cdot\\|_{2})$	(S.5.91)
	$\displaystyle\geq\left(\frac{1}{\epsilon}\right)^{p-1},$	(S.5.92)

implying

\max_{\bm{v}\in\{\pm 1\}^{\otimes p}}M(\epsilon,\mathcal{S}^{p}\cap\mathcal{Q}_{\bm{v}},\|\cdot\|_{2})\geq\left(\frac{1}{2}\right)^{p}\left(\frac{1}{\epsilon}\right)^{p-1}.

(S.5.93)

S.5.1.3 Proof of Lemma 5

Suppose $Z_{1}\sim\mathcal{N}(\mu_{1},\sigma^{2})\perp\!\!\!\!\perp Z_{2}\sim\mathcal{N}(\mu_{2},\sigma^{2})$ , then we can write $Z=(1-I)Z_{1}+IZ_{2}=(1-I)(Z_{1}-\mu_{1})+I(Z_{2}-\mu_{2})+[\mu_{1}(1-I)+\mu_{2}I]$ , where $I\sim\text{Bernoulli}(w)$ that is independent with $Z_{1}$ and $Z_{2}$ . Then

$\displaystyle\mathbb{E}e^{\lambda Z}$	$\displaystyle\leq\mathbb{E}e^{\lambda(1-I)(Z_{1}-\mu_{1})+\lambda I(Z_{2}-\mu_{2})}\cdot\mathbb{E}e^{\lambda[\mu_{1}(1-I)+\mu_{2}I]}$	(S.5.94)
	$\displaystyle\leq\mathbb{E}_{I}\big{[}(1-I)\mathbb{E}e^{\lambda Z_{1}}+I\mathbb{E}e^{\lambda Z_{2}}\big{]}\cdot\mathbb{E}e^{\lambda[\mu_{1}(1-I)+\mu_{2}I]}$	(S.5.95)
	$\displaystyle\leq\exp\left\{\frac{1}{2}\sigma^{2}\lambda^{2}+\frac{1}{8}(\mu_{2}-\mu_{1})^{2}\lambda^{2}\right\},$	(S.5.96)

where the last second inequality comes from Jensen’s inequality and the independence between $Z_{1}$ , $Z_{2}$ , and $I$ . This completes the proof.

S.5.1.4 Proof of Lemma 6

The result follows from Theorem A.2, Lemma B.1, and Claim B.1 in \citeappduan2023adaptive.

S.5.2 Proof of Theorem 1

Define the contraction basin of one GMM as

	$\displaystyle B_{\text{con}}(\bm{\theta}^{(k)*})=\bigg{\{}\bm{\theta}=\{w,\bm{\beta},\delta\}$	$\displaystyle:w_{r}\in[c_{w}/2,1-c_{w}/2],\\|\bm{\beta}-\bm{\beta}^{(k)*}\\|_{2}\leq C_{b}\Delta,\delta=\frac{1}{2}\bm{\beta}^{\top}(\bm{\mu}_{1}+\bm{\mu}_{2})$		(S.5.97)
		$\displaystyle\quad\,\max_{r=1:2}\\|\bm{\mu}_{r}-\bm{\mu}_{r}^{*}\\|_{2}\leq C_{b}\Delta\bigg{\}},$		(S.5.98)

for which we may shorthand as $B_{\text{con}}$ in the following. And given the index set $S$ , two joint contraction basins are defined as

	$\displaystyle B_{\text{con}}^{J,1}(\{\bm{\theta}^{(k)*}\}_{k\in S})$	$\displaystyle=\left\{\{\bm{\theta}^{(k)}\}_{k\in S}=\{(w^{(k)},\bm{\beta}^{(k)},\delta^{(k)})\}_{k\in S}:\bm{\theta}^{(k)}\in B_{\text{con}}(\bm{\theta}^{(k)*})\right\},$		(S.5.99)
	$\displaystyle B_{\text{con}}^{J,2}(\{\bm{\theta}^{(k)*}\}_{k\in S})$	$\displaystyle=\left\{\{\bm{\theta}^{(k)}\}_{k\in S}=\{(w^{(k)},\bm{\beta}^{(k)},\delta^{(k)})\}_{k\in S}:\bm{\theta}^{(k)}\in B_{\text{con}}(\bm{\theta}^{(k)*}),\bm{\beta}^{(k)}\equiv\overline{\bm{\beta}}\text{ for all }k\right\}.$		(S.5.100)

For simplicity, at some places, we will write them as $B_{\text{con}}^{J,1}$ and $B_{\text{con}}^{J,2}$ , respectively.

For $\bm{\theta}=(w,\bm{\beta},\delta)$ and $\bm{\theta}^{\prime}=(w^{\prime},\bm{\beta}^{\prime},\delta^{\prime})$ , define

d(\bm{\theta},\bm{\theta}^{\prime})=|w-w^{\prime}|\vee\|\bm{\beta}-\bm{\beta}^{\prime}\|_{2}\vee|\delta-\delta^{\prime}|.

(S.5.101)

And denote the minimum SNR $\Delta=\min_{k\in S}\Delta^{(k)}$ .

S.5.2.1 Lemmas

For GMM $\bm{z}\sim(1-w^{*})\mathcal{N}(\bm{\mu}_{1}^{*},\bm{\Sigma}^{*})+w^{*}\mathcal{N}(\bm{\mu}_{2}^{*},\bm{\Sigma}^{*})$ and any $\bm{\theta}=(w,\bm{\beta},\delta)$ , define

	$\displaystyle\gamma_{\bm{\theta}}(\bm{z})=\frac{w\exp\{\bm{\beta}^{\top}\bm{z}-\delta\}}{1-w+w\exp\{\bm{\beta}^{\top}\bm{z}-\delta_{r}\}},$	$\displaystyle\quad w(\bm{\theta})=\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})],$		(S.5.102)
	$\displaystyle\bm{\mu}_{1}(\bm{\theta})=\frac{\mathbb{E}[(1-\gamma_{\bm{\theta}}(\bm{z}))\bm{z}]}{\mathbb{E}[1-\gamma_{\bm{\theta}}(\bm{z})]},$	$\displaystyle\quad\bm{\mu}_{2}(\bm{\theta})=\frac{\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})\bm{z}]}{\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})]}.$		(S.5.103)

Lemma 8 (Contraction of binary GMMs, a special case of Lemma 37 when $R=2$ ).

When $C_{b}\leq cc_{\bm{\Sigma}}^{-1/2}$ with a small constant $c>0$ and $\Delta\geq C\log(c_{\bm{\Sigma}}Mc_{w}^{-1})$ with a large constant $C>0$ , there exist positive constants $C^{\prime}>0$ and $C^{\prime\prime}>0$ , for any $\bm{\theta}\in B_{\textup{con}}(\bm{\theta}^{(k)*})$ ,

|w_{r}(\bm{\theta})-w_{r}^{*}|\leq C^{\prime}exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*}),\quad\|\bm{\mu}_{r}(\bm{\theta})-\bm{\mu}_{r}^{*}\|_{2}\leq C^{\prime}\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*}),

(S.5.104)

where $C^{\prime}\Delta\exp\{-C^{\prime\prime}\Delta^{2}\}\leq\kappa_{0}<1$ with a constant $\kappa_{0}$ .

Lemma 9.

When $h\leq C_{b}\Delta$ , $B_{\text{con}}^{J,2}(\{\bm{\theta}^{(k)*}\}_{k\in S})\neq\emptyset$ .

Lemma 10 (Theorem 3 in \citealpappmaurer2021concentration).

Let $f:\mathcal{X}^{n}\rightarrow\mathbb{R}$ and $X=(X_{1},\ldots,X_{n})$ be a vector of independent random variables with values in a space $\mathcal{X}$ . Then for any $t>0$ we have

\mathbb{P}(f(X)-\mathbb{E}f(X)>t)\leq\exp\left\{-\frac{t^{2}}{32e\left\|\sum_{i=1}^{n}\|f_{i}(X)\|_{\psi_{2}}^{2}\right\|_{\infty}}\right\},

(S.5.105)

where $f_{i}(X)$ as a random function of $x$ is defined to be $(f_{i}(X))(x)\coloneqq f(x_{1},\ldots,x_{i-1},X_{i},x_{i+1},\ldots,X_{n})-\mathbb{E}_{X_{i}}[f(x_{1},\ldots,x_{i-1},X_{i},x_{i+1},\ldots,X_{n})]$ , the sub-Gaussian norm $\|Z\|_{\psi_{2}}\coloneqq\sup_{d\geq 1}\{\|Z\|_{d}/\sqrt{d}\}$ , and $\|Z\|_{d}=(\mathbb{E}|Z|^{d})^{1/d}$ .

Lemma 11.

Suppose Assumption 1 holds.

(i)

With probability at least $1-C^{\prime}K^{-2}$ ,

\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\right|\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}},

(S.5.106)

for all $k\in S$ .

(ii)

With probability at least $1-C^{\prime}K^{-2}e^{-C^{\prime\prime}p}$ ,

\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\frac{1}{n_{S}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\Big{]}\right|\lesssim\sqrt{\frac{p+K}{n_{S}}}.

(S.5.107)

Lemma 12.

Suppose Assumption 1 holds.

(i)

With probability at least $1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})$ ,

	$\displaystyle\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)}-\mathbb{E}\big{[}[1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})](\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)*}\big{]}\right\|$		(S.5.108)
	$\displaystyle\quad\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}},$		(S.5.109)

for all $k\in S$ .

(ii)

With probability at least $1-C^{\prime}K^{-2}e^{-C^{\prime\prime}p}$ ,

	$\displaystyle\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{\|\widetilde{w}_{k}\|\leq 1}\frac{1}{n_{S}}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\big{[}1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)}-\mathbb{E}\big{[}[1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})](\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)}\big{]}\Big{]}\right\|$		(S.5.110)
	$\displaystyle\quad\lesssim\sqrt{\frac{p+K}{n_{S}}}.$		(S.5.111)

Lemma 13.

Suppose Assumption 1 holds.

(i)

With probability at least $1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})$ ,

\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\bm{z}^{(k)}_{i}-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})\bm{z}^{(k)}]\right\|_{2}\lesssim\sqrt{\frac{p+\log K}{n_{k}}},

(S.5.112)

for all $k\in S$ .

(ii)

With probability at least $1-C^{\prime}K^{-2}e^{-C^{\prime\prime}p}$ ,

\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\frac{1}{n_{S}}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\bm{z}^{(k)}_{i}-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})\bm{z}^{(k)}]\Big{]}\right\|_{2}\lesssim\sqrt{\frac{p+K}{n_{S}}}.

(S.5.113)

(iii)

With probability at least $1-C^{\prime}K^{-2}e^{-C^{\prime\prime}p}$ ,

\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\frac{1}{n_{S}}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\Big{]}\bm{\mu}^{(k)*}_{1}\right\|_{2}\lesssim\sqrt{\frac{p+K}{n_{S}}}.

(S.5.114)

Lemma 14.

Suppose Assumption 1 holds.

(i)

With probability at least $1-C^{\prime}(K^{-2}+K^{-1}e^{-C^{\prime\prime}p})$ ,

\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\bm{\beta}^{(k)*}\right\|_{2}\lesssim\sqrt{\frac{p+\log K}{n_{k}}},

(S.5.115)

for all $k\in S$ .

(ii)

With probability at least $1-C^{\prime}K^{-2}e^{-C^{\prime\prime}p}$ ,

\left\|\frac{1}{n_{S}}\sum_{k\in S}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\bm{\beta}^{(k)*}\right\|_{2}\lesssim\sqrt{\frac{p}{n_{S}}}.

(S.5.116)

S.5.2.2 Main proof of Theorem 1

The proof of Theorem 1 consists of two cases. In Case 1, we study the scenario $h\gtrsim\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}$ , where we take a fixed contraction radius. In this case, proving a single-task estimation error rate $\sqrt{\frac{K(p+\log K)}{n_{S}}}$ is sufficient, which is relatively straightforward. In Case 2, we explore the scenario $h\lesssim\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}$ , in which regime multi-task learning can outperform the classical single-task learning. In this case, the classical finite-sample analysis of EM in \citeappbalakrishnan2017statistical and \citeappcai2019chime, which uses a fixed contraction radius as we did in Case 1, does not work. This is because the heterogenous $\bm{\mu}^{(k)*}_{1}$ and $\bm{\mu}^{(k)*}_{2}$ lead to an error of $\sqrt{\frac{K(p+\log K)}{n_{S}}}$ when estimating $w^{(k)*}$ . This term $\sqrt{\frac{K(p+\log K)}{n_{S}}}$ ultimately affects the estimation error of $\bm{\beta}^{(k)*}$ and $\delta^{(k)*}$ , preventing us from proving the improvement of multi-task learning over single-task learning. To resolve this issue, we creatively use a “localization” strategy to adaptively shrink the contraction radius in each iteration. This method effectively eliminates the term $\sqrt{\frac{K(p+\log K)}{n_{S}}}$ . By combining the two cases, we complete the proof.

WLOG, in Assumptions 1.(iii) and 1.(iv), we assume

•

$\max_{k\in S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2}\big{)}\leq C^{\prime}\min_{k\in S}\Delta^{(k)}$ , with a small constant $C^{\prime}>0$ ;
•

$\max_{k\in S}|\widehat{w}^{(k)[0]}-w^{(k)*}|\leq c_{w}/2$ .

(\@slowromancapi@) Case 1: Let us consider the case that $h\geq C\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}$ . Consider an event $\mathcal{E}$ defined to be the intersection of the events in Lemmas 11.(i), 12.(i), 13.(i), and 14.(i), with $\xi^{(k)}=$ a large constant $C$ , which satisfies $\mathbb{P}(\mathcal{E})\geq 1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})$ . Throughout the analysis in Case 1, we condition on $\mathcal{E}$ , therefore all the arguments hold with probability at least $1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})$ .

Consider the case $t=1$ . Lemma 6 tells us that when $\lambda^{[t]}\geq C\max_{k\in S}\{\sqrt{n_{k}}\|\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})\|_{2}\}$ , we have

\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\|_{2}\lesssim\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}[\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})]\right\|_{2}+h\wedge\frac{\lambda^{[t]}}{\sqrt{n_{k}}}+\epsilon\frac{\lambda^{[t]}}{\sqrt{\max_{k=1:K}n_{k}}}.

(S.5.117)

And if further $\lambda^{[t]}\geq C\max_{k\in S}\sqrt{n_{k}}h$ , we have (S.5.117) holds with $\widehat{\bm{\beta}}^{(k)[t]}=\overline{\bm{\beta}}^{[t]}$ for all $k\in S$ . Note that

\|\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})\|_{2}\leq\|(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*})\bm{\beta}^{(k)*}\|_{2}+\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{2}+\bm{\mu}^{(k)*}_{1}\|_{2}.

(S.5.118)

And the first term on the RHS can be controlled as

	$\displaystyle\\|(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)})\bm{\beta}^{(k)}\\|_{2}$		(S.5.119)
	$\displaystyle\leq\underbrace{\left\\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\bm{\beta}^{(k)*}\right\\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{1}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}$		(S.5.120)
	$\displaystyle\quad+\underbrace{\left\\|\big{[}(1-\widehat{w}^{(k)[t]})\widehat{\bm{\mu}}_{1}^{(k)[t]}(\widehat{\bm{\mu}}_{1}^{(k)[t]})^{\top}-(1-w^{(k)})\bm{\mu}^{(k)}_{1}(\bm{\mu}^{(k)}_{1})^{\top}\big{]}\bm{\beta}^{(k)}\right\\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}$		(S.5.121)
	$\displaystyle\quad+\underbrace{\left\\|\big{[}\widehat{w}^{(k)[t]}\widehat{\bm{\mu}}_{2}^{(k)[t]}(\widehat{\bm{\mu}}_{2}^{(k)[t]})^{\top}-w^{(k)}\bm{\mu}^{(k)}_{2}(\bm{\mu}^{(k)}_{2})^{\top}\big{]}\bm{\beta}^{(k)}\right\\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{3}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}.$		(S.5.122)

Conditioned on $\mathcal{E}$ , we have

\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{1}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\lesssim\sqrt{\frac{p+\log K}{n_{k}}},

(S.5.123)

And

		$\displaystyle\leq\underbrace{\left\\|(1-\widehat{w}^{(k)[t]})(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)}_{1})\cdot(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)}\right\\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.1}$		(S.5.124)
		$\displaystyle\quad+\underbrace{\left\\|\big{[}(1-\widehat{w}^{(k)[t]})\bm{\mu}^{(k)}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}-(1-w^{(k)})\bm{\mu}^{(k)}_{1}(\bm{\mu}^{(k)}_{1})^{\top}\big{]}\bm{\beta}^{(k)*}\right\\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2},$		(S.5.125)

where

\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2\leq\left\|(\widehat{w}^{(k)[t]}-w^{(k)*})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}+\left\|(1-\widehat{w}^{(k)[t]})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}.

(S.5.126)

Before we discuss how to control the terms on the RHS, let us first try to control $|\widehat{w}^{(k)[t]}-w^{(k)*}|$ as it will be used to bound the existing terms. Note that by Lemma 8,

$\displaystyle\|\widehat{w}^{(k)[t]}-w^{(k)*}\|$	$\displaystyle\leq\|w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})-w^{(k)*}\|+\|\widehat{w}^{(k)[t]}-w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})\|$	(S.5.127)
	$\displaystyle\leq\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})]\right\|$	(S.5.128)
	$\displaystyle\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}$	(S.5.129)
	$\displaystyle\leq c,$	(S.5.130)

where $c$ is a small constant. By Lemma 8 again,

$\displaystyle\\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\\|_{2}$	$\displaystyle=\left\\|\frac{\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i}))\bm{z}^{(k)}_{i}}{1-\widehat{w}^{(k)[t]}}-\frac{\mathbb{E}_{\bm{z}^{(k)}}[(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}))\bm{z}^{(k)}]}{1-w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})}\right\\|_{2}$	(S.5.131)
	$\displaystyle\leq\left\\|\frac{\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i}))\bm{z}^{(k)}_{i}-\mathbb{E}_{\bm{z}^{(k)}}[(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}))\bm{z}^{(k)}]}{1-\widehat{w}^{(k)[t]}}\right\\|_{2}$	(S.5.132)
	$\displaystyle\quad+\left\\|\frac{\mathbb{E}_{\bm{z}^{(k)}}[(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}))\bm{z}^{(k)}]}{(1-\widehat{w}^{(k)[t]})(1-w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]}))}(\widehat{w}^{(k)[t]}-w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]}))\right\\|_{2}$	(S.5.133)
	$\displaystyle\lesssim\left\\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i}))\bm{z}^{(k)}_{i}-\mathbb{E}_{\bm{z}^{(k)}}[(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}))\bm{z}^{(k)}]\right\\|_{2}$	(S.5.134)
	$\displaystyle\quad+\|\widehat{w}^{(k)[t]}-w^{(k)}\|+\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)})$	(S.5.135)
	$\displaystyle\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}$	(S.5.136)
	$\displaystyle\leq C_{b}\Delta.$	(S.5.137)

Therefore, we can bound the RHS of (S.5.126) as

\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2\lesssim|\widehat{w}^{(k)[t]}-w^{(k)*}|+\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}.

(S.5.138)

Similarly, we have

\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.1\lesssim\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}},

(S.5.139)

Combining (S.5.138) and (S.5.139), we have

\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}},

(S.5.140)

Similarly, we can bound in the same way, and get

\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{3}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}},

(S.5.141)

Hence

\|(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*})\bm{\beta}^{(k)*}\|_{2}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}},

(S.5.142)

And the second term on the RHS of (S.5.118) satisfies

\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{2}+\bm{\mu}^{(k)*}_{1}\|_{2}\lesssim\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}.

(S.5.143)

All together, we have

\|\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})\|_{2}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}.

(S.5.144)

This implies that $\lambda^{[t]}=C_{\lambda}\sqrt{p+\log K}+\kappa\lambda^{[0]}\geq C\max_{k\in S}\{\sqrt{n_{k}}\|\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})\|_{2}\}$ , therefore by (S.5.117),

\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\|_{2}\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{K(p+\log K)}{n_{S}}}+\epsilon\frac{\lambda^{[t]}}{\sqrt{\max_{k=1:K}n_{k}}},

(S.5.145)

And by (S.5.129),

\sum_{k\in S}\frac{n_{k}}{n_{S}}|\widehat{w}^{(k)[t]}-w^{(k)*}|\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{K(p+\log K)}{n_{S}}}.

(S.5.146)

Also,

$\displaystyle\|\widehat{\delta}^{(k)[t]}-\delta^{(k)*}\|$	$\displaystyle=\frac{1}{2}\left\\|(\widehat{\bm{\beta}}^{(k)[t]})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{1}+\widehat{\bm{\mu}}^{(k)[t]}_{2})-(\bm{\beta}^{(k)})^{\top}(\bm{\mu}^{(k)}_{1}+\bm{\mu}^{(k)*}_{2})\right\\|_{2}$	(S.5.147)
	$\displaystyle\lesssim\\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)}\\|_{2}+\\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)}_{1}\\|_{2}+\\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\bm{\mu}^{(k)*}_{2}\\|_{2}$	(S.5.148)
	$\displaystyle\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}},$	(S.5.149)

which entails that

\sum_{k\in S}\frac{n_{k}}{n_{S}}|\widehat{\delta}^{(k)[t]}-\delta^{(k)*}|\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{K(p+\log K)}{n_{S}}}.

(S.5.150)

Combining (S.5.145), (S.5.146), and (S.5.150), we have

\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{K(p+\log K)}{n_{S}}}+\epsilon\frac{\lambda^{[t]}}{\sqrt{\max_{k=1:K}n_{k}}}.

(S.5.151)

Also,

\max_{k\in S}\big{\{}\sqrt{n_{k}}d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})\big{\}}\lesssim\kappa_{0}\max_{k\in S}\big{\{}\sqrt{n_{k}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})\big{\}}+\lambda^{[t]}.

(S.5.152)

When we assume (S.5.145), (S.5.151), (S.5.152) hold for all $t=1:t^{\prime}$ , via the same analysis we will have (S.5.144) hold again for $t=t^{\prime}+1$ . Hence

\max_{k\in S}\big{\{}\sqrt{n_{k}}\|\widehat{\bm{\Sigma}}^{(k)[t^{\prime}+1]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t^{\prime}+1]}_{2}-\widehat{\bm{\mu}}^{(k)[t^{\prime}+1]}_{1})\|_{2}\big{\}}\lesssim\kappa_{0}\max_{k\in S}\big{\{}\sqrt{n_{k}}d(\widehat{\bm{\theta}}^{(k)[t^{\prime}]},\bm{\theta}^{(k)*})\big{\}}+\sqrt{p+\log K}.

(S.5.153)

Then by (S.5.152) when $t=t^{\prime}$ ,

$\displaystyle\kappa_{0}\max_{k\in S}\big{\{}\sqrt{n_{k}}d(\widehat{\bm{\theta}}^{(k)[t^{\prime}]},\bm{\theta}^{(k)*})\big{\}}+\sqrt{p+\log K}$	$\displaystyle\lesssim\kappa_{0}^{2}\max_{k\in S}\big{\{}\sqrt{n_{k}}d(\widehat{\bm{\theta}}^{(k)[t^{\prime}-1]},\bm{\theta}^{(k)*})\big{\}}+\sqrt{p+\log K}+\kappa_{0}\lambda^{[t^{\prime}]}$	(S.5.154)
	$\displaystyle\leq\kappa_{0}\lambda^{[t^{\prime}]}+\sqrt{p+\log K}$	(S.5.155)
	$\displaystyle\leq\lambda^{[t^{\prime}+1]},$	(S.5.156)

where we need $\kappa\geq C\kappa_{0}$ with a large constant $C>0$ . Recall that $\kappa\in(0,1)$ is one of the tuning parameters in the update formula of $\lambda^{[t]}$ . Therefore we can follow the same arguments as above to obtain (S.5.129), (S.5.143), (S.5.145), (S.5.149), (S.5.151), (S.5.152) for $t=t^{\prime}+1$ .

So far, we have shown that (S.5.129), (S.5.143), (S.5.145), (S.5.149), (S.5.151), (S.5.152) hold for any $t$ . By the update formula of $\lambda^{[t]}$ , when $t\geq 1$ , we have

\lambda^{[t]}=\frac{1-\kappa^{t}}{1-\kappa}C_{\lambda}\sqrt{p+\log K}+\kappa^{t-1}\lambda^{[0]}.

(S.5.157)

Therefore by (S.5.151),

$\displaystyle\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})$	$\displaystyle\leq C\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+C^{\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K}{n_{S}}}\lambda^{[t]}$	(S.5.158)
	$\displaystyle\leq(C\kappa_{0})^{t}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[0]},\bm{\theta}^{(k)*})+C^{\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K}{n_{S}}}\sum_{t^{\prime}=1}^{t}\lambda^{[t^{\prime}]}\cdot(C\kappa_{0})^{t-t^{\prime}}$	(S.5.159)
	$\displaystyle\leq(C\kappa_{0})^{t}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[0]},\bm{\theta}^{(k)*})+C^{\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K}{n_{S}}}\sum_{t^{\prime}=1}^{t}\lambda^{[t^{\prime}]}\cdot\kappa^{t-t^{\prime}}$	(S.5.160)
	$\displaystyle\leq C^{\prime\prime}t\kappa^{t}+C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}.$	(S.5.161)

Consider a new event $\mathcal{E}^{\prime}$ defined to be the intersection of the events in Lemmas 11.(i), 12.(i), 13.(i), and 14.(i), with $\xi^{(k)}=C\sqrt{\frac{n_{k}}{\max_{k\in S}n_{k}}}$ , which satisfies $\mathbb{P}(\mathcal{E}^{\prime})\geq 1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})$ . Throughout the following analysis in Case 1, we condition on $\mathcal{E}\cap\mathcal{E}^{\prime}$ , therefore all the arguments hold with probability at least $1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})$ . When $h\geq C\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}$ , since $n_{S}\gtrsim K\max_{k\in S}n_{k}$ , we have $\sqrt{\frac{K(p+\log K)}{n_{S}}}\lesssim\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}\lesssim h\wedge\sqrt{\frac{p+\log K}{n_{k}}}$ . Furthermore, when $t\geq C^{\prime}\log\big{(}\frac{\max_{k\in S}n_{k}}{\min_{k\in S}n_{k}}\big{)}$ with a large $C^{\prime}>0$ , we have $\xi^{(k)}\sqrt{\frac{p}{n_{k}}}\lesssim h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}$ and $t\kappa^{t}+C\sqrt{\frac{K(p+\log K)}{n_{S}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}}\leq\xi^{(k)}$ , where we used the fact $n_{S}\gtrsim K\max_{k\in S}n_{k}$ again to get the second inequality.

Plugging (S.5.161) back into (S.5.145), we have

$\displaystyle\\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\\|_{2}$	$\displaystyle\leq C^{\prime\prime}t\kappa^{t}+C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime\prime}\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}}$	(S.5.162)
	$\displaystyle\leq Ct\kappa^{t}+C\sqrt{\frac{K(p+\log K)}{n_{S}}}+C\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}}$	(S.5.163)
	$\displaystyle\leq Ct\kappa^{t}+C\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}}$	(S.5.164)

Then by (S.5.129),

$\displaystyle\|\widehat{w}^{(k)[t]}-w^{(k)*}\|$	$\displaystyle\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}$	(S.5.165)
	$\displaystyle\lesssim\kappa_{0}\\|\widehat{\bm{\beta}}^{(k)[t-1]}-\bm{\beta}^{(k)}\\|_{2}+\kappa_{0}\|\widehat{w}^{(k)[t-1]}-w^{(k)}\|\vee\|\widehat{\delta}^{(k)[t-1]}-\delta^{(k)*}\|+\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}$	(S.5.166)
	$\displaystyle\lesssim Ct\kappa^{t}+\kappa_{0}\|\widehat{w}^{(k)[t-1]}-w^{(k)}\|\vee\|\widehat{\delta}^{(k)[t-1]}-\delta^{(k)}\|+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}$	(S.5.167)
	$\displaystyle\quad+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}.$	(S.5.168)

Similarly, by (S.5.149),

	$\displaystyle\|\widehat{\delta}^{(k)[t-1]}-\delta^{(k)*}\|$	$\displaystyle\lesssim Ct\kappa^{t}+\kappa_{0}\|\widehat{w}^{(k)[t-1]}-w^{(k)}\|\vee\|\widehat{\delta}^{(k)[t-1]}-\delta^{(k)}\|$		(S.5.169)
		$\displaystyle\quad+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}.$		(S.5.170)

Therefore,

$\displaystyle\|\widehat{w}^{(k)[t]}-w^{(k)}\|\vee\|\widehat{\delta}^{(k)[t]}-\delta^{(k)}\|$	$\displaystyle\leq Ct\kappa^{t}+C\kappa_{0}\|\widehat{w}^{(k)[t-1]}-w^{(k)}\|\vee\|\widehat{\delta}^{(k)[t-1]}-\delta^{(k)}\|$	(S.5.171)
	$\displaystyle\quad+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}$	(S.5.172)
	$\displaystyle\leq C^{\prime\prime}t^{2}\kappa^{t}+C^{\prime\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$	(S.5.173)
	$\displaystyle\quad+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}}.$	(S.5.174)

Combine it with (S.5.164), we obtain that

d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})\leq C^{\prime\prime}t^{2}\kappa^{t}+C^{\prime\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}}.

(S.5.175)

Plugging this back into (S.5.139), we get

\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\leq C^{\prime\prime}t^{2}\kappa^{t}+C^{\prime\prime}\sqrt{\frac{p+\log K}{n_{k}}}.

(S.5.176)

And the same bound holds for $\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2}$ as well. The same bound for $\|\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*}\|_{2}$ can be obtained in the same spirit as in (S.5.122).

(\@slowromancapii@) Case 2: We now focus on the case that $h\leq C\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}$ . As mentioned at the beginning of this proof, we need to adaptively shrink the radius of the contraction basin to prove the desired convergence rate. The analysis of Case 2 can be divided into two stages. In the first stage, we use the same fixed contraction radius as in Case 1 and follow the same analysis until the iterative error $t^{2}\kappa^{t}$ has reduced to the single-task error $\sqrt{\frac{K(p+\log K)}{n_{S}}}$ . In the second stage, we apply the localization argument to shrink the contraction basin until we achieve the desired rate of convergence.

Similar to Case 1, we consider an event $\mathcal{E}$ defined to be the intersection of the events in Lemmas 11, 12, 13, and 14, with $\xi^{(k)}=$ a large constant $C$ , which satisfies $\mathbb{P}(\mathcal{E})\geq 1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})$ . Throughout the analysis in Case 2, we condition on $\mathcal{E}$ , therefore all the arguments hold with probability at least $1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})$ .

Consider $t_{0}$ as the number of iterations in the first stage which satisfies $t_{0}^{2}\kappa^{t_{0}}\asymp\sqrt{\frac{K(p+\log K)}{n_{S}}}$ . When $t=1:t_{0}$ , we can go through the same analysis as in Case 1, and show that conditioned on $\mathcal{E}$ ,

\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})\lesssim C^{\prime\prime}t\kappa^{t}+C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}},

(S.5.177)

and

	$\displaystyle d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})\leq C^{\prime\prime}t^{2}\kappa^{t}+C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$		(S.5.178)
	$\displaystyle\hskip 85.35826pt+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}},$		(S.5.179)
	$\displaystyle\\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)}_{1}\\|_{2}\vee\\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\bm{\mu}^{(k)}_{2}\\|_{2}\vee\\|\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*}\\|_{2}\leq C^{\prime\prime}t^{2}\kappa^{t}+C^{\prime\prime}\sqrt{\frac{p+\log K}{n_{k}}}.$		(S.5.180)

Since $t_{0}^{2}\kappa^{t_{0}}\asymp\sqrt{\frac{K(p+\log K)}{n_{S}}}$ , the rates above are the desired rates. In the following, we will derive the results for the case $t\geq t_{0}+1$ .

Define

$\displaystyle\xi^{(k)}_{t_{0}}$	$\displaystyle=C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}},$	(S.5.181)
$\displaystyle\overline{\xi}_{t_{0}}=\sum_{k\in S}\frac{n_{k}}{n_{S}}\xi^{(k)}_{t_{0}}$	$\displaystyle\leq C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime\prime}\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime\prime}\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}$	(S.5.182)
	$\displaystyle\quad+C^{\prime\prime}\sqrt{\frac{K\log K}{n_{S}}}.$	(S.5.183)

Consider an event $\mathcal{E}_{t_{0}}$ defined to be the intersection of the events in Lemmas 11, 12, 13, and 14, with $\xi^{(k)}=\xi^{(k)}_{t_{0}}$ , which satisfies $\mathbb{P}(\mathcal{E}_{t_{0}})\geq 1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})$ . In the following, we condition on $\mathcal{E}\cap\mathcal{E}_{t_{0}}$ , therefore all the arguments hold with probability at least $1-C^{\prime}K^{-1}$ .

Let $t=t_{0}+1$ . Since $\lambda^{[t-1]}\geq C\sqrt{p+\log K}\geq C\max_{k\in S}\sqrt{n_{k}}h$ , by Lemma 6, we also have $\widehat{\bm{\beta}}^{(k)[t-1]}=\overline{\bm{\beta}}^{[t-1]}$ for all $k\in S$ . Similar to (S.5.129),

$\displaystyle\|\widehat{w}^{(k)[t]}-w^{(k)*}\|$	$\displaystyle\leq\|w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})-w^{(k)*}\|+\|\widehat{w}^{(k)[t]}-w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})\|$	(S.5.184)
	$\displaystyle\leq\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})]\right\|$	(S.5.185)
	$\displaystyle\leq\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+C^{\prime\prime}\xi^{(k)}_{t-1}\sqrt{\frac{p}{n_{k}}}+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}}$	(S.5.186)
	$\displaystyle\leq\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\kappa_{0}\xi^{(k)}_{t-1}+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}}.$	(S.5.187)

This implies that

\sum_{k\in S}\frac{n_{k}}{n_{S}}|\widehat{w}^{(k)[t]}-w^{(k)*}|\leq\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}\xi^{(k)}_{t-1}+C^{\prime\prime}\sqrt{\frac{K\log K}{n_{S}}}.

(S.5.188)

And by Lemma 6,

	$\displaystyle\\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\\|_{2}$	$\displaystyle\lesssim\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}[\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})]\right\\|_{2}+h\wedge\frac{\lambda^{[t]}}{\sqrt{n_{k}}}+\epsilon\frac{\lambda^{[t]}}{\sqrt{n_{k}}}$		(S.5.189)
		$\displaystyle\lesssim\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}[\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})]\right\\|_{2}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}},$		(S.5.190)

where

	$\displaystyle\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}[\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})]\right\\|_{2}$	$\displaystyle\leq\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)})\bm{\beta}^{(k)}\right\\|_{2}$		(S.5.191)
		$\displaystyle\quad+\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}\big{(}\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)}_{2}+\bm{\mu}^{(k)}_{1}\big{)}\right\\|_{2}.$		(S.5.192)

And the first term on the RHS can be controlled as

	$\displaystyle\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)})\bm{\beta}^{(k)}\right\\|_{2}$		(S.5.193)
	$\displaystyle\leq\underbrace{\left\\|\frac{1}{n_{S}}\sum_{k\in S}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\bm{\beta}^{(k)*}\right\\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{4}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}$		(S.5.194)
	$\displaystyle\quad+\underbrace{\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}\big{[}(1-\widehat{w}^{(k)[t]})\widehat{\bm{\mu}}_{1}^{(k)[t]}(\widehat{\bm{\mu}}_{1}^{(k)[t]})^{\top}-(1-w^{(k)})\bm{\mu}^{(k)}_{1}(\bm{\mu}^{(k)}_{1})^{\top}\big{]}\bm{\beta}^{(k)}\right\\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}$		(S.5.195)
	$\displaystyle\quad+\underbrace{\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}\big{[}\widehat{w}^{(k)[t]}\widehat{\bm{\mu}}_{2}^{(k)[t]}(\widehat{\bm{\mu}}_{2}^{(k)[t]})^{\top}-w^{(k)}\bm{\mu}^{(k)}_{2}(\bm{\mu}^{(k)}_{2})^{\top}\big{]}\bm{\beta}^{(k)}\right\\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{6}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}.$		(S.5.196)

Conditioned on $\mathcal{E}$ , we have

\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{4}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\lesssim\sqrt{\frac{p}{n_{S}}}.

(S.5.197)

And

		$\displaystyle\leq\underbrace{\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(1-\widehat{w}^{(k)[t]})(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)}_{1})\cdot(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)}\right\\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.1}$		(S.5.198)
		$\displaystyle\quad+\underbrace{\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}\big{[}(1-\widehat{w}^{(k)[t]})\bm{\mu}^{(k)}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}-(1-w^{(k)})\bm{\mu}^{(k)}_{1}(\bm{\mu}^{(k)}_{1})^{\top}\big{]}\bm{\beta}^{(k)*}\right\\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2},$		(S.5.199)

where

\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2\leq\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{w}^{(k)[t]}-w^{(k)*})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}+\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(1-\widehat{w}^{(k)[t]})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}.

(S.5.200)

For the first term, we have

	$\displaystyle\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{w}^{(k)[t]}-w^{(k)})\bm{\mu}^{(k)}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}\right\\|_{2}$		(S.5.201)
	$\displaystyle\lesssim\sum_{k\in S}\frac{n_{k}}{n_{S}}\|w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})-w^{(k)*}\|$		(S.5.202)
	$\displaystyle\quad+\frac{1}{n_{S}}\left\\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\Big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})\bm{\mu}^{(k)}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)}-\mathbb{E}_{\bm{z}^{(k)}}\big{[}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})\bm{\mu}^{(k)}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)}\big{]}\Big{\}}\right\\|_{2}$		(S.5.203)
	$\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})$		(S.5.204)
	$\displaystyle\quad+\frac{1}{n_{S}}\sup_{\|\widetilde{w}_{k}\|\leq U}\left\\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\Big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})\bm{\mu}^{(k)}_{1}-\mathbb{E}_{\bm{z}^{(k)}}\big{[}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})\bm{\mu}^{(k)}_{1}\big{]}\Big{\}}\right\\|_{2},$		(S.5.205)

where $U>0$ is some constant such that $U\geq\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\|\bm{\beta}^{(k)*}\|_{2}+\|\bm{\mu}^{(k)*}_{1}\|_{2}\|\bm{\beta}^{(k)*}\|_{2}\geq|(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}|$ under event $\mathcal{E}_{t_{0}}$ . Note that the last inequality holds because the expectation $\mathbb{E}_{\bm{z}^{(k)}}$ is w.r.t. $\bm{z}^{(k)}$ which is independent of $\widehat{\bm{\mu}}^{(k)[t]}$ . By Lemma 13.(iii) and the definition of $\mathcal{E}_{t_{0}}$ , the second term can be bounded as

\displaystyle\frac{1}{n_{S}}\sup_{|\widetilde{w}_{k}|\leq U}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\Big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})\bm{\mu}^{(k)*}_{1}-\mathbb{E}_{\bm{z}^{(k)}}\big{[}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})\bm{\mu}^{(k)*}_{1}\big{]}\Big{\}}\right\|_{2}\lesssim\sqrt{\frac{p+K}{n_{S}}}.

(S.5.206)

Therefore,

\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{w}^{(k)[t]}-w^{(k)*})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}.

(S.5.207)

On the other hand, by simple calculations, we have

	$\displaystyle\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(1-\widehat{w}^{(k)[t]})\bm{\mu}^{(k)}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)}_{1})^{\top}\bm{\beta}^{(k)*}\right\\|_{2}$		(S.5.208)
	$\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)})+\sup_{\|\widetilde{w}_{k}\|\leq U}\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}\widetilde{w}_{k}[w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})-\widehat{w}^{(k)[t]}]\bm{\mu}^{(k)}_{1}\right\\|_{2}$		(S.5.209)
	$\displaystyle\quad+\frac{1}{n_{S}}\sup_{\|\widetilde{w}_{k}\|\leq U^{\prime}}\left\\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)}-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})(\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)}]\big{\}}\bm{\mu}^{(k)*}_{1}\right\\|_{2}$		(S.5.210)
	$\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)})+\frac{1}{n_{S}}\sup_{\|\widetilde{w}_{k}\|\leq U}\left\\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{[}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})]\big{]}\bm{\mu}^{(k)}_{1}\right\\|_{2}$		(S.5.211)
	$\displaystyle\quad+\frac{1}{n_{S}}\sup_{\|\widetilde{w}_{k}\|\leq U^{\prime}}\sup_{\\|u\\|_{2}\leq 1}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)}-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})(\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)}]\big{\}}(\bm{\mu}^{(k)*}_{1})^{\top}\bm{u}\right\|$		(S.5.212)
	$\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}$		(S.5.213)
	$\displaystyle\quad+\frac{1}{n_{S}}\sup_{\|\widetilde{w}_{k}\|\leq U^{\prime}}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)}-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})(\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)}]\big{\}}\right\|$		(S.5.214)
	$\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}.$		(S.5.215)

Hence

\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}.

(S.5.216)

A similar discussion leads to the same bound for $\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.1$ . Therefore,

\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}.

(S.5.217)

And the same bound holds for , which can be shown in the same spirit. Putting all the pieces together,

\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*})\bm{\beta}^{(k)*}\right\|_{2}\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}.

(S.5.218)

Therefore by Lemma 6 and (S.5.190), we have $\widehat{\bm{\beta}}^{(k)[t-1]}=\overline{\bm{\beta}}^{[t-1]}$ for all $k\in S$ , and

$\displaystyle\\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\\|_{2}$	$\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$	(S.5.219)
	$\displaystyle\leq C\kappa_{0}\cdot\sum_{k\in S}\frac{n_{k}}{n_{S}}\xi^{(k)}_{t-1}+\sqrt{\frac{p+K}{n_{S}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$	(S.5.220)
	$\displaystyle\leq\kappa_{0}^{\prime}\overline{\xi}_{t-1}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}}$	(S.5.221)
	$\displaystyle\coloneqq\xi^{(k)}_{t}.$	(S.5.222)

This entails that

\overline{\xi}_{t}=\sum_{k\in S}\frac{n_{k}}{n_{S}}\xi^{(k)}_{t}=\kappa_{0}^{\prime}\overline{\xi}_{t-1}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}}

(S.5.223)

This implies that

\sum_{k\in S}\frac{n_{k}}{n_{S}}\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\|_{2}\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}+h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}+\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}.

(S.5.224)

Also,

$\displaystyle\|\widehat{\delta}^{(k)[t]}-\delta^{(k)*}\|$	$\displaystyle=\frac{1}{2}\left\\|(\widehat{\bm{\beta}}^{(k)[t]})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{1}+\widehat{\bm{\mu}}^{(k)[t]}_{2})-(\bm{\beta}^{(k)})^{\top}(\bm{\mu}^{(k)}_{1}+\bm{\mu}^{(k)*}_{2})\right\\|_{2}$	(S.5.225)
	$\displaystyle\lesssim\\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)}\\|_{2}+\\|(\bm{\beta}^{(k)})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)}_{1})\\|_{2}+\\|(\bm{\beta}^{(k)})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\bm{\mu}^{(k)*}_{2})\\|_{2}$	(S.5.226)
	$\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{n_{k}}}$	(S.5.227)
	$\displaystyle\quad+\left\\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})(\bm{\beta}^{(k)})^{\top}\bm{z}^{(k)}_{i}-\mathbb{E}_{\bm{z}^{(k)}}\big{[}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})(\bm{\beta}^{(k)})^{\top}\bm{z}^{(k)}\big{]}\right\\|_{2}$	(S.5.228)
	$\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{n_{k}}}$	(S.5.229)
	$\displaystyle\quad+\xi^{(k)}_{t-1}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}$	(S.5.230)
	$\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p}{n_{S}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$	(S.5.231)
	$\displaystyle\quad+\kappa_{0}\xi^{(k)}_{t-1}+\sqrt{\frac{\log K}{n_{k}}}$	(S.5.232)

Therefore,

	$\displaystyle\sum_{k\in S}\frac{n_{k}}{n_{S}}\|\widehat{\delta}^{(k)[t]}-\delta^{(k)*}\|$	$\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p}{n_{S}}}+h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}$		(S.5.233)
		$\displaystyle\quad+\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}+\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}\xi^{(k)}_{t-1}+\sqrt{\frac{K\log K}{n_{S}}}.$		(S.5.234)

Hence

$\displaystyle\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})$	$\displaystyle\leq C\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+C\sqrt{\frac{p}{n_{S}}}+C\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}$	(S.5.235)
	$\displaystyle\quad+C\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}+C\kappa_{0}\overline{\xi}_{t-1}+C\sqrt{\frac{K\log K}{n_{S}}}$	(S.5.236)
	$\displaystyle\leq\kappa_{0}^{\prime}\overline{\xi}_{t-1}+C\sqrt{\frac{p}{n_{S}}}+C\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}$	(S.5.237)
	$\displaystyle\quad+C\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}},$	(S.5.238)

and

$\displaystyle d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})$	$\displaystyle\leq C\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$	(S.5.239)
	$\displaystyle\quad+C\kappa_{0}\xi^{(k)}_{t-1}+C\sqrt{\frac{\log K}{n_{k}}}$	(S.5.240)
	$\displaystyle\leq\frac{1}{2}\kappa_{0}^{\prime}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$	(S.5.241)
	$\displaystyle\quad+\frac{1}{2}\kappa_{0}^{\prime}\xi^{(k)}_{t-1}+C\sqrt{\frac{\log K}{n_{k}}}$	(S.5.242)
	$\displaystyle\leq\kappa_{0}^{\prime}\xi^{(k)}_{t-1}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$	(S.5.243)
	$\displaystyle\quad+C\sqrt{\frac{\log K}{n_{k}}},$	(S.5.244)

where $\kappa_{0}^{\prime}=2C\kappa_{0}\in(0,1)$ .

Therefore, for $t=(t_{0}+1):(t_{0}+t_{0}^{\prime})$ , where $(\kappa_{0}^{\prime})^{t_{0}^{\prime}}\asymp K^{-1/2}$ hence $t_{0}^{\prime}\asymp\log K$ , we have updating formulas (S.5.222) and (S.5.223) for $\xi^{(k)}_{t}$ . We can get

$\displaystyle\overline{\xi}_{t_{0}+t^{\prime}}$	$\displaystyle=(\kappa_{0}^{\prime})^{t^{\prime}}\overline{\xi}_{t_{0}}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}}$	(S.5.245)
	$\displaystyle\leq C(\kappa_{0}^{\prime})^{t^{\prime}}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}$	(S.5.246)
	$\displaystyle\quad+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}},$	(S.5.247)

and

$\displaystyle\xi^{(k)}_{t_{0}+t^{\prime}}$	$\displaystyle=\kappa_{0}^{\prime}\overline{\xi}_{t_{0}+t^{\prime}-1}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}}$	(S.5.248)
	$\displaystyle\leq C(\kappa_{0}^{\prime})^{t^{\prime}}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$	(S.5.249)
	$\displaystyle\quad+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}},$	(S.5.250)

with

\xi^{(k)}_{t_{0}+t_{0}^{\prime}}\leq C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}},

(S.5.251)

where the last inequality is due to $(\kappa_{0}^{\prime})^{t_{0}^{\prime}}\sqrt{\frac{K(p+\log K)}{n_{S}}}\asymp\sqrt{\frac{p+\log K}{n_{S}}}$ .

Consider an event series $\{\mathcal{E}_{t}\}_{t=t_{0}}^{t_{0}+t_{0}^{\prime}}$ each of which is defined to be the intersection of the events in Lemmas 11, 12, 13, and 14, with $\xi^{(k)}=\xi^{(k)}_{t}$ , which satisfies $\mathbb{P}(\mathcal{E}_{t})\geq 1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})$ hence $\mathbb{P}(\bigcap_{t=t_{0}}^{t_{0}+t_{0}^{\prime}}\mathcal{E}_{t})\geq 1-C^{\prime}t_{0}^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})\geq 1-C^{\prime\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})\log K\geq 1-C^{\prime}K^{-1}$ . In the following, we condition on $\mathcal{E}\cap(\cap_{t=t_{0}}^{t_{0}+t_{0}^{\prime}}\mathcal{E}_{t})$ , therefore all the arguments hold with probability at least $1-C^{\prime}K^{-1}$ . Therefore, for $t=(t_{0}+1):(t_{0}+t_{0}^{\prime})$ , we have (S.5.244) hold, which leads to

$\displaystyle d(\widehat{\bm{\theta}}^{(k)[t_{0}+t^{\prime}]},\bm{\theta}^{(k)*})$	$\displaystyle\leq\kappa_{0}^{\prime}\xi^{(k)}_{t_{0}+t^{\prime}-1}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}}$	(S.5.252)
	$\displaystyle\leq C(\kappa_{0}^{\prime})^{t^{\prime}}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$	(S.5.253)
	$\displaystyle\quad+C\sqrt{\frac{\log K}{n_{k}}}$	(S.5.254)
	$\displaystyle\leq C^{\prime}(\kappa_{0}^{\prime})^{t^{\prime}}\cdot t_{0}^{2}\kappa^{t_{0}}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$	(S.5.255)
	$\displaystyle\quad+C\sqrt{\frac{\log K}{n_{k}}}$	(S.5.256)
	$\displaystyle\leq(t_{0}+t^{\prime})^{2}(\kappa\vee\kappa_{0}^{\prime})^{t_{0}+t^{\prime}}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}$	(S.5.257)
	$\displaystyle\quad+C\sqrt{\frac{\log K}{n_{k}}},$	(S.5.258)

where $t^{\prime}=1,\ldots,t_{0}^{\prime}$ , which provides the desired rate for $t=(t_{0}+1):(t_{0}+t_{0}^{\prime})$ . When $t^{\prime}\geq t_{0}^{\prime}$ , by (S.5.251), we have

	$\displaystyle d(\widehat{\bm{\theta}}^{(k)[t_{0}+t^{\prime}]},\bm{\theta}^{(k)*})$	$\displaystyle\leq(\kappa_{0}^{\prime})^{t^{\prime}-t_{0}^{\prime}}\xi^{(k)}_{t_{0}+t_{0}^{\prime}}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}}$		(S.5.259)
		$\displaystyle\leq C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}},$		(S.5.260)

which is the desired rate. We complete the proof for Theorem 1.

S.5.2.3 Proofs of lemmas

Proof of Lemma 11.

We prove part (i) first.

Denote $W=\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\right|$ . By bounded difference inequality,

W\leq\mathbb{E}W+C\sqrt{\frac{\log K}{n_{k}}},

(S.5.261)

with probability at least $1-C^{\prime}K^{-2}$ . By the generalized symmetrization inequality (Proposition 4.11 in \citeappwainwright2019high), with i.i.d. Rademacher variables $\{\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}}$ ,

$\displaystyle\mathbb{E}W$	$\displaystyle\leq\frac{2}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\epsilon^{(k)}_{i}\right\|\right]$	(S.5.262)
	$\displaystyle\leq\frac{2}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}\frac{w^{(k)}\exp\{C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\}}{1-w^{(k)}+w^{(k)}\exp\{C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\}}\epsilon^{(k)}_{i}\right\|\right]$	(S.5.263)
	$\displaystyle\leq\frac{2}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}\frac{1}{1+\exp\{C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\}}\epsilon^{(k)}_{i}\right\|\right],$	(S.5.264)

where $C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})=(\bm{\beta}^{(k)})^{\top}\bm{z}^{(k)}_{i}-\delta^{(k)}$ . Denote $\bm{\mu}^{(k)*}=(1-w^{(k)*})\bm{\mu}^{(k)*}_{1}+w^{(k)*}\bm{\mu}^{(k)}_{2}=\mathbb{E}[\bm{z}^{(k)}_{i}]$ . By the contraction inequality for Rademecher variables (Theorem 11.6 in \citealpappboucheron2013concentration),

RHS of (S.5.261)	$\displaystyle\leq\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}\big{[}C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\big{]}\epsilon^{(k)}_{i}\right\|\right]$	(S.5.266)
	$\displaystyle\leq\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)})\cdot\epsilon^{(k)}_{i}\right\|\right]$	(S.5.267)
	$\displaystyle\quad+\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}\cdot\epsilon^{(k)}_{i}\right\|\right]$	(S.5.268)
	$\displaystyle\quad+\frac{C}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\log((w^{(k)})^{-1}-1)\right\|\cdot\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{i}\right\|\right]$	(S.5.269)
	$\displaystyle\leq\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)}\\|_{2}\leq\xi^{(k)}}\left\|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)}-\bm{\beta}^{(k)})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right\|\right]$	(S.5.270)
	$\displaystyle\quad+\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left\|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)})\cdot\epsilon^{(k)}_{i}\right\|$	(S.5.271)
	$\displaystyle\quad+\frac{C}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{i}\right\|+\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}\cdot\epsilon^{(k)}_{i}\right\|\right].$	(S.5.272)

Since $\{\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}}$ and $\{(\bm{\beta}^{(k)}-\bm{\beta}^{(k)*})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}}$ are i.i.d. sub-Gaussian variables, we know that

\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)*})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right|+\frac{C}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{i}\right|\lesssim\sqrt{\frac{1}{n_{k}}}.

(S.5.273)

Suppose $\{\bm{u}_{j}\}_{j=1}^{N}$ is a $1/2$ -cover of $\mathcal{B}^{p}\coloneqq\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}\leq 1\}$ with $N=5^{p}$ (see Example 5.8 in \citealpappwainwright2019high). Hence by standard arguments,

	$\displaystyle\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)}\\|_{2}\leq\xi^{(k)}}\left\|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)}-\bm{\beta}^{(k)})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right\|\right]$		(S.5.274)
	$\displaystyle\lesssim\frac{\xi^{(k)}}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{j=1:N}\left\|\sum_{i=1}^{n_{k}}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right\|\right].$		(S.5.275)

Again, since $\{\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}}$ are i.i.d. sub-Gaussian variables,

\frac{1}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{j=1:N}\left|\sum_{i=1}^{n_{k}}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right|\right]\lesssim\sqrt{\frac{\log N}{n_{k}}}=\sqrt{\frac{p}{n_{k}}}.

(S.5.276)

Putting all the pieces together,

\mathbb{E}W\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{1}{n_{k}}}.

(S.5.277)

Combining (S.5.261) and (S.5.277), we get the result in (i).

Next, we derive part (ii) using a similar analysis.

Denote $W^{\prime}=\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\frac{1}{n_{S}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\Big{]}\right|$ .

By a similar standard symmetrization and contraction arguments we used in part (i), with i.i.d. Rademacher variables $\{\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}}$ , for any $\lambda\in\mathbb{R}$ , we have

	$\displaystyle\mathbb{E}\exp\{\lambda W^{\prime}\}$		(S.5.278)
	$\displaystyle\leq C\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{2\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{\|\widetilde{w}_{k}\|\leq 1}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.279)
	$\displaystyle\leq C\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{4\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{\widetilde{w}_{k}=\pm 1/2}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.280)
	$\displaystyle\leq C\sum_{\widetilde{w}_{k}=\pm 1/2}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{4\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.281)
	$\displaystyle\leq C\sum_{\widetilde{w}_{k}=\pm 1/2}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{4\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\frac{1}{1+\exp\{C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\}}\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.282)
	$\displaystyle\leq C\sum_{\widetilde{w}_{k}=\pm 1/2}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{4\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{[}C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\big{]}\epsilon^{(k)}_{i}\right\|\right\},$		(S.5.283)

where $C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})=(\bm{\beta}^{(k)})^{\top}\bm{z}^{(k)}_{i}-\delta^{(k)}$ . Denote $\bm{\mu}^{(k)*}=(1-w^{(k)*})\bm{\mu}^{(k)*}_{1}+w^{(k)*}\bm{\mu}^{(k)}_{2}=\mathbb{E}[\bm{z}^{(k)}_{i}]$ . Suppose $\{\bm{u}_{j}\}_{j=1}^{N}$ is a $1/2$ -cover of $\mathcal{B}^{p}\coloneqq\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}\leq 1\}$ with $N=5^{p}$ . Then by Cauchy-Schwarz inequality and standard arguments,

	$\displaystyle\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{4\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{[}C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\big{]}\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.284)
	$\displaystyle\lesssim\left[\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{\\|\bm{\beta}\\|_{2}\leq U}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{\beta}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\right\|\right\}\right]^{1/3}$		(S.5.285)
	$\displaystyle\quad+\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{\\|\bm{\beta}\\|_{2}\leq U}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{\beta}^{\top}\bm{\mu}^{(k)*}\epsilon^{(k)}_{i}\right\|\right\}\right]^{1/3}$		(S.5.286)
	$\displaystyle\quad+\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{c_{w}/2\leq w^{(k)}\leq 1-c_{w}/2}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\log((w^{(k)})^{-1}-1)\epsilon^{(k)}_{i}\right\|\right\}\right]^{1/3}$		(S.5.287)
	$\displaystyle\lesssim\left[\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{j=1:N}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\right\|\right\}\right]^{1/3}$		(S.5.288)
	$\displaystyle\quad+\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{j=1:N}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{u}_{j}^{\top}\bm{\mu}^{(k)*}\epsilon^{(k)}_{i}\right\|\right\}\right]^{1/3}$		(S.5.289)
	$\displaystyle\quad+\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{c_{w}/2\leq w^{(k)}\leq 1-c_{w}/2}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\log((w^{(k)})^{-1}-1)\epsilon^{(k)}_{i}\right\|\right\}\right]^{1/3}$		(S.5.290)
	$\displaystyle\lesssim\underbrace{\sum_{j=1}^{N}\left[\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\right\|\right\}\right]^{1/3}}_{[1]}$		(S.5.291)
	$\displaystyle\quad+\underbrace{\sum_{j=1}^{N}\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{u}_{j}^{\top}\bm{\mu}^{(k)*}\epsilon^{(k)}_{i}\right\|\right\}\right]^{1/3}}_{[2]}$		(S.5.292)
	$\displaystyle\quad+\underbrace{\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sum_{k\in S}\left\|\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\epsilon^{(k)}_{i}\right\|\right\}\right]^{1/3}}_{[3]}$		(S.5.293)

Since $\{\widetilde{w}_{k}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\}_{i,j}$ , $\{\widetilde{w}_{k}(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)*}\epsilon^{(k)}_{i}\}_{i,k}$ , and $\{\widetilde{w}_{k}\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}}$ are independent sub-Gaussian variables, we can bound the three terms on the RHS as

$\displaystyle[1]$	$\displaystyle\lesssim 5^{p}\cdot\exp\left\{\frac{C\lambda^{2}}{n_{S}}\right\},$	(S.5.294)
$\displaystyle[2]$	$\displaystyle\lesssim 5^{p}\cdot\exp\left\{\frac{C\lambda^{2}}{n_{S}}\right\},$	(S.5.295)
$\displaystyle[3]$	$\displaystyle\leq\left[\prod_{k\in S}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\left\|\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\epsilon^{(k)}_{i}\right\|\right\}\right]^{1/3}\lesssim\left[\prod_{k\in S}\exp\left\{C\frac{\lambda^{2}}{n_{S}^{2}}n_{k}\right\}\right]^{1/3}\lesssim\exp\left\{\frac{C\lambda^{2}}{n_{S}}\right\}.$	(S.5.296)

Putting all pieces together,

\displaystyle\mathbb{E}\exp\{\lambda W^{\prime}\}\leq C^{\prime}2^{K}5^{p}\cdot\exp\left\{\frac{C\lambda^{2}}{n_{S}}\right\}=\exp\left\{C\frac{\lambda^{2}}{n_{S}}+C^{\prime\prime}(K+p)\right\}.

(S.5.297)

Therefore, for any $\delta>0$

\mathbb{P}(W^{\prime}\geq\delta)\leq e^{-\lambda\delta}\mathbb{E}\exp\{\lambda W^{\prime}\}\leq\exp\left\{C\frac{\lambda^{2}}{n_{S}}+C^{\prime\prime}(K+p)-\lambda\delta\right\}.

(S.5.298)

Let $\lambda=\frac{n_{S}}{2C}\delta$ and $\delta=4\sqrt{\frac{CC^{\prime\prime}(K+p)}{n_{S}}}$ , we have

\mathbb{P}(W^{\prime}\geq\delta)\leq\exp\left\{-\frac{n_{S}}{4C}\delta^{2}+C^{\prime\prime}(K+p)\right\}=\exp\{-3C^{\prime\prime}(K+p)\}\leq C^{\prime}K^{-2}\exp\{-3C^{\prime\prime}p\},

(S.5.299)

which completes the proof. ∎

Proof of Lemma 12.

The proof of part (ii) is the same as the proof of part (ii) for Lemma 11, so we omit it. The only difference between the proofs of part (i) for two lemmas is that here the bounded difference inequality is not available. Denote

W=\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}-\mathbb{E}\big{[}[1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})](\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)*}\big{]}\right|.

(S.5.300)

We need to use Lemma 10 to upper bound $W-\mathbb{E}W$ . Prior to that, we first verify the conditions required by the lemma. Fix $\bm{z}^{(k)}_{1}$ , …, $\bm{z}^{(k)}_{i-1}$ , $\bm{z}^{(k)}_{i+1}$ , …, $\bm{z}^{(k)}_{n_{k}}$ , and define

g^{(k)}_{i}(\bm{z}^{(k)}_{i})=W-\mathbb{E}[W|\bm{z}^{(k)}_{1},\ldots,\bm{z}^{(k)}_{i-1},\bm{z}^{(k)}_{i+1},\ldots,\bm{z}^{(k)}_{n_{k}}].

(S.5.301)

By triangle inequality,

	$\displaystyle\left\|g^{(k)}_{i}(\bm{z}^{(k)}_{i})\right\|$		(S.5.302)
	$\displaystyle=\Bigg{\|}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)}\\|_{2}\leq\xi^{(k)}\end{subarray}}\bigg{\|}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)}-\mathbb{E}\big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}\big{]}\bigg{\|}$		(S.5.303)
	$\displaystyle\quad-\mathbb{E}\Bigg{[}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)}\\|_{2}\leq\xi^{(k)}\end{subarray}}\bigg{\|}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)}-\mathbb{E}\big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}\big{]}\bigg{\|}\Big{\|}\{\bm{z}^{(k)}_{i^{\prime}}\}_{i^{\prime}\neq i}\Bigg{]}\Bigg{\|}$		(S.5.304)
	$\displaystyle\leq\underbrace{\frac{1}{n_{k}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)}\\|_{2}\leq\xi^{(k)}\end{subarray}}\bigg{\|}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)}\bigg{\|}}_{W_{1}}+\underbrace{\frac{2}{n_{k}}\mathbb{E}\left\|\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)}\\|_{2}\leq\xi^{(k)}\end{subarray}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)}\right\|}_{W_{2}}.$		(S.5.305)

Note that $[\mathbb{E}(W_{1}+W_{2})^{d}]^{1/d}\leq(\mathbb{E}W_{1}^{d})^{1/d}+(\mathbb{E}W_{2}^{d})^{1/d}$ , where

$\displaystyle(\mathbb{E}W_{1}^{d})^{1/d}$	$\displaystyle\leq\frac{1}{n_{k}}\left[\mathbb{E}\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})]^{2d}\right]^{1/2d}\left[\mathbb{E}\big{\|}(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}\big{\|}^{2d}\right]^{1/2d}$	(S.5.306)
	$\displaystyle\leq\frac{1}{n_{k}}\cdot C\cdot\sqrt{d}$	(S.5.307)
$\displaystyle(\mathbb{E}W_{2}^{d})^{1/d}$	$\displaystyle=\mathbb{E}W_{1}\leq(\mathbb{E}W_{1}^{d})^{1/d}\leq\frac{1}{n_{k}}\cdot C\cdot\sqrt{d}.$	(S.5.308)

Therefore $[\mathbb{E}(W_{1}+W_{2})^{d}]^{1/d}\leq\frac{C}{n_{k}}\sqrt{d}$ . Hence by applying Lemma 10, we have

W\leq\mathbb{E}W+C\sqrt{\frac{\log K}{n_{k}}},

(S.5.309)

with probability at least $1-C^{\prime}K^{-2}$ . ∎

Proof of Lemma 13.

For part (i), denote

W=\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\bm{z}^{(k)}_{i}-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})\bm{z}^{(k)}]\right\|_{2}.

(S.5.310)

Suppose $\{\bm{u}_{j}\}_{j=1}^{N}$ is a $1/2$ -cover of $\mathcal{B}^{p}\coloneqq\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}\leq 1\}$ with $N=5^{p}$ . Define $\bm{\mu}^{(k)*}=(1-w^{(k)*})\bm{\mu}^{(k)*}_{1}+w^{(k)*}\bm{\mu}^{(k)*}_{2}$ . Then by the generalized symmetrization inequality (Proposition 4.11 in \citealpappwainwright2019high), with i.i.d. Rademacher variables $\{\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}}$ , for any $\lambda\in\mathbb{R}$ ,

	$\displaystyle\mathbb{E}\exp\{\lambda W\}$		(S.5.311)
	$\displaystyle\lesssim\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left\\|\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\bm{z}^{(k)}_{i}\epsilon^{(k)}_{i}\right\\|_{2}\right\}$		(S.5.312)
	$\displaystyle\lesssim\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{j=1:N}\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left\|\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.313)
	$\displaystyle\lesssim\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left\|\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.314)
	$\displaystyle\lesssim\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left\|\sum_{i=1}^{n_{k}}\big{[}C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.315)
	$\displaystyle\lesssim\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{\\|\bm{\beta}^{(k)}\\|_{2}\leq U}\left\|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.316)
	$\displaystyle\quad+\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{\\|\bm{\beta}^{(k)}\\|_{2}\leq U}\left\|\sum_{i=1}^{n_{k}}\big{[}(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)*}-\delta^{(k)}\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.317)
	$\displaystyle\quad+\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{c_{w}/2\leq w^{(k)}\leq 1-c_{w}/2}\left\|\sum_{i=1}^{n_{k}}\log((w^{(k)})^{-1}-1)(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.318)
	$\displaystyle\lesssim\sum_{j=1}^{N}\sum_{j^{\prime}=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\left\|\sum_{i=1}^{n_{k}}\bm{u}_{j^{\prime}}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right\|\right\}$		(S.5.319)
	$\displaystyle\quad+\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\left\|\sum_{i=1}^{n_{k}}(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right\|\right\}.$		(S.5.320)

Note that since $\{\bm{u}_{j^{\prime}}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}}$ are i.i.d. sub-exponential variables and $\{(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}}$ are i.i.d. sub-Gaussian variables, we have

	$\displaystyle\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\left\|\sum_{i=1}^{n_{k}}\bm{u}_{j^{\prime}}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right\|\right\}$	$\displaystyle\lesssim\exp\left\{C\frac{\lambda^{2}}{n_{k}}\right\},$		(S.5.321)
	$\displaystyle\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\left\|\sum_{i=1}^{n_{k}}(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right\|\right\}$	$\displaystyle\lesssim\exp\left\{C\frac{\lambda^{2}}{n_{k}}\right\},$		(S.5.322)

where the first inequality holds when $\lambda\leq C^{\prime\prime}n_{k}$ where $C^{\prime\prime}$ is small. Therefore,

\mathbb{E}\exp\{\lambda W\}\lesssim\exp\left\{C\frac{\lambda^{2}}{n_{k}}+C^{\prime}p\right\},

(S.5.323)

when $\lambda\leq C^{\prime\prime}n_{k}$ . The desired result follows from Chernoff’s bound.

The proofs of parts (ii) and (iii) are almost the same as the proofs of part (ii) of Lemma 11, so we do not repeat them here.

∎

Proof of Lemma 14.

Note that

\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\bm{\beta}^{(k)*}\right\|_{2}\lesssim\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\right\|_{2}.

(S.5.324)

The bound of the RHS comes from Theorem 6.5 in \citeappwainwright2019high. And the bound in part (ii) can be proved in the same way. ∎

S.5.3 Proof of Theorem 3

S.5.3.1 Lemmas

Recall the parameter space

\displaystyle\overline{\Theta}_{S}(h)=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}=\{(w^{(k)},\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2},\bm{\Sigma}^{(k)})\}_{k\in S}:\overline{\bm{\theta}}^{(k)}\in\overline{\Theta},\inf_{\overline{\bm{\beta}}}\max_{k\in S}\|\bm{\beta}^{(k)}-\overline{\bm{\beta}}\|_{2}\leq h\Big{\}},

(S.5.325)

where $\bm{\beta}^{(k)}=(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{2}-\bm{\mu}^{(k)}_{1})$ and $\delta^{(k)}=\frac{1}{2}(\bm{\beta}^{(k)})^{\top}(\bm{\mu}^{(k)}_{1}+\bm{\mu}^{(k)}_{2})$ .

Lemma 15 (Lemma 8.4 in \citealpappcai2019chime).

For any $\bm{\mu}$ , $\widetilde{\bm{\mu}}\in\mathbb{R}^{p}$ and $w\in(0,1)$ , denote $\mathbb{P}_{\bm{\mu}}=(1-w)\mathcal{N}(\bm{\mu},\bm{I}_{p})+\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ and $\mathbb{P}_{\widetilde{\bm{\mu}}}=(1-w)\mathcal{N}(\widetilde{\bm{\mu}},\bm{I}_{p})+\mathcal{N}(-\widetilde{\bm{\mu}},\bm{I}_{p})$ . Then

\textup{KL}(\mathbb{P}_{\bm{\mu}}\|\mathbb{P}_{\widetilde{\bm{\mu}}})\leq\left(4\|\bm{\mu}\|_{2}^{2}+\frac{1}{2}\log\left(\frac{w}{1-w}\right)\right)\cdot 2\|\bm{\mu}-\widetilde{\bm{\mu}}\|_{2}^{2}.

(S.5.326)

Lemma 16.

For any $\bm{\mu}$ , $\bm{\mu}^{\prime}$ , $\widetilde{\bm{\mu}}$ , $\widetilde{\bm{\mu}}^{\prime}\in\mathbb{R}^{p}$ and $w\in(0,1)$ , denote $\mathbb{P}_{\bm{\mu},\widetilde{\bm{\mu}}}=(1-w)\mathcal{N}(\bm{\mu},\bm{I}_{p})+w\mathcal{N}(\widetilde{\bm{\mu}},\bm{I}_{p})$ and $\mathbb{P}_{\bm{\mu}^{\prime},\widetilde{\bm{\mu}}^{\prime}}=(1-w)\mathcal{N}(\bm{\mu}^{\prime},\bm{I}_{p})+w\mathcal{N}(\widetilde{\bm{\mu}}^{\prime},\bm{I}_{p})$ . Then

\textup{KL}(\mathbb{P}_{\bm{\mu},\widetilde{\bm{\mu}}}\|\mathbb{P}_{\bm{\mu}^{\prime},\widetilde{\bm{\mu}}^{\prime}})\leq(1-w)\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}^{2}+w\|\widetilde{\bm{\mu}}-\widetilde{\bm{\mu}}^{\prime}\|_{2}^{2}.

(S.5.327)

Lemma 17.

Denote distribution $(1-w)\mathcal{N}(\bm{\mu},\bm{I}_{p})+w\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{w}$ for any $w\in(c_{w},1-c_{w})$ , where $\bm{\mu}\in\mathbb{R}^{p}$ . Then

\textup{KL}(\mathbb{P}_{w}\|\mathbb{P}_{w^{\prime}})\leq\frac{1}{2c_{w}^{2}}(w-w^{\prime})^{2}.

(S.5.328)

Lemma 18.

Denote distribution $\frac{1}{2}\mathcal{N}((-1/2,\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p})+\frac{1}{2}\mathcal{N}((1/2+\widetilde{u},\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p})$ as $\mathbb{P}_{\widetilde{u}}$ for any $\widetilde{u}\in[-1,1]$ . Then

\textup{KL}(\mathbb{P}_{\widetilde{u}}\|\mathbb{P}_{\widetilde{u}^{\prime}})\leq\frac{1}{2}(\widetilde{u}-\widetilde{u}^{\prime})^{2}.

(S.5.329)

Lemma 19.

When there exists an subset $S$ such that $\min_{k\in S}n_{k}\geq C(p+\log K)$ with some constant $C>0$ , we have

	$\displaystyle\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}$	$\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\geq C_{1}\sqrt{\frac{p}{n_{S}}}+C_{2}\sqrt{\frac{\log K}{n_{k}}}$		(S.5.330)
		$\displaystyle\quad\quad+C_{3}h\wedge\sqrt{\frac{p+\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}.$		(S.5.331)

Lemma 20.

Denote $\widetilde{\epsilon}=\frac{K-s}{s}$ . Then

\displaystyle\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}

\displaystyle\mathbb{P}\Bigg{(}\max_{k\in S}d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\geq C_{1}\widetilde{\epsilon}\sqrt{\frac{1}{\max_{k=1:K}n_{k}}}\Bigg{)}\geq\frac{1}{10}.

(S.5.332)

Lemma 21 (The first variant of Theorem 5.1 in \citealpappchen2018robust).

Given a series of distributions $\{\{\mathbb{P}_{\theta}^{(k)}\}_{k=1}^{K}:\theta\in\Theta\}$ , each of which is indexed by the same parameter $\theta\in\Theta$ . Consider $\bm{x}^{(k)}\sim(1-\widetilde{\epsilon})\mathbb{P}^{(k)}_{\theta}+\widetilde{\epsilon}\mathbb{Q}^{(k)}$ independently for $k=1:K$ . Denote the joint distribution of $\{\bm{x}^{(k)}\}_{k=1}^{K}$ as $\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}$ . Then

\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi(\widetilde{\epsilon},\Theta)\right)\geq\frac{1}{2},

(S.5.333)

where $\varpi(\widetilde{\epsilon},\Theta)\coloneqq\sup\big{\{}\|\theta_{1}-\theta_{2}\|:\max_{k=1:K}d_{\textup{TV}}\big{(}\mathbb{P}^{(k)}_{\theta_{1}},\mathbb{P}^{(k)}_{\theta_{2}}\big{)}\leq\widetilde{\epsilon}/(1-\widetilde{\epsilon})\big{\}}$ .

Lemma 22.

Suppose $K-s\geq 1$ . Consider two data generating mechanisms:

(i)

$\bm{x}^{(k)}\sim(1-\widetilde{\epsilon}^{\prime})\mathbb{P}_{\theta}^{(k)}+\widetilde{\epsilon}^{\prime}\mathbb{Q}^{(k)}$ independently for $k=1:K$ , where $\widetilde{\epsilon}^{\prime}=\frac{K-s}{K}$ ;
(ii)

With a preserved set $S\subseteq 1:K$ , generate $\{\bm{x}^{(k)}\}_{k\in S^{c}}\sim\mathbb{Q}_{S}$ and $\bm{x}^{(k)}\sim\mathbb{P}_{\theta}^{(k)}$ independently for $k\in S$ .

Denote the joint distributions of $\{\bm{x}^{(k)}\}_{k=1}^{K}$ in (i) and (ii) as $\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}$ and $\mathbb{P}_{(S,\theta,\mathbb{Q})}$ , respectively. We claim that if

\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\frac{K-s}{50K},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)\geq\frac{1}{2},

(S.5.334)

then

\inf_{\widehat{\theta}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}_{(S,\theta,\mathbb{Q}_{S})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)\geq\frac{1}{10},

(S.5.335)

where $\varpi(\widetilde{\epsilon},\Theta)\coloneqq\sup\big{\{}\|\theta_{1}-\theta_{2}\|:\max_{k=1:K}\textup{KL}\big{(}\mathbb{P}^{(k)}_{\theta_{1}}\|\mathbb{P}^{(k)}_{\theta_{2}}\big{)}\leq[\widetilde{\epsilon}/(1-\widetilde{\epsilon})]^{2}\big{\}}$ for any $\widetilde{\epsilon}\in(0,1)$ .

Lemma 23.

When there exists an subset $S$ such that $\min_{k\in S}n_{k}\geq C(p\vee\log K)$ with some constant $C>0$ , we have

	$\displaystyle\inf_{\{\widehat{\bm{\Sigma}}^{(k)}\}_{k=1}^{K}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\min\big{\{}\\|\widehat{\bm{\mu}}^{(k)}_{1}-\bm{\mu}^{(k)}_{1}\\|_{2}\vee\\|\widehat{\bm{\mu}}^{(k)}_{2}-\bm{\mu}^{(k)*}_{2}\\|_{2},$		(S.5.336)
	$\displaystyle\quad\\|\widehat{\bm{\mu}}^{(k)}_{1}-\bm{\mu}^{(k)}_{2}\\|_{2}\vee\\|\widehat{\bm{\mu}}^{(k)}_{2}-\bm{\mu}^{(k)}_{1}\\|_{2}\big{\}}\vee\\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\\|_{2}\geq C\sqrt{\frac{p+\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{10}.$		(S.5.337)

S.5.3.2 Main proof of Theorem 3

Combine conclusions of Lemmas 19 and 20 to get the first lower bound. Lemma 23 implies the second one.

S.5.3.3 Proofs of lemmas

Proof of Lemma 17.

Denote $g(w;\widetilde{z})=\log\big{[}(1-w)\widetilde{z}+w\big{]}$ , $g^{\prime}(w;\widetilde{z})=\frac{1-\widetilde{z}}{(1-w)\widetilde{z}+w}$ , $g^{\prime\prime}(w;\widetilde{z})=-\frac{(1-\widetilde{z})^{2}}{[(1-w)\widetilde{z}+w]^{2}}$ and $f(w;\bm{z},\bm{\mu})=\frac{1-w}{(2\pi)^{p/2}}\exp\{-\frac{1}{2}\|\bm{z}-\bm{\mu}\|_{2}^{2}\}+\frac{w}{(2\pi)^{p/2}}\exp\{-\frac{1}{2}\|\bm{z}+\bm{\mu}\|_{2}^{2}\}$ .

By Taylor expansion,

\log\left[\frac{f(w^{\prime};\bm{z},\bm{\mu})}{f(w;\bm{z},\bm{\mu})}\right]=\frac{\partial\log f(w;\bm{z},\bm{\mu})}{\partial w}\bigg{|}_{w}\cdot(w^{\prime}-w)+\frac{1}{2}\frac{\partial^{2}\log f(w;\bm{z},\bm{\mu})}{\partial w^{2}}\bigg{|}_{w_{0}}\cdot(w^{\prime}-w)^{2},

(S.5.338)

where $w_{0}=w_{0}(\bm{z},\bm{\mu})$ is between $w$ and $w^{\prime}$ . By the property of score function,

\int\frac{\partial\log f(w;\bm{z},\bm{\mu})}{\partial w}d\mathbb{P}_{w}=0.

(S.5.339)

Besides,

\frac{\partial^{2}\log f(w;\bm{z},\bm{\mu})}{\partial w^{2}}=\frac{\partial^{2}\log\big{[}f(w;\bm{z},\bm{\mu})/\big{(}(2\pi)^{-p/2}\exp\{-\frac{1}{2}\|\bm{z}+\bm{\mu}\|_{2}^{2}\}\big{)}\big{]}}{\partial w^{2}}=g^{\prime\prime}(w;\widetilde{z}),

(S.5.340)

where $\widetilde{z}=e^{-\bm{\mu}^{\top}\bm{z}}$ . Note that

-g^{\prime\prime}(w;\widetilde{z})=\frac{1}{(1-w)^{2}}\cdot\frac{(\widetilde{z}-1)^{2}}{(\widetilde{z}+w/(1-w))^{2}}\leq\frac{1}{c_{w}^{2}},

(S.5.341)

for any $\widetilde{z}>0$ . Therefore,

$\displaystyle\textup{KL}(\mathbb{P}_{w}\\|\mathbb{P}_{w^{\prime}})$	$\displaystyle=-\int\log\left[\frac{f(w^{\prime};\bm{z},\bm{\mu})}{f(w;\bm{z},\bm{\mu})}\right]d\mathbb{P}_{w}$	(S.5.342)
	$\displaystyle=-\frac{1}{2}(w^{\prime}-w)^{2}\cdot\int g^{\prime\prime}(w_{0}(\bm{z},\bm{\mu});\widetilde{z})d\mathbb{P}_{w}$	(S.5.343)
	$\displaystyle\leq\frac{1}{2c_{w}^{2}}(w^{\prime}-w)^{2},$	(S.5.344)

which completes the proof. ∎

Proof of Lemma 18.

Recall that we denote distribution $\frac{1}{2}\mathcal{N}((-1/2,\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p})+\frac{1}{2}\mathcal{N}((1/2+\widetilde{u},\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p})$ as $\mathbb{P}_{\widetilde{u}}$ for any $\widetilde{u}\in[-1,1]$ . By the bi-convexity of KL divergence, we have

$\displaystyle\textup{KL}(\mathbb{P}_{\widetilde{u}}\\|\mathbb{P}_{\widetilde{u}^{\prime}})$	$\displaystyle\leq\frac{1}{2}\textup{KL}(\mathcal{N}((1/2+\widetilde{u},\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p})\\|\mathcal{N}((1/2+\widetilde{u}^{\prime},\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p}))$	(S.5.345)
	$\displaystyle=\frac{1}{2}\textup{KL}(\mathcal{N}(1/2+\widetilde{u},1)\\|\mathcal{N}(1/2+\widetilde{u}^{\prime},1))$	(S.5.346)
	$\displaystyle=\frac{1}{2}(\widetilde{u}-\widetilde{u}^{\prime})^{2},$	(S.5.347)

which completes the proof. ∎

Proof of Lemma 19.

WLOG, suppose $\Delta\geq 1$ . It’s easy to see that given any $S$ , $\overline{\Theta}_{S}\supseteq\overline{\Theta}_{S,w}\cup\overline{\Theta}_{S,\bm{\beta}}\cup\overline{\Theta}_{S,\delta}$ , where

$\displaystyle\overline{\Theta}_{S,w}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\mu}^{(k)}_{1}=\bm{1}_{p}/\sqrt{p},\bm{\mu}^{(k)}_{2}=-\bm{\mu}^{(k)}_{1}=\widetilde{\bm{\mu}},\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}\in(c_{w},1-c_{w})\Big{\}},$	(S.5.348)
$\displaystyle\overline{\Theta}_{S,\bm{\beta}}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\\|\bm{\mu}^{(k)}_{1}\\|_{2}\vee\\|\bm{\mu}^{(k)}_{2}\\|_{2}\leq M,\min_{\bm{\beta}}\max_{k\in S}\\|\bm{\beta}^{(k)}-\bm{\beta}\\|_{2}\leq h\Big{\}},$	(S.5.349)
$\displaystyle\overline{\Theta}_{S,\delta}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\\|\bm{\mu}^{(k)}_{1}\\|_{2}\vee\\|\bm{\mu}^{(k)}_{2}\\|_{2}\leq M,\bm{\mu}^{(k)}_{1}=-\frac{1}{2}\bm{\mu}_{0},\bm{\mu}^{(k)}_{2}=\frac{1}{2}\bm{\mu}_{0}+\bm{u},$	(S.5.350)
	$\displaystyle\hskip 85.35826pt\\|\bm{u}\\|_{2}\leq 1\Big{\}}.$	(S.5.351)

(i) By fixing an $S$ and a $\mathbb{Q}_{S}$ , we want to show

\inf_{\{\widehat{\bm{\beta}}^{(k)}\}_{k=1}^{K}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\bm{\beta}}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)}+\bm{\beta}^{(k)*}\|_{2}\geq C\sqrt{\frac{p}{n_{S}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}

(S.5.352)

By Lemma 4, $\exists$ a quadrant $\mathcal{Q}_{\bm{v}}$ of $\mathbb{R}^{p}$ and a $r/8$ -packing of $(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}}$ under Euclidean norm: $\{\widetilde{\bm{\mu}}_{j}\}_{j=1}^{N}$ , where $r=(c\sqrt{p/n_{S}})\wedge M\leq 1$ with a small constant $c>0$ and $N\geq(\frac{1}{2})^{p}8^{p-1}=\frac{1}{2}\times 4^{p-1}\geq 2^{p-1}$ when $p\geq 2$ . For any $\bm{\mu}\in\mathbb{R}^{p}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu}_{0}+\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu}_{0}+\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ , where $\bm{\mu}_{0}$ can be any vector in $\mathbb{R}^{p}$ with $\|\bm{\mu}_{0}\|_{2}\geq 1$ . Then

	LHS	$\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\bm{\mu}\in(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}}}\mathbb{P}\Bigg{(}\\|\widehat{\bm{\mu}}-\bm{\mu}\\|_{2}\wedge\\|\widehat{\bm{\mu}}+\bm{\mu}\\|_{2}\geq C\sqrt{\frac{p}{n_{S}}}\Bigg{)}$		(S.5.353)
		$\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\bm{\mu}\in(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}}}\mathbb{P}\Bigg{(}\\|\widehat{\bm{\mu}}-\bm{\mu}\\|_{2}\geq C\sqrt{\frac{p}{n_{S}}}\Bigg{)},$		(S.5.354)

where the last inequality holds because it suffices to consider estimator $\widehat{\bm{\mu}}$ satisfying $\widehat{\bm{\mu}}(X)\in(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}}$ almost surely. In addition, for any $\bm{x}$ , $\bm{y}\in\mathcal{Q}_{\bm{v}}$ , $\|\bm{x}-\bm{y}\|_{2}\leq\|\bm{x}+\bm{y}\|_{2}$ .

By Lemma 15,

$\displaystyle\text{KL}\left(\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\bigg{\\|}\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\right)$	$\displaystyle=\sum_{k\in S}n_{k}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}})$	(S.5.355)
	$\displaystyle\leq\sum_{k\in S}n_{k}\cdot 8\\|\widetilde{\bm{\mu}}_{j}\\|_{2}^{2}\\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\\|_{2}^{2}$	(S.5.356)
	$\displaystyle\leq 32n_{S}r^{2}$	(S.5.357)
	$\displaystyle\leq 32n_{S}c^{2}\cdot\frac{2(p-1)}{n_{S}}$	(S.5.358)
	$\displaystyle\leq\frac{64c^{2}}{\log 2}\log N.$	(S.5.359)

By Lemma 3,

\displaystyle\text{LHS of \eqref{eq: lower bdd eq mu 1}}\geq 1-\frac{\log 2}{\log N}-\frac{64c^{2}}{\log 2}\geq 1-\frac{1}{p-1}-\frac{1}{4}\geq\frac{1}{4},

(S.5.360)

when $C=c/2$ , $p\geq 3$ and $c=\sqrt{\log 2}/16$ .

(ii) By fixing an $S$ and a $\mathbb{Q}_{S}$ , we want to show

\inf_{\{\widehat{\bm{\beta}}^{(k)}\}_{k=1}^{K}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)}+\bm{\beta}^{(k)*}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{k}}}\bigg{)}\bigg{]}\bigg{\}}\Bigg{)}\geq\frac{1}{4}.

(S.5.361)

WLOG, suppose $1\in S$ . We have

\inf_{\widehat{\bm{\beta}}^{(1)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\beta}}^{(1)}-\bm{\beta}^{(1)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(1)}+\bm{\beta}^{(1)*}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{1}}}\bigg{)}\bigg{]}\Bigg{)}\geq\frac{1}{4},\\

(S.5.362)

By Lemma 4, $\exists$ a quadrant $\mathcal{Q}_{\bm{v}}$ of $\mathbb{R}^{p}$ and a $r/8$ -packing of $(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}$ under Euclidean norm: $\{\widetilde{\bm{\vartheta}}_{j}\}_{j=1}^{N}$ , where $r=h_{\bm{\beta}}\wedge(c\sqrt{p/n_{1}})\wedge M\leq 1$ with a small constant $c>0$ and $N\geq(\frac{1}{2})^{p-1}8^{p-2}=\frac{1}{2}\times 4^{p-2}\geq 2^{p-2}$ when $p\geq 3$ . WLOG, assume $M\geq 2$ . Denote $\widetilde{\bm{\mu}}_{j}=(1,\widetilde{\bm{\vartheta}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}$ . Let $\bm{\mu}^{(k)*}_{1}=\widetilde{\bm{\mu}}=(1,\bm{0}_{p-1})^{\top}$ for all $k\in S\backslash\{1\}$ . And let $\bm{\mu}^{(0)*}_{1}=\bm{\mu}=(1,\bm{\vartheta})$ with $\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}$ . For any $\bm{\mu}\in\mathbb{R}^{p}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ . Then similar to the arguments in (i),

	LHS	$\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\begin{subarray}{c}\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}\\ \bm{\mu}=(1,\bm{\vartheta})^{\top}\end{subarray}}\mathbb{P}\Bigg{(}\\|\widehat{\bm{\mu}}-\bm{\mu}\\|_{2}\wedge\\|\widehat{\bm{\mu}}+\bm{\mu}\\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{1}}}\bigg{)}\bigg{]}\Bigg{)}$		(S.5.363)
		$\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\begin{subarray}{c}\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}\\ \bm{\mu}=(1,\bm{\vartheta})^{\top}\end{subarray}}\mathbb{P}\Bigg{(}\\|\widehat{\bm{\mu}}-\bm{\mu}\\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{1}}}\bigg{)}\bigg{]}\Bigg{)}.$		(S.5.364)

Then by Lemma 15,

$\displaystyle\text{KL}\left(\prod_{k\in S\backslash\{1\}}\mathbb{P}_{\widetilde{\bm{\mu}}}^{\otimes n_{k}}\cdot\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{1}}\cdot\mathbb{Q}_{S}\bigg{\\|}\prod_{k\in S\backslash\{1\}}\mathbb{P}_{\widetilde{\bm{\mu}}}^{\otimes n_{k}}\cdot\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{1}}\cdot\mathbb{Q}_{S}\right)$	$\displaystyle=n_{1}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}})$	(S.5.365)
	$\displaystyle\leq n_{1}\cdot 8\\|\widetilde{\bm{\mu}}_{j}\\|_{2}^{2}\\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\\|_{2}^{2}$	(S.5.366)
	$\displaystyle\leq 32n_{1}r^{2}$	(S.5.367)
	$\displaystyle\leq 32n_{1}c^{2}\cdot\frac{3(p-2)}{n_{1}}$	(S.5.368)
	$\displaystyle\leq\frac{96c^{2}}{\log 2}\log N,$	(S.5.369)

when $n_{1}\geq(c^{2}\vee M^{-2})p$ and $p\geq 3$ . By Fano’s lemma (See Corollary 2.6 in \citealpapptsybakov2009introduction),

\displaystyle\text{LHS of \eqref{eq: lower bdd eq mu 2}}\geq 1-\frac{\log 2}{\log N}-\frac{96c^{2}}{\log 2}\geq 1-\frac{1}{p-2}-\frac{1}{4}\geq\frac{1}{4},

(S.5.370)

when $C=1/2$ , $p\geq 4$ and $c=\sqrt{(\log 2)/384}$ .

(iii) By fixing an $S$ and a $\mathbb{Q}_{S}$ , we want to show

\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)}+\bm{\beta}^{(k)*}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{\log K}{n_{k}}}\bigg{)}\bigg{]}\bigg{\}}\Bigg{)}\geq\frac{1}{4}.

(S.5.371)

Suppose $\bm{v}=\bm{1}_{p}$ and denote the associated quadrant $\mathcal{Q}_{\bm{v}}=\mathbb{R}_{+}^{p}$ , $\Upsilon_{S}=\{\{\bm{\mu}^{(k)}\}_{k\in S}:\bm{\mu}^{(k)}\in\mathbb{R}_{+}^{p},\min_{\mu}\max_{k\in S}\|\bm{\mu}^{(k)}-\bm{\mu}\|_{2}\leq h,\|\bm{\mu}^{(k)}\|_{2}\leq M\}$ . Let $r_{k}=h\wedge(c\sqrt{\log K/n_{k}})\wedge M$ with a small constant $c>0$ for $k\in S$ . For any $\bm{M}=\{\bm{\mu}^{(k)}\}_{k\in S}$ , where $\bm{\mu}^{(k)}\in\mathbb{R}^{p}$ , denote distribution $\prod_{k\in S}\big{[}\frac{1}{2}\mathcal{N}(\bm{\mu}^{(k)},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu}^{(k)},\bm{I}_{p})\big{]}^{\otimes n_{k}}$ as $\mathbb{P}_{\bm{M}}$ , and the joint distribution of $\mathbb{P}_{\bm{M}}$ and $\mathbb{Q}_{S}$ as $\mathbb{P}_{\bm{M}}\cdot\mathbb{Q}_{S}$ . And denote distribution $(1-\overline{w})\mathcal{N}(\bm{\mu},\bm{I}_{p})+\overline{w}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ for any $\bm{\mu}\in\mathbb{R}^{p}$ . Similar to the arguments in (i), since it suffices to consider the estimators $\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}$ satisfying $\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}\in\Upsilon_{S}$ almost surely and $\|\bm{x}-\bm{y}\|_{2}\leq\|\bm{x}+\bm{y}\|_{2}$ for any $\bm{x}$ , $\bm{y}\in\mathbb{R}_{+}^{p}$ , we have

LHS	$\displaystyle\geq\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}}\sup_{\{\bm{\mu}^{(k)}\}_{k\in S}\in\Upsilon_{S}}\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}\cdot\mathbb{Q}_{S}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\\|\widehat{\bm{\mu}}^{(k)}-\bm{\mu}^{(k)}\\|_{2}\wedge\\|\widehat{\bm{\mu}}^{(k)}+\bm{\mu}^{(k)}\\|_{2}$	(S.5.372)
	$\displaystyle\hskip 227.62204pt\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{\log K}{n_{k}}}\bigg{)}\bigg{]}\bigg{\}}\Bigg{)}$	(S.5.373)
	$\displaystyle\geq\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}}\sup_{\{\bm{\mu}^{(k)}\}_{k\in S}\in\Upsilon_{S}}\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}\cdot\mathbb{Q}_{S}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\\|\widehat{\bm{\mu}}^{(k)}-\bm{\mu}^{(k)}\\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{\log K}{n_{k}}}\bigg{)}\bigg{]}\bigg{\}}\Bigg{)},$	(S.5.374)

Consider $\bm{M}^{(k)}=\{\bm{\mu}^{(j)}\}_{j\in S}$ where $\bm{\mu}^{(j)}=\frac{r_{j}}{\sqrt{p-3/4}}\cdot\bm{1}_{p}+\bm{\mu}_{0}$ for $j\neq k$ and $\bm{\mu}^{(k)}=\frac{r_{k}}{2\sqrt{p-3/4}}\cdot\bm{1}_{p}+\bm{\mu}_{0}$ , where $\bm{\mu}_{0}=(1,\bm{0}_{p-1}^{\top})^{\top}$ . Define two new “distances” (which are not rigorously distances because triangle inequalities and the definiteness do not hold) between $\bm{M}=\{\bm{\mu}^{(k)}\}_{k\in S}$ and as $\bm{M}^{\prime}=\{\bm{\mu}^{\prime(k)}\}_{k\in S}$

	$\displaystyle\widetilde{d}(\bm{M},\bm{M}^{\prime})$	$\displaystyle\coloneqq\sum_{k\in S}\mathds{1}\left(\\|\bm{\mu}^{(k)}-\bm{\mu}^{\prime(k)}\\|_{2}\geq\frac{r_{k}}{2\sqrt{p-3/4}}\right),$		(S.5.375)
	$\displaystyle\widetilde{d}^{\prime}(\bm{M},\bm{M}^{\prime})$	$\displaystyle\coloneqq\sum_{k\in S}\mathds{1}\left(\\|\bm{\mu}^{(k)}-\bm{\mu}^{\prime(k)}\\|_{2}\geq\frac{r_{k}}{4\sqrt{p-3/4}}\right).$		(S.5.376)

Therefore $\widetilde{d}(\bm{M}^{(k)},\bm{M}^{(k^{\prime})})=2$ when $k\neq k^{\prime}$ . For $\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}$ , define $\psi^{*}=\operatorname*{arg\,min}_{k\in S}\widetilde{d}^{\prime}(\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S},\allowbreak\bm{M}^{(k)})$ . Because $\widetilde{d}(\bm{M}_{1},\bm{M}_{2})\leq\widetilde{d}^{\prime}(\bm{M}_{1},\bm{M}_{2})+\widetilde{d}^{\prime}(\bm{M}_{2},\bm{M}_{3})$ for any $\bm{M}_{1}$ , $\bm{M}_{2}$ and $\bm{M}_{3}$ , it’s easy to see that

	$\displaystyle\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}}\sup_{\{\bm{\mu}^{(k)}\}_{k\in S}\in\Upsilon_{S}}\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}\cdot\mathbb{Q}_{S}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\\|\widehat{\bm{\mu}}^{(k)}-\bm{\mu}^{(k)}\\|_{2}\geq\frac{r_{k}}{4\sqrt{p-3/4}}\bigg{\}}\Bigg{)}$		(S.5.377)
	$\displaystyle\geq\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}}\sup_{k\in S}\mathbb{P}_{\bm{M}^{(k)}}\left(\widetilde{d}^{\prime}(\{\widehat{\bm{\mu}}^{(k)}_{1}\}_{k\in S},\bm{M}^{(k)})\geq 1\right)$		(S.5.378)
	$\displaystyle\geq\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}}\sup_{k\in S}\mathbb{P}_{\bm{M}^{(k)}}\left(\psi^{*}\neq k\right)$		(S.5.379)
	$\displaystyle\geq\inf_{\psi}\sup_{k\in S}\mathbb{P}_{\bm{M}^{(k)}}\left(\psi\neq k\right).$		(S.5.380)

By Lemma 15,

$\displaystyle\text{KL}\left(\mathbb{P}_{\bm{M}^{(k)}}\cdot\mathbb{Q}_{S}\big{\\|}\mathbb{P}_{\bm{M}^{(k^{\prime})}}\cdot\mathbb{Q}_{S}\right)$	$\displaystyle=n_{k}\text{KL}\left(\mathbb{P}_{\frac{r_{k}}{\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}}^{\otimes n_{k}}\\|\mathbb{P}_{\frac{r_{k}}{2\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}}^{\otimes n_{k}}\right)$	(S.5.381)
	$\displaystyle\quad+n_{k^{\prime}}\text{KL}\left(\mathbb{P}_{\frac{r_{k^{\prime}}}{\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}}^{\otimes n_{k^{\prime}}}\\|\mathbb{P}_{\frac{r_{k^{\prime}}}{2\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}}^{\otimes n_{k^{\prime}}}\right)$	(S.5.382)
	$\displaystyle\leq n_{k}\cdot 8\left\\|\frac{r_{k}}{\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}\right\\|_{2}^{2}\left\\|\frac{r_{k}}{2\sqrt{p-3/4}}\mathds{1}_{p}\right\\|_{2}^{2}$	(S.5.383)
	$\displaystyle\quad+n_{k^{\prime}}\cdot 8\left\\|\frac{r_{k^{\prime}}}{\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}\right\\|_{2}^{2}\left\\|\frac{r_{k^{\prime}}}{2\sqrt{p-3/4}}\mathds{1}_{p}\right\\|_{2}^{2}$	(S.5.384)
	$\displaystyle\leq n_{k}\cdot 8\cdot 2\cdot(2r_{k}^{2}+1)\cdot\frac{1}{4}\cdot 2r_{k}^{2}+n_{k^{\prime}}\cdot 8\cdot 2\cdot(2r_{k^{\prime}}^{2}+1)\cdot\frac{1}{4}\cdot 2r_{k^{\prime}}^{2}$	(S.5.385)
	$\displaystyle\leq 16c^{2}\log K,$	(S.5.386)

when $p\geq 3$ . By Fano’s lemma (See Corollary 2.6 in \citealpapptsybakov2009introduction),

\displaystyle\text{RHS of \eqref{eq: lower bdd eq mu 4}}\geq 1-\frac{\log 2}{\log K}-16c^{2}\geq\frac{1}{4},

(S.5.387)

when $K\geq 3$ , $c=\sqrt{1/160}$ , and $\min_{k\in S}n_{k}\geq(c^{2}\vee M^{-2})\log K$ .

(iv) We want to show

	$\displaystyle\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,w}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}$	$\displaystyle\|\widehat{w}^{(k)}-w^{(k)}\|\wedge\|1-\widehat{w}^{(k)}-w^{(k)}\|$		(S.5.388)
		$\displaystyle\geq C\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}.$		(S.5.389)

The argument is similar to part (iii). The only two differences here are that the dimension of interested parameter $w$ equals 1, and Lemma 15 is replaced by Lemma 17.

(v) We want to show

	$\displaystyle\inf_{\{\widehat{\delta}^{(k)}\}_{k=1}^{K}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\delta}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}$	$\displaystyle\|\widehat{\delta}^{(k)}-\delta^{(k)}\|\wedge\|\widehat{\delta}^{(k)}+\delta^{(k)}\|$		(S.5.390)
		$\displaystyle\geq C\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4},$		(S.5.391)

The argument is similar to (iii). The only two differences here are that the dimension of interested parameter $\delta$ equals 1, and Lemma 15 is replaced by Lemma 18.

Finally, we get the desired conclusion by combining (i)-(v). ∎

Proof of Lemma 20.

Let $\widetilde{\epsilon}=\frac{K-s}{s}$ and $\widetilde{\epsilon}^{\prime}=\frac{K-s}{K}$ . Since $s/K\geq c>0$ , $\widetilde{\epsilon}\lesssim\widetilde{\epsilon}^{\prime}$ . Denote $\Upsilon_{S}=\{\{\bm{\mu}^{(k)}\}_{k\in S}:\bm{\mu}^{(k)}\in\mathbb{R}_{+}^{p},\min_{\bm{\mu}}\max_{k\in S}\|\bm{\mu}^{(k)}-\bm{\mu}\|_{2}\leq h_{\bm{\beta}}/2,\|\bm{\mu}^{(k)}\|_{2}\leq M\}$ . For any $\bm{\mu}\in\mathbb{R}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ , and denote $\prod_{k\in S}\mathbb{P}_{\bm{\mu}^{(k)}}^{\otimes n_{k}}$ as $\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}$ . Note that $\bm{\beta}^{(k)}=2\bm{\mu}^{(k)}$ for $\mathbb{P}_{\bm{\mu}^{(k)}}$ with $\{\bm{\mu}^{(k)}\}_{k\in S}\in\Upsilon_{S}$ . Then it suffices to show

\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\bm{\mu}^{(k)}\}_{k\in S}\in\Upsilon_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\max_{k\in S}\|\widehat{\bm{\mu}}^{(k)}-\bm{\mu}^{(k)}\|_{2}\geq C_{1}\widetilde{\epsilon}^{\prime}\sqrt{\frac{1}{\max_{k=1:K}n_{k}}}\Bigg{)}\geq\frac{1}{10},

(S.5.392)

where $\mathbb{P}=\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}\cdot\mathbb{Q}_{S}$ . WLOG, assume $M\geq 1$ . For any $\widetilde{\bm{\mu}}_{1}$ , $\widetilde{\bm{\mu}}_{2}\in\mathbb{R}^{p}$ with $\|\widetilde{\bm{\mu}}_{1}\|_{2}=\|\widetilde{\bm{\mu}}_{2}\|_{2}=1$ , by Lemma 15,

\max_{k=1:K}\text{KL}\big{(}\mathbb{P}_{\widetilde{\bm{\mu}}_{1}}^{\otimes n_{k}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{2}}^{\otimes n_{k}}\big{)}\leq\max_{k=1:K}n_{k}\cdot 8\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}^{2}.

(S.5.393)

Let $8\max_{k=1:K}n_{k}\cdot\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}^{2}\leq(\frac{\widetilde{\epsilon}^{\prime}}{1-\widetilde{\epsilon}^{\prime}})^{2}$ , then $\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}\leq C\sqrt{\frac{1}{\max_{k=1:K}n_{k}}}\widetilde{\epsilon}^{\prime}$ for some constant $C>0$ . Then (S.5.392) follows by Lemma 22. ∎

Proof of Lemma 21.

The proof is similar to the proof of Theorem 5.1 in \citeappchen2018robust, so we omit it here. ∎

Proof of Lemma 22.

It’s easy to see that

\text{LHS of \eqref{eq: conclusion binomial lemma}}\geq\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{E}_{S\sim\mathbb{P}_{s}}\left[\mathbb{P}_{(S,\theta,\mathbb{Q}_{S})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)\right],

(S.5.394)

where $\mathbb{P}_{s}$ can be any probability measure on all subsets of $1:K$ with $\mathbb{P}_{s}(|S|\geq s)=1$ .

Consider a special distribution $\widetilde{\mathbb{P}}_{s}$ as $\mathbb{P}_{s}$ :

\widetilde{\mathbb{P}}_{s}(S=S^{\prime})=\frac{\mathbb{P}_{|S^{c}|\sim\text{Bin}(K,\frac{K-s}{50K})}(|S|=|S^{\prime}|)}{\mathbb{P}_{|S^{c}|\sim\text{Bin}(K,\frac{K-s}{50K})}(|S^{c}|\leq\frac{41(K-s)}{50})}\cdot\frac{1}{\binom{K}{|S^{\prime}|}},

(S.5.395)

for any $S^{\prime}$ with $|(S^{\prime})^{c}|\leq\frac{41(K-s)}{50}$ . Given $S$ , consider the distribution of $\{\bm{x}^{(k)}\}_{k=1}^{K}$ as

\mathbb{P}^{S}=\prod_{k\in S}\mathbb{P}^{(k)}_{\theta}\cdot\prod_{k\notin S}\mathbb{Q}^{(k)}

(S.5.396)

Then consider the distribution of $\{\bm{x}^{(k)}\}_{k=1}^{K}$ as

\mathbb{P}^{\prime}=\sum_{S:|S^{c}|\leq\frac{41(K-s)}{50}}\widetilde{\mathbb{P}}_{s}(S)\cdot\mathbb{P}^{S}.

(S.5.397)

It’s easy to see that $\mathbb{P}^{\prime}$ is the same as $\mathbb{P}_{(\frac{K-s}{50K},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}$ conditioning on the event $\big{\{}S:|S^{c}|\leq\frac{41(K-s)}{50}\big{\}}$ . Therefore,

	$\displaystyle\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{E}_{S\sim\widetilde{\mathbb{P}}_{s}}\left[\mathbb{P}_{(S,\theta,\mathbb{Q}_{S})}\left(\\|\widehat{\theta}-\theta\\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)\right]$		(S.5.398)
	$\displaystyle\geq\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{E}_{S\sim\widetilde{\mathbb{P}}_{s}}\left[\mathbb{P}^{S}\left(\\|\widehat{\theta}-\theta\\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)\right]$		(S.5.399)
	$\displaystyle=\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}^{\prime}\left(\\|\widehat{\theta}-\theta\\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)$		(S.5.400)
	$\displaystyle\geq\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\frac{K-s}{50K},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\\|\widehat{\theta}-\theta\\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\bigg{\|}\|S^{c}\|\leq\frac{41(K-s)}{50}\right)$		(S.5.401)
	$\displaystyle\geq\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\frac{K-s}{50K},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\\|\widehat{\theta}-\theta\\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)-\mathbb{P}_{\|S^{c}\|\sim\text{Bin}(K,\frac{K-s}{50K})}\left(\|S^{c}\|>\frac{41(K-s)}{50}\right)$		(S.5.402)
	$\displaystyle\geq\frac{1}{2}-\exp\left\{-\frac{\frac{1}{2}[\frac{4}{5}(K-s)]^{2}}{K\cdot\frac{K-s}{50K}\big{(}1-\frac{K-s}{50K}\big{)}+\frac{1}{3}\cdot\frac{4}{5}(K-s)}\right\}$		(S.5.403)
	$\displaystyle\geq\frac{1}{2}-\exp\left\{-\frac{\frac{1}{2}\cdot(\frac{4}{5})^{2}}{\frac{1}{50}+\frac{4}{15}}\right\}$		(S.5.404)
	$\displaystyle\geq\frac{1}{10},$		(S.5.405)

where the last third inequality comes from Bernstein’s inequality, application of Lemma 17 and the fact that $d_{\text{TV}}^{2}(\mathbb{P}_{\theta_{1}},\mathbb{P}_{\theta_{2}})\leq\text{KL}(\mathbb{P}_{\theta_{1}}\|\mathbb{P}_{\theta_{2}})$ . ∎

Proof of Lemma 23.

(i) We want to show

\inf_{\{\widehat{\bm{\Sigma}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\|_{2}\geq C\sqrt{\frac{p}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{10}.

(S.5.406)

Fix $S$ and some $\mathbb{Q}_{S}$ . WLOG, assume $1\in S$ . Then it suffices to show

\inf_{\widehat{\bm{\Sigma}}^{(k)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\|_{2}\geq C\sqrt{\frac{p}{n_{k}}}\Bigg{)}\geq\frac{1}{10}.

(S.5.407)

Consider a special subset of $\overline{\Theta}_{S}$ as

\overline{\Theta}_{S,\bm{\Sigma}}=\{\overline{\bm{\theta}}:w=1/2,\bm{\mu}_{1}=\bm{\mu}_{2}=0,\bm{\Sigma}=\bm{\Sigma}(\bm{\gamma}),\bm{\gamma}\in\{0,1\}^{p}\},

(S.5.408)

where

\bm{\Sigma}(\bm{\gamma})=\begin{pmatrix}\gamma_{1}\bm{e}_{1}^{\top}\\ \vdots\\ \gamma_{p}\bm{e}_{1}^{\top}\end{pmatrix}\cdot\tau+\bm{I}_{p},

(S.5.409)

and $\tau>0$ is a small constant which we will specify later. For any $\bm{\gamma}\in\{0,1\}^{p}$ , denote $\mathcal{N}(\bm{0},\bm{\Sigma}(\bm{\gamma}))$ as $\mathbb{P}_{\bm{\gamma}}$ . Therefore it suffices to show

\inf_{\widehat{\bm{\Sigma}}^{(1)}}\sup_{\bm{\gamma}\in\{0,1\}^{p}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\Sigma}}^{(1)}-\bm{\Sigma}^{(1)*}\|_{2}\geq C\sqrt{\frac{p}{n_{1}}}\Bigg{)}\geq\frac{1}{10}.

(S.5.410)

Note that for any $\widehat{\bm{\Sigma}}^{(1)}$ , we can define $\widehat{\bm{\gamma}}=\operatorname*{arg\,min}_{\bm{\gamma}\in\{0,1\}^{p}}\|\widehat{\bm{\Sigma}}^{(1)}-\bm{\Sigma}(\bm{\gamma})\|_{2}$ . Then by triangle inequality and definition of $\widehat{\bm{\gamma}}$ , $\|\widehat{\bm{\Sigma}}^{(1)}-\bm{\Sigma}(\bm{\gamma})\|_{2}\geq\|\bm{\Sigma}(\widehat{\bm{\gamma}})-\bm{\Sigma}(\bm{\gamma})\|_{2}/2$ . Therefore

\text{LHS of \eqref{eq: lemma 16 eq 1}}\geq\inf_{\widehat{\bm{\gamma}}\in\{0,1\}^{p}}\sup_{\bm{\gamma}\in\{0,1\}^{p}}\mathbb{P}\Bigg{(}\|\bm{\Sigma}(\widehat{\bm{\gamma}})-\bm{\Sigma}(\bm{\gamma})\|_{2}\geq 2C\sqrt{\frac{p}{n_{1}}}\Bigg{)}.

(S.5.411)

Let $\tau=c\sqrt{1/n_{1}}$ where $c>0$ is a small constant. Since $\|\bm{\Sigma}(\widehat{\bm{\gamma}})-\bm{\Sigma}(\bm{\gamma})\|_{2}\leq\tau$ for any $\widehat{\bm{\gamma}}$ and $\bm{\gamma}\in\{0,1\}^{p}$ , by Lemma D.2 in \citeappduan2023adaptive,

\text{LHS of \eqref{eq: lemma 16 eq 2}}\geq\frac{\inf_{\widehat{\bm{\gamma}}\in\{0,1\}^{p}}\sup_{\bm{\gamma}\in\{0,1\}^{p}}\mathbb{E}\|\bm{\Sigma}(\widehat{\bm{\gamma}})-\bm{\Sigma}(\bm{\gamma})\|_{2}^{2}-4C^{2}\cdot\frac{p}{n_{1}}}{(c^{2}-4C^{2})\frac{p}{n_{1}}}.

(S.5.412)

Applying Assouad’s lemma (Theorem 2.12 in \citealpapptsybakov2009introduction or Lemma 2 in \citealpappcai2012optimal), we get

	$\displaystyle\inf_{\widehat{\bm{\gamma}}\in\{0,1\}^{p}}\sup_{\bm{\gamma}\in\{0,1\}^{p}}\mathbb{E}\\|\bm{\Sigma}(\widehat{\bm{\gamma}})-\bm{\Sigma}(\bm{\gamma})\\|_{2}^{2}$	$\displaystyle\geq\frac{p}{8}\min_{\rho_{H}(\bm{\gamma},\bm{\gamma}^{\prime})\geq 1}\left[\frac{\\|\bm{\Sigma}(\bm{\gamma})-\bm{\Sigma}(\bm{\gamma}^{\prime})\\|_{2}^{2}}{\rho_{H}(\bm{\gamma},\bm{\gamma}^{\prime})}\right]$		(S.5.413)
		$\displaystyle\quad\cdot\left[1-\max_{\rho_{H}(\bm{\gamma},\bm{\gamma}^{\prime})=1}\left(\text{KL}(\mathbb{P}_{\bm{\gamma}}^{\otimes n_{1}}\\|\mathbb{P}_{\bm{\gamma}^{\prime}}^{\otimes n_{1}})\right)^{1/2}\right],$		(S.5.414)

where $\rho_{H}$ is the Hamming distance. For the first term on the RHS, it’s easy to see that

\|\bm{\Sigma}(\bm{\gamma})-\bm{\Sigma}(\bm{\gamma}^{\prime})\|_{2}^{2}=\tau^{2}\rho_{H}(\bm{\gamma},\bm{\gamma}^{\prime}),

(S.5.415)

for any $\bm{\gamma}$ and $\bm{\gamma}^{\prime}\in\{0,1\}^{p}$ . For the second term, by the density form of Gaussian distribution, we can show that if $\rho_{H}(\bm{\gamma},\bm{\gamma}^{\prime})=1$ , then

$\displaystyle\text{KL}(\mathbb{P}_{\bm{\gamma}}^{\otimes n_{1}}\\|\mathbb{P}_{\bm{\gamma}^{\prime}}^{\otimes n_{1}})$	$\displaystyle=n_{1}\text{KL}(\mathbb{P}_{\bm{\gamma}}\\|\mathbb{P}_{\bm{\gamma}^{\prime}})$	(S.5.416)
	$\displaystyle\leq n_{1}\cdot\frac{1}{2}\left\{\log(\|\bm{\Sigma}(\bm{\gamma}^{\prime})\|/\|\bm{\Sigma}(\bm{\gamma})\|)-\text{Tr}\left[(\bm{\Sigma}(\bm{\gamma})^{-1}-\bm{\Sigma}(\bm{\gamma}^{\prime})^{-1})\bm{\Sigma}(\bm{\gamma})\right]\right\}$	(S.5.417)
	$\displaystyle\leq n_{1}\cdot\frac{1}{4}\tau^{2}$	(S.5.418)
	$\displaystyle\leq\frac{c^{2}}{4}.$	(S.5.419)

Plugging this back into (S.5.414), combining with (S.5.412), we have

\text{LHS of \eqref{eq: lemma 16 eq 2}}\geq\frac{c^{2}\cdot\frac{p}{8n_{1}}(1-\frac{c}{2})-4C^{2}\cdot\frac{p}{n_{1}}}{(c^{2}-4C^{2})\frac{p}{n_{1}}}\geq\frac{1}{10},

(S.5.420)

when $c=2/9$ and $C\leq c/\sqrt{324}$ .

(ii) We want to show

\inf_{\{\widehat{\bm{\Sigma}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\|_{2}\geq C\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{10}.

(S.5.421)

The proof idea is similar to part (iii) of the proof of Lemma 19, so we omit the details here. It suffices to consider $\bm{M}^{(k)}=\{\bm{\Sigma}^{(j)}\}_{j=1}^{K}$ where $\bm{\Sigma}^{(j)}=\bm{I}_{p}$ when $j\neq k$ and $\bm{\Sigma}^{(k)}=\bm{I}_{p}+\sqrt{\log K/n_{k}}\cdot\bm{e}_{1}\bm{e}_{1}^{\top}$ .

∎

S.5.4 Proof of Theorem 2

We claim that with probability at least $1-CK^{-1}$ ,

R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[t]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*}).

(S.5.422)

Then the conclusion immediately follows from Theorem 1. Hence it suffices to verify the claim. For convenience, we write $\widehat{\mathcal{C}}^{(k)[t]}=\mathcal{C}_{\widehat{\bm{\theta}}^{(k)[t]}}$ simply as $\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}$ and $\widehat{\bm{\theta}}^{(k)[t]}$ as $\widehat{\bm{\theta}}^{(k)}$ .

By simple calculations, we have

$\displaystyle R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}})$	$\displaystyle=(1-w^{(k)})\Phi\left(\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}\right)$	(S.5.423)
	$\displaystyle\quad+w^{(k)}\Phi\left(\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}\right),$	(S.5.424)
$\displaystyle R_{\overline{\bm{\theta}}^{(k)}}(\mathcal{C}_{\bm{\theta}^{(k)}})$	$\displaystyle=(1-w^{(k)})\Phi\left(\frac{-\log(\frac{1-w^{(k)}}{w^{(k)}})-\delta^{(k)}+(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}_{1}}{\sqrt{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)*}}}\right)$	(S.5.425)
	$\displaystyle\quad+w^{(k)}\Phi\left(\frac{\log(\frac{1-w^{(k)}}{w^{(k)}})+\delta^{(k)}-(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}_{2}}{\sqrt{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)*}}}\right).$	(S.5.426)

Then by Taylor expansion,

	$\displaystyle R_{\overline{\bm{\theta}}^{(k)}}(\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}})-R_{\overline{\bm{\theta}}^{(k)}}(\mathcal{C}_{\bm{\theta}^{(k)}})\leq(1-w^{(k)})\Phi^{\prime}\left(\frac{-\log(\frac{1-w^{(k)}}{w^{(k)}})-\delta^{(k)}+(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}_{1}}{\sqrt{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}}}\right)$		(S.5.427)
	$\displaystyle\quad\cdot\Bigg{[}\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)}\widehat{\bm{\beta}}^{(k)}}}-\frac{-\log(\frac{1-w^{(k)}}{w^{(k)}})-\delta^{(k)}+(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}_{1}}{\sqrt{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}}}\Bigg{]}$		(S.5.428)
	$\displaystyle\quad+w^{(k)}\Phi^{\prime}\left(\frac{\log(\frac{1-w^{(k)}}{w^{(k)}})+\delta^{(k)}-(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}_{2}}{\sqrt{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)*}}}\right)$		(S.5.429)
	$\displaystyle\cdot\Bigg{[}\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)}\widehat{\bm{\beta}}^{(k)}}}-\frac{\log(\frac{1-w^{(k)}}{w^{(k)}})+\delta^{(k)}-(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}_{2}}{\sqrt{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}}}\Bigg{]}$		(S.5.430)
	$\displaystyle\quad+C\Bigg{[}\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)}\widehat{\bm{\beta}}^{(k)}}}-\frac{-\log(\frac{1-w^{(k)}}{w^{(k)}})-\delta^{(k)}+(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}_{1}}{\sqrt{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}}}\Bigg{]}^{2}$		(S.5.431)
	$\displaystyle\quad+C\Bigg{[}\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)}\widehat{\bm{\beta}}^{(k)}}}-\frac{\log(\frac{1-w^{(k)}}{w^{(k)}})+\delta^{(k)}-(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}_{2}}{\sqrt{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}}}\Bigg{]}^{2}.$		(S.5.432)

Denote $\mathscr{A}=(1-w^{(k)*})\Phi^{\prime}\left(\frac{-\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\delta^{(k)*}+(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\right)\cdot\Bigg{[}\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{-\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\delta^{(k)*}+(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]}+w^{(k)*}\Phi^{\prime}\left(\frac{\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\delta^{(k)*}-(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\right)\cdot\Bigg{[}\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\delta^{(k)*}-(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]}$ and $\mathscr{B}=C\Bigg{[}\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{-\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\delta^{(k)*}+(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]}^{2}+C\Bigg{[}\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\delta^{(k)*}-(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]}^{2}$ . By plugging in the density formula of standard Gaussian distribution, it is easy to see that

$\displaystyle\mathscr{A}$	$\displaystyle\lesssim\sqrt{(1-w^{(k)})w^{(k)}}\cdot\exp\left\{-\frac{\big{[}\log(\frac{1-w^{(k)}}{w^{(k)}})+\frac{1}{2}(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}\big{]}^{2}}{2(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}}+\frac{1}{2}\log\bigg{(}\frac{1-w^{(k)}}{w^{(k)}}\bigg{)}\right\}$	(S.5.433)
	$\displaystyle\quad\cdot\Bigg{[}\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)}\widehat{\bm{\beta}}^{(k)}}}-\frac{-\log(\frac{1-w^{(k)}}{w^{(k)}})-\delta^{(k)}+(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}_{1}}{\sqrt{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}}}\Bigg{]}$	(S.5.434)
	$\displaystyle\quad+\sqrt{(1-w^{(k)})w^{(k)}}\cdot\exp\left\{-\frac{\big{[}\log(\frac{1-w^{(k)}}{w^{(k)}})-\frac{1}{2}(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}\big{]}^{2}}{2(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}}+\frac{1}{2}\log\bigg{(}\frac{w^{(k)}}{1-w^{(k)}}\bigg{)}\right\}$	(S.5.435)
	$\displaystyle\quad\cdot\Bigg{[}\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)}\widehat{\bm{\beta}}^{(k)}}}-\frac{\log(\frac{1-w^{(k)}}{w^{(k)}})+\delta^{(k)}-(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)}_{2}}{\sqrt{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}}}\Bigg{]}$	(S.5.436)
	$\displaystyle=\sqrt{(1-w^{(k)})w^{(k)}}\cdot\exp\left\{-\frac{1}{8}[(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}]^{2}-\frac{1}{2}\cdot\frac{\log^{2}(\frac{1-w^{(k)}}{w^{(k)}})}{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)}}\right\}$	(S.5.437)
	$\displaystyle\quad\cdot\left\|\frac{(\widehat{\bm{\beta}}^{(k)})^{\top}(\bm{\mu}^{(k)}_{1}-\bm{\mu}^{(k)}_{2})}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)}\widehat{\bm{\beta}}^{(k)}}}-\frac{(\bm{\beta}^{(k)})^{\top}(\bm{\mu}^{(k)}_{1}-\bm{\mu}^{(k)}_{2})}{\sqrt{(\bm{\beta}^{(k)})^{\top}\bm{\Sigma}^{(k)}\bm{\beta}^{(k)*}}}\right\|$	(S.5.438)
	$\displaystyle\lesssim\left\|\frac{(\widehat{\bm{\xi}}^{(k)})^{\top}\bm{\xi}^{(k)}}{\\|\widehat{\bm{\xi}}^{(k)}\\|_{2}}-\\|\bm{\xi}^{(k)}\\|_{2}\right\|$	(S.5.439)
	$\displaystyle\lesssim\\|\widehat{\bm{\xi}}^{(k)}-\bm{\xi}^{(k)*}\\|_{2}^{2},$	(S.5.440)

with probability at least $1-C^{\prime}K^{-1}$ , where $\widehat{\bm{\xi}}^{(k)}=(\bm{\Sigma}^{(k)*})^{1/2}\widehat{\bm{\beta}}^{(k)}$ and $\bm{\xi}^{(k)*}=(\bm{\Sigma}^{(k)*})^{1/2}\bm{\beta}^{(k)*}$ , so $\|\widehat{\bm{\xi}}^{(k)}-\bm{\xi}^{(k)*}\|_{2}^{2}\lesssim\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\|_{2}^{2}$ . Only the last inequality in (S.5.440) holds with high probability and the others are deterministic. It comes from the fact that $\|\widehat{\bm{\xi}}^{(k)}-\bm{\xi}^{(k)*}\|_{2}\leq c\leq\|\bm{\xi}^{(k)*}\|_{2}$ for some $c>0$ with probability at least $1-C^{\prime}K^{-1}$ and a direct application of Lemma 8.1 in \citeappcai2019chime. On the other hand, it is easy to see that $\mathscr{B}\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})$ . Combining these two facts leads to (S.5.422).

S.5.5 Proof of Theorem 4

S.5.5.1 Lemmas

Recall that for GMM associated with parameter set $\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\Sigma})$ , we define the mis-clustering error rate of any classifier $\mathcal{C}$ as $R_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z^{\textup{new}})\neq\pi(Y^{\textup{new}}))$ , where $\mathbb{P}_{\overline{\bm{\theta}}}$ represents the distribution of $(Z^{\textup{new}},Y^{\textup{new}})$ , i.e. $(1-w)\mathcal{N}(\bm{\mu}_{1},\bm{\Sigma})+w\mathcal{N}(\bm{\mu}_{2},\bm{\Sigma})$ . Denote $\mathcal{C}_{\overline{\bm{\theta}}}$ as the Bayes classifier corresponding to $\overline{\bm{\theta}}$ . Define a surrogate loss $L_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z^{\textup{new}})\neq\pi(\mathcal{C}_{\overline{\bm{\theta}}}(Y^{\textup{new}})))$ .

Lemma 24.

Assume there exists an subset $S$ such that $\min_{k\in S}n_{k}\geq C(p\vee\log K)$ and $\min_{k\in S}\Delta^{(k)}\geq\sigma^{2}>0$ with some constants $C>0$ . We have

	$\displaystyle\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}$	$\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq C_{1}\frac{p}{n_{S}}+C_{2}\frac{\log K}{n_{k}}$		(S.5.441)
		$\displaystyle\quad\quad+C_{3}h^{2}\wedge\frac{p+\log K}{n_{k}}+C_{4}\frac{\epsilon^{2}}{\max_{k\in S}n_{k}}\bigg{\}}\Bigg{)}\geq\frac{1}{10}.$		(S.5.442)

Lemma 25.

Suppose $\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\beta},\bm{\Sigma})$ satisfies $\Delta^{2}\coloneqq(\bm{\mu}_{1}-\bm{\mu}_{2})^{\top}\bm{\Sigma}^{-1}(\bm{\mu}_{1}-\bm{\mu}_{2})\geq\sigma^{2}>0$ with some constant $\sigma^{2}>0$ and $w,w^{\prime}\in(c_{w},1-c_{w})$ . Then $\exists c>0$ such that

cL_{\overline{\bm{\theta}}}^{2}(\mathcal{C})\leq R_{\overline{\bm{\theta}}}(\mathcal{C})-R_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}),

(S.5.443)

for any classifier $\mathcal{C}$ , where $R_{\overline{\bm{\theta}}}(\mathcal{C})\coloneqq\min_{\pi:1:2\rightarrow 1:2}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(\bm{z})\neq\pi(y))$ , $L_{\overline{\bm{\theta}}}(\mathcal{C})\coloneqq\min_{\pi:\{0,1\}\rightarrow\{0,1\}}\allowbreak\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(\bm{z})\neq\pi(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})))$ , and $\mathcal{C}_{\overline{\bm{\theta}}}$ is the corresponding Bayes classifier.

Lemma 26.

Consider $\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\beta},\bm{\Sigma})$ and $\overline{\bm{\theta}}^{\prime}=(w^{\prime},\bm{\mu}_{1},\bm{\mu}_{2},\bm{\beta},\bm{\Sigma})$ satisfies $\Delta^{2}\coloneqq(\bm{\mu}_{1}-\bm{\mu}_{2})^{\top}\bm{\Sigma}^{-1}(\bm{\mu}_{1}-\bm{\mu}_{2})\geq\sigma^{2}>0$ with some constant $\sigma^{2}>0$ .. We have

c|w-w^{\prime}|\leq L_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})\leq c^{\prime}|w-w^{\prime}|,

(S.5.444)

for some constants $c$ , $c^{\prime}>0$ .

Lemma 27.

Consider $\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\Sigma})$ and $\overline{\bm{\theta}}^{\prime}=(w,\bm{\mu}_{1}^{\prime},\bm{\mu}_{2}^{\prime},\bm{\Sigma})$ satisfies $w=1/2$ , $\bm{\mu}_{1}=-\bm{\mu}_{0}/2+\bm{u}$ , $\bm{\mu}_{2}=\bm{\mu}_{0}/2+\bm{u}$ , $\bm{\mu}_{1}^{\prime}=-\bm{\mu}_{0}/2+\bm{u}^{\prime}$ , $\bm{\mu}_{2}^{\prime}=\bm{\mu}_{0}/2+\bm{u}^{\prime}$ , $\bm{\Sigma}=\bm{I}_{p}$ , $\bm{\mu}_{0}=(1,\bm{0}_{p-1}^{\top})^{\top}$ , $\bm{u}=(\widetilde{u},\bm{0}_{p-1}^{\top})^{\top}$ , and $\bm{u}^{\prime}=(\widetilde{u}^{\prime},\bm{0}_{p-1}^{\top})^{\top}$ . We have

c|\widetilde{u}-\widetilde{u}^{\prime}|\leq L_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})\leq c^{\prime}|\widetilde{u}-\widetilde{u}^{\prime}|,

(S.5.445)

for some constants $c$ , $c^{\prime}>0$ .

Lemma 28.

Denote $\widetilde{\epsilon}=\frac{K-s}{s}$ . We have

\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\max_{k\in S}\left[R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\right]\geq C_{1}\frac{\widetilde{\epsilon}^{\prime 2}}{\max_{k=1:K}n_{k}}\Bigg{)}\geq\frac{1}{10}.

(S.5.446)

S.5.5.2 Main proof of Theorem 4

Combine conclusions of Lemmas 24 and 28 to get the lower bound.

S.5.5.3 Proof of lemmas

Proof of Lemma 24.

Recall the definitions and proof idea of Lemma 19. We have $\overline{\Theta}_{S}\supseteq\overline{\Theta}_{|S|,w}\cup\overline{\Theta}_{|S|,\delta}\cup\overline{\Theta}_{|S|,\bm{\beta}}$ , where

$\displaystyle\overline{\Theta}_{S,w}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\mu}^{(k)}_{1}=\bm{1}_{p}/\sqrt{p},\bm{\mu}^{(k)}_{2}=-\bm{\mu}^{(k)}_{1}=\widetilde{\bm{\mu}},\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}\in(c_{w},1-c_{w})\Big{\}},$	(S.5.447)
$\displaystyle\overline{\Theta}_{S,\bm{\beta}}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\\|\bm{\mu}^{(k)}_{1}\\|_{2}\vee\\|\bm{\mu}^{(k)}_{2}\\|_{2}\leq M,\bm{\mu}^{(k)}_{2}=-\bm{\mu}^{(k)}_{1},$	(S.5.448)
	$\displaystyle\hskip 85.35826pt\min_{\bm{\beta}}\max_{k\in S}\\|\bm{\beta}^{(k)}-\bm{\beta}\\|_{2}\leq h\Big{\}},$	(S.5.449)
$\displaystyle\overline{\Theta}_{S,\delta}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\\|\bm{\mu}^{(k)}_{1}\\|_{2}\vee\\|\bm{\mu}^{(k)}_{2}\\|_{2}\leq M,\bm{\mu}^{(k)}_{1}=-\frac{1}{2}\bm{\mu}_{0},\bm{\mu}^{(k)}_{2}=\frac{1}{2}\bm{\mu}_{0}+\bm{u},$	(S.5.450)
	$\displaystyle\hskip 85.35826pt\\|\bm{u}\\|_{2}\leq 1\Big{\}}.$	(S.5.451)

Recall the mis-clustering error for GMM associated with parameter set $\overline{\bm{\theta}}$ of any classifier $\mathcal{C}$ is $R_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z)\neq\pi(Y))$ . To help the analysis, following \citeappazizyan2013minimax and \citeappcai2019chime, we define a surrogate loss $L_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z)\neq\pi(\mathcal{C}_{\overline{\bm{\theta}}}(Z)))$ , where $\mathcal{C}_{\overline{\bm{\theta}}}$ is the Bayes classifier. Suppose $\sigma=\sqrt{0.005}$ .

(i) We want to show

\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq C\sqrt{\frac{p}{n_{S}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}.

(S.5.452)

Consider $S=1:K$ and space $\overline{\Theta}_{0}=\{\{\overline{\bm{\theta}}^{(k)}\}_{k=1}^{K}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=1/2,\bm{\mu}^{(k)}_{1}=\bm{\mu}_{1},\bm{\mu}^{(k)}_{2}=\bm{\mu}_{2},\|\bm{\mu}_{1}\|_{2}\vee\|\bm{\mu}_{2}\|_{2}\leq M\}$ . And

\text{LHS of \eqref{eq: lemma 19 eq 1}}\geq\inf_{\widehat{\mathcal{C}}^{(1)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k=1}^{K}\in\overline{\Theta}_{0}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(1)*}}(\widehat{\mathcal{C}}^{(1)})-R_{\overline{\bm{\theta}}^{(1)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(1)*}})\geq C\sqrt{\frac{p}{n_{S}}}\Bigg{)}.

(S.5.453)

Let $r=c\sqrt{p/n_{S}}\leq 0.001$ with some small constant $c>0$ . For any $\bm{\mu}\in\mathbb{R}^{p}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ . Consider a $r/4$ -packing of $r\mathcal{S}^{p-1}$ : $\{\widetilde{\bm{v}}_{j}\}_{j=1}^{N}$ . By Lemma 2, $N\geq 4^{p-1}$ . Denote $\widetilde{\bm{\mu}}_{j}=(\sigma,\widetilde{\bm{v}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}$ , where $\sigma=\sqrt{0.005}$ . Then by definition of KL divergence and Lemma 8.4 in \citeappcai2019chime,

$\displaystyle\text{KL}\left(\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\bigg{\\|}\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\right)$	$\displaystyle=\sum_{k\in S}n_{k}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}})$	(S.5.454)
	$\displaystyle\leq n_{S}\cdot 8(1+\sigma^{2})\\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\\|_{2}^{2}$	(S.5.455)
	$\displaystyle\leq 32(1+\sigma^{2})n_{S}r^{2}$	(S.5.456)
	$\displaystyle\leq 32(1+\sigma^{2})n_{S}\cdot c^{2}\frac{2(p-1)}{n_{S}}$	(S.5.457)
	$\displaystyle\leq\frac{32(1+\sigma^{2})c^{2}}{\log 2}\log N.$	(S.5.458)

For simplicity, we write $L_{\overline{\bm{\theta}}}$ with $\overline{\bm{\theta}}\in\overline{\Theta}_{0}$ and $\bm{\mu}_{1}=-\bm{\mu}_{2}=\bm{\mu}$ as $L_{\bm{\mu}}$ . By Lemma 8.5 in \citeappcai2019chime,

L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C}_{\widetilde{\bm{\mu}}_{j}})\geq\frac{1}{\sqrt{2}}g\left(\frac{\sqrt{\sigma^{2}+r^{2}}}{2}\right)\frac{\|\widetilde{\bm{\mu}}_{i}-\widetilde{\bm{\mu}}_{j}\|_{2}}{\|\widetilde{\bm{\mu}}_{i}\|_{2}}\geq\frac{1}{\sqrt{2}}\cdot 0.15\cdot\frac{r/4}{\sqrt{\sigma^{2}+r^{2}}}\geq 2r,

(S.5.459)

where $g(x)=\phi(x)[\phi(x)-x\Phi(x)]$ . The last inequality holds because $\sqrt{\sigma^{2}+r^{2}}\geq\sqrt{2}\sigma$ and $g(\sqrt{\sigma^{2}+r^{2}}/2)\geq 0.15$ when $r^{2}\leq\sigma^{2}=0.001$ . Then by Lemma 3.5 in \citeappcai2019chime (Proposition 2 in \citealpappazizyan2013minimax), for any classifier $\mathcal{C}$ , and $i\neq j$ ,

L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C})+L_{\widetilde{\bm{\mu}}_{j}}(\mathcal{C})\geq L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C}_{\widetilde{\bm{\mu}}_{j}})-\sqrt{\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{i}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j}})/2}\geq 2r-r=c\sqrt{\frac{p}{n_{S}}}.

(S.5.460)

For any $\widehat{\mathcal{C}}^{(1)}$ , consider a test $\psi^{*}=\operatorname*{arg\,min}_{j=1:N}L_{\widetilde{\bm{\mu}}_{j}}(\widehat{\mathcal{C}}^{(1)})$ . Therefore if there exists $j_{0}$ such that $L_{\widetilde{\bm{\mu}}_{j_{0}}}(\widehat{\mathcal{C}}^{(1)})<\frac{c}{2}\sqrt{\frac{p}{n_{S}}}$ , then by (S.5.460), we must have $\psi^{*}=j_{0}$ . Let $C_{1}\leq c/2$ , then by Fano’s lemma (Corollary 6 in \citealpapptsybakov2009introduction)

$\displaystyle\inf_{\widehat{\mathcal{C}}^{(1)}}\sup_{\{\overline{\bm{\theta}}^{(k)}\}_{k=1}^{K}\in\overline{\Theta}_{0}}\mathbb{P}\Bigg{(}L_{\overline{\bm{\theta}}^{(1)}}(\widehat{\mathcal{C}}^{(1)})\geq C_{1}\sqrt{\frac{p}{n_{S}}}\Bigg{)}$	$\displaystyle\geq\inf_{\widehat{\mathcal{C}}^{(1)}}\sup_{j=1:N}\mathbb{P}\Bigg{(}L_{\widetilde{\bm{\mu}}^{(j)}}(\widehat{\mathcal{C}}^{(1)})\geq C_{1}\sqrt{\frac{p}{n_{S}}}\Bigg{)}$	(S.5.461)
	$\displaystyle\geq\inf_{\widehat{\mathcal{C}}^{(1)}}\sup_{j=1:N}\mathbb{P}\Bigg{(}\psi^{*}\neq j\Bigg{)}$	(S.5.462)
	$\displaystyle\geq\inf_{\psi}\sup_{j=1:N}\mathbb{P}\Bigg{(}\psi\neq j\Bigg{)}$	(S.5.463)
	$\displaystyle\geq 1-\frac{\log 2}{\log N}-\frac{32(1+\sigma^{2})c^{2}}{\log 2}$	(S.5.464)
	$\displaystyle\geq\frac{1}{4},$	(S.5.465)

when $p\geq 2$ and $c=\sqrt{\frac{\log 2}{128(1+\sigma^{2})}}$ . Then apply Lemma 25 to get the (S.5.452).

(ii) We want to show

\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\bm{\beta}}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq Ch\wedge\sqrt{\frac{p}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}.

(S.5.466)

Fixing an $S$ and a $\mathbb{Q}_{S}$ . Suppose $1\in S$ . We have

\text{LHS of \eqref{eq: lemma 19 eq 3}}\geq\inf_{\widehat{\mathcal{C}}^{(1)}}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\bm{\beta}}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\bigg{(}R_{\overline{\bm{\theta}}^{(1)*}}(\widehat{\mathcal{C}}^{(1)})-R_{\overline{\bm{\theta}}^{(1)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(1)*}})\geq Ch\wedge\sqrt{\frac{p}{n_{k}}}\Bigg{)}.

(S.5.467)

Let $r=h\wedge(c\sqrt{p/n_{1}})\wedge M$ with a small constant $c>0$ . For any $\bm{\mu}\in\mathbb{R}^{p}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ . Consider a $r/4$ -packing of $r\mathcal{S}^{p-1}$ . By Lemma 2, $N\geq 4^{p-1}$ . Denote $\widetilde{\bm{\mu}}_{j}=(\sigma,\widetilde{\bm{v}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}$ . WLOG, assume $M\geq 2$ . Let $\bm{\mu}^{(k)*}_{1}=\widetilde{\bm{\mu}}=(\sigma,\bm{0}_{p-1})^{\top}$ for all $k\in S\backslash\{1\}$ . Then by following the same arguments in (i) and part (ii) of the proof of Lemma 19, we can show that the RHS of (S.5.467) is larger than or equal to $1/4$ when $p\geq 3$ .

(iii) We want to show

\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\bm{\beta}}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq Ch\wedge\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}.

(S.5.468)

This can be proved by following similar ideas used in step (iii) of the proof of Lemma 19, so we omit the proof here.

(iv) We want to show

\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,w}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq C\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}.

(S.5.469)

This can be similarly proved by following the arguments in part (i) with Lemmas 25 and 26.

(v) We want to show

\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\delta}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq C\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}.

(S.5.470)

This can be similarly proved by following the arguments in part (i) with Lemmas 25 and 27.

Finally, we get the desired conclusion by combining (i)-(v). ∎

Proof of Lemma 25.

We follow a similar proof idea used in the proof of Lemma 3.4 in \citeappcai2019chime. Let $\phi_{1}$ and $\phi_{2}$ be the density of $\mathcal{N}(\bm{\mu},\bm{\Sigma})$ and $\mathcal{N}(-\bm{\mu},\bm{\Sigma})$ , respectively. Denote $\eta_{\overline{\bm{\theta}}}(\bm{z})=\frac{(1-w)\phi_{1}(\bm{z})}{(1-w)\phi_{1}(\bm{z})+w\phi_{2}(\bm{z})}$ and $S_{\mathcal{C}}=\{\bm{z}\in\mathbb{R}^{p}:\mathcal{C}(\bm{z})=1\}$ for any classifier $\mathcal{C}$ . Note that $S_{\mathcal{C}_{\overline{\bm{\theta}}}}=\{\bm{z}\in\mathbb{R}^{p}:(1-w)\phi_{1}(\bm{z})\geq w\phi_{2}(\bm{z})\}$ . The permutation actually doesn’t matter in the proof. WLOG, we drop the permutations in the definition of misclassification error and surrogate loss by assuming $\pi$ to be the identity function. If $\pi$ is not identity in the definition of $R_{\overline{\bm{\theta}}}(\mathcal{C})$ , for example, we can define $S_{\mathcal{C}}=\{\bm{z}\in\mathbb{R}^{p}:\mathcal{C}(\bm{z})=2\}$ instead and all the following steps still follow.

By definition,

	$\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(\bm{z})\neq y)$	$\displaystyle=(1-w)\int_{S_{\mathcal{C}}^{c}}\phi_{1}d\bm{z}+w\int_{S_{\mathcal{C}}}\phi_{2}d\bm{z},$		(S.5.471)
	$\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})\neq y)$	$\displaystyle=(1-w)\int_{S_{\mathcal{C}_{\overline{\bm{\theta}}}}^{c}}\phi_{1}d\bm{z}+w\int_{S_{\mathcal{C}_{\overline{\bm{\theta}}}}}\phi_{2}d\bm{z},$		(S.5.472)

which leads to

$\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(\bm{z})\neq y)-\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})\neq y)$	$\displaystyle=\int_{S_{\mathcal{C}_{\overline{\bm{\theta}}}}\backslash S_{\mathcal{C}}}[(1-w)\phi_{1}-w\phi_{2}]d\bm{z}+\int_{S_{\mathcal{C}_{\overline{\bm{\theta}}}}^{c}\backslash S_{\mathcal{C}}^{c}}[w\phi_{2}-(1-w)\phi_{1}]d\bm{z}$	(S.5.473)
	$\displaystyle=\int_{S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}}}\|(1-w)\phi_{1}-w\phi_{2}\|d\bm{z}$	(S.5.474)
	$\displaystyle=\mathbb{E}_{\bm{z}\sim(1-w)\phi_{1}+w\phi_{2}}\big{[}\left\|2\eta_{\overline{\bm{\theta}}}(\bm{z})-1\right\|\mathds{1}(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}})\big{]}$	(S.5.475)
	$\displaystyle\geq 2t\cdot\mathbb{P}_{\overline{\bm{\theta}}}\left(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}},\left\|2\eta_{\overline{\bm{\theta}}}(\bm{z})-1\right\|>2t\right)$	(S.5.476)
	$\displaystyle=2t\left[\mathbb{P}_{\overline{\bm{\theta}}}(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}})-\mathbb{P}_{\overline{\bm{\theta}}}(\left\|2\eta_{\overline{\bm{\theta}}}(\bm{z})-1\right\|\leq 2t)\right]$	(S.5.477)
	$\displaystyle\geq 2t\left[\mathbb{P}_{\overline{\bm{\theta}}}(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}})-ct\right]$	(S.5.478)
	$\displaystyle\geq\frac{1}{2c}\mathbb{P}_{\overline{\bm{\theta}}}^{2}(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}}),$	(S.5.479)

where we let $t=\frac{1}{2c}\mathbb{P}_{\overline{\bm{\theta}}}(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}})$ with $c=1+\frac{8}{\sqrt{2\pi}\sigma}$ . This completes the proof. The last second inequality depends on the fact that

\mathbb{P}_{\overline{\bm{\theta}}}(\left|\eta_{\overline{\bm{\theta}}}(\bm{z})-1/2\right|\leq t)\leq ct,

(S.5.480)

holds for all $t\leq 1/(2c)$ . This is because

	$\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\left\|\eta_{\overline{\bm{\theta}}}(\bm{z})-1/2\right\|\leq t)$		(S.5.481)
	$\displaystyle=\mathbb{P}_{\overline{\bm{\theta}}}\left(\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1-2t}{1+2t}\right)\leq\log\left(\frac{\phi_{1}}{\phi_{2}}(\bm{z})\right)\leq\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1+2t}{1-2t}\right)\right)$		(S.5.482)
	$\displaystyle=\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1-2t}{1+2t}\right)\leq(\bm{\mu}_{1}-\bm{\mu}_{2})^{\top}\bm{\Sigma}^{-1}\left(\bm{z}-\frac{\bm{\mu}_{1}+\bm{\mu}_{2}}{2}\right)$		(S.5.483)
	$\displaystyle\hskip 187.78836pt\leq\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1+2t}{1-2t}\right)\bigg{)}$		(S.5.484)
	$\displaystyle=\frac{1}{2}\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1-2t}{1+2t}\right)\leq\mathcal{N}(\Delta^{2}/2,\Delta^{2})\leq\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1+2t}{1-2t}\right)\bigg{)}$		(S.5.485)
	$\displaystyle\quad+\frac{1}{2}\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1-2t}{1+2t}\right)\leq\mathcal{N}(-\Delta^{2}/2,\Delta^{2})\leq\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1+2t}{1-2t}\right)\bigg{)}$		(S.5.486)
	$\displaystyle\leq\frac{1}{\sqrt{2\pi}\sigma}\cdot\left[\log\left(\frac{1+2t}{1-2t}\right)-\log\left(\frac{1-2t}{1+2t}\right)\right]$		(S.5.487)
	$\displaystyle\leq\frac{1}{\sqrt{2\pi}\sigma}\cdot\frac{8t}{1-2t}$		(S.5.488)
	$\displaystyle\leq ct,$		(S.5.489)

when $t\leq 1/(2c)$ . Note that (S.5.489) implies that a binary GMM under the separation assumption $\Delta\gtrsim 1$ has Tsybakov’s margin with margin parameter $1$ . For the notion of Tsybakov’s margin, see \citeappaudibert2007fast. We will prove a more general result showing that a multi-cluster GMM under the separation assumption also has Tsybakov’s margin with margin parameter $1$ . This turns out to be useful in proving the upper and lower bounds of misclassification error. ∎

Proof of Lemma 26.

WLOG, suppose $w\geq w^{\prime}$ . Similar to (S.5.486), it’s easy to see that

$\displaystyle L_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})$	$\displaystyle=(1-w)\mathbb{P}_{\overline{\bm{\theta}}}\left(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}(\bm{z})\|\bm{z}=1\right)+w\mathbb{P}_{\overline{\bm{\theta}}}\left(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}(\bm{z})\|\bm{z}=2\right)$	(S.5.490)
	$\displaystyle=(1-w)\mathbb{P}\left(\log\left(\frac{w^{\prime}}{1-w^{\prime}}\right)\leq\mathcal{N}(\Delta^{2}/2,\Delta^{2})\leq\log\left(\frac{w}{1-w}\right)\right)$	(S.5.491)
	$\displaystyle\quad+w\mathbb{P}\left(\log\left(\frac{w^{\prime}}{1-w^{\prime}}\right)\leq\mathcal{N}(-\Delta^{2}/2,\Delta^{2})\leq\log\left(\frac{w}{1-w}\right)\right)$	(S.5.492)
	$\displaystyle\leq\frac{1}{\sqrt{2\pi}\sigma}\left[\log\left(\frac{w}{1-w}\right)-\log\left(\frac{w^{\prime}}{1-w^{\prime}}\right)\right]$	(S.5.493)
	$\displaystyle=\frac{1}{\sqrt{2\pi}c_{w}(1-c_{w})\sigma}\cdot\|w-w^{\prime}\|.$	(S.5.494)

On the other hand,

	$\displaystyle\eqref{eq: lemma 26 eq}$	$\displaystyle\geq\frac{1}{\sqrt{2\pi}Mc_{\bm{\Sigma}}}\cdot\exp\left\{-\frac{1}{2\sigma^{2}}\left[\log\left(\frac{1-c_{w}}{c_{w}}\right)+\frac{1}{2}M^{2}c_{\bm{\Sigma}}\right]^{2}\right\}\left[\log\left(\frac{w}{1-w}\right)-\log\left(\frac{w^{\prime}}{1-w^{\prime}}\right)\right]$		(S.5.495)
		$\displaystyle\geq\frac{1}{\sqrt{2\pi}Mc_{\bm{\Sigma}}c_{w}(1-c_{w})}\cdot\exp\left\{-\frac{1}{2\sigma^{2}}\left[\log\left(\frac{1-c_{w}}{c_{w}}\right)+\frac{1}{2}M^{2}c_{\bm{\Sigma}}\right]^{2}\right\}\|w-w^{\prime}\|,$		(S.5.496)

which completes the proof. ∎

Proof of Lemma 28.

By Lemma 25, it suffices to prove

\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\max_{k\in S}L_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})\geq C_{1}\frac{\widetilde{\epsilon}^{\prime 2}}{\max_{k=1:K}n_{k}}\Bigg{)}\geq\frac{1}{10}.

(S.5.497)

For any $\bm{\mu}\in\mathbb{R}^{p}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ . For simplicity, we write $L_{\overline{\bm{\theta}}}$ with $\overline{\bm{\theta}}$ satisfying $\bm{\mu}_{1}=-\bm{\mu}_{2}=\bm{\mu}$ , $w=1/2$ and $\bm{\Sigma}=\bm{I}_{p}$ as $L_{\bm{\mu}}$ . Consider $L_{\bm{\mu}}(\mathcal{C}_{\bm{\mu}^{\prime}})$ as a loss function between $\bm{\mu}$ and $\bm{\mu}^{\prime}$ in Lemmas 21 and 22. Considering $\|\bm{\mu}\|_{2}=\|\bm{\mu}^{\prime}\|_{2}=1$ , by Lemma 15, note that

\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}})\leq 8\max_{k=1:K}n_{k}\cdot\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}^{2}.

(S.5.498)

By Lemma 8.5 in \citeappcai2019chime, this implies for some constants $c,C>0$

	$\displaystyle\sup\left\{L_{\bm{\mu}}(\mathcal{C}_{\bm{\mu}^{\prime}}):\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}})\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2}\right\}$		(S.5.499)
	$\displaystyle\geq\sup\left\{c\\|\bm{\mu}-\bm{\mu}^{\prime}\\|_{2}:\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}})\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2}\right\}$		(S.5.500)
	$\displaystyle\geq\sup\left\{c\\|\bm{\mu}-\bm{\mu}^{\prime}\\|_{2}:8\max_{k=1:K}n_{k}\cdot\\|\bm{\mu}-\bm{\mu}^{\prime}\\|_{2}^{2}\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2}\right\}$		(S.5.501)
	$\displaystyle=C\cdot\frac{\widetilde{\epsilon}^{\prime}}{\sqrt{\max_{k=1:K}n_{k}}}.$		(S.5.502)

Then apply Lemmas 21 and 22 to get the desired bound. ∎

S.5.6 Proof of Theorem 5

Denote $\xi=\max_{k\in S}\min_{r_{k}=\pm 1}\|r_{k}\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}=\max_{k\in S}(\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2})$ . WLOG, assume $S=\{1,\ldots,s\}$ and $r^{*}_{k}=1$ for all $k\in S$ . Hence $\xi=\max_{k\in S}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}$ . For any $k^{\prime}=1,\ldots,s$ , define

$\displaystyle\bm{r}$	$\displaystyle=(\underbrace{r_{1},\ldots,r_{k^{\prime}}}_{=-1},\underbrace{r_{k^{\prime}+1},\ldots,r_{s}}_{=1},\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}),$	(S.5.503)
$\displaystyle\bm{r}^{\prime}$	$\displaystyle=(1,1,\ldots,1,1,1,\ldots,1,1,\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}),$	(S.5.504)
$\displaystyle\bm{r}^{\prime\prime}$	$\displaystyle=(-1,\ldots,-1,-1,\ldots,-1,\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}).$	(S.5.505)

WLOG, it suffices to prove that

	$\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime})>0\quad\text{ when }k^{\prime}\leq\lfloor s/2\rfloor,$		(S.5.506)
	$\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime\prime})>0\quad\text{ when }k^{\prime}>\lfloor s/2\rfloor.$		(S.5.507)

In fact, if this holds, then we must have

\widehat{r}_{k}=1\text{ for all }k\in S\text{\quad or \quad}\widehat{r}_{k}=-1\text{ for all }k\in S.

(S.5.508)

Otherwise, according to (S.5.506), if $\#\{k\in S:\widehat{r}_{k}=-1\}\leq\lfloor s/2\rfloor$ , by replacing the first $s$ entries of $\widehat{\bm{r}}$ with $1$ , we get a different alignment whose score is smaller than the score of $\widehat{\bm{r}}$ , which is contradicted with the definition of $\widehat{\bm{r}}$ . If $\#\{k\in S:\widehat{r}_{k}=-1\}>\lfloor s/2\rfloor$ , based on (S.5.507), by replacing the first $s$ entries of $\widehat{\bm{r}}$ with $-1$ , we get a different alignment whose score is smaller than the score of $\widehat{\bm{r}}$ , which is again contradicted with the definition of $\widehat{\bm{r}}$ .

In the following, we prove (S.5.506). The proof of (S.5.507) is almost the same, so we do not repeat it. Under the conditions we assume, it can be shown that

$\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime})$	$\displaystyle=\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=1}^{k^{\prime}}\\|\widehat{\bm{\beta}}^{(k_{1})[0]}-\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2}+2\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}\\|\widehat{\bm{\beta}}^{(k_{1})[0]}+\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2}$	(S.5.509)
	$\displaystyle\quad+2\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=s+1}^{K}\\|\widehat{\bm{\beta}}^{(k_{1})[0]}+r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2}$	(S.5.510)
	$\displaystyle\quad-\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=1}^{k^{\prime}}\\|\widehat{\bm{\beta}}^{(k_{1})[0]}-\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2}-2\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}\\|\widehat{\bm{\beta}}^{(k_{1})[0]}-\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2}$	(S.5.511)
	$\displaystyle\quad-2\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=s+1}^{K}\\|-\widehat{\bm{\beta}}^{(k_{1})[0]}+r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2}$	(S.5.512)
	$\displaystyle=2\underbrace{\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}\\|\widehat{\bm{\beta}}^{(k_{1})[0]}+\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2}}_{(1)}+2\underbrace{\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=s+1}^{K}\\|\widehat{\bm{\beta}}^{(k_{1})[0]}+r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2}}_{(2)}$	(S.5.513)
	$\displaystyle\quad-2\underbrace{\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}\\|\widehat{\bm{\beta}}^{(k_{1})[0]}-\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2}}_{(1)^{\prime}}-2\underbrace{\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=s+1}^{K}\\|-\widehat{\bm{\beta}}^{(k_{1})[0]}+r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2}}_{(2)^{\prime}}.$	(S.5.514)

And

$\displaystyle(1)-(1)^{\prime}$	$\displaystyle=\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}(\\|\widehat{\bm{\beta}}^{(k_{1})[0]}+\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2}-\\|\widehat{\bm{\beta}}^{(k_{1})[0]}-\widehat{\bm{\beta}}^{(k_{2})[0]}\\|_{2})$	(S.5.515)
	$\displaystyle\geq\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}(\\|\bm{\beta}^{(k_{1})}+\bm{\beta}^{(k_{2})}\\|_{2}-\\|\bm{\beta}^{(k_{1})}-\bm{\beta}^{(k_{2})}\\|_{2}-4\xi)$	(S.5.516)
	$\displaystyle\geq\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}(2\\|\bm{\beta}^{(k_{1})}\\|_{2}-2\\|\bm{\beta}^{(k_{1})}-\bm{\beta}^{(k_{2})*}\\|_{2}-4\xi)$	(S.5.517)
	$\displaystyle\geq 2(s-k^{\prime})\sum_{k_{1}=1}^{k^{\prime}}\\|\bm{\beta}^{(k_{1})*}\\|_{2}-4k^{\prime}(s-k^{\prime})h_{\bm{\beta}}-4k^{\prime}(s-k^{\prime})\xi,$	(S.5.518)

(2)-(2)^{\prime}\geq-\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=s+1}^{K}2\|\widehat{\bm{\beta}}^{(k_{1})[0]}\|_{2}\geq-2(K-s)\sum_{k_{1}=1}^{k^{\prime}}\|\bm{\beta}^{(k_{1})*}\|_{2}-2k^{\prime}(K-s)\xi.

(S.5.519)

Combining all these pieces,

	$\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime})$		(S.5.520)
	$\displaystyle\geq 2(2s-k^{\prime}-K)\sum_{k_{1}=1}^{k^{\prime}}\\|\bm{\beta}^{(k_{1})*}\\|_{2}-4k^{\prime}(s-k^{\prime})h_{\mu}-2k^{\prime}(K-s)\xi-4k^{\prime}(s-k^{\prime})\xi$		(S.5.521)
	$\displaystyle\geq 2k^{\prime}\left[(2s-k^{\prime}-K)\min_{k\in S}\\|\bm{\beta}^{(k)*}\\|_{2}-2(s-k^{\prime})h_{\mu}-(K-s)\xi-2(s-k^{\prime})\xi\right]$		(S.5.522)
	$\displaystyle>2k^{\prime}\left[\bigg{(}\frac{3}{2}s-K\bigg{)}\min_{k\in S}\\|\bm{\beta}^{(k)*}\\|_{2}-2sh_{\mu}-(K-s)\xi-2s\xi\right]$		(S.5.523)
	$\displaystyle\geq 0,$		(S.5.524)

where (S.5.523) holds because $1\leq k^{\prime}\leq\lfloor s/2\rfloor$ and (S.5.524) is due to the condition (ii).

S.5.7 Proof of Theorem 6

	$\displaystyle\bm{r}$	$\displaystyle=(\underbrace{r_{1},\ldots,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}r_{k^{\prime}}}}_{=-1},\underbrace{r_{k^{\prime}+1},\ldots,r_{s}}_{=1},\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}),$		(S.5.525)
	$\displaystyle\bm{r}^{\prime}$	$\displaystyle=(\underbrace{r_{1},\ldots,r_{k^{\prime}-1}}_{=-1},\underbrace{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}r_{k^{\prime}}^{\prime}},r_{k^{\prime}+1},\ldots,r_{s}}_{=1},\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}).$		(S.5.526)

By the definition of $p_{a}$ , we must have $\#\{k\in S:\widehat{r}_{k}=-1\}=sp_{a}$ or $\#\{k\in S:\widehat{r}_{k}=1\}=sp_{a}$ . If $\#\{k\in S:\widehat{r}_{k}=-1\}=sp_{a}$ and we have

\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime})>0,

(S.5.527)

then for each $k\in S$ in the for loop of Algorithm 3, the algorithm will flip the sign of $\widehat{r}_{k^{\prime}}$ to decrease the mis-alignment proportion $p_{a}$ . Then after the for loop, the mis-alignment proportion $p_{a}$ will become zero, which means the correct alignment is achieved. The case that $\#\{k\in S:\widehat{r}_{k}=1\}=sp_{a}$ can be similarly discussed.

Now we derive (S.5.527). Similar to the decomposition in (S.5.514), we have

$\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime})$	$\displaystyle=2\underbrace{\sum_{k=k^{\prime}+1}^{s}\\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}+\widehat{\bm{\beta}}^{(k)[0]}\\|_{2}}_{(1)}+2\underbrace{\sum_{k=1}^{k^{\prime}-1}\\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}-\widehat{\bm{\beta}}^{(k)[0]}\\|_{2}}_{(2)}$	(S.5.528)
	$\displaystyle\quad+2\underbrace{\sum_{k=s+1}^{K}\\|-\widehat{\bm{\beta}}^{(k^{\prime})[0]}-r_{k}\widehat{\bm{\beta}}^{(k)[0]}\\|_{2}}_{(3)}$	(S.5.529)
	$\displaystyle\quad-2\underbrace{\sum_{k=k^{\prime}+1}^{s}\\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}-\widehat{\bm{\beta}}^{(k)[0]}\\|_{2}}_{(1)^{\prime}}-2\underbrace{\sum_{k=1}^{k^{\prime}-1}\\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}+\widehat{\bm{\beta}}^{(k)[0]}\\|_{2}}_{(2)^{\prime}}$	(S.5.530)
	$\displaystyle\quad-2\underbrace{\sum_{k=s+1}^{K}\\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}-r_{k}\widehat{\bm{\beta}}^{(k)[0]}\\|_{2}}_{(3)^{\prime}}.$	(S.5.531)

Note that

$\displaystyle(1)-(1)^{\prime}$	$\displaystyle=\sum_{k=k^{\prime}+1}^{s}(\\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}+\widehat{\bm{\beta}}^{(k)[0]}\\|_{2}-\\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}-\widehat{\bm{\beta}}^{(k)[0]}\\|_{2})$	(S.5.532)
	$\displaystyle\geq\sum_{k=k^{\prime}+1}^{s}(\\|\bm{\beta}^{(k^{\prime})}+\bm{\beta}^{(k)}\\|_{2}-\\|\bm{\beta}^{(k^{\prime})}-\bm{\beta}^{(k)}\\|_{2}-4\xi)$	(S.5.533)
	$\displaystyle\geq\sum_{k=k^{\prime}+1}^{s}(2\\|\bm{\beta}^{(k^{\prime})}\\|_{2}-2\\|\bm{\beta}^{(k^{\prime})}-\bm{\beta}^{(k)*}\\|_{2}-4\xi)$	(S.5.534)
	$\displaystyle\geq(s-k^{\prime})(2\\|\bm{\beta}^{(k^{\prime})*}\\|_{2}-4h_{\bm{\beta}}-4\xi),$	(S.5.535)

(2)-(2)^{\prime}\geq-\sum_{k=1}^{k^{\prime}-1}2\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}\|_{2}\geq-2(k^{\prime}-1)\|\bm{\beta}^{(k^{\prime})*}\|_{2}-2(k^{\prime}-1)\xi,

(S.5.536)

(3)-(3)^{\prime}\geq-\sum_{k=s+1}^{K}2\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}\|_{2}\geq-2(K-s)\|\bm{\beta}^{(k^{\prime})*}\|_{2}-2(K-s)\xi.

(S.5.537)

Putting all pieces together,

	$\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime})$		(S.5.538)
	$\displaystyle\geq 2\left[(2s-2k^{\prime}-K+1)\\|\bm{\beta}^{(k^{\prime})*}\\|_{2}-2(s-k^{\prime})h_{\bm{\beta}}-(s-k^{\prime}+K-1)\xi\right]$		(S.5.539)
	$\displaystyle>2\left[(2s-2sp_{a}-K)\\|\bm{\beta}^{(k^{\prime})*}\\|_{2}-2sh_{\bm{\beta}}-2(s+K)\xi\right]$		(S.5.540)
	$\displaystyle\geq 0.$		(S.5.541)

where (S.5.540) holds because $1\leq k^{\prime}\leq sp_{a}$ and (S.5.541) is due to the condition (iii).

S.5.8 Proof of Theorem 13

S.5.8.1 Lemmas

Define the contraction basin of one GMM as

B_{\text{con}}(\bm{\theta}^{(k)*})=\{\bm{\theta}=\{w,\bm{\beta},\delta\}:w_{r}\in[c_{w}/2,1-c_{w}/2],\|\bm{\beta}-\bm{\beta}^{(k)*}\|_{2}\leq C_{b}\Delta,|\delta-\delta^{(k)*}|\leq C_{b}\Delta\},

(S.5.542)

for which we may shorthand as $B_{\text{con}}$ in the following.

For GMM $\bm{z}\sim(1-w^{*})\mathcal{N}(\bm{\mu}_{1}^{*},\bm{\Sigma}^{*})+w^{*}\mathcal{N}(\bm{\mu}_{2}^{*},\bm{\Sigma}^{*})$ and any $\bm{\theta}=(w,\bm{\beta},\delta)$ , define

	$\displaystyle\gamma_{\bm{\theta}}(\bm{z})=\frac{w\exp\{\bm{\beta}^{\top}\bm{z}-\delta\}}{1-w+w\exp\{\bm{\beta}^{\top}\bm{z}-\delta_{r}\}},$	$\displaystyle\quad w(\bm{\theta})=\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})],$		(S.5.543)
	$\displaystyle\bm{\mu}_{1}(\bm{\theta})=\frac{\mathbb{E}[(1-\gamma_{\bm{\theta}}(\bm{z}))\bm{z}]}{\mathbb{E}[1-\gamma_{\bm{\theta}}(\bm{z})]},$	$\displaystyle\quad\bm{\mu}_{2}(\bm{\theta})=\frac{\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})\bm{z}]}{\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})]}.$		(S.5.544)

Lemma 29.

Suppose Assumption 3 holds.

(i)

With probability at least $1-\tau$ ,

\sup_{\begin{subarray}{c}\bm{\theta}^{(0)}\in B_{\text{con}}\\ \|\bm{\beta}^{(0)}-\bm{\beta}^{(0)*}\|_{2}\leq\xi^{(0)}\end{subarray}}\left|\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\gamma_{\bm{\theta}^{(0)}}(\bm{z}^{(0)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(0)}}(\bm{z}^{(0)})]\right|\lesssim\xi^{(0)}\sqrt{\frac{p}{n_{0}}}+\sqrt{\frac{\log(1/\tau)}{n_{0}}}.

(S.5.545)

(ii)

With probability at least $1-\tau$ ,

\sup_{\begin{subarray}{c}\bm{\theta}^{(0)}\in B_{\text{con}}\\ \|\bm{\beta}^{(0)}-\bm{\beta}^{(0)*}\|_{2}\leq\xi^{(0)}\end{subarray}}\left|\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(0)}_{i})(\bm{z}^{(0)}_{i})^{\top}\bm{\beta}^{(0)*}-\mathbb{E}[\gamma_{\bm{\theta}^{(0)}}(\bm{z}^{(0)})(\bm{z}^{(0)}_{i})^{\top}\bm{\beta}^{(0)*}]\right|\lesssim\xi^{(0)}\sqrt{\frac{p}{n_{0}}}+\sqrt{\frac{\log(1/\tau)}{n_{0}}}.

(S.5.546)

(iii)

With probability at least $1-\tau$ ,

\sup_{\begin{subarray}{c}\bm{\theta}^{(0)}\in B_{\text{con}}\\ \|\bm{\beta}^{(0)}-\bm{\beta}^{(0)*}\|_{2}\leq\xi^{(0)}\end{subarray}}\left\|\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(0)}_{i})\bm{z}^{(0)}_{i}-\mathbb{E}[\gamma_{\bm{\theta}^{(0)}}(\bm{z}^{(0)})\bm{z}^{(0)}_{i}]\right\|_{2}\lesssim\xi^{(0)}\sqrt{\frac{p}{n_{0}}}+\sqrt{\frac{\log(1/\tau)}{n_{0}}}.

(S.5.547)

S.5.8.2 Main proof of Theorem 13

WLOG, in Assumptions 3.(iii) and 3.(iv), we assume

•

$\|\widehat{\bm{\beta}}^{(0)[0]}-\bm{\beta}^{(0)*}\|_{2}\vee|\widehat{\delta}^{(0)[0]}-\delta^{(0)*}|\leq C_{4}\Delta^{(0)}$ , with a sufficiently small constant $C_{4}$ ;
•

$|\widehat{w}^{(0)[0]}-w^{(0)*}|\leq c_{w}/2$ .

(\@slowromancapi@) Case 1: We first consider the case that $h\geq C\sqrt{\frac{p}{n_{0}}}$ . Consider an event $\mathcal{E}$ defined to be the intersection of the events in Lemma 29, with $\xi^{(k)}=$ a large constant $C$ , which satisfies $\mathbb{P}(\mathcal{E})\geq 1-\tau$ . Throughout the analysis in Case 1, we condition on $\mathcal{E}$ , therefore all the arguments hold with probability at least $1-\tau$ .

Similar to our analysis in the proof of Theorem 1, conditioned on $\mathcal{E}$ , we have

$\displaystyle\|\widehat{w}^{(0)[t]}-w^{(0)*}\|$	$\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\sqrt{\frac{p}{n_{0}}},$	(S.5.548)
$\displaystyle\max_{r=1:2}\\|\widehat{\bm{\mu}}^{(0)[t]}_{r}-\bm{\mu}^{(0)*}_{r}\\|_{2}$	$\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\sqrt{\frac{p}{n_{0}}},$	(S.5.549)
$\displaystyle\\|(\widehat{\bm{\Sigma}}^{(0)[t]}-\bm{\Sigma}^{(0)})\bm{\beta}^{(0)}\\|_{2}$	$\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\sqrt{\frac{p}{n_{0}}}.$	(S.5.550)

Hence

	$\displaystyle\\|(\widehat{\bm{\Sigma}}^{(0)[t]})\bm{\beta}^{(0)*}-(\widehat{\bm{\mu}}^{(0)[t-1]}_{2}-\widehat{\bm{\mu}}^{(0)[t-1]}_{1})\\|_{2}$	$\displaystyle\lesssim\\|(\widehat{\bm{\Sigma}}^{(0)[t]}-\bm{\Sigma}^{(0)})\bm{\beta}^{(0)}\\|_{2}+\max_{r=1:2}\\|\widehat{\bm{\mu}}^{(0)[t]}_{r}-\bm{\mu}^{(0)*}_{r}\\|_{2}$		(S.5.551)
		$\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\sqrt{\frac{p}{n_{0}}}.$		(S.5.552)

By Lemma 7, we have

	$\displaystyle\\|\widehat{\bm{\beta}}^{(0)[t]}-\bm{\beta}^{(0)*}\\|_{2}$	$\displaystyle\lesssim\\|(\widehat{\bm{\Sigma}}^{(0)[t]})\bm{\beta}^{(0)*}-(\widehat{\bm{\mu}}^{(0)[t-1]}_{2}-\widehat{\bm{\mu}}^{(0)[t-1]}_{1})\\|_{2}+\frac{\lambda_{0}^{[t]}}{\sqrt{n_{0}}}$		(S.5.553)
		$\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\sqrt{\frac{p}{n_{0}}}+\frac{\lambda_{0}^{[t]}}{\sqrt{n_{0}}}.$		(S.5.554)

Combining these results, we have

d(\widehat{\bm{\theta}}^{(0)[t]},\bm{\theta}^{(0)*})\leq C\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+C^{\prime}\sqrt{\frac{p}{n_{0}}}+C^{\prime}\frac{\lambda_{0}^{[t]}}{\sqrt{n_{0}}}.

(S.5.555)

By the construction of $\lambda_{0}^{[t]}$ , we know that

\lambda_{0}^{[t]}=\frac{1-\kappa_{0}^{t}}{1-\kappa_{0}}C_{\lambda_{0}}\sqrt{p}+\kappa_{0}^{t}\lambda_{0}^{[0]},

(S.5.556)

implies that

$\displaystyle d(\widehat{\bm{\theta}}^{(0)[t]},\bm{\theta}^{(0)*})$	$\displaystyle\leq(C\kappa_{0}^{\prime\prime})^{t}d(\widehat{\bm{\theta}}^{(0)[0]},\bm{\theta}^{(0)*})+C^{\prime\prime}\sqrt{\frac{p}{n_{0}}}+C^{\prime}\sum_{t^{\prime}=1}^{t}\frac{\lambda_{0}^{[t^{\prime}]}}{\sqrt{n_{0}}}(C\kappa_{0}^{\prime\prime})^{t-t^{\prime}}$	(S.5.557)
	$\displaystyle\leq(\kappa_{0}^{\prime})^{t}d(\widehat{\bm{\theta}}^{(0)[0]},\bm{\theta}^{(0)*})+C^{\prime\prime\prime}\sqrt{\frac{p}{n_{0}}}+C^{\prime\prime\prime}t(\kappa_{0}^{\prime})^{t}$	(S.5.558)
	$\displaystyle\leq Ct(\kappa_{0}^{\prime})^{t}+C^{\prime\prime\prime}\sqrt{\frac{p}{n_{0}}},$	(S.5.559)

which is the desired rate. The bound of $\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)[t]}_{r}-\bm{\mu}^{(0)*}_{r}\|_{2}$ and $\|\widehat{\bm{\Sigma}}^{(0)[t]}-\bm{\Sigma}^{(0)*}\|_{2}$ can be derived similar to (S.5.549) and (S.5.550).

(\@slowromancapii@) Case 2: Next, we consider the case that $h\leq C\sqrt{\frac{p}{n_{0}}}$ . According to Assumption 3, we have $\sqrt{\frac{p}{n_{0}}}\lesssim\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}$ . It is easy to see that the analysis in part (\@slowromancapi@) does not depend on the condition $h\geq C\sqrt{\frac{p}{n_{0}}}$ . Hence we have proved the desired bounds of $\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)[t]}_{r}-\bm{\mu}^{(0)*}_{r}\|_{2}$ and $\|\widehat{\bm{\Sigma}}^{(0)[t]}-\bm{\Sigma}^{(0)*}\|_{2}$ . Denote $t_{0}$ as an integer such that $t_{0}\kappa_{0}^{t_{0}}\asymp\sqrt{\frac{p}{n_{0}}}$ . When $1\leq t\leq t_{0}$ , the bound in part (\@slowromancapi@) is the desired bound since the term $t\kappa_{0}^{t}$ dominates the other terms. Let us consider the case $t=t_{0}+1$ .

Consider an event $\mathcal{E}^{\prime}$ defined to be the event of

\|\overline{\bm{\beta}}^{[T]}-\bm{\beta}^{(k^{\prime})*}\|_{2}\lesssim h+\sqrt{\frac{\log K}{n_{k^{\prime}}}}+\sqrt{\frac{p}{n_{S}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}},\quad k^{\prime}\in\operatorname*{arg\,min}_{k\in S}n_{k}.

(S.5.560)

Note that since $h\leq C\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}$ , by part (\@slowromancapii@) of the proof of Theorem 1, we know that $\mathbb{P}(\mathcal{E}^{\prime})\geq 1-C(K^{-1}+\exp\{-C^{\prime}p\})$ . And $\mathcal{E}^{\prime}$ implies that

\|\overline{\bm{\beta}}^{[T]}-\bm{\beta}^{(0)*}\|_{2}\lesssim h+\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+\sqrt{\frac{p}{n_{S}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}\lesssim\sqrt{\frac{p}{n_{0}}},

(S.5.561)

where the second inequality comes from Assumption 3.

Also consider another event $\mathcal{E}^{\prime\prime}$ defined to be the intersection of the events in Lemma 29, with $\xi=C\Big{[}h+\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+\sqrt{\frac{p}{n_{S}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}\Big{]}$ , which satisfies $\mathbb{P}(\mathcal{E}^{\prime\prime})\geq 1-\tau$ . Throughout the analysis in Case 1, we condition on $\mathcal{E}\cap\mathcal{E}^{\prime}\cap\mathcal{E}^{\prime\prime}$ , therefore all the arguments hold with probability at least $1-\tau-C(K^{-1}+\exp\{-C^{\prime}p\})$ .

Note that $\lambda_{0}^{[t]}\geq C\sqrt{p}\geq C^{\prime}\|\overline{\bm{\beta}}^{[T]}-\bm{\beta}^{(0)*}\|_{2}$ and $\lambda_{0}^{[t]}\geq C\sqrt{p}\geq C^{\prime}\sqrt{n_{0}}\|(\widehat{\bm{\Sigma}}^{(0)[t]})\bm{\beta}^{(0)*}-(\widehat{\bm{\mu}}^{(0)[t-1]}_{2}-\widehat{\bm{\mu}}^{(0)[t-1]}_{1})\|_{2}$ . Hence by Lemma 7, we have $\widehat{\bm{\beta}}^{(0)[t]}=\overline{\bm{\beta}}^{[T]}$ thus

\|\widehat{\bm{\beta}}^{(0)[t]}-\bm{\beta}^{(0)*}\|_{2}\lesssim h+\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+\sqrt{\frac{p}{n_{S}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}.

(S.5.562)

Similar to the analysis in part (\@slowromancapii@) in the proof of Theorem 1, we have

$\displaystyle\|\widehat{w}^{(0)[t]}-w^{(0)*}\|$	$\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\xi\sqrt{\frac{p}{n_{0}}}+\sqrt{\frac{1}{n_{0}}}$	(S.5.563)
	$\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\xi+\sqrt{\frac{1}{n_{0}}},$	(S.5.564)
$\displaystyle\|\widehat{\delta}^{(0)[t]}-\delta^{(0)*}\|$	$\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\xi+\xi\sqrt{\frac{p}{n_{0}}}+\sqrt{\frac{1}{n_{0}}},$	(S.5.565)
	$\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\xi+\sqrt{\frac{1}{n_{0}}}.$	(S.5.566)

Putting all pieces together,

d(\widehat{\bm{\theta}}^{(0)[t]},\bm{\theta}^{(0)*})\leq C\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+h+\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+\sqrt{\frac{p}{n_{S}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+\sqrt{\frac{1}{n_{0}}}.

(S.5.567)

We can continue this analysis from $t=t_{0}+1$ to $t_{0}+2$ , and so on. Hence for any $t^{\prime}\geq 1$ , we have

$\displaystyle d(\widehat{\bm{\theta}}^{(0)[t_{0}+t^{\prime}]},\bm{\theta}^{(0)*})$	$\displaystyle\leq(C\kappa_{0}^{\prime\prime})^{t^{\prime}}d(\widehat{\bm{\theta}}^{(0)[t_{0}]},\bm{\theta}^{(0)*})+C^{\prime}h+C^{\prime}\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+C^{\prime}\sqrt{\frac{p}{n_{S}}}$	(S.5.568)
	$\displaystyle\quad+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime}\sqrt{\frac{1}{n_{0}}}$	(S.5.569)
	$\displaystyle\leq(\kappa_{0}^{\prime})^{t^{\prime}}d(\widehat{\bm{\theta}}^{(0)[t_{0}]},\bm{\theta}^{(0)*})+C^{\prime}h+C^{\prime}\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+C^{\prime}\sqrt{\frac{p}{n_{S}}}$	(S.5.570)
	$\displaystyle\quad+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime}\sqrt{\frac{1}{n_{0}}}$	(S.5.571)
	$\displaystyle\leq(t^{\prime}+t_{0})(\kappa_{0}^{\prime})^{t^{\prime}+t_{0}}+C^{\prime}h+C^{\prime}\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+C^{\prime}\sqrt{\frac{p}{n_{S}}}$	(S.5.572)
	$\displaystyle\quad+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime}\sqrt{\frac{1}{n_{0}}},$	(S.5.573)

where the last inequality holds because $t_{0}$ is chosen to be the integer satisfying $t_{0}(\kappa_{0})^{t_{0}}\asymp\sqrt{\frac{p}{n_{0}}}\gtrsim d(\widehat{\bm{\theta}}^{(0)[t_{0}]},\bm{\theta}^{(0)*})$ .

S.5.8.3 Proof of lemmas

Proof of Lemma 29.

The proof is almost the same as the proofs of lemmas in Theorem 1, so we do not repeat it here. ∎

S.5.9 Proof of Theorem 15

S.5.9.1 Lemmas

Recall

\displaystyle\overline{\Theta}_{S}^{\prime}(h)=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}=\{(w^{(k)},\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2},\bm{\Sigma}^{(k)})\}_{k\in\{0\}\cup S}:\bm{\theta}^{(k)}\in\overline{\Theta},\max_{k\in S}\|\bm{\beta}^{(k)}-\bm{\beta}^{(0)}\|_{2}\leq h\Big{\}}.

(S.5.574)

Lemma 30.

When $n_{0}\geq Cp$ with some constant $C>0$ , we have

\displaystyle\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}

\displaystyle\mathbb{P}\Bigg{(}d(\widehat{\bm{\theta}}^{(0)},\bm{\theta}^{(0)*})\geq C_{1}\sqrt{\frac{p}{n_{S}+n_{0}}}+C_{1}h\wedge\sqrt{\frac{p}{n_{0}}}+C_{1}\sqrt{\frac{1}{n_{0}}}\Bigg{)}\geq\frac{1}{10}.

(S.5.575)

Lemma 31.

Denote $\widetilde{\epsilon}=\frac{K-s}{s}$ . Then

\displaystyle\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}

\displaystyle\mathbb{P}\Bigg{(}d(\widehat{\bm{\theta}}^{(0)},\bm{\theta}^{(0)*})\geq\bigg{(}C_{1}\frac{\widetilde{\epsilon}}{\sqrt{\max_{k=1:K}n_{k}}}\bigg{)}\wedge\bigg{(}C_{2}\sqrt{\frac{1}{n_{0}}}\bigg{)}\Bigg{)}\geq\frac{1}{10}.

(S.5.576)

Lemma 32 (The second variant of Theorem 5.1 in \citealpappchen2018robust).

Given a series of distributions $\{\{\mathbb{P}_{\theta}^{(k)}\}_{k=0}^{K}:\theta\in\Theta\}$ , each of which is indexed by the same parameter $\theta\in\Theta$ . Consider $\bm{x}^{(k)}\sim(1-\widetilde{\epsilon})\mathbb{P}^{(k)}_{\theta}+\widetilde{\epsilon}\mathbb{Q}^{(k)}$ independently for $k=1:K$ and $\bm{x}^{(0)}\sim\mathbb{P}^{(0)}_{\theta}$ . Denote the joint distribution of $\{\bm{x}^{(k)}\}_{k=0}^{K}$ as $\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}$ . Then

\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi^{\prime}(\widetilde{\epsilon},\Theta)\right)\geq\frac{9}{20},

(S.5.577)

where $\varpi^{\prime}(\widetilde{\epsilon},\Theta)\coloneqq\sup\big{\{}\|\theta_{1}-\theta_{2}\|:\max_{k=1:K}d_{\textup{TV}}\big{(}\mathbb{P}^{(k)}_{\theta_{1}},\mathbb{P}^{(k)}_{\theta_{2}}\big{)}\leq\widetilde{\epsilon}/(1-\widetilde{\epsilon}),d_{\textup{TV}}\big{(}\mathbb{P}^{(0)}_{\theta_{1}},\mathbb{P}^{(0)}_{\theta_{2}}\big{)}\leq 1/20\big{\}}$ .

Lemma 33.

Consider two data generating mechanisms:

(i)

$\bm{x}^{(k)}\sim(1-\widetilde{\epsilon}^{\prime})\mathbb{P}_{\theta}^{(k)}+\widetilde{\epsilon}^{\prime}\mathbb{Q}^{(k)}$ independently for $k=1:K$ and $\bm{x}^{(0)}\sim\mathbb{P}^{(0)}_{\theta}$ , where $\widetilde{\epsilon}^{\prime}=\frac{K-s}{K}$ ;
(ii)

With a preserved set $S\subseteq 1:K$ , generate $\{\bm{x}^{(k)}\}_{k\in S^{c}}\sim\mathbb{Q}_{S}$ and $\bm{x}^{(k)}\sim\mathbb{P}_{\theta}^{(k)}$ independently for $k\in S$ . And $\bm{x}^{(0)}\sim\mathbb{P}^{(0)}_{\theta}$ .

Denote the joint distributions of $\{\bm{x}^{(k)}\}_{k=0}^{K}$ in (i) and (ii) as $\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}$ and $\mathbb{P}_{(S,\theta,\mathbb{Q})}$ , respectively. We claim that if

\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\frac{K-s}{50K},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi^{\prime}\left(\frac{K-s}{50K},\Theta\right)\right)\geq\frac{9}{20},

(S.5.578)

then

\inf_{\widehat{\theta}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}_{(S,\theta,\mathbb{Q}_{S})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi^{\prime}\left(\frac{K-s}{50K},\Theta\right)\right)\geq\frac{1}{10},

(S.5.579)

where $\varpi^{\prime}(\widetilde{\epsilon},\Theta)\coloneqq\sup\big{\{}\|\theta_{1}-\theta_{2}\|:\max_{k=1:K}\textup{KL}\big{(}\mathbb{P}^{(k)}_{\theta_{1}}\|\mathbb{P}^{(k)}_{\theta_{2}}\big{)}\leq[\widetilde{\epsilon}/(1-\widetilde{\epsilon})]^{2},\textup{KL}\big{(}\mathbb{P}^{(0)}_{\theta_{1}}\|\mathbb{P}^{(0)}_{\theta_{2}}\big{)}\leq 1/400\big{\}}$ for any $\widetilde{\epsilon}\in(0,1)$ .

Lemma 34.

When $n_{0}\geq Cp$ with some constant $C>0$ , we have

\inf_{\widehat{\bm{\Sigma}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\Sigma}}^{(0)}-\bm{\Sigma}^{(0)*}\|_{2}\geq C\sqrt{\frac{p}{n_{0}}}\Bigg{)}\geq\frac{1}{10}.

(S.5.580)

S.5.9.2 Main proof of Theorem 15

S.5.9.3 Proof of lemmas

Proof of Lemma 30.

It’s easy to see that $\overline{\Theta}_{S}^{\prime}\supseteq\overline{\Theta}_{S,w}^{\prime}\cup\overline{\Theta}_{S,\bm{\beta}}^{\prime}\cup\overline{\Theta}_{S,\delta}^{\prime}$ , where

$\displaystyle\overline{\Theta}_{S,w}^{\prime}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\mu}^{(k)}_{1}=\bm{1}_{p}/\sqrt{p},\bm{\mu}^{(k)}_{2}=-\bm{\mu}^{(k)}_{1},\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}\in(c_{w},1-c_{w})\Big{\}},$	(S.5.581)
$\displaystyle\overline{\Theta}_{S,\bm{\beta}}^{\prime}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\\|\bm{\mu}^{(k)}_{1}\\|_{2}\vee\\|\bm{\mu}^{(k)}_{2}\\|_{2}\leq M,\max_{k\in S}\\|\bm{\beta}^{(k)}-\bm{\beta}^{(0)}\\|_{2}\leq h\Big{\}},$	(S.5.582)
$\displaystyle\overline{\Theta}_{S,\delta}^{\prime}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\\|\bm{\mu}^{(k)}_{1}\\|_{2}\vee\\|\bm{\mu}^{(k)}_{2}\\|_{2}\leq M,\bm{\mu}^{(k)}_{1}=-\frac{1}{2}\bm{\mu}_{0},$	(S.5.583)
	$\displaystyle\hskip 85.35826pt\bm{\mu}^{(k)}_{2}=\frac{1}{2}\bm{\mu}_{0}+\bm{u},\\|\bm{u}\\|_{2}\leq 1\Big{\}}.$	(S.5.584)

(i) By fixing an $S$ and a $\mathbb{Q}_{S}$ , we want to show

\displaystyle\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S,\bm{\beta}}^{\prime}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\beta}}^{(0)}-\bm{\beta}^{(0)*}\|_{2}\vee\|\widehat{\bm{\beta}}^{(0)}+\bm{\beta}^{(0)*}\|_{2}\geq C\sqrt{\frac{p}{n_{S}+n_{0}}}\Bigg{)}\geq\frac{1}{4}.

(S.5.585)

By Lemma 4, $\exists$ a quadrant $\mathcal{Q}_{\bm{v}}$ of $\mathbb{R}^{p}$ and a $r/8$ -packing of $(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}}$ under Euclidean norm: $\{\widetilde{\bm{\mu}}_{j}\}_{j=1}^{N}$ , where $r=(c\sqrt{p/(n_{S}+n_{0})})\wedge M\leq 1$ with a small constant $c>0$ and $N\geq(\frac{1}{2})^{p}8^{p-1}=\frac{1}{2}\times 4^{p-1}\geq 2^{p-1}$ when $p\geq 2$ . For any $\bm{\mu}\in\mathbb{R}^{p}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ . Then

	LHS	$\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\bm{\mu}\in(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}}}\mathbb{P}\Bigg{(}\\|\widehat{\bm{\mu}}-\bm{\mu}\\|_{2}\wedge\\|\widehat{\bm{\mu}}+\bm{\mu}\\|_{2}\geq C\sqrt{\frac{p}{n_{S}}}\Bigg{)}$		(S.5.586)
		$\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\bm{\mu}\in(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}}}\mathbb{P}\Bigg{(}\\|\widehat{\bm{\mu}}-\bm{\mu}\\|_{2}\geq C\sqrt{\frac{p}{n_{S}}}\Bigg{)},$		(S.5.587)

By Lemma 15,

$\displaystyle\text{KL}\left(\prod_{k\in\{0\}\cup S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\bigg{\\|}\prod_{k\in\{0\}\cup S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\right)$	$\displaystyle=\sum_{k\in\{0\}\cup S}n_{k}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}})$	(S.5.588)
	$\displaystyle\leq\sum_{k\in\{0\}\cup S}n_{k}\cdot 8\\|\widetilde{\bm{\mu}}_{j}\\|_{2}^{2}\\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\\|_{2}^{2}$	(S.5.589)
	$\displaystyle\leq 32(n_{S}+n_{0})r^{2}$	(S.5.590)
	$\displaystyle\leq 32n_{S}c^{2}\cdot\frac{2(p-1)}{n_{S}+n_{0}}$	(S.5.591)
	$\displaystyle\leq\frac{64c^{2}}{\log 2}\log N.$	(S.5.592)

By Lemma 3,

\displaystyle\text{LHS of \eqref{eq: lower bdd eq mu 1 transfer}}\geq 1-\frac{\log 2}{\log N}-\frac{64c^{2}}{\log 2}\geq 1-\frac{1}{p-1}-\frac{1}{4}\geq\frac{1}{4},

(S.5.593)

when $C=c/2$ , $p\geq 3$ and $c=\sqrt{\log 2}/16$ .

(ii) By fixing an $S$ and a $\mathbb{Q}_{S}$ , we want to show

\displaystyle\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\beta}}^{(0)}-\bm{\beta}^{(0)*}\|_{2}\vee\|\widehat{\bm{\beta}}^{(0)}+\bm{\beta}^{(0)*}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{0}}}\bigg{)}\bigg{]}\bigg{\}}\Bigg{)}\geq\frac{1}{4}.

(S.5.594)

By Lemma 4, $\exists$ a quadrant $\mathcal{Q}_{\bm{v}}$ of $\mathbb{R}^{p}$ and a $r/8$ -packing of $(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}$ under Euclidean norm: $\{\widetilde{\bm{\vartheta}}_{j}\}_{j=1}^{N}$ , where $r=h\wedge(c\sqrt{p/n_{0}})\wedge M\leq 1$ with a small constant $c>0$ and $N\geq(\frac{1}{2})^{p-1}8^{p-2}=\frac{1}{2}\times 4^{p-2}\geq 2^{p-2}$ when $p\geq 3$ . WLOG, assume $M\geq 2$ . Denote $\widetilde{\bm{\mu}}_{j}=(1,\widetilde{\bm{\vartheta}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}$ . Let $\bm{\mu}^{(k)*}_{1}=\widetilde{\bm{\mu}}=(1,\bm{0}_{p-1})^{\top}$ for all $k\in S$ and $\bm{\mu}^{(0)*}_{1}=\bm{\mu}=(1,\bm{\vartheta})$ with $\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}$ . For any $\bm{\mu}\in\mathbb{R}^{p}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ . Then similar to the arguments in (i),

	LHS	$\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\begin{subarray}{c}\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}\\ \bm{\mu}=(1,\bm{\vartheta})^{\top}\end{subarray}}\mathbb{P}\Bigg{(}\\|\widehat{\bm{\mu}}-\bm{\mu}\\|_{2}\wedge\\|\widehat{\bm{\mu}}+\bm{\mu}\\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{0}}}\bigg{)}\bigg{]}\Bigg{)}$		(S.5.596)
		$\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\begin{subarray}{c}\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}\\ \bm{\mu}=(1,\bm{\vartheta})^{\top}\end{subarray}}\mathbb{P}\Bigg{(}\\|\widehat{\bm{\mu}}-\bm{\mu}\\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{0}}}\bigg{)}\bigg{]}\Bigg{)}.$		(S.5.597)

Then by Lemma 15,

$\displaystyle\text{KL}\left(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{0}}\cdot\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\bigg{\\|}\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{0}}\cdot\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\right)$	$\displaystyle=n_{0}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}})$	(S.5.598)
	$\displaystyle\leq n_{0}\cdot 8\\|\widetilde{\bm{\mu}}_{j}\\|_{2}^{2}\\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\\|_{2}^{2}$	(S.5.599)
	$\displaystyle\leq 32n_{0}r^{2}$	(S.5.600)
	$\displaystyle\leq 32n_{0}c^{2}\cdot\frac{3(p-2)}{n_{0}}$	(S.5.601)
	$\displaystyle\leq\frac{96c^{2}}{\log 2}\log N,$	(S.5.602)

when $n_{0}\geq(c^{2}\vee M^{-2})p$ and $p\geq 3$ . By Fano’s lemma (See Corollary 2.6 in \citealpapptsybakov2009introduction),

\displaystyle\text{LHS of \eqref{eq: lower bdd eq mu 2 transfer}}\geq 1-\frac{\log 2}{\log N}-\frac{96c^{2}}{\log 2}\geq 1-\frac{1}{p-2}-\frac{1}{4}\geq\frac{1}{4},

(S.5.603)

when $C=1/2$ , $p\geq 4$ and $c=\sqrt{(\log 2)/384}$ .

(iii) We want to show

\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S,w}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\bigg{(}|\widehat{w}^{(0)}-w^{(0)*}|\geq C\sqrt{\frac{1}{n_{0}}}\Bigg{)}\geq\frac{1}{4}.

(S.5.604)

The argument is similar to (ii). The only two differences here are that the dimension of interested parameter $w$ equals 1, and Lemma 15 is replaced by Lemma 17.

(iv) We want to show

\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S,\delta}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\bigg{(}|\widehat{\delta}^{(0)}-\delta^{(0)*}|\geq C\sqrt{\frac{1}{n_{0}}}\Bigg{)}\geq\frac{1}{4}

(S.5.605)

The argument is similar to (ii).

Finally, we get the desired conclusion by combining (i)-(iv).

∎

Proof of Lemma 31.

Let $\widetilde{\epsilon}=\frac{K-s}{s}$ and $\widetilde{\epsilon}^{\prime}=\frac{K-s}{K}$ . Since $s/K\geq c>0$ , $\widetilde{\epsilon}\lesssim\widetilde{\epsilon}^{\prime}$ . Denote $\Upsilon_{S}=\{\{\bm{\mu}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\mu}^{(k)}\in\mathbb{R}_{+}^{p},\max_{k\in S}\|\bm{\mu}^{(k)}-\bm{\mu}^{(0)}\|_{2}\leq h/2,\|\bm{\mu}^{(k)}\|_{2}\leq M\}$ . For any $\bm{\mu}\in\mathbb{R}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ , and denote $\prod_{k\in S}\mathbb{P}_{\bm{\mu}^{(k)}}^{\otimes n_{k}}$ as $\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}$ . It suffices to show

\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\mu}}^{(0)}-\bm{\mu}^{(0)*}\|_{2}\geq\bigg{(}C_{1}\widetilde{\epsilon}^{\prime}\sqrt{\frac{1}{\max_{k=1:K}n_{k}}}\bigg{)}\wedge\bigg{(}C_{2}\sqrt{\frac{1}{n_{0}}}\bigg{)}\Bigg{)}\geq\frac{1}{10}.

(S.5.606)

where $\mathbb{P}=\mathbb{P}_{\bm{\mu}^{(0)}}^{\otimes n_{0}}\cdot\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}\cdot\mathbb{Q}_{S}$ .

For any $\bm{\mu}\in\mathbb{R}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ . WLOG, assume $M\geq 1$ . For any $\widetilde{\bm{\mu}}_{1}$ , $\widetilde{\bm{\mu}}_{2}\in\mathbb{R}^{p}$ with $\|\widetilde{\bm{\mu}}_{1}\|_{2}=\|\widetilde{\bm{\mu}}_{2}\|_{2}=1$ , by Lemma 15,

\max_{k=1:K}\text{KL}\big{(}\mathbb{P}_{\widetilde{\bm{\mu}}_{1}}^{\otimes n_{k}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{2}}^{\otimes n_{k}}\big{)}\leq\max_{k=1:K}n_{k}\cdot 8\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}^{2}.

(S.5.607)

for any $k=1:K$ . Let $8\max_{k=1:K}n_{k}\cdot\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}^{2}\leq(\frac{\widetilde{\epsilon}^{\prime}}{1-\widetilde{\epsilon}^{\prime}})^{2}$ , then $\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}\leq C\sqrt{\frac{1}{\max_{k=1:K}n_{k}}}\widetilde{\epsilon}^{\prime}$ for some constant $C>0$ . On the other hand, let $\text{KL}\big{(}\mathbb{P}_{\widetilde{\bm{\mu}}_{1}}^{\otimes n_{0}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{2}}^{\otimes n_{0}}\big{)}=8n_{0}\cdot\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}^{2}\leq 1/100$ , then $\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}\leq\sqrt{\frac{1}{800}}\sqrt{\frac{1}{n_{0}}}$ for some constant $C>0$ . Then (S.5.606) follows by Lemma 33. ∎

Proof of Lemma 32.

The proof is similar to the proof of Theorem 5.1 in \citeappchen2018robust, so we omit it here. ∎

Proof of Lemma 34.

This can be similarly shown by Assouad’s Lemma as in the proof of Lemma 23. We omit the proof here. ∎

S.5.10 Proof of Theorem 14

The result follows from (S.5.422) and Theorem 13.

S.5.11 Proof of Theorem 16

S.5.11.1 Lemmas

Lemma 35.

Assume $n_{0}\geq Cp$ and $\Delta^{(0)}\geq C^{\prime}>0$ with some constants $C$ , $C^{\prime}>0$ . We have

\displaystyle\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}

\displaystyle\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq C_{1}\frac{p}{n_{S}+n_{0}}+C_{2}h^{2}\wedge\frac{p}{n_{0}}+\frac{1}{n_{0}}\Bigg{)}\geq\frac{1}{10}.

(S.5.608)

Lemma 36.

Denote $\widetilde{\epsilon}=\frac{K-s}{s}$ . We have

\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{(T)}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq\bigg{(}C_{1}\frac{\widetilde{\epsilon}^{\prime 2}}{\max_{k=1:K}n_{k}}\bigg{)}\wedge\bigg{(}C_{2}\sqrt{\frac{1}{n_{0}}}\bigg{)}\Bigg{)}\geq\frac{1}{10}.

(S.5.609)

S.5.11.2 Main proof of Theorem 16

Combine Lemmas 35 and 36 to finish the proof.

S.5.11.3 Proof of lemmas

Proof of Lemma 35.

We proceed with similar proof ideas used in the proof of Lemma 35. Recall the definitions and proof idea of Lemma 30. We have $\overline{\Theta}_{S}^{\prime}\supseteq\overline{\Theta}_{S,w}^{\prime}\cup\overline{\Theta}_{S,\bm{\beta}}^{\prime}\cup\overline{\Theta}_{S,\delta}^{\prime}$ , where

$\displaystyle\overline{\Theta}_{S,w}^{\prime}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\mu}^{(k)}_{1}=\bm{1}_{p}/\sqrt{p},\bm{\mu}^{(k)}_{2}=-\bm{\mu}^{(k)}_{1},\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}\in(c_{w},1-c_{w})\Big{\}},$	(S.5.610)
$\displaystyle\overline{\Theta}_{S,\bm{\beta}}^{\prime}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\\|\bm{\mu}^{(k)}_{1}\\|_{2}\vee\\|\bm{\mu}^{(k)}_{2}\\|_{2}\leq M,\max_{k\in S}\\|\bm{\beta}^{(k)}-\bm{\beta}^{(0)}\\|_{2}\leq h\Big{\}},$	(S.5.611)
$\displaystyle\overline{\Theta}_{S,\delta}^{\prime}$	$\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\\|\bm{\mu}^{(k)}_{1}\\|_{2}\vee\\|\bm{\mu}^{(k)}_{2}\\|_{2}\leq M,\bm{\mu}^{(k)}_{1}=-\frac{1}{2}\bm{\mu}_{0},$	(S.5.612)
	$\displaystyle\hskip 85.35826pt\bm{\mu}^{(k)}_{2}=\frac{1}{2}\bm{\mu}_{0}+\bm{u},\\|\bm{u}\\|_{2}\leq 1\Big{\}}.$	(S.5.613)

(i) We want to show

\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq C\sqrt{\frac{p}{n_{S}+n_{0}}}\Bigg{)}\geq\frac{1}{4}.

(S.5.614)

Consider $S=1:K$ and space $\overline{\Theta}_{0}^{\prime}=\{\{\overline{\bm{\theta}}^{(k)}\}_{k=0}^{K}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=1/2,\bm{\mu}^{(k)}_{1}=\bm{\mu}_{1},\bm{\mu}^{(k)}_{2}=\bm{\mu}_{2},\|\bm{\mu}_{1}\|_{2}\vee\|\bm{\mu}_{2}\|_{2}\leq M\}$ . And

\text{LHS of \eqref{eq: lemma 19 eq 1}}\geq\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k=0}^{K}\in\overline{\Theta}_{0}^{\prime}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq C\sqrt{\frac{p}{n_{S}+n_{0}}}\Bigg{)}.

(S.5.615)

Let $r=c\sqrt{p/(n_{S}+n_{0})}\leq 0.001$ with some small constant $c>0$ . For any $\bm{\mu}\in\mathbb{R}^{p}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ . Consider a $r/4$ -packing of $r\mathcal{S}^{p-1}$ : $\{\widetilde{\bm{v}}_{j}\}_{j=1}^{N}$ . By Lemma 2, $N\geq 4^{p-1}$ . Denote $\widetilde{\bm{\mu}}_{j}=(\sigma,\widetilde{\bm{v}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}$ , where $\sigma=\sqrt{0.005}$ . Then by definition of KL divergence and Lemma 8.4 in \citeappcai2019chime,

$\displaystyle\text{KL}\left(\prod_{k\in\{0\}\cup S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\bigg{\\|}\prod_{k\in\{0\}\cup S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\right)$	$\displaystyle=\sum_{k\in\{0\}\cup S}n_{k}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}})$	(S.5.616)
	$\displaystyle\leq(n_{S}+n_{0})\cdot 8(1+\sigma^{2})\\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\\|_{2}^{2}$	(S.5.617)
	$\displaystyle\leq 32(1+\sigma^{2})(n_{S}+n_{0})r^{2}$	(S.5.618)
	$\displaystyle\leq 32(1+\sigma^{2})(n_{S}+n_{0})\cdot c^{2}\frac{2(p-1)}{n_{S}+n_{0}}$	(S.5.619)
	$\displaystyle\leq\frac{32(1+\sigma^{2})c^{2}}{\log 2}\log N.$	(S.5.620)

L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C}_{\widetilde{\bm{\mu}}_{j}})\geq\frac{1}{\sqrt{2}}g\left(\frac{\sqrt{\sigma^{2}+r^{2}}}{2}\right)\frac{\|\widetilde{\bm{\mu}}_{i}-\widetilde{\bm{\mu}}_{j}\|_{2}}{\|\widetilde{\bm{\mu}}_{i}\|_{2}}\geq\frac{1}{\sqrt{2}}\cdot 0.15\cdot\frac{r/4}{\sqrt{\sigma^{2}+r^{2}}}\geq 2r,

(S.5.621)

where $g(x)=\phi(x)[\phi(x)-x\Phi(x)]$ . The last inequality holds because $\sqrt{\sigma^{2}+r^{2}}\geq\sqrt{2}\sigma$ and $g(\sqrt{\sigma^{2}+r^{2}}/2)\geq 0.15$ when $r^{2}\leq\sigma^{2}=0.001$ . Then by Lemma 3.5 in \citealpappcai2019chime (Proposition 2 in \citeappazizyan2013minimax), for any classifier $\mathcal{C}$ , and $i\neq j$ ,

L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C})+L_{\widetilde{\bm{\mu}}_{j}}(\mathcal{C})\geq L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C}_{\widetilde{\bm{\mu}}_{j}})-\sqrt{\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{i}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j}})/2}\geq 2r-r=c\sqrt{\frac{p}{n_{S}+n_{0}}}.

(S.5.622)

For any $\widehat{\mathcal{C}}^{(0)}$ , consider a test $\psi^{*}=\operatorname*{arg\,min}_{j=1:N}L_{\widetilde{\bm{\mu}}_{j}}(\widehat{\mathcal{C}}^{(0)})$ . Therefore if there exists $j_{0}$ such that $L_{\widetilde{\bm{\mu}}_{j_{0}}}(\widehat{\mathcal{C}}^{(0)})<\frac{c}{2}\sqrt{\frac{p}{n_{S}+n_{0}}}$ , then by (S.5.622), we must have $\psi^{*}=j_{0}$ . Let $C_{1}\leq c/2$ , then by Fano’s lemma (Corollary 6 in \citealpapptsybakov2009introduction)

$\displaystyle\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{\{\overline{\bm{\theta}}^{(k)}\}_{k=0}^{K}\in\overline{\Theta}_{0}^{\prime}}\mathbb{P}\Bigg{(}L_{\overline{\bm{\theta}}^{(0)}}(\widehat{\mathcal{C}}^{(0)})\geq C_{1}\sqrt{\frac{p}{n_{S}+n_{0}}}\Bigg{)}$	$\displaystyle\geq\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{j=1:N}\mathbb{P}\Bigg{(}L_{\widetilde{\bm{\mu}}^{(j)}}(\widehat{\mathcal{C}}^{(0)})\geq C_{1}\sqrt{\frac{p}{n_{S}+n_{0}}}\Bigg{)}$	(S.5.623)
	$\displaystyle\geq\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{j=1:N}\mathbb{P}\Bigg{(}\psi^{*}\neq j\Bigg{)}$	(S.5.624)
	$\displaystyle\geq\inf_{\psi}\sup_{j=1:N}\mathbb{P}\Bigg{(}\psi\neq j\Bigg{)}$	(S.5.625)
	$\displaystyle\geq 1-\frac{\log 2}{\log N}-\frac{32(1+\sigma^{2})c^{2}}{\log 2}$	(S.5.626)
	$\displaystyle\geq\frac{1}{4},$	(S.5.627)

when $p\geq 2$ and $c=\sqrt{\frac{\log 2}{128(1+\sigma^{2})}}$ . Then apply Lemma 25 to get the (S.5.614).

(ii) We want to show

\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S,\bm{\beta}}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq C\left(h\wedge\sqrt{\frac{p}{n_{0}}}\right)\Bigg{)}\geq\frac{1}{4}.

(S.5.628)

Fixing an $S$ and a $\mathbb{Q}_{S}$ . Suppose $1\in S$ . We have

\text{LHS of \eqref{eq: lemma 36 eq 3}}\geq\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S,\bm{\beta}}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq C\left(h\wedge\sqrt{\frac{p}{n_{0}}}\right)\Bigg{)}.

(S.5.629)

Let $r=h\wedge(c\sqrt{p/n_{0}})\wedge M$ with a small constant $c>0$ . For any $\bm{\mu}\in\mathbb{R}^{p}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ . Consider a $r/4$ -packing of $r\mathcal{S}^{p-1}$ . By Lemma 2, $N\geq 4^{p-1}$ . Denote $\widetilde{\bm{\mu}}_{j}=(\sigma,\widetilde{\bm{v}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}$ . WLOG, assume $M\geq 2$ . Let $\bm{\mu}^{(k)*}_{1}=\widetilde{\bm{\mu}}=(\sigma,\bm{0}_{p-1})^{\top}$ for all $k\in S$ and $\bm{\mu}^{(0)*}_{1}=\bm{\mu}=(1,\bm{\vartheta})$ with $\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}$ . Then by following the same arguments in part (ii) of the proof of Lemma 30, we can show that the RHS of (S.5.629) is larger than or equal to $1/4$ when $p\geq 3$ .

(iii) We want to show

\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq Ch_{w}^{2}\wedge\frac{1}{n_{0}}\Bigg{)}\geq\frac{1}{4}.

(S.5.630)

This can be similarly proved by following the arguments in part (i) with Lemmas 25 and 26.

(iv) We want to show

\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq Ch_{\bm{\beta}}^{2}\wedge\frac{p}{n_{0}}\Bigg{)}\geq\frac{1}{4}.

(S.5.631)

The conclusion can be obtained immediately from (ii), by noticing that $\overline{\Theta}_{S,\bm{\beta}}^{\prime}\supseteq\overline{\Theta}_{S,\bm{\mu}}^{\prime}$ .

Finally, we get the desired conclusion by combining (i)-(iv). ∎

Proof of Lemma 36.

By Lemma 25, it suffices to prove

\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}L_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})\geq C_{1}\frac{\widetilde{\epsilon}^{\prime 2}}{\max_{k=1:K}n_{k}}\wedge\frac{1}{n_{0}}\Bigg{)}\geq\frac{1}{10}.

(S.5.632)

For any $\bm{\mu}\in\mathbb{R}^{p}$ , denote distribution $\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p})$ as $\mathbb{P}_{\bm{\mu}}$ . For simplicity, we write $L_{\overline{\bm{\theta}}}$ with $\overline{\bm{\theta}}$ satisfying $\bm{\mu}_{1}=-\bm{\mu}_{2}=\bm{\mu}$ , $w=1/2$ and $\bm{\Sigma}=\bm{I}_{p}$ as $L_{\bm{\mu}}$ . Consider $L_{\bm{\mu}}(\mathcal{C}_{\bm{\mu}^{\prime}})$ as a loss function between $\bm{\mu}$ and $\bm{\mu}^{\prime}$ in Lemmas 32 and 33. Considering $\|\bm{\mu}\|_{2}=\|\bm{\mu}^{\prime}\|_{2}=1$ , by Lemma 15, note that

	$\displaystyle\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}})$	$\displaystyle\leq 8\max_{k=1:K}n_{k}\cdot\\|\bm{\mu}-\bm{\mu}^{\prime}\\|_{2}^{2},$		(S.5.633)
	$\displaystyle\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{0}}\\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{0}})$	$\displaystyle\leq 8n_{0}\cdot\\|\bm{\mu}-\bm{\mu}^{\prime}\\|_{2}^{2}.$		(S.5.634)

By Lemma 8.5 in \citeappcai2019chime, this implies for some constants $c,C>0$

	$\displaystyle\sup\left\{L_{\bm{\mu}}(\mathcal{C}_{\bm{\mu}^{\prime}}):\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}})\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2},\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{0}}\\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{0}})\leq 1/100\right\}$		(S.5.635)
	$\displaystyle\geq\sup\left\{c\\|\bm{\mu}-\bm{\mu}^{\prime}\\|_{2}:\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}})\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2},\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{0}}\\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{0}})\leq 1/100\right\}$		(S.5.636)
	$\displaystyle\geq\sup\left\{c\\|\bm{\mu}-\bm{\mu}^{\prime}\\|_{2}:8\max_{k=1:K}n_{k}\cdot\\|\bm{\mu}-\bm{\mu}^{\prime}\\|_{2}^{2}\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2},8n_{0}\cdot\\|\bm{\mu}-\bm{\mu}^{\prime}\\|_{2}^{2}\leq 1/800\right\}$		(S.5.637)
	$\displaystyle=C\cdot\frac{\widetilde{\epsilon}^{\prime}}{\sqrt{\max_{k=1:K}n_{k}}}\wedge\sqrt{\frac{1}{n_{0}}}.$		(S.5.638)

Then apply Lemmas 32 and 33 to get the desired bound. ∎

S.5.12 Proof of Theorem 17

Denote $\xi=\max_{k\in\{0\}\cup S}\min_{r_{k}=\pm 1}\|r_{k}\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}=\max_{k\in\{0\}\cup S}(\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2})$ . WLOG, assume $S=\{1,\ldots,s\}$ and $r^{*}_{k}=1$ for all $k\in\{0\}\cup S$ . Hence $\xi=\max_{k\in\{0\}\cup S}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}$ . WLOG, consider $\widehat{r}_{k}=1$ for all $k\in S$ (i.e., the tasks in $S$ are already well-aligned). Consider

	$\displaystyle(1,\widehat{\bm{r}})$	$\displaystyle=(\underbrace{1}_{\text{target}},\underbrace{1,\ldots,1,1}_{S},\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}),$		(S.5.639)
	$\displaystyle(-1,\widehat{\bm{r}})$	$\displaystyle=(\underbrace{-1}_{\text{target}},\underbrace{1,\ldots,1,1}_{S},\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}).$		(S.5.640)

It suffices to prove that

\displaystyle\text{score}((-1,\widehat{\bm{r}}))-\text{score}((1,\widehat{\bm{r}}))>0.

(S.5.641)

In fact,

	$\displaystyle\text{score}((-1,\widehat{\bm{r}}))-\text{score}((1,\widehat{\bm{r}}))$	$\displaystyle=2\underbrace{\sum_{k=1}^{s}\\|\widehat{\bm{\beta}}^{(0)[0]}+\widehat{\bm{\beta}}^{(k)[0]}\\|_{2}}_{[1]}+2\underbrace{\sum_{k=s+1}^{K}\\|\widehat{\bm{\beta}}^{(0)[0]}+r_{k}\widehat{\bm{\beta}}^{(k)[0]}\\|_{2}}_{[2]}$		(S.5.642)
		$\displaystyle\quad-2\underbrace{\sum_{k=1}^{s}\\|\widehat{\bm{\beta}}^{(0)[0]}-\widehat{\bm{\beta}}^{(k)[0]}\\|_{2}}_{[1]^{\prime}}-2\underbrace{\sum_{k=s+1}^{K}\\|-\widehat{\bm{\beta}}^{(0)[0]}+r_{k}\widehat{\bm{\beta}}^{(k)[0]}\\|_{2}}_{[2]^{\prime}},$		(S.5.643)

where

$\displaystyle[1]-[1]^{\prime}$	$\displaystyle\geq\sum_{k=1}^{s}(\\|\bm{\beta}^{(0)}+\bm{\beta}^{(k)}\\|_{2}-\\|\bm{\beta}^{(0)}-\bm{\beta}^{(k)}\\|_{2}-4\xi)$	(S.5.644)
	$\displaystyle\geq\sum_{k=1}^{s}(2\\|\bm{\beta}^{(0)}\\|_{2}-2\\|\bm{\beta}^{(0)}-\bm{\beta}^{(k)*}\\|_{2}-4\xi)$	(S.5.645)
	$\displaystyle\geq s(2\\|\bm{\beta}^{(0)*}\\|_{2}-4h-4\xi),$	(S.5.646)

and

[2]-[2]^{\prime}\geq-4\sum_{k=s+1}^{K}\|\widehat{\bm{\beta}}^{(0)[0]}\|_{2}\geq-4(K-s)(\|\bm{\beta}^{(0)*}\|_{2}+\xi).

(S.5.647)

Hence

$\displaystyle\text{score}((-1,\widehat{\bm{r}}))-\text{score}((1,\widehat{\bm{r}}))$	$\displaystyle=2([1]-[1]^{\prime})+2([2]-[2]^{\prime})$	(S.5.648)
	$\displaystyle\geq 4[(2s-K)\\|\bm{\beta}^{(0)*}\\|_{2}-2sh-(K+s)\xi]$	(S.5.649)
	$\displaystyle>0,$	(S.5.650)

when $\|\bm{\beta}^{(0)*}\|_{2}>\frac{2(1-\epsilon)}{1-2\epsilon}h+\frac{2-\epsilon}{1-2\epsilon}\xi$ , which completes our proof.

S.5.13 Proof of Theorem 7

Define the contraction basin of one GMM as

	$\displaystyle B_{\text{con}}(\bm{\theta}^{(k)*})$	$\displaystyle=\{\bm{\theta}=\{\{w_{r}\}_{r=2}^{R},\{\bm{\beta}_{r}\}_{r=2}^{R},\{\delta_{r}\}_{r=2}^{R}\}:w_{r}^{*}\in(c_{w}/2,1-c_{w}/2),$		(S.5.651)
		$\displaystyle\quad\\|\bm{\beta}_{r}-\bm{\beta}_{r}^{}\\|_{2}\leq C_{b}\Delta,\|\delta_{r}-\delta_{r}^{}\|\leq C_{b}\Delta\}.$		(S.5.652)

And the joint contraction basin is defined as $B_{\text{con}}(\{\bm{\theta}^{(k)*}\}_{k\in S})=\bigcap_{r=1}^{R}B_{\text{con}}(\bm{\theta}^{(k)*})$ .

For $\bm{\theta}=(\{w_{r}\}_{r=2}^{R},\{\bm{\beta}_{r}\}_{r=2}^{R},\{\delta_{r}\}_{r=2}^{R})$ and $\bm{\theta}^{\prime}=(\{w_{r}^{\prime}\}_{r=2}^{R},\{\bm{\beta}_{r}^{\prime}\}_{r=2}^{R},\{\delta_{r}^{\prime}\}_{r=2}^{R})$ , define

d(\bm{\theta},\bm{\theta}^{\prime})=\max_{r=2:R}\{|w_{r}-w_{r}^{\prime}|\vee\|\bm{\beta}_{r}-\bm{\beta}_{r}^{\prime}\|_{2}\vee|\delta_{r}-\delta_{r}^{\prime}|\}.

(S.5.653)

S.5.13.1 Lemmas

For GMM $\bm{z}\sim\sum_{r=1}^{R}w_{r}^{*}\mathcal{N}(\bm{\mu}_{r}^{*},\bm{\Sigma}^{*})$ and any $\bm{\theta}$ , define

\gamma^{(r)}_{\bm{\theta}}(\bm{z})=\frac{w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}}{w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}},r=2:R,\quad\gamma^{(1)}_{\bm{\theta}}(\bm{z})=\frac{w_{1}}{w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}}.

(S.5.654)

Denote $w_{r}(\bm{\theta})=\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})]$ and $\bm{\mu}_{r}(\bm{\theta})=\frac{\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})\bm{z}]}{\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})]}$ .

Lemma 37 (Contraction of multi-cluster GMM).

|w_{r}(\bm{\theta})-w_{r}^{*}|\leq C^{\prime}\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*}),\quad\|\bm{\mu}_{r}(\bm{\theta})-\bm{\mu}_{r}^{*}\|_{2}\leq C^{\prime}\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*}),

(S.5.655)

where $C^{\prime}\exp\{-C^{\prime\prime}\Delta^{2}\}\leq\kappa_{0}<1$ with a constant $\kappa_{0}$ .

Lemma 38 (Vectorized contraction of Rademacher complexity, Corollary 1 in \citealpappmaurer2016vector).

Suppose $\{\epsilon_{ir}\}_{i\in[n],r\in[R]}$ and $\{\epsilon_{i}\}_{i=1}^{n}$ are independent Rademacher variables. Let $\mathcal{F}$ be a class of functions $f:\mathbb{R}^{d}\rightarrow\mathcal{S}\subseteq\mathbb{R}^{R}$ and $h:\mathcal{S}\rightarrow\mathbb{R}$ is $L$ -Lipschitz under $\ell_{2}$ -norm, i.e., $|h(\bm{y})-h(\bm{y}^{\prime})|\leq L\|\bm{y}-\bm{y}^{\prime}\|_{2}$ , where $\bm{y}=(y_{1},\ldots,y_{R})^{\top}$ , $\bm{y}^{\prime}=(y_{1}^{\prime},\ldots,y_{R}^{\prime})^{\top}\in\mathcal{S}$ . Then

\mathbb{E}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\epsilon_{i}h(f(x_{i}))\leq\sqrt{2}L\mathbb{E}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sum_{r=1}^{R}\epsilon_{ir}f_{r}(x_{i}),

(S.5.656)

where $f_{r}(x_{i})$ is the $r$ -th component of $f(x_{i})\in\mathcal{S}\subseteq\mathbb{R}^{R}$ .

S.5.13.2 Main proof of Theorem 7

The proof idea is almost the same as the idea used in the proof of Theorem 1. We still need to show similar results presented in the lemmas associated with Theorem 1, then go through the same arguments in the proof of Theorem 7. We only sketch the key steps and the differences here.

The biggest difference appears in the proofs of the lemmas associated with Theorem 1 under the context of multi-cluster GMM. The original arguments in the proofs of Lemmas 11-14 involve the contraction inequality for Rademacher variables and univariate Lipschitz functions, which is not available anymore. We replace this part with an argument through a vectorized Rademacher contraction inequality \citepappmaurer2016vector.

First, we will show that

\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\right|\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}},

(S.5.657)

for all $k\in S$ and $r\in 1:R$ , with probability at least $1-CK^{-2}$ . Denote the LHS as $W$ . By changing one observation $\bm{z}^{(k)}_{i}$ , denote the new $W$ as $W^{\prime}$ . Since $\gamma^{(r)}_{\bm{\theta}}(\bm{z})$ is bounded for all $\bm{z}\in\mathbb{R}^{p}$ , we know that $|W-W^{\prime}|\leq 1/n_{k}$ . Then by bounded difference inequality, we have

W\leq\mathbb{E}W+C\sqrt{\frac{\log K}{n_{k}}},

(S.5.658)

with probability at least $1-C^{\prime}K^{-2}$ . On the other hand, by symmetrization,

\mathbb{E}W\leq\frac{2}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{i}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\right|.

(S.5.659)

Note that $\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z})=\frac{w^{(k)}_{r}\cdot\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}\}}{w^{(k)}_{1}+\sum_{r=2}^{R}w^{(k)}_{r}\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}\}}=\frac{\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}}{1+\sum_{r=2}^{R}\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}}=\varphi(\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}_{r=2}^{R})$ , where $\varphi(\bm{x})=\frac{\exp\{x_{r}\}}{1+\sum_{r=2}^{R}\exp\{x_{r}\}}$ is a 1-Lipschitz function (w.r.t. $\ell_{2}$ -norm). By Lemma 38,

	$\displaystyle\frac{2}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{i}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\right\|$	$\displaystyle\lesssim\frac{1}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}\sum_{r=2}^{R}\epsilon^{(k)}_{ir}g^{(k)}_{ir}\right\|$		(S.5.660)
		$\displaystyle\lesssim\frac{1}{n_{k}}\sum_{r=2}^{R}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}g^{(k)}_{ir}\right\|,$		(S.5.661)

where $g^{(k)}_{ir}\coloneqq(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}^{(k)}_{i}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}$ . It follows that

	$\displaystyle\frac{1}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}g^{(k)}_{ir}\right\|$		(S.5.662)
	$\displaystyle\leq\frac{1}{n_{k}}\mathbb{E}_{\bm{z},\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}^{(k)}_{i}\right\|$		(S.5.663)
	$\displaystyle\quad+\frac{1}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\delta^{(k)}_{r}-\log w^{(k)}_{r}+\log w^{(k)}_{1})\right\|$		(S.5.664)
	$\displaystyle\leq\frac{1}{n_{k}}\mathbb{E}_{\bm{z},\bm{\epsilon}}\sup_{\\|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)}_{r}\\|_{2}\leq\xi^{(k)}}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\bm{\beta}^{(k)}_{r})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)})\right\|$		(S.5.665)
	$\displaystyle\quad+\frac{1}{n_{k}}\mathbb{E}_{\bm{z},\bm{\epsilon}}\sup_{\\|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)}_{r}\\|_{2}\leq\xi^{(k)}}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\bm{\beta}^{(k)}_{r})^{\top}\bm{\mu}^{(k)}\right\|$		(S.5.666)
	$\displaystyle\quad+\frac{1}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\|\delta^{(k)}_{r}\|\leq U\\ c_{w}/2\leq w^{(k)}_{r}\leq 1-c_{w}/2\end{subarray}}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\delta^{(k)}_{r}-\log w^{(k)}_{r}+\log w^{(k)}_{1})\right\|$		(S.5.667)
	$\displaystyle\leq\frac{\xi^{(k)}}{n_{k}}\mathbb{E}_{\bm{z},\bm{\epsilon}}\sup_{j=1:N}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)})\right\|+\frac{1}{n_{k}}\mathbb{E}_{\bm{z},\bm{\epsilon}}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\bm{\beta}^{(k)}_{r})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\right\|$		(S.5.668)
	$\displaystyle\quad+\frac{C}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\left\|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}\right\|$		(S.5.669)

where $\bm{\mu}^{(k)*}\coloneqq\sum_{r=1}^{R}w^{(k)*}_{r}\bm{\mu}^{(k)*}_{r}$ , $\{\bm{u}_{j}\}_{j=1}^{N}$ is a $1/2$ -cover of $\mathcal{S}^{d-1}$ with $N=5^{p}$ , and $\{\epsilon^{(k)}_{ir}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\}_{i=1}^{n_{k}}$ , $\{\epsilon^{(k)}_{ir}(\bm{\beta}^{(k)*}_{r})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\}_{i=1}^{n_{k}}$ , and $\{\epsilon^{(k)}_{ir}\}_{i=1}^{n_{k}}$ are all sub-Gaussian processes. Then by the property of sub-Gaussian variables,

\textup{RHS of }\eqref{eq: proof of thm multi-cluster eq 1}\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}.

(S.5.670)

Putting all the pieces together, we obtain $W\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}$ with probability at least $1-CK^{-2}$ .

The second bound we want to show is

\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\frac{1}{n_{S}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\Big{]}\right|\lesssim\sqrt{\frac{p+K}{n_{S}}}.

(S.5.671)

Denote the LHS as $W^{\prime}$ . Again by bounded difference inequality,

W^{\prime}\leq\mathbb{E}W^{\prime}+C\sqrt{\frac{p}{n_{S}}},

(S.5.672)

with probability at least $1-C^{\prime}\exp\{-C^{\prime\prime}p\}$ . It remains to control $\mathbb{E}W^{\prime}$ . By symmetrization,

\mathbb{E}W^{\prime}\leq\frac{2}{n_{S}}\mathbb{E}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\right|.

(S.5.673)

Denote

	$\displaystyle\varphi(\widetilde{w},\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}_{r=2}^{R})$		(S.5.674)
	$\displaystyle=\widetilde{w}_{k}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})$		(S.5.675)
	$\displaystyle=\widetilde{w}_{k}\cdot\frac{\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}}{1+\sum_{r=2}^{R}\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}},$		(S.5.676)

which is C-Lipschitz w.r.t. $(\widetilde{w},\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}_{r=2}^{R})$ as a $R$ -dimensional vector with a constant $C$ . Denote $g^{(k)}_{ir}=(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}^{(k)}_{i}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}$ . A direct application of Lemma 38 implies that

	$\displaystyle\frac{1}{n_{S}}\mathbb{E}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{\|\widetilde{w}_{k}\|\leq 1}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\right\|$		(S.5.677)
	$\displaystyle\lesssim\frac{1}{n_{S}}\mathbb{E}\sup_{\|\widetilde{w}_{k}\|\leq 1}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\epsilon^{(k)}_{i1}\right\|+\frac{1}{n_{S}}\sum_{r=2}^{R}\mathbb{E}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}g^{(k)}_{ir}\epsilon^{(k)}_{ir}\right\|.$		(S.5.678)

By a similar argument involving covering number as before, we can show that

\frac{1}{n_{S}}\mathbb{E}\sup_{|\widetilde{w}_{k}|\leq 1}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\epsilon^{(k)}_{i1}\right|+\frac{1}{n_{S}}\sum_{r=2}^{R}\mathbb{E}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}g^{(k)}_{ir}\epsilon^{(k)}_{ir}\right|\lesssim\sqrt{\frac{p+K}{n_{S}}}.

(S.5.679)

Therefore $W\lesssim\sqrt{\frac{p+K}{n_{S}}}$ with probability at least $1-C^{\prime}\exp\{-C^{\prime\prime}p\}$ ..

The third bound we want to show is

	$\displaystyle\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)}\\|_{2}\leq\xi^{(k)}\end{subarray}}\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}1-\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)}-\mathbb{E}\big{[}[1-\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})](\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)*}\big{]}\right\|$		(S.5.680)
	$\displaystyle\quad\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}},$		(S.5.681)

for all $k\in S$ and $r=1:R$ , with probability at least $1-C^{\prime}(K^{-2}+K^{-1}e^{-C^{\prime\prime}p})$ . Denote the LHS as $W^{\prime\prime}$ . Similar to the previous two proofs, we derive an upper bound for $W^{\prime\prime}$ by controlling $W^{\prime\prime}-\mathbb{E}W^{\prime\prime}$ and $\mathbb{E}W^{\prime\prime}$ , seperately. The first part involving $W^{\prime\prime}-\mathbb{E}W^{\prime\prime}$ is similar to the proof of part (i) in Lemma 12 and the second part involving $\mathbb{E}W^{\prime\prime}$ is similar to the proof of (S.5.657), so we omit the details.

The arguments to derive these three bounds can be used to derive other results similar to the lemmas used in the proof of Theorem 1. With these lemmas in hand, the remaining proof is almost the same as the proof of Theorem 1.

S.5.13.3 Proof of lemmas

Proof of Lemma 37.

We will prove the contraction of $w_{r}$ first, and only sketch the different part for the proof of contraction of $\bm{\mu}_{r}$ because the proofs are quite similar.

Part 1: Contraction of $|w_{r}(\bm{\theta})-w_{r}^{*}|$ :

First, note that $w_{r}(\bm{\theta}^{*})=w^{*}_{r}$ and $\bm{\mu}_{r}(\bm{\theta}^{*})=\bm{\mu}_{r}^{*}$ . Therefore,

$\displaystyle\|w_{r}(\bm{\theta})-w_{r}^{*}\|$	$\displaystyle=\left\|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z})]\right\|$	(S.5.682)
	$\displaystyle\leq\sum_{\widetilde{r}=1}^{R}w^{(k)}_{r}\left\|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{}}(\bm{z})\|y=\widetilde{r}]\right\|$	(S.5.683)
	$\displaystyle\leq\sum_{\widetilde{r}=1}^{R}w^{(k)}_{r}\sum_{r^{\prime}=2}^{R}\left\|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\bigg{\|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{\|}y=\widetilde{r}\right]\right\|\cdot\|w_{r^{\prime}}-w_{r}^{}\|$	(S.5.684)
	$\displaystyle\quad+\sum_{\widetilde{r}=1}^{R}w^{(k)}_{r}\sum_{r^{\prime}=2}^{R}\left\|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}\bigg{\|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{\|}y=\widetilde{r}\right]\right\|\cdot\|\delta_{r^{\prime}}-\delta_{r}^{}\|$	(S.5.685)
	$\displaystyle\quad+\sum_{\widetilde{r}=1}^{R}w^{(k)}_{r}\sum_{r^{\prime}=2}^{R}\left\|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\bigg{\|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{\|}y=\widetilde{r}\right]^{\top}(\bm{\beta}_{r^{\prime}}-\bm{\beta}_{r}^{})\right\|,$	(S.5.686)

We only show how to bound $\left|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z})|y=1]\right|$ , i.e. the case when $\widetilde{r}=1$ . For the other $\widetilde{r}=2:R$ , the proof is the same by changingt the reference level from $y=1$ to $y=\widetilde{r}$ . Note that

$\displaystyle\left\|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z})\|y=1]\right\|$	$\displaystyle\leq\sum_{r^{\prime}=2}^{R}\left\|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\bigg{\|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{\|}y=1\right]\right\|\cdot\|w_{r^{\prime}}-w_{r}^{*}\|$	(S.5.687)
	$\displaystyle\quad+\sum_{r^{\prime}=2}^{R}\left\|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}\bigg{\|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{\|}y=1\right]\right\|\cdot\|\delta_{r^{\prime}}-\delta_{r}^{*}\|$	(S.5.688)
	$\displaystyle\quad+\sum_{r^{\prime}=2}^{R}\left\|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\bigg{\|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{\|}y=1\right]^{\top}(\bm{\beta}_{r^{\prime}}-\bm{\beta}_{r}^{*})\right\|.$	(S.5.689)

where $\widetilde{\bm{\theta}}_{t}=(\{\widetilde{w}_{r}\}_{r=2}^{R},\{\widetilde{\bm{\beta}}_{r}\}_{r=2}^{R},\{\widetilde{\delta}_{r}\}_{r=2}^{R})$ with $\widetilde{w}_{r}=tw_{r}+(1-t)w_{r}^{*}$ , $\widetilde{\bm{\beta}}_{r}=t\bm{\beta}_{r}+(1-t)\bm{\beta}_{r}^{*}$ , $\widetilde{\delta}_{r}=t\delta_{r}+(1-t)\delta_{r}^{*}$ , and $\delta_{r}=\frac{1}{2}\bm{\beta}_{r}^{\top}(\bm{\mu}_{r}+\bm{\mu}_{1})$ . We will bound the three terms on the RHS separately. Note that when $\bm{\theta}\in B_{\textup{con}}(\bm{\theta}^{*})$ , we have $w_{r}\in[c_{w}/2,1-c_{w}]$ , $\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}\leq C_{b}\Delta$ , and $\max_{r=1:R}\|\bm{\mu}_{r}-\bm{\mu}_{r}^{*}\|_{2}\leq C_{b}\Delta$ , hence $\widetilde{w}_{r}\in[c_{w}/2,1-c_{w}]$ , $\|\widetilde{\bm{\beta}}_{r}-\bm{\beta}^{*}_{r}\|_{2}\leq tC_{b}\Delta$ ..

(i) Bounding $|\mathbb{E}[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}|_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}|y=1]|$ : Note that

	$\displaystyle\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}$	$\displaystyle=\frac{\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}}{\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}}-\frac{\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}(\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}-1)}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}}$		(S.5.690)
		$\displaystyle=\begin{cases}\frac{\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}(\widetilde{w}_{1}+\widetilde{w}_{r}+\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\})}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}=r,\\ -\frac{\widetilde{w}_{r}\cdot\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}(\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\}-1)}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}\neq r.\end{cases}$		(S.5.691)

Hence

\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r}}\bigg{|}y=1\right]=\underbrace{\mathbb{E}\left[\frac{\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\delta_{r}\}(\widetilde{w}_{1}+\widetilde{w}_{r}+\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}^{(1)}-\delta_{r^{\prime}}\})}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\delta_{r}\})^{2}}\right]}_{(*)}.

(S.5.692)

Let $\widetilde{z}_{r^{\prime}}=\widetilde{\bm{\beta}}^{\top}_{r^{\prime}}(\bm{z}^{(1)}-\bm{\mu}_{1}^{*})\ \sim\mathcal{N}(0,\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}})$ . And notice that

\widetilde{\bm{\beta}}^{\top}_{r^{\prime}}\bm{\mu}_{1}^{*}-\widetilde{\delta}_{r^{\prime}}=t(\bm{\beta}_{r^{\prime}}^{\top}\bm{\mu}_{1}^{*}-\delta_{r^{\prime}})+(1-t)[(\bm{\beta}_{r^{\prime}}^{*})^{\top}\bm{\mu}_{1}^{*}-\delta_{r^{\prime}}^{*}],

(S.5.693)

where

$\displaystyle\bm{\beta}_{r^{\prime}}^{\top}\bm{\mu}_{1}^{*}-\delta_{r^{\prime}}$	$\displaystyle=[\bm{\beta}_{r^{\prime}}^{}+(\bm{\beta}_{r^{\prime}}-\bm{\beta}^{}_{r^{\prime}})]^{\top}\Big{[}\frac{1}{2}(\bm{\mu}_{1}^{}-\bm{\mu}_{r^{\prime}}^{})+\frac{1}{2}(\bm{\mu}_{r^{\prime}}^{}-\bm{\mu}_{r^{\prime}})+\frac{1}{2}(\bm{\mu}_{1}^{}-\bm{\mu}_{1})\Big{]}$	(S.5.694)
	$\displaystyle=-\frac{1}{2}\underbrace{(\bm{\mu}_{r^{\prime}}^{}-\bm{\mu}_{1}^{})^{\top}(\bm{\Sigma}^{})^{-1}(\bm{\mu}_{r^{\prime}}^{}-\bm{\mu}_{1}^{})}_{A_{r^{\prime}}^{2}}+\frac{1}{2}(\bm{\beta}_{r^{\prime}}^{})^{\top}(\bm{\Sigma}^{})^{1/2}(\bm{\Sigma}^{})^{-1/2}[(\bm{\mu}_{r^{\prime}}^{}-\bm{\mu}_{r^{\prime}})+(\bm{\mu}_{1}^{}-\bm{\mu}_{1})]$	(S.5.695)
	$\displaystyle\quad+\frac{1}{2}(\bm{\beta}_{r^{\prime}}-\bm{\beta}^{}_{r^{\prime}})^{\top}(\bm{\Sigma}^{})^{1/2}(\bm{\Sigma}^{})^{-1/2}(\bm{\mu}_{1}^{}-\bm{\mu}_{r^{\prime}}^{*})$	(S.5.696)
	$\displaystyle\quad+\frac{1}{2}(\bm{\beta}_{r^{\prime}}-\bm{\beta}^{}_{r^{\prime}})^{\top}[(\bm{\mu}_{r^{\prime}}^{}-\bm{\mu}_{r^{\prime}})+(\bm{\mu}_{1}^{*}-\bm{\mu}_{1})],$	(S.5.697)
$\displaystyle(\bm{\beta}_{r^{\prime}}^{})^{\top}\bm{\mu}_{1}^{}-\delta_{r^{\prime}}^{*}$	$\displaystyle=-\frac{1}{2}A_{r^{\prime}},$	(S.5.698)

and $A_{r^{\prime}}=\sqrt{(\bm{\mu}_{r^{\prime}}^{*}-\bm{\mu}_{1}^{*})^{\top}(\bm{\Sigma}^{*})^{-1}(\bm{\mu}_{r^{\prime}}^{*}-\bm{\mu}_{1}^{*})}=\sqrt{(\bm{\beta}_{r}^{*})^{\top}\bm{\Sigma}^{*}\bm{\beta}_{r}^{*}}$ . By the fact that $\max_{r=1:R}\|\bm{\mu}_{r}-\bm{\mu}_{r}^{*}\|_{2}\leq C_{b}\Delta$ and $\max_{r=1:R}\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}\leq C_{b}\Delta$ , we have

-\frac{1}{2}A_{r^{\prime}}^{2}-2c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r^{\prime}}-C_{b}^{2}\Delta^{2}\leq\bm{\beta}_{r^{\prime}}^{\top}\bm{\mu}_{1}^{*}-\delta_{r^{\prime}}\leq-\frac{1}{2}A_{r^{\prime}}^{2}+2c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r^{\prime}}+C_{b}^{2}\Delta^{2},

(S.5.699)

implying that

-\frac{1}{2}A_{r^{\prime}}^{2}-2c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r^{\prime}}-C_{b}^{2}\Delta^{2}\leq\widetilde{\bm{\beta}}^{\top}_{r^{\prime}}\bm{\mu}_{1}^{*}-\widetilde{\delta}_{r^{\prime}}\leq-\frac{1}{2}A_{r^{\prime}}^{2}+2c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r^{\prime}}+C_{b}^{2}\Delta^{2}.

(S.5.700)

By Gaussian tail, we have

\mathbb{P}\left(\bigcap_{r^{\prime}=2}^{R}\Big{\{}|\widetilde{z}_{r^{\prime}}|\leq\frac{1}{4}\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}}\Big{\}}\right)\geq 1-CR\exp\Big{\{}-\frac{1}{32}\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}}\Big{\}}.

(S.5.701)

Denote event $\mathcal{E}=\bigcap_{r^{\prime}=1}^{R}\big{\{}|\widetilde{z}_{r^{\prime}}|\leq\frac{1}{4}c_{\bm{\Sigma}}C_{b}^{2}\Delta^{2}+\frac{1}{2}C_{b}c_{\bm{\Sigma}}^{1/2}\Delta A_{r^{\prime}}+\frac{1}{4}A_{r^{\prime}}^{2}\big{\}}$ . Since

	$\displaystyle\frac{1}{4}\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}}$	$\displaystyle=\frac{1}{4}(\widetilde{\bm{\beta}}_{r^{\prime}}-\bm{\beta}_{r^{\prime}}^{})^{\top}\bm{\Sigma}^{}(\widetilde{\bm{\beta}}_{r^{\prime}}-\bm{\beta}_{r^{\prime}}^{})+\frac{1}{2}(\widetilde{\bm{\beta}}_{r^{\prime}}-\bm{\beta}_{r^{\prime}}^{})^{\top}(\bm{\Sigma}^{})^{1/2}(\bm{\Sigma}^{})^{1/2}\bm{\beta}_{r^{\prime}}^{}+\frac{1}{4}(\bm{\beta}_{r^{\prime}}^{})^{\top}\bm{\Sigma}^{}\bm{\beta}_{r^{\prime}}^{}$		(S.5.702)
		$\displaystyle\leq\frac{1}{4}c_{\bm{\Sigma}}C_{b}^{2}\Delta^{2}+\frac{1}{2}C_{b}c_{\bm{\Sigma}}^{1/2}\Delta A_{r^{\prime}}+\frac{1}{4}A_{r^{\prime}}^{2},$		(S.5.703)

and

\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}}\geq A_{r^{\prime}}^{2}-c_{\bm{\Sigma}}C_{b}^{2}\Delta^{2}-2C_{b}c_{\bm{\Sigma}}^{1/2}\Delta A_{r^{\prime}}\geq(1-c_{\bm{\Sigma}}C_{b}^{2}-2C_{b}c_{\bm{\Sigma}}^{1/2})\Delta^{2}\geq\frac{1}{2}\Delta^{2},

(S.5.704)

we have

\mathbb{P}(\mathcal{E})\geq 1-CR\exp\Big{\{}-\frac{1}{64}\Delta^{2}\Big{\}}.

(S.5.705)

Then since $\min_{r^{\prime}=1:R}A_{r^{\prime}}\geq\Delta\geq 5c_{\bm{\Sigma}}^{1/2}C_{b}\Delta$ and $C_{b}\leq\frac{c_{\bm{\Sigma}}^{-1/2}}{40}\wedge(2c_{\bm{\Sigma}}+8)^{-1/2}$ , we have

	$\displaystyle(*)$		(S.5.706)
	$\displaystyle\leq\mathbb{E}\Bigg{[}\frac{\exp\{-\frac{1}{4}A_{r}^{2}+\frac{5}{2}c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r}+(\frac{1}{4}c_{\bm{\Sigma}}+1)C_{b}^{2}\Delta^{2}\}}{\widetilde{w}^{2}_{1}}$		(S.5.707)
	$\displaystyle\quad\quad\quad\cdot\bigg{(}\widetilde{w}_{1}+\widetilde{w}_{r}+\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\exp\Big{\{}-\frac{1}{4}A_{r}^{2}+\frac{5}{2}c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r}+\Big{(}\frac{1}{4}c_{\bm{\Sigma}}+1\Big{)}C_{b}^{2}\Delta^{2}\Big{\}}\bigg{)}\bigg{\|}\mathcal{E}\Bigg{]}+\mathbb{P}(\mathcal{E}^{c})$		(S.5.708)
	$\displaystyle\lesssim c_{w}^{-2}\exp\{-C\Delta^{2}\}.$		(S.5.709)

Hence,

\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r}}\Big{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{|}y=1\right]\right|\lesssim c_{w}^{-2}\exp\{-C\Delta^{2}\}.

(S.5.710)

Similarly, it can be shown that

\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\Big{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{|}y=1\right]\right|\lesssim c_{w}^{-2}\exp\{-C\Delta^{2}\}.

(S.5.711)

for any $r^{\prime}=2:R$ .

(ii) Bounding $|\mathbb{E}[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}|_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}|y=1]|$ : Note that

\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}=\begin{cases}\frac{-w_{r}\cdot\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}\cdot\sum_{r^{\prime}\neq r}w_{r^{\prime}}\exp\{\bm{\beta}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\}}{(w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}=r,\\ -\frac{w_{r}\cdot\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}\cdot w_{r^{\prime}}\cdot\exp\{\bm{\beta}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\}}{(w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}\neq r.\end{cases}

(S.5.712)

The analysis is almost the same as in (i), which leads to

\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}\Big{|}_{\bm{\theta}=\bm{\theta}}\right]\right|\lesssim c_{w}^{-2}\exp\{-C\Delta^{2}\},

(S.5.713)

for any $r^{\prime}=2:R$ . We omit the proof here.

(iii) Bounding $|\mathbb{E}[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}|_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}|y=1]^{\top}(\bm{\beta}_{r^{\prime}}-\bm{\beta}_{r}^{*})|$ : Note that

\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}=\begin{cases}\frac{-w_{r}\cdot\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}\cdot\sum_{r^{\prime}\neq r}w_{r^{\prime}}\exp\{\bm{\beta}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\}\bm{z}}{(w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}=r,\\ -\frac{w_{r}\cdot\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}\cdot w_{r^{\prime}}\cdot\exp\{\bm{\beta}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\}\bm{z}}{(w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}\neq r.\end{cases}

(S.5.714)

$\bullet$ When $r^{\prime}=r$ :

	$\displaystyle\mathbb{E}\left[\left(\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\bigg{\|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\right)^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})\bigg{\|}y=1\right]$		(S.5.715)
	$\displaystyle=\mathbb{E}\left[\frac{\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\}\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\cdot\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r^{\prime}}\}\cdot(\bm{z}^{(1)})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\})^{2}}\right]$		(S.5.716)
	$\displaystyle\leq\underbrace{\sqrt{\mathbb{E}\left[\frac{\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\}\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\cdot\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r^{\prime}}\}}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\})^{2}}\right]^{2}}}_{(1)}\cdot\underbrace{\sqrt{\mathbb{E}[(\bm{z}^{(1)})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})]^{2}}}_{(2)}.$		(S.5.717)

Similar to the previous argument in (i), let $\widetilde{z}_{r^{\prime}}=\widetilde{\bm{\beta}}^{\top}_{r^{\prime}}(\bm{z}^{(\widetilde{r})}-\bm{\mu}_{1}^{*})\sim\mathcal{N}(0,\widetilde{\bm{\beta}}^{\top}_{r^{\prime}}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}})$ and event $\mathcal{E}=\bigcap_{r^{\prime}=1}^{R}\big{\{}|\widetilde{z}_{r^{\prime}}|\leq\frac{1}{4}c_{\bm{\Sigma}}C_{b}^{2}\Delta^{2}+\frac{1}{2}C_{b}c_{\bm{\Sigma}}^{1/2}\Delta A_{r^{\prime}}+\frac{1}{4}A_{r^{\prime}}^{2}\big{\}}$ , then

\mathbb{P}(\mathcal{E})\geq 1-CR\exp\Big{\{}-\frac{1}{64}\Delta^{2}\Big{\}}.

(S.5.718)

Similar to (i), we have

(1)\lesssim\sqrt{\mathbb{E}\left[\left(\frac{\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\}\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\cdot\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r^{\prime}}\}}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\})^{2}}\right)^{2}\bigg{|}\mathcal{E}\right]+\mathbb{P}(\mathcal{E}^{c})}\lesssim\exp\{-C\Delta^{2}\}.

(S.5.719)

Moreover, $(\bm{z}^{(1)})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})=(\bm{z}^{(1)}-\bm{\mu}_{1}^{*})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})+(\bm{\mu}_{1}^{*})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})$ , where $(\bm{z}^{(1)}-\bm{\mu}_{1}^{*})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})\sim\mathcal{N}(0,(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})^{\top}\bm{\Sigma}^{*}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*}))$ and $|(\bm{\mu}_{1}^{*})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})|\leq M\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}$ , hence

(2)\lesssim\sqrt{(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})^{\top}\bm{\Sigma}^{*}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})}+M\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}\lesssim(c_{\bm{\Sigma}}^{1/2}+M)\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}\lesssim(c_{\bm{\Sigma}}^{1/2}+M)C_{b}\Delta.

(S.5.720)

Therefore, since $\Delta\leq 2Mc_{\bm{\Sigma}}^{1/2}$ and $\Delta\gtrsim\log^{1/2}(c_{\bm{\Sigma}}Mc_{w}^{-1})$ ,

\mathbb{E}\left[\left(\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\right)^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})\bigg{|}y=1\right]\lesssim c_{w}^{-2}\exp\{-C^{\prime}\Delta^{2}\}\cdot(c_{\bm{\Sigma}}^{1/2}+M)Mc_{\bm{\Sigma}}^{1/2}\lesssim\exp\{-C^{\prime}\Delta^{2}\}.

(S.5.721)

$\bullet$ When $r^{\prime}\neq r$ : we can obtain

\mathbb{E}\left[\left(\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\right)^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})\bigg{|}y=1\right]\lesssim\exp\{-C^{\prime}\Delta^{2}\}.

(S.5.722)

similarly.

Combining (i)-(iii), we have

|w_{r}(\bm{\theta})-w_{r}^{*}|\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot\sum_{r=2}^{R}(|w_{r}-w_{r}^{*}|+|\delta_{r}-\delta_{r}^{*}|+\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}).

(S.5.723)

Part 2: Contraction of $\|\bm{\mu}_{r}(\bm{\theta})-\bm{\mu}_{r}^{*}\|_{2}$ :

By definition,

\|\bm{\mu}_{r}(\bm{\theta})-\bm{\mu}_{r}^{*}\|_{2}\leq\frac{\|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})\bm{z}]\|_{2}}{w_{r}(\bm{\theta})w_{r}^{*}}\cdot|w_{r}^{*}-w_{r}(\bm{\theta})|+\frac{\|\mathbb{E}[(\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z}))\bm{z}]\|_{2}}{w_{r}^{*}},

(S.5.724)

implying that

$\displaystyle\\|\mathbb{E}[(\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z}))\bm{z}]\\|_{2}$	$\displaystyle\leq\sum_{r^{\prime}=2}^{R}\left\\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\bigg{\|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bm{z}\bigg{]}\right\\|_{2}\cdot\|w_{r^{\prime}}-w_{r^{\prime}}^{*}\|$	(S.5.725)
	$\displaystyle\quad+\sum_{r=2}^{R}\left\\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}\bigg{\|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bm{z}\bigg{]}\right\\|_{2}\cdot\|\delta_{r^{\prime}}-\delta_{r}^{*}\|$	(S.5.726)
	$\displaystyle\quad+\sum_{r=2}^{R}\left\\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\bigg{\|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bm{z}\bigg{]}\right\\|_{2}\cdot\\|\bm{\beta}_{r^{\prime}}-\bm{\beta}_{r}^{*}\\|_{2},$	(S.5.727)

where $\widetilde{\bm{\theta}}_{t}=(\{\widetilde{w}_{r}\}_{r=2}^{R},\{\widetilde{\bm{\beta}}_{r}\}_{r=2}^{R},\{\widetilde{\delta}_{r}\}_{r=2}^{R})$ with $\widetilde{w}_{r}=tw_{r}+(1-t)w_{r}^{*}$ , $\widetilde{\bm{\beta}}_{r}=t\bm{\beta}_{r}+(1-t)\bm{\beta}_{r}^{*}$ , and $\widetilde{\delta}_{r}=t\delta_{r}+(1-t)\delta_{r}^{*}$ . We will bound the three terms on the RHS separately. Note that when $\bm{\theta}\in B_{\textup{con}}(\bm{\theta}^{*})$ , we have $\widetilde{w}_{r}\in(c_{w}/2,1-c_{w})$ , $\|\widetilde{\bm{\beta}}_{r}-\bm{\beta}^{*}_{r}\|_{2}\leq C_{b}\Delta$ , and $|\widetilde{\delta}_{r}-\delta_{r}|\leq C_{b}\Delta$ .

For any $\bm{u}\in\mathbb{R}^{p}$ with $\|\bm{u}\|_{2}\leq 1$ and any $\widetilde{r}\in 1:R$ , similar to our previous arguments, we have

\left|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bm{z}^{\top}\bm{u}\bigg{|}y=\widetilde{r}\bigg{]}\right|\leq\sqrt{\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\right]^{2}}\cdot\sqrt{\mathbb{E}[(\bm{z}^{(\widetilde{r})})^{\top}\bm{u}]^{2}}\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\},

(S.5.728)

which leads to

\left\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\bm{z}\bigg{]}\right\|_{2}\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\},

(S.5.729)

for any $r^{\prime}\in 2:R$ . Similarly, we have

\left\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}\bm{z}\bigg{]}\right\|_{2},\left\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\bm{z}\bigg{]}\right\|_{2}\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}.

(S.5.730)

for any $r^{\prime}\in 2:R$ . Therefore, $\|\mathbb{E}[(\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z}))\bm{z}]\|_{2}\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}$ . By part 1, we have $\frac{\|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})\bm{z}]\|_{2}}{w_{r}(\bm{\theta})w_{r}^{*}}\cdot|w_{r}^{*}-w_{r}(\bm{\theta})|\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*})$ . Hence by (S.5.724), we have $\|\bm{\mu}_{r}(\bm{\theta})-\bm{\mu}_{r}^{*}\|_{2}\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*})$ .

Combining part 1 and part 2, we complete the proof.

∎

S.5.14 Proof of Theorem 8

Note that the excess risk

	$\displaystyle R_{\overline{\bm{\theta}}^{(k)}}(\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}})-R_{\overline{\bm{\theta}}^{(k)}}(\mathcal{C}_{\bm{\theta}^{(k)*}})$		(S.5.731)
	$\displaystyle=\mathbb{P}(y^{(k)}\neq\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z}^{(k)}))-\mathbb{P}(y^{(k)}\neq\mathcal{C}_{\bm{\theta}^{(k)*}}(\bm{z}^{(k)}))$		(S.5.732)
	$\displaystyle=\int_{\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}\neq\mathcal{C}_{\bm{\theta}^{(k)}}}\left[\mathbb{P}(y^{(k)}=\mathcal{C}_{\bm{\theta}^{(k)}}(\bm{z})\|\bm{z}^{(k)}=\bm{z})-\mathbb{P}(y^{(k)}=\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})\|\bm{z}^{(k)}=\bm{z})\right]d\mathbb{P}_{\bm{\theta}^{(k)*}}(\bm{z})$		(S.5.733)
	$\displaystyle=\int_{\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}\neq\mathcal{C}_{\bm{\theta}^{(k)}}}\left[\max_{r=1:R}\mathbb{P}(y^{(k)}=r\|\bm{z}^{(k)}=\bm{z})-\mathbb{P}(y^{(k)}=\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})\|\bm{z}^{(k)}=\bm{z})\right]d\mathbb{P}_{\bm{\theta}^{(k)}}(\bm{z}).$		(S.5.734)

Let event $\mathcal{E}=\big{\{}\bm{z}:\max_{r}\mathbb{P}(y^{(k)}=r|\bm{z}^{(k)}=\bm{z})-\max_{j}\{\mathbb{P}(y^{(k)}=j|\bm{z}^{(k)}=\bm{z}):\mathbb{P}(y^{(k)}=j|\bm{z}^{(k)}=\bm{z})<\max_{r}\mathbb{P}(y^{(k)}=r|\bm{z}^{(k)}=\bm{z})\}\leq t\big{\}}$ . We claim that the margin condition $\mathbb{P}(\mathcal{E})\lesssim t$ holds for any $t\leq$ a small constant $c$ (to be verified). If this is the case, then denote $\widetilde{\mathcal{E}}=\big{\{}\max_{r}|\eta^{(r)}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z}^{(k)})-\eta^{(r)}_{\bm{\theta}^{(k)*}}(\bm{z}^{(k)})|\leq t/2\big{\}}$ .

	$\displaystyle r^{*}$	$\displaystyle=\operatorname{arg\,max}_{r}\eta^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}),$		(S.5.735)
	$\displaystyle\widehat{r}$	$\displaystyle=\operatorname*{arg\,max}_{r}\eta^{(r)}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z}).$		(S.5.736)

We have

	$\displaystyle\textup{RHS of }\eqref{eq: proof mtl classification multi-cluster eq 1}$		(S.5.737)
	$\displaystyle\leq\int_{\begin{subarray}{c}r^{}\neq\widehat{r}\\ \mathcal{E},\widetilde{\mathcal{E}}\end{subarray}}\big{[}\eta^{(r^{})}_{\bm{\theta}^{(k)}}(\bm{z})-\eta^{(\widehat{r})}_{\bm{\theta}^{(k)}}(\bm{z})\big{]}d\mathbb{P}_{\bm{\theta}^{(k)}}(\bm{z})+\int_{\begin{subarray}{c}r^{}\neq\widehat{r}\\ \mathcal{E}^{c},\widetilde{\mathcal{E}}\end{subarray}}\big{[}\eta^{(r^{})}_{\bm{\theta}^{(k)}}(\bm{z})-\eta^{(\widehat{r})}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})\big{]}d\mathbb{P}_{\bm{\theta}^{(k)*}}(\bm{z})+\mathbb{P}(\widetilde{\mathcal{E}}^{c})$		(S.5.738)
	$\displaystyle\leq t\mathbb{P}(\mathcal{E})+\mathbb{P}(\widetilde{\mathcal{E}}^{c}),$		(S.5.739)

where the last inequality comes from the fact that when $r^{*}\neq\widehat{r}$ , $\eta^{(r^{*})}_{\bm{\theta}^{(k)*}}(\bm{z})-\eta^{(\widehat{r})}_{\bm{\theta}^{(k)*}}(\bm{z})\leq t$ on $\mathcal{E}$ . And notice that on $\mathcal{E}^{c}\cap\widetilde{\mathcal{E}}$ , we must have $\widehat{r}=r^{*}$ because if $\widehat{r}\neq r^{*}$ , then

\eta^{(\widehat{r})}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})-\eta^{(r^{*})}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})\leq\eta^{(\widehat{r})}_{\bm{\theta}^{(k)*}}(\bm{z})-\eta^{(r^{*})}_{\bm{\theta}^{(k)*}}(\bm{z})+\frac{t}{2}+\frac{t}{2}\leq-t+t<0,

(S.5.740)

which is a contradiction with the definition of $\widehat{r}$ . Hence $\{r^{*}\neq\widehat{r}\}\cap\mathcal{E}\cap\widetilde{\mathcal{E}}$ is empty. Therefore $\int_{\begin{subarray}{c}r^{*}\neq\widehat{r}\\ \mathcal{E}^{c},\widetilde{\mathcal{E}}\end{subarray}}\big{[}\eta^{(r^{*})}_{\bm{\theta}^{(k)*}}(\bm{z})-\eta^{(\widehat{r})}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})\big{]}d\mathbb{P}_{\bm{\theta}^{(k)*}}(\bm{z})=0$ . Finally, by Lipschitzness,

$\displaystyle\mathbb{P}(\widetilde{\mathcal{E}}^{c})$	$\displaystyle=\mathbb{P}\left(\max_{r}\|\eta^{(r)}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z}^{(k)})-\eta^{(r)}_{\bm{\theta}^{(k)*}}(\bm{z}^{(k)})\|>t/2\right)$	(S.5.741)
	$\displaystyle\leq\sum_{r=1}^{R}\mathbb{P}\left(\|\eta^{(r)}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z}^{(k)})-\eta^{(r)}_{\bm{\theta}^{(k)*}}(\bm{z}^{(k)})\|>t/2\right)$	(S.5.742)
	$\displaystyle\lesssim\mathbb{P}(\|(\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)})^{\top}\bm{z}^{(k)}-\widehat{\delta}^{(k)}+\delta^{(k)}-\log\widehat{w}^{(k)}_{r}+\log\widehat{w}^{(1)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\|>Ct)$	(S.5.743)
	$\displaystyle\lesssim\mathbb{P}(\|(\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*})^{\top}(\bm{z}^{(k)}-\bm{\mu}^{(k)})\|>C^{\prime}t)$	(S.5.744)
	$\displaystyle\lesssim\exp\left\{-\frac{Ct^{2}}{\\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\\|_{2}^{2}}\right\}.$	(S.5.745)

Plugging back into (S.5.739), we have

R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\bm{\theta}^{(k)*}})\lesssim t^{2}+\exp\left\{-\frac{Ct^{2}}{\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\|_{2}^{2}}\right\}.

(S.5.746)

Let $t\asymp d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\sqrt{\log d^{-1}(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})}$ :

R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\bm{\theta}^{(k)*}})\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\log d^{-1}(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\log\left(\frac{n_{S}}{p+\log n_{S}}\right).

(S.5.747)

Then plugging in the upper bound of $d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})$ in Theorem 7 completes the proof.

It remains to verify the margin condition $\mathbb{P}(\mathcal{E})\lesssim t$ for any $t\leq$ a small constant $c$ . In fact,

$\displaystyle\mathbb{P}(\mathcal{E})$	$\displaystyle=\sum_{r=1}^{R}\sum_{j\neq r}\mathbb{P}\bigg{(}\operatorname{arg\,max}_{r^{\prime}}\mathbb{P}(y=r^{\prime}\|\bm{z}^{(k)})=r,\operatorname{arg\,max}_{r^{\prime}\neq r}\mathbb{P}(y=r^{\prime}\|\bm{z}^{(k)})=j,$	(S.5.748)
	$\displaystyle\hskip 79.6678pt\mathbb{P}(y=r\|\bm{z}^{(k)})-\mathbb{P}(y=j\|\bm{z}^{(k)})\leq t\bigg{)}$	(S.5.749)
	$\displaystyle\leq\sum_{r=1}^{R}\sum_{j\neq r}\mathbb{P}\bigg{(}\operatorname{arg\,max}_{r^{\prime}}\mathbb{P}(y=r^{\prime}\|\bm{z}^{(k)})=r,\operatorname{arg\,max}_{r^{\prime}\neq r}\mathbb{P}(y=r^{\prime}\|\bm{z}^{(k)})=j,$	(S.5.750)
	$\displaystyle\hskip 79.6678pt1-\frac{\mathbb{P}(y=j\|\bm{z}^{(k)})}{\mathbb{P}(y=r\|\bm{z}^{(k)})}\leq\frac{t}{\mathbb{P}(y=r\|\bm{z}^{(k)})}\leq Rt\bigg{)}$	(S.5.751)
	$\displaystyle\leq\sum_{r=1}^{R}\sum_{j\neq r}\mathbb{P}\bigg{(}1-Rt\leq\frac{\mathbb{P}(y=j\|\bm{z}^{(k)})}{\mathbb{P}(y=r\|\bm{z}^{(k)})}$	(S.5.752)
	$\displaystyle\hskip 79.6678pt=\exp\{(\bm{\beta}^{(k)}_{j}-\bm{\beta}^{(k)}_{r})^{\top}\bm{z}^{(k)}-\delta_{j}^{(k)}+\delta_{r}^{(k)}+\log w^{(k)}_{j}-\log w^{(k)}_{r}\}\bigg{)}$	(S.5.753)
	$\displaystyle\lesssim\sum_{r=1}^{R}\sum_{j\neq r}\sum_{r^{\prime}=1}^{R}\mathbb{P}\bigg{(}\log(1-Rt)\leq\mathcal{N}\big{(}(\bm{\beta}^{(k)}_{j}-\bm{\beta}^{(k)}_{r})^{\top}\bm{\mu}^{(k)}_{r}-\delta_{j}^{(k)}+\delta_{r}^{(k)}+\log w^{(k)}_{j}-\log w^{(k)*}_{r},$	(S.5.754)
	$\displaystyle\hskip 91.04872pt(\bm{\beta}^{(k)}_{j}-\bm{\beta}^{(k)}_{r})^{\top}\bm{\Sigma}^{(k)}(\bm{\beta}^{(k)}_{j}-\bm{\beta}^{(k)*}_{r})\big{)}\leq 0\bigg{)}$	(S.5.755)
	$\displaystyle\lesssim-\log(1-Rt)$	(S.5.756)
	$\displaystyle\lesssim t,$	(S.5.757)

when $t>0$ is less than some constant $c>0$ . Note that we used the fact that $(\bm{\beta}^{(k)*}_{j}-\bm{\beta}^{(k)*}_{r})^{\top}\bm{\Sigma}^{(k)*}(\bm{\beta}^{(k)*}_{j}-\bm{\beta}^{(k)*}_{r})\geq\Delta^{2}\geq$ some constant $C$ , which implies that the Gaussian density is upper bounded by a constant. Hence the marginal condition is true.

We want to point out that this multi-class extension of margin condition in binary case has been widely used in literature of multi-class classification. For example, see \citeappchen2006consistency and \citeappvigogna2022multiclass.

S.5.15 Proof of Theorem 9

The proof is almost the same as the proof of Theorem 3, by noticing that we can make the GMM parameters the same across $r$ -th task with $r\geq 3$ to reduce the problem to the case $R=2$ , so we do not repeat it here.

S.5.16 Proof of Theorem 10

S.5.16.1 Lemmas

Lemma 39.

\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})\gtrsim|w_{2}-w_{2}^{\prime}|.

(S.5.758)

Lemma 40.

Consider $\overline{\bm{\theta}}=\{\{w_{r}\}_{r=2}^{R},\{\bm{\beta}_{r}\}_{r=2}^{R},\{\delta_{r}\}_{r=2}^{R},\{\bm{\mu}_{r}\}_{r=1}^{R},\bm{\Sigma}\}$ and $\overline{\bm{\theta}}^{\prime}=\{\{w_{r}^{\prime}\}_{r=2}^{R},\{\bm{\beta}_{r}^{\prime}\}_{r=2}^{R},\allowbreak\{\delta_{r}^{\prime}\}_{r=2}^{R},\{\bm{\mu}_{r}^{\prime}\}_{r=1}^{R},\bm{\Sigma}^{\prime}\}$ with $w_{r}=w_{r}^{\prime}=\frac{1}{R}$ for $r=1:R$ , $\bm{\mu}_{r}=(u+1)\bm{e}_{r}$ , $\bm{\mu}_{r}^{\prime}=\bm{e}_{r}$ for $r\geq 2$ , $\bm{\mu}_{1}=u\bm{e}_{r}$ , $\bm{\mu}_{1}^{\prime}=\bm{0}$ , and $\bm{\Sigma}=\bm{\Sigma}^{\prime}=\bm{I}_{p}$ , where $0<u\leq C$ . Then $\bm{\beta}_{r}=\bm{\beta}_{r}^{\prime}=\bm{e}_{r}$ , $\delta_{r}=\bm{e}_{r}^{\top}(\frac{\bm{\mu}_{1}+\bm{\mu}_{2}}{2})=u+\frac{1}{2}$ , $\delta_{r}^{\prime}=\bm{e}_{r}^{\top}(\frac{\bm{\mu}_{1}^{\prime}+\bm{\mu}_{2}^{\prime}}{2})=\frac{1}{2}$ for $r\geq 2$ , and

\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})\gtrsim u.

(S.5.759)

Lemma 41.

Consider $\overline{\bm{\theta}}=\{\{w_{r}\}_{r=2}^{R},\{\bm{\beta}_{r}\}_{r=2}^{R},\{\delta_{r}\}_{r=2}^{R},\{\bm{\mu}_{r}\}_{r=1}^{R},\bm{\Sigma}\}$ and $\overline{\bm{\theta}}^{\prime}=\{\{w_{r}^{\prime}\}_{r=2}^{R},\{\bm{\beta}_{r}^{\prime}\}_{r=2}^{R},\allowbreak\{\delta_{r}^{\prime}\}_{r=2}^{R},\{\bm{\mu}_{r}^{\prime}\}_{r=1}^{R},\bm{\Sigma}^{\prime}\}$ with $w_{r}=w_{r}^{\prime}=\frac{1}{R}$ for $r=1:R$ , $\bm{\mu}_{r}=\bm{\mu}_{r}^{\prime}=\bm{e}_{r}$ for $r\geq 3$ , $\bm{\mu}_{1}=\bm{\mu}_{1}^{\prime}=\bm{0}$ , and $\bm{\Sigma}=\bm{\Sigma}^{\prime}=\bm{I}_{p}$ . Suppose $\bm{\mu}_{2}$ and $\bm{\mu}_{2}^{\prime}$ satisfy $(\bm{\mu}_{2})_{3:R}=(\bm{\mu}_{2})_{3:R}=\bm{0}$ . Then $\bm{\beta}_{r}=\bm{\beta}_{r}^{\prime}=\bm{e}_{r}$ for $r\geq 3$ , $\delta_{r}=\delta_{r}^{\prime}$ for $r\geq 3$ , $\bm{\beta}_{2}=\bm{\mu}_{2}$ , $\bm{\beta}_{2}^{\prime}=\bm{\mu}_{2}^{\prime}$ , $\delta_{2}=\frac{1}{2}\|\bm{\mu}_{2}\|_{2}^{2}$ , $\delta_{2}^{\prime}=\frac{1}{2}\|\bm{\mu}_{2}^{\prime}\|_{2}^{2}$ where $\|\bm{\mu}_{2}\|_{2}=\|\bm{\mu}_{2}^{\prime}\|_{2}=1$ with $\bm{\mu}_{2}^{\top}\bm{\mu}_{2}^{\prime}>\frac{\sqrt{2}}{2}$ , and

\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})\gtrsim\|\bm{\mu}_{2}-\bm{\mu}_{2}^{\prime}\|_{2}.

(S.5.760)

S.5.16.2 Main proof of Theorem 10

Given the three lemmas we presented, the proof is almost the same as the proof of Theorem 4. We do not repeat it here.

S.5.16.3 Proof of lemmas

Proof of Lemma 39.

Note that $z_{j}$ ’s are independent given $y=3$ . We have

$\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})$	$\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})=2,\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}(\bm{z})\neq 2)$	(S.5.761)
	$\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}z_{2}-\frac{1}{2}-\log w_{1}+\log w_{2}\geq 0,z_{2}-\frac{1}{2}-\log w_{1}^{\prime}+\log w_{2}^{\prime}\leq 0$	(S.5.762)
	$\displaystyle\quad\quad\quad z_{r}-\frac{1}{2}-\log w_{1}+\log w_{r}\leq 0\text{ for all }r\geq 3\Big{)}$	(S.5.763)
	$\displaystyle\geq w_{3}\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}z_{2}-\frac{1}{2}-\log w_{1}+\log w_{2}\geq 0,z_{2}-\frac{1}{2}-\log w_{1}^{\prime}+\log w_{2}^{\prime}\leq 0$	(S.5.764)
	$\displaystyle\quad\quad\quad z_{r}-\frac{1}{2}-\log w_{1}+\log w_{r}\leq 0\text{ for all }r\geq 3\Big{\|}y=3\Big{)}$	(S.5.765)
	$\displaystyle\gtrsim\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\frac{1}{2}+\log w_{1}-\log w_{2}\leq z_{2}\leq\frac{1}{2}+\log w_{1}^{\prime}-\log w_{2}^{\prime}\Big{\|}y=3\bigg{)}$	(S.5.766)
	$\displaystyle\quad\cdot\prod_{r=3}^{R}\mathbb{P}\Big{(}z_{r}-\frac{1}{2}-\log w_{1}+\log w_{r}\leq 0\Big{)}$	(S.5.767)
	$\displaystyle\gtrsim\left\|\log w_{1}-\log w_{2}-\log w_{1}^{\prime}+\log w_{2}^{\prime}\right\|$	(S.5.768)
	$\displaystyle=\left\|\log(1-w_{2})-\log w_{2}-\log(1-w_{2}^{\prime})+\log w_{2}^{\prime}\right\|$	(S.5.769)
	$\displaystyle\gtrsim\|w_{2}-w_{2}^{\prime}\|.$	(S.5.770)

∎

Proof of Lemma 40.

Note that $z_{j}$ ’s are independent given $y=3$ . We have

$\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})$	$\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})=2,\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}(\bm{z})\neq 2)$	(S.5.771)
	$\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}z_{2}-\frac{1}{2}-u\geq 0,z_{2}-\frac{1}{2}>0,z_{r}-\frac{1}{2}-u\leq 0,z_{r}-\frac{1}{2}\leq 0\text{ for all }r\geq 3\Big{)}$	(S.5.772)
	$\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}\frac{1}{2}\leq z_{2}\leq\frac{1}{2}+u,z_{r}\leq\frac{1}{2}\text{ for all }r\geq 3\Big{)}$	(S.5.773)
	$\displaystyle\gtrsim w_{3}\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\frac{1}{2}\leq z_{2}\leq\frac{1}{2}+u,z_{r}\leq\frac{1}{2}\text{ for all }r\geq 3\Big{\|}y=3\bigg{)}$	(S.5.774)
	$\displaystyle\gtrsim w_{3}\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\frac{1}{2}\leq z_{2}\leq\frac{1}{2}+u\Big{\|}y=3\bigg{)}\cdot\prod_{r=3}^{R}\mathbb{P}\Big{(}z_{r}\leq\frac{1}{2}\Big{)}$	(S.5.775)
	$\displaystyle\gtrsim u,$	(S.5.776)

where we used the fact that $\mathbb{P}\big{(}z_{r}\leq\frac{1}{2}|y=3\big{)}\geq$ some constant $C$ . ∎

Proof of Lemma 40.

Note that $z_{j}$ ’s are independent given $y=3$ . We have

$\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})$	$\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})=2,\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}(\bm{z})\neq 2)$	(S.5.777)
	$\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}\bm{\mu}_{2}^{\top}\bm{z}-\frac{1}{2}\\|\bm{\mu}_{2}\\|_{2}^{2}\geq 0,(\bm{\mu}_{2}^{\prime})^{\top}\bm{z}-\frac{1}{2}\\|\bm{\mu}_{2}^{\prime}\\|_{2}^{2}\leq 0,z_{r}\leq\frac{1}{2}\text{ for all }r\geq 3\Big{)}$	(S.5.778)
	$\displaystyle\geq w_{3}\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}\bm{\mu}_{2}^{\top}\bm{z}-\frac{1}{2}\geq 0,(\bm{\mu}_{2}^{\prime})^{\top}\bm{z}-\frac{1}{2}\leq 0,z_{r}\leq\frac{1}{2}\text{ for all }r\geq 3\Big{\|}y=3\Big{)}$	(S.5.779)
	$\displaystyle\gtrsim w_{3}\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}\bm{\mu}_{2}^{\top}\bm{z}\geq\frac{1}{2},(\bm{\mu}_{2}^{\prime})^{\top}\bm{z}\leq\frac{1}{2}\Big{\|}y=3\Big{)}\cdot\prod_{r=3}^{R}\mathbb{P}\Big{(}z_{r}\leq\frac{1}{2}\Big{\|}y=3\Big{)}$	(S.5.780)
	$\displaystyle\gtrsim\sqrt{1-\frac{\|\bm{\mu}_{2}^{\top}\bm{\mu}_{2}^{\prime}\|^{2}}{\\|\bm{\mu}_{2}\\|_{2}^{4}}}\cdot\frac{\|\bm{\mu}_{2}^{\top}\bm{\mu}_{2}^{\prime}\|}{\\|\bm{\mu}_{2}\\|_{2}^{2}}$	(S.5.781)
	$\displaystyle\gtrsim\\|\bm{\mu}_{2}-\bm{\mu}_{2}^{\prime}\\|_{2}$	(S.5.782)

The second last inequality is due to the fact that $\mathbb{P}\big{(}z_{r}\leq\frac{1}{2}|y=3\big{)}\geq$ some constant $C$ and Proposition 23 in \citeappazizyan2013minimax. The last inequality comes from Lemma 8.1 in \citeappcai2019chime. ∎

S.5.17 Proof of Theorem 11

WLOG, suppose that $\pi_{k}^{*}$ satisfies $\pi_{k}^{*}(r)=$ the “majority class” $\widetilde{r}$ , if $\#\{k\in S:\pi_{k}(r)=\widetilde{r}\}\geq\frac{1}{2}|S|$ . Note that we can make this assumption because it suffices to recover $\{\iota(\pi^{*}_{k})\}_{k\in S}$ with a permutation $\iota$ . And WLOG, suppose $\pi^{*}_{k}(r)=r$ for all $k\in S$ . Let us consider any $\pi=\{\pi_{k}\}_{k=1}^{K}$ with $\pi_{k}(r)=\pi_{k}^{*}(r)$ for all $k\in S^{c}$ and $\pi\neq\pi^{*}$ . It suffices to prove that $\text{score}(\pi)>\text{score}(\pi^{*})$ .

Recall that $\xi=\max_{k\in S}\min_{\pi}\max_{r\in[R]}\|(\widehat{\bm{\Sigma}}^{(k)})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-\bm{\beta}^{(k)*}_{r}\|_{2}$ . Note that

$\displaystyle\text{score}(\pi)-\text{score}(\pi^{*})$	$\displaystyle=\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}}_{[1]}$	(S.5.783)
	$\displaystyle\quad+\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}}_{[2]}$	(S.5.784)
	$\displaystyle\quad+\underbrace{2\sum_{\begin{subarray}{c}k\in S\\ k^{\prime}\in S^{c}\end{subarray}}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}}_{[3]}$	(S.5.785)
	$\displaystyle-\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\\|_{2}}_{[1]^{\prime}}$	(S.5.786)
	$\displaystyle-\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\\|_{2}}_{[2]^{\prime}}$	(S.5.787)
	$\displaystyle-\underbrace{2\sum_{\begin{subarray}{c}k\in S\\ k^{\prime}\in S^{c}\end{subarray}}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}}_{[3]^{\prime}}.$	(S.5.788)

We have

$\displaystyle[2]$	$\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\left(\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}-2h-\xi\right)$	(S.5.789)
	$\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}R\left(\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}-2h-\xi\right)$	(S.5.790)
	$\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}R(c_{\bm{\Sigma}}^{-1/2}\Delta-2h-2\xi).$	(S.5.791)

Hence

[2]-[2]^{\prime}\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}[R(c_{\bm{\Sigma}}^{-1/2}\Delta-2h-2\xi)-R(2\xi+2h)]=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}R(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi).

(S.5.792)

Therefore,

	$\displaystyle[1]$	$\displaystyle=\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}}_{[1]_{1}}$		(S.5.793)
		$\displaystyle\quad+\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}}_{[1]_{2}}.$		(S.5.794)

Correspondingly, we can decompose $[1]^{\prime}$ in the same way as $[1]^{\prime}=[1]^{\prime}_{1}+[1]^{\prime}_{2}$ with

	$\displaystyle[1]_{1}^{\prime}$	$\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\\|_{2},$		(S.5.795)
	$\displaystyle[1]_{2}^{\prime}$	$\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\\|_{2}.$		(S.5.796)

Note that

$\displaystyle[1]_{2}$	$\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}\left(\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}-2h-\xi\right)$	(S.5.797)
	$\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}\left(\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(r)})\\|_{2}-2h-\xi\right)$	(S.5.798)
	$\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}\left(c_{\bm{\Sigma}}^{-1/2}\Delta-2h-2\xi\right),$	(S.5.799)

hence

[1]_{2}-[1]^{\prime}_{2}\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}\left(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi\right).

(S.5.800)

And

$\displaystyle[1]_{1}$	$\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}$	(S.5.801)
	$\displaystyle=\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)=1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}}_{[1]_{11}}$	(S.5.802)
	$\displaystyle\quad+\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}}_{[1]_{12}}.$	(S.5.803)

	$\displaystyle[1]_{12}$	$\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}(\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}-2h-\xi)$		(S.5.804)
		$\displaystyle\geq-\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}(2h+\xi).$		(S.5.805)

An important observation is that $[1]_{11}=[1]_{11}^{\prime}$ . Therefore,

[1]_{1}-[1]^{\prime}_{1}=[1]_{12}-[1]_{12}^{\prime}\geq-\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}(4h+3\xi).

(S.5.806)

And

	$\displaystyle[1]-[1]^{\prime}$	$\displaystyle=[1]_{2}-[1]^{\prime}_{2}+[1]_{1}-[1]_{1}^{\prime}$		(S.5.807)
		$\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)-\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}(4h+3\xi).$		(S.5.808)

Furthermore, by triangle inequality,

$\displaystyle[3]-[3]^{\prime}$	$\displaystyle=2\sum_{\begin{subarray}{c}k\in S\\ k^{\prime}\in S^{c}\end{subarray}}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}$	(S.5.809)
	$\displaystyle\quad-2\sum_{\begin{subarray}{c}k\in S\\ k^{\prime}\in S^{c}\end{subarray}}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\\|_{2}$	(S.5.810)
	$\displaystyle\geq-2\sum_{\begin{subarray}{c}k\in S\\ k^{\prime}\in S^{c}\end{subarray}}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}+\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\\|_{2}.$	(S.5.811)

Putting all pieces together,

$\displaystyle\text{score}(\pi)-\text{score}(\pi^{*})$	$\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}R(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)+\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)$	(S.5.812)
	$\displaystyle\quad-\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}(4h+3\xi)$	(S.5.813)
	$\displaystyle\quad-2\sum_{k\in S}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}+\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\\|_{2}\cdot\|S^{c}\|.$	(S.5.814)

Denote $m_{r}=$ the majority class among $\{\pi_{k}(r)\}_{k\in S}$ and $S^{(r)}_{\widetilde{r}}=\{k\in S:\pi_{k}(r)=\widetilde{r}\}$ .

(i)Case 1: $|S^{(1)}_{1}|\leq\frac{2}{3}|S|$ .

Since $\pi^{*}_{k}(1)=1$ , we have $|S^{(r)}_{1}|\leq\frac{2}{3}|S|$ for all $r\in[R]$ , otherwise by our assumption, $m_{1}=r_{0}$ since $r_{0}$ satisfies $|S^{(r_{0})}_{1}|>\frac{2}{3}|S|\geq|S^{(1)}_{1}|$ , which is a contradition. Therefore,

$\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}1$	$\displaystyle=2\sum_{r\neq r^{\prime}}\|S^{(r)}_{1}\|\cdot\|S^{(r^{\prime})}_{1}\|$	(S.5.815)
	$\displaystyle=\Big{(}\sum_{r=1}^{R}\|S_{1}^{(r)}\|\Big{)}^{2}-\sum_{r=1}^{R}\|S_{1}^{(r)}\|^{2}$	(S.5.816)
	$\displaystyle=\|S\|^{2}-\sum_{r=1}^{R}\|S_{1}^{(r)}\|^{2}$	(S.5.817)
	$\displaystyle\geq\|S\|^{2}-\Big{(}\frac{4}{9}\|S\|^{2}+\frac{1}{9}\|S\|^{2}\Big{)}$	(S.5.818)
	$\displaystyle=\frac{4}{9}\|S\|^{2},$	(S.5.819)

and

$\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}1$	$\displaystyle\leq R\cdot 2\binom{\|S\|-\|S^{(1)}_{1}\|}{2}$	(S.5.820)
	$\displaystyle\leq R(\|S\|-\|S^{(1)}_{1}\|)^{2}$	(S.5.821)
	$\displaystyle\leq R\|S\|^{2}.$	(S.5.822)

Also,

\sum_{k\in S}\sum_{r=2}^{R}1\leq 2R|S|.

(S.5.823)

Hence, since $\frac{4}{9}|S|-4D|S^{c}|=\frac{4}{9}(1-\epsilon)K-4D\epsilon K>0$ ,

$\displaystyle\text{score}(\pi)-\text{score}(\pi^{*})$	$\displaystyle\geq\frac{4}{9}\|S\|^{2}R(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)-\|S\|^{2}R(4h+4\xi)-2R\|S\|\|S^{c}\|(2Dc_{\bm{\Sigma}}^{1/2}\Delta+2\xi)$	(S.5.824)
	$\displaystyle=\|S\|R\Big{[}\Big{(}\frac{4}{9}c_{\bm{\Sigma}}^{-1/2}\|S\|-4Dc_{\bm{\Sigma}}^{1/2}\|S^{c}\|\Big{)}\Delta-\frac{52}{9}\|S\|h-\Big{(}\frac{52}{9}\|S\|+4\|S^{c}\|\Big{)}\xi\Big{]}$	(S.5.825)
	$\displaystyle>0.$	(S.5.826)

(ii)Case 2: $|S^{(1)}_{1}|>\frac{2}{3}|S|$ .

In this case, by our assumption, we must have $m_{1}=1$ . And

$\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}1$	$\displaystyle\geq 2\|S^{(1)}_{1}\|(\|S\|-\|S^{(1)}_{1}\|),$	(S.5.827)
$\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}1$	$\displaystyle\leq R\sum_{k,k^{\prime}}\mathds{1}(k\neq k^{\prime}\in S,\pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1)$	(S.5.828)
	$\displaystyle\leq R\cdot 2\binom{\|S\|-\|S^{(1)}_{1}\|}{2}$	(S.5.829)
	$\displaystyle\leq R(\|S\|-\|S^{(1)}_{1}\|)^{2}$	(S.5.830)
	$\displaystyle\leq R\frac{1}{2}\|S^{(1)}_{1}\|(\|S\|-\|S^{(1)}_{1}\|).$	(S.5.831)

Moreover,

	$\displaystyle\sum_{k\in S}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)}-\widehat{\bm{\mu}}^{(k)[0]}_{r}+\widehat{\bm{\mu}}^{(k)[0]}_{1})\\|_{2}\cdot\|S^{c}\|$		(S.5.832)
	$\displaystyle\leq\sum_{\begin{subarray}{c}k\in S\\ \pi_{k}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)}-\widehat{\bm{\mu}}^{(k)[0]}_{r}+\widehat{\bm{\mu}}^{(k)[0]}_{1})\\|_{2}\cdot\|S^{c}\|$		(S.5.833)
	$\displaystyle\quad+\sum_{k\in S}\sum_{\begin{subarray}{c}\pi_{k}(1)=1\\ \pi_{k}(r)\neq r\end{subarray}}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{r})\\|_{2}\cdot\|S^{c}\|$		(S.5.834)
	$\displaystyle\leq(\|S\|-\|S^{(1)}_{1}\|)R\cdot(2c_{\bm{\Sigma}}^{1/2}\Delta+2\xi)\|S^{c}\|+\sum_{k\in S}\sum_{\begin{subarray}{c}\pi_{k}(1)=1\\ \pi_{k}(r)\neq r\end{subarray}}(Dc_{\bm{\Sigma}}^{1/2}\Delta+\xi)\|S^{c}\|.$		(S.5.835)

For $r$ satisfying $|S^{(r)}_{r}|\leq\frac{1}{2}|S|$ , we have

	$\displaystyle\sum_{k\in S}\sum_{\begin{subarray}{c}\pi_{k}(1)=1\\ \pi_{k}(r)\neq r\end{subarray}}$	$\displaystyle\leq 2\|S^{(1)}_{1}\|,$		(S.5.836)
	$\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}1$	$\displaystyle\geq 2\cdot\frac{1}{2}\|S\|\Big{(}\frac{2}{3}\|S\|-\frac{1}{2}\|S\|\Big{)}=\frac{1}{6}\|S\|^{2}.$		(S.5.837)

For $r$ satisfying $|S^{(r)}_{r}|>\frac{1}{2}|S|$ , we have $m_{r}=r$ . Define $\widetilde{S}^{(r)}_{r}=\{k\in S:\pi_{k}(1)=1,\pi_{k}(r)=r\}$ . Note that $|\widetilde{S}^{(r)}_{r}|\geq\frac{2}{3}|S|-\frac{1}{2}|S|=\frac{1}{6}|S|$ . Furthermore,

	$\displaystyle\sum_{k\in S}\sum_{\begin{subarray}{c}\pi_{k}(1)=1\\ \pi_{k}(r)\neq r\end{subarray}}$	$\displaystyle\leq\|S^{(1)}_{1}\|-\|\widetilde{S}^{(r)}_{r}\|,$		(S.5.838)
	$\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}1$	$\displaystyle\geq 2\|\widetilde{S}^{(r)}_{r}\|(\|S^{(1)}_{1}\|-\|\widetilde{S}^{(r)}_{r}\|)\geq\frac{1}{3}\|S\|(\|S^{(1)}_{1}\|-\|\widetilde{S}^{(r)}_{r}\|).$		(S.5.839)

This implies that

	$\displaystyle\text{score}(\pi)-\text{score}(\pi^{*})$		(S.5.840)
	$\displaystyle\geq 2\|S_{1}^{(1)}\|(\|S\|-\|S_{1}^{(1)}\|)R(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)-R\frac{1}{2}\|S^{(1)}_{1}\|(\|S\|-\|S^{(1)}_{1}\|)\cdot(4h+3\xi)$		(S.5.841)
	$\displaystyle\quad+\sum_{r:\|S^{(r)}_{r}\|\leq\|S\|/2}\Big{[}\frac{1}{6}\|S\|^{2}(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)-2\|S_{1}^{(1)}\|\cdot 2\|S^{c}\|(c_{\bm{\Sigma}}^{1/2}D\Delta+\xi)\Big{]}$		(S.5.842)
	$\displaystyle\quad+\sum_{r:\|S^{(r)}_{r}\|>\|S\|/2}\Big{[}\frac{1}{3}\|S\|(\|S_{1}^{(1)}\|-\widetilde{S}^{(r)}_{r})(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)-(\|S^{(1)}_{1}\|-\|\widetilde{S}^{(r)}_{r}\|)\cdot\|S^{c}\|\cdot 2(c_{\bm{\Sigma}}^{1/2}D\Delta+\xi)\Big{]}$		(S.5.843)
	$\displaystyle\geq\|S^{(1)}_{1}\|(\|S\|-\|S^{(1)}_{1}\|)R\Big{(}2c_{\bm{\Sigma}}^{-1/2}\Delta-10h-\frac{19}{2}\xi\Big{)}$		(S.5.844)
	$\displaystyle\quad+\sum_{r:\|S^{(r)}_{r}\|\leq\|S\|/2}\|S\|\bigg{[}\Big{(}\frac{1}{6}c_{\bm{\Sigma}}^{-1/2}\|S\|-4Dc_{\bm{\Sigma}}^{1/2}\|S^{c}\|\Big{)}\Delta-\frac{2}{3}\|S\|h-\Big{(}\frac{2}{3}\|S\|+4\|S^{c}\|\Big{)}\xi\bigg{]}$		(S.5.845)
	$\displaystyle\quad+\sum_{r:\|S^{(r)}_{r}\|>\|S\|/2}\bigg{[}\Big{(}\frac{1}{3}c_{\bm{\Sigma}}^{-1/2}\|S\|-2Dc_{\bm{\Sigma}}^{1/2}\|S^{c}\|\Big{)}\Delta-\frac{4}{3}\|S\|h-\Big{(}\frac{4}{3}\|S\|+2\|S^{c}\|\Big{)}\xi\bigg{]}(\|S^{(1)}_{1}\|-\|\widetilde{S}^{(r)}_{r}\|)$		(S.5.846)
	$\displaystyle>0.$		(S.5.847)

S.5.18 Proof of Theorem 12

WLOG, consider the step $\widetilde{K}\in[K]$ in the for loop and the case that $\iota=$ indentify mapping from $[K]$ to $[K]$ , and $\widetilde{K}\in S$ . Denote $\widetilde{S}=S\cap[\widetilde{K}]$ and $\widetilde{S}^{c}=S^{c}\cap[\widetilde{K}]$ , hence $[\widetilde{K}]=\widetilde{S}\cup\widetilde{S}^{c}$ . WLOG, consider $\pi_{1}=\pi_{2}=\cdots\pi_{\widetilde{K}-1}=$ identify from $[R]$ to $[R]$ . Denote $\bm{\pi}=\{\pi_{k}\}_{k=1}^{\widetilde{K}}$ and $\widetilde{\bm{\pi}}=\{\pi_{k}\}_{k=1}^{\widetilde{K}-1}\cup\{\widetilde{\pi}_{\widetilde{K}}\}$ with $\widetilde{\pi}_{\widetilde{K}}=$ identify from $[R]$ to $[R]$ . It suffices to show that

\text{score}(\bm{\pi})>\text{score}(\widetilde{\bm{\pi}}),

(S.5.848)

for any $\bm{\pi}$ with $\pi_{\widetilde{K}}\neq\widetilde{\pi}_{\widetilde{K}}=$ identify from $[R]$ to $[R]$ . If this is the case, then $\widehat{\bm{\pi}}=\{\widehat{\pi}_{k}\}_{k=1}^{K}$ satisfies $\widehat{\pi}_{k}=$ identity for all $k\in S$ , which completes the proof.

We focus on the derivation of (S.5.848) in the remaining part of the proof.

(i) Case 1: $\pi_{\widetilde{K}}(1)=1$ .

$\displaystyle\text{score}(\pi)-\text{score}(\pi^{*})$	$\displaystyle=\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\\|_{2}}_{[1]}$	(S.5.849)
	$\displaystyle\quad+\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}^{c}}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\\|_{2}}_{[2]}$	(S.5.850)
	$\displaystyle-\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\\|_{2}}_{[1]^{\prime}}$	(S.5.851)
	$\displaystyle-\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}^{c}}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\\|_{2}}_{[2]^{\prime}}.$	(S.5.852)

Note that

$\displaystyle[1]-[1]^{\prime}$	$\displaystyle=\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot$	(S.5.853)
	$\displaystyle\quad\sum_{k\in\widetilde{S}}\Big{[}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\\|_{2}$	(S.5.854)
	$\displaystyle\quad\quad\quad-\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\\|_{2}\Big{]}$	(S.5.855)
	$\displaystyle\geq\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot\Big{[}\\|(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{r}-\bm{\mu}^{(k)*}_{\pi_{\widetilde{K}}(r)})\\|_{2}-2\xi-2h-2\xi-2h\Big{]}$	(S.5.856)
	$\displaystyle\geq\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot\|\widetilde{S}\|\cdot\big{(}\Delta c_{\bm{\Sigma}}^{-1/2}-4\xi-4h\big{)}.$	(S.5.857)

$\displaystyle[2]-[2]^{\prime}$	$\displaystyle=\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot$	(S.5.858)
	$\displaystyle\quad\sum_{k\in\widetilde{S}^{c}}\Big{[}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\\|_{2}$	(S.5.859)
	$\displaystyle\quad\quad\quad-\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\\|_{2}\Big{]}$	(S.5.860)
	$\displaystyle\geq-\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot\sum_{k\in\widetilde{S}^{c}}\\|(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r})\\|_{2}$	(S.5.861)
	$\displaystyle\geq-\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot\|\widetilde{S}^{c}\|\cdot\big{[}\\|(\bm{\Sigma}^{(\widetilde{K})})^{-1}(\bm{\mu}^{(\widetilde{K})}_{\pi_{\widetilde{K}}(r)}-\bm{\mu}^{(\widetilde{K})*}_{r})\\|_{2}+\xi\big{]}$	(S.5.862)
	$\displaystyle\geq-\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot\|\widetilde{S}^{c}\|\cdot(Dc_{\bm{\Sigma}}^{1/2}\Delta+\xi).$	(S.5.863)

These imply that

$\displaystyle\text{score}(\bm{\pi})-\text{score}(\widetilde{\bm{\pi}})$	$\displaystyle=[1]-[1]^{\prime}+[2]-[2]^{\prime}$	(S.5.864)
	$\displaystyle\geq\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot\big{[}(\|\widetilde{S}\|c_{\bm{\Sigma}}^{-1/2}-\|\widetilde{S}^{c}\|Dc_{\bm{\Sigma}}^{1/2})\Delta-(4\|\widetilde{S}\|+\|\widetilde{S}^{c}\|)\xi-4\|\widetilde{S}\|h\big{]}$	(S.5.865)
	$\displaystyle>0,$	(S.5.866)

where we used the fact that $|\widetilde{S}|/|\widetilde{S}^{c}|\leq\frac{K_{0}}{K\epsilon}$ .

(ii) Case 2: $\pi_{\widetilde{K}}(1)\neq 1$ .

$\displaystyle\text{score}(\pi)-\text{score}(\pi^{*})$	$\displaystyle=\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(1)})\\|_{2}}_{[1]}$	(S.5.867)
	$\displaystyle\quad+\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}^{c}}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(1)})\\|_{2}}_{[2]}$	(S.5.868)
	$\displaystyle-\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\\|_{2}}_{[1]^{\prime}}$	(S.5.869)
	$\displaystyle-\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}^{c}}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\\|_{2}}_{[2]^{\prime}}.$	(S.5.870)

By previous results,

$\displaystyle[1]$	$\displaystyle=\sum_{k\in\widetilde{S}}\sum_{r=2}^{R}\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(1)})\\|_{2}$	(S.5.871)
	$\displaystyle\geq\sum_{k\in\widetilde{S}}\sum_{r=2}^{R}\left(\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{\widetilde{K}}(1)})\\|_{2}-2h-\xi\right)$	(S.5.872)
	$\displaystyle\geq\|\widetilde{S}\|R\cdot\left(\\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{\widetilde{K}}(1)}-\widehat{\bm{\mu}}^{(k)[0]}_{1})\\|_{2}-2h-\xi\right)$	(S.5.873)
	$\displaystyle\geq\|\widetilde{S}\|R\cdot\left(\\|(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{\pi_{\widetilde{K}}(1)}-\bm{\mu}^{(k)*}_{1})\\|_{2}-\xi-2h-\xi\right)$	(S.5.874)
	$\displaystyle\geq\|\widetilde{S}\|R\cdot(c_{\bm{\Sigma}}^{-1/2}\Delta-2\xi-2h),$	(S.5.875)

and

-[1]^{\prime}\geq-|\widetilde{S}|R\cdot(2\xi+2h).

(S.5.876)

Similar to case 1,

	$\displaystyle[2]-[2]^{\prime}$	$\displaystyle\geq-\sum_{r=2}^{R}\sum_{k\in\widetilde{S}}\\|(\widehat{\bm{\Sigma}}^{(\widetilde{K})})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}(r)}}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}(1)}})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\\|_{2}$		(S.5.877)
		$\displaystyle\geq-R\|\widetilde{S}^{c}\|\cdot(2Dc_{\bm{\Sigma}}^{1/2}\Delta+2\xi).$		(S.5.878)

Therefore,

$\displaystyle\text{score}(\bm{\pi})-\text{score}(\widetilde{\bm{\pi}})$	$\displaystyle=[1]-[1]^{\prime}+[2]-[2]^{\prime}$	(S.5.879)
	$\displaystyle\geq R\big{[}(\|\widetilde{S}\|c_{\bm{\Sigma}}^{-1/2}-\|\widetilde{S}^{c}\|\cdot 2Dc_{\bm{\Sigma}}^{1/2})\Delta-(2\|\widetilde{S}\|+2\|\widetilde{S}^{c}\|)\xi-2\|\widetilde{S}\|h\big{]}$	(S.5.880)
	$\displaystyle>0,$	(S.5.881)

where we used the fact that $|\widetilde{S}|/|\widetilde{S}^{c}|\leq\frac{K_{0}}{K\epsilon}$ .

\bibliographyapp

reference.bib \bibliographystyleappapalike

	$\displaystyle\inf_{\begin{subarray}{c}\{\widehat{\bm{\mu}}^{(k)}_{r}\}_{k=1:K,r=1:R}\\ \{\widehat{\bm{\Sigma}}^{(k)}\}_{k=1}^{K}\end{subarray}}\sup_{S:\|S\|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\Bigg{\{}$	$\displaystyle\left(\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\\|\widehat{\bm{\mu}}^{(k)}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\\|_{2}\right)\vee$		(S.1.50)
		$\displaystyle\\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\\|_{2}\gtrsim\sqrt{\frac{p+\log K}{n_{k}}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10}.$		(S.1.51)

	$\displaystyle\\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\\|_{2}$	$\displaystyle\lesssim\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}[\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})]\right\\|_{2}+h\wedge\frac{\lambda^{[t]}}{\sqrt{n_{k}}}+\epsilon\frac{\lambda^{[t]}}{\sqrt{n_{k}}}$		(S.5.189)
		$\displaystyle\lesssim\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}[\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})]\right\\|_{2}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}},$		(S.5.190)

	$\displaystyle\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}[\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})]\right\\|_{2}$	$\displaystyle\leq\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)})\bm{\beta}^{(k)}\right\\|_{2}$		(S.5.191)
		$\displaystyle\quad+\left\\|\sum_{k\in S}\frac{n_{k}}{n_{S}}\big{(}\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)}_{2}+\bm{\mu}^{(k)}_{1}\big{)}\right\\|_{2}.$		(S.5.192)

Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models

Abstract

1 Introduction

1.1 Gaussian mixture models (GMMs)

1.2 Multi-task learning and transfer learning

1.3 Our contributions and paper structure

2 Multi-task Learning

2.1 Problem setting

2.2 Method

2.3 Theory

Assumption 1.

Remark 1.

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

2.4 Initialization and cluster alignment

2.4.1 The alignment issue

2.4.2 Two alignment algorithms

Theorem 5 (Alignment correctness for Algorithm 2).

Remark 2.

Remark 3.

Theorem 6 (Alignment correctness for Algorithm 3).

Remark 4.

Remark 5.

3 Simulations

4 Discussions

References

S.1 Extension to Multi-cluster GMMs

S.1.1 Theory

Assumption 2.

Remark 6.

Theorem 7.

Theorem 8.

Theorem 9.

Theorem 10.

S.1.2 Alignment

Theorem 11 (Alignment correctness for Algorithm 5).

Theorem 12.

Remark 7.

S.2 Transfer Learning

S.2.1 Problem setting

S.2.2 Method

S.2.3 Theory

Assumption 3.

Remark 8.

Theorem 13.

Theorem 14.

Theorem 15.

Theorem 16.

S.2.4 Label alignment

Theorem 17 (Alignment correctness for Algorithm 8).

S.3 Additional Numerical Studies

S.3.1 Simulations

S.3.1.1 Simulation 1 of MTL

S.3.1.2 Simulation 2

S.3.1.3 Simulation 3 of MTL

S.3.1.4 Simulation of TL

S.3.1.5 Tuning parameters CλC_{\lambda} and Cλ0C_{\lambda_{0}} in Algorithms 1, 4, and 7

S.3.1.6 Tuning parameter κ\kappa and κ0\kappa_{0} in Algorithms 1, 4, and 7

S.3.2 Real-data analysis

S.3.2.1 Human activity recognition

S.3.2.2 Pen-based recognition of handwritten digits (PRHD)

S.4 Technical Lemmas

S.4.1 General lemmas

Lemma 1 (Covering number of the unit ball under Euclidean norm, Example 5.8 in \citealpappwainwright2019high).

Lemma 2 (Packing number of the unit sphere under Euclidean norm).

Lemma 3 (Fano’s lemma, see Chapter 2 of \citealpapptsybakov2009introduction, Chapter 15 of \citealpappwainwright2019high).

Lemma 4 (Packing number of the unit sphere in a quadrant under Euclidean norm).

Lemma 5.

Lemma 6 (\citealpappduan2023adaptive).

Lemma 7.

S.5 Proofs

S.5.1 Proofs of general lemmas

S.5.1.1 Proof of Lemma 2

S.5.1.2 Proof of Lemma 4

S.5.1.3 Proof of Lemma 5

S.5.1.4 Proof of Lemma 6

S.5.2 Proof of Theorem 1

S.5.2.1 Lemmas

S.3.1.5 Tuning parameters $C_{\lambda}$ and $C_{\lambda_{0}}$ in Algorithms 1, 4, and 7

S.3.1.6 Tuning parameter $\kappa$ and $\kappa_{0}$ in Algorithms 1, 4, and 7

Lemma 8 (Contraction of binary GMMs, a special case of Lemma 37 when $R=2$ ).