This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

appReferences

Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models

Ye Tian Department of Statistics, Columbia University Haolei Weng Department of Statistics and Probability, Michigan State University Lucy Xia Department of ISOM, School of Business and Management
Hong Kong University of Science and Technology
Yang Feng Department of Biostatistics, School of Global Public Health
New York University
Abstract

Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multi-task learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to single-task learning. We propose a multi-task GMM learning procedure based on the EM algorithm that effectively utilizes unknown similarities between related tasks and is robust against a fraction of outlier tasks from arbitrary distributions. The proposed procedure is shown to achieve the minimax optimal rate of convergence for both parameter estimation error and the excess mis-clustering error, in a wide range of regimes. Moreover, we generalize our approach to tackle the problem of transfer learning for GMMs, where similar theoretical results are derived. Additionally, iterative unsupervised multi-task and transfer learning methods may suffer from an initialization alignment problem, and two alignment algorithms are proposed to resolve the issue. Finally, we demonstrate the effectiveness of our methods through simulations and real data examples. To the best of our knowledge, this is the first work studying multi-task and transfer learning on GMMs with theoretical guarantees.


Keywords: Multi-task learning, transfer learning, unsupervised learning, Gaussian mixture models, robustness, minimax rate, EM algorithm

1 Introduction

1.1 Gaussian mixture models (GMMs)

Unsupervised learning that learns patterns from unlabeled data is a prevalent problem in statistics and machine learning. Clustering is one of the most important problems in unsupervised learning, where the goal is to group the observations based on some metrics of similarity. Researchers have developed numerous clustering methods including kk-means (Forgy, , 1965), kk-medians (Jain and Dubes, , 1988), spectral clustering (Ng et al., , 2001), and hierarchical clustering (Murtagh and Contreras, , 2012), among others. On the other hand, clustering problems have been analyzed from the perspective of the mixture of several probability distributions (Scott and Symons, , 1971). The mixture of Gaussian distributions is one of the simplest models in this category and has been widely applied in many real applications (Yang and Ahuja, , 1998; Lee et al., , 2012).

In the binary Gaussian mixture models (GMMs) with common covariances, each observation ZpZ\in\mathbb{R}^{p} comes from the following mixture of two Gaussian distributions:

Y\displaystyle Y ={1,with probability 1w,2,with probability w,\displaystyle=\begin{cases}1,&\text{with probability }1-w,\\ 2,&\text{with probability }w,\end{cases} (1)
Z|Y\displaystyle Z|Y =r𝒩(𝝁r,𝚺),r=1,2,\displaystyle=r\sim\mathcal{N}(\bm{\mu}_{r},\bm{\Sigma}),\quad r=1,2, (2)

where w(0,1)w\in(0,1), 𝝁1p\bm{\mu}_{1}\in\mathbb{R}^{p}, 𝝁2p\bm{\mu}_{2}\in\mathbb{R}^{p} and 𝚺0\bm{\Sigma}\succ 0 are parameters. This is the same setting as the linear discriminant analysis (LDA) problem in classification (Hastie et al., , 2009), except that the label YY is unknown in the clustering problem, while it is observed in the classification case. It has been shown that the Bayes classifier for the LDA problem is

𝒞(𝒛)={1,if 𝜷𝒛δlog(1ww);2,otherwise,\mathcal{C}(\bm{z})=\begin{cases}1,&\text{if }\bm{\beta}^{\top}\bm{z}-\delta\leq\log\left(\frac{1-w}{w}\right);\\ 2,&\text{otherwise},\end{cases} (3)

where 𝜷=𝚺1(𝝁2𝝁1)p\bm{\beta}=\bm{\Sigma}^{-1}(\bm{\mu}_{2}-\bm{\mu}_{1})\in\mathbb{R}^{p} and δ=𝜷(𝝁1+𝝁2)/2\delta=\bm{\beta}^{\top}(\bm{\mu}_{1}+\bm{\mu}_{2})/2. Note that 𝜷\bm{\beta} is usually referred to as the discriminant coefficient (Anderson, , 1958; Efron, , 1975). Naturally, this classifier is useful in clustering too. In clustering, after learning ww, 𝝁1\bm{\mu}_{1}, 𝝁2\bm{\mu}_{2} and 𝜷\bm{\beta}, we can plug their estimators into (3) to group any new observation ZnewZ^{\textup{new}}. Generally, we define the mis-clustering error rate of any given clustering method 𝒞\mathcal{C} as

R(𝒞)=minπ:{1,2}{1,2}(𝒞(Znew)π(Ynew)),R(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}(\mathcal{C}(Z^{\textup{new}})\neq\pi(Y^{\textup{new}})), (4)

where π\pi is a permutation function, YnewY^{\textup{new}} is the label of a future observation ZnewZ^{\textup{new}}, and the probability is taken w.r.t. the joint distribution of (Znew,Ynew)(Z^{\textup{new}},Y^{\textup{new}}) based on parameters ww, 𝝁1\bm{\mu}_{1}, 𝝁2\bm{\mu}_{2} and 𝚺\bm{\Sigma}. Here the error is calculated up to a permutation due to the lack of label information. It is clear that in the ideal case where the parameters are known, 𝒞()\mathcal{C}(\cdot) in (3) achieves the optimal mis-clustering error. Multi-cluster Gaussian mixture models with R3R\geq 3 components can be described in a similar way. We provide the details in Section S.1 of the supplementary materials.

There is a large volume of published studies on learning a GMM. The vast majority of approaches can be roughly divided into three categories. The first category is the method of moments, where the parameters are estimated through several moment equations (Pearson, , 1894; Kalai et al., , 2010; Hsu and Kakade, , 2013; Ge et al., , 2015). The second category is the spectral method, where the estimation is based on the spectral decomposition (Vempala and Wang, , 2004; Hsu and Kakade, , 2013; Jin et al., , 2017). The last category is the likelihood-based method including the popular expectation-maximization (EM) algorithm as a canonical example. The general form of the EM algorithm was formalized by Dempster et al., (1977) in the context of incomplete data, though earlier works (Hartley, , 1958; Hasselblad, , 1966; Baum et al., , 1970; Sundberg, , 1974) have studied EM-style algorithms in various concrete settings. Classical convergence results on the EM algorithm (Wu, , 1983; Redner and Walker, , 1984; Meng and Rubin, , 1994; McLachlan and Krishnan, , 2007) guarantee local convergence of the algorithm to fixed points of the sample likelihood. Recent advances in the analysis of EM algorithm and its variants provide stronger guarantees by establishing geometric convergence rates of the algorithm to the underlying true parameters under mild initialization conditions. See, for example, Dasgupta and Schulman, (2013); Wang et al., (2014); Xu et al., (2016); Balakrishnan et al., (2017); Yan et al., (2017); Cai et al., (2019); Kwon and Caramanis, (2020); Zhao et al., (2020) for GMM related works. In this paper, we propose modified versions of the EM algorithm with similarly strong guarantees to learn GMMs, under the new context of multi-task and transfer learning.

1.2 Multi-task learning and transfer learning

Multi-tasking is an ability that helps people pay attention to more than one task simultaneously. Moreover, we often find that the knowledge learned from one task can also be useful in other tasks. Multi-task learning (MTL) is a learning paradigm inspired by the human learning ability, which aims to learn multiple tasks and improve performance by utilizing the similarity between these tasks (Zhang and Yang, , 2021). There has been numerous research on MTL, which can be classified into five categories (Zhang and Yang, , 2021): feature learning approach (Argyriou et al., , 2008; Obozinski et al., , 2006), low-rank approach (Ando et al., , 2005), task clustering approach (Thrun and O’Sullivan, , 1996), task relation learning approach (Evgeniou and Pontil, , 2004) and decomposition approach (Jalali et al., , 2010). The majority of existing works focus on the use of MTL in supervised learning problems, while the application of MTL in unsupervised learning, such as clustering, has received less attention. Zhang and Zhang, (2011) developed an MTL clustering method based on a penalization framework, where the objective function consists of a local loss function and a pairwise task regularization term, both of which are related to the Bregman divergence. In Gu et al., (2011), a reproducing kernel Hilbert space (RKHS) was first established, and then a multi-task kernel k-means clustering was applied based on that RKHS. Yang et al., (2014) proposed a spectral MTL clustering method with a novel 2,p\ell_{2,p}-norm, which can also produce a linear regression function to predict labels for out-of-sample data. Zhang et al., (2018) suggested a new method based on the similarity matrix of samples in each task, which can learn the within-task clustering structure as well as the task relatedness simultaneously. Marfoq et al., (2021) established a new federated multi-task EM algorithm to learn the mixture of distributions and provided some theory on the convergence guarantee, but the statistical properties of the estimators were not fully understood. Zhang and Chen, (2022) proposed a distributed learning algorithm for GMMs based on transportation divergence when all GMMs are identical. In general, there are very few theoretical results about unsupervised MTL.

Transfer learning (TL) is another learning paradigm similar to multi-task learning but has different objectives. While MTL aims to learn all the tasks well with no priority for any specific task, the goal of TL is to improve the performance on the target task using the information from the source tasks (Zhang and Yang, , 2021). According to Pan and Yang, (2009), most of TL approaches can be classified into four categories: instance-based transfer (Dai et al., , 2007), feature representation transfer (Dai et al., , 2008), parameter transfer (Lawrence and Platt, , 2004) and relational-knowledge transfer (Mihalkova et al., , 2007). Similar to MTL, most of the TL methods focus on supervised learning. Some TL approaches are also developed for the semi-supervised learning setting (Chattopadhyay et al., , 2012; Li et al., , 2013), where only part of target or source data is labeled. There are much fewer discussions on the unsupervised TL approaches 111There are different definitions for unsupervised TL. Sometimes people call the semi-supervised TL an unsupervised TL as well. We follow the definition in Pan and Yang, (2009) here., which focuses on the cases where both target and source data are unlabeled. Dai et al., (2008) developed a co-clustering approach to transfer information from a single source to the target, which relies on the loss in mutual information and requires the features to be discrete. Wang et al., (2008) proposed a TL discriminant analysis method, where the target data is allowed to be unlabeled, but some labeled source data is necessary. In Wang et al., (2021), a TL approach was developed to learn Gaussian mixture models with only one source by weighting the target and source likelihood functions. Zuo et al., (2018) proposed a TL method based on infinite Gaussian mixture models and active learning, but their approach needs sufficient labeled source data and a few labeled target samples.

There are some recent studies on TL and MTL under various statistical settings, including high-dimensional linear regression (Xu and Bastani, , 2021; Li et al., 2022b, ; Zhang et al., , 2022; Li et al., 2022a, ), high-dimensional generalized linear models (Bastani, , 2021; Li et al., , 2023; Tian and Feng, , 2023), functional linear regression (Lin and Reimherr, , 2022), high-dimensional graphical models (Li et al., 2022b, ), reinforcement learning (Chen et al., , 2022), among others. The recent work Duan and Wang, (2023) developed an adaptive and robust MTL framework with sharp statistical guarantees for a broad class of models. We discuss its connection to our work in Section 2.

1.3 Our contributions and paper structure

Our main contributions in this work can be summarized in the following:

  1. (i)

    We develop efficient polynomial-time iterative procedures to learn GMMs in both MTL and TL settings. These procedures can be viewed as adaptations of the standard EM algorithm for MTL and TL problems.

  2. (ii)

    The developed procedures come with provable statistical guarantees. Specifically, we derive the upper bounds of their estimation and excess mis-clustering error rates under mild conditions. For MTL, it is shown that when the tasks are close to each other, our method can achieve better upper bounds than those from the single-task learning; when the tasks are substantially different from each other, our method can still obtain competitive convergence rates compared to single-task learning. Similarly for TL, our method can achieve better upper bounds than those from fitting GMM only to target data when the target and sources are similar, and remains competitive otherwise. In addition, the derived upper bounds reveal the robustness of our methods against a fraction of outlier tasks (for MTL) or outlier sources (for TL) from arbitrary distributions. These guarantees certify our procedures as adaptive (to the unknown task relatedness) and robust (to contaminated data) learning approaches.

  3. (iii)

    We derive the minimax lower bounds for parameter estimation and excess mis-clustering errors. In various regimes, the upper bounds from our methods match the lower bounds (up to small order terms), showing that the proposed methods are (nearly) minimax rate optimal.

  4. (iv)

    Our MTL and TL approaches require the initial estimates for different tasks to be “well-aligned”, due to the non-identifiability of GMM. We propose two pre-processing alignment algorithms to provably resolve the alignment problem. Similar problems arise in many unsupervised MTL settings. However, to our knowledge, there is no formal discussion of the alignment issue in the existing literature on unsupervised MTL (Gu et al., , 2011; Zhang and Zhang, , 2011; Yang et al., , 2014; Zhang et al., , 2018; Dieuleveut et al., , 2021; Marfoq et al., , 2021). Therefore, our rigorous treatment of the alignment problem is an important step forward in this field.

The rest of the paper is organized as follows. In Section 2, we first discuss the multi-task learning problem for binary GMMs, by introducing the problem setting, our method, and the associated theory. The above-mentioned alignment problem is discussed in Section 2.4. We present a simulation study in Section 3 to validate our theory. Finally, in Section 4, we point out some interesting future research directions. Due to the space limit, the extension to multi-cluster GMMs, additional numerical results, a full treatment of the transfer learning problem, and all the proofs are delegated to the supplementary materials.

We summarize the notations used throughout the paper here for convenience. We use bold capital letters (e.g., 𝚺\bm{\Sigma}) to denote matrices and use bold small letters (e.g., 𝒙\bm{x}, 𝒚\bm{y}) to denote vectors. For a matrix 𝑨=[aij]p×qp×q\bm{A}=[a_{ij}]_{p\times q}\in\mathbb{R}^{p\times q}, its 2-norm or spectral norm is defined as 𝑨2=max𝒙:𝒙2=1𝑨𝒙2\|\bm{A}\|_{2}=\max_{\bm{x}:\|\bm{x}\|_{2}=1}\|\bm{A}\bm{x}\|_{2}. If q=1q=1, 𝑨\bm{A} becomes a pp-dimensional vector and 𝑨2\|\bm{A}\|_{2} equals its Euclidean norm. For symmetric 𝑨\bm{A}, we define λmax(𝑨)\lambda_{\max}(\bm{A}) and λmin(𝑨)\lambda_{\min}(\bm{A}) as the maximum and minimum eigenvalues of 𝑨\bm{A}, respectively. For two non-zero real sequences {an}n=1\{a_{n}\}_{n=1}^{\infty} and {bn}n=1\{b_{n}\}_{n=1}^{\infty}, we use anbna_{n}\ll b_{n}, bnanb_{n}\gg a_{n} or an=𝒪(bn)a_{n}=\mathchoice{{\scriptstyle\mathcal{O}}}{{\scriptstyle\mathcal{O}}}{{\scriptscriptstyle\mathcal{O}}}{\scalebox{0.7}{$\scriptscriptstyle\mathcal{O}$}}(b_{n}) to represent |an/bn|0|a_{n}/b_{n}|\rightarrow 0 as nn\rightarrow\infty. And anbna_{n}\lesssim b_{n}, bnanb_{n}\gtrsim a_{n} or an=𝒪(bn)a_{n}=\mathcal{O}(b_{n}) means supn|an/bn|<\sup_{n}|a_{n}/b_{n}|<\infty. For two random variable sequences {xn}n=1\{x_{n}\}_{n=1}^{\infty} and {yn}n=1\{y_{n}\}_{n=1}^{\infty}, the notation xn=𝒪(yn)x_{n}=\mathcal{O}_{\mathbb{P}}(y_{n}) means that for any ϵ>0\epsilon>0, there exists a positive constant MM such that supn(|xn/yn|>M)ϵ\sup_{n}\mathbb{P}(|x_{n}/y_{n}|>M)\leq\epsilon. For two real numbers aa and bb, aba\vee b and aba\wedge b represent max(a,b)\max(a,b) and min(a,b)\min(a,b), respectively. For any positive integer KK, both 1:K1:K and [K][K] stand for the set {1,2,,K}\{1,2,\ldots,K\}. For any set S[K]S\subseteq[K], |S||S| denotes its cardinality, and ScS^{c} denotes its complement. Without further notice, cc, CC, C1C_{1}, C2C_{2}, \ldots represent some positive constants and can change from line to line.

2 Multi-task Learning

2.1 Problem setting

Suppose there are KK tasks, for which we have nkn_{k} observations {𝒛i(k)}i=1nk\{\bm{z}_{i}^{(k)}\}_{i=1}^{n_{k}} from the kk-th task. Suppose there exists an unknown subset S1:KS\subseteq 1:K, such that observations from each task in SS independently follow a GMM, while samples from tasks outside SS can be arbitrarily distributed. This means,

yi(k)={1,with probability 1w(k);2,with probability w(k);𝒛i(k)|yi(k)=r𝒩(𝝁r(k),𝚺(k)),r=1,2,\displaystyle y_{i}^{(k)}=\begin{cases}1,&\text{with probability }1-w^{(k)*};\\ 2,&\text{with probability }w^{(k)*};\end{cases}\quad\quad\bm{z}_{i}^{(k)}|y_{i}^{(k)}=r\sim\mathcal{N}(\bm{\mu}^{(k)*}_{r},\bm{\Sigma}^{(k)*}),\quad\quad r=1,2, (5)

for all kSk\in S, i=1:nki=1:n_{k}, and

{𝒛i(k)}i,kScS,\{\bm{z}^{(k)}_{i}\}_{i,k\in S^{c}}\sim\mathbb{Q}_{S}, (6)

where S\mathbb{Q}_{S} is some probability measure on (p)nSc(\mathbb{R}^{p})^{\otimes n_{S^{c}}} and nSc=kScnkn_{S^{c}}=\sum_{k\in S^{c}}n_{k}. In unsupervised learning, we have no access to the true labels {yi(k)}i,k\{y^{(k)}_{i}\}_{i,k}. To formalize the multi-task learning problem, we first introduce the parameter space for a single GMM:

Θ¯={𝜽¯=(w,𝝁1,𝝁2,𝚺):\displaystyle\overline{\Theta}=\{\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\Sigma}): 𝝁12𝝁22M,w(cw,1cw),\displaystyle\|\bm{\mu}_{1}\|_{2}\vee\|\bm{\mu}_{2}\|_{2}\leq M,w\in(c_{w},1-c_{w}), (7)
c𝚺1λmin(𝚺)λmax(𝚺)c𝚺},\displaystyle c_{\bm{\Sigma}}^{-1}\leq\lambda_{\min}(\bm{\Sigma})\leq\lambda_{\max}(\bm{\Sigma})\leq c_{\bm{\Sigma}}\}, (8)

where MM, cw(0,1/2]c_{w}\in(0,1/2] and c𝚺c_{\bm{\Sigma}} are some fixed positive constants. For simplicity, throughout the main text, we assume these constants are fixed. Hence, we have suppressed the dependency on them in the notation Θ¯\overline{\Theta}. The parameter space Θ¯\overline{\Theta} is a standard formulation. Similar parameter spaces have been considered, for example, in Cai et al., (2019).

Our goal of multi-task learning is to leverage the potential similarity shared by different tasks in SS to collectively learn them all. The tasks outside SS can be arbitrarily distributed and they can be potentially outlier tasks. This motivates us to define a joint parameter space for GMMs in SS:

Θ¯S(h)={{𝜽¯(k)}kS={(w(k),𝝁1(k),𝝁2(k),𝚺(k))}kS:𝜽¯(k)Θ¯,inf𝜷¯maxkS𝜷(k)𝜷¯2h},\displaystyle\overline{\Theta}_{S}(h)=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}=\{(w^{(k)},\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2},\bm{\Sigma}^{(k)})\}_{k\in S}:\overline{\bm{\theta}}^{(k)}\in\overline{\Theta},\inf_{\overline{\bm{\beta}}}\max_{k\in S}\|\bm{\beta}^{(k)}-\overline{\bm{\beta}}\|_{2}\leq h\Big{\}}, (9)

where 𝜷(k)=(𝚺(k))1(𝝁2(k)𝝁1(k))\bm{\beta}^{(k)}=(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{2}-\bm{\mu}^{(k)}_{1}) is called the discriminant coefficient in the kk-th task (recall Section 1.1). For convenience, we define δ(k)=(𝜷(k))(𝝁1(k)+𝝁2(k))/2\delta^{(k)}=(\bm{\beta}^{(k)})^{\top}(\bm{\mu}^{(k)}_{1}+\bm{\mu}^{(k)}_{2})/2, which together with log((1w(k))/w(k))\log((1-w^{(k)})/w^{(k)}) is part of the decision boundary. Note that this parameter space is defined only for GMMs of tasks in SS. To model potentially corrupted or contaminated data, we do not impose any distributional constraints for tasks in ScS^{c}. Such a modeling framework is reminiscent of Huber’s ϵ\epsilon-contamination model (Huber, , 1964). Similar formulations have been adopted in recent multi-task learning research such as Konstantinov et al., (2020) and Duan and Wang, (2023).

For GMMs in Θ¯S(h)\overline{\Theta}_{S}(h), we assume that they share similar discriminant coefficients. The similarity is formalized by assuming that all the discriminant coefficients in SS are within Euclidean distance hh from a “center”. Given that the discriminant coefficient has a major impact on the clustering performance (see the discriminant rule in (3)), the parameter space Θ¯S(h)\overline{\Theta}_{S}(h) is tailored to characterize the task relatedness from the clustering perspective. A similar viewpoint that focuses on modeling the discriminant coefficient has appeared in the study of high-dimensional GMM clustering (Cai et al., , 2019) and sparse linear discriminant analysis (Cai and Liu, , 2011; Mai et al., , 2012). With both SS and hh being unknown in practice, we aim to develop a multi-task learning procedure that is robust to outlier tasks in ScS^{c}, and achieves improved performance for tasks in SS (compared to the single-task learning), in terms of discriminant coefficient estimation and clustering, whenever hh is small.

The parameter space does not require the mean vectors {𝝁1(k),𝝁2(k)}kS\{\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2}\}_{k\in S} or the covariance matrices {𝚺(k)}kS\{\bm{\Sigma}^{(k)}\}_{k\in S} to be similar, although they are not free parameters due to the constraint on {𝜷(k)}kS\{\bm{\beta}^{(k)}\}_{k\in S}. And the mixture proportions {w(k)}kS\{w^{(k)}\}_{k\in S} do not need to be similar either. We thus avoid imposing restrictive conditions on those parameters. On the other hand, it implies that estimation of the mixture proportions, mean vectors, and covariance matrices in multi-task learning may not be generally improvable over that in the single-task learning. This is verified by the theoretical results in Section 2.3. While the current treatment in the paper does not consider similarity structure among {𝝁1(k),𝝁2(k)}kS,{𝚺(k)}kS\{\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2}\}_{k\in S},\{\bm{\Sigma}^{(k)}\}_{k\in S} or {w(k)}kS\{w^{(k)}\}_{k\in S}, our methods and theory can be readily adapted to handle such scenarios, if desired.

There are two main reasons why this MTL problem can be challenging. First, commonly used strategies like data pooling are fragile with respect to outlier tasks and can lead to arbitrarily inaccurate outcomes in the presence of even a small number of outliers. Also, since the distribution of data from outlier tasks can be adversarial to the learner, the idea of outlier task detection in the recent literature (Li et al., , 2021; Tian and Feng, , 2023) may not be applicable. Second, to address the nonconvexity of the likelihood, we propose to explore the similarity among tasks via a generalization of the EM algorithm. However, a clear theoretical understanding of such an iterative procedure requires a delicate analysis of the whole iterative process. In particular, as in the analysis of EM algorithms (Cai et al., , 2019; Kwon and Caramanis, , 2020), the estimates of similar discriminant vectors {𝜷(k)}kS\{\bm{\beta}^{(k)*}\}_{k\in S} and other potentially dissimilar parameters are entangled in the iterations. It is highly non-trivial to separate the impact of estimating {𝜷(k)}kS\{\bm{\beta}^{(k)*}\}_{k\in S} and other parameters to derive the desired statistical error rates. We manage to address this challenge through a localization technique by carefully shrinking the analysis radius of estimators as the iteration proceeds.

2.2 Method

We aim to tackle the problem of GMM estimation under the context of multi-task learning. The EM algorithm is commonly used to address the non-convexity of the log-likelihood function arising from the latent labels. In the standard EM algorithm, we “classify” the observations (update the posterior) in E-step and update the parameter estimations in M-step (Redner and Walker, , 1984). For multi-task and transfer learning problems, the penalization framework is very popular, where we solve an optimization problem based on a new objective function. This objective function consists of a local loss function and a penalty term, forcing the estimators of similar tasks to be close to each other. For examples, see Zhang and Zhang, (2011); Zhang et al., (2015); Bastani, (2021); Xu and Bastani, (2021); Li et al., (2021); Duan and Wang, (2023); Lin and Reimherr, (2022); Li et al., (2023); Tian and Feng, (2023). Thus motivated, our method seeks a combination of the EM algorithm and the penalization framework.

In particular, we adapt the penalization framework of Duan and Wang, (2023) and modify the updating formulas in the M-step accordingly. The proposed procedure is summarized in Algorithm 1. For simplicity, in Algorithm 1 we have used the notation

γ𝜽(𝒛)=wexp(𝜷𝒛δ)1w+wexp(𝜷𝒛δ),for 𝜽=(w,𝜷,δ).\gamma_{\bm{\theta}}(\bm{z})=\frac{w\exp(\bm{\beta}^{\top}\bm{z}-\delta)}{1-w+w\exp(\bm{\beta}^{\top}\bm{z}-\delta)},\quad\text{for }\bm{\theta}=(w,\bm{\beta},\delta). (11)

Note that γ𝜽(𝒛)\gamma_{\bm{\theta}}(\bm{z}) is the posterior probability (Y=2|Z=𝒛)\mathbb{P}(Y=2|Z=\bm{z}) given the observation 𝒛\bm{z}. The estimated posterior probability is calculated in every E-step given the updated parameter estimates.

Input: Initialization {(w^(k)[0],𝜷^(k)[0],𝝁^1(k)[0],𝝁^2(k)[0])}k=1K\{(\widehat{w}^{(k)[0]},\widehat{\bm{\beta}}^{(k)[0]},\widehat{\bm{\mu}}^{(k)[0]}_{1},\widehat{\bm{\mu}}^{(k)[0]}_{2})\}_{k=1}^{K}, maximum number of iteration rounds TT, initial penalty parameter λ[0]\lambda^{[0]}, tuning parameters Cλ>0C_{\lambda}>0 and κ(0,1)\kappa\in(0,1)
1 𝜽^(k)[0]=(w^(k)[0],𝜷^(k)[0],δ^(k)[0])\widehat{\bm{\theta}}^{(k)[0]}=(\widehat{w}^{(k)[0]},\widehat{\bm{\beta}}^{(k)[0]},\widehat{\delta}^{(k)[0]}) and δ^(k)[0]=12(𝜷^(k)[0])(𝝁^1(k)[0]+𝝁^2(k)[0])\widehat{\delta}^{(k)[0]}=\frac{1}{2}(\widehat{\bm{\beta}}^{(k)[0]})^{\top}(\widehat{\bm{\mu}}^{(k)[0]}_{1}+\widehat{\bm{\mu}}^{(k)[0]}_{2}) for k=1:Kk=1:K
2 for t=1t=1 to TT do
3       λ[t]=κλ[t1]+Cλp+logK\lambda^{[t]}=\kappa\lambda^{[t-1]}+C_{\lambda}\sqrt{p+\log K} // Penalty parameter update
4       for k=1k=1 to KK do // Local update for each task
5             w^(k)[t]=1nki=1nkγ𝜽^(k)[t1](𝒛i(k))\widehat{w}^{(k)[t]}=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})
6             𝝁^1(k)[t]=i=1nk[1γ𝜽^(k)[t1](𝒛i(k))]𝒛i(k)nk(1w^(k)[t])\widehat{\bm{\mu}}^{(k)[t]}_{1}=\frac{\sum_{i=1}^{n_{k}}[1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})]\bm{z}^{(k)}_{i}}{n_{k}(1-\widehat{w}^{(k)[t]})}, 𝝁^2(k)[t]=i=1nkγ𝜽^(k)[t1](𝒛i(k))𝒛i(k)nkw^(k)[t]\widehat{\bm{\mu}}^{(k)[t]}_{2}=\frac{\sum_{i=1}^{n_{k}}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})\bm{z}^{(k)}_{i}}{n_{k}\widehat{w}^{(k)[t]}}
7             𝚺^(k)[t]=1nki=1nk{[1γ𝜽^(k)[t1](𝒛i(k))](𝒛i(k)𝝁^1(k)[t])(𝒛i(k)𝝁^1(k)[t])\widehat{\bm{\Sigma}}^{(k)[t]}=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\left\{[1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})]\cdot(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{1})(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\right. +γ𝜽^(k)[t1](𝒛i(k))(𝒛i(k)𝝁^2(k)[t])(𝒛i(k)𝝁^2(k)[t])}\left.\hskip 93.89418pt+\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})\cdot(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{2})(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{2})^{\top}\right\}
8       end for
9      {𝜷^(k)[t]}k=1K\{\widehat{\bm{\beta}}^{(k)[t]}\}_{k=1}^{K}, 𝜷¯[t]=argmin𝜷(1),,𝜷(K),𝜷¯{k=1Knk[12(𝜷(k))𝚺^(k)[t]𝜷(k)(𝜷(k))(𝝁^2(k)[t]𝝁^1(k)[t])]+k=1Knkλ[t]𝜷(k)𝜷¯2}\overline{\bm{\beta}}^{[t]}=\operatorname*{arg\,min}\limits_{\bm{\beta}^{(1)},\ldots,\bm{\beta}^{(K)},\overline{\bm{\beta}}}\bigg{\{}\sum_{k=1}^{K}n_{k}\Big{[}\frac{1}{2}(\bm{\beta}^{(k)})^{\top}\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)}-(\bm{\beta}^{(k)})^{\top}(\widehat{\bm{\mu}}_{2}^{(k)[t]}-\widehat{\bm{\mu}}_{1}^{(k)[t]})\Big{]}+\sum_{k=1}^{K}\sqrt{n_{k}}\lambda^{[t]}\cdot\|\bm{\beta}^{(k)}-\overline{\bm{\beta}}\|_{2}\bigg{\}} // Aggregation to learn {𝜷^(k)[t]}k=1K\{\widehat{\bm{\beta}}^{(k)[t]}\}_{k=1}^{K}
10       for k=1k=1 to KK do // Local update for each task
11             δ^(k)[t]=12(𝜷^(k)[t])(𝝁^1(k)[t]+𝝁^2(k)[t])\widehat{\delta}^{(k)[t]}=\frac{1}{2}(\widehat{\bm{\beta}}^{(k)[t]})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{1}+\widehat{\bm{\mu}}^{(k)[t]}_{2})
12             Let 𝜽^(k)[t]=(w^(k)[t],𝜷^(k)[t],δ^(k)[t])\widehat{\bm{\theta}}^{(k)[t]}=(\widehat{w}^{(k)[t]},\widehat{\bm{\beta}}^{(k)[t]},\widehat{\delta}^{(k)[t]})
13       end for
14      
15 end for
Output: {(𝜽^(k)[T],𝝁^1(k)[T],𝝁^2(k)[T],𝚺^(k)[T])}k=1K\{(\widehat{\bm{\theta}}^{(k)[T]},\widehat{\bm{\mu}}^{(k)[T]}_{1},\widehat{\bm{\mu}}^{(k)[T]}_{2},\widehat{\bm{\Sigma}}^{(k)[T]})\}_{k=1}^{K} with 𝜽^(k)[T]=(w^(k)[T],𝜷^(k)[T],δ^(k)[T])\widehat{\bm{\theta}}^{(k)[T]}=(\widehat{w}^{(k)[T]},\widehat{\bm{\beta}}^{(k)[T]},\widehat{\delta}^{(k)[T]}), and 𝜷¯[T]\overline{\bm{\beta}}^{[T]}
Algorithm 1 MTL-GMM

Recall that the parameter space Θ¯S(h)\overline{\Theta}_{S}(h) introduced in (LABEL:eq:_parameter_space_mtl) does not encode similarity for the mixture proportions {w(k)}kS\{w^{(k)*}\}_{k\in S}, mean vectors {𝝁1(k),𝝁2(k)}kS\{\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2}\}_{k\in S}, or covariance matrices {𝚺(k)}kS\{\bm{\Sigma}^{(k)*}\}_{k\in S}. Hence, the updates of them in Steps 5-7 are kept the same as in the standard EM algorithm. Regarding the update for discriminant coefficients in Step 9, the quadratic loss function is motivated by the direct estimation of the discriminant coefficient in high-dimensional GMM (Cai et al., , 2019) and high-dimensional LDA literature (Cai and Liu, , 2011; Witten and Tibshirani, , 2011; Fan et al., , 2012; Mai et al., , 2012, 2019). The penalty term in Step 9 penalizes the contrasts of 𝜷(k)\bm{\beta}^{(k)}’s to exploit the similarity structure among tasks. Having the “center” parameter 𝜷¯\overline{\bm{\beta}} in the penalization induces robustness against outlier tasks. We refer to Duan and Wang, (2023) for a systematic treatment of this penalization framework. It is straightforward to verify that when the tuning parameters {λ[t]}t=1T\{\lambda^{[t]}\}_{t=1}^{T} are set to zero, Algorithm 1 reduces to the standard EM algorithm performed separately on the KK tasks. That is, for each k=1:Kk=1:K, given the parameter estimate from the previous step 𝜽^(k)[t1]=(w^(k)[t1],𝜷^(k)[t1],δ^(k)[t1])\widehat{\bm{\theta}}^{(k)[t-1]}=(\widehat{w}^{(k)[t-1]},\widehat{\bm{\beta}}^{(k)[t-1]},\widehat{\delta}^{(k)[t-1]}), we update w^(k)[t]\widehat{w}^{(k)[t]}, 𝝁^1(k)[t]\widehat{\bm{\mu}}^{(k)[t]}_{1}, 𝝁^2(k)[t]\widehat{\bm{\mu}}^{(k)[t]}_{2}, 𝚺^(k)[t]\widehat{\bm{\Sigma}}^{(k)[t]}, and δ^(k)[t]\widehat{\delta}^{(k)[t]} as in Algorithm 1, and update 𝜷^(k)[t]\widehat{\bm{\beta}}^{(k)[t]} via

𝜷^(k)[t]=(𝚺^(k)[t])1(𝝁^2(k)[t]𝝁^1(k)[t]).\widehat{\bm{\beta}}^{(k)[t]}=(\widehat{\bm{\Sigma}}^{(k)[t]})^{-1}(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1}). (12)

For the maximum number of iteration rounds, TT, our theory will show that Tlog(maxk=1:Knk)T\gtrsim\log(\max_{k=1:K}n_{k}) is sufficient to reach the desired statistical error rates. In practice, we can terminate the iteration when the change of estimates within two successive rounds falls below some pre-set small tolerance level. We discuss the initialization in detail in Sections 2.3 and 2.4.

2.3 Theory

In this section, we develop statistical theories for our proposed procedure MTL-GMM (see Algorithm 1). As mentioned in Section 2.1, we are interested in the performance of both parameter estimation and clustering, although the latter is the main focus and motivation. First, we impose conditions in the following assumption set.

Assumption 1.

Denote Δ(k)=(𝛍1(k)𝛍2(k))(𝚺(k))1(𝛍1(k)𝛍2(k))\Delta^{(k)}=\sqrt{(\bm{\mu}^{(k)*}_{1}-\bm{\mu}^{(k)*}_{2})^{\top}(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{1}-\bm{\mu}^{(k)*}_{2})} for kSk\in S. The quantity Δ(k)\Delta^{(k)} is the Mahalanobis distance between 𝛍1(k)\bm{\mu}^{(k)*}_{1} and 𝛍2(k)\bm{\mu}^{(k)*}_{2} with covariance matrix 𝚺(k)\bm{\Sigma}^{(k)*}, and can be viewed as the signal-to-noise ratio (SNR) in the kk-th task (Anderson, , 1958). Suppose the following conditions hold:

  1. (i)

    nS=kSnkC1|S|maxk=1:Knkn_{S}=\sum_{k\in S}n_{k}\geq C_{1}|S|\max_{k=1:K}n_{k} with a constant C1(0,1]C_{1}\in(0,1];

  2. (ii)

    minkSnkC2(p+logK)\min_{k\in S}n_{k}\geq C_{2}(p+\log K) with some constant C2>0C_{2}>0;

  3. (iii)

    Either of the following two conditions holds with some constant C3>0C_{3}>0:

    1. (a)

      maxkS(𝜷^(k)[0]𝜷(k)2𝝁^1(k)[0]𝝁1(k)2𝝁^2(k)[0]𝝁2(k)2)C3minkSΔ(k)\max_{k\in S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2}\big{)}\leq C_{3}\min_{k\in S}\Delta^{(k)}, maxkS|w^(k)[0]w(k)|cw/2\max_{k\in S}|\widehat{w}^{(k)[0]}-w^{(k)*}|\leq c_{w}/2;

    2. (b)

      maxkS(𝜷^(k)[0]+𝜷(k)2𝝁^1(k)[0]𝝁2(k)2𝝁^2(k)[0]𝝁1(k)2)C3minkSΔ(k)\max_{k\in S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{1}-\bm{\mu}^{(k)*}_{2}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{2}-\bm{\mu}^{(k)*}_{1}\|_{2}\big{)}\leq C_{3}\min_{k\in S}\Delta^{(k)}, maxkS|1w^(k)[0]w(k)|cw/2\max_{k\in S}|1-\widehat{w}^{(k)[0]}-w^{(k)*}|\leq c_{w}/2.

  4. (iv)

    minkSΔ(k)C4>0\min_{k\in S}\Delta^{(k)}\geq C_{4}>0 with some constant C4>0C_{4}>0;

Remark 1.

These are common and mild conditions related to the sample size, initialization, and signal-to-noise ratio of GMMs. Condition (i) requires the maximum sample size of all tasks not to be much larger than the average sample size of tasks in SS. Similar conditions can be found in Duan and Wang, (2023). Condition (ii) is the requirement of the sample size of tasks in SS. The usual condition for low-dimensional single-task learning is nkpn_{k}\gtrsim p (Cai et al., , 2019). The additional logK\log K term arises from the simultaneous control of performance on all tasks in SS, where SS can be as large as 1:K1:K. Condition (iii) requires that the initialization should not be too far away from the truth, which is commonly assumed in either the analysis of EM algorithm (Redner and Walker, , 1984; Balakrishnan et al., , 2017; Cai et al., , 2019) or other iterative algorithms like the local estimation used in semi-parametric models (Carroll et al., , 1997; Li and Liang, , 2008) and adaptive Lasso (Zou, , 2006). The two possible forms considered in this condition are due to the fact that binary GMM is only identifiable up to label permutation. Condition (iv) requires that the signal strength of GMM (in terms of Mahalanobis distance) is strong enough, which is usually assumed in the literature about the likelihood-based methods of GMMs (Dasgupta and Schulman, , 2013; Azizyan et al., , 2013; Balakrishnan et al., , 2017; Cai et al., , 2019).

We first establish the rate of convergence for the estimation. Recalling the parameter space Θ¯S(h)\overline{\Theta}_{S}(h) in (LABEL:eq:_parameter_space_mtl), let us denote the true parameter by

{𝜽¯(k)}kS={(w(k),𝝁1(k),𝝁2(k),𝚺(k))}kSΘ¯S(h).\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}=\{(w^{(k)*},\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2},\bm{\Sigma}^{(k)*})\}_{k\in S}\in\overline{\Theta}_{S}(h).

To better present the results for parameters related to the optimal discriminant rule (3), we further denote

𝜽(k)=(w(k),𝜷(k),δ(k)),kS,\bm{\theta}^{(k)*}=(w^{(k)*},\bm{\beta}^{(k)*},\delta^{(k)*}),\quad\forall k\in S,

where 𝜷(k)=(𝚺(k))1(𝝁2(k)𝝁1(k)),δ(k)=12(𝜷(k))(𝝁1(k)+𝝁2(k))\bm{\beta}^{(k)*}=(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{2}-\bm{\mu}^{(k)*}_{1}),\delta^{(k)*}=\frac{1}{2}(\bm{\beta}^{(k)*})^{\top}(\bm{\mu}^{(k)*}_{1}+\bm{\mu}^{(k)*}_{2}). Note that 𝜽(k)\bm{\theta}^{(k)*} is a function of 𝜽¯(k)\overline{\bm{\theta}}^{(k)*}. For the estimators returned by MTL-GMM (see Algorithm 1), we are particularly interested in the following two error metrics:

d(𝜽^(k)[T],𝜽(k)):=min{|w^(k)[T]w(k)|𝜷^(k)[T]𝜷(k)2|δ^(k)[T]δ(k)|,\displaystyle d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*}):=\min\{|\widehat{w}^{(k)[T]}-w^{(k)*}|\vee\|\widehat{\bm{\beta}}^{(k)[T]}-\bm{\beta}^{(k)*}\|_{2}\vee|\widehat{\delta}^{(k)[T]}-\delta^{(k)*}|, (13)
|1w^(k)[T]w(k)|𝜷^(k)[T]+𝜷(k)2|δ^(k)[T]+δ(k)|},\displaystyle\hskip 113.81102pt|1-\widehat{w}^{(k)[T]}-w^{(k)*}|\vee\|\widehat{\bm{\beta}}^{(k)[T]}+\bm{\beta}^{(k)*}\|_{2}\vee|\widehat{\delta}^{(k)[T]}+\delta^{(k)*}|\}, (14)
(minπ:[2][2]maxr=1:2𝝁^r(k)[T]𝝁π(r)(k)2)𝚺^(k)[T]𝚺(k)2,\displaystyle\Big{(}\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(k)[T]}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\Big{)}\vee\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)*}\|_{2}, (15)

where π:[2][2]\pi:[2]\rightarrow[2] is a permutation on {1,2}\{1,2\}. Again, we take the minimum above because binary GMM is identifiable up to label permutation. The first error metric d(𝜽^(k)[T],𝜽(k))d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*}) involves the error for discriminant coefficients and is closely related to the clustering performance. It reveals how well our method utilizes similarity structure in multi-task learning. The second error metric is about the mean vectors and covariance matrix. As discussed in Section 2.1, we shall not expect it to be improved compared to that in single-task learning, as these parameters are not necessarily similar.

We are ready to present upper bounds for the estimation error of MTL-GMM. We recall that Θ¯S(h)\overline{\Theta}_{S}(h) and S\mathbb{Q}_{S} are the parameter space and probability measure that we use in Section 2.1 to describe the data distributions for tasks in SS and ScS^{c}, respectively.

Theorem 1.

(Upper bounds of the estimation error of GMM parameters for MTL-GMM) Suppose Assumption 1 holds for some SS with |S|s|S|\geq s and ϵKsK<1/3\epsilon\coloneqq\frac{K-s}{K}<1/3. Let λ[0]C1maxk=1:Knk\lambda^{[0]}\geq C_{1}\max_{k=1:K}\sqrt{n_{k}}, CλC1C_{\lambda}\geq C_{1} and κ>C2\kappa>C_{2} with some constants C1>0,C2(0,1)C_{1}>0,C_{2}\in(0,1) 222C1C_{1} and C2C_{2} depend on the constants MM, cwc_{w}, and c𝚺c_{\bm{\Sigma}} etc.. Then there exist a constant C3>0C_{3}>0, such that for any {𝛉¯(k)}kS={(w(k),𝛍1(k),𝛍2(k),𝚺(k))}kSΘ¯S(h)\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}=\{(w^{(k)*},\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2},\bm{\Sigma}^{(k)*})\}_{k\in S}\in\overline{\Theta}_{S}(h) and any probability measure S\mathbb{Q}_{S} on (p)nSc(\mathbb{R}^{p})^{\otimes n_{S^{c}}}, with probability 1C3K11-C_{3}K^{-1}, the following hold for all kSk\in S:

d(𝜽^(k)[T],𝜽(k))\displaystyle d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*}) pnS+logKnk+hp+logKnk+ϵp+logKmaxk=1:Knk+T2(κ),\displaystyle\lesssim\sqrt{\frac{p}{n_{S}}}+\sqrt{\frac{\log K}{n_{k}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+T^{2}(\kappa^{\prime})^{\top},\quad (16)
(minπ:[2][2]maxr=1:2𝝁^r(k)[T]𝝁π(r)(k)2)𝚺^(k)[T]𝚺(k)2p+logKnk+T2(κ),\Big{(}\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(k)[T]}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\Big{)}\vee\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)*}\|_{2}\lesssim\sqrt{\frac{p+\log K}{n_{k}}}+T^{2}(\kappa^{\prime})^{\top}, (17)

where κ(0,1)\kappa^{\prime}\in(0,1) is some constant and nS=kSnkn_{S}=\sum_{k\in S}n_{k}. When TClog(maxk=1:Knk)T\geq C\log(\max_{k=1:K}n_{k}) with a large constant C>0C>0, the last term on the right-hand side will be dominated by other terms in both inequalities.

The upper bound of (minπ:[2][2]maxr=1:2𝝁^r(k)[T]𝝁π(r)(k)2)𝚺^(k)[T]𝚺(k)2\big{(}\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(k)[T]}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\big{)}\vee\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)*}\|_{2} contains two parts. The first part is comparable to the single-task learning rate (Cai et al., , 2019) (up to a logK\sqrt{\log K} term due to the simultaneous control over all tasks in SS), and the second part characterizes the geometric convergence of iterates. As expected, since 𝝁1(k)\bm{\mu}^{(k)*}_{1}, 𝝁2(k)\bm{\mu}^{(k)*}_{2}, 𝚺(k)\bm{\Sigma}^{(k)*} in SS are not necessarily similar, an improved error rate over single-task learning is generally impossible. The upper bound for d(𝜽^(k)[T],𝜽(k))d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*}) is directly related to the clustering performance of our method. Thus we will provide a detailed discussion about it after presenting the clustering result in the next theorem.

As introduced in Section 1.1, using the estimate 𝜽^(k)[T]=(w^(k)[T],𝜷^(k)[T],δ^(k)[T])\widehat{\bm{\theta}}^{(k)[T]}=(\widehat{w}^{(k)[T]},\widehat{\bm{\beta}}^{(k)[T]},\widehat{\delta}^{(k)[T]}) from Algorithm 1, we can construct a classifier for task kk as

𝒞^(k)[T](𝒛)={1,if (𝜷^(k)[T])𝒛δ^(k)[T]log(1w^(k)[T]w^(k)[T]);2,otherwise.\widehat{\mathcal{C}}^{(k)[T]}(\bm{z})=\begin{cases}1,&\text{if }(\widehat{\bm{\beta}}^{(k)[T]})^{\top}\bm{z}-\widehat{\delta}^{(k)[T]}\leq\log\left(\frac{1-\widehat{w}^{(k)[T]}}{\widehat{w}^{(k)[T]}}\right);\\ 2,&\text{otherwise}.\end{cases} (18)

Recall that for a clustering method 𝒞:p{1,2}\mathcal{C}:\mathbb{R}^{p}\rightarrow\{1,2\}, its mis-clustering error rate under GMM with parameter 𝜽¯=(w,𝝁1,𝝁2,𝚺)\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\Sigma}) is

R𝜽¯(𝒞)=minπ:[2][2]𝜽¯(𝒞(Znew)π(Ynew)),R_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:[2]\rightarrow[2]}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z^{\textup{new}})\neq\pi(Y^{\textup{new}})), (19)

where Znew(1w)𝒩(𝝁1,𝚺)+w𝒩(𝝁2,𝚺)Z^{\textup{new}}\sim(1-w)\mathcal{N}(\bm{\mu}_{1},\bm{\Sigma})+w\mathcal{N}(\bm{\mu}_{2},\bm{\Sigma}) is a future observation associated with the label YnewY^{\textup{new}}, independent from 𝒞\mathcal{C}; the probability 𝜽¯\mathbb{P}_{\overline{\bm{\theta}}} is w.r.t. (Znew,Ynew)(Z^{\textup{new}},Y^{\textup{new}}), and the minimum is taken over two permutation functions on {1,2}\{1,2\}. Denote 𝒞𝜽¯\mathcal{C}_{\overline{\bm{\theta}}} as the Bayes classifier that minimizes R𝜽¯(𝒞)R_{\overline{\bm{\theta}}}(\mathcal{C}). In the following theorem, we obtain the upper bound of the excess mis-clustering error of 𝒞^(k)[T]\widehat{\mathcal{C}}^{(k)[T]} for kSk\in S.

Theorem 2.

(Upper bound of the excess mis-clustering error for MTL-GMM) Suppose the same conditions in Theorem 1 hold. Then there exist a constant C1>0C_{1}>0 such that for any {𝛉¯(k)}kSΘ¯S(h)\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h) and any probability measure S\mathbb{Q}_{S} on (p)nSc(\mathbb{R}^{p})^{\otimes n_{S^{c}}}, with probability 1C1K11-C_{1}K^{-1}, the following holds for all kSk\in S:

R𝜽¯(k)(𝒞^(k)[T])R𝜽¯(k)(𝒞𝜽¯(k))\displaystyle R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}}) d2(𝜽^(k)[T],𝜽(k))\displaystyle\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*}) (20)
pnS(\@slowromancapi@)+logKnk(\@slowromancapii@)+h2p+logKnk(\@slowromancapiii@)+ϵ2p+logKmaxk=1:Knk(\@slowromancapiv@)+T4(κ)2T(\@slowromancapv@),\displaystyle\lesssim\underbrace{\frac{p}{n_{S}}}_{\rm(\@slowromancap i@)}+\underbrace{\frac{\log K}{n_{k}}}_{\rm(\@slowromancap ii@)}+\underbrace{h^{2}\wedge\frac{p+\log K}{n_{k}}}_{\rm(\@slowromancap iii@)}+\underbrace{\epsilon^{2}\frac{p+\log K}{\max_{k=1:K}n_{k}}}_{\rm(\@slowromancap iv@)}+\underbrace{T^{4}(\kappa^{\prime})^{2T}}_{\rm(\@slowromancap v@)}, (21)

with some κ(0,1)\kappa^{\prime}\in(0,1). When TClog(maxk=1:Knk)T\geq C\log(\max_{k=1:K}n_{k}) with a large constant C>0C>0, the last term on the right-hand side will be dominated by the second term.

The upper bounds of d(𝜽^(k)[T],𝜽(k))d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*}) in Theorem 1 and R𝜽¯(k)(𝒞^(k)[T])R𝜽¯(k)(𝒞𝜽¯(k))R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}}) in Theorem 2 consist of five parts with one-to-one correspondence. It is sufficient to discuss the bound of R𝜽¯(k)(𝒞^(k)[T])R𝜽¯(k)(𝒞𝜽¯(k))R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}}). Part (\@slowromancapi@) represents the “oracle rate”, which can be achieved when all tasks in SS are the same. This is the best rate to possibly achieve. Part (\@slowromancapii@) is a dimension-free error caused by estimating scalar parameters δ(k)\delta^{(k)*} and w(k)w^{(k)*} that appears in the optimal discriminant rule. Part (\@slowromancapiii@) includes hh that measures the degree of similarity among the tasks in SS. When these tasks are very similar, hh will be small, contributing a small term to the upper bound. Nicely, even when hh is large, the term becomes p+logKnk\frac{p+\log K}{n_{k}} and it is still comparable to the minimax error rate of single-task learning 𝒪(p/nk)\mathcal{O}_{\mathbb{P}}(p/n_{k}) (e.g., Theorems 4.1 and 4.2 in Cai et al., (2019)). We have the extra logK\log K term here due to the simultaneous control over all tasks in SS. Part (\@slowromancapiv@) quantifies the influence from the outlier tasks in ScS^{c}. When there are more outlier tasks, ϵ\epsilon increases, and the bound becomes worse. On the other hand, as long as ϵ\epsilon is small enough to make this term dominated by any other part, the error rate induced by outlier tasks becomes negligible. Given that data from outlier tasks can be arbitrarily contaminated, we can conclude that our method is robust against a fraction of outlier tasks from arbitrary sources. The term in Part (\@slowromancapv@) decreases geometrically in the iteration number TT, implying that the iterates in Algorithm 1 converge geometrically to a ball of radius determined by the errors from Parts (\@slowromancapi@)-(\@slowromancapiv@).

After explaining each part of the upper bound, we now compare it with the convergence rate 𝒪(p+logKnk)\mathcal{O}_{\mathbb{P}}(\frac{p+\log K}{n_{k}}) (including logK\log K here since we consider all the tasks simultaneously) in the single-task learning and reveal how our method performs. With a quick inspection, we can conclude the following:

  • The rate of the upper bound is never larger than p+logKnk\frac{p+\log K}{n_{k}}. So, in terms of rate of convergence, our method MTL-GMM performs at least as well as single-task learning, regardless of the similarity level hh and outlier task fraction ϵ\epsilon.

  • When nSnkn_{S}\gg n_{k} (large total sample size for tasks in SS), pp increases with nkn_{k} (diverging dimension), hp+logKnkh\ll\sqrt{\frac{p+\log K}{n_{k}}} (sufficient similarity between tasks in SS), and ϵ(maxk=1:Knk)/nk\epsilon\ll\sqrt{(\max_{k=1:K}n_{k})/n_{k}} (small fraction of outlier tasks), MTL-GMM attains a faster excess mis-clustering error rate and improves over single-task learning.

The preceding discussions on the upper bounds have demonstrated the superiority of our method. But can we do better? To further evaluate the upper bounds of our method, we next derive complementary minimax lower bounds for both estimation error and excess mis-clustering error. We will show that our method is (nearly) minimax rate optimal in a broad range of regimes.

Theorem 3.

(Lower bounds of the estimation error of GMM parameters in multi-task learning) Suppose ϵ=KsK<1/3\epsilon=\frac{K-s}{K}<1/3. Suppose there exists a subset SS with |S|s|S|\geq s such that minkSnkC1(p+logK)\min_{k\in S}n_{k}\geq C_{1}(p+\log K) and minkSΔ(k)C2\min_{k\in S}\Delta^{(k)}\geq C_{2}, where C1,C2>0C_{1},C_{2}>0 are some constants. Then

inf{𝜽^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S(h)S\displaystyle\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}} (kS{d(𝜽^(k),𝜽(k))pnS+logKnk\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\Bigg{\{}d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\gtrsim\sqrt{\frac{p}{n_{S}}}+\sqrt{\frac{\log K}{n_{k}}} (22)
+hp+logKnk+ϵmaxk=1:Knk})110,\displaystyle\quad\quad+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\frac{\epsilon}{\sqrt{\max_{k=1:K}n_{k}}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10}, (23)
inf{𝝁^1(k),𝝁^2(k)}k=1K{𝚺^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S(h)S\displaystyle\inf_{\begin{subarray}{c}\{\widehat{\bm{\mu}}^{(k)}_{1},\widehat{\bm{\mu}}^{(k)}_{2}\}_{k=1}^{K}\\ \{\widehat{\bm{\Sigma}}^{(k)}\}_{k=1}^{K}\end{subarray}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P} (kS{(minπ:[2][2]maxr=1:2𝝁^r(k)𝝁π(r)(k)2)𝚺^(k)𝚺(k)2\displaystyle\Bigg{(}\bigcup_{k\in S}\Bigg{\{}\Big{(}\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(k)}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\Big{)}\vee\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\|_{2} (24)
p+logKnk})110.\displaystyle\quad\quad\quad\gtrsim\sqrt{\frac{p+\log K}{n_{k}}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10}. (25)
Theorem 4.

(Lower bound of the excess mis-clustering error in multi-task learning) Suppose the same conditions in Theorem 3 hold. Then

inf{𝒞^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S(h)S\displaystyle\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}} (kS{R𝜽¯(k)(𝒞^(k))R𝜽¯(k)(𝒞𝜽¯(k))pnS+logKnk\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\Bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\gtrsim\frac{p}{n_{S}}+\frac{\log K}{n_{k}} (26)
+h2p+logKnk+ϵ2maxk=1:Knk})110.\displaystyle\quad\quad\quad\quad+h^{2}\wedge\frac{p+\log K}{n_{k}}+\frac{\epsilon^{2}}{\max_{k=1:K}n_{k}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10}. (27)

Comparing the upper and lower bounds in Theorems 1-4, we make several remarks:

  • Regarding the estimation of mean vectors {𝝁1(k),𝝁2(k)}kS\{\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2}\}_{k\in S} and covariance matrices {𝚺(k)}kS\{\bm{\Sigma}^{(k)*}\}_{k\in S}, the upper and lower bounds match, hence our method is minimax rate optimal.

  • For the estimation error d(𝜽^(k),𝜽(k))d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*}) and excess mis-clustering error R𝜽¯(k)(𝒞^(k)[T])R𝜽¯(k)(𝒞𝜽¯(k))R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}}) with kSk\in S, the first three terms in the upper and lower bounds match. Only the term involving ϵ\epsilon in the lower bound differs from that in the upper bound by a factor p+logK\sqrt{p+\log K} or p+logKp+\log K. As a result, in the classical low-dimensional regime where pp is bounded, the upper and lower bounds match (up to a logarithmic factor). Therefore, our method is (nearly) minimax rate optimal for estimating {𝜽(k)}kS\{\bm{\theta}^{(k)*}\}_{k\in S} and clustering in such a classical regime.

  • When the dimension pp diverges, there might exist a non-negligible gap between the upper and lower bounds for d(𝜽^(k),𝜽(k))d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*}) and R𝜽¯(k)(𝒞^(k)[T])R𝜽¯(k)(𝒞𝜽¯(k))R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}}) with kSk\in S. Nevertheless, this only occurs when the fraction of outlier task ϵ\epsilon is above the threshold maxk=1:Knkp+logK(h2logKnkpnS)\sqrt{\frac{\max_{k=1:K}n_{k}}{p+\log K}\big{(}h^{2}\vee\frac{\log K}{n_{k}}\vee\frac{p}{n_{S}}\big{)}}. Below the threshold, our method remains minimax rate optimal even though pp is unbounded.

  • Does the gap, when it exists, arise from the upper bound or the lower bound? We believe that it is the upper bound that sometimes becomes not sharp. As can be seen from the proof of Theorem 1, the term ϵp+logKmaxk=1:Knk\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} is due to the estimation of those “center” parameters in Algorithm 1. Recent advances in robust statistics (Chen et al., , 2018) have shown that estimators based on statistical depth functions such as Tukey’s depth function (Tukey, , 1975) can achieve optimal minimax rate under Huber’s ϵ\epsilon-contamination model for location and covariance estimation. It might be possible to utilize depth functions to estimate “center” parameters in our problem and kill the factor p\sqrt{p} in the upper bound. We leave a rigorous development of optimal robustness as an interesting future research. On the other hand, such statistical improvement may come with expensive computation, as depth function-based estimation typically requires solving a challenging non-convex optimization problem.

2.4 Initialization and cluster alignment

As specified by Condition (iii) in Assumption 1, our proposed learning procedure requires that for each task in SS, initial values of the GMM parameter estimates lie within a distance of SNR-order from the ground truth. This can be satisfied by the method of moments proposed in Ge et al., (2015). In practice, a natural initialization method is to run the standard EM algorithm or other common clustering methods like kk-means on each task and use the corresponding estimate as the initial values. We adopted the standard EM algorithm in our numerical experiments, and the numerical results in Section 3 and supplements showed that this practical initialization works quite well. However, in the context of multi-task learning, Condition (iii) further requires a correct alignment of those good initializations from each task, owing to the non-identifiability of GMMs. We discuss in detail the alignment issue in Section 2.4.1 and propose two algorithms to resolve this issue in Section 2.4.2.

2.4.1 The alignment issue

Recall that Section 2.1 introduces the binary GMM with parameters (w(k),𝝁1(k),𝝁2(k),𝚺(k))(w^{(k)*},\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2},\bm{\Sigma}^{(k)*}) for each task kSk\in S. Because the two sets of parameter values {(w,𝒖,𝒗,𝚺),(1w,𝒗,𝒖,𝚺)}\{(w,\bm{u},\bm{v},\bm{\Sigma}),(1-w,\bm{v},\bm{u},\bm{\Sigma})\} for (w(k),𝝁1(k),𝝁2(k),𝚺(k))(w^{(k)*},\bm{\mu}^{(k)*}_{1},\bm{\mu}^{(k)*}_{2},\bm{\Sigma}^{(k)*}) index the same distribution, a good initialization close to the truth is up to a permutation of the two cluster labels. The permutations in the initialization of different tasks could be different. Therefore, in light of the joint parameter space Θ¯S(h)\overline{\Theta}_{S}(h) defined in (LABEL:eq:_parameter_space_mtl) and Condition (iii) in Assumption 1, for given initializations from different tasks, we may need to permute their cluster labels to feed the well-aligned initialization into Algorithm 1.

We further elaborate on the alignment issue using Algorithm 1. The penalization in Step 9 aims to push the estimators 𝜷^(k)[t]\widehat{\bm{\beta}}^{(k)[t]}’s with different kk towards each other, which is expected to improve the performance thanks to the similarity among underlying true parameters {𝜷(k)}kS\{\bm{\beta}^{(k)*}\}_{k\in S}. However, due to the potential permutation of two cluster labels, the vanilla single-task initializations (without alignment) cannot guarantee that the estimators {𝜷^(k)[t]}kS\{\widehat{\bm{\beta}}^{(k)[t]}\}_{k\in S} at each iteration are all estimating the corresponding 𝜷(k)\bm{\beta}^{(k)*}’s (some may estimate 𝜷(k)-\bm{\beta}^{(k)*}’s).

Refer to caption
Figure 1: Examples of well-aligned (left) and badly-aligned (right) initializations.

Figure 1 illustrates the alignment issue in the case of two tasks. The left-hand-side situation is ideal where 𝜷^(1)[0]\widehat{\bm{\beta}}^{(1)[0]}, 𝜷^(2)[0]\widehat{\bm{\beta}}^{(2)[0]} are estimates of 𝜷(1)\bm{\beta}^{(1)*}, 𝜷(2)\bm{\beta}^{(2)*} (which are similar). The right-hand-side situation is problematic because 𝜷^(1)[0]\widehat{\bm{\beta}}^{(1)[0]}, 𝜷^(2)[0]\widehat{\bm{\beta}}^{(2)[0]} are estimates of 𝜷(1)\bm{\beta}^{(1)*}, 𝜷(2)-\bm{\beta}^{(2)*} (which are not similar). Therefore, in practice, after obtaining the initializations from each task, it is necessary to align their cluster labels to ensure that estimators of similar parameters are correctly put together in the penalization framework in Algorithm 1. We formalize the problem and provide two solutions in the next subsection.

2.4.2 Two alignment algorithms

Suppose {𝜷^(k)[0]}k=1K\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K} are the initial estimates of discriminant coefficients with potentially bad alignment for kSk\in S. Note that a good initialization and alignment is not required (in fact, it is not even well defined) for the outlier tasks in ScS^{c}, because they can be from arbitrary distributions. However, since SS is unknown, we will have to address the alignment issue for tasks in SS based on initial estimates from all the tasks. For binary GMMs, each alignment of {𝜷^(k)[0]}k=1K\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K} can be represented by a KK-dimensional Rademacher vector 𝒓{±1}K\bm{r}\in\{\pm 1\}^{K}. Define the ideal alignment as rk=argminrk=±1rk𝜷^(k)[0]𝜷(k)2,kSr^{*}_{k}=\operatorname*{arg\,min}_{r_{k}=\pm 1}\|r_{k}\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2},k\in S. The goal is to recover the well-aligned initializers {rk𝜷^(k)[0]}kS\{r_{k}^{*}\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S} from the initial estimates {𝜷^(k)[0]}k=1K\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K} (equivalently, to recover {rk}kS\{r^{*}_{k}\}_{k\in S}), which can then be fed into Algorithm 1. Once {𝜷^(k)[0]}kS\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S} are well aligned, other initial estimates in Algorithm 1 will be automatically well aligned.

In the following, we will introduce two alignment algorithms. The first one is the “exhaustive search” method (Algorithm 2), where we search among all possible alignments to find the best one. The second one is the “greedy search” method (Algorithm 3), where we flip the sign of 𝜷^(k)[0]\widehat{\bm{\beta}}^{(k)[0]} in a greedy way to recover {rk𝜷^(k)[0]}kS\{r_{k}^{*}\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S}. Both methods are proved to recover {rk}kS\{r_{k}^{*}\}_{k\in S} under mild conditions. The conditions required by the “exhaustive search” method are slightly weaker than those required by the “greedy search” method. As for computational complexity, the latter enjoys a linear time complexity 𝒪(K)\mathcal{O}(K), while the former suffers from an exponential time complexity 𝒪(2K)\mathcal{O}(2^{K}) due to optimization over all possible 2K2^{K} alignments.

To this end, for a given alignment 𝒓={rk}k=1K{±1}K\bm{r}=\{r_{k}\}_{k=1}^{K}\in\{\pm 1\}^{K} with the correspondingly aligned estimates {rk𝜷^(k)[0]}k=1K\{r_{k}\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K}, define its alignment score as

score(𝒓)=1k1k2Krk1𝜷^(k1)[0]rk2𝜷^(k2)[0]2.\text{score}(\bm{r})=\sum_{1\leq k_{1}\neq k_{2}\leq K}\|r_{k_{1}}\widehat{\bm{\beta}}^{(k_{1})[0]}-r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}. (28)

The intuition is that as long as the initializations {𝜷^(k)[0]}kS\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S} are close to the ground truth, a smaller score indicates less difference among {rk𝜷^(k)[0]}kS\{r_{k}\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S}, which implies a better alignment. The score can be thus used to evaluate the quality of an alignment. Note that the score is defined in a symmetric way, that is, score(𝒓)=score(𝒓)\text{score}(\bm{r})=\text{score}(-\bm{r}). The exhaustive search algorithm is presented in Algorithm 2, where scores of all alignments are calculated, and the alignment that minimizes the score is output. Since the score is symmetric, there are at least two alignments with the minimum score. The algorithm can arbitrarily choose and output one of them.

Input: Initialization {𝜷^(k)[0]}k=1K\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K}
1 𝒓^argmin𝒓{±1}Kscore(𝒓)\widehat{\bm{r}}\leftarrow\operatorname*{arg\,min}_{\bm{r}\in\{\pm 1\}^{K}}\text{score}(\bm{r})
Output: 𝒓^\widehat{\bm{r}}
Algorithm 2 Exhaustive search for the alignment

The following theorem reveals that the exhaustive search algorithm can successfully find the ideal alignment under mild conditions.

Theorem 5 (Alignment correctness for Algorithm 2).

Assume that

  1. (i)

    ϵ<13\epsilon<\frac{1}{3};

  2. (ii)

    minkS𝜷(k)24(1ϵ)13ϵh+2(2ϵ)13ϵmaxkS(𝜷^(k)[0]𝜷(k)2𝜷^(k)[0]+𝜷(k)2)\min_{k\in S}\|\bm{\beta}^{(k)*}\|_{2}\geq\frac{4(1-\epsilon)}{1-3\epsilon}h+\frac{2(2-\epsilon)}{1-3\epsilon}\max_{k\in S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}\big{)},

where ϵ=K|S|K\epsilon=\frac{K-|S|}{K} is the outlier task proportion introduced in Theorem 1, and hh is the similarity level of discriminant coefficient defined in (LABEL:eq:_parameter_space_mtl). Then the output of Algorithm 2 satisfies

r^k=rk for all kS or r^k=rk for all kS\widehat{r}_{k}=r_{k}^{*}\text{ for all }k\in S\quad\text{ or }\quad\widehat{r}_{k}=-r_{k}^{*}\text{ for all }k\in S (29)
Remark 2.

The conditions imposed in Theorem 5 are no stronger than conditions required by Theorem 1. First of all, Condition (i) is also required in Theorem 1. Moreover, from the definition of hh in (LABEL:eq:_parameter_space_mtl), it is bounded by a constant. This together with Conditions (iii) and (iv) in Assumption 1 implies Condition (ii) in Theorem 5.

Remark 3.

With Theorem 5, we can relax the original Condition (iii) in Assumption 1 to the following condition:

For all kSk\in S, either of the following two conditions holds with a sufficiently small constant C3C_{3}:

  1. (a)

    𝜷^(k)[0]𝜷(k)2𝝁^1(k)[0]𝝁1(k)2𝝁^2(k)[0]𝝁2(k)2C3minkSΔ(k)\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2}\leq C_{3}\min_{k\in S}\Delta^{(k)}, |w^(k)[0]w(k)|cw/2|\widehat{w}^{(k)[0]}-w^{(k)*}|\leq c_{w}/2;

  2. (b)

    𝜷^(k)[0]+𝜷(k)2𝝁^1(k)[0]𝝁2(k)2𝝁^2(k)[0]𝝁1(k)2C3minkSΔ(k)\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{1}-\bm{\mu}^{(k)*}_{2}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{2}-\bm{\mu}^{(k)*}_{1}\|_{2}\leq C_{3}\min_{k\in S}\Delta^{(k)}, |1w^(k)[0]w(k)|cw/2|1-\widehat{w}^{(k)[0]}-w^{(k)*}|\leq c_{w}/2.

In the relaxed version, the initialization for each task only needs to be good up to an arbitrary permutation, while in the original version, the initialization for each task needs to be good under the same permutation.

Next, we would like to introduce the second alignment algorithm, the “greedy search” method, summarized in Algorithm 3. The main idea is to flip the sign of the discriminant coefficient estimates (equivalently, swap the two cluster labels) from KK tasks in a sequential fashion to check whether the alignment score decreases or not. If yes, we keep the alignment after the flip and proceed with the next task. Otherwise, we keep the alignment before the flip and proceed with the next task. A surprising fact of Algorithm 3 is that it is sufficient to iterate this procedure for all tasks just once to recover the ideal alignment, making the algorithm computationally efficient.

Input: Initialization {𝜷^(k)[0]}k=1K\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=1}^{K}
1 𝒓^=(1,,1){±1}K\widehat{\bm{r}}=(1,\ldots,1)\in\{\pm 1\}^{K}
2 for k=1k=1 to KK do
3       𝒓~flip the sign of r^k\widetilde{\bm{r}}\leftarrow\text{flip the sign of }\widehat{r}_{k} in 𝒓^\widehat{\bm{r}}
4       if score(𝐫^)>score(𝐫~)\textup{score}(\widehat{\bm{r}})>\textup{score}(\widetilde{\bm{r}}) then
5             𝒓^𝒓~\widehat{\bm{r}}\leftarrow\widetilde{\bm{r}}
6            
7       end if
8      
9 end for
Output: 𝒓^\widehat{\bm{r}}
Algorithm 3 Greedy search for the alignment

To help state the theory of the greedy search algorithm, we define the “mismatch proportion” of {𝜷^(k)[0]}kS\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S} as

pa=min{#{kS:rk=1},#{kS:rk=1}}/|S|p_{a}=\min\{\#\{k\in S:r_{k}^{*}=1\},\#\{k\in S:r_{k}^{*}=-1\}\}/|S| (30)

Intuitively, pap_{a} represents the level of mismatch between the initial alignment and the ideal one. It’s straightforward to verify that pa[0,1/2]p_{a}\in[0,1/2]; pa=0p_{a}=0 means the initial alignment equals the ideal one, while pa=1/2p_{a}=1/2 (or |S|12|S|\frac{|S|-1}{2|S|} when |S||S| is odd) represents the “worst” alignment, where almost half of the tasks are badly-aligned. The smaller pap_{a} is, the better alignment {𝜷^(k)[0]}kS\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k\in S} is. Note that we only care about the alignment of tasks in SS.

The following theorem shows that the greedy search algorithm can succeed in finding the ideal alignment under mild conditions.

Theorem 6 (Alignment correctness for Algorithm 3).

Assume that

  1. (i)

    ϵ<12\epsilon<\frac{1}{2};

  2. (ii)

    minkS𝜷(k)22(1ϵ)2(1ϵ)(1pa)1h+2ϵ2(1ϵ)(1pa)1maxkS(𝜷^(k)[0]𝜷(k)2𝜷^(k)[0]+𝜷(k)2)\min_{k\in S}\|\bm{\beta}^{(k)*}\|_{2}\geq\frac{2(1-\epsilon)}{2(1-\epsilon)(1-p_{a})-1}h+\frac{2-\epsilon}{2(1-\epsilon)(1-p_{a})-1}\max_{k\in S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}\big{)};

  3. (iii)

    pa<12ϵ2(1ϵ)p_{a}<\frac{1-2\epsilon}{2(1-\epsilon)},

where ϵ\epsilon and hh are the same as in Theorem 5. Then the output of Algorithm 3 satisfies

rk=rk for all kS or rk=rk for all kSr_{k}=r_{k}^{*}\text{ for all }k\in S\quad\text{ or }\quad r_{k}=-r_{k}^{*}\text{ for all }k\in S (31)
Remark 4.

Conditions (i) and (ii) required by Theorem 6 are similar to the requirements in Theorem 5, which have been shown to be no stronger than conditions in Assumption 1 and Theorem 1 (See Remark 2). However, Condition (iii) is an additional requirement for the success of the greedy label-swapping algorithm. The intuition is that in the exhaustive search algorithm, we compare the scores of all alignments and only need to ensure the ideal alignment can defeat the badly-aligned ones in terms of the alignment score. In contrast, the success of the greedy search algorithm relies on the correct move at each step. We need to guarantee that the “better” alignment after the swap (which may still be badly aligned) can outperform the “worse” one before the swap. This is more difficult to satisfy. Hence, more conditions are needed for the success of Algorithm 3. Condition (iii) is one such condition to provide a reasonably good initial alignment to start the greedy search process. More details of the analysis can be found in the proofs of Theorems 5 and 6 in the supplements.

Remark 5.

In practice, Condition (iii) can fail to hold with a non-zero probability. One solution is to start with random alignments, run the greedy search algorithm multiple times, and use the alignment that appears most frequently in the output. Nevertheless, this will increase the computational burden. In our numerical studies, Algorithm 3 without multiple random alignments works well.

One appealing feature of the two alignment algorithms is that they are robust against a fraction of outlier tasks from arbitrary distributions. According to the definition of the alignment score, this may appear impossible at first glance because the score depends on the estimators from all tasks. However, it turns out that the impact of outliers when comparing the scores in Algorithm 2 and 3 can be bounded by parameters and constants that are unrelated to outlier tasks via the triangle inequality of Euclidean norms. The key idea is that the alignment of outlier tasks in ScS^{c} does not matter in Theorems 5 and 6. More details can be found in the proof of Theorems 5 and 6 in the supplementary materials.

In contrast with supervised MTL, the alignment issue commonly exists in unsupervised MTL. It generally occurs when aggregating information (up to latent label permutation) across different tasks. Alignment pre-processing is thus necessary and important. However, to our knowledge, there is no formal discussion regarding alignment in the existing literature of unsupervised MTL (Gu et al., , 2011; Zhang and Zhang, , 2011; Yang et al., , 2014; Zhang et al., , 2018; Dieuleveut et al., , 2021; Marfoq et al., , 2021). Our treatment of alignment in Section 2.4 is an important step forward in this field. Our algorithms can be potentially extended to other unsupervised MTL scenarios and we leave it for future studies.

3 Simulations

In this section, we present a simulation study of our multi-task learning procedure MTL-GMM, i.e., Algorithms 1. The tuning parameter κ(0,1)\kappa\in(0,1) is set as 1/31/3, and the value of CλC_{\lambda} is determined by a 10-fold cross-validation based on the log-likelihood of the final fitted model. The candidates of CλC_{\lambda} are chosen in a data-driven way, which is described in detail in Section S.3.1.5 of the supplements. All the experiments in this section are implemented in R. Function Mcluster in R package mclust is called to fit a single GMM. We also conducted two additional simulation studies and two real-data studies. Due to the page limit, we included these in Section S.3 of the supplementary materials.

We consider a binary GMM setting. There are K=10K=10 tasks of which each has sample size nk=100n_{k}=100 and dimension p=15p=15. When kSk\in S, we generate each w(k)w^{(k)*} from Unif(0.1,0.9)\text{Unif}(0.1,0.9) and 𝝁1(k)\bm{\mu}^{(k)*}_{1} from (2,2,𝟎p2)+h/2(𝚺(k))1𝒖(2,2,\bm{0}_{p-2})^{\top}+h/2\cdot(\bm{\Sigma}^{(k)*})^{-1}\bm{u}, where 𝒖Unif({𝒖p:𝒖2=1})\bm{u}\sim\text{Unif}(\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=1\}), 𝚺(k)=(0.2|ij|)p×p\bm{\Sigma}^{(k)*}=(0.2^{|i-j|})_{p\times p}, and let 𝝁2(k)=𝝁1(k)\bm{\mu}^{(k)*}_{2}=-\bm{\mu}^{(k)*}_{1}. When kSk\notin S, the distributions still follow GMM, but we generate each w(k)w^{(k)*} from Unif(0.2,0.4)\text{Unif}(0.2,0.4) and 𝝁1(k)\bm{\mu}^{(k)*}_{1} from Unif({𝒖p:𝒖2=0.5})\text{Unif}(\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=0.5\}), and let 𝝁2(k)=𝝁1(k)\bm{\mu}^{(k)*}_{2}=-\bm{\mu}^{(k)*}_{1}, 𝚺(k)=(0.5|ij|))p×p\bm{\Sigma}^{(k)*}=(0.5^{|i-j|)})_{p\times p}. In this setup, it is clear that hh quantifies the similarity among tasks in SS, and tasks in ScS^{c} have very distinct distributions and can be viewed as outlier tasks. For a given ϵ[0,1)\epsilon\in[0,1), the outlier task index set ScS^{c} in each replication is uniformly sampled from all subsets of 1:K1:K with cardinality KϵK\epsilon. We consider two cases:

  1. (i)

    No outlier tasks (ϵ=0\epsilon=0), and hh changes from 0 to 10 with increment 1;

  2. (ii)

    2 outlier tasks (ϵ=0.2\epsilon=0.2), and hh changes from 0 to 10 with increment 1;

We fit Single-task-GMM on each separate task, Pooled-GMM on the merged data of all tasks, and our MTL-GMM in Algorithm 1 coupled with the exhaustive search for the alignment in Algorithm 2. The performances of all three methods are evaluated by the estimation error of w(k)w^{(k)*}, 𝝁1(k)\bm{\mu}^{(k)*}_{1}, 𝝁2(k)\bm{\mu}^{(k)*}_{2}, 𝜷(k)\bm{\beta}^{(k)*}, δ(k)\delta^{(k)*}, and 𝚺(k)\bm{\Sigma}^{(k)*}, as well as the empirical mis-clustering error calculated on a test data set of size 500, for tasks in SS. Due to page limit, we only present the estimation error of 𝜷(k)\bm{\beta}^{(k)*} and the mis-clustering error here, and leave the others to Section S.3.1.1 of the supplements. These two errors are the maximum errors over tasks in SS. For each setting, the simulation is replicated 200 times, and the average of the maximum errors together with the standard deviation are reported in Figure 2.

Refer to caption
Figure 2: The performance of different methods in Simulation 1 under different outlier proportions. The upper panel shows the performance without outlier tasks (ϵ=0\epsilon=0), and the lower panel shows the performance with 2 outlier tasks (ϵ=0.2\epsilon=0.2). hh changes from 0 to 10 with increment 1. Estimation error of {𝜷(k)}kS\{\bm{\beta}^{(k)*}\}_{k\in S} stands for maxkS(𝜷^(k)[T]𝜷(k)2𝜷^(k)[T]+𝜷(k)2)\max_{k\in S}(\|\widehat{\bm{\beta}}^{(k)[T]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[T]}+\bm{\beta}^{(k)*}\|_{2}) and maximum mis-clustering error represents the maximum empirical mis-clustering error rate calculated on the test set of tasks in SS.

When there are no outlier tasks, it can be seen that MTL-GMM and Pooled-GMM are competitive when hh is small (i.e. the tasks are similar), and they outperform Single-task-GMM. As hh increases (i.e. tasks become more heterogenous), MTL-GMM starts to outperform Pooled-GMM by a large margin. Moreover, MTL-GMM is significantly better than Single-task-GMM in terms of both estimation and mis-clustering errors over a wide range of hh. These comparisons demonstrate that MTL-GMM not only effectively utilizes the unknown similarity structure among tasks, but also adapts to it. When the outlier tasks exist, even when hh is very small, MTL-GMM still performs better than Pooled-GMM, showing the robustness of MTL-GMM against a fraction of outlier tasks.

4 Discussions

We would like to highlight several interesting open problems for potential future work:

  • What if only some clusters are similar among different tasks? This may be a more realistic situation in particular when there are more than 2 clusters in each task. Our current proposed algorithms may not work well because they do not take into account this extra layer of heterogeneity. Furthermore, in this situation, different tasks may have a different number of Gaussian clusters. Such a setting with various numbers of clusters has been considered in some literature on general unsupervised multi-task learning (Zhang and Zhang, , 2011; Yang et al., , 2014; Zhang et al., , 2018). It would be of great interest to develop multi-task and transfer learning methods with provable guarantees for GMMs under these more complicated settings.

  • How to accommodate heterogeneous covariance matrices for different Gaussian clusters within each task? This is related to the quadratic discriminant analysis (QDA) in supervised learning where the Bayes classifier has a leading quadratic term. It may require more delicate analysis for methodological and theoretical development. Some recent QDA literature might be helpful (Li and Shao, , 2015; Fan et al., , 2015; Hao et al., , 2018; Jiang et al., , 2018).

  • In this paper, we have focused on the pure unsupervised learning problem, where all the samples are unlabeled. It would be interesting to consider the semi-supervised learning setting, where labels in some tasks (or sources) are known. Li et al., 2022a discusses a similar problem under the linear regression setting, but how the labeled data can help the estimation and clustering in the context of GMMs remains unknown.

References

  • Anderson, (1958) Anderson, T. W. (1958). An introduction to multivariate statistical analysis: Wiley series in probability and mathematical statistics: Probability and mathematical statistics.
  • Ando et al., (2005) Ando, R. K., Zhang, T., and Bartlett, P. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(11).
  • Argyriou et al., (2008) Argyriou, A., Evgeniou, T., and Pontil, M. (2008). Convex multi-task feature learning. Machine learning, 73(3):243–272.
  • Azizyan et al., (2013) Azizyan, M., Singh, A., and Wasserman, L. (2013). Minimax theory for high-dimensional gaussian mixtures with sparse mean separation. Advances in Neural Information Processing Systems, 26.
  • Balakrishnan et al., (2017) Balakrishnan, S., Wainwright, M. J., and Yu, B. (2017). Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120.
  • Bastani, (2021) Bastani, H. (2021). Predicting with proxies: Transfer learning in high dimension. Management Science, 67(5):2964–2984.
  • Baum et al., (1970) Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics, 41(1):164–171.
  • Cai and Liu, (2011) Cai, T. and Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American statistical association, 106(496):1566–1577.
  • Cai et al., (2019) Cai, T. T., Ma, J., and Zhang, L. (2019). Chime: Clustering of high-dimensional gaussian mixtures with em algorithm and its optimality. The Annals of Statistics, 47(3):1234–1267.
  • Carroll et al., (1997) Carroll, R. J., Fan, J., Gijbels, I., and Wand, M. P. (1997). Generalized partially linear single-index models. Journal of the American Statistical Association, 92(438):477–489.
  • Chattopadhyay et al., (2012) Chattopadhyay, R., Sun, Q., Fan, W., Davidson, I., Panchanathan, S., and Ye, J. (2012). Multisource domain adaptation and its application to early detection of fatigue. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(4):1–26.
  • Chen et al., (2022) Chen, E. Y., Jordan, M. I., and Li, S. (2022). Transferred q-learning. arXiv preprint arXiv:2202.04709.
  • Chen et al., (2018) Chen, M., Gao, C., and Ren, Z. (2018). Robust covariance and scatter matrix estimation under huber’s contamination model. The Annals of Statistics, 46(5):1932–1960.
  • Dai et al., (2007) Dai, W., Yang, Q., Xue, G., and Yu, Y. (2007). Boosting for transfer learning. In ACM International Conference Proceeding Series, volume 227, page 193.
  • Dai et al., (2008) Dai, W., Yang, Q., Xue, G.-R., and Yu, Y. (2008). Self-taught clustering. In Proceedings of the 25th international conference on Machine learning, pages 200–207.
  • Dasgupta and Schulman, (2013) Dasgupta, S. and Schulman, L. (2013). A two-round variant of em for gaussian mixtures. arXiv preprint arXiv:1301.3850.
  • Dempster et al., (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22.
  • Dieuleveut et al., (2021) Dieuleveut, A., Fort, G., Moulines, E., and Robin, G. (2021). Federated-em with heterogeneity mitigation and variance reduction. Advances in Neural Information Processing Systems, 34:29553–29566.
  • Duan and Wang, (2023) Duan, Y. and Wang, K. (2023). Adaptive and robust multi-task learning. The Annals of Statistics, 51(5):2015–2039.
  • Efron, (1975) Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70(352):892–898.
  • Evgeniou and Pontil, (2004) Evgeniou, T. and Pontil, M. (2004). Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117.
  • Fan et al., (2012) Fan, J., Feng, Y., and Tong, X. (2012). A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(4):745–771.
  • Fan et al., (2015) Fan, Y., Kong, Y., Li, D., and Zheng, Z. (2015). Innovated interaction screening for high-dimensional nonlinear classification. The Annals of Statistics, 43(3):1243–1272.
  • Forgy, (1965) Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21:768–769.
  • Ge et al., (2015) Ge, R., Huang, Q., and Kakade, S. M. (2015). Learning mixtures of gaussians in high dimensions. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 761–770.
  • Gu et al., (2011) Gu, Q., Li, Z., and Han, J. (2011). Learning a kernel for multi-task clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 368–373.
  • Hao et al., (2018) Hao, N., Feng, Y., and Zhang, H. H. (2018). Model selection for high-dimensional quadratic regression via regularization. Journal of the American Statistical Association, 113(522):615–625.
  • Hartley, (1958) Hartley, H. O. (1958). Maximum likelihood estimation from incomplete data. Biometrics, 14(2):174–194.
  • Hasselblad, (1966) Hasselblad, V. (1966). Estimation of parameters for a mixture of normal distributions. Technometrics, 8(3):431–444.
  • Hastie et al., (2009) Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
  • Hsu and Kakade, (2013) Hsu, D. and Kakade, S. M. (2013). Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11–20.
  • Huber, (1964) Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, pages 73–101.
  • Jain and Dubes, (1988) Jain, A. K. and Dubes, R. C. (1988). Algorithms for clustering data. Prentice-Hall, Inc.
  • Jalali et al., (2010) Jalali, A., Sanghavi, S., Ruan, C., and Ravikumar, P. (2010). A dirty model for multi-task learning. Advances in neural information processing systems, 23.
  • Jiang et al., (2018) Jiang, B., Wang, X., and Leng, C. (2018). A direct approach for sparse quadratic discriminant analysis. Journal of Machine Learning Research, 19(1):1098–1134.
  • Jin et al., (2017) Jin, J., Ke, Z. T., and Wang, W. (2017). Phase transitions for high dimensional clustering and related problems. The Annals of Statistics, 45(5):2151–2189.
  • Kalai et al., (2010) Kalai, A. T., Moitra, A., and Valiant, G. (2010). Efficiently learning mixtures of two gaussians. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 553–562.
  • Konstantinov et al., (2020) Konstantinov, N., Frantar, E., Alistarh, D., and Lampert, C. (2020). On the sample complexity of adversarial multi-source pac learning. In International Conference on Machine Learning, pages 5416–5425. PMLR.
  • Kwon and Caramanis, (2020) Kwon, J. and Caramanis, C. (2020). The em algorithm gives sample-optimality for learning mixtures of well-separated gaussians. In Conference on Learning Theory, pages 2425–2487. PMLR.
  • Lawrence and Platt, (2004) Lawrence, N. D. and Platt, J. C. (2004). Learning to learn with the informative vector machine. In Proceedings of the twenty-first international conference on Machine learning, page 65.
  • Lee et al., (2012) Lee, K., Guillemot, L., Yue, Y., Kramer, M., and Champion, D. (2012). Application of the gaussian mixture model in pulsar astronomy-pulsar classification and candidates ranking for the fermi 2fgl catalogue. Monthly Notices of the Royal Astronomical Society, 424(4):2832–2840.
  • Li and Shao, (2015) Li, Q. and Shao, J. (2015). Sparse quadratic discriminant analysis for high dimensional data. Statistica Sinica, pages 457–473.
  • Li and Liang, (2008) Li, R. and Liang, H. (2008). Variable selection in semiparametric regression modeling. The Annals of Statistics, 36(1):261–286.
  • Li et al., (2023) Li, S., Cai, T., and Duan, R. (2023). Targeting underrepresented populations in precision medicine: A federated transfer learning approach. The Annals of Applied Statistics, 17(4):2970–2992.
  • Li et al., (2021) Li, S., Cai, T. T., and Li, H. (2021). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages 1–25.
  • (46) Li, S., Cai, T. T., and Li, H. (2022a). Estimation and inference with proxy data and its genetic applications. arXiv preprint arXiv:2201.03727.
  • (47) Li, S., Cai, T. T., and Li, H. (2022b). Transfer learning in large-scale gaussian graphical models with false discovery rate control. Journal of the American Statistical Association, pages 1–13.
  • Li et al., (2013) Li, W., Duan, L., Xu, D., and Tsang, I. W. (2013). Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 36(6):1134–1148.
  • Lin and Reimherr, (2022) Lin, H. and Reimherr, M. (2022). On transfer learning in functional linear regression. arXiv preprint arXiv:2206.04277.
  • Mai et al., (2019) Mai, Q., Yang, Y., and Zou, H. (2019). Multiclass sparse discriminant analysis. Statistica Sinica, 29(1):97–111.
  • Mai et al., (2012) Mai, Q., Zou, H., and Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99(1):29–42.
  • Marfoq et al., (2021) Marfoq, O., Neglia, G., Bellet, A., Kameni, L., and Vidal, R. (2021). Federated multi-task learning under a mixture of distributions. Advances in Neural Information Processing Systems, 34:15434–15447.
  • McLachlan and Krishnan, (2007) McLachlan, G. J. and Krishnan, T. (2007). The EM algorithm and extensions. John Wiley & Sons.
  • Meng and Rubin, (1994) Meng, X.-L. and Rubin, D. B. (1994). On the global and componentwise rates of convergence of the em algorithm. Linear Algebra and its Applications, 199:413–425.
  • Mihalkova et al., (2007) Mihalkova, L., Huynh, T., and Mooney, R. J. (2007). Mapping and revising markov logic networks for transfer learning. In Proceedings of the 22nd national conference on Artificial intelligence-Volume 1, pages 608–614.
  • Murtagh and Contreras, (2012) Murtagh, F. and Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97.
  • Ng et al., (2001) Ng, A., Jordan, M., and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14.
  • Obozinski et al., (2006) Obozinski, G., Taskar, B., and Jordan, M. (2006). Multi-task feature selection. Statistics Department, UC Berkeley, Tech. Rep, 2(2.2):2.
  • Pan and Yang, (2009) Pan, S. J. and Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359.
  • Pearson, (1894) Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71–110.
  • Redner and Walker, (1984) Redner, R. A. and Walker, H. F. (1984). Mixture densities, maximum likelihood and the em algorithm. SIAM review, 26(2):195–239.
  • Scott and Symons, (1971) Scott, A. J. and Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics, pages 387–397.
  • Sundberg, (1974) Sundberg, R. (1974). Maximum likelihood theory for incomplete data from an exponential family. Scandinavian Journal of Statistics, pages 49–58.
  • Thrun and O’Sullivan, (1996) Thrun, S. and O’Sullivan, J. (1996). Discovering structure in multiple learning tasks: The tc algorithm. In ICML, volume 96, pages 489–497.
  • Tian and Feng, (2023) Tian, Y. and Feng, Y. (2023). Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association, 118(544):2684–2697.
  • Tukey, (1975) Tukey, J. W. (1975). Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531.
  • Vempala and Wang, (2004) Vempala, S. and Wang, G. (2004). A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841–860.
  • Wang et al., (2021) Wang, R., Zhou, J., Jiang, H., Han, S., Wang, L., Wang, D., and Chen, Y. (2021). A general transfer learning-based gaussian mixture model for clustering. International Journal of Fuzzy Systems, 23(3):776–793.
  • Wang et al., (2014) Wang, Z., Gu, Q., Ning, Y., and Liu, H. (2014). High dimensional expectation-maximization algorithm: Statistical optimization and asymptotic normality. arXiv preprint arXiv:1412.8729.
  • Wang et al., (2008) Wang, Z., Song, Y., and Zhang, C. (2008). Transferred dimensionality reduction. In Joint European conference on machine learning and knowledge discovery in databases, pages 550–565. Springer.
  • Witten and Tibshirani, (2011) Witten, D. M. and Tibshirani, R. (2011). Penalized classification using fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):753–772.
  • Wu, (1983) Wu, C. J. (1983). On the convergence properties of the em algorithm. The Annals of statistics, pages 95–103.
  • Xu et al., (2016) Xu, J., Hsu, D. J., and Maleki, A. (2016). Global analysis of expectation maximization for mixtures of two gaussians. Advances in Neural Information Processing Systems, 29.
  • Xu and Bastani, (2021) Xu, K. and Bastani, H. (2021). Learning across bandits in high dimension via robust statistics. arXiv preprint arXiv:2112.14233.
  • Yan et al., (2017) Yan, B., Yin, M., and Sarkar, P. (2017). Convergence of gradient em on multi-component mixture of gaussians. Advances in Neural Information Processing Systems, 30.
  • Yang and Ahuja, (1998) Yang, M.-H. and Ahuja, N. (1998). Gaussian mixture model for human skin color and its applications in image and video databases. In Storage and retrieval for image and video databases VII, volume 3656, pages 458–466. SPIE.
  • Yang et al., (2014) Yang, Y., Ma, Z., Yang, Y., Nie, F., and Shen, H. T. (2014). Multitask spectral clustering by exploring intertask correlation. IEEE transactions on cybernetics, 45(5):1083–1094.
  • Zhang and Zhang, (2011) Zhang, J. and Zhang, C. (2011). Multitask bregman clustering. Neurocomputing, 74(10):1720–1734.
  • Zhang and Chen, (2022) Zhang, Q. and Chen, J. (2022). Distributed learning of finite gaussian mixtures. Journal of Machine Learning Research, 23(99):1–40.
  • Zhang et al., (2022) Zhang, X., Blanchet, J., Ghosh, S., and Squillante, M. S. (2022). A class of geometric structures in transfer learning: Minimax bounds and optimality. In International Conference on Artificial Intelligence and Statistics, pages 3794–3820. PMLR.
  • Zhang et al., (2015) Zhang, X., Zhang, X., and Liu, H. (2015). Smart multitask bregman clustering and multitask kernel clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(1):1–29.
  • Zhang et al., (2018) Zhang, X., Zhang, X., Liu, H., and Luo, J. (2018). Multi-task clustering with model relation learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 3132–3140.
  • Zhang and Yang, (2021) Zhang, Y. and Yang, Q. (2021). A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering.
  • Zhao et al., (2020) Zhao, R., Li, Y., and Sun, Y. (2020). Statistical convergence of the em algorithm on gaussian mixture models. Electronic Journal of Statistics, 14:632–660.
  • Zou, (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429.
  • Zuo et al., (2018) Zuo, H., Lu, J., Zhang, G., and Liu, F. (2018). Fuzzy transfer learning using an infinite gaussian mixture model and active learning. IEEE Transactions on Fuzzy Systems, 27(2):291–303.

Supplementary Materials of “Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models”

S.1 Extension to Multi-cluster GMMs

In the main text, we have discussed the MTL problem for binary GMMs. In this section, we extend our methods and theory to Gaussian mixtures with RR clusters (R3R\geq 3).

We first generalize the problem setting introduced in Sections 1.1 and 2.1. There are KK tasks where we have nkn_{k} observations {𝒛i(k)}i=1nk\{\bm{z}_{i}^{(k)}\}_{i=1}^{n_{k}} from the kk-th task. An unknown subset S1:KS\subseteq 1:K denotes tasks whose samples follow multi-cluster GMMs, and ScS^{c} refers to outlier tasks that can have arbitrary distributions. Specifically, for all kS,i=1:nkk\in S,i=1:n_{k},

yi(k)=r with probability wr(k),𝒛i(k)|yi(k)=r𝒩(𝝁r(k),𝚺(k)),r=1:R,\displaystyle y^{(k)}_{i}=r\text{ \,\,with probability }w^{(k)*}_{r},\quad\quad\bm{z}_{i}^{(k)}|y_{i}^{(k)}=r\sim\mathcal{N}(\bm{\mu}^{(k)*}_{r},\bm{\Sigma}^{(k)*}),\quad\quad r=1:R, (S.1.32)

with r=1Rwr(k)=1\sum_{r=1}^{R}w^{(k)*}_{r}=1, and

{𝒛i(k)}i,kScS,\{\bm{z}^{(k)}_{i}\}_{i,k\in S^{c}}\sim\mathbb{Q}_{S}, (S.1.33)

where S\mathbb{Q}_{S} is some probability measure on (p)nSc(\mathbb{R}^{p})^{\otimes n_{S^{c}}} and nSc=kScnkn_{S^{c}}=\sum_{k\in S^{c}}n_{k}. We focus on the following joint parameter space

Θ¯S(h)={{𝜽¯(k)}kS={({wr(k)}r=1R,{𝝁r(k)}r=1R,𝚺(k))}kS:𝜽¯(k)Θ¯,maxr=2:Rinf𝜷maxkS𝜷r(k)𝜷2h},\overline{\Theta}_{S}(h)=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}=\{(\{w^{(k)}_{r}\}_{r=1}^{R},\{\bm{\mu}^{(k)}_{r}\}_{r=1}^{R},\bm{\Sigma}^{(k)})\}_{k\in S}:\overline{\bm{\theta}}^{(k)}\in\overline{\Theta},\max_{r=2:R}\inf_{\bm{\beta}}\max_{k\in S}\|\bm{\beta}^{(k)}_{r}-\bm{\beta}\|_{2}\leq h\Big{\}}, (S.1.34)

where 𝜷r(k)=(𝚺(k))1(𝝁r(k)𝝁1(k))\bm{\beta}^{(k)}_{r}=(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{r}-\bm{\mu}^{(k)}_{1}) is the rr-th discriminant coefficient in the kk-th task, and Θ¯\overline{\Theta} is the parameter space for a single multi-cluster GMM,

Θ¯={𝜽¯=({wr}r=1R,{𝝁r}r=1R,𝚺):\displaystyle\overline{\Theta}=\bigg{\{}\overline{\bm{\theta}}=(\{w_{r}\}_{r=1}^{R},\{\bm{\mu}_{r}\}_{r=1}^{R},\bm{\Sigma}): maxr=1:R𝝁r2M,cwminr=1:Rwrmaxr=1:Rwr1cw,\displaystyle\max_{r=1:R}\|\bm{\mu}_{r}\|_{2}\leq M,c_{w}\leq\min_{r=1:R}w_{r}\leq\max_{r=1:R}w_{r}\leq 1-c_{w}, (S.1.35)
r=1Rwr=1,c𝚺1λmin(𝚺)λmax(𝚺)c𝚺}.\displaystyle\sum_{r=1}^{R}w_{r}=1,c_{\bm{\Sigma}}^{-1}\leq\lambda_{\min}(\bm{\Sigma})\leq\lambda_{\max}(\bm{\Sigma})\leq c_{\bm{\Sigma}}\bigg{\}}. (S.1.36)

Note that (S.1.34) and (S.1.36) are natural generalizations of (LABEL:eq:_parameter_space_mtl) and (8), respectively.

Under a multi-cluster GMM with parameter 𝜽¯=({wr}r=1R,{𝝁r}r=1R,𝚺)\overline{\bm{\theta}}=(\{w_{r}\}_{r=1}^{R},\{\bm{\mu}_{r}\}_{r=1}^{R},\bm{\Sigma}), compared with (3), the optimal discriminant rule now becomes

𝒞𝜽¯(𝒛)=argmaxr=1:R{(𝒛𝝁1+𝝁r2)𝜷r+log(wrw1)},\displaystyle\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})=\operatorname*{arg\,max}_{r=1:R}\bigg{\{}\bigg{(}\bm{z}-\frac{\bm{\mu}_{1}+\bm{\mu}_{r}}{2}\bigg{)}^{\top}\bm{\beta}_{r}+\log\bigg{(}\frac{w_{r}}{w_{1}}\bigg{)}\bigg{\}}, (S.1.37)

where 𝜷r=𝚺1(𝝁r𝝁1)\bm{\beta}_{r}=\bm{\Sigma}^{-1}(\bm{\mu}_{r}-\bm{\mu}_{1}). Once we have the parameter estimators, we plug them into the above rule to obtain the plug-in clustering method. Recall that for a clustering method 𝒞:p[R]\mathcal{C}:\mathbb{R}^{p}\rightarrow[R], its mis-clustering error is

R𝜽¯(𝒞)=minπ:[R][R]𝜽¯(𝒞(Znew)π(Ynew)).R_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:[R]\rightarrow[R]}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z^{\textup{new}})\neq\pi(Y^{\textup{new}})). (S.1.38)

Here, Znewr=1Rwr𝒩(𝝁r,𝚺)Z^{\textup{new}}\sim\sum_{r=1}^{R}w_{r}\mathcal{N}(\bm{\mu}_{r},\bm{\Sigma}) is an independent future observation associated with the label YnewY^{\textup{new}}, the probability 𝜽¯\mathbb{P}_{\overline{\bm{\theta}}} is w.r.t. (Znew,Ynew)(Z^{\textup{new}},Y^{\textup{new}}), and the minimum is taken over R!R! permutation functions on [R][R]. Since (S.1.37) is the optimal clustering method that minimizes R𝜽¯(𝒞)R_{\overline{\bm{\theta}}}(\mathcal{C}), the excess mis-clustering error for a given clustering 𝒞\mathcal{C} is R𝜽¯(𝒞)R𝜽¯(𝒞𝜽¯)R_{\overline{\bm{\theta}}}(\mathcal{C})-R_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}). The rest of the section aims to extend the EM-stylized multi-task learning procedure and the two alignment algorithms in Section 2 to the general multi-cluster GMM setting, and provide similar statistical guarantees in terms of estimation and excess mis-clustering errors. For simplicity, throughout this section, we assume the number of clusters RR to be bounded and known. We leave the case of diverging RR as a future work.

Since both the EM algorithm and the penalization framework work beyond binary GMM, our methodological idea described in Section 2.2 can be directly adapted to extend Algorithm 1 to the multi-cluster case. We summarize the general procedure in Algorithm 4. Like in Algorithm 1, we have adopted the following notation for posterior probability in Algorithm 4,

γ𝜽(r)(𝒛)=wrexp(𝜷r𝒛δr)r=1Rwrexp(𝜷r𝒛δr),for𝜽=({wr}r=2R,{𝜷r}r=2R,{δr}r=2R),\gamma^{(r)}_{\bm{\theta}}(\bm{z})=\frac{w_{r}\exp(\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r})}{\sum_{r=1}^{R}w_{r}\exp(\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r})},\quad{\rm for~{}}\bm{\theta}=(\{w_{r}\}_{r=2}^{R},\{\bm{\beta}_{r}\}_{r=2}^{R},\{\delta_{r}\}_{r=2}^{R}), (S.1.39)

where w1=1r=2Rwrw_{1}=1-\sum_{r=2}^{R}w_{r}, 𝜷1𝟎\bm{\beta}_{1}\coloneqq\bm{0}, and δ10\delta_{1}\coloneqq 0. Specifically, γ𝜽(r)(𝒛)\gamma^{(r)}_{\bm{\theta}}(\bm{z}) is the posterior probability (Y=r|Z=𝒛)\mathbb{P}(Y=r|Z=\bm{z}) given the observation 𝒛\bm{z}, when the true parameter of a multi-cluster GMM ({wr}r=1R,{𝝁r}r=1R,𝚺)(\{w^{*}_{r}\}_{r=1}^{R},\{\bm{\mu}^{*}_{r}\}_{r=1}^{R},\bm{\Sigma}^{*}) satisfies wr=wr,𝜷r=(𝚺)1(𝝁r𝝁1),δr=12𝜷r(𝝁1+𝝁r)w_{r}=w_{r}^{*},\bm{\beta}_{r}=(\bm{\Sigma}^{*})^{-1}(\bm{\mu}^{*}_{r}-\bm{\mu}^{*}_{1}),\delta_{r}=\frac{1}{2}\bm{\beta}_{r}^{\top}(\bm{\mu}^{*}_{1}+\bm{\mu}^{*}_{r}), for r=1:Rr=1:R.

Input: Initialization {({w^r(k)[0]}r=1R,{𝜷^r(k)[0]}r=2R,{𝝁^r(k)[0]}r=1R)}k=1K\{(\{\widehat{w}^{(k)[0]}_{r}\}_{r=1}^{R},\{\widehat{\bm{\beta}}^{(k)[0]}_{r}\}_{r=2}^{R},\{\widehat{\bm{\mu}}^{(k)[0]}_{r}\}_{r=1}^{R})\}_{k=1}^{K}, maximum number of iteration rounds TT, initial penalty parameter λ[0]\lambda^{[0]}, tuning parameters Cλ>0C_{\lambda}>0, κ(0,1)\kappa\in(0,1)
1 𝜽^(k)[0]=({w^r(k)[0]}r=1R,{𝜷^r(k)[0]}r=2R,{δ^r(k)[0]}r=2R)\widehat{\bm{\theta}}^{(k)[0]}=(\{\widehat{w}^{(k)[0]}_{r}\}_{r=1}^{R},\{\widehat{\bm{\beta}}^{(k)[0]}_{r}\}_{r=2}^{R},\{\widehat{\delta}^{(k)[0]}_{r}\}_{r=2}^{R}), where δ^r(k)[0]=12(𝜷^r(k)[0])(𝝁^1(k)[0]+𝝁^r(k)[0])\widehat{\delta}^{(k)[0]}_{r}=\frac{1}{2}(\widehat{\bm{\beta}}^{(k)[0]}_{r})^{\top}(\widehat{\bm{\mu}}^{(k)[0]}_{1}+\widehat{\bm{\mu}}^{(k)[0]}_{r}), for k=1:Kk=1:K
2 for t=1t=1 to TT do
3       λ[t]=κλ[t1]+Cλp+logK\lambda^{[t]}=\kappa\lambda^{[t-1]}+C_{\lambda}\sqrt{p+\log K} // Update the penalty parameter
4       for k=1k=1 to KK do // Local update for each task
5             for r=1r=1 to RR do
6                   w^r(k)[t]=1nki=1nkγ𝜽^(k)[t1](r)(𝒛i(k))\widehat{w}^{(k)[t]}_{r}=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma^{(r)}_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})
7                   𝝁^r(k)[t]=i=1nkγ𝜽^(k)[t1](r)(𝒛i(k))𝒛i(k)nkw^r(k)[t]\widehat{\bm{\mu}}^{(k)[t]}_{r}=\frac{\sum_{i=1}^{n_{k}}\gamma^{(r)}_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})\bm{z}^{(k)}_{i}}{n_{k}\widehat{w}^{(k)[t]}_{r}}
8                  
9             end for
10            𝚺^(k)[t]=1nki=1nkr=1Rγ𝜽^(k)[t1](r)(𝒛i(k))(𝒛i(k)𝝁^r(k)[t])(𝒛i(k)𝝁^r(k)[t])\widehat{\bm{\Sigma}}^{(k)[t]}=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\sum_{r=1}^{R}\gamma^{(r)}_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}_{i}^{(k)})\cdot(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{r})(\bm{z}^{(k)}_{i}-\widehat{\bm{\mu}}^{(k)[t]}_{r})^{\top}
11       end for
12      
13      for r=2r=2 to RR do
14             {𝜷^r(k)[t]}k=1K\{\widehat{\bm{\beta}}^{(k)[t]}_{r}\}_{k=1}^{K}, 𝜷¯r[t]=argmin𝜷(1),,𝜷(K),𝜷¯{k=1Knk[12(𝜷(k))𝚺^(k)[t]𝜷(k)(𝜷(k))(𝝁^r(k)[t]𝝁^1(k)[t])]+k=1Knkλ[t]𝜷(k)𝜷¯2}\overline{\bm{\beta}}_{r}^{[t]}=\operatorname*{arg\,min}\limits_{\bm{\beta}^{(1)},\ldots,\bm{\beta}^{(K)},\overline{\bm{\beta}}}\bigg{\{}\sum_{k=1}^{K}n_{k}\Big{[}\frac{1}{2}(\bm{\beta}^{(k)})^{\top}\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)}-(\bm{\beta}^{(k)})^{\top}(\widehat{\bm{\mu}}_{r}^{(k)[t]}-\widehat{\bm{\mu}}_{1}^{(k)[t]})\Big{]}+\sum_{k=1}^{K}\sqrt{n_{k}}\lambda^{[t]}\cdot\|\bm{\beta}^{(k)}-\overline{\bm{\beta}}\|_{2}\bigg{\}} // Aggregation
15            
16       end for
17      
18      for k=1k=1 to KK do // Local update for each task
19             for r=2r=2 to RR do
20                   δ^r(k)[t]=12(𝜷^r(k)[t])(𝝁^1(k)[t]+𝝁^r(k)[t])\widehat{\delta}^{(k)[t]}_{r}=\frac{1}{2}(\widehat{\bm{\beta}}^{(k)[t]}_{r})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{1}+\widehat{\bm{\mu}}^{(k)[t]}_{r})
21             end for
22            Let 𝜽^(k)[t]=({w^r(k)[t]}r=1R,{𝜷^r(k)[t]}r=2R,{δ^(k)[t]}r=2R)\widehat{\bm{\theta}}^{(k)[t]}=(\{\widehat{w}^{(k)[t]}_{r}\}_{r=1}^{R},\{\widehat{\bm{\beta}}^{(k)[t]}_{r}\}_{r=2}^{R},\{\widehat{\delta}^{(k)[t]}\}_{r=2}^{R})
23       end for
24      
25 end for
Output: {(𝜽^(k)[T],{𝝁^r(k)[T]}r=1R,𝚺^(k)[T])}k=1K\{(\widehat{\bm{\theta}}^{(k)[T]},\{\widehat{\bm{\mu}}^{(k)[T]}_{r}\}_{r=1}^{R},\widehat{\bm{\Sigma}}^{(k)[T]})\}_{k=1}^{K} with 𝜽^(k)[T]=({w^r(k)[T]}r=1R,{𝜷^r(k)[T]}r=2R,{δ^(k)[T]}r=2R)\widehat{\bm{\theta}}^{(k)[T]}=(\{\widehat{w}^{(k)[T]}_{r}\}_{r=1}^{R},\{\widehat{\bm{\beta}}^{(k)[T]}_{r}\}_{r=2}^{R},\allowbreak\{\widehat{\delta}^{(k)[T]}\}_{r=2}^{R})
Algorithm 4 MTL-GMM (Multi-cluster)

Having the estimates ({w^r(k)[T]}r=1R,{𝜷^r(k)[T]}r=2R,{𝝁^r(k)[T]}r=1R)(\{\widehat{w}^{(k)[T]}_{r}\}_{r=1}^{R},\{\widehat{\bm{\beta}}^{(k)[T]}_{r}\}_{r=2}^{R},\allowbreak\{\widehat{\bm{\mu}}^{(k)[T]}_{r}\}_{r=1}^{R}) from Algorithm 4, we can plug them into (S.1.37) to construct the clustering method, denoted by 𝒞^(k)[T](𝒛)\widehat{\mathcal{C}}^{(k)[T]}(\bm{z}). Equivalently,

𝒞^(k)[T](𝒛)=argmaxr=1:Rγ𝜽^(k)[T](r)(𝒛).\widehat{\mathcal{C}}^{(k)[T]}(\bm{z})=\operatorname*{arg\,max}_{r=1:R}\gamma^{(r)}_{\widehat{\bm{\theta}}^{(k)[T]}}(\bm{z}). (S.1.40)

S.1.1 Theory

We need the following assumption before stating the theory.

Assumption 2.

Denote Δrj(k)=(𝛍r(k)𝛍j(k))(𝚺(k))1(𝛍r(k)𝛍j(k))\Delta^{(k)}_{rj}=\sqrt{(\bm{\mu}^{(k)*}_{r}-\bm{\mu}^{(k)*}_{j})^{\top}(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{r}-\bm{\mu}^{(k)*}_{j})} for kSk\in S. Suppose the following conditions hold:

  1. (i)

    nS=kSnkC1|S|maxk=1:Knkn_{S}=\sum_{k\in S}n_{k}\geq C_{1}|S|\max_{k=1:K}n_{k} with a constant C1>0C_{1}>0;

  2. (ii)

    minkSnkC2(p+logK)\min_{k\in S}n_{k}\geq C_{2}(p+\log K) with some constant C2C_{2};

  3. (iii)

    There exists a permutation π:[R][R]\pi:[R]\rightarrow[R] such that

    1. (a)

      maxkS{[maxr=2:R𝜷^r(k)[0](𝚺(k))1(𝝁π(r)(k)𝝁π(1)(k))2](maxr=1:R𝝁^r(k)[0]𝝁π(r)(k)2)}C3minkSminrjΔrj(k)\max_{k\in S}\big{\{}\big{[}\max_{r=2:R}\|\widehat{\bm{\beta}}^{(k)[0]}_{r}-(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{\pi(r)}-\bm{\mu}^{(k)*}_{\pi(1)})\|_{2}\big{]}\vee\big{(}\max_{r=1:R}\|\widehat{\bm{\mu}}^{(k)[0]}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\big{)}\big{\}}\leq C_{3}\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}, with some constant C3C_{3};

    2. (b)

      maxkSmaxr=2:R|w^r(k)[0]wπ(r)(k)|cw/2\max_{k\in S}\max_{r=2:R}|\widehat{w}^{(k)[0]}_{r}-w^{(k)*}_{\pi(r)}|\leq c_{w}/2.

  4. (iv)

    minkSminrjΔrj(k)C4>0\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}\geq C_{4}>0 with some constant C4C_{4};

Remark 6.

The above set of conditions are analogues of those in Assumption 1. We refer to Remark 1 for a detailed explanation of each condition.

We first present the result for parameter estimation. We adopt similar error metrics as the ones in (14) and (15). Specifically, denote the true parameter by {𝜽¯(k)}kS={({wr(k)}r=2R,{𝝁r(k)}r=1R,𝚺(k))}kS\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}=\{(\{w^{(k)*}_{r}\}_{r=2}^{R},\{\bm{\mu}^{(k)*}_{r}\}_{r=1}^{R},\bm{\Sigma}^{(k)*})\}_{k\in S} which belongs to the parameter space Θ¯S(h)\overline{\Theta}_{S}(h) in (S.1.34). For each kSk\in S, define the functional 𝜽(k)=({wr(k)}r=2R,{𝜷r(k)}r=2R,{δr(k)}r=2R)\bm{\theta}^{(k)*}=(\{w^{(k)*}_{r}\}_{r=2}^{R},\{\bm{\beta}^{(k)*}_{r}\}_{r=2}^{R},\{\delta^{(k)*}_{r}\}_{r=2}^{R}), where 𝜷r(k)=(𝚺(k))1(𝝁r(k)𝝁1(k)),δr(k)=12(𝜷(k))(𝝁1(k)+𝝁r(k))\bm{\beta}^{(k)*}_{r}=(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{r}-\bm{\mu}^{(k)*}_{1}),\delta^{(k)*}_{r}=\frac{1}{2}(\bm{\beta}^{(k)*})^{\top}(\bm{\mu}^{(k)*}_{1}+\bm{\mu}^{(k)*}_{r}). For the estimators returned by Algorithm 4, we are interested in the error metrics 333Similar to the binary case, the minimum is taken due to the non-identifiability in multi-cluster GMMs.:

d(𝜽^(k)[T],𝜽(k))=minπ:[R][R]maxr=2:R{|w^r(k)[T]wπ(r)(k)|𝜷^r(k)[T](𝚺(k))1(𝝁π(r)(k)𝝁π(1)(k))2\displaystyle d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*})=\min_{\pi:[R]\rightarrow[R]}\max_{r=2:R}\Big{\{}|\widehat{w}^{(k)[T]}_{r}-w^{(k)*}_{\pi(r)}|\vee\|\widehat{\bm{\beta}}^{(k)[T]}_{r}-(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{\pi(r)}-\bm{\mu}^{(k)*}_{\pi(1)})\|_{2} (S.1.41)
|δ^r(k)[T](𝝁π(r)(k)+𝝁π(1)(k))(𝚺(k))1(𝝁π(r)(k)𝝁π(1)(k))/2|},\displaystyle\hskip 170.71652pt\vee|\widehat{\delta}^{(k)[T]}_{r}-(\bm{\mu}^{(k)*}_{\pi(r)}+\bm{\mu}^{(k)*}_{\pi(1)})^{\top}(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{\pi(r)}-\bm{\mu}^{(k)*}_{\pi(1)})/2|\Big{\}}, (S.1.42)
(minπ:[R][R]maxr=1:R𝝁^r(k)[T]𝝁π(r)(k)2)𝚺^(k)[T]𝚺(k)2.\displaystyle\Big{(}\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\|\widehat{\bm{\mu}}^{(k)[T]}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\Big{)}\vee\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)*}\|_{2}. (S.1.43)
Theorem 7.

(Upper bounds of the estimation error of GMM parameters for multi-cluster MTL-GMM) Suppose Assumption 2 holds, |S|s|S|\geq s, and ϵ=KsK<1/3\epsilon=\frac{K-s}{K}<1/3. Let λ[0]C1maxk=1:Knk\lambda^{[0]}\geq C_{1}\max_{k=1:K}\sqrt{n_{k}}, CλC1C_{\lambda}\geq C_{1} and κ>C2\kappa>C_{2} with some constants C1>0,C2(0,1)C_{1}>0,C_{2}\in(0,1) 444C1C_{1} and C2C_{2} depend on the constants MM, cwc_{w}, and c𝚺c_{\bm{\Sigma}} etc. Then there exists a constant C3>0C_{3}>0, such that for any {𝛉¯(k)}kSΘ¯S(h)\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h) and any probability measure S\mathbb{Q}_{S} on (p)nSc(\mathbb{R}^{p})^{\otimes n_{S^{c}}}, with probability 1C3K11-C_{3}K^{-1}, the following hold for all kSk\in S:

d(𝜽^(k)[T],𝜽(k))\displaystyle d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*}) pnS+logKnk+hp+logKnk+ϵp+logKmaxk=1:Knk+T2(κ)T,\displaystyle\lesssim\sqrt{\frac{p}{n_{S}}}+\sqrt{\frac{\log K}{n_{k}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+T^{2}(\kappa^{\prime})^{T}, (S.1.44)
(minπ:[R][R]maxr=1:R𝝁^r(k)[T]𝝁π(r)(k)2)𝚺^(k)[T]𝚺(k)2p+logKnk+T2(κ)T,\displaystyle\left(\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\|\widehat{\bm{\mu}}^{(k)[T]}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\right)\vee\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)*}\|_{2}\lesssim\sqrt{\frac{p+\log K}{n_{k}}}+T^{2}(\kappa^{\prime})^{T}, (S.1.45)

where κ(0,1)\kappa^{\prime}\in(0,1) is some constant and nS=kSnkn_{S}=\sum_{k\in S}n_{k}. When TClog(maxk=1:Knk)T\geq C\log(\max_{k=1:K}n_{k}) with some large constant C>0C>0, the last term on the right-hand side will be dominated by other terms in both inequalities.

Recall the clustering method 𝒞^(k)[T]\widehat{\mathcal{C}}^{(k)[T]} defined in (S.1.40). The next theorem obtains the upper bound of the excess mis-clustering error of 𝒞^(k)[T]\widehat{\mathcal{C}}^{(k)[T]} for kSk\in S.

Theorem 8.

(Upper bound of the excess mis-clustering error for multi-cluster MTL-GMM) Suppose the same conditions in Theorem 7 hold. Then there exists a constant C1>0C_{1}>0 such that for any {𝛉¯(k)}kSΘ¯S(h)\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h) and any probability measure S\mathbb{Q}_{S} on (p)nSc(\mathbb{R}^{p})^{\otimes n_{S^{c}}}, with probability at least 1C1K11-C_{1}K^{-1}, the following holds for all kSk\in S:

R𝜽¯(k)(𝒞^(k)[T])R𝜽¯(k)(𝒞𝜽¯(k))d2(𝜽^(k)[T],𝜽(k))logd1(𝜽^(k)[T],𝜽(k))\displaystyle R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*})\cdot\log d^{-1}(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*}) (S.1.46)
[pnS+logKnk+h2p+logKnk+ϵ2p+logKmaxk=1:Knk+T4(κ)2T]log(nSpnklogK),\displaystyle\lesssim\bigg{[}\frac{p}{n_{S}}+\frac{\log K}{n_{k}}+h^{2}\wedge\frac{p+\log K}{n_{k}}+\epsilon^{2}\frac{p+\log K}{\max_{k=1:K}n_{k}}+T^{4}(\kappa^{\prime})^{2T}\bigg{]}\cdot\log\left(\frac{n_{S}}{p}\wedge\frac{n_{k}}{\log K}\right), (S.1.47)

where κ(0,1)\kappa^{\prime}\in(0,1) is some constant. When TClog(maxk=1:Knk)T\geq C\log(\max_{k=1:K}n_{k}) with a large constant C>0C>0, the term involving TT on the right-hand side will be dominated by other terms.

Comparing the upper bounds in Theorems 7 and 8 with those in Theorems 1 and 2, the only difference is an extra logarithmic term log(nSpnklogK)\log\big{(}\frac{n_{S}}{p}\wedge\frac{n_{k}}{\log K}\big{)} in Theorem 8, which we believe is a proof artifact. Similar logarithmic terms appear in other multi-cluster GMM literatures as well, see for example, \citeappyan2017convergence and \citeappzhao2020statistical. To understand the upper bounds in Theorems 7 and 8, we can follow the discussion after Theorems 1 and 2. We do not repeat it here.

The following lower bounds together with the derived upper bounds will show that our method is (nearly) minimax optimal in a wide range of regimes.

Theorem 9.

(Lower bounds of the estimation error of GMM parameters in multi-task learning) Suppose ϵ=KsK<1/3\epsilon=\frac{K-s}{K}<1/3. When there exists a subset SS with |S|s|S|\geq s such that minkSnkC1(p+logK)\min_{k\in S}n_{k}\geq C_{1}(p+\log K) and minkSminr,jΔrj(k)C2\min_{k\in S}\min_{r,j}\Delta_{rj}^{(k)}\geq C_{2}, where C1,C2>0C_{1},C_{2}>0 are some constants, we have

inf{𝜽^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S(h)S\displaystyle\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}} (kS{d(𝜽^(k),𝜽(k))pnS+logKnk\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\Bigg{\{}d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\gtrsim\sqrt{\frac{p}{n_{S}}}+\sqrt{\frac{\log K}{n_{k}}} (S.1.48)
+hp+logKnk+ϵmaxk=1:Knk})110,\displaystyle\quad\quad+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\frac{\epsilon}{\sqrt{\max_{k=1:K}n_{k}}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10}, (S.1.49)
inf{𝝁^r(k)}k=1:K,r=1:R{𝚺^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S(h)S(kS{\displaystyle\inf_{\begin{subarray}{c}\{\widehat{\bm{\mu}}^{(k)}_{r}\}_{k=1:K,r=1:R}\\ \{\widehat{\bm{\Sigma}}^{(k)}\}_{k=1}^{K}\end{subarray}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\Bigg{\{} (minπ:[R][R]maxr=1:R𝝁^r(k)𝝁π(r)(k)2)\displaystyle\left(\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\|\widehat{\bm{\mu}}^{(k)}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}\right)\vee (S.1.50)
𝚺^(k)𝚺(k)2p+logKnk})110.\displaystyle\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\|_{2}\gtrsim\sqrt{\frac{p+\log K}{n_{k}}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10}. (S.1.51)
Theorem 10.

(Lower bound of the excess mis-clustering error in multi-task learning) Suppose the same conditions in Theorem 9 hold. Then

inf{𝒞^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S(h)S\displaystyle\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}(h)\\ \mathbb{Q}_{S}\end{subarray}} (kS{R𝜽¯(k)(𝒞^(k))R𝜽¯(k)(𝒞𝜽¯(k))pnS+logKnk\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\Bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\gtrsim\frac{p}{n_{S}}+\frac{\log K}{n_{k}} (S.1.52)
+h2p+logKnk+ϵ2maxk=1:Knk})110.\displaystyle\quad\quad\quad+h^{2}\wedge\frac{p+\log K}{n_{k}}+\frac{\epsilon^{2}}{\max_{k=1:K}n_{k}}\Bigg{\}}\Bigg{)}\geq\frac{1}{10}. (S.1.53)

The lower bounds in Theorems 9 and 10 are the same as those in Theorems 3 and 4. Therefore, the remarks on the comparison of upper and lower bounds presented after Theorem 4 carry over to the multi-cluster setting (up to the logarithmic term from Theorem 8). We do not repeat the details here.

S.1.2 Alignment

Similar to the binary case, we have the alignment issues in multi-cluster case as well. In this section, we propose two alignment algorithms as the extension to the Algorithms 2 and 3.

In the multi-cluster case, the alignment of each task can be represented as a permutation of [R][R]. Consider a series of permutations 𝝅={πk}k=1K\bm{\pi}=\{\pi_{k}\}_{k=1}^{K}, where each πk\pi_{k} is a permutation function on [R][R]. Define a score of 𝝅\bm{\pi} as

score(𝝅)=r=2Rkk(𝚺^(k)[0])1(𝝁^πk(r)(k)[0]𝝁^πk(1)(k)[0])(𝚺^(k)[0])1(𝝁^πk(r)(k)[0]𝝁^πk(1)(k)[0])2.\textup{score}(\bm{\pi})=\sum_{r=2}^{R}\sum_{k\neq k^{\prime}}\left\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}\big{(}\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)}\big{)}\right\|_{2}. (S.1.54)

We want to recover the correct alignment πk=argminπk:[R][R]r=2R(𝚺^(k)[0])1(𝝁^πk(r)(k)[0]𝝁^πk(1)(k)[0])𝜷r(k)2\pi_{k}^{*}=\operatorname*{arg\,min}\limits_{\pi_{k}:[R]\rightarrow[R]}\sum_{r=2}^{R}\left\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}\big{(}\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-\bm{\beta}^{(k)*}_{r}\right\|_{2}. We proposed an exhaustive search algorithm, which is summarized in Algorithm 5.

Input: Initialization {{𝝁^r(k)[0]}r=1R,𝚺^(k)[0]}k=1K\{\{\widehat{\bm{\mu}}^{(k)[0]}_{r}\}_{r=1}^{R},\widehat{\bm{\Sigma}}^{(k)[0]}\}_{k=1}^{K}
1 𝝅^={π^k}k=1Kargmin𝝅score(𝝅)\widehat{\bm{\pi}}=\{\widehat{\pi}_{k}\}_{k=1}^{K}\leftarrow\operatorname*{arg\,min}_{\bm{\pi}}\text{score}(\bm{\pi})
Output: 𝝅^={π^k}k=1K\widehat{\bm{\pi}}=\{\widehat{\pi}_{k}\}_{k=1}^{K}
Algorithm 5 Exhaustive search for the alignment in multi-cluster GMMs

The following theorem shows that under certain conditions, the output from Algorithm 5 recovers the correct alignment up to a permutation.

Theorem 11 (Alignment correctness for Algorithm 5).

Assume that

  1. (i)

    maxkSmaxrjΔrj(k)/minkSminrjΔrj(k)D\max_{k\in S}\max_{r\neq j}\Delta^{(k)}_{rj}/\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}\leq D with D1D\geq 1.

  2. (ii)

    ϵ<124Dc𝚺+1\epsilon<\frac{1}{24Dc_{\bm{\Sigma}}+1};

  3. (iii)

    minkSminrjΔrj(k)[4(1ϵ)c𝚺1/21(24Dc𝚺+1)ϵh+(4+20ϵ)c𝚺1/21(24Dc𝚺+1)ϵξ][13(1ϵ)c𝚺1/21(9Dc𝚺+1)ϵh+(134ϵ)c𝚺1/21(9Dc𝚺+1)ϵξ]\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}\geq\left[\frac{4(1-\epsilon)c_{\bm{\Sigma}}^{1/2}}{1-(24Dc_{\bm{\Sigma}}+1)\epsilon}h+\frac{(4+20\epsilon)c_{\bm{\Sigma}}^{1/2}}{1-(24Dc_{\bm{\Sigma}}+1)\epsilon}\xi\right]\vee\left[\frac{13(1-\epsilon)c_{\bm{\Sigma}}^{1/2}}{1-(9Dc_{\bm{\Sigma}}+1)\epsilon}h+\frac{(13-4\epsilon)c_{\bm{\Sigma}}^{1/2}}{1-(9Dc_{\bm{\Sigma}}+1)\epsilon}\xi\right],

where ϵ=KsK\epsilon=\frac{K-s}{K} is the outlier task proportion introduced in Theorem 7, hh is degree of discriminant coefficient similarity defined in (S.1.34), and

ξ\displaystyle\xi =maxkSminπ:[R][R]maxr=1:R(𝚺^(k)[0])1(𝝁^π(r)(k)[0]𝝁^π(1)(k)[0])𝜷r(k)2\displaystyle=\max_{k\in S}\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\left\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}\big{(}\widehat{\bm{\mu}}^{(k)[0]}_{\pi(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi(1)}\big{)}-\bm{\beta}^{(k)*}_{r}\right\|_{2} (S.1.55)
=maxkSminπ:[R][R]maxr=1:R(𝚺^(k)[0])1(𝝁^π(r)(k)[0]𝝁^π(1)(k)[0])(𝚺(k))1(𝝁r(k)𝝁1(k))2.\displaystyle=\max_{k\in S}\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\left\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}\big{(}\widehat{\bm{\mu}}^{(k)[0]}_{\pi(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi(1)}\big{)}-(\bm{\Sigma}^{(k)*})^{-1}\big{(}\bm{\mu}^{(k)*}_{r}-\bm{\mu}^{(k)*}_{1}\big{)}\right\|_{2}. (S.1.56)

Then there exists a permutation ι:[R][R]\iota:[R]\rightarrow[R], such that the output of Algorithm 5 satisfies

π^k=ιπk,\widehat{\pi}_{k}=\iota\circ\pi_{k}^{*}, (S.1.57)

for all kSk\in S.

The biggest issue of Algorithm 5 is the computational time. It is easy to see that the time complexity of it is 𝒪((R!)KRK2)\mathcal{O}((R!)^{K}\cdot RK^{2}), because it needs to search over all permutations. This is not practically feasible when RR and KK are large. Therefore, we propose the following greedy search algorithm to reduce the computational cost, which is summarized in Algorithm 6. Note that its main idea is similar to Algorithm 3 for the binary GMM, but the procedure is different. We define the score of alignments {πk}k=1k\{\pi_{k^{\prime}}\}_{k^{\prime}=1}^{k} of tasks 11-kk^{\prime} as

score({πk}k=1k|{{𝝁^r(k)[0]}r=1R}k=1k,{𝚺^(k)[0]}k=1k)\displaystyle\textup{score}(\{\pi_{k^{\prime}}\}_{k^{\prime}=1}^{k}|\{\{\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}\}_{r=1}^{R}\}_{k^{\prime}=1}^{k},\{\widehat{\bm{\Sigma}}^{(k^{\prime})[0]}\}_{k^{\prime}=1}^{k}) (S.1.58)
=r=2Rk~,kk(𝚺^(k~)[0])1(𝝁^πk~(r)(k~)[0]𝝁^πk~(1)(k~)[0])(𝚺^(k)[0])1(𝝁^πk(r)(k)[0]𝝁^πk(1)(k)[0])2.\displaystyle=\sum_{r=2}^{R}\sum_{\tilde{k},k^{\prime}\leq k}\left\|(\widehat{\bm{\Sigma}}^{(\tilde{k})[0]})^{-1}\big{(}\widehat{\bm{\mu}}^{(\tilde{k})[0]}_{\pi_{\tilde{k}}(r)}-\widehat{\bm{\mu}}^{(\tilde{k})[0]}_{\pi_{\tilde{k}}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)}\big{)}\right\|_{2}. (S.1.59)
Input: Initialization {{𝝁^r(k)[0]}r=1R,𝚺^(k)[0]}k=1K\{\{\widehat{\bm{\mu}}^{(k)[0]}_{r}\}_{r=1}^{R},\widehat{\bm{\Sigma}}^{(k)[0]}\}_{k=1}^{K}
1 for k=1k=1 to KK do
2       With {π^k}k=1k1\{\widehat{\pi}_{k^{\prime}}\}_{k^{\prime}=1}^{k-1} fixed (\emptyset when k=1k=1), set π^k=argminπ:[R][R]score({π^k}k=1k1π|{{𝝁^r(k)[0]}r=1R}k=1k,{𝚺^(k)[0]}k=1k)\widehat{\pi}_{k}=\operatorname*{arg\,min}\limits_{\pi:[R]\rightarrow[R]}\textup{score}(\{\widehat{\pi}_{k^{\prime}}\}_{k^{\prime}=1}^{k-1}\cup\pi|\{\{\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}\}_{r=1}^{R}\}_{k^{\prime}=1}^{k},\{\widehat{\bm{\Sigma}}^{(k^{\prime})[0]}\}_{k^{\prime}=1}^{k})
3      
4 end for
Output: 𝝅^={π^k}k=1K\widehat{\bm{\pi}}=\{\widehat{\pi}_{k}\}_{k=1}^{K}
Algorithm 6 Greedy search for the alignment in multi-cluster GMMs

The subsequent theorem demonstrates that, with slightly stronger assumptions than those required by Algorithm 5, the greedy search algorithm can recover the correct alignment up to a permutation with high probability. Importantly, this approach significantly alleviates the computational cost from 𝒪((R!)KRK2)\mathcal{O}((R!)^{K}\cdot RK^{2}) to 𝒪(R!KRK2)\mathcal{O}(R!K\cdot RK^{2}).

Theorem 12.

Assume there are no outlier tasks in the first K0K_{0} tasks, and

  1. (i)

    maxkSmaxrjΔrj(k)/minkSminrjΔrj(k)D\max_{k\in S}\max_{r\neq j}\Delta^{(k)}_{rj}/\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}\leq D with D1D\geq 1.

  2. (ii)

    ϵ<12Dc𝚺+1\epsilon<\frac{1}{2Dc_{\bm{\Sigma}}+1};

  3. (iii)

    K0>2Dc𝚺KϵK_{0}>2Dc_{\bm{\Sigma}}K\epsilon;

  4. (iv)

    minkSminrjΔrj(k)[4K0c𝚺1/2K0Dc𝚺Kϵh+(4K0+Kϵ)c𝚺1/2K0Dc𝚺Kϵξ][2K0c𝚺1/2K02Dc𝚺Kϵh+(2K0+2Kϵ)c𝚺1/2K02Dc𝚺Kϵξ]\min_{k\in S}\min_{r\neq j}\Delta^{(k)}_{rj}\geq\left[\frac{4K_{0}c_{\bm{\Sigma}}^{1/2}}{K_{0}-Dc_{\bm{\Sigma}}K\epsilon}h+\frac{(4K_{0}+K\epsilon)c_{\bm{\Sigma}}^{1/2}}{K_{0}-Dc_{\bm{\Sigma}}K\epsilon}\xi\right]\vee\left[\frac{2K_{0}c_{\bm{\Sigma}}^{1/2}}{K_{0}-2Dc_{\bm{\Sigma}}K\epsilon}h+\frac{(2K_{0}+2K\epsilon)c_{\bm{\Sigma}}^{1/2}}{K_{0}-2Dc_{\bm{\Sigma}}K\epsilon}\xi\right],

where ϵ=KsK\epsilon=\frac{K-s}{K} is the outlier task proportion and c𝚺c_{\bm{\Sigma}} appears in the condition that c𝚺1minkSλmin(𝚺(k))maxkSλmax(𝚺(k))c𝚺c_{\bm{\Sigma}}^{-1}\leq\min_{k\in S}\lambda_{\min}(\bm{\Sigma}^{(k)*})\leq\max_{k\in S}\lambda_{\max}(\bm{\Sigma}^{(k)*})\leq c_{\bm{\Sigma}}. Then there exists a permutation ι:[R][R]\iota:[R]\rightarrow[R], such that the output of Algorithm 6 satisfies

π^k=ιπk,\widehat{\pi}_{k}=\iota\circ\pi_{k}^{*}, (S.1.60)

for all kSk\in S.

Remark 7.

Conditions (ii)-(iv) are similar to the conditions in Theorem 6. The inclusion of Condition (i) aims to facilitate the analysis in the proof, and we conjecture that the obtained results persist even if this condition is omitted.

When RR is very large, the computational burden becomes prohibitive, rendering even the 𝒪(R!KRK2)\mathcal{O}(R!K\cdot RK^{2}) time complexity impractical. Addressing this computational challenge requires the development of more efficient alignment algorithms, a pursuit that we defer to future investigations. In addition, one caveat of the greedy search algorithm is that we need to know K0K_{0} non-outlier tasks a priori, which may not be unrealistic in practice. In our empirical examinations, we enhance the algorithm’s performance by introducing a random shuffle of the KK tasks in each iteration. Specifically, we execute Algorithm 6 for 200 times, yielding 200 alignment candidates. The final alignment is then determined by selecting the configuration that attains the minimum score among the candidates.

S.2 Transfer Learning

S.2.1 Problem setting

In the main text and Section S.1, we discussed GMMs under the context of multi-task learning, where the goal is to learn all tasks jointly by utilizing the potential similarities shared by different tasks. In this section, we will study binary GMMs in the transfer learning context where the focus is on the improvement of learning in one target task through the transfer of knowledge from related source tasks. Multi-cluster results can be obtained similarly as in the MTL case, and we omit the details given the extensive length of the paper.

Suppose that there are (K+1)(K+1) tasks in total, where the first task is called the target and the KK remaining ones are called KK sources. As in multi-task learning, we assume that there exists an unknown subset S1:KS\subseteq 1:K, such that samples from sources in SS follow an independent GMM, while samples from sources outside SS can be arbitrarily distributed. This means,

yi(k)={1,with probability 1w(k);2,with probability w(k);\displaystyle y_{i}^{(k)}=\begin{cases}1,&\text{with probability }1-w^{(k)*};\\ 2,&\text{with probability }w^{(k)*};\end{cases} (S.2.61)
𝒛i(k)|yi(k)=j𝒩(𝝁j(k),𝚺(k)),j=1,2,\bm{z}_{i}^{(k)}|y_{i}^{(k)}=j\sim\mathcal{N}(\bm{\mu}^{(k)*}_{j},\bm{\Sigma}^{(k)*}),\,\,j=1,2, (S.2.62)

for all kSk\in S, i=1:nki=1:n_{k}, and

{𝒛i(k)}i,kScS,\{\bm{z}^{(k)}_{i}\}_{i,k\in S^{c}}\sim\mathbb{Q}_{S}, (S.2.63)

where S\mathbb{Q}_{S} is some probability measure on (p)nSc(\mathbb{R}^{p})^{\otimes n_{S^{c}}} and nSc=kScnkn_{S^{c}}=\sum_{k\in S^{c}}n_{k}.

For the target task, we observe sample {𝒛i(0)}i=1n0\{\bm{z}_{i}^{(0)}\}_{i=1}^{n_{0}} independently sampled from the following GMM:

yi(0)={1,with probability 1w(0);2,with probability w(0);\displaystyle y_{i}^{(0)}=\begin{cases}1,&\text{with probability }1-w^{(0)*};\\ 2,&\text{with probability }w^{(0)*};\end{cases} (S.2.64)
𝒛i(0)|yi(0)=j𝒩(𝝁j(0),𝚺(0)),j=1,2.\bm{z}_{i}^{(0)}|y_{i}^{(0)}=j\sim\mathcal{N}(\bm{\mu}^{(0)*}_{j},\bm{\Sigma}^{(0)*}),\,\,j=1,2. (S.2.65)

The objective of transfer learning is to use source data to help improve GMM learning in the target task. As for multi-task learning, we measure the learning performance by both parameter estimation error and the excess mis-clustering error, but only on the target GMM. Toward this end, we define the joint parameter space for GMM parameters of the target and sources in SS:

Θ¯S(h)={{𝜽¯(k)}k{0}S={(w(k),𝝁1(k),𝝁2(k),𝚺(k))}k{0}S:𝜽¯(k)Θ¯,maxkS𝜷(k)𝜷(0)2h},\overline{\Theta}_{S}^{\prime}(h)=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}=\{(w^{(k)},\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2},\bm{\Sigma}^{(k)})\}_{k\in\{0\}\cup S}:\overline{\bm{\theta}}^{(k)}\in\overline{\Theta},\max_{k\in S}\|\bm{\beta}^{(k)}-\bm{\beta}^{(0)}\|_{2}\leq h\Big{\}}, (S.2.66)

where Θ¯\overline{\Theta} is the single GMM parameter space introduced in (8), and 𝜷(k)=(𝚺(k))1(𝝁2(k)𝝁1(k))\bm{\beta}^{(k)}=(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{2}-\bm{\mu}^{(k)}_{1}), k{0}Sk\in\{0\}\cup S. Comparing Θ¯S(h)\overline{\Theta}_{S}^{\prime}(h) with the parameter space Θ¯S(h)\overline{\Theta}_{S}(h) from multi-task learning in (LABEL:eq:_parameter_space_mtl), here the target discriminant coefficient 𝜷(0)\bm{\beta}^{(0)} serves as the “center” of discriminant coefficients of sources in SS. The quantity hh characterizes the closeness between sources in SS and the target.

S.2.2 Method

Like the MTL-GMM procedure developed in Section 2.2, we combine the EM algorithm and the penalization framework to develop a variant of the EM algorithm for transfer learning. The key idea is to first apply MTL-GMM to all the sources to obtain estimates of discriminant coefficient “center” as good summary statistics of the KK source data sets, and then shrink the target discriminant coefficient towards those center estimates in the EM iterations to explore the relatedness between sources and the target. See Section 3.3 of \citeappduan2023adaptive for more general discussions on this idea. Our proposed transfer learning procedure TL-GMM is summarized in Algorithm 7.

While the steps of TL-GMM look very similar to those of MTL-GMM, there exist two major differences between them. First, for each optimization problem in TL-GMM, the first part of the objective function only involves the target data {𝒛i(0)}i=1n0\{\bm{z}^{(0)}_{i}\}_{i=1}^{n_{0}}, while in MTL-GMM, it is a weighted average of all tasks. Second, in TL-GMM, the penalty is imposed on the distance between a discriminant coefficient estimator and a given center estimator produced by MTL-GMM from the source data. In contrast, the center is estimated simultaneously with other parameters through the penalization in MTL-GMM. In light of existing transfer learning approaches in the literature, TL-GMM can be considered as the “debiasing” step described in \citeappli2021transfer and \citeapptian2023transfer, which corrects potential bias of the center estimate using the target data.

The tuning parameters {λ0[t]}t=1T0\{\lambda^{[t]}_{0}\}_{t=1}^{T_{0}} in Algorithm 7 control the amount of knowledge to be transferred from sources. Setting tuning parameters large enough pushes parameter estimates for the target task to be exactly equal to the center learned from sources while letting them be zero makes TL-GMM reduce to the standard EM algorithm on the target data.

Input: Initialization 𝜽^(0)[0]=(w^(0)[0],𝜷^(0)[0],δ^(0)[0])\widehat{\bm{\theta}}^{(0)[0]}=(\widehat{w}^{(0)[0]},\widehat{\bm{\beta}}^{(0)[0]},\widehat{\delta}^{(0)[0]}), output 𝜷¯[T]\overline{\bm{\beta}}^{[T]} from Algorithm 1, maximum number of iteration rounds T0T_{0}, initial penalty parameter λ0[0]\lambda^{[0]}_{0}, tuning parameters Cλ0>0C_{\lambda_{0}}>0 and κ0(0,1)\kappa_{0}\in(0,1)
1 for t=1t=1 to T0T_{0} do
2       λ0[t]=κ0λ0[t1]+Cλ0p+logK\lambda^{[t]}_{0}=\kappa_{0}\lambda^{[t-1]}_{0}+C_{\lambda_{0}}\sqrt{p+\log K} // Update the penalty parameter
3       w^(0)[t]=1n0i=1n0γ𝜽^(0)[t1](𝒛i(0))\widehat{w}^{(0)[t]}=\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\gamma_{\widehat{\bm{\theta}}^{(0)[t-1]}}(\bm{z}_{i}^{(0)})
4       𝝁^1(0)[t]=i=1n0[1γ𝜽^(0)[t1](𝒛i(0))]𝒛i(0)n0(1w^(0)[t])\widehat{\bm{\mu}}^{(0)[t]}_{1}=\frac{\sum_{i=1}^{n_{0}}[1-\gamma_{\widehat{\bm{\theta}}^{(0)[t-1]}}(\bm{z}_{i}^{(0)})]\bm{z}^{(0)}_{i}}{n_{0}(1-\widehat{w}^{(0)[t]})}, 𝝁^2(0)[t]=i=1n0γ𝜽^(0)[t1](𝒛i(0))𝒛i(0)n0w^(0)[t]\widehat{\bm{\mu}}^{(0)[t]}_{2}=\frac{\sum_{i=1}^{n_{0}}\gamma_{\widehat{\bm{\theta}}^{(0)[t-1]}}(\bm{z}_{i}^{(0)})\bm{z}^{(0)}_{i}}{n_{0}\widehat{w}^{(0)[t]}}
5       𝚺^(0)[t]=1n0i=1n0{[1γ𝜽^(0)[t1](𝒛i(0))](𝒛i(0)𝝁^1(0)[t])(𝒛i(0)𝝁^1(0)[t])\widehat{\bm{\Sigma}}^{(0)[t]}=\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\left\{[1-\gamma_{\widehat{\bm{\theta}}^{(0)[t-1]}}(\bm{z}_{i}^{(0)})]\cdot(\bm{z}^{(0)}_{i}-\widehat{\bm{\mu}}^{(0)[t]}_{1})(\bm{z}^{(0)}_{i}-\widehat{\bm{\mu}}^{(0)[t]}_{1})^{\top}\right. +γ𝜽^(0)[t1](𝒛i(0))(𝒛i(0)𝝁^2(0)[t])(𝒛i(0)𝝁^2(0)[t])}\left.\hskip 93.89418pt+\gamma_{\widehat{\bm{\theta}}^{(0)[t-1]}}(\bm{z}_{i}^{(0)})\cdot(\bm{z}^{(0)}_{i}-\widehat{\bm{\mu}}^{(0)[t]}_{2})(\bm{z}^{(0)}_{i}-\widehat{\bm{\mu}}^{(0)[t]}_{2})^{\top}\right\}
6       𝜷^(0)[t]=argmin𝜷(0){[12(𝜷(0))𝚺^(0)[t]𝜷(0)(𝜷(0))(𝝁^2(0)[t]𝝁^1(0)[t])]+λ0[t]n0𝜷(0)𝜷¯[T]2}\widehat{\bm{\beta}}^{(0)[t]}=\operatorname*{arg\,min}\limits_{\bm{\beta}^{(0)}}\left\{\left[\frac{1}{2}(\bm{\beta}^{(0)})^{\top}\widehat{\bm{\Sigma}}^{(0)[t]}\bm{\beta}^{(0)}-(\bm{\beta}^{(0)})^{\top}(\widehat{\bm{\mu}}_{2}^{(0)[t]}-\widehat{\bm{\mu}}_{1}^{(0)[t]})\right]+\frac{\lambda^{[t]}_{0}}{\sqrt{n_{0}}}\|\bm{\beta}^{(0)}-\overline{\bm{\beta}}^{[T]}\|_{2}\right\}
7       δ^(0)[t]=12(𝜷^(0)[t])(𝝁^1(0)[t]+𝝁^2(0)[t])\widehat{\delta}^{(0)[t]}=\frac{1}{2}(\widehat{\bm{\beta}}^{(0)[t]})^{\top}(\widehat{\bm{\mu}}^{(0)[t]}_{1}+\widehat{\bm{\mu}}^{(0)[t]}_{2})
8       Let 𝜽^(0)[t]=(w^(0)[t],𝜷^(0)[t],δ^(0)[t])\widehat{\bm{\theta}}^{(0)[t]}=(\widehat{w}^{(0)[t]},\widehat{\bm{\beta}}^{(0)[t]},\widehat{\delta}^{(0)[t]})
9 end for
Output: (𝜽^(0)[T0],𝝁^1(0)[T0],𝝁^2(0)[T0],𝚺^(0)[T0])(\widehat{\bm{\theta}}^{(0)[T_{0}]},\widehat{\bm{\mu}}^{(0)[T_{0}]}_{1},\widehat{\bm{\mu}}^{(0)[T_{0}]}_{2},\widehat{\bm{\Sigma}}^{(0)[T_{0}]}) with 𝜽^(0)[T0]=(w^(0)[T0],𝜷^(0)[T0],δ^(0)[T0])\widehat{\bm{\theta}}^{(0)[T_{0}]}=(\widehat{w}^{(0)[T_{0}]},\widehat{\bm{\beta}}^{(0)[T_{0}]},\widehat{\delta}^{(0)[T_{0}]})
Algorithm 7 TL-GMM

S.2.3 Theory

In this section, we will establish the upper and lower bounds for the GMM parameter estimation error and the excess mis-clustering error on the target task. First, we impose the following assumption set.

Assumption 3.

Denote Δ(0)=(𝛍1(0)𝛍2(0))(𝚺(0))1(𝛍1(0)𝛍2(0))\Delta^{(0)}=\sqrt{(\bm{\mu}^{(0)*}_{1}-\bm{\mu}^{(0)*}_{2})^{\top}(\bm{\Sigma}^{(0)*})^{-1}(\bm{\mu}^{(0)*}_{1}-\bm{\mu}^{(0)*}_{2})}. Assume the following conditions hold:

  1. (i)

    C1[logKp+ϵ2(1+logKp)]maxkSnkn0C2(1+logKp)C_{1}\left[\frac{\log K}{p}+\epsilon^{2}\big{(}1+\frac{\log K}{p}\big{)}\right]\leq\frac{\max_{k\in S}n_{k}}{n_{0}}\leq C_{2}\big{(}1+\frac{\log K}{p}\big{)} with constants C1C_{1} and C2C_{2}, where ϵ=KsK\epsilon=\frac{K-s}{K}.

  2. (ii)

    n0C3pn_{0}\geq C_{3}p with some constant C3C_{3};

  3. (iii)

    Either of the following two conditions holds with some constant C4C_{4}:

    1. (a)

      𝜷^(0)[0]𝜷(0)2|δ^(0)[0]δ(0)|C4Δ(0)\|\widehat{\bm{\beta}}^{(0)[0]}-\bm{\beta}^{(0)*}\|_{2}\vee|\widehat{\delta}^{(0)[0]}-\delta^{(0)*}|\leq C_{4}\Delta^{(0)}, |w^(0)[0]w(0)|cw/2|\widehat{w}^{(0)[0]}-w^{(0)*}|\leq c_{w}/2;

    2. (b)

      𝜷^(0)[0]+𝜷(0)2|δ^(0)[0]+δ(0)|C4Δ(0)\|\widehat{\bm{\beta}}^{(0)[0]}+\bm{\beta}^{(0)*}\|_{2}\vee|\widehat{\delta}^{(0)[0]}+\delta^{(0)*}|\leq C_{4}\Delta^{(0)}, |1w^(0)[0]w(0)|cw/2|1-\widehat{w}^{(0)[0]}-w^{(0)*}|\leq c_{w}/2.

  4. (iv)

    Δ(0)C5>0\Delta^{(0)}\geq C_{5}>0 with some constant C5C_{5};

Remark 8.

Condition (i) requires the target sample size not to be much smaller than the maximum source sample size, which appears due to technical reasons in the proof. Conditions (ii)-(iv) can be seen as the counterpart of Conditions (ii)-(iv) in Assumption 1 for the target GMM.

We are in the position to present the upper bounds of the estimation error of GMM parameters for TL-GMM.

Theorem 13.

(Upper bounds of the estimation error of GMM parameters for TL-GMM) Suppose the conditions in Theorem 1 and Assumption 3 hold. Let λ0[0]C1maxk=1:Knk\lambda^{[0]}_{0}\geq C_{1}\max_{k=1:K}\sqrt{n_{k}}, Cλ0C1C_{\lambda_{0}}\geq C_{1}, κ0>C2\kappa_{0}>C_{2} with some specific constants C1>0,C2(0,1)C_{1}>0,C_{2}\in(0,1). Then there exists a constant C3>0C_{3}>0, such that for any {𝛉¯(k)}k{0}SΘ¯S(h)\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}(h) and any probability measure S\mathbb{Q}_{S} on (p)nSc(\mathbb{R}^{p})^{\otimes n_{S^{c}}}, we have

d(𝜽^(0)[T0],𝜽(0))\displaystyle d(\widehat{\bm{\theta}}^{(0)[T_{0}]},\bm{\theta}^{(0)*}) pnS+n0+1n0+hpn0+(ϵp+logKmaxk=1:Knk)pn0\displaystyle\lesssim\sqrt{\frac{p}{n_{S}+n_{0}}}+\sqrt{\frac{1}{n_{0}}}+h\wedge\sqrt{\frac{p}{n_{0}}}+\bigg{(}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}\bigg{)}\wedge\sqrt{\frac{p}{n_{0}}} (S.2.67)
+logKmaxk=1:Knk+T0(κ0)T0,\displaystyle\quad\quad\quad\quad+\sqrt{\frac{\log K}{\max_{k=1:K}n_{k}}}+T_{0}(\kappa_{0}^{\prime})^{T_{0}}, (S.2.68)
minπ:[2][2]maxr=1:2𝝁^π(r)(0)[T0]𝝁r(0)2𝚺^(0)[T0]𝚺(0)2pn0+T0(κ0)T0,\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)[T_{0}]}_{\pi(r)}-\bm{\mu}^{(0)*}_{r}\|_{2}\vee\|\widehat{\bm{\Sigma}}^{(0)[T_{0}]}-\bm{\Sigma}^{(0)*}\|_{2}\lesssim\sqrt{\frac{p}{n_{0}}}+T_{0}(\kappa_{0}^{\prime})^{T_{0}}, (S.2.69)

with probability at least 1C3K11-C_{3}K^{-1}, where κ0(0,1)\kappa^{\prime}_{0}\in(0,1) and nS=kSnkn_{S}=\sum_{k\in S}n_{k}. When T0Clogn0T_{0}\geq C\log n_{0} with a large constant C>0C>0, in both inequalities, the last term on the right-hand side will be dominated by other terms.

Next, we present the upper bound of the excess mis-clustering error on the target task for TL-GMM. Having the estimator 𝜽^(0)[T0]\widehat{\bm{\theta}}^{(0)[T_{0}]} and the truth 𝜽(0)\bm{\theta}^{(0)*}, the clustering method 𝒞^(0)[T0]\widehat{\mathcal{C}}^{(0)[T_{0}]} and its mis-clustering error R𝜽¯(0)(𝒞^(0)[T0])R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)[T_{0}]}) are defined in the same way as in (18) and (19).

Theorem 14.

(Upper bound of the target excess mis-clustering error for TL-GMM) Suppose the same conditions in Theorem 13 hold. Then there exists a constant C1>0C_{1}>0 such that for any {𝛉¯(k)}k{0}SΘ¯S(h)\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}(h) and any probability measure S\mathbb{Q}_{S} on (p)nSc(\mathbb{R}^{p})^{\otimes n_{S^{c}}}, with probability at least 1C1K11-C_{1}K^{-1} the following holds:

R𝜽¯(0)(𝒞^(0)[T0])R𝜽¯(0)(𝒞𝜽¯(0))\displaystyle R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)[T_{0}]})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}}) pnS+n0(\@slowromancapi@)+1n0(\@slowromancapii@)+h2pn0(\@slowromancapiii@)+ϵ2p+logKmaxk=1:Knkpn0(\@slowromancapiv@)\displaystyle\lesssim\underbrace{\frac{p}{n_{S}+n_{0}}}_{\rm(\@slowromancap i@)}+\underbrace{\frac{1}{n_{0}}}_{\rm(\@slowromancap ii@)}+\underbrace{h^{2}\wedge\frac{p}{n_{0}}}_{\rm(\@slowromancap iii@)}+\underbrace{\epsilon^{2}\frac{p+\log K}{\max_{k=1:K}n_{k}}\wedge\frac{p}{n_{0}}}_{\rm(\@slowromancap iv@)} (S.2.70)
+logKmaxk=1:Knk(\@slowromancapv@)+T02(κ0)2T0(\@slowromancapvi@),\displaystyle\quad\quad\quad\quad+\underbrace{\frac{\log K}{\max_{k=1:K}n_{k}}}_{\rm(\@slowromancap v@)}+\underbrace{T_{0}^{2}(\kappa_{0}^{\prime})^{2T_{0}}}_{\rm(\@slowromancap vi@)}, (S.2.71)

with some constant κ0(0,1)\kappa^{\prime}_{0}\in(0,1). When T0Clogn0T_{0}\geq C\log n_{0} with some large constant C>0C>0, the last term in the upper bound will be dominated by the second term.

Similar to the upper bounds of d(𝜽^(k)[T],𝜽(k))d(\widehat{\bm{\theta}}^{(k)[T]},\bm{\theta}^{(k)*}) and R𝜽¯(k)(𝒞^(k)[T])R𝜽¯(k)(𝒞𝜽¯(k))R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[T]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}}) in Theorems 1 and 2, the upper bounds for d(𝜽^(0)[T0],𝜽(0))d(\widehat{\bm{\theta}}^{(0)[T_{0}]},\bm{\theta}^{(0)*}) and R𝜽¯(0)(𝒞^(0)[T0])R𝜽¯(0)(𝒞𝜽¯(0))R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)[T_{0}]})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}}) consist of multiple parts with one-to-one correspondence. We take the bound of R𝜽¯(0)(𝒞^(0)[T0])R𝜽¯(0)(𝒞𝜽¯(0))R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)[T_{0}]})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}}) in Theorem 14 as an example. Part (\@slowromancapi@) is the oracle rate 𝒪(pnS+n0)\mathcal{O}_{\mathbb{P}}\big{(}\frac{p}{n_{S}+n_{0}}\big{)}. Part (\@slowromancapii@) is the error caused by estimating scalar parameters δ(0)\delta^{(0)*} and w(0)w^{(0)*} in the decision boundary, which thus do not depend on dimension pp. Part (\@slowromancapiii@) quantifies the contribution of related sources to the target task learning. The more related the sources in SS to the target (i.e. the smaller hh is), the smaller Part (\@slowromancapiii@) becomes. Part (\@slowromancapiv@) captures the impact of outlier sources on the estimation error. As ϵ\epsilon increases (i.e. the proportion of outlier sources increases), Part (\@slowromancapiv@) first increases and then flats out. It never exceeds the minimax rate 𝒪(p/n0)\mathcal{O}_{\mathbb{P}}(p/n_{0}) of the single task learning on target task \citepappbalakrishnan2017statistical, cai2019chime. Therefore, our method is robust against a fraction of outlier sources with arbitrary contaminated data. Part (\@slowromancapv@) is an extra term caused by estimating the center in MTL-GMM, which by Assumption 3.(i) is smaller than the single-task learning rate 𝒪(p/n0)\mathcal{O}_{\mathbb{P}}(p/n_{0}). Part (\@slowromancapvi@) decreases geometrically in the iteration number T0T_{0} of Algorithm 7, which becomes negligible by setting the iteration numbers T0T_{0} large enough.

Consider the general scenario T0logn0T_{0}\gtrsim\log n_{0}. Then the upper bound of excess mis-clustering error rate R𝜽¯(0)(𝒞^(0)[T0])R𝜽¯(0)(𝒞𝜽¯(0))R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)[T_{0}]})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}}) is guaranteed to be no worse than the optimal single-task learning rate 𝒪(p/n0)\mathcal{O}_{\mathbb{P}}(p/n_{0}). More importantly, in the general regime where ϵpmaxk=1:Knk(p+logK)n0\epsilon\ll\sqrt{\frac{p\max_{k=1:K}n_{k}}{(p+\log K)n_{0}}} (small number of outlier sources), hp/n0h\ll\sqrt{p/n_{0}} (enough similarity between sources and target), nSn0n_{S}\gg n_{0} (large total source sample size), and maxkSnk/n0logK/p\max_{k\in S}n_{k}/n_{0}\gg\log K/p (large maximum source sample size), TL-GMM improves the GMM learning on the target task by achieving a better estimation error rate. As for the upper bound of minπ:[2][2]maxr=1:2𝝁^π(r)(0)[T0]𝝁r(0)2\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)[T_{0}]}_{\pi(r)}-\bm{\mu}^{(0)*}_{r}\|_{2} and 𝚺^(0)[T0]𝚺(0)2\|\widehat{\bm{\Sigma}}^{(0)[T_{0}]}-\bm{\Sigma}^{(0)*}\|_{2}, when T0logn0T_{0}\gtrsim\log n_{0}, it has the single-task learning rate 𝒪(p/n0)\mathcal{O}_{\mathbb{P}}(\sqrt{p/n_{0}}). This is expected since the mean vectors and covariance matrices from sources are not necessarily similar to the one from target in the parameter space Θ¯S(h)\overline{\Theta}_{S}^{\prime}(h).

The following result of minimax lower bounds shows that the upper bounds in Theorems 13 and 14 are optimal in a broad range of regimes.

Theorem 15.

(Lower bounds of the estimation error of GMM parameters in transfer learning) Suppose ϵ=KsK<1/3\epsilon=\frac{K-s}{K}<1/3. Suppose there exists a subset SS with |S|s|S|\geq s such that minkSnkC1(p+logK)\min_{k\in S}n_{k}\geq C_{1}(p+\log K), n0C1pn_{0}\geq C_{1}p and mink{0}SΔ(k)C2\min_{k\in\{0\}\cup S}\Delta^{(k)}\geq C_{2} with some constants C1,C2>0C_{1},C_{2}>0. Then we have

inf𝜽^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯S(h)S\displaystyle\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}(h)\\ \mathbb{Q}_{S}\end{subarray}} (d(𝜽^(0),𝜽(0))pnS+n0+1n0+hpn0\displaystyle\mathbb{P}\Bigg{(}d(\widehat{\bm{\theta}}^{(0)},\bm{\theta}^{(0)*})\gtrsim\sqrt{\frac{p}{n_{S}+n_{0}}}+\sqrt{\frac{1}{n_{0}}}+h\wedge\sqrt{\frac{p}{n_{0}}} (S.2.72)
+ϵmaxk=1:Knkpn0)110,\displaystyle\hskip 113.81102pt+\frac{\epsilon}{\sqrt{\max_{k=1:K}n_{k}}}\wedge\sqrt{\frac{p}{n_{0}}}\Bigg{)}\geq\frac{1}{10}, (S.2.73)
inf𝝁^1(0),𝝁^2(0)𝚺^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯S(h)S(minπ:[2][2]maxr=1:2𝝁^π(r)(0)𝝁r(0)2𝚺^(0)𝚺(0)2pn0)110.\inf_{\begin{subarray}{c}\widehat{\bm{\mu}}^{(0)}_{1},\widehat{\bm{\mu}}^{(0)}_{2}\\ \widehat{\bm{\Sigma}}^{(0)}\end{subarray}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}(h)\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\min_{\pi:[2]\rightarrow[2]}\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)}_{\pi(r)}-\bm{\mu}^{(0)*}_{r}\|_{2}\vee\|\widehat{\bm{\Sigma}}^{(0)}-\bm{\Sigma}^{(0)*}\|_{2}\gtrsim\sqrt{\frac{p}{n_{0}}}\Bigg{)}\geq\frac{1}{10}. (S.2.74)
Theorem 16.

(Lower bound of the target excess mis-clustering error in transfer learning) Suppose the same conditions in Theorem 15 hold. Then we have

inf𝒞^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯S(h)S\displaystyle\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}(h)\\ \mathbb{Q}_{S}\end{subarray}} (R𝜽¯(0)(𝒞^(0))R𝜽¯(0)(𝒞𝜽¯(0))pnS+n0+1n0+h2pn0\displaystyle\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\gtrsim\frac{p}{n_{S}+n_{0}}+\frac{1}{n_{0}}+h^{2}\wedge\frac{p}{n_{0}} (S.2.75)
+ϵ2maxk=1:Knk1n0)110.\displaystyle\hskip 142.26378pt+\frac{\epsilon^{2}}{\max_{k=1:K}n_{k}}\wedge\frac{1}{n_{0}}\Bigg{)}\geq\frac{1}{10}. (S.2.76)

Comparing the upper and lower bounds in Theorems 13- 16, several remarks are in order:

  • With T0logn0T_{0}\gtrsim\log n_{0}, our estimators 𝝁^1(0)[T0]\widehat{\bm{\mu}}^{(0)[T_{0}]}_{1}, 𝝁^2(0)[T0]\widehat{\bm{\mu}}^{(0)[T_{0}]}_{2}, 𝚺^(0)[T0]\widehat{\bm{\Sigma}}^{(0)[T_{0}]} achieve the minimax optimal rate for estimating the mean vectors 𝝁1(0)\bm{\mu}^{(0)*}_{1}, 𝝁2(0)\bm{\mu}^{(0)*}_{2} and the covariance matrix 𝚺(0)\bm{\Sigma}^{(0)*}.

  • Regarding the target excess mis-clustering error, with the choices T0logn0T_{0}\gtrsim\log n_{0}, Part (\@slowromancapvi@) in the upper bound becomes negligible. We thus compare the other five terms in the upper bound with the corresponding terms in the lower bound.

    1. 1.

      Part (\@slowromancapiv@) in the upper bound differs from the one in the lower bound by a factor pp (up to logK\log K). Hence the gap can arise when the dimension pp diverges. The reason is similar to the one in a multi-task learning setting and using statistical depth function based “center” estimates might be able to close the gap. We refer to the paragraph after Theorem 4 for more details.

    2. 2.

      Part (\@slowromancapv@) in the upper bound does not appear in the lower bound. This term is due to the center estimate from the upper bound in MTL-GMM. When maxkSnk/n0logK\max_{k\in S}n_{k}/n_{0}\gtrsim\log K, this term is dominated by Part (\@slowromancapii@).

    3. 3.

      The other three terms from the upper bound match with the ones in the lower bound.

  • Based on the above comparisons, we can conclude that under the mild condition maxkSnk/n0logK\max_{k\in S}n_{k}/n_{0}\gtrsim\log K, our method is minimax rate optimal for the estimation of 𝜽(0)\bm{\theta}^{(0)*} in the classical low-dimensional regime p=O(1)p=O(1). Even when pp is unbounded, the gap between the upper and lower bounds appears only when the fourth or fifth term is the dominating term in the upper bound. Like the discussions after Theorem 4, similar restricted regimes where our method might become sub-optimal can be derived.

S.2.4 Label alignment

As in multi-task learning, the alignment issue exists in transfer learning as well. Referring to the parameter space Θ¯(h)\overline{\Theta}^{\prime}(h) and the conditions of initialization in Assumptions 1 and 3, the success of Algorithm 7 requires correct alignments in two places. First, the center estimate 𝜷¯[T]\overline{\bm{\beta}}^{[T]} used as input of Algorithm 7 are obtained from Algorithm 1 which involves the alignment of initial estimates for sources. This alignment problem can be readily solved by Algorithm 2 or 3. Second, the initialization of the target problem 𝜷^(0)[0]\widehat{\bm{\beta}}^{(0)[0]} needs to be correctly aligned with the aforementioned center estimates. This is easy to address using the alignment score described in Section 2.4.2 as there are only two different alignment options. We summarize the steps in Algorithm 8.

Like Algorithms 2 and 3, Algorithm 8 is able to find the correct alignments under mild conditions. Suppose {𝜷^(k)[0]}k=0K\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=0}^{K} are the initialization values with potentially wrong alignment. Define the correct alignment as 𝒓=(r0,r1,,rK)\bm{r}^{*}=(r_{0}^{*},r^{*}_{1},\ldots,r^{*}_{K}) with rk=argminrk=±1rk𝜷^(k)[0]𝜷(k)2r^{*}_{k}=\operatorname*{arg\,min}_{r_{k}=\pm 1}\|r_{k}\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}. For any 𝒓={rk}k=0K{±1}K+1\bm{r}=\{r_{k}\}_{k=0}^{K}\in\{\pm 1\}^{K+1} which is a permutation order of {𝜷^(k)[0]}k=0K\{\widehat{\bm{\beta}}^{(k)[0]}\}_{k=0}^{K} and its corresponding alignment {rk𝜷^(k)[0]}k=0K\{r_{k}\widehat{\bm{\beta}}^{(k)[0]}\}_{k=0}^{K}, define its alignment score as

score(𝒓)=0k1k2Krk1𝜷^(k1)[0]rk2𝜷^(k2)[0]2.\text{score}(\bm{r})=\sum_{0\leq k_{1}\neq k_{2}\leq K}\|r_{k_{1}}\widehat{\bm{\beta}}^{(k_{1})[0]}-r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}. (S.2.77)
Input: Initialization {(𝜷^(k)[0])}k=0K\{(\widehat{\bm{\beta}}^{(k)[0]})\}_{k=0}^{K}, and 𝒓^\widehat{\bm{r}} from Algorithm 2 or 3
1 if score((1,𝐫^))>score((1,𝐫^))\textup{score}((-1,\widehat{\bm{r}}))>\textup{score}((1,\widehat{\bm{r}})) then
2       𝒓^=(1,𝒓^)\widehat{\bm{r}}^{\prime}=(1,\widehat{\bm{r}})
3      
4 else
5      𝒓^=(1,𝒓^)\widehat{\bm{r}}^{\prime}=(-1,\widehat{\bm{r}})
6 end if
Output: 𝒓^\widehat{\bm{r}}^{\prime}
Algorithm 8 Alignment for transfer learning

As expected, under the conditions from Algorithms 2 or 3 for sources together with some similar conditions on the target, Algorithm 8 will output the ideal alignment 𝒓^\widehat{\bm{r}}^{\prime} (equivalently, the good initialization r^0𝜷^(0)[0]\widehat{r}_{0}^{\prime}\widehat{\bm{\beta}}^{(0)[0]} for Algorithm 7).

Theorem 17 (Alignment correctness for Algorithm 8).

Assume that

  1. (i)

    ϵ<12\epsilon<\frac{1}{2};

  2. (ii)

    𝜷(0)2>2(1ϵ)12ϵh+2ϵ12ϵmaxk{0}S(𝜷^(k)[0]𝜷(k)2𝜷^(k)[0]+𝜷(k)2)\|\bm{\beta}^{(0)*}\|_{2}>\frac{2(1-\epsilon)}{1-2\epsilon}h+\frac{2-\epsilon}{1-2\epsilon}\max_{k\in\{0\}\cup S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}\big{)},

where ϵ=KsK\epsilon=\frac{K-s}{K} is the outlier source task proportion, and hh is the degree of discriminant coefficient relatedness defined in (S.2.66).

For 𝐫^\widehat{\bm{r}} in Algorithm 8: if it is from Algorithm 2, assume the conditions of Theorem 5 hold; if it is from Algorithm 3, assume the conditions of Theorem 6 hold. Then the output of Algorithm 8 satisfies

𝒓^k=rk for all k{0}S or 𝒓^k=rk for all k{0}S.\widehat{\bm{r}}^{\prime}_{k}=r_{k}^{*}\text{ for all }k\in\{0\}\cup S\quad\text{ or }\quad\widehat{\bm{r}}^{\prime}_{k}=-r_{k}^{*}\text{ for all }k\in\{0\}\cup S. (S.2.78)

S.3 Additional Numerical Studies

In this section, we present results from additional numerical studies, including supplementary results from the simulation study in Section 3 of the main text. Additionally, we provide results from two new MTL simulations, one TL simulation, explorations of different penalty parameters, and two real-data studies.

S.3.1 Simulations

S.3.1.1 Simulation 1 of MTL

In this subsection, we provide additional performance evaluations for the three methods (MTL-GMM, Pooled-GMM, and Single-task-GMM) in the simulation presented in the main text (referred to as Simulation 1). The results are displayed in Figures S.3 and S.4.

Refer to caption
Figure S.3: The performance of different methods in Simulation 1.(i) of multi-task learning, with no outlier tasks (ϵ=0\epsilon=0), and hh changing from 0 to 10 with increment 1. Estimation error of {w(k)}kS\{w^{(k)*}\}_{k\in S} stands for maxkS(|w^(k)[T]w(k)||1w^(k)[T]w(k)|)\max_{k\in S}(|\widehat{w}^{(k)[T]}-w^{(k)*}|\wedge|1-\widehat{w}^{(k)[T]}-w^{(k)*}|). Estimation error of {𝝁1(k)}kS\{\bm{\mu}^{(k)*}_{1}\}_{k\in S} and {𝝁2(k)}kS\{\bm{\mu}^{(k)*}_{2}\}_{k\in S} stands for maxkSminπ:[2][2](𝝁^1(k)[T]𝝁π(1)(k)2𝝁^2(k)[T]𝝁π(2)(k)2)\max_{k\in S}\min_{\pi:[2]\rightarrow[2]}(\|\widehat{\bm{\mu}}^{(k)[T]}_{1}-\bm{\mu}^{(k)*}_{\pi(1)}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[T]}_{2}-\bm{\mu}^{(k)*}_{\pi(2)}\|_{2}). Estimation error of {𝚺(k)}kS\{\bm{\Sigma}^{(k)*}\}_{k\in S} stands for maxkS𝚺^(k)[T]𝚺(k)2\max_{k\in S}\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)*}\|_{2}. Estimation error of {δ(k)}kS\{\delta^{(k)*}\}_{k\in S} stands for maxkS(|δ^(k)[T]δ(k)||δ^(k)[T]+δ(k)|)\max_{k\in S}(|\widehat{\delta}^{(k)[T]}-\delta^{(k)*}|\wedge|\widehat{\delta}^{(k)[T]}+\delta^{(k)*}|). Average mis-clustering error represents the average empirical mis-clustering error rate calculated on the test set of tasks in SS.
Refer to caption
Figure S.4: The performance of different methods in Simulation 1.(ii) of multi-task learning, with 2 outlier tasks (ϵ=0.2\epsilon=0.2), and hh changing from 0 to 10 with increment 1. The meaning of each subfigure’s title is the same as in Figure S.3.

Referring to Figure S.3 for the case without outlier tasks, MTL-GMM outperforms Pooled-GMM in estimating w(k)w^{(k)*} all the time. This makes sense because Pooled-GMM does not take the heterogeneity of w(k)w^{(k)*}’s into account. For the estimation of other parameters (except δ(k)\delta^{(k)*} 555Actually it is not surprising to see Pooled-GMM estimates 𝝁1(k)\bm{\mu}^{(k)*}_{1}, 𝝁2(k)\bm{\mu}^{(k)*}_{2}, δ(k)\delta^{(k)*}, and 𝚺(k)\bm{\Sigma}^{(k)*} better than MTL-GMM when hh is small in this example. The reason is that these parameters are similar to each other (although MTL-GMM does not rely on this similarity) which makes pooling the data a good approach.) and clustering, MTL-GMM and Pooled-GMM are competitive when hh is small (i.e. the tasks are similar). As hh increases (i.e. tasks become more heterogenous), MTL-GMM starts to outperform Pooled-GMM by a large margin. Moreover, MTL-GMM is significantly better than Single-task-GMM in terms of both estimation and mis-clustering errors over a wide range of hh. They only become comparable when hh is very large. These comparisons demonstrate that MTL-GMM not only effectively utilizes the unknown similarity structure among tasks, but also adapts to it.

The results for the case with two outlier tasks are shown in Figure S.4. It is clear that the comparison between MTL-GMM and Single-task-GMM is similar to the one in Figure S.3. What is new here is that even when hh is very small, MTL-GMM still performs much better than Pooled-GMM, showing the robustness of MTL-GMM against a fraction of outlier tasks. Note that in this simulation, δ(k)=0\delta^{(k)*}=0 for all k[K]k\in[K], which might explain the phenomenon where Pooled-GMM outperforms MTL-GMM in estimating δ(k)\delta^{(k)*}’s.

S.3.1.2 Simulation 2

The second simulation is a multi-cluster example, which is built based on Simulation 1. Consider a multi-task learning problem with K=10K=10 tasks, where each task has sample size nk=100n_{k}=100 and dimension p=15p=15, and follows a GMM with R=4R=4 clusters. For all k[K]k\in[K], we generate (w1(k),,wR(k))(w^{(k)*}_{1},\ldots,w^{(k)*}_{R}) independently from Dirichlet(𝜶)\text{Dirichlet}(\bm{\alpha}) with 𝜶=5𝟏R\bm{\alpha}=5\cdot\bm{1}_{R}. When kSk\in S, we generate 𝝁r(k)\bm{\mu}^{(k)*}_{r} from (2𝟎2r2,2,2,𝟎p2r)+h/2(𝚺(k))1𝒖(2\cdot\bm{0}_{2r-2},2,2,\bm{0}_{p-2r})^{\top}+h/2\cdot(\bm{\Sigma}^{(k)*})^{-1}\bm{u}, where 𝒖Unif({𝒖p:𝒖2=1})\bm{u}\sim\text{Unif}(\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=1\}), 𝚺(k)=(0.2|ij|)p×p\bm{\Sigma}^{(k)*}=(0.2^{|i-j|})_{p\times p}. When kSk\notin S, we generate each w(k)w^{(k)*} from the same Dirichlet distribution and set 𝚺(k)=(0.5|ij|))p×p\bm{\Sigma}^{(k)*}=(0.5^{|i-j|)})_{p\times p} and 𝝁r(k)\bm{\mu}^{(k)*}_{r} from Unif({𝒖p:𝒖2=0.5})\text{Unif}(\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=0.5\}) for r=1:Rr=1:R. For a given ϵ[0,1)\epsilon\in[0,1), in each replication the outlier task index set ScS^{c} is uniformly sampled from all subsets of 1:K1:K with cardinality KϵK\epsilon. We consider two cases:

  1. (i)

    No outlier tasks (ϵ=0\epsilon=0), and hh changes from 0 to 10 with increment 1;

  2. (ii)

    2 outlier tasks (ϵ=0.2\epsilon=0.2), and hh changes from 0 to 10 with increment 1.

Refer to caption
Figure S.5: The performance of different methods in Simulation 2 under different outlier proportions. The upper panel shows the performance without outlier tasks (ϵ=0\epsilon=0), and the lower panel shows the performance with 2 outlier tasks (ϵ=0.2\epsilon=0.2). hh changes from 0 to 10 with increment 1. Estimation error of {𝜷(k)r}r[R],kS\{\bm{\beta}^{(k)*}_{r}\}_{r\in[R],k\in S} stands for maxkSminπ:[R][R]maxr=1:R𝜷^(k)[T]r(𝚺(k))1(𝝁(k)π(r)𝝁(k)π(1))2\max_{k\in S}\min_{\pi:[R]\rightarrow[R]}\max_{r=1:R}\|\widehat{\bm{\beta}}^{(k)[T]}_{r}-(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{\pi(r)}-\bm{\mu}^{(k)*}_{\pi(1)})\|_{2} and maximum mis-clustering error represents the maximum empirical mis-clustering error rate calculated on the test set of tasks in SS.

Algorithm 4 is run with the alignment Algorithm 6. The results of it and other benchmarks are reported in Figures S.5. The main message is the same as in Simulation 1: Pooled-GMM is sensitive to outlier tasks and suffers from negative transfer when hh is large, while MTL-GMM is robust to outliers and can adapt to the unknown similarity level hh. Note that in this example, {𝝁(k)r}kS\{\bm{\mu}^{(k)*}_{r}\}_{k\in S} are similar, and {𝚺(k)}kS\{\bm{\Sigma}^{(k)*}\}_{k\in S} are the same, therefore running the EM algorithm by pooling all the data when hh is small without outliers may be more effective than our MTL algorithm. This could explain why MTL-GMM performs slightly worse than Pooled-GMM in terms of maximum mis-clustering error when hh is small and ϵ=0\epsilon=0.

We also provide additional performance evaluations for the three methods in Simulation 2. The results are presented in Figures S.6 and S.7. The main takeaway is the same as in the previous simulation example: Pooled-GMM is sensitive to outlier tasks and suffers from negative transfer when hh is large, while MTL-GMM is robust to outliers and can adapt to the unknown similarity level hh. The results verify the theoretical findings in the multi-cluster case.

Refer to caption
Figure S.6: The performance of different methods in Simulation 2.(i) of multi-task learning, with no outlier tasks (ϵ=0\epsilon=0), and hh changing from 0 to 10 with increment 1. Estimation error of {w(k)r}r[R],kS\{w^{(k)*}_{r}\}_{r\in[R],k\in S} stands for maxkSminπ:[R][R]maxr[R]|w^(k)[T]rw(k)π(r)|\max_{k\in S}\min_{\pi:[R]\rightarrow[R]}\max_{r\in[R]}|\widehat{w}^{(k)[T]}_{r}-w^{(k)*}_{\pi(r)}|. Estimation error of {𝝁(k)r}r[R],kS\{\bm{\mu}^{(k)*}_{r}\}_{r\in[R],k\in S} stands for maxkSminπ:[R][R]maxr[R]𝝁^(k)[T]r𝝁(k)π(r)2\max_{k\in S}\min_{\pi:[R]\rightarrow[R]}\max_{r\in[R]}\|\widehat{\bm{\mu}}^{(k)[T]}_{r}-\bm{\mu}^{(k)*}_{\pi(r)}\|_{2}. Estimation error of {𝚺(k)}kS\{\bm{\Sigma}^{(k)*}\}_{k\in S} stands for maxkS𝚺^(k)[T]𝚺(k)2\max_{k\in S}\|\widehat{\bm{\Sigma}}^{(k)[T]}-\bm{\Sigma}^{(k)*}\|_{2}. Estimation error of {δ(k)r}r[R],kS\{\delta^{(k)*}_{r}\}_{r\in[R],k\in S} stands for maxkSminπ:[R][R]maxr[R]|δ^(k)[T]r(𝝁(k)π(r)+𝝁(k)π(1))(𝚺(k))1(𝝁(k)π(r)𝝁(k)π(1))/2|\max_{k\in S}\min_{\pi:[R]\rightarrow[R]}\max_{r\in[R]}|\widehat{\delta}^{(k)[T]}_{r}-(\bm{\mu}^{(k)*}_{\pi(r)}+\bm{\mu}^{(k)*}_{\pi(1)})^{\top}(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{\pi(r)}-\bm{\mu}^{(k)*}_{\pi(1)})/2|. Average mis-clustering error represents the average empirical mis-clustering error rate calculated on the test set of tasks in SS.
Refer to caption
Figure S.7: The performance of different methods in Simulation 2.(ii) of multi-task learning, with 2 outlier tasks (ϵ=0.2\epsilon=0.2), and hh changing from 0 to 10 with increment 1. The meaning of each subfigure’s title is the same as in Figure S.6.

S.3.1.3 Simulation 3 of MTL

In the third simulation of MTL, we consider a different similarity structure among tasks in SS and a different type of outlier tasks. For a multi-task learning problem with K=10K=10 tasks, set the sample size of each task equal to 100. Let 𝜷(1)=(2.5,0,0,0,0)\bm{\beta}^{(1)*}=(2.5,0,0,0,0), 𝚺(1)=(0.5|ij|)5×5\bm{\Sigma}^{(1)*}=(0.5^{|i-j|})_{5\times 5}, and 1S1\in S, i.e., the first task is not an outlier task. We generate each w(k)w^{(k)*} from Unif(0.1,0.9)\text{Unif}(0.1,0.9) for all kSk\in S. For kS\{1}k\in S\backslash\{1\}, we generate 𝚺(k)\bm{\Sigma}^{(k)*} as

𝚺(k)={(0.5|ij|)5×5,with probability 1/2,(a|ij|)5×5,with probability 1/2,\bm{\Sigma}^{(k)*}=\begin{cases}(0.5^{|i-j|})_{5\times 5},&\text{with probability }1/2,\\ (a^{|i-j|})_{5\times 5},&\text{with probability }1/2,\end{cases} (S.3.79)

and set 𝜷(k)=(𝚺(k))1𝚺(1)𝜷(1)\bm{\beta}^{(k)*}=(\bm{\Sigma}^{(k)*})^{-1}\bm{\Sigma}^{(1)*}\bm{\beta}^{(1)*}. Here, the value of aa is determined by max{a[0.5,1):𝜷(k)𝜷(1)2h}\max\{a\in[0.5,1):\|\bm{\beta}^{(k)*}-\bm{\beta}^{(1)*}\|_{2}\leq h\} for a given hh. Let 𝝁(k)2=𝚺(k)𝜷(k)\bm{\mu}^{(k)*}_{2}=\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*} and 𝝁(k)1=𝟎\bm{\mu}^{(k)*}_{1}=\bm{0}, kS\forall k\in S. In this generation process, 𝝁(k)2=𝝁(1)2=𝚺(1)𝜷(1)=(5/2,5/4,5/8,5/16,5/32)\bm{\mu}^{(k)*}_{2}=\bm{\mu}^{(1)*}_{2}=\bm{\Sigma}^{(1)*}\bm{\beta}^{(1)*}=(5/2,5/4,5/8,5/16,5/32)^{\top} for all kSk\in S. The covariance matrix of tasks in SS can differ. When kSk\notin S, we generate the data of task kk from two clusters with probability 1w(k)1-w^{(k)*} and w(k)w^{(k)*}, where w(k)Unif(0.1,0.9)w^{(k)*}\sim\text{Unif}(0.1,0.9). Samples from the second cluster follow N(𝝁(k)2,𝚺(k))N(\bm{\mu}^{(k)*}_{2},\bm{\Sigma}^{(k)*}), with 𝚺(k)\bm{\Sigma}^{(k)*} coming from (S.3.79), 𝜷(k)=(2.5,2.5,2.5,2.5,2.5)\bm{\beta}^{(k)*}=(-2.5,-2.5,-2.5,-2.5,-2.5)^{\top}, and 𝝁(k)2=𝚺(k)𝜷(k)\bm{\mu}^{(k)*}_{2}=\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}. For each sample from the first cluster, each component is independently generated from a tt-distribution with degrees of freedom 44. In each replication, for given ϵ\epsilon, the outlier task index set ScS^{c} is uniformly sampled from all subsets of 2:K2:K with cardinality KϵK\epsilon (since task 1 has been fixed in SS). We consider two cases:

  1. (i)

    No outlier tasks (ϵ=0\epsilon=0), and hh changes from 0 to 10 with increment 1;

  2. (ii)

    2 outlier tasks (ϵ=0.2\epsilon=0.2), and hh changes from 0 to 10 with increment 1.

Refer to caption
Figure S.8: The performance of different methods in Simulation 3.(i) of multi-task learning, with no outlier tasks (ϵ=0\epsilon=0), and hh changing from 0 to 10 with increment 1. Estimation error of {𝜷(k)}kS\{\bm{\beta}^{(k)*}\}_{k\in S} stands for maxkS(𝜷^(k)[T]𝜷(k)2𝜷^(k)[T]+𝜷(k)2)\max_{k\in S}(\|\widehat{\bm{\beta}}^{(k)[T]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[T]}+\bm{\beta}^{(k)*}\|_{2}). Maximum mis-clustering error represents maximum empirical mis-clustering error rate calculated on test set of tasks in SS. The meaning of other subfigures’ titles is the same as in Figure S.3.
Refer to caption
Figure S.9: The performance of different methods in Simulation 3.(ii) of multi-task learning, with 2 outlier tasks (ϵ=0.2\epsilon=0.2), and hh changing from 0 to 10 with increment 1. The meaning of each subfigure’s title is the same as in Figure S.8.

We implement the same three methods as in Simulation 1 and the results are reported in Figures S.8 and S.9. When there are no outlier tasks, both MTL-GMM and Pooled-GMM significantly outperform Single-task-GMM. Note that in this simulation, 𝝁(k)1=𝝁(k)1\bm{\mu}^{(k)*}_{1}=\bm{\mu}^{(k^{\prime})*}_{1} and 𝝁(k)2=𝝁(k)2\bm{\mu}^{(k)*}_{2}=\bm{\mu}^{(k^{\prime})*}_{2} for all kk[K]k\neq k^{\prime}\in[K], which might explain the phenomenon where Pooled-GMM outperforms MTL-GMM in estimating 𝝁(k)1\bm{\mu}^{(k)*}_{1} and 𝝁(k)2\bm{\mu}^{(k)*}_{2}’s. When there are two outlier tasks, Figure S.9 shows that Pooled-GMM performs much worse than Single-task-GMM on most of the estimation errors of GMM parameters as well as the mis-clustering error rate. In contrast, MTL-GMM greatly improves the performance of Single-task-GMM, showing the advantage of MTL-GMM when dealing with outlier tasks and heterogeneous covariance matrices.

S.3.1.4 Simulation of TL

Consider a transfer learning problem with K=10K=10 source data sets, where all sources are from the same GMM. The setting is modified based on Simulation 1 of MTL. The source and target sample sizes are equal to 100. For each of the source and target task, w(k)Unif(0.1,0.9)w^{(k)*}\sim\textup{Unif}(0.1,0.9) and 𝝁(k)1=𝝁(k)2=(2,2,𝟎p2)+h/2(𝚺(k))1𝒖\bm{\mu}^{(k)*}_{1}=-\bm{\mu}^{(k)*}_{2}=(2,2,\bm{0}_{p-2})^{\top}+h/2\cdot(\bm{\Sigma}^{(k)*})^{-1}\bm{u}, where p=15p=15, 𝚺(k)=(0.2|ij|)5×5\bm{\Sigma}^{(k)*}=(0.2^{|i-j|})_{5\times 5}, and 𝒖Unif({𝒖p:𝒖2=1})\bm{u}\sim\text{Unif}(\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=1\}). We consider the case that hh changes from 0 to 10 with increment 1.

We compare five different methods, including Target-GMM fitted on target data only, MTL-GMM fitted on all the data, MTL-GMM-center which fits MTL-GMM on source data and outputs the estimated “center” 𝜷¯[T]\overline{\bm{\beta}}^{[T]} as the target estimate 666MTL-GMM-center only appears in the comparison of estimation error of 𝜷(k)\bm{\beta}^{(k)*}’s., Pooled-GMM which fits a merged GMM on all the data, and our TL-GMM. The performance is evaluated by the estimation errors of w(0)w^{(0)*}, 𝝁(0)1\bm{\mu}^{(0)*}_{1}, 𝝁(0)2\bm{\mu}^{(0)*}_{2}, 𝜷(0)\bm{\beta}^{(0)*}, δ(0)\delta^{(0)*}, and 𝚺(0)\bm{\Sigma}^{(0)*} as well as the mis-clustering error rate calculated on an independent test target data of size 500. Results are presented in Figure S.10.

Refer to caption
Figure S.10: The performance of different methods in the simulation of transfer learning, with no outlier tasks (ϵ=0\epsilon=0) and hh changing from 0 to 10 with increment 1. Estimation error of w(0)w^{(0)*} stands for |w^(0)[T0]w(0)||1w^(0)[T0]w(0)||\widehat{w}^{(0)[T_{0}]}-w^{(0)*}|\wedge|1-\widehat{w}^{(0)[T_{0}]}-w^{(0)*}|. Estimation error of 𝝁(0)1\bm{\mu}^{(0)*}_{1} and 𝝁(0)2\bm{\mu}^{(0)*}_{2} stands for minπ:[2][2]maxr[2]𝝁^(0)[T0]r𝝁(0)π(r)2\min_{\pi:[2]\rightarrow[2]}\max_{r\in[2]}\|\widehat{\bm{\mu}}^{(0)[T_{0}]}_{r}-\bm{\mu}^{(0)*}_{\pi(r)}\|_{2}. Estimation error of 𝜷(0)\bm{\beta}^{(0)*} stands for 𝜷^(0)[T0]𝜷(0)2𝜷^(0)[T0]+𝜷(0)2\|\widehat{\bm{\beta}}^{(0)[T_{0}]}-\bm{\beta}^{(0)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(0)[T_{0}]}+\bm{\beta}^{(0)*}\|_{2}. Estimation error of 𝚺(k)\bm{\Sigma}^{(k)*} stands for 𝚺^(0)[T0]𝚺(0)2\|\widehat{\bm{\Sigma}}^{(0)[T_{0}]}-\bm{\Sigma}^{(0)*}\|_{2}. Estimation error of δ(0)r\delta^{(0)*}_{r} stands for |δ^(0)[T0]δ(0)||δ^(0)[T0]+δ(0)||\widehat{\delta}^{(0)[T_{0}]}-\delta^{(0)*}|\wedge|\widehat{\delta}^{(0)[T_{0}]}+\delta^{(0)*}|. Mis-clustering error represents the empirical mis-clustering error rate calculated on the test set of the target data.

Figure S.10 shows that when hh is small, the performances of MTL-GMM, MTL-GMM-center, Pooled-GMM, and TL-GMM are comparable, and all of them are much better than Target-GMM. This is expected, because the sources are very similar to the target and can be easily used to improve the target task learning. As hh keeps increasing, the target and sources become increasingly different. This is the phase where the knowledge of sources needs to be carefully transferred for the possible learning improvement on the target task. As is clear from Figure S.10, MTL-GMM, MTL-GMM-center, and Pooled-GMM do not handle heterogeneous resources well, thus outperformed by Target-GMM. By contrast, TL-GMM remains effective in transferring source knowledge to improve over Target-GMM; when hh is very large so that sources are not useful anymore, TL-GMM is robust enough to still have competitive performance compared to Target-GMM.

Refer to caption
Figure S.11: The performance of different methods in the simulation of transfer learning, with 2 outlier tasks (ϵ=0.2\epsilon=0.2) and hh changing from 0 to 10 with increment 1. The meaning of each subfigure’s title is the same as in Figure S.10.

Figure S.11 shows the results when there are two outlier tasks (ϵ=0.2\epsilon=0.2). It can be seen that MTL-GMM is robust to outliers.

S.3.1.5 Tuning parameters CλC_{\lambda} and Cλ0C_{\lambda_{0}} in Algorithms 1, 4, and 7

The candidates of CλC_{\lambda} and Cλ0C_{\lambda_{0}} values used in the 10-fold cross-validation are chosen through a data-driven way. For CλC_{\lambda} in Algorithm 1, we first determine the smallest CλC_{\lambda} value which makes all 𝜷(k)\bm{\beta}^{(k)} estimators identical, which is denoted as CmaxC_{\max}. Then the CλC_{\lambda} candidates are set to be the sequence from Cmax/50C_{\max}/50 and 2Cmax2C_{\max} with equal logarithm distance. For Cλ0C_{\lambda_{0}} in Algorithm 7, we first determine the smallest Cλ0C_{\lambda_{0}} value which makes the 𝜷(0)\bm{\beta}^{(0)} estimator equal to 𝜷¯[T]\overline{\bm{\beta}}^{[T]}, which is denoted as CmaxC_{\max}^{\prime}. Then the Cλ0C_{\lambda_{0}} candidates are set to be the sequence from Cmax/50C_{\max}^{\prime}/50 and 2Cmax2C_{\max}^{\prime} with equal logarithm distance.

We also run MTL-GMM with different CλC_{\lambda} values in Simulation 1 to test the impact of the penalty parameter. The results are presented in Figure S.12. The values 1.29, 2.15, 3.59, 5.99, and 10 are the last 5 elements in sequence from 0.1 to 10 with equal logarithm distance. It can be seen that with small CλC_{\lambda} values like 1.29 and 2.15, the performance of MTL-GMM is similar to that of Single-task-GMM, although MTL-GMM-2.15 improves Single-task-GMM a lot when hh is small. With large CλC_{\lambda} values like 5.99 and 10, MTL-GMM performs similarly to Pooled-GMM when hh is small while suffering from negative transfer when hh is large. However, as hh continues to increase, the performance of MTL-GMM with large CλC_{\lambda} values starts to improve and finally becomes similar to Single-task-GMM. This phenomenon is in accordance with the theory, as the theory predicts that MTL-GMM achieves the same rate as Single-task-GMM for large hh. The negative transfer effect of MTL-GMM with large CλC_{\lambda} could be caused by large unknown constants in the upper bound. Comparing Figure S.12 with figures in Sections 3 and S.3.1.1, we can see that cross-validation enhances the performance of MTL-GMM.

Refer to caption
Figure S.12: The performance of different methods in Simulation 1 of multi-task learning, with no outlier tasks (ϵ=0\epsilon=0) and hh changing from 0 to 10 with increment 1. The meaning of each subfigure’s title is the same as in Figures 2 and S.3.

S.3.1.6 Tuning parameter κ\kappa and κ0\kappa_{0} in Algorithms 1, 4, and 7

We set κ=κ0=1/3\kappa=\kappa_{0}=1/3 in Algorithms 1, 4, and 7. We run MTL-GMM with different κ\kappa values in Simulation 1 to test the impact of κ\kappa on the performance. The results are presented in Figure S.13. We tried κ=0.1,0.3,0.5,0.7,0.9\kappa=0.1,0.3,0.5,0.7,0.9 in Algorithms 1. It can be seen that the lines representing MTL-GMM with different κ\kappa values highly overlap with each other, which shows that the performance of MTL-GMM is very robust to the choice of κ\kappa. In practice, we take κ=1/3\kappa=1/3 for convenience.

Refer to caption
Figure S.13: The performance of different methods in Simulation 1 of multi-task learning, with no outlier tasks (ϵ=0\epsilon=0) and hh changing from 0 to 10 with increment 1. The meaning of each subfigure’s title is the same as in Figures 2 and S.3.

S.3.2 Real-data analysis

S.3.2.1 Human activity recognition

Human Activity Recognition (HAR) Using Smartphones Data Set contains the data collected from 30 volunteers when they performed six activities (walking, walking upstairs, walking downstairs, sitting, standing, and laying) wearing a smartphone \citepappanguita2013public. Each observation has 561 time and frequency domain variables. Each volunteer can be viewed as a task, and the sample size of each task varies from 281 to 409. The original data set is available at UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones.

Here, we first focus on two activities, standing and laying, and perform clustering without the label information, to test our method in the binary case. This is a binary MTL clustering problem with 30 tasks. The sample size of each task varies from 95 to 179. For each task, in each replication, we use 90% of the samples as training data and hold 10% of the samples as test data.

We first run a principal component analysis (PCA) on the training data of each task and project both the training and test data onto the first 15 principal components. PCA has often been used for dimension reduction in pre-processing the HAR data set \citepappzeng2014convolutional, walse2016pca, aljarrah2019human, duan2023adaptive. We fit Single-task-GMM on each task separately, Pooled-GMM on merged data from 30 tasks, and our MTL-GMM with the greedy label swapping alignment algorithm. The performance of the three methods is evaluated by the mis-clustering error rate on the test data of all 30 tasks. The maximum and average mis-clustering errors among the 30 tasks are calculated in each replication. The mean and standard deviation of these two errors over 200 replications are reported on the left side of Table S.1. To better display the clustering performance on each task, we further generate the box plot of mis-clustering errors of 30 tasks (averaged over 200 replications) for each method in the left plot of Figure S.14. It is clear that MTL-GMM outperforms both Pooled-GMM and Single-task-GMM.

Binary Multi-cluster
Method Single-task Pooled MTL Single-task Pooled MTL
Max. error 0.49 (0.02) 0.40 (0.12) 0.37 (0.09) 0.51 (0.04) 0.50 (0.04) 0.51 (0.04)
Avg. error 0.28 (0.02) 0.18 (0.18) 0.04 (0.04) 0.25 (0.01) 0.35 (0.03) 0.25 (0.01)
Table S.1: Maximum and average mis-clustering errors and standard deviations (numbers in the parentheses) in binary and multi-cluster HAR data sets.
Refer to caption
Figure S.14: Box plots of mis-clustering errors of 30 tasks for each method for HAR data set. (Left: binary case; Right: multi-cluster case)

Next, we consider all six activities and compare the performance of the three approaches using the same sample-splitting strategy, to test our method in a multi-cluster scenario. Now the sample size of each task varies from 281 to 409. The maximum and average mis-clustering error rates and standard deviations over 200 replications are reported on the right side of Table S.1. We can see that Pooled-GMM might suffer from negative transfer with a worse performance than the other two methods, while MTL-GMM and Single-task-GMM have similar performances. The right plot in Figure S.14 reveals the same comparison results.

In summary, the HAR data set exhibits different levels of similarity in binary and multi-cluster cases: tasks in the binary data are sufficiently similar so that Pooled-GMM achieves a large margin of improvement over Single-task GMM, while tasks in the multi-cluster data become much more heterogeneous, resulting in the degraded performance of Pooled-GMM compared to Single-task GMM. Nevertheless, our method MTL-GMM performs either competitively or better than the best of the two, regardless of the similarity level. These results lend further support to the effectiveness of our method.

S.3.2.2 Pen-based recognition of handwritten digits (PRHD)

The Pen-based Recognition of Handwritten Digits (PRHD) data set contains 250 samples from each of the 44 writers. Each of these writers was asked to write digits 0-9 on a pressure-sensitive tablet with an integrated LCD display and a cordless stylus. The xx and yy tablet coordinates and pressure level values of the pen were recorded. After some transformations, each observation has 16 features. The data set and more information about it are available at UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/81/pen+based+recognition+of+handwritten+digits.

Similar to the previous real-data example, we first focus on a binary clustering problem by clustering observations of digits 8 and 9. The number of observations varies between 47 and 48 among the 44 tasks, showing that this is a more balanced data set with a smaller sample size (per dimension) than the HAR data. For each task, in each replication, we use 90% of the samples as training data and hold 10% of the samples as test data. The maximum and average mis-clustering error rates and standard deviations over 200 replications are reported on the left side of Table S.2, and the box plots of mis-clustering errors of 44 tasks are shown in Figure S.15. We can see that Pooled-GMM and MTL-GMM perform similarly and are much better than Single-task-GMM.

Binary Multi-cluster
Method Single-task Pooled MTL Single-task Pooled MTL
Max. error 0.32 (0.10) 0.03 (0.07) 0.03 (0.09) 0.26 (0.07) 0.37 (0.06) 0.27 (0.07)
Avg. error 0.02 (0.01) 0.00 (0.00) 0.00 (0.02) 0.03 (0.01) 0.12 (0.01) 0.03 (0.01)
Table S.2: Maximum and average mis-clustering errors and standard deviations (numbers in the parentheses) in binary and multi-cluster PRHD data sets.
Refer to caption
Figure S.15: Box plots of mis-clustering errors of 44 tasks for each method for PRHD data set. (Left: binary case; Right: multi-cluster case)

Next, we consider the observations of digits 5-9, i.e. a 5-class clustering problem. The maximum and average mis-clustering error rates and standard deviations over 200 replications are reported on the right side of Table S.2, and the box plots of mis-clustering errors of 44 tasks are shown in Figure S.15. In this multi-cluster case, MTL-GMM and Single-task-GMM have similar performance which is better than that of Pooled-GMM. Like in the first real-data example, our method MTL-GMM adapts to the unknown similarity and is competitive with the best of the other two methods.

S.4 Technical Lemmas

S.4.1 General lemmas

Denote the unit ball p={𝒖p:𝒖21}\mathcal{B}^{p}=\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}\leq 1\} and the unit sphere 𝒮p1={𝒖p:𝒖2=1}\mathcal{S}^{p-1}=\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}=1\}.

Lemma 1 (Covering number of the unit ball under Euclidean norm, Example 5.8 in \citealpappwainwright2019high).

Denote the ϵ\epsilon-covering number of a unit ball p\mathcal{B}^{p} in p\mathbb{R}^{p} under Euclidean norm as N(ϵ,p,2)N(\epsilon,\mathcal{B}^{p},\|\cdot\|_{2}), where the centers of covering balls are required to be on the sphere. We have (1/ϵ)pN(ϵ,p,2)(1+2/ϵ)p(1/\epsilon)^{p}\leq N(\epsilon,\mathcal{B}^{p},\|\cdot\|_{2})\leq(1+2/\epsilon)^{p}.

Lemma 2 (Packing number of the unit sphere under Euclidean norm).

Denote the ϵ\epsilon-packing number of the unit sphere 𝒮p1\mathcal{S}^{p-1} in p\mathbb{R}^{p} under Euclidean norm as M(ϵ,𝒮p1,2)M(\epsilon,\mathcal{S}^{p-1},\|\cdot\|_{2}). When p2p\geq 2, we have M(ϵ,𝒮p1,2)N(ϵ,p1,2)(1/ϵ)p1M(\epsilon,\mathcal{S}^{p-1},\|\cdot\|_{2})\geq N(\epsilon,\mathcal{B}^{p-1},\|\cdot\|_{2})\geq(1/\epsilon)^{p-1}.

Lemma 3 (Fano’s lemma, see Chapter 2 of \citealpapptsybakov2009introduction, Chapter 15 of \citealpappwainwright2019high).

Suppose (Θ,d)(\Theta,d) is a metric space and each θ\theta in this space is associated with a probability measure θ\mathbb{P}_{\theta}. If {θj}j=1N\{\theta_{j}\}_{j=1}^{N} is an ss-separated set (i.e. d(θj,θk)sd(\theta_{j},\theta_{k})\geq s for any jkj\neq k), and KL(θj,θk)αlogN\textup{KL}(\mathbb{P}_{\theta_{j}},\mathbb{P}_{\theta_{k}})\leq\alpha\log N, then

infθ^supθΘθ(d(θ^,θ)s/2)infψsupj=1:Nθj(ψj)1αlog2logN,\inf_{\widehat{\theta}}\sup_{\theta\in\Theta}\mathbb{P}_{\theta}(d(\widehat{\theta},\theta)\geq s/2)\geq\inf_{\psi}\sup_{j=1:N}\mathbb{P}_{\theta_{j}}(\psi\neq j)\geq 1-\alpha-\frac{\log 2}{\log N}, (S.4.80)

where ψ:Xψ(X){1,,N}\psi:X\mapsto\psi(X)\in\{1,\ldots,N\}.

Lemma 4 (Packing number of the unit sphere in a quadrant under Euclidean norm).

In p\mathbb{R}^{p}, we can use a vector 𝐯{±1}p\bm{v}\in\{\pm 1\}^{\otimes p} to indicate each quadrant 𝒬𝐯={[0,+)𝟙(vj=+1)+(,0)𝟙(vj=1)}p\mathcal{Q}_{\bm{v}}=\{[0,+\infty)\cdot\mathds{1}(v_{j}=+1)+(-\infty,0)\cdot\mathds{1}(v_{j}=-1)\}^{\otimes p}. Then when p2p\geq 2, there exists a quadrant 𝒬𝐯0\mathcal{Q}_{\bm{v}_{0}} such that M(ϵ,𝒮p1𝒬𝐯0,2)(12)p(1ϵ)p1M(\epsilon,\mathcal{S}^{p-1}\cap\mathcal{Q}_{\bm{v}_{0}},\|\cdot\|_{2})\geq(\frac{1}{2})^{p}(\frac{1}{\epsilon})^{p-1}.

Lemma 5.

For one-dimensional Gaussian mixture variable Z(1w)𝒩(μ1,σ2)+w𝒩(μ2,σ2)Z\sim(1-w)\mathcal{N}(\mu_{1},\sigma^{2})+w\mathcal{N}(\mu_{2},\sigma^{2}) with (1w)μ1+wμ2=0(1-w)\mu_{1}+w\mu_{2}=0, it is a σ2+14|μ1μ2|2\sqrt{\sigma^{2}+\frac{1}{4}|\mu_{1}-\mu_{2}|^{2}}-subGaussian variable. That means,

𝔼eλZexp{12λ2(σ2+14|μ1μ2|2)}.\mathbb{E}e^{\lambda Z}\leq\exp\left\{\frac{1}{2}\lambda^{2}\left(\sigma^{2}+\frac{1}{4}|\mu_{1}-\mu_{2}|^{2}\right)\right\}. (S.4.81)
Lemma 6 (\citealpappduan2023adaptive).

Let

({𝜽^j}k=1K,𝜷^)=argmin𝜽k,𝜷p{k=1Kωkf(k)(𝜽k)+ωkλ𝜷𝜽k2)}.(\{\widehat{\bm{\theta}}_{j}\}_{k=1}^{K},\widehat{\bm{\beta}})=\operatorname*{arg\,min}_{\bm{\theta}_{k},\bm{\beta}\in\mathbb{R}^{p}}\left\{\sum_{k=1}^{K}\omega_{k}f^{(k)}(\bm{\theta}_{k})+\sqrt{\omega_{k}}\lambda\|\bm{\beta}-\bm{\theta}_{k}\|_{2})\right\}. (S.4.82)

Suppose there exists S1:KS\subseteq 1:K such that the following conditions are satisfied:

  1. (i)

    For any kSk\in S, fkf_{k} is (𝜽k,M,ρ,L)(\bm{\theta}^{*}_{k},M,\rho,L)-regular, that is

    • fkf_{k} is convex and twice differentiable;

    • ρ𝑰2fk(𝜽)L𝑰\rho\bm{I}\preceq\nabla^{2}f_{k}(\bm{\theta})\preceq L\bm{I} for all 𝜽(𝜽k,M)\bm{\theta}\in\mathcal{B}(\bm{\theta}^{*}_{k},M);

    • fk(𝜽k)2ρM/2\|\nabla f_{k}(\bm{\theta}^{*}_{k})\|_{2}\leq\rho M/2.

  2. (ii)

    min𝜽dmaxkS{𝜽k𝜽2}h\min_{\bm{\theta}\in\mathbb{R}^{d}}\max_{k\in S}\{\|\bm{\theta}_{k}^{*}-\bm{\theta}\|_{2}\}\leq h, kScωkϵkSωk/maxkSωk\sum_{k\in S^{c}}\sqrt{\omega_{k}}\leq\epsilon^{\prime}\sum_{k\in S}\omega_{k}/\max_{k\in S}\sqrt{\omega_{k}}, with ϵ=|Sc||S|\epsilon^{\prime}=\frac{|S^{c}|}{|S|}.

Then we have the following conclusions:

  1. (i)

    𝜽^k𝜽k21ρ(f(k)(𝜽k)2+λ/wk)\|\widehat{\bm{\theta}}_{k}-\bm{\theta}_{k}^{*}\|_{2}\leq\frac{1}{\rho}(\|\nabla f^{(k)}(\bm{\theta}^{*}_{k})\|_{2}+\lambda/\sqrt{w_{k}}) for all kSk\in S.

  2. (ii)

    If

    5ϱκwmaxkS{wkf(k)(𝜽k)2}1ϱϵ<λ<ρM2minkSωk,\frac{5\varrho\kappa_{w}\max_{k\in S}\{\sqrt{w_{k}}\|\nabla f^{(k)}(\bm{\theta}^{*}_{k})\|_{2}\}}{1-\varrho\epsilon^{\prime}}<\lambda<\frac{\rho M}{2}\min_{k\in S}\sqrt{\omega_{k}}, (S.4.83)

    where ϱ=L/ρ\varrho=L/\rho, ϱϵ<1\varrho\epsilon^{\prime}<1, and maxkSnkkSnknS|S|maxkSnknSκw\max_{k\in S}\sqrt{n_{k}}\cdot\frac{\sum_{k\in S}\sqrt{n_{k}}}{n_{S}}\leq\sqrt{\frac{|S|\max_{k\in S}n_{k}}{n_{S}}}\coloneqq\kappa_{w}, then

    𝜽^k𝜽k2kSωkf(k)(𝜽k)2ρkSωk+61ϱϵmin{3ϱ2κwh,2λ5ρwk}+λϵρmaxkSωk.\|\widehat{\bm{\theta}}_{k}-\bm{\theta}^{*}_{k}\|_{2}\leq\frac{\|\sum_{k\in S}\omega_{k}\nabla f^{(k)}(\bm{\theta}_{k}^{*})\|_{2}}{\rho\sum_{k\in S}\omega_{k}}+\frac{6}{1-\varrho\epsilon^{\prime}}\min\left\{3\varrho^{2}\kappa_{w}h,\frac{2\lambda}{5\rho\sqrt{w_{k}}}\right\}+\frac{\lambda\epsilon^{\prime}}{\rho\max_{k\in S}\sqrt{\omega_{k}}}. (S.4.84)

    Furthermore, if we also have

    λ15ϱκwLmaxkSωkh1ϱϵ,\lambda\geq\frac{15\varrho\kappa_{w}L\max_{k\in S}\sqrt{\omega_{k}}h}{1-\varrho\epsilon^{\prime}}, (S.4.85)

    then 𝜽^k=𝜷^\widehat{\bm{\theta}}_{k}=\widehat{\bm{\beta}} for all kSk\in S, and

    supkS𝜽^k𝜽k2kSωkf(k)(𝜽k)2ρkSωk+2ϱκwh+λϵρmaxkSωk.\sup_{k\in S}\|\widehat{\bm{\theta}}_{k}-\bm{\theta}^{*}_{k}\|_{2}\leq\frac{\|\sum_{k\in S}\omega_{k}\nabla f^{(k)}(\bm{\theta}_{k}^{*})\|_{2}}{\rho\sum_{k\in S}\omega_{k}}+2\varrho\kappa_{w}h+\frac{\lambda\epsilon^{\prime}}{\rho\max_{k\in S}\sqrt{\omega_{k}}}. (S.4.86)
Lemma 7.

Suppose

𝜽^=argmin𝜽{f(0)(𝜽)+λn0𝜽^𝜽¯2}\widehat{\bm{\theta}}=\operatorname*{arg\,min}_{\bm{\theta}}\left\{f^{(0)}(\bm{\theta})+\frac{\lambda}{\sqrt{n_{0}}}\|\widehat{\bm{\theta}}-\overline{\bm{\theta}}\|_{2}\right\} (S.4.87)

with some 𝛉¯p\overline{\bm{\theta}}\in\mathbb{R}^{p}. Assume f(0)f^{(0)} is convex and twice differentiable, and ρ𝐈p2f(0)(𝛉)L𝐈p\rho\bm{I}_{p}\leq\nabla^{2}f^{(0)}(\bm{\theta})\leq L\bm{I}_{p} for any 𝛉p\bm{\theta}\in\mathbb{R}^{p}. Then

  1. (i)

    𝜽^𝜽2f(0)(𝜽)ρ+λρn0\|\widehat{\bm{\theta}}-\bm{\theta}^{*}\|_{2}\leq\frac{\nabla f^{(0)}(\bm{\theta})}{\rho}+\frac{\lambda}{\rho\sqrt{n_{0}}}, for any 𝜽p\bm{\theta}^{*}\in\mathbb{R}^{p} and λ0\lambda\geq 0;

  2. (ii)

    𝜽^=𝜽¯\widehat{\bm{\theta}}=\overline{\bm{\theta}}, if λ2f(0)(𝜽)2n0\lambda\geq 2\|\nabla f^{(0)}(\bm{\theta}^{*})\|_{2}\sqrt{n_{0}} and 𝜽¯𝜽2(λ/n0f(0)(𝜽)2)/L\|\overline{\bm{\theta}}-\bm{\theta}^{*}\|_{2}\leq(\lambda/\sqrt{n_{0}}-\|\nabla f^{(0)}(\bm{\theta}^{*})\|_{2})/L.

S.5 Proofs

S.5.1 Proofs of general lemmas

S.5.1.1 Proof of Lemma 2

The second half inequality is due to Lemma 1. It suffices to show the first half inequality. For any 𝒙=(x1,,xp1)p1\bm{x}=(x_{1},\ldots,x_{p-1})^{\top}\in\mathcal{B}^{p-1}, define xp=1j=1p1xj2x_{p}=\sqrt{1-\sum_{j=1}^{p-1}x_{j}^{2}}. Then we can define a mapping

𝒙p1𝒙~=(x~1,,x~p1,x~p)𝒮p,\bm{x}\in\mathcal{B}^{p-1}\mapsto\widetilde{\bm{x}}=(\widetilde{x}_{1},\ldots,\widetilde{x}_{p-1},\widetilde{x}_{p})\in\mathcal{S}^{p}, (S.5.88)

with x~j=xj\widetilde{x}_{j}=x_{j} for jp1j\leq p-1 and x~p=±xp\widetilde{x}_{p}=\pm x_{p}. And it’s easy to see that for any 𝒙\bm{x}, 𝒚p1\bm{y}\in\mathcal{B}^{p-1}, we have 𝒙𝒚2𝒙~𝒚~2\|\bm{x}-\bm{y}\|_{2}\leq\|\widetilde{\bm{x}}-\widetilde{\bm{y}}\|_{2}. Therefore, if {𝒙~j}j=1N\{\widetilde{\bm{x}}_{j}\}_{j=1}^{N} is an ϵ\epsilon-cover of 𝒮p1\mathcal{S}^{p-1} under Euclidean norm, then it {𝒙j}j=1N\{\bm{x}_{j}\}_{j=1}^{N} must be an ϵ\epsilon-cover of p1\mathcal{B}^{p-1} under Euclidean norm. Then

N(ϵ,p1,2)N(ϵ,𝒮p1,2)M(ϵ,𝒮p1,2).N(\epsilon,\mathcal{B}^{p-1},\|\cdot\|_{2})\leq N(\epsilon,\mathcal{S}^{p-1},\|\cdot\|_{2})\leq M(\epsilon,\mathcal{S}^{p-1},\|\cdot\|_{2}). (S.5.89)

S.5.1.2 Proof of Lemma 4

If {𝒖j}j=1N\{\bm{u}_{j}\}_{j=1}^{N} is an ϵ\epsilon-packing of 𝒮p1\mathcal{S}^{p-1} under Euclidean norm, then {𝒖j}j=1N𝒬𝒗\{\bm{u}_{j}\}_{j=1}^{N}\cap\mathcal{Q}_{\bm{v}} must be an ϵ\epsilon-packing of 𝒮p𝒬𝒗\mathcal{S}^{p}\cap\mathcal{Q}_{\bm{v}} under Euclidean norm. Then by Lemma 2,

2pmax𝒗{±1}pM(ϵ,𝒮p1𝒬𝒗,2)\displaystyle 2^{p}\max_{\bm{v}\in\{\pm 1\}^{\otimes p}}M(\epsilon,\mathcal{S}^{p-1}\cap\mathcal{Q}_{\bm{v}},\|\cdot\|_{2}) 𝒗{±1}pM(ϵ,𝒮p1𝒬𝒗,2)\displaystyle\geq\sum_{\bm{v}\in\{\pm 1\}^{\otimes p}}M(\epsilon,\mathcal{S}^{p-1}\cap\mathcal{Q}_{\bm{v}},\|\cdot\|_{2}) (S.5.90)
M(ϵ,𝒮p1,2)\displaystyle\geq M(\epsilon,\mathcal{S}^{p-1},\|\cdot\|_{2}) (S.5.91)
(1ϵ)p1,\displaystyle\geq\left(\frac{1}{\epsilon}\right)^{p-1}, (S.5.92)

implying

max𝒗{±1}pM(ϵ,𝒮p𝒬𝒗,2)(12)p(1ϵ)p1.\max_{\bm{v}\in\{\pm 1\}^{\otimes p}}M(\epsilon,\mathcal{S}^{p}\cap\mathcal{Q}_{\bm{v}},\|\cdot\|_{2})\geq\left(\frac{1}{2}\right)^{p}\left(\frac{1}{\epsilon}\right)^{p-1}. (S.5.93)

S.5.1.3 Proof of Lemma 5

Suppose Z1𝒩(μ1,σ2)Z2𝒩(μ2,σ2)Z_{1}\sim\mathcal{N}(\mu_{1},\sigma^{2})\perp\!\!\!\!\perp Z_{2}\sim\mathcal{N}(\mu_{2},\sigma^{2}), then we can write Z=(1I)Z1+IZ2=(1I)(Z1μ1)+I(Z2μ2)+[μ1(1I)+μ2I]Z=(1-I)Z_{1}+IZ_{2}=(1-I)(Z_{1}-\mu_{1})+I(Z_{2}-\mu_{2})+[\mu_{1}(1-I)+\mu_{2}I], where IBernoulli(w)I\sim\text{Bernoulli}(w) that is independent with Z1Z_{1} and Z2Z_{2}. Then

𝔼eλZ\displaystyle\mathbb{E}e^{\lambda Z} 𝔼eλ(1I)(Z1μ1)+λI(Z2μ2)𝔼eλ[μ1(1I)+μ2I]\displaystyle\leq\mathbb{E}e^{\lambda(1-I)(Z_{1}-\mu_{1})+\lambda I(Z_{2}-\mu_{2})}\cdot\mathbb{E}e^{\lambda[\mu_{1}(1-I)+\mu_{2}I]} (S.5.94)
𝔼I[(1I)𝔼eλZ1+I𝔼eλZ2]𝔼eλ[μ1(1I)+μ2I]\displaystyle\leq\mathbb{E}_{I}\big{[}(1-I)\mathbb{E}e^{\lambda Z_{1}}+I\mathbb{E}e^{\lambda Z_{2}}\big{]}\cdot\mathbb{E}e^{\lambda[\mu_{1}(1-I)+\mu_{2}I]} (S.5.95)
exp{12σ2λ2+18(μ2μ1)2λ2},\displaystyle\leq\exp\left\{\frac{1}{2}\sigma^{2}\lambda^{2}+\frac{1}{8}(\mu_{2}-\mu_{1})^{2}\lambda^{2}\right\}, (S.5.96)

where the last second inequality comes from Jensen’s inequality and the independence between Z1Z_{1}, Z2Z_{2}, and II. This completes the proof.

S.5.1.4 Proof of Lemma 6

The result follows from Theorem A.2, Lemma B.1, and Claim B.1 in \citeappduan2023adaptive.

S.5.2 Proof of Theorem 1

Define the contraction basin of one GMM as

Bcon(𝜽(k))={𝜽={w,𝜷,δ}\displaystyle B_{\text{con}}(\bm{\theta}^{(k)*})=\bigg{\{}\bm{\theta}=\{w,\bm{\beta},\delta\} :wr[cw/2,1cw/2],𝜷𝜷(k)2CbΔ,δ=12𝜷(𝝁1+𝝁2)\displaystyle:w_{r}\in[c_{w}/2,1-c_{w}/2],\|\bm{\beta}-\bm{\beta}^{(k)*}\|_{2}\leq C_{b}\Delta,\delta=\frac{1}{2}\bm{\beta}^{\top}(\bm{\mu}_{1}+\bm{\mu}_{2}) (S.5.97)
maxr=1:2𝝁r𝝁r2CbΔ},\displaystyle\quad\,\max_{r=1:2}\|\bm{\mu}_{r}-\bm{\mu}_{r}^{*}\|_{2}\leq C_{b}\Delta\bigg{\}}, (S.5.98)

for which we may shorthand as BconB_{\text{con}} in the following. And given the index set SS, two joint contraction basins are defined as

BconJ,1({𝜽(k)}kS)\displaystyle B_{\text{con}}^{J,1}(\{\bm{\theta}^{(k)*}\}_{k\in S}) ={{𝜽(k)}kS={(w(k),𝜷(k),δ(k))}kS:𝜽(k)Bcon(𝜽(k))},\displaystyle=\left\{\{\bm{\theta}^{(k)}\}_{k\in S}=\{(w^{(k)},\bm{\beta}^{(k)},\delta^{(k)})\}_{k\in S}:\bm{\theta}^{(k)}\in B_{\text{con}}(\bm{\theta}^{(k)*})\right\}, (S.5.99)
BconJ,2({𝜽(k)}kS)\displaystyle B_{\text{con}}^{J,2}(\{\bm{\theta}^{(k)*}\}_{k\in S}) ={{𝜽(k)}kS={(w(k),𝜷(k),δ(k))}kS:𝜽(k)Bcon(𝜽(k)),𝜷(k)𝜷¯ for all k}.\displaystyle=\left\{\{\bm{\theta}^{(k)}\}_{k\in S}=\{(w^{(k)},\bm{\beta}^{(k)},\delta^{(k)})\}_{k\in S}:\bm{\theta}^{(k)}\in B_{\text{con}}(\bm{\theta}^{(k)*}),\bm{\beta}^{(k)}\equiv\overline{\bm{\beta}}\text{ for all }k\right\}. (S.5.100)

For simplicity, at some places, we will write them as BconJ,1B_{\text{con}}^{J,1} and BconJ,2B_{\text{con}}^{J,2}, respectively.

For 𝜽=(w,𝜷,δ)\bm{\theta}=(w,\bm{\beta},\delta) and 𝜽=(w,𝜷,δ)\bm{\theta}^{\prime}=(w^{\prime},\bm{\beta}^{\prime},\delta^{\prime}), define

d(𝜽,𝜽)=|ww|𝜷𝜷2|δδ|.d(\bm{\theta},\bm{\theta}^{\prime})=|w-w^{\prime}|\vee\|\bm{\beta}-\bm{\beta}^{\prime}\|_{2}\vee|\delta-\delta^{\prime}|. (S.5.101)

And denote the minimum SNR Δ=minkSΔ(k)\Delta=\min_{k\in S}\Delta^{(k)}.

S.5.2.1 Lemmas

For GMM 𝒛(1w)𝒩(𝝁1,𝚺)+w𝒩(𝝁2,𝚺)\bm{z}\sim(1-w^{*})\mathcal{N}(\bm{\mu}_{1}^{*},\bm{\Sigma}^{*})+w^{*}\mathcal{N}(\bm{\mu}_{2}^{*},\bm{\Sigma}^{*}) and any 𝜽=(w,𝜷,δ)\bm{\theta}=(w,\bm{\beta},\delta), define

γ𝜽(𝒛)=wexp{𝜷𝒛δ}1w+wexp{𝜷𝒛δr},\displaystyle\gamma_{\bm{\theta}}(\bm{z})=\frac{w\exp\{\bm{\beta}^{\top}\bm{z}-\delta\}}{1-w+w\exp\{\bm{\beta}^{\top}\bm{z}-\delta_{r}\}}, w(𝜽)=𝔼[γ𝜽(𝒛)],\displaystyle\quad w(\bm{\theta})=\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})], (S.5.102)
𝝁1(𝜽)=𝔼[(1γ𝜽(𝒛))𝒛]𝔼[1γ𝜽(𝒛)],\displaystyle\bm{\mu}_{1}(\bm{\theta})=\frac{\mathbb{E}[(1-\gamma_{\bm{\theta}}(\bm{z}))\bm{z}]}{\mathbb{E}[1-\gamma_{\bm{\theta}}(\bm{z})]}, 𝝁2(𝜽)=𝔼[γ𝜽(𝒛)𝒛]𝔼[γ𝜽(𝒛)].\displaystyle\quad\bm{\mu}_{2}(\bm{\theta})=\frac{\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})\bm{z}]}{\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})]}. (S.5.103)
Lemma 8 (Contraction of binary GMMs, a special case of Lemma 37 when R=2R=2).

When Cbcc𝚺1/2C_{b}\leq cc_{\bm{\Sigma}}^{-1/2} with a small constant c>0c>0 and ΔClog(c𝚺Mcw1)\Delta\geq C\log(c_{\bm{\Sigma}}Mc_{w}^{-1}) with a large constant C>0C>0, there exist positive constants C>0C^{\prime}>0 and C>0C^{\prime\prime}>0, for any 𝛉Bcon(𝛉(k))\bm{\theta}\in B_{\textup{con}}(\bm{\theta}^{(k)*}),

|wr(𝜽)wr|Cexp{CΔ2}d(𝜽,𝜽),𝝁r(𝜽)𝝁r2Cexp{CΔ2}d(𝜽,𝜽),|w_{r}(\bm{\theta})-w_{r}^{*}|\leq C^{\prime}exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*}),\quad\|\bm{\mu}_{r}(\bm{\theta})-\bm{\mu}_{r}^{*}\|_{2}\leq C^{\prime}\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*}), (S.5.104)

where CΔexp{CΔ2}κ0<1C^{\prime}\Delta\exp\{-C^{\prime\prime}\Delta^{2}\}\leq\kappa_{0}<1 with a constant κ0\kappa_{0}.

Lemma 9.

When hCbΔh\leq C_{b}\Delta, BconJ,2({𝛉(k)}kS)B_{\text{con}}^{J,2}(\{\bm{\theta}^{(k)*}\}_{k\in S})\neq\emptyset.

Lemma 10 (Theorem 3 in \citealpappmaurer2021concentration).

Let f:𝒳nf:\mathcal{X}^{n}\rightarrow\mathbb{R} and X=(X1,,Xn)X=(X_{1},\ldots,X_{n}) be a vector of independent random variables with values in a space 𝒳\mathcal{X}. Then for any t>0t>0 we have

(f(X)𝔼f(X)>t)exp{t232ei=1nfi(X)ψ22},\mathbb{P}(f(X)-\mathbb{E}f(X)>t)\leq\exp\left\{-\frac{t^{2}}{32e\left\|\sum_{i=1}^{n}\|f_{i}(X)\|_{\psi_{2}}^{2}\right\|_{\infty}}\right\}, (S.5.105)

where fi(X)f_{i}(X) as a random function of xx is defined to be (fi(X))(x)f(x1,,xi1,Xi,xi+1,,Xn)𝔼Xi[f(x1,,xi1,Xi,xi+1,,Xn)](f_{i}(X))(x)\coloneqq f(x_{1},\ldots,x_{i-1},X_{i},x_{i+1},\ldots,X_{n})-\mathbb{E}_{X_{i}}[f(x_{1},\ldots,x_{i-1},X_{i},x_{i+1},\ldots,X_{n})], the sub-Gaussian norm Zψ2supd1{Zd/d}\|Z\|_{\psi_{2}}\coloneqq\sup_{d\geq 1}\{\|Z\|_{d}/\sqrt{d}\}, and Zd=(𝔼|Z|d)1/d\|Z\|_{d}=(\mathbb{E}|Z|^{d})^{1/d}.

Lemma 11.

Suppose Assumption 1 holds.

  1. (i)

    With probability at least 1CK21-C^{\prime}K^{-2},

    sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|1nki=1nkγ𝜽(k)(𝒛(k)i)𝔼[γ𝜽(k)(𝒛(k))]|ξ(k)pnk+logKnk,\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\right|\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}, (S.5.106)

    for all kSk\in S.

  2. (ii)

    With probability at least 1CK2eCp1-C^{\prime}K^{-2}e^{-C^{\prime\prime}p},

    sup{𝜽(k)}kSBconJ,2sup|w~k|11nS|kSw~ki=1nk[γ𝜽(k)(𝒛(k)i)𝔼[γ𝜽(k)(𝒛(k))]]|p+KnS.\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\frac{1}{n_{S}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\Big{]}\right|\lesssim\sqrt{\frac{p+K}{n_{S}}}. (S.5.107)
Lemma 12.

Suppose Assumption 1 holds.

  1. (i)

    With probability at least 1C(K2+K2eCp)1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p}),

    sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|1nki=1nk[1γ𝜽(k)(𝒛(k)i)](𝒛(k)i)𝜷(k)𝔼[[1γ𝜽(k)(𝒛(k))](𝒛(k))𝜷(k)]|\displaystyle\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}-\mathbb{E}\big{[}[1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})](\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)*}\big{]}\right| (S.5.108)
    ξ(k)pnk+logKnk,\displaystyle\quad\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}, (S.5.109)

    for all kSk\in S.

  2. (ii)

    With probability at least 1CK2eCp1-C^{\prime}K^{-2}e^{-C^{\prime\prime}p},

    sup{𝜽(k)}kSBconJ,2sup|w~k|11nS|kSw~ki=1nk[[1γ𝜽(k)(𝒛(k)i)](𝒛(k)i)𝜷(k)𝔼[[1γ𝜽(k)(𝒛(k))](𝒛(k))𝜷(k)]]|\displaystyle\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\frac{1}{n_{S}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\big{[}1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}-\mathbb{E}\big{[}[1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})](\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)*}\big{]}\Big{]}\right| (S.5.110)
    p+KnS.\displaystyle\quad\lesssim\sqrt{\frac{p+K}{n_{S}}}. (S.5.111)
Lemma 13.

Suppose Assumption 1 holds.

  1. (i)

    With probability at least 1C(K2+K2eCp)1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p}),

    sup𝜽(k)Bcon1nki=1nkγ𝜽(k)(𝒛(k)i)𝒛(k)i𝔼[γ𝜽(k)(𝒛(k))𝒛(k)]2p+logKnk,\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\bm{z}^{(k)}_{i}-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})\bm{z}^{(k)}]\right\|_{2}\lesssim\sqrt{\frac{p+\log K}{n_{k}}}, (S.5.112)

    for all kSk\in S.

  2. (ii)

    With probability at least 1CK2eCp1-C^{\prime}K^{-2}e^{-C^{\prime\prime}p},

    sup{𝜽(k)}kSBconJ,2sup|w~k|11nSkSw~ki=1nk[γ𝜽(k)(𝒛(k)i)𝒛(k)i𝔼[γ𝜽(k)(𝒛(k))𝒛(k)]]2p+KnS.\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\frac{1}{n_{S}}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\bm{z}^{(k)}_{i}-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})\bm{z}^{(k)}]\Big{]}\right\|_{2}\lesssim\sqrt{\frac{p+K}{n_{S}}}. (S.5.113)
  3. (iii)

    With probability at least 1CK2eCp1-C^{\prime}K^{-2}e^{-C^{\prime\prime}p},

    sup{𝜽(k)}kSBconJ,2sup|w~k|11nSkSw~ki=1nk[γ𝜽(k)(𝒛(k)i)𝔼[γ𝜽(k)(𝒛(k))]]𝝁(k)12p+KnS.\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\frac{1}{n_{S}}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\Big{]}\bm{\mu}^{(k)*}_{1}\right\|_{2}\lesssim\sqrt{\frac{p+K}{n_{S}}}. (S.5.114)
Lemma 14.

Suppose Assumption 1 holds.

  1. (i)

    With probability at least 1C(K2+K1eCp)1-C^{\prime}(K^{-2}+K^{-1}e^{-C^{\prime\prime}p}),

    1nki=1nk[𝒛(k)i(𝒛(k)i)𝔼[𝒛(k)i(𝒛(k)i)]]𝜷(k)2p+logKnk,\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\bm{\beta}^{(k)*}\right\|_{2}\lesssim\sqrt{\frac{p+\log K}{n_{k}}}, (S.5.115)

    for all kSk\in S.

  2. (ii)

    With probability at least 1CK2eCp1-C^{\prime}K^{-2}e^{-C^{\prime\prime}p},

    1nSkSi=1nk[𝒛(k)i(𝒛(k)i)𝔼[𝒛(k)i(𝒛(k)i)]]𝜷(k)2pnS.\left\|\frac{1}{n_{S}}\sum_{k\in S}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\bm{\beta}^{(k)*}\right\|_{2}\lesssim\sqrt{\frac{p}{n_{S}}}. (S.5.116)

S.5.2.2 Main proof of Theorem 1

The proof of Theorem 1 consists of two cases. In Case 1, we study the scenario hp+logKmaxkSnkh\gtrsim\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}, where we take a fixed contraction radius. In this case, proving a single-task estimation error rate K(p+logK)nS\sqrt{\frac{K(p+\log K)}{n_{S}}} is sufficient, which is relatively straightforward. In Case 2, we explore the scenario hp+logKmaxkSnkh\lesssim\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}, in which regime multi-task learning can outperform the classical single-task learning. In this case, the classical finite-sample analysis of EM in \citeappbalakrishnan2017statistical and \citeappcai2019chime, which uses a fixed contraction radius as we did in Case 1, does not work. This is because the heterogenous 𝝁(k)1\bm{\mu}^{(k)*}_{1} and 𝝁(k)2\bm{\mu}^{(k)*}_{2} lead to an error of K(p+logK)nS\sqrt{\frac{K(p+\log K)}{n_{S}}} when estimating w(k)w^{(k)*}. This term K(p+logK)nS\sqrt{\frac{K(p+\log K)}{n_{S}}} ultimately affects the estimation error of 𝜷(k)\bm{\beta}^{(k)*} and δ(k)\delta^{(k)*}, preventing us from proving the improvement of multi-task learning over single-task learning. To resolve this issue, we creatively use a “localization” strategy to adaptively shrink the contraction radius in each iteration. This method effectively eliminates the term K(p+logK)nS\sqrt{\frac{K(p+\log K)}{n_{S}}}. By combining the two cases, we complete the proof.

WLOG, in Assumptions 1.(iii) and 1.(iv), we assume

  • maxkS(𝜷^(k)[0]𝜷(k)2𝝁^(k)[0]1𝝁(k)12𝝁^(k)[0]2𝝁(k)22)CminkSΔ(k)\max_{k\in S}\big{(}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[0]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2}\big{)}\leq C^{\prime}\min_{k\in S}\Delta^{(k)}, with a small constant C>0C^{\prime}>0;

  • maxkS|w^(k)[0]w(k)|cw/2\max_{k\in S}|\widehat{w}^{(k)[0]}-w^{(k)*}|\leq c_{w}/2.

(\@slowromancapi@) Case 1: Let us consider the case that hCp+logKmaxkSnkh\geq C\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}. Consider an event \mathcal{E} defined to be the intersection of the events in Lemmas 11.(i), 12.(i), 13.(i), and 14.(i), with ξ(k)=\xi^{(k)}= a large constant CC, which satisfies ()1C(K2+K2eCp)\mathbb{P}(\mathcal{E})\geq 1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p}). Throughout the analysis in Case 1, we condition on \mathcal{E}, therefore all the arguments hold with probability at least 1C(K2+K2eCp)1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p}).

Consider the case t=1t=1. Lemma 6 tells us that when λ[t]CmaxkS{nk𝚺^(k)[t]𝜷(k)(𝝁^(k)[t]2𝝁^(k)[t]1)2}\lambda^{[t]}\geq C\max_{k\in S}\{\sqrt{n_{k}}\|\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})\|_{2}\}, we have

𝜷^(k)[t]𝜷(k)2kSnknS[𝚺^(k)[t]𝜷(k)(𝝁^(k)[t]2𝝁^(k)[t]1)]2+hλ[t]nk+ϵλ[t]maxk=1:Knk.\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\|_{2}\lesssim\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}[\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})]\right\|_{2}+h\wedge\frac{\lambda^{[t]}}{\sqrt{n_{k}}}+\epsilon\frac{\lambda^{[t]}}{\sqrt{\max_{k=1:K}n_{k}}}. (S.5.117)

And if further λ[t]CmaxkSnkh\lambda^{[t]}\geq C\max_{k\in S}\sqrt{n_{k}}h, we have (S.5.117) holds with 𝜷^(k)[t]=𝜷¯[t]\widehat{\bm{\beta}}^{(k)[t]}=\overline{\bm{\beta}}^{[t]} for all kSk\in S. Note that

𝚺^(k)[t]𝜷(k)(𝝁^(k)[t]2𝝁^(k)[t]1)2(𝚺^(k)[t]𝚺(k))𝜷(k)2+𝝁^(k)[t]2𝝁^(k)[t]1𝝁(k)2+𝝁(k)12.\|\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})\|_{2}\leq\|(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*})\bm{\beta}^{(k)*}\|_{2}+\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{2}+\bm{\mu}^{(k)*}_{1}\|_{2}. (S.5.118)

And the first term on the RHS can be controlled as

(𝚺^(k)[t]𝚺(k))𝜷(k)2\displaystyle\|(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*})\bm{\beta}^{(k)*}\|_{2} (S.5.119)
1nki=1nk[𝒛(k)i(𝒛(k)i)𝔼[𝒛(k)i(𝒛(k)i)]]𝜷(k)21\displaystyle\leq\underbrace{\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\bm{\beta}^{(k)*}\right\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{1}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}} (S.5.120)
+[(1w^(k)[t])𝝁^1(k)[t](𝝁^1(k)[t])(1w(k))𝝁(k)1(𝝁(k)1)]𝜷(k)22\displaystyle\quad+\underbrace{\left\|\big{[}(1-\widehat{w}^{(k)[t]})\widehat{\bm{\mu}}_{1}^{(k)[t]}(\widehat{\bm{\mu}}_{1}^{(k)[t]})^{\top}-(1-w^{(k)*})\bm{\mu}^{(k)*}_{1}(\bm{\mu}^{(k)*}_{1})^{\top}\big{]}\bm{\beta}^{(k)*}\right\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}} (S.5.121)
+[w^(k)[t]𝝁^2(k)[t](𝝁^2(k)[t])w(k)𝝁(k)2(𝝁(k)2)]𝜷(k)23.\displaystyle\quad+\underbrace{\left\|\big{[}\widehat{w}^{(k)[t]}\widehat{\bm{\mu}}_{2}^{(k)[t]}(\widehat{\bm{\mu}}_{2}^{(k)[t]})^{\top}-w^{(k)*}\bm{\mu}^{(k)*}_{2}(\bm{\mu}^{(k)*}_{2})^{\top}\big{]}\bm{\beta}^{(k)*}\right\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{3}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}. (S.5.122)

Conditioned on \mathcal{E}, we have

1p+logKnk,\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{1}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\lesssim\sqrt{\frac{p+\log K}{n_{k}}}, (S.5.123)

And

2 (1w^(k)[t])(𝝁^(k)[t]1𝝁(k)1)(𝝁^(k)[t]1)𝜷(k)22.1\displaystyle\leq\underbrace{\left\|(1-\widehat{w}^{(k)[t]})(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1})\cdot(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.1} (S.5.124)
+[(1w^(k)[t])𝝁(k)1(𝝁^(k)[t]1)(1w(k))𝝁(k)1(𝝁(k)1)]𝜷(k)22.2,\displaystyle\quad+\underbrace{\left\|\big{[}(1-\widehat{w}^{(k)[t]})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}-(1-w^{(k)*})\bm{\mu}^{(k)*}_{1}(\bm{\mu}^{(k)*}_{1})^{\top}\big{]}\bm{\beta}^{(k)*}\right\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2}, (S.5.125)

where

2.2(w^(k)[t]w(k))𝝁(k)1(𝝁^(k)[t]1)𝜷(k)2+(1w^(k)[t])𝝁(k)1(𝝁^(k)[t]1𝝁(k)1)𝜷(k)2.\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2\leq\left\|(\widehat{w}^{(k)[t]}-w^{(k)*})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}+\left\|(1-\widehat{w}^{(k)[t]})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}. (S.5.126)

Before we discuss how to control the terms on the RHS, let us first try to control |w^(k)[t]w(k)||\widehat{w}^{(k)[t]}-w^{(k)*}| as it will be used to bound the existing terms. Note that by Lemma 8,

|w^(k)[t]w(k)|\displaystyle|\widehat{w}^{(k)[t]}-w^{(k)*}| |w(k)(𝜽^(k)[t1])w(k)|+|w^(k)[t]w(k)(𝜽^(k)[t1])|\displaystyle\leq|w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})-w^{(k)*}|+|\widehat{w}^{(k)[t]}-w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})| (S.5.127)
κ0d(𝜽^(k)[t1],𝜽(k))+|1nki=1nkγ𝜽^(k)[t1](𝒛(k)i)𝔼𝒛(k)[γ𝜽^(k)[t1](𝒛(k))]|\displaystyle\leq\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})]\right| (S.5.128)
κ0d(𝜽^(k)[t1],𝜽(k))+p+logKnk\displaystyle\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}} (S.5.129)
c,\displaystyle\leq c, (S.5.130)

where cc is a small constant. By Lemma 8 again,

𝝁^(k)[t]1𝝁(k)12\displaystyle\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2} =1nki=1nk(1γ𝜽^(k)[t1](𝒛(k)i))𝒛(k)i1w^(k)[t]𝔼𝒛(k)[(1γ𝜽^(k)[t1](𝒛(k)))𝒛(k)]1w(k)(𝜽^(k)[t1])2\displaystyle=\left\|\frac{\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i}))\bm{z}^{(k)}_{i}}{1-\widehat{w}^{(k)[t]}}-\frac{\mathbb{E}_{\bm{z}^{(k)}}[(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}))\bm{z}^{(k)}]}{1-w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})}\right\|_{2} (S.5.131)
1nki=1nk(1γ𝜽^(k)[t1](𝒛(k)i))𝒛(k)i𝔼𝒛(k)[(1γ𝜽^(k)[t1](𝒛(k)))𝒛(k)]1w^(k)[t]2\displaystyle\leq\left\|\frac{\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i}))\bm{z}^{(k)}_{i}-\mathbb{E}_{\bm{z}^{(k)}}[(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}))\bm{z}^{(k)}]}{1-\widehat{w}^{(k)[t]}}\right\|_{2} (S.5.132)
+𝔼𝒛(k)[(1γ𝜽^(k)[t1](𝒛(k)))𝒛(k)](1w^(k)[t])(1w(k)(𝜽^(k)[t1]))(w^(k)[t]w(k)(𝜽^(k)[t1]))2\displaystyle\quad+\left\|\frac{\mathbb{E}_{\bm{z}^{(k)}}[(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}))\bm{z}^{(k)}]}{(1-\widehat{w}^{(k)[t]})(1-w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]}))}(\widehat{w}^{(k)[t]}-w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]}))\right\|_{2} (S.5.133)
1nki=1nk(1γ𝜽^(k)[t1](𝒛(k)i))𝒛(k)i𝔼𝒛(k)[(1γ𝜽^(k)[t1](𝒛(k)))𝒛(k)]2\displaystyle\lesssim\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i}))\bm{z}^{(k)}_{i}-\mathbb{E}_{\bm{z}^{(k)}}[(1-\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}))\bm{z}^{(k)}]\right\|_{2} (S.5.134)
+|w^(k)[t]w(k)|+κ0d(𝜽^(k)[t1],𝜽(k))\displaystyle\quad+|\widehat{w}^{(k)[t]}-w^{(k)*}|+\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*}) (S.5.135)
κ0d(𝜽^(k)[t1],𝜽(k))+p+logKnk\displaystyle\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}} (S.5.136)
CbΔ.\displaystyle\leq C_{b}\Delta. (S.5.137)

Therefore, we can bound the RHS of (S.5.126) as

2.2|w^(k)[t]w(k)|+𝝁^(k)[t]1𝝁(k)12κ0d(𝜽^(k)[t1],𝜽(k))+p+logKnk.\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2\lesssim|\widehat{w}^{(k)[t]}-w^{(k)*}|+\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}. (S.5.138)

Similarly, we have

2.1𝝁^(k)[t]1𝝁(k)12κ0d(𝜽^(k)[t1],𝜽(k))+p+logKnk,\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.1\lesssim\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}, (S.5.139)

Combining (S.5.138) and (S.5.139), we have

2κ0d(𝜽^(k)[t1],𝜽(k))+p+logKnk,\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}, (S.5.140)

Similarly, we can bound 3 in the same way, and get

3κ0d(𝜽^(k)[t1],𝜽(k))+p+logKnk,\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{3}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}, (S.5.141)

Hence

(𝚺^(k)[t]𝚺(k))𝜷(k)2κ0d(𝜽^(k)[t1],𝜽(k))+p+logKnk,\|(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*})\bm{\beta}^{(k)*}\|_{2}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}, (S.5.142)

And the second term on the RHS of (S.5.118) satisfies

𝝁^(k)[t]2𝝁^(k)[t]1𝝁(k)2+𝝁(k)12𝝁^(k)[t]2𝝁(k)22𝝁^(k)[t]1𝝁(k)12κ0d(𝜽^(k)[t1],𝜽(k))+p+logKnk.\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{2}+\bm{\mu}^{(k)*}_{1}\|_{2}\lesssim\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}. (S.5.143)

All together, we have

𝚺^(k)[t]𝜷(k)(𝝁^(k)[t]2𝝁^(k)[t]1)2κ0d(𝜽^(k)[t1],𝜽(k))+p+logKnk.\|\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})\|_{2}\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}. (S.5.144)

This implies that λ[t]=Cλp+logK+κλ[0]CmaxkS{nk𝚺^(k)[t]𝜷(k)(𝝁^(k)[t]2𝝁^(k)[t]1)2}\lambda^{[t]}=C_{\lambda}\sqrt{p+\log K}+\kappa\lambda^{[0]}\geq C\max_{k\in S}\{\sqrt{n_{k}}\|\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})\|_{2}\}, therefore by (S.5.117),

𝜷^(k)[t]𝜷(k)2κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+K(p+logK)nS+ϵλ[t]maxk=1:Knk,\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\|_{2}\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{K(p+\log K)}{n_{S}}}+\epsilon\frac{\lambda^{[t]}}{\sqrt{\max_{k=1:K}n_{k}}}, (S.5.145)

And by (S.5.129),

kSnknS|w^(k)[t]w(k)|κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+K(p+logK)nS.\sum_{k\in S}\frac{n_{k}}{n_{S}}|\widehat{w}^{(k)[t]}-w^{(k)*}|\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{K(p+\log K)}{n_{S}}}. (S.5.146)

Also,

|δ^(k)[t]δ(k)|\displaystyle|\widehat{\delta}^{(k)[t]}-\delta^{(k)*}| =12(𝜷^(k)[t])(𝝁^(k)[t]1+𝝁^(k)[t]2)(𝜷(k))(𝝁(k)1+𝝁(k)2)2\displaystyle=\frac{1}{2}\left\|(\widehat{\bm{\beta}}^{(k)[t]})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{1}+\widehat{\bm{\mu}}^{(k)[t]}_{2})-(\bm{\beta}^{(k)*})^{\top}(\bm{\mu}^{(k)*}_{1}+\bm{\mu}^{(k)*}_{2})\right\|_{2} (S.5.147)
𝜷^(k)[t]𝜷(k)2+𝝁^(k)[t]1𝝁(k)12+𝝁^(k)[t]2𝝁(k)22\displaystyle\lesssim\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\|_{2}+\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}+\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2} (S.5.148)
κ0d(𝜽^(k)[t1],𝜽(k))+p+logKnk,\displaystyle\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+\log K}{n_{k}}}, (S.5.149)

which entails that

kSnknS|δ^(k)[t]δ(k)|κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+K(p+logK)nS.\sum_{k\in S}\frac{n_{k}}{n_{S}}|\widehat{\delta}^{(k)[t]}-\delta^{(k)*}|\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{K(p+\log K)}{n_{S}}}. (S.5.150)

Combining (S.5.145), (S.5.146), and (S.5.150), we have

kSnknSd(𝜽^(k)[t],𝜽(k))κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+K(p+logK)nS+ϵλ[t]maxk=1:Knk.\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{K(p+\log K)}{n_{S}}}+\epsilon\frac{\lambda^{[t]}}{\sqrt{\max_{k=1:K}n_{k}}}. (S.5.151)

Also,

maxkS{nkd(𝜽^(k)[t],𝜽(k))}κ0maxkS{nkd(𝜽^(k)[t1],𝜽(k))}+λ[t].\max_{k\in S}\big{\{}\sqrt{n_{k}}d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})\big{\}}\lesssim\kappa_{0}\max_{k\in S}\big{\{}\sqrt{n_{k}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})\big{\}}+\lambda^{[t]}. (S.5.152)

When we assume (S.5.145), (S.5.151), (S.5.152) hold for all t=1:tt=1:t^{\prime}, via the same analysis we will have (S.5.144) hold again for t=t+1t=t^{\prime}+1. Hence

maxkS{nk𝚺^(k)[t+1]𝜷(k)(𝝁^(k)[t+1]2𝝁^(k)[t+1]1)2}κ0maxkS{nkd(𝜽^(k)[t],𝜽(k))}+p+logK.\max_{k\in S}\big{\{}\sqrt{n_{k}}\|\widehat{\bm{\Sigma}}^{(k)[t^{\prime}+1]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t^{\prime}+1]}_{2}-\widehat{\bm{\mu}}^{(k)[t^{\prime}+1]}_{1})\|_{2}\big{\}}\lesssim\kappa_{0}\max_{k\in S}\big{\{}\sqrt{n_{k}}d(\widehat{\bm{\theta}}^{(k)[t^{\prime}]},\bm{\theta}^{(k)*})\big{\}}+\sqrt{p+\log K}. (S.5.153)

Then by (S.5.152) when t=tt=t^{\prime},

κ0maxkS{nkd(𝜽^(k)[t],𝜽(k))}+p+logK\displaystyle\kappa_{0}\max_{k\in S}\big{\{}\sqrt{n_{k}}d(\widehat{\bm{\theta}}^{(k)[t^{\prime}]},\bm{\theta}^{(k)*})\big{\}}+\sqrt{p+\log K} κ02maxkS{nkd(𝜽^(k)[t1],𝜽(k))}+p+logK+κ0λ[t]\displaystyle\lesssim\kappa_{0}^{2}\max_{k\in S}\big{\{}\sqrt{n_{k}}d(\widehat{\bm{\theta}}^{(k)[t^{\prime}-1]},\bm{\theta}^{(k)*})\big{\}}+\sqrt{p+\log K}+\kappa_{0}\lambda^{[t^{\prime}]} (S.5.154)
κ0λ[t]+p+logK\displaystyle\leq\kappa_{0}\lambda^{[t^{\prime}]}+\sqrt{p+\log K} (S.5.155)
λ[t+1],\displaystyle\leq\lambda^{[t^{\prime}+1]}, (S.5.156)

where we need κCκ0\kappa\geq C\kappa_{0} with a large constant C>0C>0. Recall that κ(0,1)\kappa\in(0,1) is one of the tuning parameters in the update formula of λ[t]\lambda^{[t]}. Therefore we can follow the same arguments as above to obtain (S.5.129), (S.5.143), (S.5.145), (S.5.149), (S.5.151), (S.5.152) for t=t+1t=t^{\prime}+1.

So far, we have shown that (S.5.129), (S.5.143), (S.5.145), (S.5.149), (S.5.151), (S.5.152) hold for any tt. By the update formula of λ[t]\lambda^{[t]}, when t1t\geq 1, we have

λ[t]=1κt1κCλp+logK+κt1λ[0].\lambda^{[t]}=\frac{1-\kappa^{t}}{1-\kappa}C_{\lambda}\sqrt{p+\log K}+\kappa^{t-1}\lambda^{[0]}. (S.5.157)

Therefore by (S.5.151),

kSnknSd(𝜽^(k)[t],𝜽(k))\displaystyle\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*}) Cκ0kSnknSd(𝜽^(k)[t1],𝜽(k))+CK(p+logK)nS+CKnSλ[t]\displaystyle\leq C\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+C^{\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K}{n_{S}}}\lambda^{[t]} (S.5.158)
(Cκ0)tkSnknSd(𝜽^(k)[0],𝜽(k))+CK(p+logK)nS+CKnSt=1tλ[t](Cκ0)tt\displaystyle\leq(C\kappa_{0})^{t}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[0]},\bm{\theta}^{(k)*})+C^{\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K}{n_{S}}}\sum_{t^{\prime}=1}^{t}\lambda^{[t^{\prime}]}\cdot(C\kappa_{0})^{t-t^{\prime}} (S.5.159)
(Cκ0)tkSnknSd(𝜽^(k)[0],𝜽(k))+CK(p+logK)nS+CKnSt=1tλ[t]κtt\displaystyle\leq(C\kappa_{0})^{t}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[0]},\bm{\theta}^{(k)*})+C^{\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K}{n_{S}}}\sum_{t^{\prime}=1}^{t}\lambda^{[t^{\prime}]}\cdot\kappa^{t-t^{\prime}} (S.5.160)
Ctκt+CK(p+logK)nS.\displaystyle\leq C^{\prime\prime}t\kappa^{t}+C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}. (S.5.161)

Consider a new event \mathcal{E}^{\prime} defined to be the intersection of the events in Lemmas 11.(i), 12.(i), 13.(i), and 14.(i), with ξ(k)=CnkmaxkSnk\xi^{(k)}=C\sqrt{\frac{n_{k}}{\max_{k\in S}n_{k}}}, which satisfies ()1C(K2+K2eCp)\mathbb{P}(\mathcal{E}^{\prime})\geq 1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p}). Throughout the following analysis in Case 1, we condition on \mathcal{E}\cap\mathcal{E}^{\prime}, therefore all the arguments hold with probability at least 1C(K2+K2eCp)1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p}). When hCp+logKmaxkSnkh\geq C\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}, since nSKmaxkSnkn_{S}\gtrsim K\max_{k\in S}n_{k}, we have K(p+logK)nSp+logKmaxkSnkhp+logKnk\sqrt{\frac{K(p+\log K)}{n_{S}}}\lesssim\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}\lesssim h\wedge\sqrt{\frac{p+\log K}{n_{k}}}. Furthermore, when tClog(maxkSnkminkSnk)t\geq C^{\prime}\log\big{(}\frac{\max_{k\in S}n_{k}}{\min_{k\in S}n_{k}}\big{)} with a large C>0C^{\prime}>0, we have ξ(k)pnkhp+logKnk+ϵp+logKmaxk=1:Knk+logKnk\xi^{(k)}\sqrt{\frac{p}{n_{k}}}\lesssim h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+\sqrt{\frac{\log K}{n_{k}}} and tκt+CK(p+logK)nS+Cϵp+logKmaxk=1:Knk+ClogKnkξ(k)t\kappa^{t}+C\sqrt{\frac{K(p+\log K)}{n_{S}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}}\leq\xi^{(k)}, where we used the fact nSKmaxkSnkn_{S}\gtrsim K\max_{k\in S}n_{k} again to get the second inequality.

Plugging (S.5.161) back into (S.5.145), we have

𝜷^(k)[t]𝜷(k)2\displaystyle\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\|_{2} Ctκt+CK(p+logK)nS+Cξ(k)pnk+ClogKnk\displaystyle\leq C^{\prime\prime}t\kappa^{t}+C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime\prime}\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}} (S.5.162)
Ctκt+CK(p+logK)nS+Chp+logKnk+Cϵp+logKnk+ClogKnk\displaystyle\leq Ct\kappa^{t}+C\sqrt{\frac{K(p+\log K)}{n_{S}}}+C\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}} (S.5.163)
Ctκt+Chp+logKnk+Cϵp+logKmaxk=1:Knk+ClogKnk\displaystyle\leq Ct\kappa^{t}+C\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}} (S.5.164)

Then by (S.5.129),

|w^(k)[t]w(k)|\displaystyle|\widehat{w}^{(k)[t]}-w^{(k)*}| κ0d(𝜽^(k)[t1],𝜽(k))+ξ(k)pnk+logKnk\displaystyle\lesssim\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}} (S.5.165)
κ0𝜷^(k)[t1]𝜷(k)2+κ0|w^(k)[t1]w(k)||δ^(k)[t1]δ(k)|+ξ(k)pnk+logKnk\displaystyle\lesssim\kappa_{0}\|\widehat{\bm{\beta}}^{(k)[t-1]}-\bm{\beta}^{(k)*}\|_{2}+\kappa_{0}|\widehat{w}^{(k)[t-1]}-w^{(k)*}|\vee|\widehat{\delta}^{(k)[t-1]}-\delta^{(k)*}|+\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}} (S.5.166)
Ctκt+κ0|w^(k)[t1]w(k)||δ^(k)[t1]δ(k)|+hp+logKnk\displaystyle\lesssim Ct\kappa^{t}+\kappa_{0}|\widehat{w}^{(k)[t-1]}-w^{(k)*}|\vee|\widehat{\delta}^{(k)[t-1]}-\delta^{(k)*}|+h\wedge\sqrt{\frac{p+\log K}{n_{k}}} (S.5.167)
+ϵp+logKmaxk=1:Knk+logKnk.\displaystyle\quad+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}. (S.5.168)

Similarly, by (S.5.149),

|δ^(k)[t1]δ(k)|\displaystyle|\widehat{\delta}^{(k)[t-1]}-\delta^{(k)*}| Ctκt+κ0|w^(k)[t1]w(k)||δ^(k)[t1]δ(k)|\displaystyle\lesssim Ct\kappa^{t}+\kappa_{0}|\widehat{w}^{(k)[t-1]}-w^{(k)*}|\vee|\widehat{\delta}^{(k)[t-1]}-\delta^{(k)*}| (S.5.169)
+hp+logKnk+ϵp+logKmaxk=1:Knk+logKnk.\displaystyle\quad+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}. (S.5.170)

Therefore,

|w^(k)[t]w(k)||δ^(k)[t]δ(k)|\displaystyle|\widehat{w}^{(k)[t]}-w^{(k)*}|\vee|\widehat{\delta}^{(k)[t]}-\delta^{(k)*}| Ctκt+Cκ0|w^(k)[t1]w(k)||δ^(k)[t1]δ(k)|\displaystyle\leq Ct\kappa^{t}+C\kappa_{0}|\widehat{w}^{(k)[t-1]}-w^{(k)*}|\vee|\widehat{\delta}^{(k)[t-1]}-\delta^{(k)*}| (S.5.171)
+hp+logKnk+ϵp+logKnk+logKnk\displaystyle\quad+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}} (S.5.172)
Ct2κt+Chp+logKnk+Cϵp+logKmaxk=1:Knk\displaystyle\leq C^{\prime\prime}t^{2}\kappa^{t}+C^{\prime\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.173)
+ClogKnk.\displaystyle\quad+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}}. (S.5.174)

Combine it with (S.5.164), we obtain that

d(𝜽^(k)[t],𝜽(k))Ct2κt+Chp+logKnk+Cϵp+logKmaxk=1:Knk+ClogKnk.d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})\leq C^{\prime\prime}t^{2}\kappa^{t}+C^{\prime\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}}. (S.5.175)

Plugging this back into (S.5.139), we get

𝝁^(k)[t]1𝝁(k)12Ct2κt+Cp+logKnk.\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\leq C^{\prime\prime}t^{2}\kappa^{t}+C^{\prime\prime}\sqrt{\frac{p+\log K}{n_{k}}}. (S.5.176)

And the same bound holds for 𝝁^(k)[t]2𝝁(k)22\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2} as well. The same bound for 𝚺^(k)[t]𝚺(k)2\|\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*}\|_{2} can be obtained in the same spirit as in (S.5.122).

(\@slowromancapii@) Case 2: We now focus on the case that hCp+logKmaxkSnkh\leq C\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}. As mentioned at the beginning of this proof, we need to adaptively shrink the radius of the contraction basin to prove the desired convergence rate. The analysis of Case 2 can be divided into two stages. In the first stage, we use the same fixed contraction radius as in Case 1 and follow the same analysis until the iterative error t2κtt^{2}\kappa^{t} has reduced to the single-task error K(p+logK)nS\sqrt{\frac{K(p+\log K)}{n_{S}}}. In the second stage, we apply the localization argument to shrink the contraction basin until we achieve the desired rate of convergence.

Similar to Case 1, we consider an event \mathcal{E} defined to be the intersection of the events in Lemmas 11, 12, 13, and 14, with ξ(k)=\xi^{(k)}= a large constant CC, which satisfies ()1C(K2+K2eCp)\mathbb{P}(\mathcal{E})\geq 1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p}). Throughout the analysis in Case 2, we condition on \mathcal{E}, therefore all the arguments hold with probability at least 1C(K2+K2eCp)1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p}).

Consider t0t_{0} as the number of iterations in the first stage which satisfies t02κt0K(p+logK)nSt_{0}^{2}\kappa^{t_{0}}\asymp\sqrt{\frac{K(p+\log K)}{n_{S}}}. When t=1:t0t=1:t_{0}, we can go through the same analysis as in Case 1, and show that conditioned on \mathcal{E},

kSnknSd(𝜽^(k)[t],𝜽(k))Ctκt+CK(p+logK)nS,\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})\lesssim C^{\prime\prime}t\kappa^{t}+C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}, (S.5.177)

and

d(𝜽^(k)[t],𝜽(k))Ct2κt+CK(p+logK)nS+Chp+logKnk+Cϵp+logKmaxk=1:Knk\displaystyle d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*})\leq C^{\prime\prime}t^{2}\kappa^{t}+C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.178)
+ClogKnk,\displaystyle\hskip 85.35826pt+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}}, (S.5.179)
𝝁^(k)[t]1𝝁(k)12𝝁^(k)[t]2𝝁(k)22𝚺^(k)[t]𝚺(k)2Ct2κt+Cp+logKnk.\displaystyle\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)[t]}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2}\vee\|\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*}\|_{2}\leq C^{\prime\prime}t^{2}\kappa^{t}+C^{\prime\prime}\sqrt{\frac{p+\log K}{n_{k}}}. (S.5.180)

Since t02κt0K(p+logK)nSt_{0}^{2}\kappa^{t_{0}}\asymp\sqrt{\frac{K(p+\log K)}{n_{S}}}, the rates above are the desired rates. In the following, we will derive the results for the case tt0+1t\geq t_{0}+1.

Define

ξ(k)t0\displaystyle\xi^{(k)}_{t_{0}} =CK(p+logK)nS+Chp+logKnk+Cϵp+logKmaxk=1:Knk+ClogKnk,\displaystyle=C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}}, (S.5.181)
ξ¯t0=kSnknSξ(k)t0\displaystyle\overline{\xi}_{t_{0}}=\sum_{k\in S}\frac{n_{k}}{n_{S}}\xi^{(k)}_{t_{0}} CK(p+logK)nS+ChK(p+logK)nS+CϵK(p+logK)nS\displaystyle\leq C^{\prime\prime}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime\prime}\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime\prime}\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}} (S.5.182)
+CKlogKnS.\displaystyle\quad+C^{\prime\prime}\sqrt{\frac{K\log K}{n_{S}}}. (S.5.183)

Consider an event t0\mathcal{E}_{t_{0}} defined to be the intersection of the events in Lemmas 11, 12, 13, and 14, with ξ(k)=ξ(k)t0\xi^{(k)}=\xi^{(k)}_{t_{0}}, which satisfies (t0)1C(K2+K2eCp)\mathbb{P}(\mathcal{E}_{t_{0}})\geq 1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p}). In the following, we condition on t0\mathcal{E}\cap\mathcal{E}_{t_{0}}, therefore all the arguments hold with probability at least 1CK11-C^{\prime}K^{-1}.

Let t=t0+1t=t_{0}+1. Since λ[t1]Cp+logKCmaxkSnkh\lambda^{[t-1]}\geq C\sqrt{p+\log K}\geq C\max_{k\in S}\sqrt{n_{k}}h, by Lemma 6, we also have 𝜷^(k)[t1]=𝜷¯[t1]\widehat{\bm{\beta}}^{(k)[t-1]}=\overline{\bm{\beta}}^{[t-1]} for all kSk\in S. Similar to (S.5.129),

|w^(k)[t]w(k)|\displaystyle|\widehat{w}^{(k)[t]}-w^{(k)*}| |w(k)(𝜽^(k)[t1])w(k)|+|w^(k)[t]w(k)(𝜽^(k)[t1])|\displaystyle\leq|w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})-w^{(k)*}|+|\widehat{w}^{(k)[t]}-w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})| (S.5.184)
κ0d(𝜽^(k)[t1],𝜽(k))+|1nki=1nkγ𝜽^(k)[t1](𝒛(k)i)𝔼𝒛(k)[γ𝜽^(k)[t1](𝒛(k))]|\displaystyle\leq\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})]\right| (S.5.185)
κ0d(𝜽^(k)[t1],𝜽(k))+Cξ(k)t1pnk+ClogKnk\displaystyle\leq\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+C^{\prime\prime}\xi^{(k)}_{t-1}\sqrt{\frac{p}{n_{k}}}+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}} (S.5.186)
κ0d(𝜽^(k)[t1],𝜽(k))+κ0ξ(k)t1+ClogKnk.\displaystyle\leq\kappa_{0}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\kappa_{0}\xi^{(k)}_{t-1}+C^{\prime\prime}\sqrt{\frac{\log K}{n_{k}}}. (S.5.187)

This implies that

kSnknS|w^(k)[t]w(k)|κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+κ0kSnknSξ(k)t1+CKlogKnS.\sum_{k\in S}\frac{n_{k}}{n_{S}}|\widehat{w}^{(k)[t]}-w^{(k)*}|\leq\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}\xi^{(k)}_{t-1}+C^{\prime\prime}\sqrt{\frac{K\log K}{n_{S}}}. (S.5.188)

And by Lemma 6,

𝜷^(k)[t]𝜷(k)2\displaystyle\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\|_{2} kSnknS[𝚺^(k)[t]𝜷(k)(𝝁^(k)[t]2𝝁^(k)[t]1)]2+hλ[t]nk+ϵλ[t]nk\displaystyle\lesssim\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}[\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})]\right\|_{2}+h\wedge\frac{\lambda^{[t]}}{\sqrt{n_{k}}}+\epsilon\frac{\lambda^{[t]}}{\sqrt{n_{k}}} (S.5.189)
kSnknS[𝚺^(k)[t]𝜷(k)(𝝁^(k)[t]2𝝁^(k)[t]1)]2+hp+logKnk+ϵp+logKmaxk=1:Knk,\displaystyle\lesssim\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}[\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})]\right\|_{2}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}, (S.5.190)

where

kSnknS[𝚺^(k)[t]𝜷(k)(𝝁^(k)[t]2𝝁^(k)[t]1)]2\displaystyle\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}[\widehat{\bm{\Sigma}}^{(k)[t]}\bm{\beta}^{(k)*}-(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1})]\right\|_{2} kSnknS(𝚺^(k)[t]𝚺(k))𝜷(k)2\displaystyle\leq\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*})\bm{\beta}^{(k)*}\right\|_{2} (S.5.191)
+kSnknS(𝝁^(k)[t]2𝝁^(k)[t]1𝝁(k)2+𝝁(k)1)2.\displaystyle\quad+\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}\big{(}\widehat{\bm{\mu}}^{(k)[t]}_{2}-\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{2}+\bm{\mu}^{(k)*}_{1}\big{)}\right\|_{2}. (S.5.192)

And the first term on the RHS can be controlled as

kSnknS(𝚺^(k)[t]𝚺(k))𝜷(k)2\displaystyle\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*})\bm{\beta}^{(k)*}\right\|_{2} (S.5.193)
1nSkSi=1nk[𝒛(k)i(𝒛(k)i)𝔼[𝒛(k)i(𝒛(k)i)]]𝜷(k)24\displaystyle\leq\underbrace{\left\|\frac{1}{n_{S}}\sum_{k\in S}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\bm{\beta}^{(k)*}\right\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{4}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}} (S.5.194)
+kSnknS[(1w^(k)[t])𝝁^1(k)[t](𝝁^1(k)[t])(1w(k))𝝁(k)1(𝝁(k)1)]𝜷(k)25\displaystyle\quad+\underbrace{\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}\big{[}(1-\widehat{w}^{(k)[t]})\widehat{\bm{\mu}}_{1}^{(k)[t]}(\widehat{\bm{\mu}}_{1}^{(k)[t]})^{\top}-(1-w^{(k)*})\bm{\mu}^{(k)*}_{1}(\bm{\mu}^{(k)*}_{1})^{\top}\big{]}\bm{\beta}^{(k)*}\right\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}} (S.5.195)
+kSnknS[w^(k)[t]𝝁^2(k)[t](𝝁^2(k)[t])w(k)𝝁(k)2(𝝁(k)2)]𝜷(k)26.\displaystyle\quad+\underbrace{\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}\big{[}\widehat{w}^{(k)[t]}\widehat{\bm{\mu}}_{2}^{(k)[t]}(\widehat{\bm{\mu}}_{2}^{(k)[t]})^{\top}-w^{(k)*}\bm{\mu}^{(k)*}_{2}(\bm{\mu}^{(k)*}_{2})^{\top}\big{]}\bm{\beta}^{(k)*}\right\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{6}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}. (S.5.196)

Conditioned on \mathcal{E}, we have

4pnS.\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{4}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\lesssim\sqrt{\frac{p}{n_{S}}}. (S.5.197)

And

5 kSnknS(1w^(k)[t])(𝝁^(k)[t]1𝝁(k)1)(𝝁^(k)[t]1)𝜷(k)25.1\displaystyle\leq\underbrace{\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(1-\widehat{w}^{(k)[t]})(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1})\cdot(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.1} (S.5.198)
+kSnknS[(1w^(k)[t])𝝁(k)1(𝝁^(k)[t]1)(1w(k))𝝁(k)1(𝝁(k)1)]𝜷(k)25.2,\displaystyle\quad+\underbrace{\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}\big{[}(1-\widehat{w}^{(k)[t]})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}-(1-w^{(k)*})\bm{\mu}^{(k)*}_{1}(\bm{\mu}^{(k)*}_{1})^{\top}\big{]}\bm{\beta}^{(k)*}\right\|_{2}}_{\leavevmode\hbox to6.24pt{\vbox to6.24pt{\pgfpicture\makeatletter\hbox{\hskip 3.12187pt\lower-3.12187pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{2.92188pt}{0.0pt}\pgfsys@curveto{2.92188pt}{1.61372pt}{1.61372pt}{2.92188pt}{0.0pt}{2.92188pt}\pgfsys@curveto{-1.61372pt}{2.92188pt}{-2.92188pt}{1.61372pt}{-2.92188pt}{0.0pt}\pgfsys@curveto{-2.92188pt}{-1.61372pt}{-1.61372pt}{-2.92188pt}{0.0pt}{-2.92188pt}\pgfsys@curveto{1.61372pt}{-2.92188pt}{2.92188pt}{-1.61372pt}{2.92188pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-1.75pt}{-2.25555pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2}, (S.5.199)

where

5.2kSnknS(w^(k)[t]w(k))𝝁(k)1(𝝁^(k)[t]1)𝜷(k)2+kSnknS(1w^(k)[t])𝝁(k)1(𝝁^(k)[t]1𝝁(k)1)𝜷(k)2.\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2\leq\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{w}^{(k)[t]}-w^{(k)*})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}+\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(1-\widehat{w}^{(k)[t]})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}. (S.5.200)

For the first term, we have

kSnknS(w^(k)[t]w(k))𝝁(k)1(𝝁^(k)[t]1)𝜷(k)2\displaystyle\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{w}^{(k)[t]}-w^{(k)*})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2} (S.5.201)
kSnknS|w(k)(𝜽^(k)[t1])w(k)|\displaystyle\lesssim\sum_{k\in S}\frac{n_{k}}{n_{S}}|w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})-w^{(k)*}| (S.5.202)
+1nSkSi=1nk{γ𝜽^(k)[t1](𝒛(k)i)𝝁(k)1(𝝁^(k)[t]1)𝜷(k)𝔼𝒛(k)[γ𝜽^(k)[t1](𝒛(k))𝝁(k)1(𝝁^(k)[t]1)𝜷(k)]}2\displaystyle\quad+\frac{1}{n_{S}}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\Big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}-\mathbb{E}_{\bm{z}^{(k)}}\big{[}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}\big{]}\Big{\}}\right\|_{2} (S.5.203)
κ0kSnknSd(𝜽^(k)[t1],𝜽(k))\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*}) (S.5.204)
+1nSsup|w~k|UkSi=1nkw~k{γ𝜽^(k)[t1](𝒛(k)i)𝝁(k)1𝔼𝒛(k)[γ𝜽^(k)[t1](𝒛(k))𝝁(k)1]}2,\displaystyle\quad+\frac{1}{n_{S}}\sup_{|\widetilde{w}_{k}|\leq U}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\Big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})\bm{\mu}^{(k)*}_{1}-\mathbb{E}_{\bm{z}^{(k)}}\big{[}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})\bm{\mu}^{(k)*}_{1}\big{]}\Big{\}}\right\|_{2}, (S.5.205)

where U>0U>0 is some constant such that U𝝁^(k)[t]1𝝁(k)12𝜷(k)2+𝝁(k)12𝜷(k)2|(𝝁^(k)[t]1)𝜷(k)|U\geq\|\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\|\bm{\beta}^{(k)*}\|_{2}+\|\bm{\mu}^{(k)*}_{1}\|_{2}\|\bm{\beta}^{(k)*}\|_{2}\geq|(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}| under event t0\mathcal{E}_{t_{0}}. Note that the last inequality holds because the expectation 𝔼𝒛(k)\mathbb{E}_{\bm{z}^{(k)}} is w.r.t. 𝒛(k)\bm{z}^{(k)} which is independent of 𝝁^(k)[t]\widehat{\bm{\mu}}^{(k)[t]}. By Lemma 13.(iii) and the definition of t0\mathcal{E}_{t_{0}}, the second term can be bounded as

1nSsup|w~k|UkSi=1nkw~k{γ𝜽^(k)[t1](𝒛(k)i)𝝁(k)1𝔼𝒛(k)[γ𝜽^(k)[t1](𝒛(k))𝝁(k)1]}2p+KnS.\displaystyle\frac{1}{n_{S}}\sup_{|\widetilde{w}_{k}|\leq U}\left\|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\Big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})\bm{\mu}^{(k)*}_{1}-\mathbb{E}_{\bm{z}^{(k)}}\big{[}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})\bm{\mu}^{(k)*}_{1}\big{]}\Big{\}}\right\|_{2}\lesssim\sqrt{\frac{p+K}{n_{S}}}. (S.5.206)

Therefore,

kSnknS(w^(k)[t]w(k))𝝁(k)1(𝝁^(k)[t]1)𝜷(k)2κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+p+KnS.\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{w}^{(k)[t]}-w^{(k)*})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2}\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}. (S.5.207)

On the other hand, by simple calculations, we have

kSnknS(1w^(k)[t])𝝁(k)1(𝝁^(k)[t]1𝝁(k)1)𝜷(k)2\displaystyle\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(1-\widehat{w}^{(k)[t]})\bm{\mu}^{(k)*}_{1}(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1})^{\top}\bm{\beta}^{(k)*}\right\|_{2} (S.5.208)
κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+sup|w~k|UkSnknSw~k[w(k)(𝜽^(k)[t1])w^(k)[t]]𝝁(k)12\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sup_{|\widetilde{w}_{k}|\leq U}\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}\widetilde{w}_{k}[w^{(k)}(\widehat{\bm{\theta}}^{(k)[t-1]})-\widehat{w}^{(k)[t]}]\bm{\mu}^{(k)*}_{1}\right\|_{2} (S.5.209)
+1nSsup|w~k|UkSw~ki=1nk{γ𝜽^(k)[t1](𝒛(k)i)(𝒛(k)i)𝜷(k)𝔼𝒛(k)[γ𝜽^(k)[t1](𝒛(k))(𝒛(k))𝜷(k)]}𝝁(k)12\displaystyle\quad+\frac{1}{n_{S}}\sup_{|\widetilde{w}_{k}|\leq U^{\prime}}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})(\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)*}]\big{\}}\bm{\mu}^{(k)*}_{1}\right\|_{2} (S.5.210)
κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+1nSsup|w~k|UkSw~ki=1nk[γ𝜽^(k)[t1](𝒛(k)i)𝔼𝒛(k)[γ𝜽^(k)[t1](𝒛(k))]]𝝁(k)12\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\frac{1}{n_{S}}\sup_{|\widetilde{w}_{k}|\leq U}\left\|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{[}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})]\big{]}\bm{\mu}^{(k)*}_{1}\right\|_{2} (S.5.211)
+1nSsup|w~k|Usupu21|kSw~ki=1nk{γ𝜽^(k)[t1](𝒛(k)i)(𝒛(k)i)𝜷(k)𝔼𝒛(k)[γ𝜽^(k)[t1](𝒛(k))(𝒛(k))𝜷(k)]}(𝝁(k)1)𝒖|\displaystyle\quad+\frac{1}{n_{S}}\sup_{|\widetilde{w}_{k}|\leq U^{\prime}}\sup_{\|u\|_{2}\leq 1}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})(\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)*}]\big{\}}(\bm{\mu}^{(k)*}_{1})^{\top}\bm{u}\right| (S.5.212)
κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+p+KnS\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}} (S.5.213)
+1nSsup|w~k|U|kSw~ki=1nk{γ𝜽^(k)[t1](𝒛(k)i)(𝒛(k)i)𝜷(k)𝔼𝒛(k)[γ𝜽^(k)[t1](𝒛(k))(𝒛(k))𝜷(k)]}|\displaystyle\quad+\frac{1}{n_{S}}\sup_{|\widetilde{w}_{k}|\leq U^{\prime}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{\{}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}-\mathbb{E}_{\bm{z}^{(k)}}[\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})(\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)*}]\big{\}}\right| (S.5.214)
κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+p+KnS.\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}. (S.5.215)

Hence

5.2κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+p+KnS.\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.2\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}. (S.5.216)

A similar discussion leads to the same bound for 5.1\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}.1. Therefore,

5κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+p+KnS.\leavevmode\hbox to8.69pt{\vbox to8.69pt{\pgfpicture\makeatletter\hbox{\hskip 4.34673pt\lower-4.34673pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{4.14673pt}{0.0pt}\pgfsys@curveto{4.14673pt}{2.29019pt}{2.29019pt}{4.14673pt}{0.0pt}{4.14673pt}\pgfsys@curveto{-2.29019pt}{4.14673pt}{-4.14673pt}{2.29019pt}{-4.14673pt}{0.0pt}\pgfsys@curveto{-4.14673pt}{-2.29019pt}{-2.29019pt}{-4.14673pt}{0.0pt}{-4.14673pt}\pgfsys@curveto{2.29019pt}{-4.14673pt}{4.14673pt}{-2.29019pt}{4.14673pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{5}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}. (S.5.217)

And the same bound holds for 6, which can be shown in the same spirit. Putting all the pieces together,

kSnknS(𝚺^(k)[t]𝚺(k))𝜷(k)2κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+p+KnS.\left\|\sum_{k\in S}\frac{n_{k}}{n_{S}}(\widehat{\bm{\Sigma}}^{(k)[t]}-\bm{\Sigma}^{(k)*})\bm{\beta}^{(k)*}\right\|_{2}\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}. (S.5.218)

Therefore by Lemma 6 and (S.5.190), we have 𝜷^(k)[t1]=𝜷¯[t1]\widehat{\bm{\beta}}^{(k)[t-1]}=\overline{\bm{\beta}}^{[t-1]} for all kSk\in S, and

𝜷^(k)[t]𝜷(k)2\displaystyle\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\|_{2} κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+p+KnS+hp+logKnk+ϵp+logKmaxk=1:Knk\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.219)
Cκ0kSnknSξ(k)t1+p+KnS+hp+logKnk+ϵp+logKmaxk=1:Knk\displaystyle\leq C\kappa_{0}\cdot\sum_{k\in S}\frac{n_{k}}{n_{S}}\xi^{(k)}_{t-1}+\sqrt{\frac{p+K}{n_{S}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.220)
κ0ξ¯t1+CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk+CKlogKnS\displaystyle\leq\kappa_{0}^{\prime}\overline{\xi}_{t-1}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}} (S.5.221)
ξ(k)t.\displaystyle\coloneqq\xi^{(k)}_{t}. (S.5.222)

This entails that

ξ¯t=kSnknSξ(k)t=κ0ξ¯t1+CpnS+ChK(p+logK)nS+CϵK(p+logK)nS+CKlogKnS\overline{\xi}_{t}=\sum_{k\in S}\frac{n_{k}}{n_{S}}\xi^{(k)}_{t}=\kappa_{0}^{\prime}\overline{\xi}_{t-1}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}} (S.5.223)

This implies that

kSnknS𝜷^(k)[t]𝜷(k)2κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+p+KnS+hK(p+logK)nS+ϵK(p+logK)nS.\sum_{k\in S}\frac{n_{k}}{n_{S}}\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\|_{2}\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}+h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}+\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}. (S.5.224)

Also,

|δ^(k)[t]δ(k)|\displaystyle|\widehat{\delta}^{(k)[t]}-\delta^{(k)*}| =12(𝜷^(k)[t])(𝝁^(k)[t]1+𝝁^(k)[t]2)(𝜷(k))(𝝁(k)1+𝝁(k)2)2\displaystyle=\frac{1}{2}\left\|(\widehat{\bm{\beta}}^{(k)[t]})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{1}+\widehat{\bm{\mu}}^{(k)[t]}_{2})-(\bm{\beta}^{(k)*})^{\top}(\bm{\mu}^{(k)*}_{1}+\bm{\mu}^{(k)*}_{2})\right\|_{2} (S.5.225)
𝜷^(k)[t]𝜷(k)2+(𝜷(k))(𝝁^(k)[t]1𝝁(k)1)2+(𝜷(k))(𝝁^(k)[t]2𝝁(k)2)2\displaystyle\lesssim\|\widehat{\bm{\beta}}^{(k)[t]}-\bm{\beta}^{(k)*}\|_{2}+\|(\bm{\beta}^{(k)*})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{1}-\bm{\mu}^{(k)*}_{1})\|_{2}+\|(\bm{\beta}^{(k)*})^{\top}(\widehat{\bm{\mu}}^{(k)[t]}_{2}-\bm{\mu}^{(k)*}_{2})\|_{2} (S.5.226)
κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+p+KnS+hp+logKnk+ϵp+logKnk\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{n_{k}}} (S.5.227)
+1nki=1nkγ𝜽^(k)[t1](𝒛(k)i)(𝜷(k))𝒛(k)i𝔼𝒛(k)[γ𝜽^(k)[t1](𝒛(k))(𝜷(k))𝒛(k)]2\displaystyle\quad+\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)}_{i})(\bm{\beta}^{(k)*})^{\top}\bm{z}^{(k)}_{i}-\mathbb{E}_{\bm{z}^{(k)}}\big{[}\gamma_{\widehat{\bm{\theta}}^{(k)[t-1]}}(\bm{z}^{(k)})(\bm{\beta}^{(k)*})^{\top}\bm{z}^{(k)}\big{]}\right\|_{2} (S.5.228)
κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+p+KnS+hp+logKnk+ϵp+logKnk\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p+K}{n_{S}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{n_{k}}} (S.5.229)
+ξ(k)t1pnk+logKnk\displaystyle\quad+\xi^{(k)}_{t-1}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}} (S.5.230)
κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+pnS+hp+logKnk+ϵp+logKmaxk=1:Knk\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p}{n_{S}}}+h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.231)
+κ0ξ(k)t1+logKnk\displaystyle\quad+\kappa_{0}\xi^{(k)}_{t-1}+\sqrt{\frac{\log K}{n_{k}}} (S.5.232)

Therefore,

kSnknS|δ^(k)[t]δ(k)|\displaystyle\sum_{k\in S}\frac{n_{k}}{n_{S}}|\widehat{\delta}^{(k)[t]}-\delta^{(k)*}| κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+pnS+hK(p+logK)nS\displaystyle\lesssim\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+\sqrt{\frac{p}{n_{S}}}+h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}} (S.5.233)
+ϵK(p+logK)nS+κ0kSnknSξ(k)t1+KlogKnS.\displaystyle\quad+\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}+\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}\xi^{(k)}_{t-1}+\sqrt{\frac{K\log K}{n_{S}}}. (S.5.234)

Hence

kSnknSd(𝜽^(k)[t],𝜽(k))\displaystyle\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*}) Cκ0kSnknSd(𝜽^(k)[t1],𝜽(k))+CpnS+ChK(p+logK)nS\displaystyle\leq C\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+C\sqrt{\frac{p}{n_{S}}}+C\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}} (S.5.235)
+CϵK(p+logK)nS+Cκ0ξ¯t1+CKlogKnS\displaystyle\quad+C\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}+C\kappa_{0}\overline{\xi}_{t-1}+C\sqrt{\frac{K\log K}{n_{S}}} (S.5.236)
κ0ξ¯t1+CpnS+ChK(p+logK)nS\displaystyle\leq\kappa_{0}^{\prime}\overline{\xi}_{t-1}+C\sqrt{\frac{p}{n_{S}}}+C\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}} (S.5.237)
+CϵK(p+logK)nS+CKlogKnS,\displaystyle\quad+C\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}}, (S.5.238)

and

d(𝜽^(k)[t],𝜽(k))\displaystyle d(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*}) Cκ0kSnknSd(𝜽^(k)[t1],𝜽(k))+CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk\displaystyle\leq C\kappa_{0}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.239)
+Cκ0ξ(k)t1+ClogKnk\displaystyle\quad+C\kappa_{0}\xi^{(k)}_{t-1}+C\sqrt{\frac{\log K}{n_{k}}} (S.5.240)
12κ0kSnknSd(𝜽^(k)[t1],𝜽(k))+CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk\displaystyle\leq\frac{1}{2}\kappa_{0}^{\prime}\sum_{k\in S}\frac{n_{k}}{n_{S}}d(\widehat{\bm{\theta}}^{(k)[t-1]},\bm{\theta}^{(k)*})+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.241)
+12κ0ξ(k)t1+ClogKnk\displaystyle\quad+\frac{1}{2}\kappa_{0}^{\prime}\xi^{(k)}_{t-1}+C\sqrt{\frac{\log K}{n_{k}}} (S.5.242)
κ0ξ(k)t1+CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk\displaystyle\leq\kappa_{0}^{\prime}\xi^{(k)}_{t-1}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.243)
+ClogKnk,\displaystyle\quad+C\sqrt{\frac{\log K}{n_{k}}}, (S.5.244)

where κ0=2Cκ0(0,1)\kappa_{0}^{\prime}=2C\kappa_{0}\in(0,1).

Therefore, for t=(t0+1):(t0+t0)t=(t_{0}+1):(t_{0}+t_{0}^{\prime}), where (κ0)t0K1/2(\kappa_{0}^{\prime})^{t_{0}^{\prime}}\asymp K^{-1/2} hence t0logKt_{0}^{\prime}\asymp\log K, we have updating formulas (S.5.222) and (S.5.223) for ξ(k)t\xi^{(k)}_{t}. We can get

ξ¯t0+t\displaystyle\overline{\xi}_{t_{0}+t^{\prime}} =(κ0)tξ¯t0+CpnS+ChK(p+logK)nS+CϵK(p+logK)nS+CKlogKnS\displaystyle=(\kappa_{0}^{\prime})^{t^{\prime}}\overline{\xi}_{t_{0}}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}} (S.5.245)
C(κ0)tK(p+logK)nS+CpnS+ChK(p+logK)nS+CϵK(p+logK)nS\displaystyle\leq C(\kappa_{0}^{\prime})^{t^{\prime}}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\epsilon\sqrt{\frac{K(p+\log K)}{n_{S}}} (S.5.246)
+CKlogKnS,\displaystyle\quad+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}}, (S.5.247)

and

ξ(k)t0+t\displaystyle\xi^{(k)}_{t_{0}+t^{\prime}} =κ0ξ¯t0+t1+CpnS+Chp+logKnk+Cϵp+logKnk+CKlogKnS\displaystyle=\kappa_{0}^{\prime}\overline{\xi}_{t_{0}+t^{\prime}-1}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}} (S.5.248)
C(κ0)tK(p+logK)nS+CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk\displaystyle\leq C(\kappa_{0}^{\prime})^{t^{\prime}}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.249)
+CKlogKnS,\displaystyle\quad+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}}, (S.5.250)

with

ξ(k)t0+t0CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk+CKlogKnS,\xi^{(k)}_{t_{0}+t_{0}^{\prime}}\leq C^{\prime}\sqrt{\frac{p}{n_{S}}}+C^{\prime}\cdot h\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime}\sqrt{\frac{K\log K}{n_{S}}}, (S.5.251)

where the last inequality is due to (κ0)t0K(p+logK)nSp+logKnS(\kappa_{0}^{\prime})^{t_{0}^{\prime}}\sqrt{\frac{K(p+\log K)}{n_{S}}}\asymp\sqrt{\frac{p+\log K}{n_{S}}}.

Consider an event series {t}t=t0t0+t0\{\mathcal{E}_{t}\}_{t=t_{0}}^{t_{0}+t_{0}^{\prime}} each of which is defined to be the intersection of the events in Lemmas 11, 12, 13, and 14, with ξ(k)=ξ(k)t\xi^{(k)}=\xi^{(k)}_{t}, which satisfies (t)1C(K2+K2eCp)\mathbb{P}(\mathcal{E}_{t})\geq 1-C^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p}) hence (t=t0t0+t0t)1Ct0(K2+K2eCp)1C(K2+K2eCp)logK1CK1\mathbb{P}(\bigcap_{t=t_{0}}^{t_{0}+t_{0}^{\prime}}\mathcal{E}_{t})\geq 1-C^{\prime}t_{0}^{\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})\geq 1-C^{\prime\prime}(K^{-2}+K^{-2}e^{-C^{\prime\prime}p})\log K\geq 1-C^{\prime}K^{-1}. In the following, we condition on (t=t0t0+t0t)\mathcal{E}\cap(\cap_{t=t_{0}}^{t_{0}+t_{0}^{\prime}}\mathcal{E}_{t}), therefore all the arguments hold with probability at least 1CK11-C^{\prime}K^{-1}. Therefore, for t=(t0+1):(t0+t0)t=(t_{0}+1):(t_{0}+t_{0}^{\prime}), we have (S.5.244) hold, which leads to

d(𝜽^(k)[t0+t],𝜽(k))\displaystyle d(\widehat{\bm{\theta}}^{(k)[t_{0}+t^{\prime}]},\bm{\theta}^{(k)*}) κ0ξ(k)t0+t1+CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk+ClogKnk\displaystyle\leq\kappa_{0}^{\prime}\xi^{(k)}_{t_{0}+t^{\prime}-1}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}} (S.5.252)
C(κ0)tK(p+logK)nS+CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk\displaystyle\leq C(\kappa_{0}^{\prime})^{t^{\prime}}\sqrt{\frac{K(p+\log K)}{n_{S}}}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.253)
+ClogKnk\displaystyle\quad+C\sqrt{\frac{\log K}{n_{k}}} (S.5.254)
C(κ0)tt02κt0+CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk\displaystyle\leq C^{\prime}(\kappa_{0}^{\prime})^{t^{\prime}}\cdot t_{0}^{2}\kappa^{t_{0}}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.255)
+ClogKnk\displaystyle\quad+C\sqrt{\frac{\log K}{n_{k}}} (S.5.256)
(t0+t)2(κκ0)t0+t+CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk\displaystyle\leq(t_{0}+t^{\prime})^{2}(\kappa\vee\kappa_{0}^{\prime})^{t_{0}+t^{\prime}}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}} (S.5.257)
+ClogKnk,\displaystyle\quad+C\sqrt{\frac{\log K}{n_{k}}}, (S.5.258)

where t=1,,t0t^{\prime}=1,\ldots,t_{0}^{\prime}, which provides the desired rate for t=(t0+1):(t0+t0)t=(t_{0}+1):(t_{0}+t_{0}^{\prime}). When tt0t^{\prime}\geq t_{0}^{\prime}, by (S.5.251), we have

d(𝜽^(k)[t0+t],𝜽(k))\displaystyle d(\widehat{\bm{\theta}}^{(k)[t_{0}+t^{\prime}]},\bm{\theta}^{(k)*}) (κ0)tt0ξ(k)t0+t0+CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk+ClogKnk\displaystyle\leq(\kappa_{0}^{\prime})^{t^{\prime}-t_{0}^{\prime}}\xi^{(k)}_{t_{0}+t_{0}^{\prime}}+C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}} (S.5.259)
CpnS+Chp+logKnk+Cϵp+logKmaxk=1:Knk+ClogKnk,\displaystyle\leq C\sqrt{\frac{p}{n_{S}}}+Ch\wedge\sqrt{\frac{p+\log K}{n_{k}}}+C\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C\sqrt{\frac{\log K}{n_{k}}}, (S.5.260)

which is the desired rate. We complete the proof for Theorem 1.

S.5.2.3 Proofs of lemmas

Proof of Lemma 11.

We prove part (i) first.

Denote W=sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|1nki=1nkγ𝜽(k)(𝒛(k)i)𝔼[γ𝜽(k)(𝒛(k))]|W=\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\right|. By bounded difference inequality,

W𝔼W+ClogKnk,W\leq\mathbb{E}W+C\sqrt{\frac{\log K}{n_{k}}}, (S.5.261)

with probability at least 1CK21-C^{\prime}K^{-2}. By the generalized symmetrization inequality (Proposition 4.11 in \citeappwainwright2019high), with i.i.d. Rademacher variables {ϵ(k)i}i=1nk\{\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}},

𝔼W\displaystyle\mathbb{E}W 2nk𝔼𝒛𝔼ϵ[sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|i=1nkγ𝜽(k)(𝒛(k)i)ϵ(k)i|]\displaystyle\leq\frac{2}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\epsilon^{(k)}_{i}\right|\right] (S.5.262)
2nk𝔼𝒛𝔼ϵ[sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|i=1nkw(k)exp{C𝜽(k)(𝒛(k)i)}1w(k)+w(k)exp{C𝜽(k)(𝒛(k)i)}ϵ(k)i|]\displaystyle\leq\frac{2}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\frac{w^{(k)}\exp\{C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\}}{1-w^{(k)}+w^{(k)}\exp\{C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\}}\epsilon^{(k)}_{i}\right|\right] (S.5.263)
2nk𝔼𝒛𝔼ϵ[sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|i=1nk11+exp{C𝜽(k)(𝒛(k)i)log((w(k))11)}ϵ(k)i|],\displaystyle\leq\frac{2}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\frac{1}{1+\exp\{C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\}}\epsilon^{(k)}_{i}\right|\right], (S.5.264)

where C𝜽(k)(𝒛(k)i)=(𝜷(k))𝒛(k)iδ(k)C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})=(\bm{\beta}^{(k)})^{\top}\bm{z}^{(k)}_{i}-\delta^{(k)}. Denote 𝝁(k)=(1w(k))𝝁(k)1+w(k)𝝁(k)2=𝔼[𝒛(k)i]\bm{\mu}^{(k)*}=(1-w^{(k)*})\bm{\mu}^{(k)*}_{1}+w^{(k)*}\bm{\mu}^{(k)}_{2}=\mathbb{E}[\bm{z}^{(k)}_{i}]. By the contraction inequality for Rademecher variables (Theorem 11.6 in \citealpappboucheron2013concentration),

RHS of (S.5.261) Cnk𝔼𝒛𝔼ϵ[sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|i=1nk[C𝜽(k)(𝒛(k)i)log((w(k))11)]ϵ(k)i|]\displaystyle\leq\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\big{[}C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\big{]}\epsilon^{(k)}_{i}\right|\right] (S.5.266)
Cnk𝔼𝒛𝔼ϵ[sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|i=1nk(𝜷(k))(𝒛(k)i𝝁(k))ϵ(k)i|]\displaystyle\leq\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right|\right] (S.5.267)
+Cnk𝔼𝒛𝔼ϵ[sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|i=1nk(𝜷(k))𝝁(k)ϵ(k)i|]\displaystyle\quad+\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)*}\cdot\epsilon^{(k)}_{i}\right|\right] (S.5.268)
+Cnk𝔼ϵ[sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|log((w(k))11)||i=1nkϵ(k)i|]\displaystyle\quad+\frac{C}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\log((w^{(k)})^{-1}-1)\right|\cdot\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{i}\right|\right] (S.5.269)
Cnk𝔼𝒛𝔼ϵ[sup𝜷(k)𝜷(k)2ξ(k)|i=1nk(𝜷(k)𝜷(k))(𝒛(k)i𝝁(k))ϵ(k)i|]\displaystyle\leq\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}}\left|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)}-\bm{\beta}^{(k)*})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right|\right] (S.5.270)
+Cnk𝔼𝒛𝔼ϵ|i=1nk(𝜷(k))(𝒛(k)i𝝁(k))ϵ(k)i|\displaystyle\quad+\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)*})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right| (S.5.271)
+Cnk𝔼ϵ|i=1nkϵ(k)i|+Cnk𝔼𝒛𝔼ϵ[sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|i=1nk(𝜷(k))𝝁(k)ϵ(k)i|].\displaystyle\quad+\frac{C}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{i}\right|+\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)*}\cdot\epsilon^{(k)}_{i}\right|\right]. (S.5.272)

Since {ϵ(k)i}i=1nk\{\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}} and {(𝜷(k)𝜷(k))(𝒛(k)i𝝁(k))ϵ(k)i}i=1nk\{(\bm{\beta}^{(k)}-\bm{\beta}^{(k)*})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}} are i.i.d. sub-Gaussian variables, we know that

Cnk𝔼𝒛𝔼ϵ|i=1nk(𝜷(k))(𝒛(k)i𝝁(k))ϵ(k)i|+Cnk𝔼ϵ|i=1nkϵ(k)i|1nk.\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)*})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right|+\frac{C}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{i}\right|\lesssim\sqrt{\frac{1}{n_{k}}}. (S.5.273)

Suppose {𝒖j}j=1N\{\bm{u}_{j}\}_{j=1}^{N} is a 1/21/2-cover of p{𝒖p:𝒖21}\mathcal{B}^{p}\coloneqq\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}\leq 1\} with N=5pN=5^{p} (see Example 5.8 in \citealpappwainwright2019high). Hence by standard arguments,

Cnk𝔼𝒛𝔼ϵ[sup𝜷(k)𝜷(k)2ξ(k)|i=1nk(𝜷(k)𝜷(k))(𝒛(k)i𝝁(k))ϵ(k)i|]\displaystyle\frac{C}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{\|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}}\left|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)}-\bm{\beta}^{(k)*})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right|\right] (S.5.274)
ξ(k)nk𝔼𝒛𝔼ϵ[supj=1:N|i=1nk𝒖j(𝒛(k)i𝝁(k))ϵ(k)i|].\displaystyle\lesssim\frac{\xi^{(k)}}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{j=1:N}\left|\sum_{i=1}^{n_{k}}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right|\right]. (S.5.275)

Again, since {𝒖j(𝒛(k)i𝝁(k))ϵ(k)i}i=1nk\{\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}} are i.i.d. sub-Gaussian variables,

1nk𝔼𝒛𝔼ϵ[supj=1:N|i=1nk𝒖j(𝒛(k)i𝝁(k))ϵ(k)i|]logNnk=pnk.\frac{1}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\left[\sup_{j=1:N}\left|\sum_{i=1}^{n_{k}}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\cdot\epsilon^{(k)}_{i}\right|\right]\lesssim\sqrt{\frac{\log N}{n_{k}}}=\sqrt{\frac{p}{n_{k}}}. (S.5.276)

Putting all the pieces together,

𝔼Wξ(k)pnk+1nk.\mathbb{E}W\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{1}{n_{k}}}. (S.5.277)

Combining (S.5.261) and (S.5.277), we get the result in (i).

Next, we derive part (ii) using a similar analysis.

Denote W=sup{𝜽(k)}kSBconJ,2sup|w~k|11nS|kSw~ki=1nk[γ𝜽(k)(𝒛(k)i)𝔼[γ𝜽(k)(𝒛(k))]]|W^{\prime}=\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\frac{1}{n_{S}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\Big{]}\right|.

By a similar standard symmetrization and contraction arguments we used in part (i), with i.i.d. Rademacher variables {ϵ(k)i}i=1nk\{\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}}, for any λ\lambda\in\mathbb{R}, we have

𝔼exp{λW}\displaystyle\mathbb{E}\exp\{\lambda W^{\prime}\} (S.5.278)
C𝔼𝒛𝔼ϵexp{2λnSsup{𝜽(k)}kSBconJ,2sup|w~k|1|kSw~ki=1nkγ𝜽(k)(𝒛(k)i)ϵ(k)i|}\displaystyle\leq C\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{2\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\epsilon^{(k)}_{i}\right|\right\} (S.5.279)
C𝔼𝒛𝔼ϵexp{4λnSsup{𝜽(k)}kSBconJ,2supw~k=±1/2|kSw~ki=1nkγ𝜽(k)(𝒛(k)i)ϵ(k)i|}\displaystyle\leq C\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{4\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{\widetilde{w}_{k}=\pm 1/2}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\epsilon^{(k)}_{i}\right|\right\} (S.5.280)
Cw~k=±1/2𝔼𝒛𝔼ϵexp{4λnSsup{𝜽(k)}kSBconJ,2|kSw~ki=1nkγ𝜽(k)(𝒛(k)i)ϵ(k)i|}\displaystyle\leq C\sum_{\widetilde{w}_{k}=\pm 1/2}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{4\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\epsilon^{(k)}_{i}\right|\right\} (S.5.281)
Cw~k=±1/2𝔼𝒛𝔼ϵexp{4λnSsup{𝜽(k)}kSBconJ,2|kSw~ki=1nk11+exp{C𝜽(k)(𝒛(k)i)log((w(k))11)}ϵ(k)i|}\displaystyle\leq C\sum_{\widetilde{w}_{k}=\pm 1/2}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{4\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\frac{1}{1+\exp\{C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\}}\epsilon^{(k)}_{i}\right|\right\} (S.5.282)
Cw~k=±1/2𝔼𝒛𝔼ϵexp{4λnSsup{𝜽(k)}kSBconJ,2|kSw~ki=1nk[C𝜽(k)(𝒛(k)i)log((w(k))11)]ϵ(k)i|},\displaystyle\leq C\sum_{\widetilde{w}_{k}=\pm 1/2}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{4\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{[}C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\big{]}\epsilon^{(k)}_{i}\right|\right\}, (S.5.283)

where C𝜽(k)(𝒛(k)i)=(𝜷(k))𝒛(k)iδ(k)C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})=(\bm{\beta}^{(k)})^{\top}\bm{z}^{(k)}_{i}-\delta^{(k)}. Denote 𝝁(k)=(1w(k))𝝁(k)1+w(k)𝝁(k)2=𝔼[𝒛(k)i]\bm{\mu}^{(k)*}=(1-w^{(k)*})\bm{\mu}^{(k)*}_{1}+w^{(k)*}\bm{\mu}^{(k)}_{2}=\mathbb{E}[\bm{z}^{(k)}_{i}]. Suppose {𝒖j}j=1N\{\bm{u}_{j}\}_{j=1}^{N} is a 1/21/2-cover of p{𝒖p:𝒖21}\mathcal{B}^{p}\coloneqq\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}\leq 1\} with N=5pN=5^{p}. Then by Cauchy-Schwarz inequality and standard arguments,

𝔼𝒛,ϵexp{4λnSsup{𝜽(k)}kSBconJ,2|kSw~ki=1nk[C𝜽(k)(𝒛(k)i)log((w(k))11)]ϵ(k)i|}\displaystyle\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{4\lambda}{n_{S}}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\big{[}C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\big{]}\epsilon^{(k)}_{i}\right|\right\} (S.5.284)
[𝔼𝒛,ϵexp{CλnSsup𝜷2U|kSi=1nkw~k𝜷(𝒛(k)i𝝁(k))ϵ(k)i|}]1/3\displaystyle\lesssim\left[\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{\|\bm{\beta}\|_{2}\leq U}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{\beta}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\right|\right\}\right]^{1/3} (S.5.285)
+[𝔼ϵexp{CλnSsup𝜷2U|kSi=1nkw~k𝜷𝝁(k)ϵ(k)i|}]1/3\displaystyle\quad+\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{\|\bm{\beta}\|_{2}\leq U}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{\beta}^{\top}\bm{\mu}^{(k)*}\epsilon^{(k)}_{i}\right|\right\}\right]^{1/3} (S.5.286)
+[𝔼ϵexp{CλnSsupcw/2w(k)1cw/2|kSi=1nkw~klog((w(k))11)ϵ(k)i|}]1/3\displaystyle\quad+\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{c_{w}/2\leq w^{(k)}\leq 1-c_{w}/2}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\log((w^{(k)})^{-1}-1)\epsilon^{(k)}_{i}\right|\right\}\right]^{1/3} (S.5.287)
[𝔼𝒛,ϵexp{CλnSsupj=1:N|kSi=1nkw~k𝒖j(𝒛(k)i𝝁(k))ϵ(k)i|}]1/3\displaystyle\lesssim\left[\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{j=1:N}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\right|\right\}\right]^{1/3} (S.5.288)
+[𝔼ϵexp{CλnSsupj=1:N|kSi=1nkw~k𝒖j𝝁(k)ϵ(k)i|}]1/3\displaystyle\quad+\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{j=1:N}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{u}_{j}^{\top}\bm{\mu}^{(k)*}\epsilon^{(k)}_{i}\right|\right\}\right]^{1/3} (S.5.289)
+[𝔼ϵexp{CλnSsupcw/2w(k)1cw/2|kSi=1nkw~klog((w(k))11)ϵ(k)i|}]1/3\displaystyle\quad+\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sup_{c_{w}/2\leq w^{(k)}\leq 1-c_{w}/2}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\log((w^{(k)})^{-1}-1)\epsilon^{(k)}_{i}\right|\right\}\right]^{1/3} (S.5.290)
j=1N[𝔼𝒛,ϵexp{CλnS|kSi=1nkw~k𝒖j(𝒛(k)i𝝁(k))ϵ(k)i|}]1/3[1]\displaystyle\lesssim\underbrace{\sum_{j=1}^{N}\left[\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\right|\right\}\right]^{1/3}}_{[1]} (S.5.291)
+j=1N[𝔼ϵexp{CλnS|kSi=1nkw~k𝒖j𝝁(k)ϵ(k)i|}]1/3[2]\displaystyle\quad+\underbrace{\sum_{j=1}^{N}\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\bm{u}_{j}^{\top}\bm{\mu}^{(k)*}\epsilon^{(k)}_{i}\right|\right\}\right]^{1/3}}_{[2]} (S.5.292)
+[𝔼ϵexp{CλnSkS|i=1nkw~kϵ(k)i|}]1/3[3]\displaystyle\quad+\underbrace{\left[\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\sum_{k\in S}\left|\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\epsilon^{(k)}_{i}\right|\right\}\right]^{1/3}}_{[3]} (S.5.293)

Since {w~k𝒖j(𝒛(k)i𝝁(k))ϵ(k)i}i,j\{\widetilde{w}_{k}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\epsilon^{(k)}_{i}\}_{i,j}, {w~k(𝜷(k))𝝁(k)ϵ(k)i}i,k\{\widetilde{w}_{k}(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)*}\epsilon^{(k)}_{i}\}_{i,k}, and {w~kϵ(k)i}i=1nk\{\widetilde{w}_{k}\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}} are independent sub-Gaussian variables, we can bound the three terms on the RHS as

[1]\displaystyle[1] 5pexp{Cλ2nS},\displaystyle\lesssim 5^{p}\cdot\exp\left\{\frac{C\lambda^{2}}{n_{S}}\right\}, (S.5.294)
[2]\displaystyle[2] 5pexp{Cλ2nS},\displaystyle\lesssim 5^{p}\cdot\exp\left\{\frac{C\lambda^{2}}{n_{S}}\right\}, (S.5.295)
[3]\displaystyle[3] [kS𝔼ϵexp{CλnS|i=1nkw~kϵ(k)i|}]1/3[kSexp{Cλ2nS2nk}]1/3exp{Cλ2nS}.\displaystyle\leq\left[\prod_{k\in S}\mathbb{E}_{\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{S}}\left|\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\epsilon^{(k)}_{i}\right|\right\}\right]^{1/3}\lesssim\left[\prod_{k\in S}\exp\left\{C\frac{\lambda^{2}}{n_{S}^{2}}n_{k}\right\}\right]^{1/3}\lesssim\exp\left\{\frac{C\lambda^{2}}{n_{S}}\right\}. (S.5.296)

Putting all pieces together,

𝔼exp{λW}C2K5pexp{Cλ2nS}=exp{Cλ2nS+C(K+p)}.\displaystyle\mathbb{E}\exp\{\lambda W^{\prime}\}\leq C^{\prime}2^{K}5^{p}\cdot\exp\left\{\frac{C\lambda^{2}}{n_{S}}\right\}=\exp\left\{C\frac{\lambda^{2}}{n_{S}}+C^{\prime\prime}(K+p)\right\}. (S.5.297)

Therefore, for any δ>0\delta>0

(Wδ)eλδ𝔼exp{λW}exp{Cλ2nS+C(K+p)λδ}.\mathbb{P}(W^{\prime}\geq\delta)\leq e^{-\lambda\delta}\mathbb{E}\exp\{\lambda W^{\prime}\}\leq\exp\left\{C\frac{\lambda^{2}}{n_{S}}+C^{\prime\prime}(K+p)-\lambda\delta\right\}. (S.5.298)

Let λ=nS2Cδ\lambda=\frac{n_{S}}{2C}\delta and δ=4CC(K+p)nS\delta=4\sqrt{\frac{CC^{\prime\prime}(K+p)}{n_{S}}}, we have

(Wδ)exp{nS4Cδ2+C(K+p)}=exp{3C(K+p)}CK2exp{3Cp},\mathbb{P}(W^{\prime}\geq\delta)\leq\exp\left\{-\frac{n_{S}}{4C}\delta^{2}+C^{\prime\prime}(K+p)\right\}=\exp\{-3C^{\prime\prime}(K+p)\}\leq C^{\prime}K^{-2}\exp\{-3C^{\prime\prime}p\}, (S.5.299)

which completes the proof. ∎

Proof of Lemma 12.

The proof of part (ii) is the same as the proof of part (ii) for Lemma 11, so we omit it. The only difference between the proofs of part (i) for two lemmas is that here the bounded difference inequality is not available. Denote

W=sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|1nki=1nk[1γ𝜽(k)(𝒛(k)i)](𝒛(k)i)𝜷(k)𝔼[[1γ𝜽(k)(𝒛(k))](𝒛(k))𝜷(k)]|.W=\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}-\mathbb{E}\big{[}[1-\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})](\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)*}\big{]}\right|. (S.5.300)

We need to use Lemma 10 to upper bound W𝔼WW-\mathbb{E}W. Prior to that, we first verify the conditions required by the lemma. Fix 𝒛(k)1\bm{z}^{(k)}_{1}, …, 𝒛(k)i1\bm{z}^{(k)}_{i-1}, 𝒛(k)i+1\bm{z}^{(k)}_{i+1}, …, 𝒛(k)nk\bm{z}^{(k)}_{n_{k}}, and define

g(k)i(𝒛(k)i)=W𝔼[W|𝒛(k)1,,𝒛(k)i1,𝒛(k)i+1,,𝒛(k)nk].g^{(k)}_{i}(\bm{z}^{(k)}_{i})=W-\mathbb{E}[W|\bm{z}^{(k)}_{1},\ldots,\bm{z}^{(k)}_{i-1},\bm{z}^{(k)}_{i+1},\ldots,\bm{z}^{(k)}_{n_{k}}]. (S.5.301)

By triangle inequality,

|g(k)i(𝒛(k)i)|\displaystyle\left|g^{(k)}_{i}(\bm{z}^{(k)}_{i})\right| (S.5.302)
=|sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|1nki=1nkγ𝜽(k)(𝒛(k)i)(𝒛(k)i)𝜷(k)𝔼[γ𝜽(k)(𝒛(k)i)(𝒛(k)i)𝜷(k)]|\displaystyle=\Bigg{|}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\bigg{|}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}-\mathbb{E}\big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}\big{]}\bigg{|} (S.5.303)
𝔼[sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|1nki=1nkγ𝜽(k)(𝒛(k)i)(𝒛(k)i)𝜷(k)𝔼[γ𝜽(k)(𝒛(k)i)(𝒛(k)i)𝜷(k)]||{𝒛(k)i}ii]|\displaystyle\quad-\mathbb{E}\Bigg{[}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\bigg{|}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}-\mathbb{E}\big{[}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}\big{]}\bigg{|}\Big{|}\{\bm{z}^{(k)}_{i^{\prime}}\}_{i^{\prime}\neq i}\Bigg{]}\Bigg{|} (S.5.304)
1nksup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|γ𝜽(k)(𝒛(k)i)(𝒛(k)i)𝜷(k)|W1+2nk𝔼|sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)γ𝜽(k)(𝒛(k)i)(𝒛(k)i)𝜷(k)|W2.\displaystyle\leq\underbrace{\frac{1}{n_{k}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\bigg{|}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}\bigg{|}}_{W_{1}}+\underbrace{\frac{2}{n_{k}}\mathbb{E}\left|\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}\right|}_{W_{2}}. (S.5.305)

Note that [𝔼(W1+W2)d]1/d(𝔼W1d)1/d+(𝔼W2d)1/d[\mathbb{E}(W_{1}+W_{2})^{d}]^{1/d}\leq(\mathbb{E}W_{1}^{d})^{1/d}+(\mathbb{E}W_{2}^{d})^{1/d}, where

(𝔼W1d)1/d\displaystyle(\mathbb{E}W_{1}^{d})^{1/d} 1nk[𝔼sup𝜽(k)Bcon[γ𝜽(k)(𝒛(k)i)]2d]1/2d[𝔼|(𝒛(k)i)𝜷(k)|2d]1/2d\displaystyle\leq\frac{1}{n_{k}}\left[\mathbb{E}\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})]^{2d}\right]^{1/2d}\left[\mathbb{E}\big{|}(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}\big{|}^{2d}\right]^{1/2d} (S.5.306)
1nkCd\displaystyle\leq\frac{1}{n_{k}}\cdot C\cdot\sqrt{d} (S.5.307)
(𝔼W2d)1/d\displaystyle(\mathbb{E}W_{2}^{d})^{1/d} =𝔼W1(𝔼W1d)1/d1nkCd.\displaystyle=\mathbb{E}W_{1}\leq(\mathbb{E}W_{1}^{d})^{1/d}\leq\frac{1}{n_{k}}\cdot C\cdot\sqrt{d}. (S.5.308)

Therefore [𝔼(W1+W2)d]1/dCnkd[\mathbb{E}(W_{1}+W_{2})^{d}]^{1/d}\leq\frac{C}{n_{k}}\sqrt{d}. Hence by applying Lemma 10, we have

W𝔼W+ClogKnk,W\leq\mathbb{E}W+C\sqrt{\frac{\log K}{n_{k}}}, (S.5.309)

with probability at least 1CK21-C^{\prime}K^{-2}. ∎

Proof of Lemma 13.

For part (i), denote

W=sup𝜽(k)Bcon1nki=1nkγ𝜽(k)(𝒛(k)i)𝒛(k)i𝔼[γ𝜽(k)(𝒛(k))𝒛(k)]2.W=\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\bm{z}^{(k)}_{i}-\mathbb{E}[\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})\bm{z}^{(k)}]\right\|_{2}. (S.5.310)

Suppose {𝒖j}j=1N\{\bm{u}_{j}\}_{j=1}^{N} is a 1/21/2-cover of p{𝒖p:𝒖21}\mathcal{B}^{p}\coloneqq\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}\leq 1\} with N=5pN=5^{p}. Define 𝝁(k)=(1w(k))𝝁(k)1+w(k)𝝁(k)2\bm{\mu}^{(k)*}=(1-w^{(k)*})\bm{\mu}^{(k)*}_{1}+w^{(k)*}\bm{\mu}^{(k)*}_{2}. Then by the generalized symmetrization inequality (Proposition 4.11 in \citealpappwainwright2019high), with i.i.d. Rademacher variables {ϵ(k)i}i=1nk\{\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}}, for any λ\lambda\in\mathbb{R},

𝔼exp{λW}\displaystyle\mathbb{E}\exp\{\lambda W\} (S.5.311)
𝔼𝒛,ϵexp{Cλnksup𝜽(k)Bconi=1nkγ𝜽(k)(𝒛(k)i)𝒛(k)iϵ(k)i2}\displaystyle\lesssim\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left\|\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\bm{z}^{(k)}_{i}\epsilon^{(k)}_{i}\right\|_{2}\right\} (S.5.312)
𝔼𝒛,ϵexp{Cλnksupj=1:Nsup𝜽(k)Bcon|i=1nkγ𝜽(k)(𝒛(k)i)(𝒛(k)i)𝒖jϵ(k)i|}\displaystyle\lesssim\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{j=1:N}\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left|\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right|\right\} (S.5.313)
j=1N𝔼𝒛,ϵexp{Cλnksup𝜽(k)Bcon|i=1nkγ𝜽(k)(𝒛(k)i)(𝒛(k)i)𝒖jϵ(k)i|}\displaystyle\lesssim\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left|\sum_{i=1}^{n_{k}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right|\right\} (S.5.314)
j=1N𝔼𝒛,ϵexp{Cλnksup𝜽(k)Bcon|i=1nk[C𝜽(k)(𝒛(k)i)log((w(k))11)](𝒛(k)i)𝒖jϵ(k)i|}\displaystyle\lesssim\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{\bm{\theta}^{(k)}\in B_{\text{con}}}\left|\sum_{i=1}^{n_{k}}\big{[}C_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\log((w^{(k)})^{-1}-1)\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right|\right\} (S.5.315)
j=1N𝔼𝒛,ϵexp{Cλnksup𝜷(k)2U|i=1nk(𝜷(k))(𝒛(k)i𝝁(k))(𝒛(k)i)𝒖jϵ(k)i|}\displaystyle\lesssim\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{\|\bm{\beta}^{(k)}\|_{2}\leq U}\left|\sum_{i=1}^{n_{k}}(\bm{\beta}^{(k)})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right|\right\} (S.5.316)
+j=1N𝔼𝒛,ϵexp{Cλnksup𝜷(k)2U|i=1nk[(𝜷(k))𝝁(k)δ(k)](𝒛(k)i)𝒖jϵ(k)i|}\displaystyle\quad+\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{\|\bm{\beta}^{(k)}\|_{2}\leq U}\left|\sum_{i=1}^{n_{k}}\big{[}(\bm{\beta}^{(k)})^{\top}\bm{\mu}^{(k)*}-\delta^{(k)}\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right|\right\} (S.5.317)
+j=1N𝔼𝒛,ϵexp{Cλnksupcw/2w(k)1cw/2|i=1nklog((w(k))11)(𝒛(k)i)𝒖jϵ(k)i|}\displaystyle\quad+\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\sup_{c_{w}/2\leq w^{(k)}\leq 1-c_{w}/2}\left|\sum_{i=1}^{n_{k}}\log((w^{(k)})^{-1}-1)(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right|\right\} (S.5.318)
j=1Nj=1N𝔼𝒛,ϵexp{Cλnk|i=1nk𝒖j(𝒛(k)i𝝁(k))(𝒛(k)i)𝒖jϵ(k)i|}\displaystyle\lesssim\sum_{j=1}^{N}\sum_{j^{\prime}=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\left|\sum_{i=1}^{n_{k}}\bm{u}_{j^{\prime}}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right|\right\} (S.5.319)
+j=1N𝔼𝒛,ϵexp{Cλnk|i=1nk(𝒛(k)i)𝒖jϵ(k)i|}.\displaystyle\quad+\sum_{j=1}^{N}\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\left|\sum_{i=1}^{n_{k}}(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right|\right\}. (S.5.320)

Note that since {𝒖j(𝒛(k)i𝝁(k))(𝒛(k)i)𝒖jϵ(k)i}i=1nk\{\bm{u}_{j^{\prime}}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}} are i.i.d. sub-exponential variables and {(𝒛(k)i)𝒖jϵ(k)i}i=1nk\{(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\}_{i=1}^{n_{k}} are i.i.d. sub-Gaussian variables, we have

𝔼𝒛,ϵexp{Cλnk|i=1nk𝒖j(𝒛(k)i𝝁(k))(𝒛(k)i)𝒖jϵ(k)i|}\displaystyle\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\left|\sum_{i=1}^{n_{k}}\bm{u}_{j^{\prime}}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right|\right\} exp{Cλ2nk},\displaystyle\lesssim\exp\left\{C\frac{\lambda^{2}}{n_{k}}\right\}, (S.5.321)
𝔼𝒛,ϵexp{Cλnk|i=1nk(𝒛(k)i)𝒖jϵ(k)i|}\displaystyle\mathbb{E}_{\bm{z},\bm{\epsilon}}\exp\left\{\frac{C\lambda}{n_{k}}\left|\sum_{i=1}^{n_{k}}(\bm{z}^{(k)}_{i})^{\top}\bm{u}_{j}\cdot\epsilon^{(k)}_{i}\right|\right\} exp{Cλ2nk},\displaystyle\lesssim\exp\left\{C\frac{\lambda^{2}}{n_{k}}\right\}, (S.5.322)

where the first inequality holds when λCnk\lambda\leq C^{\prime\prime}n_{k} where CC^{\prime\prime} is small. Therefore,

𝔼exp{λW}exp{Cλ2nk+Cp},\mathbb{E}\exp\{\lambda W\}\lesssim\exp\left\{C\frac{\lambda^{2}}{n_{k}}+C^{\prime}p\right\}, (S.5.323)

when λCnk\lambda\leq C^{\prime\prime}n_{k}. The desired result follows from Chernoff’s bound.

The proofs of parts (ii) and (iii) are almost the same as the proofs of part (ii) of Lemma 11, so we do not repeat them here.

Proof of Lemma 14.

Note that

1nki=1nk[𝒛(k)i(𝒛(k)i)𝔼[𝒛(k)i(𝒛(k)i)]]𝜷(k)21nki=1nk[𝒛(k)i(𝒛(k)i)𝔼[𝒛(k)i(𝒛(k)i)]]2.\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\bm{\beta}^{(k)*}\right\|_{2}\lesssim\left\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}-\mathbb{E}[\bm{z}^{(k)}_{i}(\bm{z}^{(k)}_{i})^{\top}]\big{]}\right\|_{2}. (S.5.324)

The bound of the RHS comes from Theorem 6.5 in \citeappwainwright2019high. And the bound in part (ii) can be proved in the same way. ∎

S.5.3 Proof of Theorem 3

S.5.3.1 Lemmas

Recall the parameter space

Θ¯S(h)={{𝜽¯(k)}kS={(w(k),𝝁(k)1,𝝁(k)2,𝚺(k))}kS:𝜽¯(k)Θ¯,inf𝜷¯maxkS𝜷(k)𝜷¯2h},\displaystyle\overline{\Theta}_{S}(h)=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}=\{(w^{(k)},\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2},\bm{\Sigma}^{(k)})\}_{k\in S}:\overline{\bm{\theta}}^{(k)}\in\overline{\Theta},\inf_{\overline{\bm{\beta}}}\max_{k\in S}\|\bm{\beta}^{(k)}-\overline{\bm{\beta}}\|_{2}\leq h\Big{\}}, (S.5.325)

where 𝜷(k)=(𝚺(k))1(𝝁(k)2𝝁(k)1)\bm{\beta}^{(k)}=(\bm{\Sigma}^{(k)})^{-1}(\bm{\mu}^{(k)}_{2}-\bm{\mu}^{(k)}_{1}) and δ(k)=12(𝜷(k))(𝝁(k)1+𝝁(k)2)\delta^{(k)}=\frac{1}{2}(\bm{\beta}^{(k)})^{\top}(\bm{\mu}^{(k)}_{1}+\bm{\mu}^{(k)}_{2}).

Lemma 15 (Lemma 8.4 in \citealpappcai2019chime).

For any 𝛍\bm{\mu}, 𝛍~p\widetilde{\bm{\mu}}\in\mathbb{R}^{p} and w(0,1)w\in(0,1), denote 𝛍=(1w)𝒩(𝛍,𝐈p)+𝒩(𝛍,𝐈p)\mathbb{P}_{\bm{\mu}}=(1-w)\mathcal{N}(\bm{\mu},\bm{I}_{p})+\mathcal{N}(-\bm{\mu},\bm{I}_{p}) and 𝛍~=(1w)𝒩(𝛍~,𝐈p)+𝒩(𝛍~,𝐈p)\mathbb{P}_{\widetilde{\bm{\mu}}}=(1-w)\mathcal{N}(\widetilde{\bm{\mu}},\bm{I}_{p})+\mathcal{N}(-\widetilde{\bm{\mu}},\bm{I}_{p}). Then

KL(𝝁𝝁~)(4𝝁22+12log(w1w))2𝝁𝝁~22.\textup{KL}(\mathbb{P}_{\bm{\mu}}\|\mathbb{P}_{\widetilde{\bm{\mu}}})\leq\left(4\|\bm{\mu}\|_{2}^{2}+\frac{1}{2}\log\left(\frac{w}{1-w}\right)\right)\cdot 2\|\bm{\mu}-\widetilde{\bm{\mu}}\|_{2}^{2}. (S.5.326)
Lemma 16.

For any 𝛍\bm{\mu}, 𝛍\bm{\mu}^{\prime}, 𝛍~\widetilde{\bm{\mu}}, 𝛍~p\widetilde{\bm{\mu}}^{\prime}\in\mathbb{R}^{p} and w(0,1)w\in(0,1), denote 𝛍,𝛍~=(1w)𝒩(𝛍,𝐈p)+w𝒩(𝛍~,𝐈p)\mathbb{P}_{\bm{\mu},\widetilde{\bm{\mu}}}=(1-w)\mathcal{N}(\bm{\mu},\bm{I}_{p})+w\mathcal{N}(\widetilde{\bm{\mu}},\bm{I}_{p}) and 𝛍,𝛍~=(1w)𝒩(𝛍,𝐈p)+w𝒩(𝛍~,𝐈p)\mathbb{P}_{\bm{\mu}^{\prime},\widetilde{\bm{\mu}}^{\prime}}=(1-w)\mathcal{N}(\bm{\mu}^{\prime},\bm{I}_{p})+w\mathcal{N}(\widetilde{\bm{\mu}}^{\prime},\bm{I}_{p}). Then

KL(𝝁,𝝁~𝝁,𝝁~)(1w)𝝁𝝁22+w𝝁~𝝁~22.\textup{KL}(\mathbb{P}_{\bm{\mu},\widetilde{\bm{\mu}}}\|\mathbb{P}_{\bm{\mu}^{\prime},\widetilde{\bm{\mu}}^{\prime}})\leq(1-w)\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}^{2}+w\|\widetilde{\bm{\mu}}-\widetilde{\bm{\mu}}^{\prime}\|_{2}^{2}. (S.5.327)
Lemma 17.

Denote distribution (1w)𝒩(𝛍,𝐈p)+w𝒩(𝛍,𝐈p)(1-w)\mathcal{N}(\bm{\mu},\bm{I}_{p})+w\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as w\mathbb{P}_{w} for any w(cw,1cw)w\in(c_{w},1-c_{w}), where 𝛍p\bm{\mu}\in\mathbb{R}^{p}. Then

KL(ww)12cw2(ww)2.\textup{KL}(\mathbb{P}_{w}\|\mathbb{P}_{w^{\prime}})\leq\frac{1}{2c_{w}^{2}}(w-w^{\prime})^{2}. (S.5.328)
Lemma 18.

Denote distribution 12𝒩((1/2,𝟎p1),𝐈p)+12𝒩((1/2+u~,𝟎p1),𝐈p)\frac{1}{2}\mathcal{N}((-1/2,\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p})+\frac{1}{2}\mathcal{N}((1/2+\widetilde{u},\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p}) as u~\mathbb{P}_{\widetilde{u}} for any u~[1,1]\widetilde{u}\in[-1,1]. Then

KL(u~u~)12(u~u~)2.\textup{KL}(\mathbb{P}_{\widetilde{u}}\|\mathbb{P}_{\widetilde{u}^{\prime}})\leq\frac{1}{2}(\widetilde{u}-\widetilde{u}^{\prime})^{2}. (S.5.329)
Lemma 19.

When there exists an subset SS such that minkSnkC(p+logK)\min_{k\in S}n_{k}\geq C(p+\log K) with some constant C>0C>0, we have

inf{𝜽^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯SS\displaystyle\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}} (kS{d(𝜽^(k),𝜽(k))C1pnS+C2logKnk\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\geq C_{1}\sqrt{\frac{p}{n_{S}}}+C_{2}\sqrt{\frac{\log K}{n_{k}}} (S.5.330)
+C3hp+logKnk})14.\displaystyle\quad\quad+C_{3}h\wedge\sqrt{\frac{p+\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}. (S.5.331)
Lemma 20.

Denote ϵ~=Kss\widetilde{\epsilon}=\frac{K-s}{s}. Then

inf{𝜽^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯SS\displaystyle\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}} (maxkSd(𝜽^(k),𝜽(k))C1ϵ~1maxk=1:Knk)110.\displaystyle\mathbb{P}\Bigg{(}\max_{k\in S}d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\geq C_{1}\widetilde{\epsilon}\sqrt{\frac{1}{\max_{k=1:K}n_{k}}}\Bigg{)}\geq\frac{1}{10}. (S.5.332)
Lemma 21 (The first variant of Theorem 5.1 in \citealpappchen2018robust).

Given a series of distributions {{θ(k)}k=1K:θΘ}\{\{\mathbb{P}_{\theta}^{(k)}\}_{k=1}^{K}:\theta\in\Theta\}, each of which is indexed by the same parameter θΘ\theta\in\Theta. Consider 𝐱(k)(1ϵ~)(k)θ+ϵ~(k)\bm{x}^{(k)}\sim(1-\widetilde{\epsilon})\mathbb{P}^{(k)}_{\theta}+\widetilde{\epsilon}\mathbb{Q}^{(k)} independently for k=1:Kk=1:K. Denote the joint distribution of {𝐱(k)}k=1K\{\bm{x}^{(k)}\}_{k=1}^{K} as (ϵ~,θ,{(k)}k=1K)\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}. Then

infθ^supθΘ{(k)}k=1K(ϵ~,θ,{(k)}k=1K)(θ^θCϖ(ϵ~,Θ))12,\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi(\widetilde{\epsilon},\Theta)\right)\geq\frac{1}{2}, (S.5.333)

where ϖ(ϵ~,Θ)sup{θ1θ2:maxk=1:KdTV((k)θ1,(k)θ2)ϵ~/(1ϵ~)}\varpi(\widetilde{\epsilon},\Theta)\coloneqq\sup\big{\{}\|\theta_{1}-\theta_{2}\|:\max_{k=1:K}d_{\textup{TV}}\big{(}\mathbb{P}^{(k)}_{\theta_{1}},\mathbb{P}^{(k)}_{\theta_{2}}\big{)}\leq\widetilde{\epsilon}/(1-\widetilde{\epsilon})\big{\}}.

Lemma 22.

Suppose Ks1K-s\geq 1. Consider two data generating mechanisms:

  1. (i)

    𝒙(k)(1ϵ~)θ(k)+ϵ~(k)\bm{x}^{(k)}\sim(1-\widetilde{\epsilon}^{\prime})\mathbb{P}_{\theta}^{(k)}+\widetilde{\epsilon}^{\prime}\mathbb{Q}^{(k)} independently for k=1:Kk=1:K, where ϵ~=KsK\widetilde{\epsilon}^{\prime}=\frac{K-s}{K};

  2. (ii)

    With a preserved set S1:KS\subseteq 1:K, generate {𝒙(k)}kScS\{\bm{x}^{(k)}\}_{k\in S^{c}}\sim\mathbb{Q}_{S} and 𝒙(k)θ(k)\bm{x}^{(k)}\sim\mathbb{P}_{\theta}^{(k)} independently for kSk\in S.

Denote the joint distributions of {𝐱(k)}k=1K\{\bm{x}^{(k)}\}_{k=1}^{K} in (i) and (ii) as (ϵ~,θ,{(k)}k=1K)\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})} and (S,θ,)\mathbb{P}_{(S,\theta,\mathbb{Q})}, respectively. We claim that if

infθ^supθΘ{(k)}k=1K(Ks50K,θ,{(k)}k=1K)(θ^θCϖ(Ks50K,Θ))12,\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\frac{K-s}{50K},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)\geq\frac{1}{2}, (S.5.334)

then

infθ^supS:|S|ssupθΘS(S,θ,S)(θ^θCϖ(Ks50K,Θ))110,\inf_{\widehat{\theta}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}_{(S,\theta,\mathbb{Q}_{S})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)\geq\frac{1}{10}, (S.5.335)

where ϖ(ϵ~,Θ)sup{θ1θ2:maxk=1:KKL((k)θ1(k)θ2)[ϵ~/(1ϵ~)]2}\varpi(\widetilde{\epsilon},\Theta)\coloneqq\sup\big{\{}\|\theta_{1}-\theta_{2}\|:\max_{k=1:K}\textup{KL}\big{(}\mathbb{P}^{(k)}_{\theta_{1}}\|\mathbb{P}^{(k)}_{\theta_{2}}\big{)}\leq[\widetilde{\epsilon}/(1-\widetilde{\epsilon})]^{2}\big{\}} for any ϵ~(0,1)\widetilde{\epsilon}\in(0,1).

Lemma 23.

When there exists an subset SS such that minkSnkC(plogK)\min_{k\in S}n_{k}\geq C(p\vee\log K) with some constant C>0C>0, we have

inf{𝚺^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯SS(kS{min{𝝁^(k)1𝝁(k)12𝝁^(k)2𝝁(k)22,\displaystyle\inf_{\{\widehat{\bm{\Sigma}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\min\big{\{}\|\widehat{\bm{\mu}}^{(k)}_{1}-\bm{\mu}^{(k)*}_{1}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)}_{2}-\bm{\mu}^{(k)*}_{2}\|_{2}, (S.5.336)
𝝁^(k)1𝝁(k)22𝝁^(k)2𝝁(k)12}𝚺^(k)𝚺(k)2Cp+logKnk})110.\displaystyle\quad\|\widehat{\bm{\mu}}^{(k)}_{1}-\bm{\mu}^{(k)*}_{2}\|_{2}\vee\|\widehat{\bm{\mu}}^{(k)}_{2}-\bm{\mu}^{(k)*}_{1}\|_{2}\big{\}}\vee\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\|_{2}\geq C\sqrt{\frac{p+\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{10}. (S.5.337)

S.5.3.2 Main proof of Theorem 3

Combine conclusions of Lemmas 19 and 20 to get the first lower bound. Lemma 23 implies the second one.

S.5.3.3 Proofs of lemmas

Proof of Lemma 17.

Denote g(w;z~)=log[(1w)z~+w]g(w;\widetilde{z})=\log\big{[}(1-w)\widetilde{z}+w\big{]}, g(w;z~)=1z~(1w)z~+wg^{\prime}(w;\widetilde{z})=\frac{1-\widetilde{z}}{(1-w)\widetilde{z}+w}, g(w;z~)=(1z~)2[(1w)z~+w]2g^{\prime\prime}(w;\widetilde{z})=-\frac{(1-\widetilde{z})^{2}}{[(1-w)\widetilde{z}+w]^{2}} and f(w;𝒛,𝝁)=1w(2π)p/2exp{12𝒛𝝁22}+w(2π)p/2exp{12𝒛+𝝁22}f(w;\bm{z},\bm{\mu})=\frac{1-w}{(2\pi)^{p/2}}\exp\{-\frac{1}{2}\|\bm{z}-\bm{\mu}\|_{2}^{2}\}+\frac{w}{(2\pi)^{p/2}}\exp\{-\frac{1}{2}\|\bm{z}+\bm{\mu}\|_{2}^{2}\}.

By Taylor expansion,

log[f(w;𝒛,𝝁)f(w;𝒛,𝝁)]=logf(w;𝒛,𝝁)w|w(ww)+122logf(w;𝒛,𝝁)w2|w0(ww)2,\log\left[\frac{f(w^{\prime};\bm{z},\bm{\mu})}{f(w;\bm{z},\bm{\mu})}\right]=\frac{\partial\log f(w;\bm{z},\bm{\mu})}{\partial w}\bigg{|}_{w}\cdot(w^{\prime}-w)+\frac{1}{2}\frac{\partial^{2}\log f(w;\bm{z},\bm{\mu})}{\partial w^{2}}\bigg{|}_{w_{0}}\cdot(w^{\prime}-w)^{2}, (S.5.338)

where w0=w0(𝒛,𝝁)w_{0}=w_{0}(\bm{z},\bm{\mu}) is between ww and ww^{\prime}. By the property of score function,

logf(w;𝒛,𝝁)wdw=0.\int\frac{\partial\log f(w;\bm{z},\bm{\mu})}{\partial w}d\mathbb{P}_{w}=0. (S.5.339)

Besides,

2logf(w;𝒛,𝝁)w2=2log[f(w;𝒛,𝝁)/((2π)p/2exp{12𝒛+𝝁22})]w2=g(w;z~),\frac{\partial^{2}\log f(w;\bm{z},\bm{\mu})}{\partial w^{2}}=\frac{\partial^{2}\log\big{[}f(w;\bm{z},\bm{\mu})/\big{(}(2\pi)^{-p/2}\exp\{-\frac{1}{2}\|\bm{z}+\bm{\mu}\|_{2}^{2}\}\big{)}\big{]}}{\partial w^{2}}=g^{\prime\prime}(w;\widetilde{z}), (S.5.340)

where z~=e𝝁𝒛\widetilde{z}=e^{-\bm{\mu}^{\top}\bm{z}}. Note that

g(w;z~)=1(1w)2(z~1)2(z~+w/(1w))21cw2,-g^{\prime\prime}(w;\widetilde{z})=\frac{1}{(1-w)^{2}}\cdot\frac{(\widetilde{z}-1)^{2}}{(\widetilde{z}+w/(1-w))^{2}}\leq\frac{1}{c_{w}^{2}}, (S.5.341)

for any z~>0\widetilde{z}>0. Therefore,

KL(ww)\displaystyle\textup{KL}(\mathbb{P}_{w}\|\mathbb{P}_{w^{\prime}}) =log[f(w;𝒛,𝝁)f(w;𝒛,𝝁)]dw\displaystyle=-\int\log\left[\frac{f(w^{\prime};\bm{z},\bm{\mu})}{f(w;\bm{z},\bm{\mu})}\right]d\mathbb{P}_{w} (S.5.342)
=12(ww)2g(w0(𝒛,𝝁);z~)dw\displaystyle=-\frac{1}{2}(w^{\prime}-w)^{2}\cdot\int g^{\prime\prime}(w_{0}(\bm{z},\bm{\mu});\widetilde{z})d\mathbb{P}_{w} (S.5.343)
12cw2(ww)2,\displaystyle\leq\frac{1}{2c_{w}^{2}}(w^{\prime}-w)^{2}, (S.5.344)

which completes the proof. ∎

Proof of Lemma 18.

Recall that we denote distribution 12𝒩((1/2,𝟎p1),𝑰p)+12𝒩((1/2+u~,𝟎p1),𝑰p)\frac{1}{2}\mathcal{N}((-1/2,\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p})+\frac{1}{2}\mathcal{N}((1/2+\widetilde{u},\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p}) as u~\mathbb{P}_{\widetilde{u}} for any u~[1,1]\widetilde{u}\in[-1,1]. By the bi-convexity of KL divergence, we have

KL(u~u~)\displaystyle\textup{KL}(\mathbb{P}_{\widetilde{u}}\|\mathbb{P}_{\widetilde{u}^{\prime}}) 12KL(𝒩((1/2+u~,𝟎p1),𝑰p)𝒩((1/2+u~,𝟎p1),𝑰p))\displaystyle\leq\frac{1}{2}\textup{KL}(\mathcal{N}((1/2+\widetilde{u},\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p})\|\mathcal{N}((1/2+\widetilde{u}^{\prime},\bm{0}_{p-1}^{\top})^{\top},\bm{I}_{p})) (S.5.345)
=12KL(𝒩(1/2+u~,1)𝒩(1/2+u~,1))\displaystyle=\frac{1}{2}\textup{KL}(\mathcal{N}(1/2+\widetilde{u},1)\|\mathcal{N}(1/2+\widetilde{u}^{\prime},1)) (S.5.346)
=12(u~u~)2,\displaystyle=\frac{1}{2}(\widetilde{u}-\widetilde{u}^{\prime})^{2}, (S.5.347)

which completes the proof. ∎

Proof of Lemma 19.

WLOG, suppose Δ1\Delta\geq 1. It’s easy to see that given any SS, Θ¯SΘ¯S,wΘ¯S,𝜷Θ¯S,δ\overline{\Theta}_{S}\supseteq\overline{\Theta}_{S,w}\cup\overline{\Theta}_{S,\bm{\beta}}\cup\overline{\Theta}_{S,\delta}, where

Θ¯S,w\displaystyle\overline{\Theta}_{S,w} ={{𝜽¯(k)}kS:𝝁(k)1=𝟏p/p,𝝁(k)2=𝝁(k)1=𝝁~,𝚺(k)=𝑰p,w(k)(cw,1cw)},\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\mu}^{(k)}_{1}=\bm{1}_{p}/\sqrt{p},\bm{\mu}^{(k)}_{2}=-\bm{\mu}^{(k)}_{1}=\widetilde{\bm{\mu}},\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}\in(c_{w},1-c_{w})\Big{\}}, (S.5.348)
Θ¯S,𝜷\displaystyle\overline{\Theta}_{S,\bm{\beta}} ={{𝜽¯(k)}kS:𝚺(k)=𝑰p,w(k)=12,𝝁(k)12𝝁(k)22M,min𝜷maxkS𝜷(k)𝜷2h},\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\|\bm{\mu}^{(k)}_{1}\|_{2}\vee\|\bm{\mu}^{(k)}_{2}\|_{2}\leq M,\min_{\bm{\beta}}\max_{k\in S}\|\bm{\beta}^{(k)}-\bm{\beta}\|_{2}\leq h\Big{\}}, (S.5.349)
Θ¯S,δ\displaystyle\overline{\Theta}_{S,\delta} ={{𝜽¯(k)}kS:𝚺(k)=𝑰p,w(k)=12,𝝁(k)12𝝁(k)22M,𝝁(k)1=12𝝁0,𝝁(k)2=12𝝁0+𝒖,\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\|\bm{\mu}^{(k)}_{1}\|_{2}\vee\|\bm{\mu}^{(k)}_{2}\|_{2}\leq M,\bm{\mu}^{(k)}_{1}=-\frac{1}{2}\bm{\mu}_{0},\bm{\mu}^{(k)}_{2}=\frac{1}{2}\bm{\mu}_{0}+\bm{u}, (S.5.350)
𝒖21}.\displaystyle\hskip 85.35826pt\|\bm{u}\|_{2}\leq 1\Big{\}}. (S.5.351)

(i) By fixing an SS and a S\mathbb{Q}_{S}, we want to show

inf{𝜷^(k)}k=1Ksup{𝜽¯(k)}kSΘ¯S,𝜷(kS{𝜷^(k)𝜷(k)2𝜷^(k)+𝜷(k)2CpnS})14\inf_{\{\widehat{\bm{\beta}}^{(k)}\}_{k=1}^{K}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\bm{\beta}}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)}+\bm{\beta}^{(k)*}\|_{2}\geq C\sqrt{\frac{p}{n_{S}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4} (S.5.352)

By Lemma 4, \exists a quadrant 𝒬𝒗\mathcal{Q}_{\bm{v}} of p\mathbb{R}^{p} and a r/8r/8-packing of (r𝒮p)𝒬𝒗(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}} under Euclidean norm: {𝝁~j}j=1N\{\widetilde{\bm{\mu}}_{j}\}_{j=1}^{N}, where r=(cp/nS)M1r=(c\sqrt{p/n_{S}})\wedge M\leq 1 with a small constant c>0c>0 and N(12)p8p1=12×4p12p1N\geq(\frac{1}{2})^{p}8^{p-1}=\frac{1}{2}\times 4^{p-1}\geq 2^{p-1} when p2p\geq 2. For any 𝝁p\bm{\mu}\in\mathbb{R}^{p}, denote distribution 12𝒩(𝝁0+𝝁,𝑰p)+12𝒩(𝝁0+𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu}_{0}+\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu}_{0}+\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}, where 𝝁0\bm{\mu}_{0} can be any vector in p\mathbb{R}^{p} with 𝝁021\|\bm{\mu}_{0}\|_{2}\geq 1. Then

LHS inf𝝁^sup𝝁(r𝒮p)𝒬𝒗(𝝁^𝝁2𝝁^+𝝁2CpnS)\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\bm{\mu}\in(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\mu}}-\bm{\mu}\|_{2}\wedge\|\widehat{\bm{\mu}}+\bm{\mu}\|_{2}\geq C\sqrt{\frac{p}{n_{S}}}\Bigg{)} (S.5.353)
inf𝝁^sup𝝁(r𝒮p)𝒬𝒗(𝝁^𝝁2CpnS),\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\bm{\mu}\in(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\mu}}-\bm{\mu}\|_{2}\geq C\sqrt{\frac{p}{n_{S}}}\Bigg{)}, (S.5.354)

where the last inequality holds because it suffices to consider estimator 𝝁^\widehat{\bm{\mu}} satisfying 𝝁^(X)(r𝒮p)𝒬𝒗\widehat{\bm{\mu}}(X)\in(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}} almost surely. In addition, for any 𝒙\bm{x}, 𝒚𝒬𝒗\bm{y}\in\mathcal{Q}_{\bm{v}}, 𝒙𝒚2𝒙+𝒚2\|\bm{x}-\bm{y}\|_{2}\leq\|\bm{x}+\bm{y}\|_{2}.

By Lemma 15,

KL(kS𝝁~jnkSkS𝝁~jnkS)\displaystyle\text{KL}\left(\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\bigg{\|}\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\right) =kSnkKL(𝝁~j𝝁~j)\displaystyle=\sum_{k\in S}n_{k}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}) (S.5.355)
kSnk8𝝁~j22𝝁~j𝝁~j22\displaystyle\leq\sum_{k\in S}n_{k}\cdot 8\|\widetilde{\bm{\mu}}_{j}\|_{2}^{2}\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\|_{2}^{2} (S.5.356)
32nSr2\displaystyle\leq 32n_{S}r^{2} (S.5.357)
32nSc22(p1)nS\displaystyle\leq 32n_{S}c^{2}\cdot\frac{2(p-1)}{n_{S}} (S.5.358)
64c2log2logN.\displaystyle\leq\frac{64c^{2}}{\log 2}\log N. (S.5.359)

By Lemma 3,

LHS of (S.5.354)1log2logN64c2log211p11414,\displaystyle\text{LHS of \eqref{eq: lower bdd eq mu 1}}\geq 1-\frac{\log 2}{\log N}-\frac{64c^{2}}{\log 2}\geq 1-\frac{1}{p-1}-\frac{1}{4}\geq\frac{1}{4}, (S.5.360)

when C=c/2C=c/2, p3p\geq 3 and c=log2/16c=\sqrt{\log 2}/16.

(ii) By fixing an SS and a S\mathbb{Q}_{S}, we want to show

inf{𝜷^(k)}k=1Ksup{𝜽¯(k)}kSΘ¯S(kS{𝜷^(k)𝜷(k)2𝜷^(k)+𝜷(k)2C[h(cpnk)]})14.\inf_{\{\widehat{\bm{\beta}}^{(k)}\}_{k=1}^{K}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)}+\bm{\beta}^{(k)*}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{k}}}\bigg{)}\bigg{]}\bigg{\}}\Bigg{)}\geq\frac{1}{4}. (S.5.361)

WLOG, suppose 1S1\in S. We have

inf𝜷^(1)sup{𝜽¯(k)}kSΘ¯S(𝜷^(1)𝜷(1)2𝜷^(1)+𝜷(1)2C[h(cpn1)])14,\inf_{\widehat{\bm{\beta}}^{(1)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\beta}}^{(1)}-\bm{\beta}^{(1)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(1)}+\bm{\beta}^{(1)*}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{1}}}\bigg{)}\bigg{]}\Bigg{)}\geq\frac{1}{4},\\ (S.5.362)

By Lemma 4, \exists a quadrant 𝒬𝒗\mathcal{Q}_{\bm{v}} of p\mathbb{R}^{p} and a r/8r/8-packing of (r𝒮p1)𝒬𝒗(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}} under Euclidean norm: {ϑ~j}j=1N\{\widetilde{\bm{\vartheta}}_{j}\}_{j=1}^{N}, where r=h𝜷(cp/n1)M1r=h_{\bm{\beta}}\wedge(c\sqrt{p/n_{1}})\wedge M\leq 1 with a small constant c>0c>0 and N(12)p18p2=12×4p22p2N\geq(\frac{1}{2})^{p-1}8^{p-2}=\frac{1}{2}\times 4^{p-2}\geq 2^{p-2} when p3p\geq 3. WLOG, assume M2M\geq 2. Denote 𝝁~j=(1,ϑ~j)p\widetilde{\bm{\mu}}_{j}=(1,\widetilde{\bm{\vartheta}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}. Let 𝝁(k)1=𝝁~=(1,𝟎p1)\bm{\mu}^{(k)*}_{1}=\widetilde{\bm{\mu}}=(1,\bm{0}_{p-1})^{\top} for all kS\{1}k\in S\backslash\{1\}. And let 𝝁(0)1=𝝁=(1,ϑ)\bm{\mu}^{(0)*}_{1}=\bm{\mu}=(1,\bm{\vartheta}) with ϑ(r𝒮p1)𝒬𝒗\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}. For any 𝝁p\bm{\mu}\in\mathbb{R}^{p}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}. Then similar to the arguments in (i),

LHS inf𝝁^supϑ(r𝒮p1)𝒬𝒗𝝁=(1,ϑ)(𝝁^𝝁2𝝁^+𝝁2C[h(cpn1)])\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\begin{subarray}{c}\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}\\ \bm{\mu}=(1,\bm{\vartheta})^{\top}\end{subarray}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\mu}}-\bm{\mu}\|_{2}\wedge\|\widehat{\bm{\mu}}+\bm{\mu}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{1}}}\bigg{)}\bigg{]}\Bigg{)} (S.5.363)
inf𝝁^supϑ(r𝒮p1)𝒬𝒗𝝁=(1,ϑ)(𝝁^𝝁2C[h(cpn1)]).\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\begin{subarray}{c}\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}\\ \bm{\mu}=(1,\bm{\vartheta})^{\top}\end{subarray}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\mu}}-\bm{\mu}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{1}}}\bigg{)}\bigg{]}\Bigg{)}. (S.5.364)

Then by Lemma 15,

KL(kS\{1}𝝁~nk𝝁~jn1SkS\{1}𝝁~nk𝝁~jn1S)\displaystyle\text{KL}\left(\prod_{k\in S\backslash\{1\}}\mathbb{P}_{\widetilde{\bm{\mu}}}^{\otimes n_{k}}\cdot\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{1}}\cdot\mathbb{Q}_{S}\bigg{\|}\prod_{k\in S\backslash\{1\}}\mathbb{P}_{\widetilde{\bm{\mu}}}^{\otimes n_{k}}\cdot\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{1}}\cdot\mathbb{Q}_{S}\right) =n1KL(𝝁~j𝝁~j)\displaystyle=n_{1}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}) (S.5.365)
n18𝝁~j22𝝁~j𝝁~j22\displaystyle\leq n_{1}\cdot 8\|\widetilde{\bm{\mu}}_{j}\|_{2}^{2}\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\|_{2}^{2} (S.5.366)
32n1r2\displaystyle\leq 32n_{1}r^{2} (S.5.367)
32n1c23(p2)n1\displaystyle\leq 32n_{1}c^{2}\cdot\frac{3(p-2)}{n_{1}} (S.5.368)
96c2log2logN,\displaystyle\leq\frac{96c^{2}}{\log 2}\log N, (S.5.369)

when n1(c2M2)pn_{1}\geq(c^{2}\vee M^{-2})p and p3p\geq 3. By Fano’s lemma (See Corollary 2.6 in \citealpapptsybakov2009introduction),

LHS of (S.5.362)1log2logN96c2log211p21414,\displaystyle\text{LHS of \eqref{eq: lower bdd eq mu 2}}\geq 1-\frac{\log 2}{\log N}-\frac{96c^{2}}{\log 2}\geq 1-\frac{1}{p-2}-\frac{1}{4}\geq\frac{1}{4}, (S.5.370)

when C=1/2C=1/2, p4p\geq 4 and c=(log2)/384c=\sqrt{(\log 2)/384}.

(iii) By fixing an SS and a S\mathbb{Q}_{S}, we want to show

inf{𝜽^(k)}k=1Ksup{𝜽¯(k)}kSΘ¯S(kS{𝜷^(k)𝜷(k)2𝜷^(k)+𝜷(k)2C[h(clogKnk)]})14.\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)}+\bm{\beta}^{(k)*}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{\log K}{n_{k}}}\bigg{)}\bigg{]}\bigg{\}}\Bigg{)}\geq\frac{1}{4}. (S.5.371)

Suppose 𝒗=𝟏p\bm{v}=\bm{1}_{p} and denote the associated quadrant 𝒬𝒗=+p\mathcal{Q}_{\bm{v}}=\mathbb{R}_{+}^{p}, ΥS={{𝝁(k)}kS:𝝁(k)+p,minμmaxkS𝝁(k)𝝁2h,𝝁(k)2M}\Upsilon_{S}=\{\{\bm{\mu}^{(k)}\}_{k\in S}:\bm{\mu}^{(k)}\in\mathbb{R}_{+}^{p},\min_{\mu}\max_{k\in S}\|\bm{\mu}^{(k)}-\bm{\mu}\|_{2}\leq h,\|\bm{\mu}^{(k)}\|_{2}\leq M\}. Let rk=h(clogK/nk)Mr_{k}=h\wedge(c\sqrt{\log K/n_{k}})\wedge M with a small constant c>0c>0 for kSk\in S. For any 𝑴={𝝁(k)}kS\bm{M}=\{\bm{\mu}^{(k)}\}_{k\in S}, where 𝝁(k)p\bm{\mu}^{(k)}\in\mathbb{R}^{p}, denote distribution kS[12𝒩(𝝁(k),𝑰p)+12𝒩(𝝁(k),𝑰p)]nk\prod_{k\in S}\big{[}\frac{1}{2}\mathcal{N}(\bm{\mu}^{(k)},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu}^{(k)},\bm{I}_{p})\big{]}^{\otimes n_{k}} as 𝑴\mathbb{P}_{\bm{M}}, and the joint distribution of 𝑴\mathbb{P}_{\bm{M}} and S\mathbb{Q}_{S} as 𝑴S\mathbb{P}_{\bm{M}}\cdot\mathbb{Q}_{S}. And denote distribution (1w¯)𝒩(𝝁,𝑰p)+w¯𝒩(𝝁,𝑰p)(1-\overline{w})\mathcal{N}(\bm{\mu},\bm{I}_{p})+\overline{w}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}} for any 𝝁p\bm{\mu}\in\mathbb{R}^{p}. Similar to the arguments in (i), since it suffices to consider the estimators {𝝁^(k)}kS\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S} satisfying {𝝁^(k)}kSΥS\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}\in\Upsilon_{S} almost surely and 𝒙𝒚2𝒙+𝒚2\|\bm{x}-\bm{y}\|_{2}\leq\|\bm{x}+\bm{y}\|_{2} for any 𝒙\bm{x}, 𝒚+p\bm{y}\in\mathbb{R}_{+}^{p}, we have

LHS inf{𝝁^(k)}kSsup{𝝁(k)}kSΥS{𝝁(k)}kSS(kS{𝝁^(k)𝝁(k)2𝝁^(k)+𝝁(k)2\displaystyle\geq\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}}\sup_{\{\bm{\mu}^{(k)}\}_{k\in S}\in\Upsilon_{S}}\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}\cdot\mathbb{Q}_{S}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\mu}}^{(k)}-\bm{\mu}^{(k)}\|_{2}\wedge\|\widehat{\bm{\mu}}^{(k)}+\bm{\mu}^{(k)}\|_{2} (S.5.372)
C[h(clogKnk)]})\displaystyle\hskip 227.62204pt\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{\log K}{n_{k}}}\bigg{)}\bigg{]}\bigg{\}}\Bigg{)} (S.5.373)
inf{𝝁^(k)}kSsup{𝝁(k)}kSΥS{𝝁(k)}kSS(kS{𝝁^(k)𝝁(k)2C[h(clogKnk)]}),\displaystyle\geq\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}}\sup_{\{\bm{\mu}^{(k)}\}_{k\in S}\in\Upsilon_{S}}\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}\cdot\mathbb{Q}_{S}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\mu}}^{(k)}-\bm{\mu}^{(k)}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{\log K}{n_{k}}}\bigg{)}\bigg{]}\bigg{\}}\Bigg{)}, (S.5.374)

Consider 𝑴(k)={𝝁(j)}jS\bm{M}^{(k)}=\{\bm{\mu}^{(j)}\}_{j\in S} where 𝝁(j)=rjp3/4𝟏p+𝝁0\bm{\mu}^{(j)}=\frac{r_{j}}{\sqrt{p-3/4}}\cdot\bm{1}_{p}+\bm{\mu}_{0} for jkj\neq k and 𝝁(k)=rk2p3/4𝟏p+𝝁0\bm{\mu}^{(k)}=\frac{r_{k}}{2\sqrt{p-3/4}}\cdot\bm{1}_{p}+\bm{\mu}_{0}, where 𝝁0=(1,𝟎p1)\bm{\mu}_{0}=(1,\bm{0}_{p-1}^{\top})^{\top}. Define two new “distances” (which are not rigorously distances because triangle inequalities and the definiteness do not hold) between 𝑴={𝝁(k)}kS\bm{M}=\{\bm{\mu}^{(k)}\}_{k\in S} and as 𝑴={𝝁(k)}kS\bm{M}^{\prime}=\{\bm{\mu}^{\prime(k)}\}_{k\in S}

d~(𝑴,𝑴)\displaystyle\widetilde{d}(\bm{M},\bm{M}^{\prime}) kS𝟙(𝝁(k)𝝁(k)2rk2p3/4),\displaystyle\coloneqq\sum_{k\in S}\mathds{1}\left(\|\bm{\mu}^{(k)}-\bm{\mu}^{\prime(k)}\|_{2}\geq\frac{r_{k}}{2\sqrt{p-3/4}}\right), (S.5.375)
d~(𝑴,𝑴)\displaystyle\widetilde{d}^{\prime}(\bm{M},\bm{M}^{\prime}) kS𝟙(𝝁(k)𝝁(k)2rk4p3/4).\displaystyle\coloneqq\sum_{k\in S}\mathds{1}\left(\|\bm{\mu}^{(k)}-\bm{\mu}^{\prime(k)}\|_{2}\geq\frac{r_{k}}{4\sqrt{p-3/4}}\right). (S.5.376)

Therefore d~(𝑴(k),𝑴(k))=2\widetilde{d}(\bm{M}^{(k)},\bm{M}^{(k^{\prime})})=2 when kkk\neq k^{\prime}. For {𝝁^(k)}kS\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}, define ψ=argminkSd~({𝝁^(k)}kS,𝑴(k))\psi^{*}=\operatorname*{arg\,min}_{k\in S}\widetilde{d}^{\prime}(\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S},\allowbreak\bm{M}^{(k)}). Because d~(𝑴1,𝑴2)d~(𝑴1,𝑴2)+d~(𝑴2,𝑴3)\widetilde{d}(\bm{M}_{1},\bm{M}_{2})\leq\widetilde{d}^{\prime}(\bm{M}_{1},\bm{M}_{2})+\widetilde{d}^{\prime}(\bm{M}_{2},\bm{M}_{3}) for any 𝑴1\bm{M}_{1}, 𝑴2\bm{M}_{2} and 𝑴3\bm{M}_{3}, it’s easy to see that

inf{𝝁^(k)}kSsup{𝝁(k)}kSΥS{𝝁(k)}kSS(kS{𝝁^(k)𝝁(k)2rk4p3/4})\displaystyle\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}}\sup_{\{\bm{\mu}^{(k)}\}_{k\in S}\in\Upsilon_{S}}\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}\cdot\mathbb{Q}_{S}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\mu}}^{(k)}-\bm{\mu}^{(k)}\|_{2}\geq\frac{r_{k}}{4\sqrt{p-3/4}}\bigg{\}}\Bigg{)} (S.5.377)
inf{𝝁^(k)}kSsupkS𝑴(k)(d~({𝝁^(k)1}kS,𝑴(k))1)\displaystyle\geq\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}}\sup_{k\in S}\mathbb{P}_{\bm{M}^{(k)}}\left(\widetilde{d}^{\prime}(\{\widehat{\bm{\mu}}^{(k)}_{1}\}_{k\in S},\bm{M}^{(k)})\geq 1\right) (S.5.378)
inf{𝝁^(k)}kSsupkS𝑴(k)(ψk)\displaystyle\geq\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k\in S}}\sup_{k\in S}\mathbb{P}_{\bm{M}^{(k)}}\left(\psi^{*}\neq k\right) (S.5.379)
infψsupkS𝑴(k)(ψk).\displaystyle\geq\inf_{\psi}\sup_{k\in S}\mathbb{P}_{\bm{M}^{(k)}}\left(\psi\neq k\right). (S.5.380)

By Lemma 15,

KL(𝑴(k)S𝑴(k)S)\displaystyle\text{KL}\left(\mathbb{P}_{\bm{M}^{(k)}}\cdot\mathbb{Q}_{S}\big{\|}\mathbb{P}_{\bm{M}^{(k^{\prime})}}\cdot\mathbb{Q}_{S}\right) =nkKL(rkp3/4𝟙p+𝝁0nkrk2p3/4𝟙p+𝝁0nk)\displaystyle=n_{k}\text{KL}\left(\mathbb{P}_{\frac{r_{k}}{\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}}^{\otimes n_{k}}\|\mathbb{P}_{\frac{r_{k}}{2\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}}^{\otimes n_{k}}\right) (S.5.381)
+nkKL(rkp3/4𝟙p+𝝁0nkrk2p3/4𝟙p+𝝁0nk)\displaystyle\quad+n_{k^{\prime}}\text{KL}\left(\mathbb{P}_{\frac{r_{k^{\prime}}}{\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}}^{\otimes n_{k^{\prime}}}\|\mathbb{P}_{\frac{r_{k^{\prime}}}{2\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}}^{\otimes n_{k^{\prime}}}\right) (S.5.382)
nk8rkp3/4𝟙p+𝝁022rk2p3/4𝟙p22\displaystyle\leq n_{k}\cdot 8\left\|\frac{r_{k}}{\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}\right\|_{2}^{2}\left\|\frac{r_{k}}{2\sqrt{p-3/4}}\mathds{1}_{p}\right\|_{2}^{2} (S.5.383)
+nk8rkp3/4𝟙p+𝝁022rk2p3/4𝟙p22\displaystyle\quad+n_{k^{\prime}}\cdot 8\left\|\frac{r_{k^{\prime}}}{\sqrt{p-3/4}}\mathds{1}_{p}+\bm{\mu}_{0}\right\|_{2}^{2}\left\|\frac{r_{k^{\prime}}}{2\sqrt{p-3/4}}\mathds{1}_{p}\right\|_{2}^{2} (S.5.384)
nk82(2rk2+1)142rk2+nk82(2rk2+1)142rk2\displaystyle\leq n_{k}\cdot 8\cdot 2\cdot(2r_{k}^{2}+1)\cdot\frac{1}{4}\cdot 2r_{k}^{2}+n_{k^{\prime}}\cdot 8\cdot 2\cdot(2r_{k^{\prime}}^{2}+1)\cdot\frac{1}{4}\cdot 2r_{k^{\prime}}^{2} (S.5.385)
16c2logK,\displaystyle\leq 16c^{2}\log K, (S.5.386)

when p3p\geq 3. By Fano’s lemma (See Corollary 2.6 in \citealpapptsybakov2009introduction),

RHS of (S.5.380)1log2logK16c214,\displaystyle\text{RHS of \eqref{eq: lower bdd eq mu 4}}\geq 1-\frac{\log 2}{\log K}-16c^{2}\geq\frac{1}{4}, (S.5.387)

when K3K\geq 3, c=1/160c=\sqrt{1/160}, and minkSnk(c2M2)logK\min_{k\in S}n_{k}\geq(c^{2}\vee M^{-2})\log K.

(iv) We want to show

inf{𝜽^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S,wS(kS{\displaystyle\inf_{\{\widehat{\bm{\theta}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,w}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{} |w^(k)w(k)||1w^(k)w(k)|\displaystyle|\widehat{w}^{(k)}-w^{(k)*}|\wedge|1-\widehat{w}^{(k)}-w^{(k)*}| (S.5.388)
ClogKnk})14.\displaystyle\geq C\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}. (S.5.389)

The argument is similar to part (iii). The only two differences here are that the dimension of interested parameter ww equals 1, and Lemma 15 is replaced by Lemma 17.

(v) We want to show

inf{δ^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S,δS(kS{\displaystyle\inf_{\{\widehat{\delta}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\delta}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{} |δ^(k)δ(k)||δ^(k)+δ(k)|\displaystyle|\widehat{\delta}^{(k)}-\delta^{(k)*}|\wedge|\widehat{\delta}^{(k)}+\delta^{(k)*}| (S.5.390)
ClogKnk})14,\displaystyle\geq C\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}, (S.5.391)

The argument is similar to (iii). The only two differences here are that the dimension of interested parameter δ\delta equals 1, and Lemma 15 is replaced by Lemma 18.

Finally, we get the desired conclusion by combining (i)-(v). ∎

Proof of Lemma 20.

Let ϵ~=Kss\widetilde{\epsilon}=\frac{K-s}{s} and ϵ~=KsK\widetilde{\epsilon}^{\prime}=\frac{K-s}{K}. Since s/Kc>0s/K\geq c>0, ϵ~ϵ~\widetilde{\epsilon}\lesssim\widetilde{\epsilon}^{\prime}. Denote ΥS={{𝝁(k)}kS:𝝁(k)+p,min𝝁maxkS𝝁(k)𝝁2h𝜷/2,𝝁(k)2M}\Upsilon_{S}=\{\{\bm{\mu}^{(k)}\}_{k\in S}:\bm{\mu}^{(k)}\in\mathbb{R}_{+}^{p},\min_{\bm{\mu}}\max_{k\in S}\|\bm{\mu}^{(k)}-\bm{\mu}\|_{2}\leq h_{\bm{\beta}}/2,\|\bm{\mu}^{(k)}\|_{2}\leq M\}. For any 𝝁\bm{\mu}\in\mathbb{R}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}, and denote kS𝝁(k)nk\prod_{k\in S}\mathbb{P}_{\bm{\mu}^{(k)}}^{\otimes n_{k}} as {𝝁(k)}kS\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}. Note that 𝜷(k)=2𝝁(k)\bm{\beta}^{(k)}=2\bm{\mu}^{(k)} for 𝝁(k)\mathbb{P}_{\bm{\mu}^{(k)}} with {𝝁(k)}kSΥS\{\bm{\mu}^{(k)}\}_{k\in S}\in\Upsilon_{S}. Then it suffices to show

inf{𝝁^(k)}k=1KsupS:|S|ssup{𝝁(k)}kSΥSS(maxkS𝝁^(k)𝝁(k)2C1ϵ~1maxk=1:Knk)110,\inf_{\{\widehat{\bm{\mu}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\bm{\mu}^{(k)}\}_{k\in S}\in\Upsilon_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\max_{k\in S}\|\widehat{\bm{\mu}}^{(k)}-\bm{\mu}^{(k)}\|_{2}\geq C_{1}\widetilde{\epsilon}^{\prime}\sqrt{\frac{1}{\max_{k=1:K}n_{k}}}\Bigg{)}\geq\frac{1}{10}, (S.5.392)

where ={𝝁(k)}kSS\mathbb{P}=\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}\cdot\mathbb{Q}_{S}. WLOG, assume M1M\geq 1. For any 𝝁~1\widetilde{\bm{\mu}}_{1}, 𝝁~2p\widetilde{\bm{\mu}}_{2}\in\mathbb{R}^{p} with 𝝁~12=𝝁~22=1\|\widetilde{\bm{\mu}}_{1}\|_{2}=\|\widetilde{\bm{\mu}}_{2}\|_{2}=1, by Lemma 15,

maxk=1:KKL(𝝁~1nk𝝁~2nk)maxk=1:Knk8𝝁~1𝝁~222.\max_{k=1:K}\text{KL}\big{(}\mathbb{P}_{\widetilde{\bm{\mu}}_{1}}^{\otimes n_{k}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{2}}^{\otimes n_{k}}\big{)}\leq\max_{k=1:K}n_{k}\cdot 8\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}^{2}. (S.5.393)

Let 8maxk=1:Knk𝝁~1𝝁~222(ϵ~1ϵ~)28\max_{k=1:K}n_{k}\cdot\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}^{2}\leq(\frac{\widetilde{\epsilon}^{\prime}}{1-\widetilde{\epsilon}^{\prime}})^{2}, then 𝝁~1𝝁~22C1maxk=1:Knkϵ~\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}\leq C\sqrt{\frac{1}{\max_{k=1:K}n_{k}}}\widetilde{\epsilon}^{\prime} for some constant C>0C>0. Then (S.5.392) follows by Lemma 22. ∎

Proof of Lemma 21.

The proof is similar to the proof of Theorem 5.1 in \citeappchen2018robust, so we omit it here. ∎

Proof of Lemma 22.

It’s easy to see that

LHS of (S.5.335)infθ^supθΘS𝔼Ss[(S,θ,S)(θ^θCϖ(Ks50K,Θ))],\text{LHS of \eqref{eq: conclusion binomial lemma}}\geq\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{E}_{S\sim\mathbb{P}_{s}}\left[\mathbb{P}_{(S,\theta,\mathbb{Q}_{S})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)\right], (S.5.394)

where s\mathbb{P}_{s} can be any probability measure on all subsets of 1:K1:K with s(|S|s)=1\mathbb{P}_{s}(|S|\geq s)=1.

Consider a special distribution ~s\widetilde{\mathbb{P}}_{s} as s\mathbb{P}_{s}:

~s(S=S)=|Sc|Bin(K,Ks50K)(|S|=|S|)|Sc|Bin(K,Ks50K)(|Sc|41(Ks)50)1(K|S|),\widetilde{\mathbb{P}}_{s}(S=S^{\prime})=\frac{\mathbb{P}_{|S^{c}|\sim\text{Bin}(K,\frac{K-s}{50K})}(|S|=|S^{\prime}|)}{\mathbb{P}_{|S^{c}|\sim\text{Bin}(K,\frac{K-s}{50K})}(|S^{c}|\leq\frac{41(K-s)}{50})}\cdot\frac{1}{\binom{K}{|S^{\prime}|}}, (S.5.395)

for any SS^{\prime} with |(S)c|41(Ks)50|(S^{\prime})^{c}|\leq\frac{41(K-s)}{50}. Given SS, consider the distribution of {𝒙(k)}k=1K\{\bm{x}^{(k)}\}_{k=1}^{K} as

S=kS(k)θkS(k)\mathbb{P}^{S}=\prod_{k\in S}\mathbb{P}^{(k)}_{\theta}\cdot\prod_{k\notin S}\mathbb{Q}^{(k)} (S.5.396)

Then consider the distribution of {𝒙(k)}k=1K\{\bm{x}^{(k)}\}_{k=1}^{K} as

=S:|Sc|41(Ks)50~s(S)S.\mathbb{P}^{\prime}=\sum_{S:|S^{c}|\leq\frac{41(K-s)}{50}}\widetilde{\mathbb{P}}_{s}(S)\cdot\mathbb{P}^{S}. (S.5.397)

It’s easy to see that \mathbb{P}^{\prime} is the same as (Ks50K,θ,{(k)}k=1K)\mathbb{P}_{(\frac{K-s}{50K},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})} conditioning on the event {S:|Sc|41(Ks)50}\big{\{}S:|S^{c}|\leq\frac{41(K-s)}{50}\big{\}}. Therefore,

infθ^supθΘS𝔼S~s[(S,θ,S)(θ^θCϖ(Ks50K,Θ))]\displaystyle\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{E}_{S\sim\widetilde{\mathbb{P}}_{s}}\left[\mathbb{P}_{(S,\theta,\mathbb{Q}_{S})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)\right] (S.5.398)
infθ^supθΘ{(k)}k=1K𝔼S~s[S(θ^θCϖ(Ks50K,Θ))]\displaystyle\geq\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{E}_{S\sim\widetilde{\mathbb{P}}_{s}}\left[\mathbb{P}^{S}\left(\|\widehat{\theta}-\theta\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)\right] (S.5.399)
=infθ^supθΘ{(k)}k=1K(θ^θCϖ(Ks50K,Θ))\displaystyle=\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}^{\prime}\left(\|\widehat{\theta}-\theta\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right) (S.5.400)
infθ^supθΘ{(k)}k=1K(Ks50K,θ,{(k)}k=1K)(θ^θCϖ(Ks50K,Θ)||Sc|41(Ks)50)\displaystyle\geq\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\frac{K-s}{50K},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\bigg{|}|S^{c}|\leq\frac{41(K-s)}{50}\right) (S.5.401)
infθ^supθΘ{(k)}k=1K(Ks50K,θ,{(k)}k=1K)(θ^θCϖ(Ks50K,Θ))|Sc|Bin(K,Ks50K)(|Sc|>41(Ks)50)\displaystyle\geq\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\frac{K-s}{50K},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi\left(\frac{K-s}{50K},\Theta\right)\right)-\mathbb{P}_{|S^{c}|\sim\text{Bin}(K,\frac{K-s}{50K})}\left(|S^{c}|>\frac{41(K-s)}{50}\right) (S.5.402)
12exp{12[45(Ks)]2KKs50K(1Ks50K)+1345(Ks)}\displaystyle\geq\frac{1}{2}-\exp\left\{-\frac{\frac{1}{2}[\frac{4}{5}(K-s)]^{2}}{K\cdot\frac{K-s}{50K}\big{(}1-\frac{K-s}{50K}\big{)}+\frac{1}{3}\cdot\frac{4}{5}(K-s)}\right\} (S.5.403)
12exp{12(45)2150+415}\displaystyle\geq\frac{1}{2}-\exp\left\{-\frac{\frac{1}{2}\cdot(\frac{4}{5})^{2}}{\frac{1}{50}+\frac{4}{15}}\right\} (S.5.404)
110,\displaystyle\geq\frac{1}{10}, (S.5.405)

where the last third inequality comes from Bernstein’s inequality, application of Lemma 17 and the fact that dTV2(θ1,θ2)KL(θ1θ2)d_{\text{TV}}^{2}(\mathbb{P}_{\theta_{1}},\mathbb{P}_{\theta_{2}})\leq\text{KL}(\mathbb{P}_{\theta_{1}}\|\mathbb{P}_{\theta_{2}}). ∎

Proof of Lemma 23.

(i) We want to show

inf{𝚺^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯SS(kS{𝚺^(k)𝚺(k)2Cpnk})110.\inf_{\{\widehat{\bm{\Sigma}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\|_{2}\geq C\sqrt{\frac{p}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{10}. (S.5.406)

Fix SS and some S\mathbb{Q}_{S}. WLOG, assume 1S1\in S. Then it suffices to show

inf𝚺^(k)sup{𝜽¯(k)}kSΘ¯S(𝚺^(k)𝚺(k)2Cpnk)110.\inf_{\widehat{\bm{\Sigma}}^{(k)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\|_{2}\geq C\sqrt{\frac{p}{n_{k}}}\Bigg{)}\geq\frac{1}{10}. (S.5.407)

Consider a special subset of Θ¯S\overline{\Theta}_{S} as

Θ¯S,𝚺={𝜽¯:w=1/2,𝝁1=𝝁2=0,𝚺=𝚺(𝜸),𝜸{0,1}p},\overline{\Theta}_{S,\bm{\Sigma}}=\{\overline{\bm{\theta}}:w=1/2,\bm{\mu}_{1}=\bm{\mu}_{2}=0,\bm{\Sigma}=\bm{\Sigma}(\bm{\gamma}),\bm{\gamma}\in\{0,1\}^{p}\}, (S.5.408)

where

𝚺(𝜸)=(γ1𝒆1γp𝒆1)τ+𝑰p,\bm{\Sigma}(\bm{\gamma})=\begin{pmatrix}\gamma_{1}\bm{e}_{1}^{\top}\\ \vdots\\ \gamma_{p}\bm{e}_{1}^{\top}\end{pmatrix}\cdot\tau+\bm{I}_{p}, (S.5.409)

and τ>0\tau>0 is a small constant which we will specify later. For any 𝜸{0,1}p\bm{\gamma}\in\{0,1\}^{p}, denote 𝒩(𝟎,𝚺(𝜸))\mathcal{N}(\bm{0},\bm{\Sigma}(\bm{\gamma})) as 𝜸\mathbb{P}_{\bm{\gamma}}. Therefore it suffices to show

inf𝚺^(1)sup𝜸{0,1}p(𝚺^(1)𝚺(1)2Cpn1)110.\inf_{\widehat{\bm{\Sigma}}^{(1)}}\sup_{\bm{\gamma}\in\{0,1\}^{p}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\Sigma}}^{(1)}-\bm{\Sigma}^{(1)*}\|_{2}\geq C\sqrt{\frac{p}{n_{1}}}\Bigg{)}\geq\frac{1}{10}. (S.5.410)

Note that for any 𝚺^(1)\widehat{\bm{\Sigma}}^{(1)}, we can define 𝜸^=argmin𝜸{0,1}p𝚺^(1)𝚺(𝜸)2\widehat{\bm{\gamma}}=\operatorname*{arg\,min}_{\bm{\gamma}\in\{0,1\}^{p}}\|\widehat{\bm{\Sigma}}^{(1)}-\bm{\Sigma}(\bm{\gamma})\|_{2}. Then by triangle inequality and definition of 𝜸^\widehat{\bm{\gamma}}, 𝚺^(1)𝚺(𝜸)2𝚺(𝜸^)𝚺(𝜸)2/2\|\widehat{\bm{\Sigma}}^{(1)}-\bm{\Sigma}(\bm{\gamma})\|_{2}\geq\|\bm{\Sigma}(\widehat{\bm{\gamma}})-\bm{\Sigma}(\bm{\gamma})\|_{2}/2. Therefore

LHS of (S.5.410)inf𝜸^{0,1}psup𝜸{0,1}p(𝚺(𝜸^)𝚺(𝜸)22Cpn1).\text{LHS of \eqref{eq: lemma 16 eq 1}}\geq\inf_{\widehat{\bm{\gamma}}\in\{0,1\}^{p}}\sup_{\bm{\gamma}\in\{0,1\}^{p}}\mathbb{P}\Bigg{(}\|\bm{\Sigma}(\widehat{\bm{\gamma}})-\bm{\Sigma}(\bm{\gamma})\|_{2}\geq 2C\sqrt{\frac{p}{n_{1}}}\Bigg{)}. (S.5.411)

Let τ=c1/n1\tau=c\sqrt{1/n_{1}} where c>0c>0 is a small constant. Since 𝚺(𝜸^)𝚺(𝜸)2τ\|\bm{\Sigma}(\widehat{\bm{\gamma}})-\bm{\Sigma}(\bm{\gamma})\|_{2}\leq\tau for any 𝜸^\widehat{\bm{\gamma}} and 𝜸{0,1}p\bm{\gamma}\in\{0,1\}^{p}, by Lemma D.2 in \citeappduan2023adaptive,

LHS of (S.5.411)inf𝜸^{0,1}psup𝜸{0,1}p𝔼𝚺(𝜸^)𝚺(𝜸)224C2pn1(c24C2)pn1.\text{LHS of \eqref{eq: lemma 16 eq 2}}\geq\frac{\inf_{\widehat{\bm{\gamma}}\in\{0,1\}^{p}}\sup_{\bm{\gamma}\in\{0,1\}^{p}}\mathbb{E}\|\bm{\Sigma}(\widehat{\bm{\gamma}})-\bm{\Sigma}(\bm{\gamma})\|_{2}^{2}-4C^{2}\cdot\frac{p}{n_{1}}}{(c^{2}-4C^{2})\frac{p}{n_{1}}}. (S.5.412)

Applying Assouad’s lemma (Theorem 2.12 in \citealpapptsybakov2009introduction or Lemma 2 in \citealpappcai2012optimal), we get

inf𝜸^{0,1}psup𝜸{0,1}p𝔼𝚺(𝜸^)𝚺(𝜸)22\displaystyle\inf_{\widehat{\bm{\gamma}}\in\{0,1\}^{p}}\sup_{\bm{\gamma}\in\{0,1\}^{p}}\mathbb{E}\|\bm{\Sigma}(\widehat{\bm{\gamma}})-\bm{\Sigma}(\bm{\gamma})\|_{2}^{2} p8minρH(𝜸,𝜸)1[𝚺(𝜸)𝚺(𝜸)22ρH(𝜸,𝜸)]\displaystyle\geq\frac{p}{8}\min_{\rho_{H}(\bm{\gamma},\bm{\gamma}^{\prime})\geq 1}\left[\frac{\|\bm{\Sigma}(\bm{\gamma})-\bm{\Sigma}(\bm{\gamma}^{\prime})\|_{2}^{2}}{\rho_{H}(\bm{\gamma},\bm{\gamma}^{\prime})}\right] (S.5.413)
[1maxρH(𝜸,𝜸)=1(KL(𝜸n1𝜸n1))1/2],\displaystyle\quad\cdot\left[1-\max_{\rho_{H}(\bm{\gamma},\bm{\gamma}^{\prime})=1}\left(\text{KL}(\mathbb{P}_{\bm{\gamma}}^{\otimes n_{1}}\|\mathbb{P}_{\bm{\gamma}^{\prime}}^{\otimes n_{1}})\right)^{1/2}\right], (S.5.414)

where ρH\rho_{H} is the Hamming distance. For the first term on the RHS, it’s easy to see that

𝚺(𝜸)𝚺(𝜸)22=τ2ρH(𝜸,𝜸),\|\bm{\Sigma}(\bm{\gamma})-\bm{\Sigma}(\bm{\gamma}^{\prime})\|_{2}^{2}=\tau^{2}\rho_{H}(\bm{\gamma},\bm{\gamma}^{\prime}), (S.5.415)

for any 𝜸\bm{\gamma} and 𝜸{0,1}p\bm{\gamma}^{\prime}\in\{0,1\}^{p}. For the second term, by the density form of Gaussian distribution, we can show that if ρH(𝜸,𝜸)=1\rho_{H}(\bm{\gamma},\bm{\gamma}^{\prime})=1, then

KL(𝜸n1𝜸n1)\displaystyle\text{KL}(\mathbb{P}_{\bm{\gamma}}^{\otimes n_{1}}\|\mathbb{P}_{\bm{\gamma}^{\prime}}^{\otimes n_{1}}) =n1KL(𝜸𝜸)\displaystyle=n_{1}\text{KL}(\mathbb{P}_{\bm{\gamma}}\|\mathbb{P}_{\bm{\gamma}^{\prime}}) (S.5.416)
n112{log(|𝚺(𝜸)|/|𝚺(𝜸)|)Tr[(𝚺(𝜸)1𝚺(𝜸)1)𝚺(𝜸)]}\displaystyle\leq n_{1}\cdot\frac{1}{2}\left\{\log(|\bm{\Sigma}(\bm{\gamma}^{\prime})|/|\bm{\Sigma}(\bm{\gamma})|)-\text{Tr}\left[(\bm{\Sigma}(\bm{\gamma})^{-1}-\bm{\Sigma}(\bm{\gamma}^{\prime})^{-1})\bm{\Sigma}(\bm{\gamma})\right]\right\} (S.5.417)
n114τ2\displaystyle\leq n_{1}\cdot\frac{1}{4}\tau^{2} (S.5.418)
c24.\displaystyle\leq\frac{c^{2}}{4}. (S.5.419)

Plugging this back into (S.5.414), combining with (S.5.412), we have

LHS of (S.5.411)c2p8n1(1c2)4C2pn1(c24C2)pn1110,\text{LHS of \eqref{eq: lemma 16 eq 2}}\geq\frac{c^{2}\cdot\frac{p}{8n_{1}}(1-\frac{c}{2})-4C^{2}\cdot\frac{p}{n_{1}}}{(c^{2}-4C^{2})\frac{p}{n_{1}}}\geq\frac{1}{10}, (S.5.420)

when c=2/9c=2/9 and Cc/324C\leq c/\sqrt{324}.

(ii) We want to show

inf{𝚺^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯SS(kS{𝚺^(k)𝚺(k)2ClogKnk})110.\inf_{\{\widehat{\bm{\Sigma}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}\|\widehat{\bm{\Sigma}}^{(k)}-\bm{\Sigma}^{(k)*}\|_{2}\geq C\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{10}. (S.5.421)

The proof idea is similar to part (iii) of the proof of Lemma 19, so we omit the details here. It suffices to consider 𝑴(k)={𝚺(j)}j=1K\bm{M}^{(k)}=\{\bm{\Sigma}^{(j)}\}_{j=1}^{K} where 𝚺(j)=𝑰p\bm{\Sigma}^{(j)}=\bm{I}_{p} when jkj\neq k and 𝚺(k)=𝑰p+logK/nk𝒆1𝒆1\bm{\Sigma}^{(k)}=\bm{I}_{p}+\sqrt{\log K/n_{k}}\cdot\bm{e}_{1}\bm{e}_{1}^{\top}.

S.5.4 Proof of Theorem 2

We claim that with probability at least 1CK11-CK^{-1},

R𝜽¯(k)(𝒞^(k)[t])R𝜽¯(k)(𝒞𝜽¯(k))d2(𝜽^(k)[t],𝜽(k)).R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)[t]})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)[t]},\bm{\theta}^{(k)*}). (S.5.422)

Then the conclusion immediately follows from Theorem 1. Hence it suffices to verify the claim. For convenience, we write 𝒞^(k)[t]=𝒞𝜽^(k)[t]\widehat{\mathcal{C}}^{(k)[t]}=\mathcal{C}_{\widehat{\bm{\theta}}^{(k)[t]}} simply as 𝒞𝜽^(k)\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}} and 𝜽^(k)[t]\widehat{\bm{\theta}}^{(k)[t]} as 𝜽^(k)\widehat{\bm{\theta}}^{(k)}.

By simple calculations, we have

R𝜽¯(k)(𝒞𝜽^(k))\displaystyle R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}) =(1w(k))Φ(log(1w^(k)w^(k))δ^(k)+(𝜷^(k))𝝁(k)1(𝜷^(k))𝚺(k)𝜷^(k))\displaystyle=(1-w^{(k)*})\Phi\left(\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}\right) (S.5.423)
+w(k)Φ(log(1w^(k)w^(k))+δ^(k)(𝜷^(k))𝝁(k)2(𝜷^(k))𝚺(k)𝜷^(k)),\displaystyle\quad+w^{(k)*}\Phi\left(\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}\right), (S.5.424)
R𝜽¯(k)(𝒞𝜽(k))\displaystyle R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\bm{\theta}^{(k)*}}) =(1w(k))Φ(log(1w(k)w(k))δ(k)+(𝜷(k))𝝁(k)1(𝜷(k))𝚺(k)𝜷(k))\displaystyle=(1-w^{(k)*})\Phi\left(\frac{-\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\delta^{(k)*}+(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\right) (S.5.425)
+w(k)Φ(log(1w(k)w(k))+δ(k)(𝜷(k))𝝁(k)2(𝜷(k))𝚺(k)𝜷(k)).\displaystyle\quad+w^{(k)*}\Phi\left(\frac{\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\delta^{(k)*}-(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\right). (S.5.426)

Then by Taylor expansion,

R𝜽¯(k)(𝒞𝜽^(k))R𝜽¯(k)(𝒞𝜽(k))(1w(k))Φ(log(1w(k)w(k))δ(k)+(𝜷(k))𝝁(k)1(𝜷(k))𝚺(k)𝜷(k))\displaystyle R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\bm{\theta}^{(k)*}})\leq(1-w^{(k)*})\Phi^{\prime}\left(\frac{-\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\delta^{(k)*}+(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\right) (S.5.427)
[log(1w^(k)w^(k))δ^(k)+(𝜷^(k))𝝁(k)1(𝜷^(k))𝚺(k)𝜷^(k)log(1w(k)w(k))δ(k)+(𝜷(k))𝝁(k)1(𝜷(k))𝚺(k)𝜷(k)]\displaystyle\quad\cdot\Bigg{[}\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{-\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\delta^{(k)*}+(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]} (S.5.428)
+w(k)Φ(log(1w(k)w(k))+δ(k)(𝜷(k))𝝁(k)2(𝜷(k))𝚺(k)𝜷(k))\displaystyle\quad+w^{(k)*}\Phi^{\prime}\left(\frac{\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\delta^{(k)*}-(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\right) (S.5.429)
[log(1w^(k)w^(k))+δ^(k)(𝜷^(k))𝝁(k)2(𝜷^(k))𝚺(k)𝜷^(k)log(1w(k)w(k))+δ(k)(𝜷(k))𝝁(k)2(𝜷(k))𝚺(k)𝜷(k)]\displaystyle\cdot\Bigg{[}\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\delta^{(k)*}-(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]} (S.5.430)
+C[log(1w^(k)w^(k))δ^(k)+(𝜷^(k))𝝁(k)1(𝜷^(k))𝚺(k)𝜷^(k)log(1w(k)w(k))δ(k)+(𝜷(k))𝝁(k)1(𝜷(k))𝚺(k)𝜷(k)]2\displaystyle\quad+C\Bigg{[}\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{-\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\delta^{(k)*}+(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]}^{2} (S.5.431)
+C[log(1w^(k)w^(k))+δ^(k)(𝜷^(k))𝝁(k)2(𝜷^(k))𝚺(k)𝜷^(k)log(1w(k)w(k))+δ(k)(𝜷(k))𝝁(k)2(𝜷(k))𝚺(k)𝜷(k)]2.\displaystyle\quad+C\Bigg{[}\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\delta^{(k)*}-(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]}^{2}. (S.5.432)

Denote 𝒜=(1w(k))Φ(log(1w(k)w(k))δ(k)+(𝜷(k))𝝁(k)1(𝜷(k))𝚺(k)𝜷(k))[log(1w^(k)w^(k))δ^(k)+(𝜷^(k))𝝁(k)1(𝜷^(k))𝚺(k)𝜷^(k)log(1w(k)w(k))δ(k)+(𝜷(k))𝝁(k)1(𝜷(k))𝚺(k)𝜷(k)]+w(k)Φ(log(1w(k)w(k))+δ(k)(𝜷(k))𝝁(k)2(𝜷(k))𝚺(k)𝜷(k))[log(1w^(k)w^(k))+δ^(k)(𝜷^(k))𝝁(k)2(𝜷^(k))𝚺(k)𝜷^(k)log(1w(k)w(k))+δ(k)(𝜷(k))𝝁(k)2(𝜷(k))𝚺(k)𝜷(k)]\mathscr{A}=(1-w^{(k)*})\Phi^{\prime}\left(\frac{-\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\delta^{(k)*}+(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\right)\cdot\Bigg{[}\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{-\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\delta^{(k)*}+(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]}+w^{(k)*}\Phi^{\prime}\left(\frac{\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\delta^{(k)*}-(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\right)\cdot\Bigg{[}\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\delta^{(k)*}-(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]} and =C[log(1w^(k)w^(k))δ^(k)+(𝜷^(k))𝝁(k)1(𝜷^(k))𝚺(k)𝜷^(k)log(1w(k)w(k))δ(k)+(𝜷(k))𝝁(k)1(𝜷(k))𝚺(k)𝜷(k)]2+C[log(1w^(k)w^(k))+δ^(k)(𝜷^(k))𝝁(k)2(𝜷^(k))𝚺(k)𝜷^(k)log(1w(k)w(k))+δ(k)(𝜷(k))𝝁(k)2(𝜷(k))𝚺(k)𝜷(k)]2\mathscr{B}=C\Bigg{[}\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{-\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\delta^{(k)*}+(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]}^{2}+C\Bigg{[}\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\delta^{(k)*}-(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]}^{2}. By plugging in the density formula of standard Gaussian distribution, it is easy to see that

𝒜\displaystyle\mathscr{A} (1w(k))w(k)exp{[log(1w(k)w(k))+12(𝜷(k))𝚺(k)𝜷(k)]22(𝜷(k))𝚺(k)𝜷(k)+12log(1w(k)w(k))}\displaystyle\lesssim\sqrt{(1-w^{(k)*})w^{(k)*}}\cdot\exp\left\{-\frac{\big{[}\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\frac{1}{2}(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}\big{]}^{2}}{2(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}+\frac{1}{2}\log\bigg{(}\frac{1-w^{(k)*}}{w^{(k)*}}\bigg{)}\right\} (S.5.433)
[log(1w^(k)w^(k))δ^(k)+(𝜷^(k))𝝁(k)1(𝜷^(k))𝚺(k)𝜷^(k)log(1w(k)w(k))δ(k)+(𝜷(k))𝝁(k)1(𝜷(k))𝚺(k)𝜷(k)]\displaystyle\quad\cdot\Bigg{[}\frac{-\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})-\widehat{\delta}^{(k)}+(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{-\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\delta^{(k)*}+(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{1}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]} (S.5.434)
+(1w(k))w(k)exp{[log(1w(k)w(k))12(𝜷(k))𝚺(k)𝜷(k)]22(𝜷(k))𝚺(k)𝜷(k)+12log(w(k)1w(k))}\displaystyle\quad+\sqrt{(1-w^{(k)*})w^{(k)*}}\cdot\exp\left\{-\frac{\big{[}\log(\frac{1-w^{(k)*}}{w^{(k)*}})-\frac{1}{2}(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}\big{]}^{2}}{2(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}+\frac{1}{2}\log\bigg{(}\frac{w^{(k)*}}{1-w^{(k)*}}\bigg{)}\right\} (S.5.435)
[log(1w^(k)w^(k))+δ^(k)(𝜷^(k))𝝁(k)2(𝜷^(k))𝚺(k)𝜷^(k)log(1w(k)w(k))+δ(k)(𝜷(k))𝝁(k)2(𝜷(k))𝚺(k)𝜷(k)]\displaystyle\quad\cdot\Bigg{[}\frac{\log(\frac{1-\widehat{w}^{(k)}}{\widehat{w}^{(k)}})+\widehat{\delta}^{(k)}-(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{\log(\frac{1-w^{(k)*}}{w^{(k)*}})+\delta^{(k)*}-(\bm{\beta}^{(k)*})^{\top}\bm{\mu}^{(k)*}_{2}}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\Bigg{]} (S.5.436)
=(1w(k))w(k)exp{18[(𝜷(k))𝚺(k)𝜷(k)]212log2(1w(k)w(k))(𝜷(k))𝚺(k)𝜷(k)}\displaystyle=\sqrt{(1-w^{(k)*})w^{(k)*}}\cdot\exp\left\{-\frac{1}{8}[(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}]^{2}-\frac{1}{2}\cdot\frac{\log^{2}(\frac{1-w^{(k)*}}{w^{(k)*}})}{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}\right\} (S.5.437)
|(𝜷^(k))(𝝁(k)1𝝁(k)2)(𝜷^(k))𝚺(k)𝜷^(k)(𝜷(k))(𝝁(k)1𝝁(k)2)(𝜷(k))𝚺(k)𝜷(k)|\displaystyle\quad\cdot\left|\frac{(\widehat{\bm{\beta}}^{(k)})^{\top}(\bm{\mu}^{(k)*}_{1}-\bm{\mu}^{(k)*}_{2})}{\sqrt{(\widehat{\bm{\beta}}^{(k)})^{\top}\bm{\Sigma}^{(k)*}\widehat{\bm{\beta}}^{(k)}}}-\frac{(\bm{\beta}^{(k)*})^{\top}(\bm{\mu}^{(k)*}_{1}-\bm{\mu}^{(k)*}_{2})}{\sqrt{(\bm{\beta}^{(k)*})^{\top}\bm{\Sigma}^{(k)*}\bm{\beta}^{(k)*}}}\right| (S.5.438)
|(𝝃^(k))𝝃(k)𝝃^(k)2𝝃(k)2|\displaystyle\lesssim\left|\frac{(\widehat{\bm{\xi}}^{(k)})^{\top}\bm{\xi}^{(k)*}}{\|\widehat{\bm{\xi}}^{(k)}\|_{2}}-\|\bm{\xi}^{(k)*}\|_{2}\right| (S.5.439)
𝝃^(k)𝝃(k)22,\displaystyle\lesssim\|\widehat{\bm{\xi}}^{(k)}-\bm{\xi}^{(k)*}\|_{2}^{2}, (S.5.440)

with probability at least 1CK11-C^{\prime}K^{-1}, where 𝝃^(k)=(𝚺(k))1/2𝜷^(k)\widehat{\bm{\xi}}^{(k)}=(\bm{\Sigma}^{(k)*})^{1/2}\widehat{\bm{\beta}}^{(k)} and 𝝃(k)=(𝚺(k))1/2𝜷(k)\bm{\xi}^{(k)*}=(\bm{\Sigma}^{(k)*})^{1/2}\bm{\beta}^{(k)*}, so 𝝃^(k)𝝃(k)22𝜷^(k)𝜷(k)22\|\widehat{\bm{\xi}}^{(k)}-\bm{\xi}^{(k)*}\|_{2}^{2}\lesssim\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\|_{2}^{2}. Only the last inequality in (S.5.440) holds with high probability and the others are deterministic. It comes from the fact that 𝝃^(k)𝝃(k)2c𝝃(k)2\|\widehat{\bm{\xi}}^{(k)}-\bm{\xi}^{(k)*}\|_{2}\leq c\leq\|\bm{\xi}^{(k)*}\|_{2} for some c>0c>0 with probability at least 1CK11-C^{\prime}K^{-1} and a direct application of Lemma 8.1 in \citeappcai2019chime. On the other hand, it is easy to see that d2(𝜽^(k),𝜽(k))\mathscr{B}\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*}). Combining these two facts leads to (S.5.422).

S.5.5 Proof of Theorem 4

S.5.5.1 Lemmas

Recall that for GMM associated with parameter set 𝜽¯=(w,𝝁1,𝝁2,𝚺)\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\Sigma}), we define the mis-clustering error rate of any classifier 𝒞\mathcal{C} as R𝜽¯(𝒞)=minπ:{1,2}{1,2}𝜽¯(𝒞(Znew)π(Ynew))R_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z^{\textup{new}})\neq\pi(Y^{\textup{new}})), where 𝜽¯\mathbb{P}_{\overline{\bm{\theta}}} represents the distribution of (Znew,Ynew)(Z^{\textup{new}},Y^{\textup{new}}), i.e. (1w)𝒩(𝝁1,𝚺)+w𝒩(𝝁2,𝚺)(1-w)\mathcal{N}(\bm{\mu}_{1},\bm{\Sigma})+w\mathcal{N}(\bm{\mu}_{2},\bm{\Sigma}). Denote 𝒞𝜽¯\mathcal{C}_{\overline{\bm{\theta}}} as the Bayes classifier corresponding to 𝜽¯\overline{\bm{\theta}}. Define a surrogate loss L𝜽¯(𝒞)=minπ:{1,2}{1,2}𝜽¯(𝒞(Znew)π(𝒞𝜽¯(Ynew)))L_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z^{\textup{new}})\neq\pi(\mathcal{C}_{\overline{\bm{\theta}}}(Y^{\textup{new}}))).

Lemma 24.

Assume there exists an subset SS such that minkSnkC(plogK)\min_{k\in S}n_{k}\geq C(p\vee\log K) and minkSΔ(k)σ2>0\min_{k\in S}\Delta^{(k)}\geq\sigma^{2}>0 with some constants C>0C>0. We have

inf{𝒞^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯SS\displaystyle\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}} (kS{R𝜽¯(k)(𝒞^(k))R𝜽¯(k)(𝒞𝜽¯(k))C1pnS+C2logKnk\displaystyle\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq C_{1}\frac{p}{n_{S}}+C_{2}\frac{\log K}{n_{k}} (S.5.441)
+C3h2p+logKnk+C4ϵ2maxkSnk})110.\displaystyle\quad\quad+C_{3}h^{2}\wedge\frac{p+\log K}{n_{k}}+C_{4}\frac{\epsilon^{2}}{\max_{k\in S}n_{k}}\bigg{\}}\Bigg{)}\geq\frac{1}{10}. (S.5.442)
Lemma 25.

Suppose 𝛉¯=(w,𝛍1,𝛍2,𝛃,𝚺)\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\beta},\bm{\Sigma}) satisfies Δ2(𝛍1𝛍2)𝚺1(𝛍1𝛍2)σ2>0\Delta^{2}\coloneqq(\bm{\mu}_{1}-\bm{\mu}_{2})^{\top}\bm{\Sigma}^{-1}(\bm{\mu}_{1}-\bm{\mu}_{2})\geq\sigma^{2}>0 with some constant σ2>0\sigma^{2}>0 and w,w(cw,1cw)w,w^{\prime}\in(c_{w},1-c_{w}). Then c>0\exists c>0 such that

cL𝜽¯2(𝒞)R𝜽¯(𝒞)R𝜽¯(𝒞𝜽¯),cL_{\overline{\bm{\theta}}}^{2}(\mathcal{C})\leq R_{\overline{\bm{\theta}}}(\mathcal{C})-R_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}), (S.5.443)

for any classifier 𝒞\mathcal{C}, where R𝛉¯(𝒞)minπ:1:21:2𝛉¯(𝒞(𝐳)π(y))R_{\overline{\bm{\theta}}}(\mathcal{C})\coloneqq\min_{\pi:1:2\rightarrow 1:2}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(\bm{z})\neq\pi(y)), L𝛉¯(𝒞)minπ:{0,1}{0,1}𝛉¯(𝒞(𝐳)π(𝒞𝛉¯(𝐳)))L_{\overline{\bm{\theta}}}(\mathcal{C})\coloneqq\min_{\pi:\{0,1\}\rightarrow\{0,1\}}\allowbreak\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(\bm{z})\neq\pi(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z}))), and 𝒞𝛉¯\mathcal{C}_{\overline{\bm{\theta}}} is the corresponding Bayes classifier.

Lemma 26.

Consider 𝛉¯=(w,𝛍1,𝛍2,𝛃,𝚺)\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\beta},\bm{\Sigma}) and 𝛉¯=(w,𝛍1,𝛍2,𝛃,𝚺)\overline{\bm{\theta}}^{\prime}=(w^{\prime},\bm{\mu}_{1},\bm{\mu}_{2},\bm{\beta},\bm{\Sigma}) satisfies Δ2(𝛍1𝛍2)𝚺1(𝛍1𝛍2)σ2>0\Delta^{2}\coloneqq(\bm{\mu}_{1}-\bm{\mu}_{2})^{\top}\bm{\Sigma}^{-1}(\bm{\mu}_{1}-\bm{\mu}_{2})\geq\sigma^{2}>0 with some constant σ2>0\sigma^{2}>0.. We have

c|ww|L𝜽¯(𝒞𝜽¯)c|ww|,c|w-w^{\prime}|\leq L_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})\leq c^{\prime}|w-w^{\prime}|, (S.5.444)

for some constants cc, c>0c^{\prime}>0.

Lemma 27.

Consider 𝛉¯=(w,𝛍1,𝛍2,𝚺)\overline{\bm{\theta}}=(w,\bm{\mu}_{1},\bm{\mu}_{2},\bm{\Sigma}) and 𝛉¯=(w,𝛍1,𝛍2,𝚺)\overline{\bm{\theta}}^{\prime}=(w,\bm{\mu}_{1}^{\prime},\bm{\mu}_{2}^{\prime},\bm{\Sigma}) satisfies w=1/2w=1/2, 𝛍1=𝛍0/2+𝐮\bm{\mu}_{1}=-\bm{\mu}_{0}/2+\bm{u}, 𝛍2=𝛍0/2+𝐮\bm{\mu}_{2}=\bm{\mu}_{0}/2+\bm{u}, 𝛍1=𝛍0/2+𝐮\bm{\mu}_{1}^{\prime}=-\bm{\mu}_{0}/2+\bm{u}^{\prime}, 𝛍2=𝛍0/2+𝐮\bm{\mu}_{2}^{\prime}=\bm{\mu}_{0}/2+\bm{u}^{\prime}, 𝚺=𝐈p\bm{\Sigma}=\bm{I}_{p}, 𝛍0=(1,𝟎p1)\bm{\mu}_{0}=(1,\bm{0}_{p-1}^{\top})^{\top}, 𝐮=(u~,𝟎p1)\bm{u}=(\widetilde{u},\bm{0}_{p-1}^{\top})^{\top}, and 𝐮=(u~,𝟎p1)\bm{u}^{\prime}=(\widetilde{u}^{\prime},\bm{0}_{p-1}^{\top})^{\top}. We have

c|u~u~|L𝜽¯(𝒞𝜽¯)c|u~u~|,c|\widetilde{u}-\widetilde{u}^{\prime}|\leq L_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})\leq c^{\prime}|\widetilde{u}-\widetilde{u}^{\prime}|, (S.5.445)

for some constants cc, c>0c^{\prime}>0.

Lemma 28.

Denote ϵ~=Kss\widetilde{\epsilon}=\frac{K-s}{s}. We have

inf{𝒞^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯SS(maxkS[R𝜽¯(k)(𝒞^(k))R𝜽¯(k)(𝒞𝜽¯(k))]C1ϵ~2maxk=1:Knk)110.\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\max_{k\in S}\left[R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\right]\geq C_{1}\frac{\widetilde{\epsilon}^{\prime 2}}{\max_{k=1:K}n_{k}}\Bigg{)}\geq\frac{1}{10}. (S.5.446)

S.5.5.2 Main proof of Theorem 4

Combine conclusions of Lemmas 24 and 28 to get the lower bound.

S.5.5.3 Proof of lemmas

Proof of Lemma 24.

Recall the definitions and proof idea of Lemma 19. We have Θ¯SΘ¯|S|,wΘ¯|S|,δΘ¯|S|,𝜷\overline{\Theta}_{S}\supseteq\overline{\Theta}_{|S|,w}\cup\overline{\Theta}_{|S|,\delta}\cup\overline{\Theta}_{|S|,\bm{\beta}}, where

Θ¯S,w\displaystyle\overline{\Theta}_{S,w} ={{𝜽¯(k)}kS:𝝁(k)1=𝟏p/p,𝝁(k)2=𝝁(k)1=𝝁~,𝚺(k)=𝑰p,w(k)(cw,1cw)},\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\mu}^{(k)}_{1}=\bm{1}_{p}/\sqrt{p},\bm{\mu}^{(k)}_{2}=-\bm{\mu}^{(k)}_{1}=\widetilde{\bm{\mu}},\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}\in(c_{w},1-c_{w})\Big{\}}, (S.5.447)
Θ¯S,𝜷\displaystyle\overline{\Theta}_{S,\bm{\beta}} ={{𝜽¯(k)}kS:𝚺(k)=𝑰p,w(k)=12,𝝁(k)12𝝁(k)22M,𝝁(k)2=𝝁(k)1,\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\|\bm{\mu}^{(k)}_{1}\|_{2}\vee\|\bm{\mu}^{(k)}_{2}\|_{2}\leq M,\bm{\mu}^{(k)}_{2}=-\bm{\mu}^{(k)}_{1}, (S.5.448)
min𝜷maxkS𝜷(k)𝜷2h},\displaystyle\hskip 85.35826pt\min_{\bm{\beta}}\max_{k\in S}\|\bm{\beta}^{(k)}-\bm{\beta}\|_{2}\leq h\Big{\}}, (S.5.449)
Θ¯S,δ\displaystyle\overline{\Theta}_{S,\delta} ={{𝜽¯(k)}kS:𝚺(k)=𝑰p,w(k)=12,𝝁(k)12𝝁(k)22M,𝝁(k)1=12𝝁0,𝝁(k)2=12𝝁0+𝒖,\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\|\bm{\mu}^{(k)}_{1}\|_{2}\vee\|\bm{\mu}^{(k)}_{2}\|_{2}\leq M,\bm{\mu}^{(k)}_{1}=-\frac{1}{2}\bm{\mu}_{0},\bm{\mu}^{(k)}_{2}=\frac{1}{2}\bm{\mu}_{0}+\bm{u}, (S.5.450)
𝒖21}.\displaystyle\hskip 85.35826pt\|\bm{u}\|_{2}\leq 1\Big{\}}. (S.5.451)

Recall the mis-clustering error for GMM associated with parameter set 𝜽¯\overline{\bm{\theta}} of any classifier 𝒞\mathcal{C} is R𝜽¯(𝒞)=minπ:{1,2}{1,2}𝜽¯(𝒞(Z)π(Y))R_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z)\neq\pi(Y)). To help the analysis, following \citeappazizyan2013minimax and \citeappcai2019chime, we define a surrogate loss L𝜽¯(𝒞)=minπ:{1,2}{1,2}𝜽¯(𝒞(Z)π(𝒞𝜽¯(Z)))L_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z)\neq\pi(\mathcal{C}_{\overline{\bm{\theta}}}(Z))), where 𝒞𝜽¯\mathcal{C}_{\overline{\bm{\theta}}} is the Bayes classifier. Suppose σ=0.005\sigma=\sqrt{0.005}.

(i) We want to show

inf{𝒞^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯SS(kS{R𝜽¯(k)(𝒞^(k))R𝜽¯(k)(𝒞𝜽¯(k))CpnS})14.\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq C\sqrt{\frac{p}{n_{S}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}. (S.5.452)

Consider S=1:KS=1:K and space Θ¯0={{𝜽¯(k)}k=1K:𝚺(k)=𝑰p,w(k)=1/2,𝝁(k)1=𝝁1,𝝁(k)2=𝝁2,𝝁12𝝁22M}\overline{\Theta}_{0}=\{\{\overline{\bm{\theta}}^{(k)}\}_{k=1}^{K}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=1/2,\bm{\mu}^{(k)}_{1}=\bm{\mu}_{1},\bm{\mu}^{(k)}_{2}=\bm{\mu}_{2},\|\bm{\mu}_{1}\|_{2}\vee\|\bm{\mu}_{2}\|_{2}\leq M\}. And

LHS of (S.5.452)inf𝒞^(1)sup{𝜽¯(k)}k=1KΘ¯0(R𝜽¯(1)(𝒞^(1))R𝜽¯(1)(𝒞𝜽¯(1))CpnS).\text{LHS of \eqref{eq: lemma 19 eq 1}}\geq\inf_{\widehat{\mathcal{C}}^{(1)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k=1}^{K}\in\overline{\Theta}_{0}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(1)*}}(\widehat{\mathcal{C}}^{(1)})-R_{\overline{\bm{\theta}}^{(1)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(1)*}})\geq C\sqrt{\frac{p}{n_{S}}}\Bigg{)}. (S.5.453)

Let r=cp/nS0.001r=c\sqrt{p/n_{S}}\leq 0.001 with some small constant c>0c>0. For any 𝝁p\bm{\mu}\in\mathbb{R}^{p}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}. Consider a r/4r/4-packing of r𝒮p1r\mathcal{S}^{p-1}: {𝒗~j}j=1N\{\widetilde{\bm{v}}_{j}\}_{j=1}^{N}. By Lemma 2, N4p1N\geq 4^{p-1}. Denote 𝝁~j=(σ,𝒗~j)p\widetilde{\bm{\mu}}_{j}=(\sigma,\widetilde{\bm{v}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}, where σ=0.005\sigma=\sqrt{0.005}. Then by definition of KL divergence and Lemma 8.4 in \citeappcai2019chime,

KL(kS𝝁~jnkSkS𝝁~jnkS)\displaystyle\text{KL}\left(\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\bigg{\|}\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\right) =kSnkKL(𝝁~j𝝁~j)\displaystyle=\sum_{k\in S}n_{k}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}) (S.5.454)
nS8(1+σ2)𝝁~j𝝁~j22\displaystyle\leq n_{S}\cdot 8(1+\sigma^{2})\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\|_{2}^{2} (S.5.455)
32(1+σ2)nSr2\displaystyle\leq 32(1+\sigma^{2})n_{S}r^{2} (S.5.456)
32(1+σ2)nSc22(p1)nS\displaystyle\leq 32(1+\sigma^{2})n_{S}\cdot c^{2}\frac{2(p-1)}{n_{S}} (S.5.457)
32(1+σ2)c2log2logN.\displaystyle\leq\frac{32(1+\sigma^{2})c^{2}}{\log 2}\log N. (S.5.458)

For simplicity, we write L𝜽¯L_{\overline{\bm{\theta}}} with 𝜽¯Θ¯0\overline{\bm{\theta}}\in\overline{\Theta}_{0} and 𝝁1=𝝁2=𝝁\bm{\mu}_{1}=-\bm{\mu}_{2}=\bm{\mu} as L𝝁L_{\bm{\mu}}. By Lemma 8.5 in \citeappcai2019chime,

L𝝁~i(𝒞𝝁~j)12g(σ2+r22)𝝁~i𝝁~j2𝝁~i2120.15r/4σ2+r22r,L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C}_{\widetilde{\bm{\mu}}_{j}})\geq\frac{1}{\sqrt{2}}g\left(\frac{\sqrt{\sigma^{2}+r^{2}}}{2}\right)\frac{\|\widetilde{\bm{\mu}}_{i}-\widetilde{\bm{\mu}}_{j}\|_{2}}{\|\widetilde{\bm{\mu}}_{i}\|_{2}}\geq\frac{1}{\sqrt{2}}\cdot 0.15\cdot\frac{r/4}{\sqrt{\sigma^{2}+r^{2}}}\geq 2r, (S.5.459)

where g(x)=ϕ(x)[ϕ(x)xΦ(x)]g(x)=\phi(x)[\phi(x)-x\Phi(x)]. The last inequality holds because σ2+r22σ\sqrt{\sigma^{2}+r^{2}}\geq\sqrt{2}\sigma and g(σ2+r2/2)0.15g(\sqrt{\sigma^{2}+r^{2}}/2)\geq 0.15 when r2σ2=0.001r^{2}\leq\sigma^{2}=0.001. Then by Lemma 3.5 in \citeappcai2019chime (Proposition 2 in \citealpappazizyan2013minimax), for any classifier 𝒞\mathcal{C}, and iji\neq j,

L𝝁~i(𝒞)+L𝝁~j(𝒞)L𝝁~i(𝒞𝝁~j)KL(𝝁~i𝝁~j)/22rr=cpnS.L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C})+L_{\widetilde{\bm{\mu}}_{j}}(\mathcal{C})\geq L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C}_{\widetilde{\bm{\mu}}_{j}})-\sqrt{\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{i}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j}})/2}\geq 2r-r=c\sqrt{\frac{p}{n_{S}}}. (S.5.460)

For any 𝒞^(1)\widehat{\mathcal{C}}^{(1)}, consider a test ψ=argminj=1:NL𝝁~j(𝒞^(1))\psi^{*}=\operatorname*{arg\,min}_{j=1:N}L_{\widetilde{\bm{\mu}}_{j}}(\widehat{\mathcal{C}}^{(1)}). Therefore if there exists j0j_{0} such that L𝝁~j0(𝒞^(1))<c2pnSL_{\widetilde{\bm{\mu}}_{j_{0}}}(\widehat{\mathcal{C}}^{(1)})<\frac{c}{2}\sqrt{\frac{p}{n_{S}}}, then by (S.5.460), we must have ψ=j0\psi^{*}=j_{0}. Let C1c/2C_{1}\leq c/2, then by Fano’s lemma (Corollary 6 in \citealpapptsybakov2009introduction)

inf𝒞^(1)sup{𝜽¯(k)}k=1KΘ¯0(L𝜽¯(1)(𝒞^(1))C1pnS)\displaystyle\inf_{\widehat{\mathcal{C}}^{(1)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k=1}^{K}\in\overline{\Theta}_{0}}\mathbb{P}\Bigg{(}L_{\overline{\bm{\theta}}^{(1)*}}(\widehat{\mathcal{C}}^{(1)})\geq C_{1}\sqrt{\frac{p}{n_{S}}}\Bigg{)} inf𝒞^(1)supj=1:N(L𝝁~(j)(𝒞^(1))C1pnS)\displaystyle\geq\inf_{\widehat{\mathcal{C}}^{(1)}}\sup_{j=1:N}\mathbb{P}\Bigg{(}L_{\widetilde{\bm{\mu}}^{(j)}}(\widehat{\mathcal{C}}^{(1)})\geq C_{1}\sqrt{\frac{p}{n_{S}}}\Bigg{)} (S.5.461)
inf𝒞^(1)supj=1:N(ψj)\displaystyle\geq\inf_{\widehat{\mathcal{C}}^{(1)}}\sup_{j=1:N}\mathbb{P}\Bigg{(}\psi^{*}\neq j\Bigg{)} (S.5.462)
infψsupj=1:N(ψj)\displaystyle\geq\inf_{\psi}\sup_{j=1:N}\mathbb{P}\Bigg{(}\psi\neq j\Bigg{)} (S.5.463)
1log2logN32(1+σ2)c2log2\displaystyle\geq 1-\frac{\log 2}{\log N}-\frac{32(1+\sigma^{2})c^{2}}{\log 2} (S.5.464)
14,\displaystyle\geq\frac{1}{4}, (S.5.465)

when p2p\geq 2 and c=log2128(1+σ2)c=\sqrt{\frac{\log 2}{128(1+\sigma^{2})}}. Then apply Lemma 25 to get the (S.5.452).

(ii) We want to show

inf{𝒞^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S,𝜷S(kS{R𝜽¯(k)(𝒞^(k))R𝜽¯(k)(𝒞𝜽¯(k))Chpnk})14.\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\bm{\beta}}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq Ch\wedge\sqrt{\frac{p}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}. (S.5.466)

Fixing an SS and a S\mathbb{Q}_{S}. Suppose 1S1\in S. We have

LHS of (S.5.466)inf𝒞^(1)sup{𝜽¯(k)}kSΘ¯S,𝜷S(R𝜽¯(1)(𝒞^(1))R𝜽¯(1)(𝒞𝜽¯(1))Chpnk).\text{LHS of \eqref{eq: lemma 19 eq 3}}\geq\inf_{\widehat{\mathcal{C}}^{(1)}}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\bm{\beta}}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\bigg{(}R_{\overline{\bm{\theta}}^{(1)*}}(\widehat{\mathcal{C}}^{(1)})-R_{\overline{\bm{\theta}}^{(1)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(1)*}})\geq Ch\wedge\sqrt{\frac{p}{n_{k}}}\Bigg{)}. (S.5.467)

Let r=h(cp/n1)Mr=h\wedge(c\sqrt{p/n_{1}})\wedge M with a small constant c>0c>0. For any 𝝁p\bm{\mu}\in\mathbb{R}^{p}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}. Consider a r/4r/4-packing of r𝒮p1r\mathcal{S}^{p-1}. By Lemma 2, N4p1N\geq 4^{p-1}. Denote 𝝁~j=(σ,𝒗~j)p\widetilde{\bm{\mu}}_{j}=(\sigma,\widetilde{\bm{v}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}. WLOG, assume M2M\geq 2. Let 𝝁(k)1=𝝁~=(σ,𝟎p1)\bm{\mu}^{(k)*}_{1}=\widetilde{\bm{\mu}}=(\sigma,\bm{0}_{p-1})^{\top} for all kS\{1}k\in S\backslash\{1\}. Then by following the same arguments in (i) and part (ii) of the proof of Lemma 19, we can show that the RHS of (S.5.467) is larger than or equal to 1/41/4 when p3p\geq 3.

(iii) We want to show

inf{𝒞^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S,𝜷S(kS{R𝜽¯(k)(𝒞^(k))R𝜽¯(k)(𝒞𝜽¯(k))ChlogKnk})14.\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\bm{\beta}}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq Ch\wedge\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}. (S.5.468)

This can be proved by following similar ideas used in step (iii) of the proof of Lemma 19, so we omit the proof here.

(iv) We want to show

inf{𝒞^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S,wS(kS{R𝜽¯(k)(𝒞^(k))R𝜽¯(k)(𝒞𝜽¯(k))ClogKnk})14.\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,w}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq C\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}. (S.5.469)

This can be similarly proved by following the arguments in part (i) with Lemmas 25 and 26.

(v) We want to show

inf{𝒞^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯S,δS(kS{R𝜽¯(k)(𝒞^(k))R𝜽¯(k)(𝒞𝜽¯(k))ClogKnk})14.\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S,\delta}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\bigcup_{k\in S}\bigg{\{}R_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(k)*}})\geq C\sqrt{\frac{\log K}{n_{k}}}\bigg{\}}\Bigg{)}\geq\frac{1}{4}. (S.5.470)

This can be similarly proved by following the arguments in part (i) with Lemmas 25 and 27.

Finally, we get the desired conclusion by combining (i)-(v). ∎

Proof of Lemma 25.

We follow a similar proof idea used in the proof of Lemma 3.4 in \citeappcai2019chime. Let ϕ1\phi_{1} and ϕ2\phi_{2} be the density of 𝒩(𝝁,𝚺)\mathcal{N}(\bm{\mu},\bm{\Sigma}) and 𝒩(𝝁,𝚺)\mathcal{N}(-\bm{\mu},\bm{\Sigma}), respectively. Denote η𝜽¯(𝒛)=(1w)ϕ1(𝒛)(1w)ϕ1(𝒛)+wϕ2(𝒛)\eta_{\overline{\bm{\theta}}}(\bm{z})=\frac{(1-w)\phi_{1}(\bm{z})}{(1-w)\phi_{1}(\bm{z})+w\phi_{2}(\bm{z})} and S𝒞={𝒛p:𝒞(𝒛)=1}S_{\mathcal{C}}=\{\bm{z}\in\mathbb{R}^{p}:\mathcal{C}(\bm{z})=1\} for any classifier 𝒞\mathcal{C}. Note that S𝒞𝜽¯={𝒛p:(1w)ϕ1(𝒛)wϕ2(𝒛)}S_{\mathcal{C}_{\overline{\bm{\theta}}}}=\{\bm{z}\in\mathbb{R}^{p}:(1-w)\phi_{1}(\bm{z})\geq w\phi_{2}(\bm{z})\}. The permutation actually doesn’t matter in the proof. WLOG, we drop the permutations in the definition of misclassification error and surrogate loss by assuming π\pi to be the identity function. If π\pi is not identity in the definition of R𝜽¯(𝒞)R_{\overline{\bm{\theta}}}(\mathcal{C}), for example, we can define S𝒞={𝒛p:𝒞(𝒛)=2}S_{\mathcal{C}}=\{\bm{z}\in\mathbb{R}^{p}:\mathcal{C}(\bm{z})=2\} instead and all the following steps still follow.

By definition,

𝜽¯(𝒞(𝒛)y)\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(\bm{z})\neq y) =(1w)S𝒞cϕ1d𝒛+wS𝒞ϕ2d𝒛,\displaystyle=(1-w)\int_{S_{\mathcal{C}}^{c}}\phi_{1}d\bm{z}+w\int_{S_{\mathcal{C}}}\phi_{2}d\bm{z}, (S.5.471)
𝜽¯(𝒞𝜽¯(𝒛)y)\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})\neq y) =(1w)S𝒞𝜽¯cϕ1d𝒛+wS𝒞𝜽¯ϕ2d𝒛,\displaystyle=(1-w)\int_{S_{\mathcal{C}_{\overline{\bm{\theta}}}}^{c}}\phi_{1}d\bm{z}+w\int_{S_{\mathcal{C}_{\overline{\bm{\theta}}}}}\phi_{2}d\bm{z}, (S.5.472)

which leads to

𝜽¯(𝒞(𝒛)y)𝜽¯(𝒞𝜽¯(𝒛)y)\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(\bm{z})\neq y)-\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})\neq y) =S𝒞𝜽¯\S𝒞[(1w)ϕ1wϕ2]d𝒛+S𝒞𝜽¯c\S𝒞c[wϕ2(1w)ϕ1]d𝒛\displaystyle=\int_{S_{\mathcal{C}_{\overline{\bm{\theta}}}}\backslash S_{\mathcal{C}}}[(1-w)\phi_{1}-w\phi_{2}]d\bm{z}+\int_{S_{\mathcal{C}_{\overline{\bm{\theta}}}}^{c}\backslash S_{\mathcal{C}}^{c}}[w\phi_{2}-(1-w)\phi_{1}]d\bm{z} (S.5.473)
=S𝒞𝜽¯S𝒞|(1w)ϕ1wϕ2|d𝒛\displaystyle=\int_{S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}}}|(1-w)\phi_{1}-w\phi_{2}|d\bm{z} (S.5.474)
=𝔼𝒛(1w)ϕ1+wϕ2[|2η𝜽¯(𝒛)1|𝟙(S𝒞𝜽¯S𝒞)]\displaystyle=\mathbb{E}_{\bm{z}\sim(1-w)\phi_{1}+w\phi_{2}}\big{[}\left|2\eta_{\overline{\bm{\theta}}}(\bm{z})-1\right|\mathds{1}(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}})\big{]} (S.5.475)
2t𝜽¯(S𝒞𝜽¯S𝒞,|2η𝜽¯(𝒛)1|>2t)\displaystyle\geq 2t\cdot\mathbb{P}_{\overline{\bm{\theta}}}\left(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}},\left|2\eta_{\overline{\bm{\theta}}}(\bm{z})-1\right|>2t\right) (S.5.476)
=2t[𝜽¯(S𝒞𝜽¯S𝒞)𝜽¯(|2η𝜽¯(𝒛)1|2t)]\displaystyle=2t\left[\mathbb{P}_{\overline{\bm{\theta}}}(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}})-\mathbb{P}_{\overline{\bm{\theta}}}(\left|2\eta_{\overline{\bm{\theta}}}(\bm{z})-1\right|\leq 2t)\right] (S.5.477)
2t[𝜽¯(S𝒞𝜽¯S𝒞)ct]\displaystyle\geq 2t\left[\mathbb{P}_{\overline{\bm{\theta}}}(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}})-ct\right] (S.5.478)
12c𝜽¯2(S𝒞𝜽¯S𝒞),\displaystyle\geq\frac{1}{2c}\mathbb{P}_{\overline{\bm{\theta}}}^{2}(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}}), (S.5.479)

where we let t=12c𝜽¯(S𝒞𝜽¯S𝒞)t=\frac{1}{2c}\mathbb{P}_{\overline{\bm{\theta}}}(S_{\mathcal{C}_{\overline{\bm{\theta}}}}\triangle S_{\mathcal{C}}) with c=1+82πσc=1+\frac{8}{\sqrt{2\pi}\sigma}. This completes the proof. The last second inequality depends on the fact that

𝜽¯(|η𝜽¯(𝒛)1/2|t)ct,\mathbb{P}_{\overline{\bm{\theta}}}(\left|\eta_{\overline{\bm{\theta}}}(\bm{z})-1/2\right|\leq t)\leq ct, (S.5.480)

holds for all t1/(2c)t\leq 1/(2c). This is because

𝜽¯(|η𝜽¯(𝒛)1/2|t)\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\left|\eta_{\overline{\bm{\theta}}}(\bm{z})-1/2\right|\leq t) (S.5.481)
=𝜽¯(log(w1w)+log(12t1+2t)log(ϕ1ϕ2(𝒛))log(w1w)+log(1+2t12t))\displaystyle=\mathbb{P}_{\overline{\bm{\theta}}}\left(\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1-2t}{1+2t}\right)\leq\log\left(\frac{\phi_{1}}{\phi_{2}}(\bm{z})\right)\leq\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1+2t}{1-2t}\right)\right) (S.5.482)
=𝜽¯(log(w1w)+log(12t1+2t)(𝝁1𝝁2)𝚺1(𝒛𝝁1+𝝁22)\displaystyle=\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1-2t}{1+2t}\right)\leq(\bm{\mu}_{1}-\bm{\mu}_{2})^{\top}\bm{\Sigma}^{-1}\left(\bm{z}-\frac{\bm{\mu}_{1}+\bm{\mu}_{2}}{2}\right) (S.5.483)
log(w1w)+log(1+2t12t))\displaystyle\hskip 187.78836pt\leq\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1+2t}{1-2t}\right)\bigg{)} (S.5.484)
=12𝜽¯(log(w1w)+log(12t1+2t)𝒩(Δ2/2,Δ2)log(w1w)+log(1+2t12t))\displaystyle=\frac{1}{2}\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1-2t}{1+2t}\right)\leq\mathcal{N}(\Delta^{2}/2,\Delta^{2})\leq\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1+2t}{1-2t}\right)\bigg{)} (S.5.485)
+12𝜽¯(log(w1w)+log(12t1+2t)𝒩(Δ2/2,Δ2)log(w1w)+log(1+2t12t))\displaystyle\quad+\frac{1}{2}\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1-2t}{1+2t}\right)\leq\mathcal{N}(-\Delta^{2}/2,\Delta^{2})\leq\log\left(\frac{w}{1-w}\right)+\log\left(\frac{1+2t}{1-2t}\right)\bigg{)} (S.5.486)
12πσ[log(1+2t12t)log(12t1+2t)]\displaystyle\leq\frac{1}{\sqrt{2\pi}\sigma}\cdot\left[\log\left(\frac{1+2t}{1-2t}\right)-\log\left(\frac{1-2t}{1+2t}\right)\right] (S.5.487)
12πσ8t12t\displaystyle\leq\frac{1}{\sqrt{2\pi}\sigma}\cdot\frac{8t}{1-2t} (S.5.488)
ct,\displaystyle\leq ct, (S.5.489)

when t1/(2c)t\leq 1/(2c). Note that (S.5.489) implies that a binary GMM under the separation assumption Δ1\Delta\gtrsim 1 has Tsybakov’s margin with margin parameter 11. For the notion of Tsybakov’s margin, see \citeappaudibert2007fast. We will prove a more general result showing that a multi-cluster GMM under the separation assumption also has Tsybakov’s margin with margin parameter 11. This turns out to be useful in proving the upper and lower bounds of misclassification error. ∎

Proof of Lemma 26.

WLOG, suppose www\geq w^{\prime}. Similar to (S.5.486), it’s easy to see that

L𝜽¯(𝒞𝜽¯)\displaystyle L_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}) =(1w)𝜽¯(𝒞𝜽¯(𝒛)𝒞𝜽¯(𝒛)|𝒛=1)+w𝜽¯(𝒞𝜽¯(𝒛)𝒞𝜽¯(𝒛)|𝒛=2)\displaystyle=(1-w)\mathbb{P}_{\overline{\bm{\theta}}}\left(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}(\bm{z})|\bm{z}=1\right)+w\mathbb{P}_{\overline{\bm{\theta}}}\left(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}(\bm{z})|\bm{z}=2\right) (S.5.490)
=(1w)(log(w1w)𝒩(Δ2/2,Δ2)log(w1w))\displaystyle=(1-w)\mathbb{P}\left(\log\left(\frac{w^{\prime}}{1-w^{\prime}}\right)\leq\mathcal{N}(\Delta^{2}/2,\Delta^{2})\leq\log\left(\frac{w}{1-w}\right)\right) (S.5.491)
+w(log(w1w)𝒩(Δ2/2,Δ2)log(w1w))\displaystyle\quad+w\mathbb{P}\left(\log\left(\frac{w^{\prime}}{1-w^{\prime}}\right)\leq\mathcal{N}(-\Delta^{2}/2,\Delta^{2})\leq\log\left(\frac{w}{1-w}\right)\right) (S.5.492)
12πσ[log(w1w)log(w1w)]\displaystyle\leq\frac{1}{\sqrt{2\pi}\sigma}\left[\log\left(\frac{w}{1-w}\right)-\log\left(\frac{w^{\prime}}{1-w^{\prime}}\right)\right] (S.5.493)
=12πcw(1cw)σ|ww|.\displaystyle=\frac{1}{\sqrt{2\pi}c_{w}(1-c_{w})\sigma}\cdot|w-w^{\prime}|. (S.5.494)

On the other hand,

(S.5.492)\displaystyle\eqref{eq: lemma 26 eq} 12πMc𝚺exp{12σ2[log(1cwcw)+12M2c𝚺]2}[log(w1w)log(w1w)]\displaystyle\geq\frac{1}{\sqrt{2\pi}Mc_{\bm{\Sigma}}}\cdot\exp\left\{-\frac{1}{2\sigma^{2}}\left[\log\left(\frac{1-c_{w}}{c_{w}}\right)+\frac{1}{2}M^{2}c_{\bm{\Sigma}}\right]^{2}\right\}\left[\log\left(\frac{w}{1-w}\right)-\log\left(\frac{w^{\prime}}{1-w^{\prime}}\right)\right] (S.5.495)
12πMc𝚺cw(1cw)exp{12σ2[log(1cwcw)+12M2c𝚺]2}|ww|,\displaystyle\geq\frac{1}{\sqrt{2\pi}Mc_{\bm{\Sigma}}c_{w}(1-c_{w})}\cdot\exp\left\{-\frac{1}{2\sigma^{2}}\left[\log\left(\frac{1-c_{w}}{c_{w}}\right)+\frac{1}{2}M^{2}c_{\bm{\Sigma}}\right]^{2}\right\}|w-w^{\prime}|, (S.5.496)

which completes the proof. ∎

Proof of Lemma 28.

By Lemma 25, it suffices to prove

inf{𝒞^(k)}k=1KsupS:|S|ssup{𝜽¯(k)}kSΘ¯SS(maxkSL𝜽¯(k)(𝒞^(k))C1ϵ~2maxk=1:Knk)110.\inf_{\{\widehat{\mathcal{C}}^{(k)}\}_{k=1}^{K}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\max_{k\in S}L_{\overline{\bm{\theta}}^{(k)*}}(\widehat{\mathcal{C}}^{(k)})\geq C_{1}\frac{\widetilde{\epsilon}^{\prime 2}}{\max_{k=1:K}n_{k}}\Bigg{)}\geq\frac{1}{10}. (S.5.497)

For any 𝝁p\bm{\mu}\in\mathbb{R}^{p}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}. For simplicity, we write L𝜽¯L_{\overline{\bm{\theta}}} with 𝜽¯\overline{\bm{\theta}} satisfying 𝝁1=𝝁2=𝝁\bm{\mu}_{1}=-\bm{\mu}_{2}=\bm{\mu}, w=1/2w=1/2 and 𝚺=𝑰p\bm{\Sigma}=\bm{I}_{p} as L𝝁L_{\bm{\mu}}. Consider L𝝁(𝒞𝝁)L_{\bm{\mu}}(\mathcal{C}_{\bm{\mu}^{\prime}}) as a loss function between 𝝁\bm{\mu} and 𝝁\bm{\mu}^{\prime} in Lemmas 21 and 22. Considering 𝝁2=𝝁2=1\|\bm{\mu}\|_{2}=\|\bm{\mu}^{\prime}\|_{2}=1, by Lemma 15, note that

maxk=1:KKL(𝝁nk𝝁nk)8maxk=1:Knk𝝁𝝁22.\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}})\leq 8\max_{k=1:K}n_{k}\cdot\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}^{2}. (S.5.498)

By Lemma 8.5 in \citeappcai2019chime, this implies for some constants c,C>0c,C>0

sup{L𝝁(𝒞𝝁):maxk=1:KKL(𝝁nk𝝁nk)(ϵ~/(1ϵ~))2}\displaystyle\sup\left\{L_{\bm{\mu}}(\mathcal{C}_{\bm{\mu}^{\prime}}):\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}})\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2}\right\} (S.5.499)
sup{c𝝁𝝁2:maxk=1:KKL(𝝁nk𝝁nk)(ϵ~/(1ϵ~))2}\displaystyle\geq\sup\left\{c\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}:\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}})\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2}\right\} (S.5.500)
sup{c𝝁𝝁2:8maxk=1:Knk𝝁𝝁22(ϵ~/(1ϵ~))2}\displaystyle\geq\sup\left\{c\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}:8\max_{k=1:K}n_{k}\cdot\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}^{2}\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2}\right\} (S.5.501)
=Cϵ~maxk=1:Knk.\displaystyle=C\cdot\frac{\widetilde{\epsilon}^{\prime}}{\sqrt{\max_{k=1:K}n_{k}}}. (S.5.502)

Then apply Lemmas 21 and 22 to get the desired bound. ∎

S.5.6 Proof of Theorem 5

Denote ξ=maxkSminrk=±1rk𝜷^(k)[0]𝜷(k)2=maxkS(𝜷^(k)[0]𝜷(k)2𝜷^(k)[0]+𝜷(k)2)\xi=\max_{k\in S}\min_{r_{k}=\pm 1}\|r_{k}\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}=\max_{k\in S}(\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}). WLOG, assume S={1,,s}S=\{1,\ldots,s\} and rk=1r^{*}_{k}=1 for all kSk\in S. Hence ξ=maxkS𝜷^(k)[0]𝜷(k)2\xi=\max_{k\in S}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}. For any k=1,,sk^{\prime}=1,\ldots,s, define

𝒓\displaystyle\bm{r} =(r1,,rk=1,rk+1,,rs=1,rs+1,,rKoutlier tasks),\displaystyle=(\underbrace{r_{1},\ldots,r_{k^{\prime}}}_{=-1},\underbrace{r_{k^{\prime}+1},\ldots,r_{s}}_{=1},\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}), (S.5.503)
𝒓\displaystyle\bm{r}^{\prime} =(1,1,,1,1,1,,1,1,rs+1,,rKoutlier tasks),\displaystyle=(1,1,\ldots,1,1,1,\ldots,1,1,\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}), (S.5.504)
𝒓\displaystyle\bm{r}^{\prime\prime} =(1,,1,1,,1,rs+1,,rKoutlier tasks).\displaystyle=(-1,\ldots,-1,-1,\ldots,-1,\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}). (S.5.505)

WLOG, it suffices to prove that

score(𝒓)score(𝒓)>0 when ks/2,\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime})>0\quad\text{ when }k^{\prime}\leq\lfloor s/2\rfloor, (S.5.506)
score(𝒓)score(𝒓)>0 when k>s/2.\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime\prime})>0\quad\text{ when }k^{\prime}>\lfloor s/2\rfloor. (S.5.507)

In fact, if this holds, then we must have

r^k=1 for all kS or r^k=1 for all kS.\widehat{r}_{k}=1\text{ for all }k\in S\text{\quad or \quad}\widehat{r}_{k}=-1\text{ for all }k\in S. (S.5.508)

Otherwise, according to (S.5.506), if #{kS:r^k=1}s/2\#\{k\in S:\widehat{r}_{k}=-1\}\leq\lfloor s/2\rfloor, by replacing the first ss entries of 𝒓^\widehat{\bm{r}} with 11, we get a different alignment whose score is smaller than the score of 𝒓^\widehat{\bm{r}}, which is contradicted with the definition of 𝒓^\widehat{\bm{r}}. If #{kS:r^k=1}>s/2\#\{k\in S:\widehat{r}_{k}=-1\}>\lfloor s/2\rfloor, based on (S.5.507), by replacing the first ss entries of 𝒓^\widehat{\bm{r}} with 1-1, we get a different alignment whose score is smaller than the score of 𝒓^\widehat{\bm{r}}, which is again contradicted with the definition of 𝒓^\widehat{\bm{r}}.

In the following, we prove (S.5.506). The proof of (S.5.507) is almost the same, so we do not repeat it. Under the conditions we assume, it can be shown that

score(𝒓)score(𝒓)\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime}) =k1=1kk2=1k𝜷^(k1)[0]𝜷^(k2)[0]2+2k1=1kk2=k+1s𝜷^(k1)[0]+𝜷^(k2)[0]2\displaystyle=\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=1}^{k^{\prime}}\|\widehat{\bm{\beta}}^{(k_{1})[0]}-\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}+2\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}\|\widehat{\bm{\beta}}^{(k_{1})[0]}+\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2} (S.5.509)
+2k1=1kk2=s+1K𝜷^(k1)[0]+rk2𝜷^(k2)[0]2\displaystyle\quad+2\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=s+1}^{K}\|\widehat{\bm{\beta}}^{(k_{1})[0]}+r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2} (S.5.510)
k1=1kk2=1k𝜷^(k1)[0]𝜷^(k2)[0]22k1=1kk2=k+1s𝜷^(k1)[0]𝜷^(k2)[0]2\displaystyle\quad-\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=1}^{k^{\prime}}\|\widehat{\bm{\beta}}^{(k_{1})[0]}-\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}-2\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}\|\widehat{\bm{\beta}}^{(k_{1})[0]}-\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2} (S.5.511)
2k1=1kk2=s+1K𝜷^(k1)[0]+rk2𝜷^(k2)[0]2\displaystyle\quad-2\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=s+1}^{K}\|-\widehat{\bm{\beta}}^{(k_{1})[0]}+r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2} (S.5.512)
=2k1=1kk2=k+1s𝜷^(k1)[0]+𝜷^(k2)[0]2(1)+2k1=1kk2=s+1K𝜷^(k1)[0]+rk2𝜷^(k2)[0]2(2)\displaystyle=2\underbrace{\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}\|\widehat{\bm{\beta}}^{(k_{1})[0]}+\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}}_{(1)}+2\underbrace{\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=s+1}^{K}\|\widehat{\bm{\beta}}^{(k_{1})[0]}+r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}}_{(2)} (S.5.513)
2k1=1kk2=k+1s𝜷^(k1)[0]𝜷^(k2)[0]2(1)2k1=1kk2=s+1K𝜷^(k1)[0]+rk2𝜷^(k2)[0]2(2).\displaystyle\quad-2\underbrace{\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}\|\widehat{\bm{\beta}}^{(k_{1})[0]}-\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}}_{(1)^{\prime}}-2\underbrace{\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=s+1}^{K}\|-\widehat{\bm{\beta}}^{(k_{1})[0]}+r_{k_{2}}\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}}_{(2)^{\prime}}. (S.5.514)

And

(1)(1)\displaystyle(1)-(1)^{\prime} =k1=1kk2=k+1s(𝜷^(k1)[0]+𝜷^(k2)[0]2𝜷^(k1)[0]𝜷^(k2)[0]2)\displaystyle=\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}(\|\widehat{\bm{\beta}}^{(k_{1})[0]}+\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}-\|\widehat{\bm{\beta}}^{(k_{1})[0]}-\widehat{\bm{\beta}}^{(k_{2})[0]}\|_{2}) (S.5.515)
k1=1kk2=k+1s(𝜷(k1)+𝜷(k2)2𝜷(k1)𝜷(k2)24ξ)\displaystyle\geq\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}(\|\bm{\beta}^{(k_{1})*}+\bm{\beta}^{(k_{2})*}\|_{2}-\|\bm{\beta}^{(k_{1})*}-\bm{\beta}^{(k_{2})*}\|_{2}-4\xi) (S.5.516)
k1=1kk2=k+1s(2𝜷(k1)22𝜷(k1)𝜷(k2)24ξ)\displaystyle\geq\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=k^{\prime}+1}^{s}(2\|\bm{\beta}^{(k_{1})*}\|_{2}-2\|\bm{\beta}^{(k_{1})*}-\bm{\beta}^{(k_{2})*}\|_{2}-4\xi) (S.5.517)
2(sk)k1=1k𝜷(k1)24k(sk)h𝜷4k(sk)ξ,\displaystyle\geq 2(s-k^{\prime})\sum_{k_{1}=1}^{k^{\prime}}\|\bm{\beta}^{(k_{1})*}\|_{2}-4k^{\prime}(s-k^{\prime})h_{\bm{\beta}}-4k^{\prime}(s-k^{\prime})\xi, (S.5.518)
(2)(2)k1=1kk2=s+1K2𝜷^(k1)[0]22(Ks)k1=1k𝜷(k1)22k(Ks)ξ.(2)-(2)^{\prime}\geq-\sum_{k_{1}=1}^{k^{\prime}}\sum_{k_{2}=s+1}^{K}2\|\widehat{\bm{\beta}}^{(k_{1})[0]}\|_{2}\geq-2(K-s)\sum_{k_{1}=1}^{k^{\prime}}\|\bm{\beta}^{(k_{1})*}\|_{2}-2k^{\prime}(K-s)\xi. (S.5.519)

Combining all these pieces,

score(𝒓)score(𝒓)\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime}) (S.5.520)
2(2skK)k1=1k𝜷(k1)24k(sk)hμ2k(Ks)ξ4k(sk)ξ\displaystyle\geq 2(2s-k^{\prime}-K)\sum_{k_{1}=1}^{k^{\prime}}\|\bm{\beta}^{(k_{1})*}\|_{2}-4k^{\prime}(s-k^{\prime})h_{\mu}-2k^{\prime}(K-s)\xi-4k^{\prime}(s-k^{\prime})\xi (S.5.521)
2k[(2skK)minkS𝜷(k)22(sk)hμ(Ks)ξ2(sk)ξ]\displaystyle\geq 2k^{\prime}\left[(2s-k^{\prime}-K)\min_{k\in S}\|\bm{\beta}^{(k)*}\|_{2}-2(s-k^{\prime})h_{\mu}-(K-s)\xi-2(s-k^{\prime})\xi\right] (S.5.522)
>2k[(32sK)minkS𝜷(k)22shμ(Ks)ξ2sξ]\displaystyle>2k^{\prime}\left[\bigg{(}\frac{3}{2}s-K\bigg{)}\min_{k\in S}\|\bm{\beta}^{(k)*}\|_{2}-2sh_{\mu}-(K-s)\xi-2s\xi\right] (S.5.523)
0,\displaystyle\geq 0, (S.5.524)

where (S.5.523) holds because 1ks/21\leq k^{\prime}\leq\lfloor s/2\rfloor and (S.5.524) is due to the condition (ii).

S.5.7 Proof of Theorem 6

Denote ξ=maxkSminrk=±1rk𝜷^(k)[0]𝜷(k)2=maxkS(𝜷^(k)[0]𝜷(k)2𝜷^(k)[0]+𝜷(k)2)\xi=\max_{k\in S}\min_{r_{k}=\pm 1}\|r_{k}\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}=\max_{k\in S}(\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}). WLOG, assume S={1,,s}S=\{1,\ldots,s\} and rk=1r^{*}_{k}=1 for all kSk\in S. Hence ξ=maxkS𝜷^(k)[0]𝜷(k)2\xi=\max_{k\in S}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}. For any k=1,,spak^{\prime}=1,\ldots,sp_{a}, define

𝒓\displaystyle\bm{r} =(r1,,rk=1,rk+1,,rs=1,rs+1,,rKoutlier tasks),\displaystyle=(\underbrace{r_{1},\ldots,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}r_{k^{\prime}}}}_{=-1},\underbrace{r_{k^{\prime}+1},\ldots,r_{s}}_{=1},\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}), (S.5.525)
𝒓\displaystyle\bm{r}^{\prime} =(r1,,rk1=1,rk,rk+1,,rs=1,rs+1,,rKoutlier tasks).\displaystyle=(\underbrace{r_{1},\ldots,r_{k^{\prime}-1}}_{=-1},\underbrace{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}r_{k^{\prime}}^{\prime}},r_{k^{\prime}+1},\ldots,r_{s}}_{=1},\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}). (S.5.526)

By the definition of pap_{a}, we must have #{kS:r^k=1}=spa\#\{k\in S:\widehat{r}_{k}=-1\}=sp_{a} or #{kS:r^k=1}=spa\#\{k\in S:\widehat{r}_{k}=1\}=sp_{a}. If #{kS:r^k=1}=spa\#\{k\in S:\widehat{r}_{k}=-1\}=sp_{a} and we have

score(𝒓)score(𝒓)>0,\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime})>0, (S.5.527)

then for each kSk\in S in the for loop of Algorithm 3, the algorithm will flip the sign of r^k\widehat{r}_{k^{\prime}} to decrease the mis-alignment proportion pap_{a}. Then after the for loop, the mis-alignment proportion pap_{a} will become zero, which means the correct alignment is achieved. The case that #{kS:r^k=1}=spa\#\{k\in S:\widehat{r}_{k}=1\}=sp_{a} can be similarly discussed.

Now we derive (S.5.527). Similar to the decomposition in (S.5.514), we have

score(𝒓)score(𝒓)\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime}) =2k=k+1s𝜷^(k)[0]+𝜷^(k)[0]2(1)+2k=1k1𝜷^(k)[0]𝜷^(k)[0]2(2)\displaystyle=2\underbrace{\sum_{k=k^{\prime}+1}^{s}\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}+\widehat{\bm{\beta}}^{(k)[0]}\|_{2}}_{(1)}+2\underbrace{\sum_{k=1}^{k^{\prime}-1}\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}-\widehat{\bm{\beta}}^{(k)[0]}\|_{2}}_{(2)} (S.5.528)
+2k=s+1K𝜷^(k)[0]rk𝜷^(k)[0]2(3)\displaystyle\quad+2\underbrace{\sum_{k=s+1}^{K}\|-\widehat{\bm{\beta}}^{(k^{\prime})[0]}-r_{k}\widehat{\bm{\beta}}^{(k)[0]}\|_{2}}_{(3)} (S.5.529)
2k=k+1s𝜷^(k)[0]𝜷^(k)[0]2(1)2k=1k1𝜷^(k)[0]+𝜷^(k)[0]2(2)\displaystyle\quad-2\underbrace{\sum_{k=k^{\prime}+1}^{s}\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}-\widehat{\bm{\beta}}^{(k)[0]}\|_{2}}_{(1)^{\prime}}-2\underbrace{\sum_{k=1}^{k^{\prime}-1}\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}+\widehat{\bm{\beta}}^{(k)[0]}\|_{2}}_{(2)^{\prime}} (S.5.530)
2k=s+1K𝜷^(k)[0]rk𝜷^(k)[0]2(3).\displaystyle\quad-2\underbrace{\sum_{k=s+1}^{K}\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}-r_{k}\widehat{\bm{\beta}}^{(k)[0]}\|_{2}}_{(3)^{\prime}}. (S.5.531)

Note that

(1)(1)\displaystyle(1)-(1)^{\prime} =k=k+1s(𝜷^(k)[0]+𝜷^(k)[0]2𝜷^(k)[0]𝜷^(k)[0]2)\displaystyle=\sum_{k=k^{\prime}+1}^{s}(\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}+\widehat{\bm{\beta}}^{(k)[0]}\|_{2}-\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}-\widehat{\bm{\beta}}^{(k)[0]}\|_{2}) (S.5.532)
k=k+1s(𝜷(k)+𝜷(k)2𝜷(k)𝜷(k)24ξ)\displaystyle\geq\sum_{k=k^{\prime}+1}^{s}(\|\bm{\beta}^{(k^{\prime})*}+\bm{\beta}^{(k)*}\|_{2}-\|\bm{\beta}^{(k^{\prime})*}-\bm{\beta}^{(k)*}\|_{2}-4\xi) (S.5.533)
k=k+1s(2𝜷(k)22𝜷(k)𝜷(k)24ξ)\displaystyle\geq\sum_{k=k^{\prime}+1}^{s}(2\|\bm{\beta}^{(k^{\prime})*}\|_{2}-2\|\bm{\beta}^{(k^{\prime})*}-\bm{\beta}^{(k)*}\|_{2}-4\xi) (S.5.534)
(sk)(2𝜷(k)24h𝜷4ξ),\displaystyle\geq(s-k^{\prime})(2\|\bm{\beta}^{(k^{\prime})*}\|_{2}-4h_{\bm{\beta}}-4\xi), (S.5.535)
(2)(2)k=1k12𝜷^(k)[0]22(k1)𝜷(k)22(k1)ξ,(2)-(2)^{\prime}\geq-\sum_{k=1}^{k^{\prime}-1}2\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}\|_{2}\geq-2(k^{\prime}-1)\|\bm{\beta}^{(k^{\prime})*}\|_{2}-2(k^{\prime}-1)\xi, (S.5.536)
(3)(3)k=s+1K2𝜷^(k)[0]22(Ks)𝜷(k)22(Ks)ξ.(3)-(3)^{\prime}\geq-\sum_{k=s+1}^{K}2\|\widehat{\bm{\beta}}^{(k^{\prime})[0]}\|_{2}\geq-2(K-s)\|\bm{\beta}^{(k^{\prime})*}\|_{2}-2(K-s)\xi. (S.5.537)

Putting all pieces together,

score(𝒓)score(𝒓)\displaystyle\text{score}(\bm{r})-\text{score}(\bm{r}^{\prime}) (S.5.538)
2[(2s2kK+1)𝜷(k)22(sk)h𝜷(sk+K1)ξ]\displaystyle\geq 2\left[(2s-2k^{\prime}-K+1)\|\bm{\beta}^{(k^{\prime})*}\|_{2}-2(s-k^{\prime})h_{\bm{\beta}}-(s-k^{\prime}+K-1)\xi\right] (S.5.539)
>2[(2s2spaK)𝜷(k)22sh𝜷2(s+K)ξ]\displaystyle>2\left[(2s-2sp_{a}-K)\|\bm{\beta}^{(k^{\prime})*}\|_{2}-2sh_{\bm{\beta}}-2(s+K)\xi\right] (S.5.540)
0.\displaystyle\geq 0. (S.5.541)

where (S.5.540) holds because 1kspa1\leq k^{\prime}\leq sp_{a} and (S.5.541) is due to the condition (iii).

S.5.8 Proof of Theorem 13

S.5.8.1 Lemmas

Define the contraction basin of one GMM as

Bcon(𝜽(k))={𝜽={w,𝜷,δ}:wr[cw/2,1cw/2],𝜷𝜷(k)2CbΔ,|δδ(k)|CbΔ},B_{\text{con}}(\bm{\theta}^{(k)*})=\{\bm{\theta}=\{w,\bm{\beta},\delta\}:w_{r}\in[c_{w}/2,1-c_{w}/2],\|\bm{\beta}-\bm{\beta}^{(k)*}\|_{2}\leq C_{b}\Delta,|\delta-\delta^{(k)*}|\leq C_{b}\Delta\}, (S.5.542)

for which we may shorthand as BconB_{\text{con}} in the following.

For GMM 𝒛(1w)𝒩(𝝁1,𝚺)+w𝒩(𝝁2,𝚺)\bm{z}\sim(1-w^{*})\mathcal{N}(\bm{\mu}_{1}^{*},\bm{\Sigma}^{*})+w^{*}\mathcal{N}(\bm{\mu}_{2}^{*},\bm{\Sigma}^{*}) and any 𝜽=(w,𝜷,δ)\bm{\theta}=(w,\bm{\beta},\delta), define

γ𝜽(𝒛)=wexp{𝜷𝒛δ}1w+wexp{𝜷𝒛δr},\displaystyle\gamma_{\bm{\theta}}(\bm{z})=\frac{w\exp\{\bm{\beta}^{\top}\bm{z}-\delta\}}{1-w+w\exp\{\bm{\beta}^{\top}\bm{z}-\delta_{r}\}}, w(𝜽)=𝔼[γ𝜽(𝒛)],\displaystyle\quad w(\bm{\theta})=\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})], (S.5.543)
𝝁1(𝜽)=𝔼[(1γ𝜽(𝒛))𝒛]𝔼[1γ𝜽(𝒛)],\displaystyle\bm{\mu}_{1}(\bm{\theta})=\frac{\mathbb{E}[(1-\gamma_{\bm{\theta}}(\bm{z}))\bm{z}]}{\mathbb{E}[1-\gamma_{\bm{\theta}}(\bm{z})]}, 𝝁2(𝜽)=𝔼[γ𝜽(𝒛)𝒛]𝔼[γ𝜽(𝒛)].\displaystyle\quad\bm{\mu}_{2}(\bm{\theta})=\frac{\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})\bm{z}]}{\mathbb{E}[\gamma_{\bm{\theta}}(\bm{z})]}. (S.5.544)
Lemma 29.

Suppose Assumption 3 holds.

  1. (i)

    With probability at least 1τ1-\tau,

    sup𝜽(0)Bcon𝜷(0)𝜷(0)2ξ(0)|1n0i=1n0γ𝜽(0)(𝒛(0)i)𝔼[γ𝜽(0)(𝒛(0))]|ξ(0)pn0+log(1/τ)n0.\sup_{\begin{subarray}{c}\bm{\theta}^{(0)}\in B_{\text{con}}\\ \|\bm{\beta}^{(0)}-\bm{\beta}^{(0)*}\|_{2}\leq\xi^{(0)}\end{subarray}}\left|\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\gamma_{\bm{\theta}^{(0)}}(\bm{z}^{(0)}_{i})-\mathbb{E}[\gamma_{\bm{\theta}^{(0)}}(\bm{z}^{(0)})]\right|\lesssim\xi^{(0)}\sqrt{\frac{p}{n_{0}}}+\sqrt{\frac{\log(1/\tau)}{n_{0}}}. (S.5.545)
  2. (ii)

    With probability at least 1τ1-\tau,

    sup𝜽(0)Bcon𝜷(0)𝜷(0)2ξ(0)|1n0i=1n0γ𝜽(k)(𝒛(0)i)(𝒛(0)i)𝜷(0)𝔼[γ𝜽(0)(𝒛(0))(𝒛(0)i)𝜷(0)]|ξ(0)pn0+log(1/τ)n0.\sup_{\begin{subarray}{c}\bm{\theta}^{(0)}\in B_{\text{con}}\\ \|\bm{\beta}^{(0)}-\bm{\beta}^{(0)*}\|_{2}\leq\xi^{(0)}\end{subarray}}\left|\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(0)}_{i})(\bm{z}^{(0)}_{i})^{\top}\bm{\beta}^{(0)*}-\mathbb{E}[\gamma_{\bm{\theta}^{(0)}}(\bm{z}^{(0)})(\bm{z}^{(0)}_{i})^{\top}\bm{\beta}^{(0)*}]\right|\lesssim\xi^{(0)}\sqrt{\frac{p}{n_{0}}}+\sqrt{\frac{\log(1/\tau)}{n_{0}}}. (S.5.546)
  3. (iii)

    With probability at least 1τ1-\tau,

    sup𝜽(0)Bcon𝜷(0)𝜷(0)2ξ(0)1n0i=1n0γ𝜽(k)(𝒛(0)i)𝒛(0)i𝔼[γ𝜽(0)(𝒛(0))𝒛(0)i]2ξ(0)pn0+log(1/τ)n0.\sup_{\begin{subarray}{c}\bm{\theta}^{(0)}\in B_{\text{con}}\\ \|\bm{\beta}^{(0)}-\bm{\beta}^{(0)*}\|_{2}\leq\xi^{(0)}\end{subarray}}\left\|\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\gamma_{\bm{\theta}^{(k)}}(\bm{z}^{(0)}_{i})\bm{z}^{(0)}_{i}-\mathbb{E}[\gamma_{\bm{\theta}^{(0)}}(\bm{z}^{(0)})\bm{z}^{(0)}_{i}]\right\|_{2}\lesssim\xi^{(0)}\sqrt{\frac{p}{n_{0}}}+\sqrt{\frac{\log(1/\tau)}{n_{0}}}. (S.5.547)

S.5.8.2 Main proof of Theorem 13

WLOG, in Assumptions 3.(iii) and 3.(iv), we assume

  • 𝜷^(0)[0]𝜷(0)2|δ^(0)[0]δ(0)|C4Δ(0)\|\widehat{\bm{\beta}}^{(0)[0]}-\bm{\beta}^{(0)*}\|_{2}\vee|\widehat{\delta}^{(0)[0]}-\delta^{(0)*}|\leq C_{4}\Delta^{(0)}, with a sufficiently small constant C4C_{4};

  • |w^(0)[0]w(0)|cw/2|\widehat{w}^{(0)[0]}-w^{(0)*}|\leq c_{w}/2.

(\@slowromancapi@) Case 1: We first consider the case that hCpn0h\geq C\sqrt{\frac{p}{n_{0}}}. Consider an event \mathcal{E} defined to be the intersection of the events in Lemma 29, with ξ(k)=\xi^{(k)}= a large constant CC, which satisfies ()1τ\mathbb{P}(\mathcal{E})\geq 1-\tau. Throughout the analysis in Case 1, we condition on \mathcal{E}, therefore all the arguments hold with probability at least 1τ1-\tau.

Similar to our analysis in the proof of Theorem 1, conditioned on \mathcal{E}, we have

|w^(0)[t]w(0)|\displaystyle|\widehat{w}^{(0)[t]}-w^{(0)*}| κ0d(𝜽^(0)[t1],𝜽(0))+pn0,\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\sqrt{\frac{p}{n_{0}}}, (S.5.548)
maxr=1:2𝝁^(0)[t]r𝝁(0)r2\displaystyle\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)[t]}_{r}-\bm{\mu}^{(0)*}_{r}\|_{2} κ0d(𝜽^(0)[t1],𝜽(0))+pn0,\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\sqrt{\frac{p}{n_{0}}}, (S.5.549)
(𝚺^(0)[t]𝚺(0))𝜷(0)2\displaystyle\|(\widehat{\bm{\Sigma}}^{(0)[t]}-\bm{\Sigma}^{(0)*})\bm{\beta}^{(0)*}\|_{2} κ0d(𝜽^(0)[t1],𝜽(0))+pn0.\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\sqrt{\frac{p}{n_{0}}}. (S.5.550)

Hence

(𝚺^(0)[t])𝜷(0)(𝝁^(0)[t1]2𝝁^(0)[t1]1)2\displaystyle\|(\widehat{\bm{\Sigma}}^{(0)[t]})\bm{\beta}^{(0)*}-(\widehat{\bm{\mu}}^{(0)[t-1]}_{2}-\widehat{\bm{\mu}}^{(0)[t-1]}_{1})\|_{2} (𝚺^(0)[t]𝚺(0))𝜷(0)2+maxr=1:2𝝁^(0)[t]r𝝁(0)r2\displaystyle\lesssim\|(\widehat{\bm{\Sigma}}^{(0)[t]}-\bm{\Sigma}^{(0)*})\bm{\beta}^{(0)*}\|_{2}+\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)[t]}_{r}-\bm{\mu}^{(0)*}_{r}\|_{2} (S.5.551)
κ0d(𝜽^(0)[t1],𝜽(0))+pn0.\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\sqrt{\frac{p}{n_{0}}}. (S.5.552)

By Lemma 7, we have

𝜷^(0)[t]𝜷(0)2\displaystyle\|\widehat{\bm{\beta}}^{(0)[t]}-\bm{\beta}^{(0)*}\|_{2} (𝚺^(0)[t])𝜷(0)(𝝁^(0)[t1]2𝝁^(0)[t1]1)2+λ0[t]n0\displaystyle\lesssim\|(\widehat{\bm{\Sigma}}^{(0)[t]})\bm{\beta}^{(0)*}-(\widehat{\bm{\mu}}^{(0)[t-1]}_{2}-\widehat{\bm{\mu}}^{(0)[t-1]}_{1})\|_{2}+\frac{\lambda_{0}^{[t]}}{\sqrt{n_{0}}} (S.5.553)
κ0d(𝜽^(0)[t1],𝜽(0))+pn0+λ0[t]n0.\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\sqrt{\frac{p}{n_{0}}}+\frac{\lambda_{0}^{[t]}}{\sqrt{n_{0}}}. (S.5.554)

Combining these results, we have

d(𝜽^(0)[t],𝜽(0))Cκ0d(𝜽^(0)[t1],𝜽(0))+Cpn0+Cλ0[t]n0.d(\widehat{\bm{\theta}}^{(0)[t]},\bm{\theta}^{(0)*})\leq C\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+C^{\prime}\sqrt{\frac{p}{n_{0}}}+C^{\prime}\frac{\lambda_{0}^{[t]}}{\sqrt{n_{0}}}. (S.5.555)

By the construction of λ0[t]\lambda_{0}^{[t]}, we know that

λ0[t]=1κ0t1κ0Cλ0p+κ0tλ0[0],\lambda_{0}^{[t]}=\frac{1-\kappa_{0}^{t}}{1-\kappa_{0}}C_{\lambda_{0}}\sqrt{p}+\kappa_{0}^{t}\lambda_{0}^{[0]}, (S.5.556)

implies that

d(𝜽^(0)[t],𝜽(0))\displaystyle d(\widehat{\bm{\theta}}^{(0)[t]},\bm{\theta}^{(0)*}) (Cκ0)td(𝜽^(0)[0],𝜽(0))+Cpn0+Ct=1tλ0[t]n0(Cκ0)tt\displaystyle\leq(C\kappa_{0}^{\prime\prime})^{t}d(\widehat{\bm{\theta}}^{(0)[0]},\bm{\theta}^{(0)*})+C^{\prime\prime}\sqrt{\frac{p}{n_{0}}}+C^{\prime}\sum_{t^{\prime}=1}^{t}\frac{\lambda_{0}^{[t^{\prime}]}}{\sqrt{n_{0}}}(C\kappa_{0}^{\prime\prime})^{t-t^{\prime}} (S.5.557)
(κ0)td(𝜽^(0)[0],𝜽(0))+Cpn0+Ct(κ0)t\displaystyle\leq(\kappa_{0}^{\prime})^{t}d(\widehat{\bm{\theta}}^{(0)[0]},\bm{\theta}^{(0)*})+C^{\prime\prime\prime}\sqrt{\frac{p}{n_{0}}}+C^{\prime\prime\prime}t(\kappa_{0}^{\prime})^{t} (S.5.558)
Ct(κ0)t+Cpn0,\displaystyle\leq Ct(\kappa_{0}^{\prime})^{t}+C^{\prime\prime\prime}\sqrt{\frac{p}{n_{0}}}, (S.5.559)

which is the desired rate. The bound of maxr=1:2𝝁^(0)[t]r𝝁(0)r2\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)[t]}_{r}-\bm{\mu}^{(0)*}_{r}\|_{2} and 𝚺^(0)[t]𝚺(0)2\|\widehat{\bm{\Sigma}}^{(0)[t]}-\bm{\Sigma}^{(0)*}\|_{2} can be derived similar to (S.5.549) and (S.5.550).

(\@slowromancapii@) Case 2: Next, we consider the case that hCpn0h\leq C\sqrt{\frac{p}{n_{0}}}. According to Assumption 3, we have pn0p+logKmaxkSnk\sqrt{\frac{p}{n_{0}}}\lesssim\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}. It is easy to see that the analysis in part (\@slowromancapi@) does not depend on the condition hCpn0h\geq C\sqrt{\frac{p}{n_{0}}}. Hence we have proved the desired bounds of maxr=1:2𝝁^(0)[t]r𝝁(0)r2\max_{r=1:2}\|\widehat{\bm{\mu}}^{(0)[t]}_{r}-\bm{\mu}^{(0)*}_{r}\|_{2} and 𝚺^(0)[t]𝚺(0)2\|\widehat{\bm{\Sigma}}^{(0)[t]}-\bm{\Sigma}^{(0)*}\|_{2}. Denote t0t_{0} as an integer such that t0κ0t0pn0t_{0}\kappa_{0}^{t_{0}}\asymp\sqrt{\frac{p}{n_{0}}}. When 1tt01\leq t\leq t_{0}, the bound in part (\@slowromancapi@) is the desired bound since the term tκ0tt\kappa_{0}^{t} dominates the other terms. Let us consider the case t=t0+1t=t_{0}+1.

Consider an event \mathcal{E}^{\prime} defined to be the event of

𝜷¯[T]𝜷(k)2h+logKnk+pnS+ϵp+logKmaxk=1:Knk,kargminkSnk.\|\overline{\bm{\beta}}^{[T]}-\bm{\beta}^{(k^{\prime})*}\|_{2}\lesssim h+\sqrt{\frac{\log K}{n_{k^{\prime}}}}+\sqrt{\frac{p}{n_{S}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}},\quad k^{\prime}\in\operatorname*{arg\,min}_{k\in S}n_{k}. (S.5.560)

Note that since hCp+logKmaxkSnkh\leq C\sqrt{\frac{p+\log K}{\max_{k\in S}n_{k}}}, by part (\@slowromancapii@) of the proof of Theorem 1, we know that ()1C(K1+exp{Cp})\mathbb{P}(\mathcal{E}^{\prime})\geq 1-C(K^{-1}+\exp\{-C^{\prime}p\}). And \mathcal{E}^{\prime} implies that

𝜷¯[T]𝜷(0)2h+logKmaxkSnk+pnS+ϵp+logKmaxk=1:Knkpn0,\|\overline{\bm{\beta}}^{[T]}-\bm{\beta}^{(0)*}\|_{2}\lesssim h+\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+\sqrt{\frac{p}{n_{S}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}\lesssim\sqrt{\frac{p}{n_{0}}}, (S.5.561)

where the second inequality comes from Assumption 3.

Also consider another event \mathcal{E}^{\prime\prime} defined to be the intersection of the events in Lemma 29, with ξ=C[h+logKmaxkSnk+pnS+ϵp+logKmaxk=1:Knk]\xi=C\Big{[}h+\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+\sqrt{\frac{p}{n_{S}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}\Big{]}, which satisfies ()1τ\mathbb{P}(\mathcal{E}^{\prime\prime})\geq 1-\tau. Throughout the analysis in Case 1, we condition on \mathcal{E}\cap\mathcal{E}^{\prime}\cap\mathcal{E}^{\prime\prime}, therefore all the arguments hold with probability at least 1τC(K1+exp{Cp})1-\tau-C(K^{-1}+\exp\{-C^{\prime}p\}).

Note that λ0[t]CpC𝜷¯[T]𝜷(0)2\lambda_{0}^{[t]}\geq C\sqrt{p}\geq C^{\prime}\|\overline{\bm{\beta}}^{[T]}-\bm{\beta}^{(0)*}\|_{2} and λ0[t]CpCn0(𝚺^(0)[t])𝜷(0)(𝝁^(0)[t1]2𝝁^(0)[t1]1)2\lambda_{0}^{[t]}\geq C\sqrt{p}\geq C^{\prime}\sqrt{n_{0}}\|(\widehat{\bm{\Sigma}}^{(0)[t]})\bm{\beta}^{(0)*}-(\widehat{\bm{\mu}}^{(0)[t-1]}_{2}-\widehat{\bm{\mu}}^{(0)[t-1]}_{1})\|_{2}. Hence by Lemma 7, we have 𝜷^(0)[t]=𝜷¯[T]\widehat{\bm{\beta}}^{(0)[t]}=\overline{\bm{\beta}}^{[T]} thus

𝜷^(0)[t]𝜷(0)2h+logKmaxkSnk+pnS+ϵp+logKmaxk=1:Knk.\|\widehat{\bm{\beta}}^{(0)[t]}-\bm{\beta}^{(0)*}\|_{2}\lesssim h+\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+\sqrt{\frac{p}{n_{S}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}. (S.5.562)

Similar to the analysis in part (\@slowromancapii@) in the proof of Theorem 1, we have

|w^(0)[t]w(0)|\displaystyle|\widehat{w}^{(0)[t]}-w^{(0)*}| κ0d(𝜽^(0)[t1],𝜽(0))+ξpn0+1n0\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\xi\sqrt{\frac{p}{n_{0}}}+\sqrt{\frac{1}{n_{0}}} (S.5.563)
κ0d(𝜽^(0)[t1],𝜽(0))+ξ+1n0,\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\xi+\sqrt{\frac{1}{n_{0}}}, (S.5.564)
|δ^(0)[t]δ(0)|\displaystyle|\widehat{\delta}^{(0)[t]}-\delta^{(0)*}| κ0d(𝜽^(0)[t1],𝜽(0))+ξ+ξpn0+1n0,\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\xi+\xi\sqrt{\frac{p}{n_{0}}}+\sqrt{\frac{1}{n_{0}}}, (S.5.565)
κ0d(𝜽^(0)[t1],𝜽(0))+ξ+1n0.\displaystyle\lesssim\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+\xi+\sqrt{\frac{1}{n_{0}}}. (S.5.566)

Putting all pieces together,

d(𝜽^(0)[t],𝜽(0))Cκ0d(𝜽^(0)[t1],𝜽(0))+h+logKmaxkSnk+pnS+ϵp+logKmaxk=1:Knk+1n0.d(\widehat{\bm{\theta}}^{(0)[t]},\bm{\theta}^{(0)*})\leq C\kappa_{0}^{\prime\prime}d(\widehat{\bm{\theta}}^{(0)[t-1]},\bm{\theta}^{(0)*})+h+\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+\sqrt{\frac{p}{n_{S}}}+\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+\sqrt{\frac{1}{n_{0}}}. (S.5.567)

We can continue this analysis from t=t0+1t=t_{0}+1 to t0+2t_{0}+2, and so on. Hence for any t1t^{\prime}\geq 1, we have

d(𝜽^(0)[t0+t],𝜽(0))\displaystyle d(\widehat{\bm{\theta}}^{(0)[t_{0}+t^{\prime}]},\bm{\theta}^{(0)*}) (Cκ0)td(𝜽^(0)[t0],𝜽(0))+Ch+ClogKmaxkSnk+CpnS\displaystyle\leq(C\kappa_{0}^{\prime\prime})^{t^{\prime}}d(\widehat{\bm{\theta}}^{(0)[t_{0}]},\bm{\theta}^{(0)*})+C^{\prime}h+C^{\prime}\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+C^{\prime}\sqrt{\frac{p}{n_{S}}} (S.5.568)
+Cϵp+logKmaxk=1:Knk+C1n0\displaystyle\quad+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime}\sqrt{\frac{1}{n_{0}}} (S.5.569)
(κ0)td(𝜽^(0)[t0],𝜽(0))+Ch+ClogKmaxkSnk+CpnS\displaystyle\leq(\kappa_{0}^{\prime})^{t^{\prime}}d(\widehat{\bm{\theta}}^{(0)[t_{0}]},\bm{\theta}^{(0)*})+C^{\prime}h+C^{\prime}\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+C^{\prime}\sqrt{\frac{p}{n_{S}}} (S.5.570)
+Cϵp+logKmaxk=1:Knk+C1n0\displaystyle\quad+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime}\sqrt{\frac{1}{n_{0}}} (S.5.571)
(t+t0)(κ0)t+t0+Ch+ClogKmaxkSnk+CpnS\displaystyle\leq(t^{\prime}+t_{0})(\kappa_{0}^{\prime})^{t^{\prime}+t_{0}}+C^{\prime}h+C^{\prime}\sqrt{\frac{\log K}{\max_{k\in S}n_{k}}}+C^{\prime}\sqrt{\frac{p}{n_{S}}} (S.5.572)
+Cϵp+logKmaxk=1:Knk+C1n0,\displaystyle\quad+C^{\prime}\epsilon\sqrt{\frac{p+\log K}{\max_{k=1:K}n_{k}}}+C^{\prime}\sqrt{\frac{1}{n_{0}}}, (S.5.573)

where the last inequality holds because t0t_{0} is chosen to be the integer satisfying t0(κ0)t0pn0d(𝜽^(0)[t0],𝜽(0))t_{0}(\kappa_{0})^{t_{0}}\asymp\sqrt{\frac{p}{n_{0}}}\gtrsim d(\widehat{\bm{\theta}}^{(0)[t_{0}]},\bm{\theta}^{(0)*}).

S.5.8.3 Proof of lemmas

Proof of Lemma 29.

The proof is almost the same as the proofs of lemmas in Theorem 1, so we do not repeat it here. ∎

S.5.9 Proof of Theorem 15

S.5.9.1 Lemmas

Recall

Θ¯S(h)={{𝜽¯(k)}k{0}S={(w(k),𝝁(k)1,𝝁(k)2,𝚺(k))}k{0}S:𝜽(k)Θ¯,maxkS𝜷(k)𝜷(0)2h}.\displaystyle\overline{\Theta}_{S}^{\prime}(h)=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}=\{(w^{(k)},\bm{\mu}^{(k)}_{1},\bm{\mu}^{(k)}_{2},\bm{\Sigma}^{(k)})\}_{k\in\{0\}\cup S}:\bm{\theta}^{(k)}\in\overline{\Theta},\max_{k\in S}\|\bm{\beta}^{(k)}-\bm{\beta}^{(0)}\|_{2}\leq h\Big{\}}. (S.5.574)
Lemma 30.

When n0Cpn_{0}\geq Cp with some constant C>0C>0, we have

inf𝜽^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯SS\displaystyle\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}} (d(𝜽^(0),𝜽(0))C1pnS+n0+C1hpn0+C11n0)110.\displaystyle\mathbb{P}\Bigg{(}d(\widehat{\bm{\theta}}^{(0)},\bm{\theta}^{(0)*})\geq C_{1}\sqrt{\frac{p}{n_{S}+n_{0}}}+C_{1}h\wedge\sqrt{\frac{p}{n_{0}}}+C_{1}\sqrt{\frac{1}{n_{0}}}\Bigg{)}\geq\frac{1}{10}. (S.5.575)
Lemma 31.

Denote ϵ~=Kss\widetilde{\epsilon}=\frac{K-s}{s}. Then

inf𝜽^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯SS\displaystyle\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}} (d(𝜽^(0),𝜽(0))(C1ϵ~maxk=1:Knk)(C21n0))110.\displaystyle\mathbb{P}\Bigg{(}d(\widehat{\bm{\theta}}^{(0)},\bm{\theta}^{(0)*})\geq\bigg{(}C_{1}\frac{\widetilde{\epsilon}}{\sqrt{\max_{k=1:K}n_{k}}}\bigg{)}\wedge\bigg{(}C_{2}\sqrt{\frac{1}{n_{0}}}\bigg{)}\Bigg{)}\geq\frac{1}{10}. (S.5.576)
Lemma 32 (The second variant of Theorem 5.1 in \citealpappchen2018robust).

Given a series of distributions {{θ(k)}k=0K:θΘ}\{\{\mathbb{P}_{\theta}^{(k)}\}_{k=0}^{K}:\theta\in\Theta\}, each of which is indexed by the same parameter θΘ\theta\in\Theta. Consider 𝐱(k)(1ϵ~)(k)θ+ϵ~(k)\bm{x}^{(k)}\sim(1-\widetilde{\epsilon})\mathbb{P}^{(k)}_{\theta}+\widetilde{\epsilon}\mathbb{Q}^{(k)} independently for k=1:Kk=1:K and 𝐱(0)(0)θ\bm{x}^{(0)}\sim\mathbb{P}^{(0)}_{\theta}. Denote the joint distribution of {𝐱(k)}k=0K\{\bm{x}^{(k)}\}_{k=0}^{K} as (ϵ~,θ,{(k)}k=1K)\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}. Then

infθ^supθΘ{(k)}k=1K(ϵ~,θ,{(k)}k=1K)(θ^θCϖ(ϵ~,Θ))920,\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi^{\prime}(\widetilde{\epsilon},\Theta)\right)\geq\frac{9}{20}, (S.5.577)

where ϖ(ϵ~,Θ)sup{θ1θ2:maxk=1:KdTV((k)θ1,(k)θ2)ϵ~/(1ϵ~),dTV((0)θ1,(0)θ2)1/20}\varpi^{\prime}(\widetilde{\epsilon},\Theta)\coloneqq\sup\big{\{}\|\theta_{1}-\theta_{2}\|:\max_{k=1:K}d_{\textup{TV}}\big{(}\mathbb{P}^{(k)}_{\theta_{1}},\mathbb{P}^{(k)}_{\theta_{2}}\big{)}\leq\widetilde{\epsilon}/(1-\widetilde{\epsilon}),d_{\textup{TV}}\big{(}\mathbb{P}^{(0)}_{\theta_{1}},\mathbb{P}^{(0)}_{\theta_{2}}\big{)}\leq 1/20\big{\}}.

Lemma 33.

Consider two data generating mechanisms:

  1. (i)

    𝒙(k)(1ϵ~)θ(k)+ϵ~(k)\bm{x}^{(k)}\sim(1-\widetilde{\epsilon}^{\prime})\mathbb{P}_{\theta}^{(k)}+\widetilde{\epsilon}^{\prime}\mathbb{Q}^{(k)} independently for k=1:Kk=1:K and 𝒙(0)(0)θ\bm{x}^{(0)}\sim\mathbb{P}^{(0)}_{\theta}, where ϵ~=KsK\widetilde{\epsilon}^{\prime}=\frac{K-s}{K};

  2. (ii)

    With a preserved set S1:KS\subseteq 1:K, generate {𝒙(k)}kScS\{\bm{x}^{(k)}\}_{k\in S^{c}}\sim\mathbb{Q}_{S} and 𝒙(k)θ(k)\bm{x}^{(k)}\sim\mathbb{P}_{\theta}^{(k)} independently for kSk\in S. And 𝒙(0)(0)θ\bm{x}^{(0)}\sim\mathbb{P}^{(0)}_{\theta}.

Denote the joint distributions of {𝐱(k)}k=0K\{\bm{x}^{(k)}\}_{k=0}^{K} in (i) and (ii) as (ϵ~,θ,{(k)}k=1K)\mathbb{P}_{(\widetilde{\epsilon},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})} and (S,θ,)\mathbb{P}_{(S,\theta,\mathbb{Q})}, respectively. We claim that if

infθ^supθΘ{(k)}k=1K(Ks50K,θ,{(k)}k=1K)(θ^θCϖ(Ks50K,Θ))920,\inf_{\widehat{\theta}}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \{\mathbb{Q}^{(k)}\}_{k=1}^{K}\end{subarray}}\mathbb{P}_{(\frac{K-s}{50K},\theta,\{\mathbb{Q}^{(k)}\}_{k=1}^{K})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi^{\prime}\left(\frac{K-s}{50K},\Theta\right)\right)\geq\frac{9}{20}, (S.5.578)

then

infθ^supS:|S|ssupθΘS(S,θ,S)(θ^θCϖ(Ks50K,Θ))110,\inf_{\widehat{\theta}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\theta\in\Theta\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}_{(S,\theta,\mathbb{Q}_{S})}\left(\|\widehat{\theta}-\theta\|\geq C\varpi^{\prime}\left(\frac{K-s}{50K},\Theta\right)\right)\geq\frac{1}{10}, (S.5.579)

where ϖ(ϵ~,Θ)sup{θ1θ2:maxk=1:KKL((k)θ1(k)θ2)[ϵ~/(1ϵ~)]2,KL((0)θ1(0)θ2)1/400}\varpi^{\prime}(\widetilde{\epsilon},\Theta)\coloneqq\sup\big{\{}\|\theta_{1}-\theta_{2}\|:\max_{k=1:K}\textup{KL}\big{(}\mathbb{P}^{(k)}_{\theta_{1}}\|\mathbb{P}^{(k)}_{\theta_{2}}\big{)}\leq[\widetilde{\epsilon}/(1-\widetilde{\epsilon})]^{2},\textup{KL}\big{(}\mathbb{P}^{(0)}_{\theta_{1}}\|\mathbb{P}^{(0)}_{\theta_{2}}\big{)}\leq 1/400\big{\}} for any ϵ~(0,1)\widetilde{\epsilon}\in(0,1).

Lemma 34.

When n0Cpn_{0}\geq Cp with some constant C>0C>0, we have

inf𝚺^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯SS(𝚺^(0)𝚺(0)2Cpn0)110.\inf_{\widehat{\bm{\Sigma}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\Sigma}}^{(0)}-\bm{\Sigma}^{(0)*}\|_{2}\geq C\sqrt{\frac{p}{n_{0}}}\Bigg{)}\geq\frac{1}{10}. (S.5.580)

S.5.9.2 Main proof of Theorem 15

S.5.9.3 Proof of lemmas

Proof of Lemma 30.

It’s easy to see that Θ¯SΘ¯S,wΘ¯S,𝜷Θ¯S,δ\overline{\Theta}_{S}^{\prime}\supseteq\overline{\Theta}_{S,w}^{\prime}\cup\overline{\Theta}_{S,\bm{\beta}}^{\prime}\cup\overline{\Theta}_{S,\delta}^{\prime}, where

Θ¯S,w\displaystyle\overline{\Theta}_{S,w}^{\prime} ={{𝜽¯(k)}k{0}S:𝝁(k)1=𝟏p/p,𝝁(k)2=𝝁(k)1,𝚺(k)=𝑰p,w(k)(cw,1cw)},\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\mu}^{(k)}_{1}=\bm{1}_{p}/\sqrt{p},\bm{\mu}^{(k)}_{2}=-\bm{\mu}^{(k)}_{1},\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}\in(c_{w},1-c_{w})\Big{\}}, (S.5.581)
Θ¯S,𝜷\displaystyle\overline{\Theta}_{S,\bm{\beta}}^{\prime} ={{𝜽¯(k)}k{0}S:𝚺(k)=𝑰p,w(k)=12,𝝁(k)12𝝁(k)22M,maxkS𝜷(k)𝜷(0)2h},\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\|\bm{\mu}^{(k)}_{1}\|_{2}\vee\|\bm{\mu}^{(k)}_{2}\|_{2}\leq M,\max_{k\in S}\|\bm{\beta}^{(k)}-\bm{\beta}^{(0)}\|_{2}\leq h\Big{\}}, (S.5.582)
Θ¯S,δ\displaystyle\overline{\Theta}_{S,\delta}^{\prime} ={{𝜽¯(k)}k{0}S:𝚺(k)=𝑰p,w(k)=12,𝝁(k)12𝝁(k)22M,𝝁(k)1=12𝝁0,\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\|\bm{\mu}^{(k)}_{1}\|_{2}\vee\|\bm{\mu}^{(k)}_{2}\|_{2}\leq M,\bm{\mu}^{(k)}_{1}=-\frac{1}{2}\bm{\mu}_{0}, (S.5.583)
𝝁(k)2=12𝝁0+𝒖,𝒖21}.\displaystyle\hskip 85.35826pt\bm{\mu}^{(k)}_{2}=\frac{1}{2}\bm{\mu}_{0}+\bm{u},\|\bm{u}\|_{2}\leq 1\Big{\}}. (S.5.584)

(i) By fixing an SS and a S\mathbb{Q}_{S}, we want to show

inf𝜽^(0)sup{𝜽¯(k)}k{0}SΘ¯S,𝜷(𝜷^(0)𝜷(0)2𝜷^(0)+𝜷(0)2CpnS+n0)14.\displaystyle\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S,\bm{\beta}}^{\prime}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\beta}}^{(0)}-\bm{\beta}^{(0)*}\|_{2}\vee\|\widehat{\bm{\beta}}^{(0)}+\bm{\beta}^{(0)*}\|_{2}\geq C\sqrt{\frac{p}{n_{S}+n_{0}}}\Bigg{)}\geq\frac{1}{4}. (S.5.585)

By Lemma 4, \exists a quadrant 𝒬𝒗\mathcal{Q}_{\bm{v}} of p\mathbb{R}^{p} and a r/8r/8-packing of (r𝒮p)𝒬𝒗(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}} under Euclidean norm: {𝝁~j}j=1N\{\widetilde{\bm{\mu}}_{j}\}_{j=1}^{N}, where r=(cp/(nS+n0))M1r=(c\sqrt{p/(n_{S}+n_{0})})\wedge M\leq 1 with a small constant c>0c>0 and N(12)p8p1=12×4p12p1N\geq(\frac{1}{2})^{p}8^{p-1}=\frac{1}{2}\times 4^{p-1}\geq 2^{p-1} when p2p\geq 2. For any 𝝁p\bm{\mu}\in\mathbb{R}^{p}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}. Then

LHS inf𝝁^sup𝝁(r𝒮p)𝒬𝒗(𝝁^𝝁2𝝁^+𝝁2CpnS)\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\bm{\mu}\in(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\mu}}-\bm{\mu}\|_{2}\wedge\|\widehat{\bm{\mu}}+\bm{\mu}\|_{2}\geq C\sqrt{\frac{p}{n_{S}}}\Bigg{)} (S.5.586)
inf𝝁^sup𝝁(r𝒮p)𝒬𝒗(𝝁^𝝁2CpnS),\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\bm{\mu}\in(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\mu}}-\bm{\mu}\|_{2}\geq C\sqrt{\frac{p}{n_{S}}}\Bigg{)}, (S.5.587)

where the last inequality holds because it suffices to consider estimator 𝝁^\widehat{\bm{\mu}} satisfying 𝝁^(X)(r𝒮p)𝒬𝒗\widehat{\bm{\mu}}(X)\in(r\mathcal{S}^{p})\cap\mathcal{Q}_{\bm{v}} almost surely. In addition, for any 𝒙\bm{x}, 𝒚𝒬𝒗\bm{y}\in\mathcal{Q}_{\bm{v}}, 𝒙𝒚2𝒙+𝒚2\|\bm{x}-\bm{y}\|_{2}\leq\|\bm{x}+\bm{y}\|_{2}.

By Lemma 15,

KL(k{0}S𝝁~jnkSk{0}S𝝁~jnkS)\displaystyle\text{KL}\left(\prod_{k\in\{0\}\cup S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\bigg{\|}\prod_{k\in\{0\}\cup S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\right) =k{0}SnkKL(𝝁~j𝝁~j)\displaystyle=\sum_{k\in\{0\}\cup S}n_{k}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}) (S.5.588)
k{0}Snk8𝝁~j22𝝁~j𝝁~j22\displaystyle\leq\sum_{k\in\{0\}\cup S}n_{k}\cdot 8\|\widetilde{\bm{\mu}}_{j}\|_{2}^{2}\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\|_{2}^{2} (S.5.589)
32(nS+n0)r2\displaystyle\leq 32(n_{S}+n_{0})r^{2} (S.5.590)
32nSc22(p1)nS+n0\displaystyle\leq 32n_{S}c^{2}\cdot\frac{2(p-1)}{n_{S}+n_{0}} (S.5.591)
64c2log2logN.\displaystyle\leq\frac{64c^{2}}{\log 2}\log N. (S.5.592)

By Lemma 3,

LHS of (S.5.587)1log2logN64c2log211p11414,\displaystyle\text{LHS of \eqref{eq: lower bdd eq mu 1 transfer}}\geq 1-\frac{\log 2}{\log N}-\frac{64c^{2}}{\log 2}\geq 1-\frac{1}{p-1}-\frac{1}{4}\geq\frac{1}{4}, (S.5.593)

when C=c/2C=c/2, p3p\geq 3 and c=log2/16c=\sqrt{\log 2}/16.

(ii) By fixing an SS and a S\mathbb{Q}_{S}, we want to show

inf𝜽^(0)sup{𝜽¯(k)}k{0}SΘ¯S(𝜷^(0)𝜷(0)2𝜷^(0)+𝜷(0)2C[h(cpn0)]})14.\displaystyle\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\beta}}^{(0)}-\bm{\beta}^{(0)*}\|_{2}\vee\|\widehat{\bm{\beta}}^{(0)}+\bm{\beta}^{(0)*}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{0}}}\bigg{)}\bigg{]}\bigg{\}}\Bigg{)}\geq\frac{1}{4}. (S.5.594)

By Lemma 4, \exists a quadrant 𝒬𝒗\mathcal{Q}_{\bm{v}} of p\mathbb{R}^{p} and a r/8r/8-packing of (r𝒮p1)𝒬𝒗(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}} under Euclidean norm: {ϑ~j}j=1N\{\widetilde{\bm{\vartheta}}_{j}\}_{j=1}^{N}, where r=h(cp/n0)M1r=h\wedge(c\sqrt{p/n_{0}})\wedge M\leq 1 with a small constant c>0c>0 and N(12)p18p2=12×4p22p2N\geq(\frac{1}{2})^{p-1}8^{p-2}=\frac{1}{2}\times 4^{p-2}\geq 2^{p-2} when p3p\geq 3. WLOG, assume M2M\geq 2. Denote 𝝁~j=(1,ϑ~j)p\widetilde{\bm{\mu}}_{j}=(1,\widetilde{\bm{\vartheta}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}. Let 𝝁(k)1=𝝁~=(1,𝟎p1)\bm{\mu}^{(k)*}_{1}=\widetilde{\bm{\mu}}=(1,\bm{0}_{p-1})^{\top} for all kSk\in S and 𝝁(0)1=𝝁=(1,ϑ)\bm{\mu}^{(0)*}_{1}=\bm{\mu}=(1,\bm{\vartheta}) with ϑ(r𝒮p1)𝒬𝒗\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}. For any 𝝁p\bm{\mu}\in\mathbb{R}^{p}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}. Then similar to the arguments in (i),

LHS inf𝝁^supϑ(r𝒮p1)𝒬𝒗𝝁=(1,ϑ)(𝝁^𝝁2𝝁^+𝝁2C[h(cpn0)])\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\begin{subarray}{c}\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}\\ \bm{\mu}=(1,\bm{\vartheta})^{\top}\end{subarray}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\mu}}-\bm{\mu}\|_{2}\wedge\|\widehat{\bm{\mu}}+\bm{\mu}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{0}}}\bigg{)}\bigg{]}\Bigg{)} (S.5.596)
inf𝝁^supϑ(r𝒮p1)𝒬𝒗𝝁=(1,ϑ)(𝝁^𝝁2C[h(cpn0)]).\displaystyle\geq\inf_{\widehat{\bm{\mu}}}\sup_{\begin{subarray}{c}\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}\\ \bm{\mu}=(1,\bm{\vartheta})^{\top}\end{subarray}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\mu}}-\bm{\mu}\|_{2}\geq C\bigg{[}h\wedge\bigg{(}c\sqrt{\frac{p}{n_{0}}}\bigg{)}\bigg{]}\Bigg{)}. (S.5.597)

Then by Lemma 15,

KL(𝝁~jn0kS𝝁~nkS𝝁~jn0kS𝝁~nkS)\displaystyle\text{KL}\left(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{0}}\cdot\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\bigg{\|}\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{0}}\cdot\prod_{k\in S}\mathbb{P}_{\widetilde{\bm{\mu}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\right) =n0KL(𝝁~j𝝁~j)\displaystyle=n_{0}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}) (S.5.598)
n08𝝁~j22𝝁~j𝝁~j22\displaystyle\leq n_{0}\cdot 8\|\widetilde{\bm{\mu}}_{j}\|_{2}^{2}\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\|_{2}^{2} (S.5.599)
32n0r2\displaystyle\leq 32n_{0}r^{2} (S.5.600)
32n0c23(p2)n0\displaystyle\leq 32n_{0}c^{2}\cdot\frac{3(p-2)}{n_{0}} (S.5.601)
96c2log2logN,\displaystyle\leq\frac{96c^{2}}{\log 2}\log N, (S.5.602)

when n0(c2M2)pn_{0}\geq(c^{2}\vee M^{-2})p and p3p\geq 3. By Fano’s lemma (See Corollary 2.6 in \citealpapptsybakov2009introduction),

LHS of (LABEL:eq:_lower_bdd_eq_mu_2_transfer)1log2logN96c2log211p21414,\displaystyle\text{LHS of \eqref{eq: lower bdd eq mu 2 transfer}}\geq 1-\frac{\log 2}{\log N}-\frac{96c^{2}}{\log 2}\geq 1-\frac{1}{p-2}-\frac{1}{4}\geq\frac{1}{4}, (S.5.603)

when C=1/2C=1/2, p4p\geq 4 and c=(log2)/384c=\sqrt{(\log 2)/384}.

(iii) We want to show

inf𝜽^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯S,wS(|w^(0)w(0)|C1n0)14.\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S,w}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\bigg{(}|\widehat{w}^{(0)}-w^{(0)*}|\geq C\sqrt{\frac{1}{n_{0}}}\Bigg{)}\geq\frac{1}{4}. (S.5.604)

The argument is similar to (ii). The only two differences here are that the dimension of interested parameter ww equals 1, and Lemma 15 is replaced by Lemma 17.

(iv) We want to show

inf𝜽^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯S,δS(|δ^(0)δ(0)|C1n0)14\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S,\delta}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\bigg{(}|\widehat{\delta}^{(0)}-\delta^{(0)*}|\geq C\sqrt{\frac{1}{n_{0}}}\Bigg{)}\geq\frac{1}{4} (S.5.605)

The argument is similar to (ii).

Finally, we get the desired conclusion by combining (i)-(iv).

Proof of Lemma 31.

Let ϵ~=Kss\widetilde{\epsilon}=\frac{K-s}{s} and ϵ~=KsK\widetilde{\epsilon}^{\prime}=\frac{K-s}{K}. Since s/Kc>0s/K\geq c>0, ϵ~ϵ~\widetilde{\epsilon}\lesssim\widetilde{\epsilon}^{\prime}. Denote ΥS={{𝝁(k)}k{0}S:𝝁(k)+p,maxkS𝝁(k)𝝁(0)2h/2,𝝁(k)2M}\Upsilon_{S}=\{\{\bm{\mu}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\mu}^{(k)}\in\mathbb{R}_{+}^{p},\max_{k\in S}\|\bm{\mu}^{(k)}-\bm{\mu}^{(0)}\|_{2}\leq h/2,\|\bm{\mu}^{(k)}\|_{2}\leq M\}. For any 𝝁\bm{\mu}\in\mathbb{R}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}, and denote kS𝝁(k)nk\prod_{k\in S}\mathbb{P}_{\bm{\mu}^{(k)}}^{\otimes n_{k}} as {𝝁(k)}kS\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}. It suffices to show

inf𝜽^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯SS(𝝁^(0)𝝁(0)2(C1ϵ~1maxk=1:Knk)(C21n0))110.\inf_{\widehat{\bm{\theta}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}\|\widehat{\bm{\mu}}^{(0)}-\bm{\mu}^{(0)*}\|_{2}\geq\bigg{(}C_{1}\widetilde{\epsilon}^{\prime}\sqrt{\frac{1}{\max_{k=1:K}n_{k}}}\bigg{)}\wedge\bigg{(}C_{2}\sqrt{\frac{1}{n_{0}}}\bigg{)}\Bigg{)}\geq\frac{1}{10}. (S.5.606)

where =𝝁(0)n0{𝝁(k)}kSS\mathbb{P}=\mathbb{P}_{\bm{\mu}^{(0)}}^{\otimes n_{0}}\cdot\mathbb{P}_{\{\bm{\mu}^{(k)}\}_{k\in S}}\cdot\mathbb{Q}_{S}.

For any 𝝁\bm{\mu}\in\mathbb{R}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}. WLOG, assume M1M\geq 1. For any 𝝁~1\widetilde{\bm{\mu}}_{1}, 𝝁~2p\widetilde{\bm{\mu}}_{2}\in\mathbb{R}^{p} with 𝝁~12=𝝁~22=1\|\widetilde{\bm{\mu}}_{1}\|_{2}=\|\widetilde{\bm{\mu}}_{2}\|_{2}=1, by Lemma 15,

maxk=1:KKL(𝝁~1nk𝝁~2nk)maxk=1:Knk8𝝁~1𝝁~222.\max_{k=1:K}\text{KL}\big{(}\mathbb{P}_{\widetilde{\bm{\mu}}_{1}}^{\otimes n_{k}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{2}}^{\otimes n_{k}}\big{)}\leq\max_{k=1:K}n_{k}\cdot 8\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}^{2}. (S.5.607)

for any k=1:Kk=1:K. Let 8maxk=1:Knk𝝁~1𝝁~222(ϵ~1ϵ~)28\max_{k=1:K}n_{k}\cdot\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}^{2}\leq(\frac{\widetilde{\epsilon}^{\prime}}{1-\widetilde{\epsilon}^{\prime}})^{2}, then 𝝁~1𝝁~22C1maxk=1:Knkϵ~\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}\leq C\sqrt{\frac{1}{\max_{k=1:K}n_{k}}}\widetilde{\epsilon}^{\prime} for some constant C>0C>0. On the other hand, let KL(𝝁~1n0𝝁~2n0)=8n0𝝁~1𝝁~2221/100\text{KL}\big{(}\mathbb{P}_{\widetilde{\bm{\mu}}_{1}}^{\otimes n_{0}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{2}}^{\otimes n_{0}}\big{)}=8n_{0}\cdot\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}^{2}\leq 1/100, then 𝝁~1𝝁~2218001n0\|\widetilde{\bm{\mu}}_{1}-\widetilde{\bm{\mu}}_{2}\|_{2}\leq\sqrt{\frac{1}{800}}\sqrt{\frac{1}{n_{0}}} for some constant C>0C>0. Then (S.5.606) follows by Lemma 33. ∎

Proof of Lemma 32.

The proof is similar to the proof of Theorem 5.1 in \citeappchen2018robust, so we omit it here. ∎

Proof of Lemma 34.

This can be similarly shown by Assouad’s Lemma as in the proof of Lemma 23. We omit the proof here. ∎

S.5.10 Proof of Theorem 14

The result follows from (S.5.422) and Theorem 13.

S.5.11 Proof of Theorem 16

S.5.11.1 Lemmas

Lemma 35.

Assume n0Cpn_{0}\geq Cp and Δ(0)C>0\Delta^{(0)}\geq C^{\prime}>0 with some constants CC, C>0C^{\prime}>0. We have

inf𝒞^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯SS\displaystyle\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}} (R𝜽¯(0)(𝒞^(0))R𝜽¯(0)(𝒞𝜽¯(0))C1pnS+n0+C2h2pn0+1n0)110.\displaystyle\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq C_{1}\frac{p}{n_{S}+n_{0}}+C_{2}h^{2}\wedge\frac{p}{n_{0}}+\frac{1}{n_{0}}\Bigg{)}\geq\frac{1}{10}. (S.5.608)
Lemma 36.

Denote ϵ~=Kss\widetilde{\epsilon}=\frac{K-s}{s}. We have

inf𝒞^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯S(T)S(R𝜽¯(0)(𝒞^(0))R𝜽¯(0)(𝒞𝜽¯(0))(C1ϵ~2maxk=1:Knk)(C21n0))110.\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{(T)}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq\bigg{(}C_{1}\frac{\widetilde{\epsilon}^{\prime 2}}{\max_{k=1:K}n_{k}}\bigg{)}\wedge\bigg{(}C_{2}\sqrt{\frac{1}{n_{0}}}\bigg{)}\Bigg{)}\geq\frac{1}{10}. (S.5.609)

S.5.11.2 Main proof of Theorem 16

Combine Lemmas 35 and 36 to finish the proof.

S.5.11.3 Proof of lemmas

Proof of Lemma 35.

We proceed with similar proof ideas used in the proof of Lemma 35. Recall the definitions and proof idea of Lemma 30. We have Θ¯SΘ¯S,wΘ¯S,𝜷Θ¯S,δ\overline{\Theta}_{S}^{\prime}\supseteq\overline{\Theta}_{S,w}^{\prime}\cup\overline{\Theta}_{S,\bm{\beta}}^{\prime}\cup\overline{\Theta}_{S,\delta}^{\prime}, where

Θ¯S,w\displaystyle\overline{\Theta}_{S,w}^{\prime} ={{𝜽¯(k)}k{0}S:𝝁(k)1=𝟏p/p,𝝁(k)2=𝝁(k)1,𝚺(k)=𝑰p,w(k)(cw,1cw)},\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\mu}^{(k)}_{1}=\bm{1}_{p}/\sqrt{p},\bm{\mu}^{(k)}_{2}=-\bm{\mu}^{(k)}_{1},\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}\in(c_{w},1-c_{w})\Big{\}}, (S.5.610)
Θ¯S,𝜷\displaystyle\overline{\Theta}_{S,\bm{\beta}}^{\prime} ={{𝜽¯(k)}k{0}S:𝚺(k)=𝑰p,w(k)=12,𝝁(k)12𝝁(k)22M,maxkS𝜷(k)𝜷(0)2h},\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\|\bm{\mu}^{(k)}_{1}\|_{2}\vee\|\bm{\mu}^{(k)}_{2}\|_{2}\leq M,\max_{k\in S}\|\bm{\beta}^{(k)}-\bm{\beta}^{(0)}\|_{2}\leq h\Big{\}}, (S.5.611)
Θ¯S,δ\displaystyle\overline{\Theta}_{S,\delta}^{\prime} ={{𝜽¯(k)}k{0}S:𝚺(k)=𝑰p,w(k)=12,𝝁(k)12𝝁(k)22M,𝝁(k)1=12𝝁0,\displaystyle=\Big{\{}\{\overline{\bm{\theta}}^{(k)}\}_{k\in\{0\}\cup S}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=\frac{1}{2},\|\bm{\mu}^{(k)}_{1}\|_{2}\vee\|\bm{\mu}^{(k)}_{2}\|_{2}\leq M,\bm{\mu}^{(k)}_{1}=-\frac{1}{2}\bm{\mu}_{0}, (S.5.612)
𝝁(k)2=12𝝁0+𝒖,𝒖21}.\displaystyle\hskip 85.35826pt\bm{\mu}^{(k)}_{2}=\frac{1}{2}\bm{\mu}_{0}+\bm{u},\|\bm{u}\|_{2}\leq 1\Big{\}}. (S.5.613)

Recall the mis-clustering error for GMM associated with parameter set 𝜽¯\overline{\bm{\theta}} of any classifier 𝒞\mathcal{C} is R𝜽¯(𝒞)=minπ:{1,2}{1,2}𝜽¯(𝒞(Z)π(Y))R_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z)\neq\pi(Y)). To help the analysis, following \citeappazizyan2013minimax and \citeappcai2019chime, we define a surrogate loss L𝜽¯(𝒞)=minπ:{1,2}{1,2}𝜽¯(𝒞(Z)π(𝒞𝜽¯(Z)))L_{\overline{\bm{\theta}}}(\mathcal{C})=\min_{\pi:\{1,2\}\rightarrow\{1,2\}}\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}(Z)\neq\pi(\mathcal{C}_{\overline{\bm{\theta}}}(Z))), where 𝒞𝜽¯\mathcal{C}_{\overline{\bm{\theta}}} is the Bayes classifier. Suppose σ=0.005\sigma=\sqrt{0.005}.

(i) We want to show

inf𝒞^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯SS(R𝜽¯(0)(𝒞^(0))R𝜽¯(0)(𝒞𝜽¯(0))CpnS+n0)14.\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq C\sqrt{\frac{p}{n_{S}+n_{0}}}\Bigg{)}\geq\frac{1}{4}. (S.5.614)

Consider S=1:KS=1:K and space Θ¯0={{𝜽¯(k)}k=0K:𝚺(k)=𝑰p,w(k)=1/2,𝝁(k)1=𝝁1,𝝁(k)2=𝝁2,𝝁12𝝁22M}\overline{\Theta}_{0}^{\prime}=\{\{\overline{\bm{\theta}}^{(k)}\}_{k=0}^{K}:\bm{\Sigma}^{(k)}=\bm{I}_{p},w^{(k)}=1/2,\bm{\mu}^{(k)}_{1}=\bm{\mu}_{1},\bm{\mu}^{(k)}_{2}=\bm{\mu}_{2},\|\bm{\mu}_{1}\|_{2}\vee\|\bm{\mu}_{2}\|_{2}\leq M\}. And

LHS of (S.5.452)inf𝒞^(0)sup{𝜽¯(k)}k=0KΘ¯0(R𝜽¯(0)(𝒞^(0))R𝜽¯(0)(𝒞𝜽¯(0))CpnS+n0).\text{LHS of \eqref{eq: lemma 19 eq 1}}\geq\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k=0}^{K}\in\overline{\Theta}_{0}^{\prime}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq C\sqrt{\frac{p}{n_{S}+n_{0}}}\Bigg{)}. (S.5.615)

Let r=cp/(nS+n0)0.001r=c\sqrt{p/(n_{S}+n_{0})}\leq 0.001 with some small constant c>0c>0. For any 𝝁p\bm{\mu}\in\mathbb{R}^{p}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}. Consider a r/4r/4-packing of r𝒮p1r\mathcal{S}^{p-1}: {𝒗~j}j=1N\{\widetilde{\bm{v}}_{j}\}_{j=1}^{N}. By Lemma 2, N4p1N\geq 4^{p-1}. Denote 𝝁~j=(σ,𝒗~j)p\widetilde{\bm{\mu}}_{j}=(\sigma,\widetilde{\bm{v}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}, where σ=0.005\sigma=\sqrt{0.005}. Then by definition of KL divergence and Lemma 8.4 in \citeappcai2019chime,

KL(k{0}S𝝁~jnkSk{0}S𝝁~jnkS)\displaystyle\text{KL}\left(\prod_{k\in\{0\}\cup S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\bigg{\|}\prod_{k\in\{0\}\cup S}\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}^{\otimes n_{k}}\cdot\mathbb{Q}_{S}\right) =k{0}SnkKL(𝝁~j𝝁~j)\displaystyle=\sum_{k\in\{0\}\cup S}n_{k}\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{j}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j^{\prime}}}) (S.5.616)
(nS+n0)8(1+σ2)𝝁~j𝝁~j22\displaystyle\leq(n_{S}+n_{0})\cdot 8(1+\sigma^{2})\|\widetilde{\bm{\mu}}_{j}-\widetilde{\bm{\mu}}_{j^{\prime}}\|_{2}^{2} (S.5.617)
32(1+σ2)(nS+n0)r2\displaystyle\leq 32(1+\sigma^{2})(n_{S}+n_{0})r^{2} (S.5.618)
32(1+σ2)(nS+n0)c22(p1)nS+n0\displaystyle\leq 32(1+\sigma^{2})(n_{S}+n_{0})\cdot c^{2}\frac{2(p-1)}{n_{S}+n_{0}} (S.5.619)
32(1+σ2)c2log2logN.\displaystyle\leq\frac{32(1+\sigma^{2})c^{2}}{\log 2}\log N. (S.5.620)

For simplicity, we write L𝜽¯L_{\overline{\bm{\theta}}} with 𝜽¯Θ¯0\overline{\bm{\theta}}\in\overline{\Theta}_{0} and 𝝁1=𝝁2=𝝁\bm{\mu}_{1}=-\bm{\mu}_{2}=\bm{\mu} as L𝝁L_{\bm{\mu}}. By Lemma 8.5 in \citeappcai2019chime,

L𝝁~i(𝒞𝝁~j)12g(σ2+r22)𝝁~i𝝁~j2𝝁~i2120.15r/4σ2+r22r,L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C}_{\widetilde{\bm{\mu}}_{j}})\geq\frac{1}{\sqrt{2}}g\left(\frac{\sqrt{\sigma^{2}+r^{2}}}{2}\right)\frac{\|\widetilde{\bm{\mu}}_{i}-\widetilde{\bm{\mu}}_{j}\|_{2}}{\|\widetilde{\bm{\mu}}_{i}\|_{2}}\geq\frac{1}{\sqrt{2}}\cdot 0.15\cdot\frac{r/4}{\sqrt{\sigma^{2}+r^{2}}}\geq 2r, (S.5.621)

where g(x)=ϕ(x)[ϕ(x)xΦ(x)]g(x)=\phi(x)[\phi(x)-x\Phi(x)]. The last inequality holds because σ2+r22σ\sqrt{\sigma^{2}+r^{2}}\geq\sqrt{2}\sigma and g(σ2+r2/2)0.15g(\sqrt{\sigma^{2}+r^{2}}/2)\geq 0.15 when r2σ2=0.001r^{2}\leq\sigma^{2}=0.001. Then by Lemma 3.5 in \citealpappcai2019chime (Proposition 2 in \citeappazizyan2013minimax), for any classifier 𝒞\mathcal{C}, and iji\neq j,

L𝝁~i(𝒞)+L𝝁~j(𝒞)L𝝁~i(𝒞𝝁~j)KL(𝝁~i𝝁~j)/22rr=cpnS+n0.L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C})+L_{\widetilde{\bm{\mu}}_{j}}(\mathcal{C})\geq L_{\widetilde{\bm{\mu}}_{i}}(\mathcal{C}_{\widetilde{\bm{\mu}}_{j}})-\sqrt{\text{KL}(\mathbb{P}_{\widetilde{\bm{\mu}}_{i}}\|\mathbb{P}_{\widetilde{\bm{\mu}}_{j}})/2}\geq 2r-r=c\sqrt{\frac{p}{n_{S}+n_{0}}}. (S.5.622)

For any 𝒞^(0)\widehat{\mathcal{C}}^{(0)}, consider a test ψ=argminj=1:NL𝝁~j(𝒞^(0))\psi^{*}=\operatorname*{arg\,min}_{j=1:N}L_{\widetilde{\bm{\mu}}_{j}}(\widehat{\mathcal{C}}^{(0)}). Therefore if there exists j0j_{0} such that L𝝁~j0(𝒞^(0))<c2pnS+n0L_{\widetilde{\bm{\mu}}_{j_{0}}}(\widehat{\mathcal{C}}^{(0)})<\frac{c}{2}\sqrt{\frac{p}{n_{S}+n_{0}}}, then by (S.5.622), we must have ψ=j0\psi^{*}=j_{0}. Let C1c/2C_{1}\leq c/2, then by Fano’s lemma (Corollary 6 in \citealpapptsybakov2009introduction)

inf𝒞^(0)sup{𝜽¯(k)}k=0KΘ¯0(L𝜽¯(0)(𝒞^(0))C1pnS+n0)\displaystyle\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{\{\overline{\bm{\theta}}^{(k)*}\}_{k=0}^{K}\in\overline{\Theta}_{0}^{\prime}}\mathbb{P}\Bigg{(}L_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})\geq C_{1}\sqrt{\frac{p}{n_{S}+n_{0}}}\Bigg{)} inf𝒞^(0)supj=1:N(L𝝁~(j)(𝒞^(0))C1pnS+n0)\displaystyle\geq\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{j=1:N}\mathbb{P}\Bigg{(}L_{\widetilde{\bm{\mu}}^{(j)}}(\widehat{\mathcal{C}}^{(0)})\geq C_{1}\sqrt{\frac{p}{n_{S}+n_{0}}}\Bigg{)} (S.5.623)
inf𝒞^(0)supj=1:N(ψj)\displaystyle\geq\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{j=1:N}\mathbb{P}\Bigg{(}\psi^{*}\neq j\Bigg{)} (S.5.624)
infψsupj=1:N(ψj)\displaystyle\geq\inf_{\psi}\sup_{j=1:N}\mathbb{P}\Bigg{(}\psi\neq j\Bigg{)} (S.5.625)
1log2logN32(1+σ2)c2log2\displaystyle\geq 1-\frac{\log 2}{\log N}-\frac{32(1+\sigma^{2})c^{2}}{\log 2} (S.5.626)
14,\displaystyle\geq\frac{1}{4}, (S.5.627)

when p2p\geq 2 and c=log2128(1+σ2)c=\sqrt{\frac{\log 2}{128(1+\sigma^{2})}}. Then apply Lemma 25 to get the (S.5.614).

(ii) We want to show

inf𝒞^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯S,𝜷S(R𝜽¯(0)(𝒞^(0))R𝜽¯(0)(𝒞𝜽¯(0))C(hpn0))14.\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S,\bm{\beta}}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq C\left(h\wedge\sqrt{\frac{p}{n_{0}}}\right)\Bigg{)}\geq\frac{1}{4}. (S.5.628)

Fixing an SS and a S\mathbb{Q}_{S}. Suppose 1S1\in S. We have

LHS of (S.5.628)inf𝒞^(0)sup{𝜽¯(k)}k{0}SΘ¯S,𝜷S(R𝜽¯(0)(𝒞^(0))R𝜽¯(0)(𝒞𝜽¯(0))C(hpn0)).\text{LHS of \eqref{eq: lemma 36 eq 3}}\geq\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S,\bm{\beta}}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq C\left(h\wedge\sqrt{\frac{p}{n_{0}}}\right)\Bigg{)}. (S.5.629)

Let r=h(cp/n0)Mr=h\wedge(c\sqrt{p/n_{0}})\wedge M with a small constant c>0c>0. For any 𝝁p\bm{\mu}\in\mathbb{R}^{p}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}. Consider a r/4r/4-packing of r𝒮p1r\mathcal{S}^{p-1}. By Lemma 2, N4p1N\geq 4^{p-1}. Denote 𝝁~j=(σ,𝒗~j)p\widetilde{\bm{\mu}}_{j}=(\sigma,\widetilde{\bm{v}}_{j}^{\top})^{\top}\in\mathbb{R}^{p}. WLOG, assume M2M\geq 2. Let 𝝁(k)1=𝝁~=(σ,𝟎p1)\bm{\mu}^{(k)*}_{1}=\widetilde{\bm{\mu}}=(\sigma,\bm{0}_{p-1})^{\top} for all kSk\in S and 𝝁(0)1=𝝁=(1,ϑ)\bm{\mu}^{(0)*}_{1}=\bm{\mu}=(1,\bm{\vartheta}) with ϑ(r𝒮p1)𝒬𝒗\bm{\vartheta}\in(r\mathcal{S}^{p-1})\cap\mathcal{Q}_{\bm{v}}. Then by following the same arguments in part (ii) of the proof of Lemma 30, we can show that the RHS of (S.5.629) is larger than or equal to 1/41/4 when p3p\geq 3.

(iii) We want to show

inf𝒞^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯SS(R𝜽¯(0)(𝒞^(0))R𝜽¯(0)(𝒞𝜽¯(0))Chw21n0)14.\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(0)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq Ch_{w}^{2}\wedge\frac{1}{n_{0}}\Bigg{)}\geq\frac{1}{4}. (S.5.630)

This can be similarly proved by following the arguments in part (i) with Lemmas 25 and 26.

(iv) We want to show

inf𝒞^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯SS(R𝜽¯(0)(𝒞^(0))R𝜽¯(k)(𝒞𝜽¯(0))Ch𝜷2pn0)14.\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}R_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\overline{\bm{\theta}}^{(0)*}})\geq Ch_{\bm{\beta}}^{2}\wedge\frac{p}{n_{0}}\Bigg{)}\geq\frac{1}{4}. (S.5.631)

The conclusion can be obtained immediately from (ii), by noticing that Θ¯S,𝜷Θ¯S,𝝁\overline{\Theta}_{S,\bm{\beta}}^{\prime}\supseteq\overline{\Theta}_{S,\bm{\mu}}^{\prime}.

Finally, we get the desired conclusion by combining (i)-(iv). ∎

Proof of Lemma 36.

By Lemma 25, it suffices to prove

inf𝒞^(0)supS:|S|ssup{𝜽¯(k)}k{0}SΘ¯SS(L𝜽¯(0)(𝒞^(0))C1ϵ~2maxk=1:Knk1n0)110.\inf_{\widehat{\mathcal{C}}^{(0)}}\sup_{S:|S|\geq s}\sup_{\begin{subarray}{c}\{\overline{\bm{\theta}}^{(k)*}\}_{k\in\{0\}\cup S}\in\overline{\Theta}_{S}^{\prime}\\ \mathbb{Q}_{S}\end{subarray}}\mathbb{P}\Bigg{(}L_{\overline{\bm{\theta}}^{(0)*}}(\widehat{\mathcal{C}}^{(0)})\geq C_{1}\frac{\widetilde{\epsilon}^{\prime 2}}{\max_{k=1:K}n_{k}}\wedge\frac{1}{n_{0}}\Bigg{)}\geq\frac{1}{10}. (S.5.632)

For any 𝝁p\bm{\mu}\in\mathbb{R}^{p}, denote distribution 12𝒩(𝝁,𝑰p)+12𝒩(𝝁,𝑰p)\frac{1}{2}\mathcal{N}(\bm{\mu},\bm{I}_{p})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\bm{I}_{p}) as 𝝁\mathbb{P}_{\bm{\mu}}. For simplicity, we write L𝜽¯L_{\overline{\bm{\theta}}} with 𝜽¯\overline{\bm{\theta}} satisfying 𝝁1=𝝁2=𝝁\bm{\mu}_{1}=-\bm{\mu}_{2}=\bm{\mu}, w=1/2w=1/2 and 𝚺=𝑰p\bm{\Sigma}=\bm{I}_{p} as L𝝁L_{\bm{\mu}}. Consider L𝝁(𝒞𝝁)L_{\bm{\mu}}(\mathcal{C}_{\bm{\mu}^{\prime}}) as a loss function between 𝝁\bm{\mu} and 𝝁\bm{\mu}^{\prime} in Lemmas 32 and 33. Considering 𝝁2=𝝁2=1\|\bm{\mu}\|_{2}=\|\bm{\mu}^{\prime}\|_{2}=1, by Lemma 15, note that

maxk=1:KKL(𝝁nk𝝁nk)\displaystyle\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}}) 8maxk=1:Knk𝝁𝝁22,\displaystyle\leq 8\max_{k=1:K}n_{k}\cdot\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}^{2}, (S.5.633)
KL(𝝁n0𝝁n0)\displaystyle\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{0}}\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{0}}) 8n0𝝁𝝁22.\displaystyle\leq 8n_{0}\cdot\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}^{2}. (S.5.634)

By Lemma 8.5 in \citeappcai2019chime, this implies for some constants c,C>0c,C>0

sup{L𝝁(𝒞𝝁):maxk=1:KKL(𝝁nk𝝁nk)(ϵ~/(1ϵ~))2,KL(𝝁n0𝝁n0)1/100}\displaystyle\sup\left\{L_{\bm{\mu}}(\mathcal{C}_{\bm{\mu}^{\prime}}):\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}})\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2},\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{0}}\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{0}})\leq 1/100\right\} (S.5.635)
sup{c𝝁𝝁2:maxk=1:KKL(𝝁nk𝝁nk)(ϵ~/(1ϵ~))2,KL(𝝁n0𝝁n0)1/100}\displaystyle\geq\sup\left\{c\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}:\max_{k=1:K}\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{k}}\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{k}})\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2},\text{KL}(\mathbb{P}_{\bm{\mu}}^{\otimes n_{0}}\|\mathbb{P}_{\bm{\mu}^{\prime}}^{\otimes n_{0}})\leq 1/100\right\} (S.5.636)
sup{c𝝁𝝁2:8maxk=1:Knk𝝁𝝁22(ϵ~/(1ϵ~))2,8n0𝝁𝝁221/800}\displaystyle\geq\sup\left\{c\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}:8\max_{k=1:K}n_{k}\cdot\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}^{2}\leq(\widetilde{\epsilon}^{\prime}/(1-\widetilde{\epsilon}))^{2},8n_{0}\cdot\|\bm{\mu}-\bm{\mu}^{\prime}\|_{2}^{2}\leq 1/800\right\} (S.5.637)
=Cϵ~maxk=1:Knk1n0.\displaystyle=C\cdot\frac{\widetilde{\epsilon}^{\prime}}{\sqrt{\max_{k=1:K}n_{k}}}\wedge\sqrt{\frac{1}{n_{0}}}. (S.5.638)

Then apply Lemmas 32 and 33 to get the desired bound. ∎

S.5.12 Proof of Theorem 17

Denote ξ=maxk{0}Sminrk=±1rk𝜷^(k)[0]𝜷(k)2=maxk{0}S(𝜷^(k)[0]𝜷(k)2𝜷^(k)[0]+𝜷(k)2)\xi=\max_{k\in\{0\}\cup S}\min_{r_{k}=\pm 1}\|r_{k}\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}=\max_{k\in\{0\}\cup S}(\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}\wedge\|\widehat{\bm{\beta}}^{(k)[0]}+\bm{\beta}^{(k)*}\|_{2}). WLOG, assume S={1,,s}S=\{1,\ldots,s\} and rk=1r^{*}_{k}=1 for all k{0}Sk\in\{0\}\cup S. Hence ξ=maxk{0}S𝜷^(k)[0]𝜷(k)2\xi=\max_{k\in\{0\}\cup S}\|\widehat{\bm{\beta}}^{(k)[0]}-\bm{\beta}^{(k)*}\|_{2}. WLOG, consider r^k=1\widehat{r}_{k}=1 for all kSk\in S (i.e., the tasks in SS are already well-aligned). Consider

(1,𝒓^)\displaystyle(1,\widehat{\bm{r}}) =(1target,1,,1,1S,rs+1,,rKoutlier tasks),\displaystyle=(\underbrace{1}_{\text{target}},\underbrace{1,\ldots,1,1}_{S},\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}), (S.5.639)
(1,𝒓^)\displaystyle(-1,\widehat{\bm{r}}) =(1target,1,,1,1S,rs+1,,rKoutlier tasks).\displaystyle=(\underbrace{-1}_{\text{target}},\underbrace{1,\ldots,1,1}_{S},\underbrace{r_{s+1},\ldots,r_{K}}_{\text{outlier tasks}}). (S.5.640)

It suffices to prove that

score((1,𝒓^))score((1,𝒓^))>0.\displaystyle\text{score}((-1,\widehat{\bm{r}}))-\text{score}((1,\widehat{\bm{r}}))>0. (S.5.641)

In fact,

score((1,𝒓^))score((1,𝒓^))\displaystyle\text{score}((-1,\widehat{\bm{r}}))-\text{score}((1,\widehat{\bm{r}})) =2k=1s𝜷^(0)[0]+𝜷^(k)[0]2[1]+2k=s+1K𝜷^(0)[0]+rk𝜷^(k)[0]2[2]\displaystyle=2\underbrace{\sum_{k=1}^{s}\|\widehat{\bm{\beta}}^{(0)[0]}+\widehat{\bm{\beta}}^{(k)[0]}\|_{2}}_{[1]}+2\underbrace{\sum_{k=s+1}^{K}\|\widehat{\bm{\beta}}^{(0)[0]}+r_{k}\widehat{\bm{\beta}}^{(k)[0]}\|_{2}}_{[2]} (S.5.642)
2k=1s𝜷^(0)[0]𝜷^(k)[0]2[1]2k=s+1K𝜷^(0)[0]+rk𝜷^(k)[0]2[2],\displaystyle\quad-2\underbrace{\sum_{k=1}^{s}\|\widehat{\bm{\beta}}^{(0)[0]}-\widehat{\bm{\beta}}^{(k)[0]}\|_{2}}_{[1]^{\prime}}-2\underbrace{\sum_{k=s+1}^{K}\|-\widehat{\bm{\beta}}^{(0)[0]}+r_{k}\widehat{\bm{\beta}}^{(k)[0]}\|_{2}}_{[2]^{\prime}}, (S.5.643)

where

[1][1]\displaystyle[1]-[1]^{\prime} k=1s(𝜷(0)+𝜷(k)2𝜷(0)𝜷(k)24ξ)\displaystyle\geq\sum_{k=1}^{s}(\|\bm{\beta}^{(0)*}+\bm{\beta}^{(k)*}\|_{2}-\|\bm{\beta}^{(0)*}-\bm{\beta}^{(k)*}\|_{2}-4\xi) (S.5.644)
k=1s(2𝜷(0)22𝜷(0)𝜷(k)24ξ)\displaystyle\geq\sum_{k=1}^{s}(2\|\bm{\beta}^{(0)*}\|_{2}-2\|\bm{\beta}^{(0)*}-\bm{\beta}^{(k)*}\|_{2}-4\xi) (S.5.645)
s(2𝜷(0)24h4ξ),\displaystyle\geq s(2\|\bm{\beta}^{(0)*}\|_{2}-4h-4\xi), (S.5.646)

and

[2][2]4k=s+1K𝜷^(0)[0]24(Ks)(𝜷(0)2+ξ).[2]-[2]^{\prime}\geq-4\sum_{k=s+1}^{K}\|\widehat{\bm{\beta}}^{(0)[0]}\|_{2}\geq-4(K-s)(\|\bm{\beta}^{(0)*}\|_{2}+\xi). (S.5.647)

Hence

score((1,𝒓^))score((1,𝒓^))\displaystyle\text{score}((-1,\widehat{\bm{r}}))-\text{score}((1,\widehat{\bm{r}})) =2([1][1])+2([2][2])\displaystyle=2([1]-[1]^{\prime})+2([2]-[2]^{\prime}) (S.5.648)
4[(2sK)𝜷(0)22sh(K+s)ξ]\displaystyle\geq 4[(2s-K)\|\bm{\beta}^{(0)*}\|_{2}-2sh-(K+s)\xi] (S.5.649)
>0,\displaystyle>0, (S.5.650)

when 𝜷(0)2>2(1ϵ)12ϵh+2ϵ12ϵξ\|\bm{\beta}^{(0)*}\|_{2}>\frac{2(1-\epsilon)}{1-2\epsilon}h+\frac{2-\epsilon}{1-2\epsilon}\xi, which completes our proof.

S.5.13 Proof of Theorem 7

Define the contraction basin of one GMM as

Bcon(𝜽(k))\displaystyle B_{\text{con}}(\bm{\theta}^{(k)*}) ={𝜽={{wr}r=2R,{𝜷r}r=2R,{δr}r=2R}:wr(cw/2,1cw/2),\displaystyle=\{\bm{\theta}=\{\{w_{r}\}_{r=2}^{R},\{\bm{\beta}_{r}\}_{r=2}^{R},\{\delta_{r}\}_{r=2}^{R}\}:w_{r}^{*}\in(c_{w}/2,1-c_{w}/2), (S.5.651)
𝜷r𝜷r2CbΔ,|δrδr|CbΔ}.\displaystyle\quad\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}\leq C_{b}\Delta,|\delta_{r}-\delta_{r}^{*}|\leq C_{b}\Delta\}. (S.5.652)

And the joint contraction basin is defined as Bcon({𝜽(k)}kS)=r=1RBcon(𝜽(k))B_{\text{con}}(\{\bm{\theta}^{(k)*}\}_{k\in S})=\bigcap_{r=1}^{R}B_{\text{con}}(\bm{\theta}^{(k)*}).

For 𝜽=({wr}r=2R,{𝜷r}r=2R,{δr}r=2R)\bm{\theta}=(\{w_{r}\}_{r=2}^{R},\{\bm{\beta}_{r}\}_{r=2}^{R},\{\delta_{r}\}_{r=2}^{R}) and 𝜽=({wr}r=2R,{𝜷r}r=2R,{δr}r=2R)\bm{\theta}^{\prime}=(\{w_{r}^{\prime}\}_{r=2}^{R},\{\bm{\beta}_{r}^{\prime}\}_{r=2}^{R},\{\delta_{r}^{\prime}\}_{r=2}^{R}), define

d(𝜽,𝜽)=maxr=2:R{|wrwr|𝜷r𝜷r2|δrδr|}.d(\bm{\theta},\bm{\theta}^{\prime})=\max_{r=2:R}\{|w_{r}-w_{r}^{\prime}|\vee\|\bm{\beta}_{r}-\bm{\beta}_{r}^{\prime}\|_{2}\vee|\delta_{r}-\delta_{r}^{\prime}|\}. (S.5.653)

S.5.13.1 Lemmas

For GMM 𝒛r=1Rwr𝒩(𝝁r,𝚺)\bm{z}\sim\sum_{r=1}^{R}w_{r}^{*}\mathcal{N}(\bm{\mu}_{r}^{*},\bm{\Sigma}^{*}) and any 𝜽\bm{\theta}, define

γ(r)𝜽(𝒛)=wrexp{𝜷r𝒛δr}w1+r=2Rwrexp{𝜷r𝒛δr},r=2:R,γ(1)𝜽(𝒛)=w1w1+r=2Rwrexp{𝜷r𝒛δr}.\gamma^{(r)}_{\bm{\theta}}(\bm{z})=\frac{w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}}{w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}},r=2:R,\quad\gamma^{(1)}_{\bm{\theta}}(\bm{z})=\frac{w_{1}}{w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}}. (S.5.654)

Denote wr(𝜽)=𝔼[γ(r)𝜽(𝒛)]w_{r}(\bm{\theta})=\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})] and 𝝁r(𝜽)=𝔼[γ(r)𝜽(𝒛)𝒛]𝔼[γ(r)𝜽(𝒛)]\bm{\mu}_{r}(\bm{\theta})=\frac{\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})\bm{z}]}{\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})]}.

Lemma 37 (Contraction of multi-cluster GMM).

When Cbcc𝚺1/2C_{b}\leq cc_{\bm{\Sigma}}^{-1/2} with a small constant c>0c>0 and ΔClog(c𝚺Mcw1)\Delta\geq C\log(c_{\bm{\Sigma}}Mc_{w}^{-1}) with a large constant C>0C>0, there exist positive constants C>0C^{\prime}>0 and C>0C^{\prime\prime}>0, for any 𝛉Bcon(𝛉(k))\bm{\theta}\in B_{\textup{con}}(\bm{\theta}^{(k)*}),

|wr(𝜽)wr|Cexp{CΔ2}d(𝜽,𝜽),𝝁r(𝜽)𝝁r2Cexp{CΔ2}d(𝜽,𝜽),|w_{r}(\bm{\theta})-w_{r}^{*}|\leq C^{\prime}\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*}),\quad\|\bm{\mu}_{r}(\bm{\theta})-\bm{\mu}_{r}^{*}\|_{2}\leq C^{\prime}\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*}), (S.5.655)

where Cexp{CΔ2}κ0<1C^{\prime}\exp\{-C^{\prime\prime}\Delta^{2}\}\leq\kappa_{0}<1 with a constant κ0\kappa_{0}.

Lemma 38 (Vectorized contraction of Rademacher complexity, Corollary 1 in \citealpappmaurer2016vector).

Suppose {ϵir}i[n],r[R]\{\epsilon_{ir}\}_{i\in[n],r\in[R]} and {ϵi}i=1n\{\epsilon_{i}\}_{i=1}^{n} are independent Rademacher variables. Let \mathcal{F} be a class of functions f:d𝒮Rf:\mathbb{R}^{d}\rightarrow\mathcal{S}\subseteq\mathbb{R}^{R} and h:𝒮h:\mathcal{S}\rightarrow\mathbb{R} is LL-Lipschitz under 2\ell_{2}-norm, i.e., |h(𝐲)h(𝐲)|L𝐲𝐲2|h(\bm{y})-h(\bm{y}^{\prime})|\leq L\|\bm{y}-\bm{y}^{\prime}\|_{2}, where 𝐲=(y1,,yR)\bm{y}=(y_{1},\ldots,y_{R})^{\top}, 𝐲=(y1,,yR)𝒮\bm{y}^{\prime}=(y_{1}^{\prime},\ldots,y_{R}^{\prime})^{\top}\in\mathcal{S}. Then

𝔼supfi=1nϵih(f(xi))2L𝔼supfi=1nr=1Rϵirfr(xi),\mathbb{E}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\epsilon_{i}h(f(x_{i}))\leq\sqrt{2}L\mathbb{E}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sum_{r=1}^{R}\epsilon_{ir}f_{r}(x_{i}), (S.5.656)

where fr(xi)f_{r}(x_{i}) is the rr-th component of f(xi)𝒮Rf(x_{i})\in\mathcal{S}\subseteq\mathbb{R}^{R}.

S.5.13.2 Main proof of Theorem 7

The proof idea is almost the same as the idea used in the proof of Theorem 1. We still need to show similar results presented in the lemmas associated with Theorem 1, then go through the same arguments in the proof of Theorem 7. We only sketch the key steps and the differences here.

The biggest difference appears in the proofs of the lemmas associated with Theorem 1 under the context of multi-cluster GMM. The original arguments in the proofs of Lemmas 11-14 involve the contraction inequality for Rademacher variables and univariate Lipschitz functions, which is not available anymore. We replace this part with an argument through a vectorized Rademacher contraction inequality \citepappmaurer2016vector.

First, we will show that

sup𝜽(k)Bcon𝜷(k)r𝜷(k)r2ξ(k)|1nki=1nkγ(r)𝜽(k)(𝒛(k)i)𝔼[γ(r)𝜽(k)(𝒛(k))]|ξ(k)pnk+logKnk,\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\right|\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}, (S.5.657)

for all kSk\in S and r1:Rr\in 1:R, with probability at least 1CK21-CK^{-2}. Denote the LHS as WW. By changing one observation 𝒛(k)i\bm{z}^{(k)}_{i}, denote the new WW as WW^{\prime}. Since γ(r)𝜽(𝒛)\gamma^{(r)}_{\bm{\theta}}(\bm{z}) is bounded for all 𝒛p\bm{z}\in\mathbb{R}^{p}, we know that |WW|1/nk|W-W^{\prime}|\leq 1/n_{k}. Then by bounded difference inequality, we have

W𝔼W+ClogKnk,W\leq\mathbb{E}W+C\sqrt{\frac{\log K}{n_{k}}}, (S.5.658)

with probability at least 1CK21-C^{\prime}K^{-2}. On the other hand, by symmetrization,

𝔼W2nk𝔼𝒛𝔼ϵsup𝜽(k)Bcon𝜷(k)r𝜷(k)r2ξ(k)|i=1nkϵ(k)iγ(r)𝜽(k)(𝒛(k)i)|.\mathbb{E}W\leq\frac{2}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{i}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\right|. (S.5.659)

Note that γ(r)𝜽(k)(𝒛)=w(k)rexp{(𝜷(k)r)𝒛δ(k)r}w(k)1+r=2Rw(k)rexp{(𝜷(k)r)𝒛δ(k)r}=exp{(𝜷(k)r)𝒛δ(k)r+logw(k)rlogw(k)1}1+r=2Rexp{(𝜷(k)r)𝒛δ(k)r+logw(k)rlogw(k)1}=φ({(𝜷(k)r)𝒛δ(k)r+logw(k)rlogw(k)1}r=2R)\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z})=\frac{w^{(k)}_{r}\cdot\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}\}}{w^{(k)}_{1}+\sum_{r=2}^{R}w^{(k)}_{r}\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}\}}=\frac{\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}}{1+\sum_{r=2}^{R}\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}}=\varphi(\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}_{r=2}^{R}), where φ(𝒙)=exp{xr}1+r=2Rexp{xr}\varphi(\bm{x})=\frac{\exp\{x_{r}\}}{1+\sum_{r=2}^{R}\exp\{x_{r}\}} is a 1-Lipschitz function (w.r.t. 2\ell_{2}-norm). By Lemma 38,

2nk𝔼𝒛𝔼ϵsup𝜽(k)Bcon𝜷(k)r𝜷(k)r2ξ(k)|i=1nkϵ(k)iγ(r)𝜽(k)(𝒛(k)i)|\displaystyle\frac{2}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{i}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\right| 1nk𝔼𝒛𝔼ϵsup𝜽(k)Bcon𝜷(k)r𝜷(k)r2ξ(k)|i=1nkr=2Rϵ(k)irg(k)ir|\displaystyle\lesssim\frac{1}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\sum_{r=2}^{R}\epsilon^{(k)}_{ir}g^{(k)}_{ir}\right| (S.5.660)
1nkr=2R𝔼𝒛𝔼ϵsup𝜽(k)Bcon𝜷(k)r𝜷(k)r2ξ(k)|i=1nkϵ(k)irg(k)ir|,\displaystyle\lesssim\frac{1}{n_{k}}\sum_{r=2}^{R}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}g^{(k)}_{ir}\right|, (S.5.661)

where g(k)ir(𝜷(k)r)𝒛(k)iδ(k)r+logw(k)rlogw(k)1g^{(k)}_{ir}\coloneqq(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}^{(k)}_{i}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}. It follows that

1nk𝔼𝒛𝔼ϵsup𝜽(k)Bcon𝜷(k)r𝜷(k)r2ξ(k)|i=1nkϵ(k)irg(k)ir|\displaystyle\frac{1}{n_{k}}\mathbb{E}_{\bm{z}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}g^{(k)}_{ir}\right| (S.5.662)
1nk𝔼𝒛,ϵsup𝜽(k)Bcon𝜷(k)r𝜷(k)r2ξ(k)|i=1nkϵ(k)ir(𝜷(k)r)𝒛(k)i|\displaystyle\leq\frac{1}{n_{k}}\mathbb{E}_{\bm{z},\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}^{(k)}_{i}\right| (S.5.663)
+1nk𝔼ϵsup𝜽(k)Bcon𝜷(k)r𝜷(k)r2ξ(k)|i=1nkϵ(k)ir(δ(k)rlogw(k)r+logw(k)1)|\displaystyle\quad+\frac{1}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\delta^{(k)}_{r}-\log w^{(k)}_{r}+\log w^{(k)}_{1})\right| (S.5.664)
1nk𝔼𝒛,ϵsup𝜷(k)r𝜷(k)r2ξ(k)|i=1nkϵ(k)ir(𝜷(k)r)(𝒛(k)i𝝁(k))|\displaystyle\leq\frac{1}{n_{k}}\mathbb{E}_{\bm{z},\bm{\epsilon}}\sup_{\|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\bm{\beta}^{(k)}_{r})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\right| (S.5.665)
+1nk𝔼𝒛,ϵsup𝜷(k)r𝜷(k)r2ξ(k)|i=1nkϵ(k)ir(𝜷(k)r)𝝁(k)|\displaystyle\quad+\frac{1}{n_{k}}\mathbb{E}_{\bm{z},\bm{\epsilon}}\sup_{\|\bm{\beta}^{(k)}_{r}-\bm{\beta}^{(k)*}_{r}\|_{2}\leq\xi^{(k)}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\bm{\beta}^{(k)}_{r})^{\top}\bm{\mu}^{(k)*}\right| (S.5.666)
+1nk𝔼ϵsup|δ(k)r|Ucw/2w(k)r1cw/2|i=1nkϵ(k)ir(δ(k)rlogw(k)r+logw(k)1)|\displaystyle\quad+\frac{1}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\sup_{\begin{subarray}{c}|\delta^{(k)}_{r}|\leq U\\ c_{w}/2\leq w^{(k)}_{r}\leq 1-c_{w}/2\end{subarray}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\delta^{(k)}_{r}-\log w^{(k)}_{r}+\log w^{(k)}_{1})\right| (S.5.667)
ξ(k)nk𝔼𝒛,ϵsupj=1:N|i=1nkϵ(k)ir𝒖j(𝒛(k)i𝝁(k))|+1nk𝔼𝒛,ϵ|i=1nkϵ(k)ir(𝜷(k)r)(𝒛(k)i𝝁(k))|\displaystyle\leq\frac{\xi^{(k)}}{n_{k}}\mathbb{E}_{\bm{z},\bm{\epsilon}}\sup_{j=1:N}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\right|+\frac{1}{n_{k}}\mathbb{E}_{\bm{z},\bm{\epsilon}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}(\bm{\beta}^{(k)*}_{r})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\right| (S.5.668)
+Cnk𝔼ϵ|i=1nkϵ(k)ir|\displaystyle\quad+\frac{C}{n_{k}}\mathbb{E}_{\bm{\epsilon}}\left|\sum_{i=1}^{n_{k}}\epsilon^{(k)}_{ir}\right| (S.5.669)

where 𝝁(k)r=1Rw(k)r𝝁(k)r\bm{\mu}^{(k)*}\coloneqq\sum_{r=1}^{R}w^{(k)*}_{r}\bm{\mu}^{(k)*}_{r}, {𝒖j}j=1N\{\bm{u}_{j}\}_{j=1}^{N} is a 1/21/2-cover of 𝒮d1\mathcal{S}^{d-1} with N=5pN=5^{p}, and {ϵ(k)ir𝒖j(𝒛(k)i𝝁(k))}i=1nk\{\epsilon^{(k)}_{ir}\bm{u}_{j}^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\}_{i=1}^{n_{k}}, {ϵ(k)ir(𝜷(k)r)(𝒛(k)i𝝁(k))}i=1nk\{\epsilon^{(k)}_{ir}(\bm{\beta}^{(k)*}_{r})^{\top}(\bm{z}^{(k)}_{i}-\bm{\mu}^{(k)*})\}_{i=1}^{n_{k}}, and {ϵ(k)ir}i=1nk\{\epsilon^{(k)}_{ir}\}_{i=1}^{n_{k}} are all sub-Gaussian processes. Then by the property of sub-Gaussian variables,

RHS of (S.5.669)ξ(k)pnk+logKnk.\textup{RHS of }\eqref{eq: proof of thm multi-cluster eq 1}\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}. (S.5.670)

Putting all the pieces together, we obtain Wξ(k)pnk+logKnkW\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}} with probability at least 1CK21-CK^{-2}.

The second bound we want to show is

sup{𝜽(k)}kSBconJ,2sup|w~k|11nS|kSw~ki=1nk[γ(r)𝜽(k)(𝒛(k)i)𝔼[γ(r)𝜽(k)(𝒛(k))]]|p+KnS.\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\frac{1}{n_{S}}\left|\sum_{k\in S}\widetilde{w}_{k}\sum_{i=1}^{n_{k}}\Big{[}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})-\mathbb{E}[\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})]\Big{]}\right|\lesssim\sqrt{\frac{p+K}{n_{S}}}. (S.5.671)

Denote the LHS as WW^{\prime}. Again by bounded difference inequality,

W𝔼W+CpnS,W^{\prime}\leq\mathbb{E}W^{\prime}+C\sqrt{\frac{p}{n_{S}}}, (S.5.672)

with probability at least 1Cexp{Cp}1-C^{\prime}\exp\{-C^{\prime\prime}p\}. It remains to control 𝔼W\mathbb{E}W^{\prime}. By symmetrization,

𝔼W2nS𝔼sup{𝜽(k)}kSBconJ,2sup|w~k|1|kSi=1nkw~kγ(r)𝜽(k)(𝒛(k)i)|.\mathbb{E}W^{\prime}\leq\frac{2}{n_{S}}\mathbb{E}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\right|. (S.5.673)

Denote

φ(w~,{(𝜷(k)r)𝒛δ(k)r+logw(k)rlogw(k)1}r=2R)\displaystyle\varphi(\widetilde{w},\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}_{r=2}^{R}) (S.5.674)
=w~kγ(r)𝜽(k)(𝒛(k)i)\displaystyle=\widetilde{w}_{k}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i}) (S.5.675)
=w~kexp{(𝜷(k)r)𝒛δ(k)r+logw(k)rlogw(k)1}1+r=2Rexp{(𝜷(k)r)𝒛δ(k)r+logw(k)rlogw(k)1},\displaystyle=\widetilde{w}_{k}\cdot\frac{\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}}{1+\sum_{r=2}^{R}\exp\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}}, (S.5.676)

which is C-Lipschitz w.r.t. (w~,{(𝜷(k)r)𝒛δ(k)r+logw(k)rlogw(k)1}r=2R)(\widetilde{w},\{(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}\}_{r=2}^{R}) as a RR-dimensional vector with a constant CC. Denote g(k)ir=(𝜷(k)r)𝒛(k)iδ(k)r+logw(k)rlogw(k)1g^{(k)}_{ir}=(\bm{\beta}^{(k)}_{r})^{\top}\bm{z}^{(k)}_{i}-\delta^{(k)}_{r}+\log w^{(k)}_{r}-\log w^{(k)}_{1}. A direct application of Lemma 38 implies that

1nS𝔼sup{𝜽(k)}kSBconJ,2sup|w~k|1|kSi=1nkw~kγ(r)𝜽(k)(𝒛(k)i)|\displaystyle\frac{1}{n_{S}}\mathbb{E}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\sup_{|\widetilde{w}_{k}|\leq 1}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\right| (S.5.677)
1nS𝔼sup|w~k|1|kSi=1nkw~kϵ(k)i1|+1nSr=2R𝔼sup{𝜽(k)}kSBconJ,2|kSi=1nkg(k)irϵ(k)ir|.\displaystyle\lesssim\frac{1}{n_{S}}\mathbb{E}\sup_{|\widetilde{w}_{k}|\leq 1}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\epsilon^{(k)}_{i1}\right|+\frac{1}{n_{S}}\sum_{r=2}^{R}\mathbb{E}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}g^{(k)}_{ir}\epsilon^{(k)}_{ir}\right|. (S.5.678)

By a similar argument involving covering number as before, we can show that

1nS𝔼sup|w~k|1|kSi=1nkw~kϵ(k)i1|+1nSr=2R𝔼sup{𝜽(k)}kSBconJ,2|kSi=1nkg(k)irϵ(k)ir|p+KnS.\frac{1}{n_{S}}\mathbb{E}\sup_{|\widetilde{w}_{k}|\leq 1}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}\widetilde{w}_{k}\epsilon^{(k)}_{i1}\right|+\frac{1}{n_{S}}\sum_{r=2}^{R}\mathbb{E}\sup_{\{\bm{\theta}^{(k)}\}_{k\in S}\in B_{\text{con}}^{J,2}}\left|\sum_{k\in S}\sum_{i=1}^{n_{k}}g^{(k)}_{ir}\epsilon^{(k)}_{ir}\right|\lesssim\sqrt{\frac{p+K}{n_{S}}}. (S.5.679)

Therefore Wp+KnSW\lesssim\sqrt{\frac{p+K}{n_{S}}} with probability at least 1Cexp{Cp}1-C^{\prime}\exp\{-C^{\prime\prime}p\}..

The third bound we want to show is

sup𝜽(k)Bcon𝜷(k)𝜷(k)2ξ(k)|1nki=1nk[1γ(r)𝜽(k)(𝒛(k)i)](𝒛(k)i)𝜷(k)𝔼[[1γ(r)𝜽(k)(𝒛(k))](𝒛(k))𝜷(k)]|\displaystyle\sup_{\begin{subarray}{c}\bm{\theta}^{(k)}\in B_{\text{con}}\\ \|\bm{\beta}^{(k)}-\bm{\beta}^{(k)*}\|_{2}\leq\xi^{(k)}\end{subarray}}\left|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\big{[}1-\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)}_{i})\big{]}(\bm{z}^{(k)}_{i})^{\top}\bm{\beta}^{(k)*}-\mathbb{E}\big{[}[1-\gamma^{(r)}_{\bm{\theta}^{(k)}}(\bm{z}^{(k)})](\bm{z}^{(k)})^{\top}\bm{\beta}^{(k)*}\big{]}\right| (S.5.680)
ξ(k)pnk+logKnk,\displaystyle\quad\lesssim\xi^{(k)}\sqrt{\frac{p}{n_{k}}}+\sqrt{\frac{\log K}{n_{k}}}, (S.5.681)

for all kSk\in S and r=1:Rr=1:R, with probability at least 1C(K2+K1eCp)1-C^{\prime}(K^{-2}+K^{-1}e^{-C^{\prime\prime}p}). Denote the LHS as WW^{\prime\prime}. Similar to the previous two proofs, we derive an upper bound for WW^{\prime\prime} by controlling W𝔼WW^{\prime\prime}-\mathbb{E}W^{\prime\prime} and 𝔼W\mathbb{E}W^{\prime\prime}, seperately. The first part involving W𝔼WW^{\prime\prime}-\mathbb{E}W^{\prime\prime} is similar to the proof of part (i) in Lemma 12 and the second part involving 𝔼W\mathbb{E}W^{\prime\prime} is similar to the proof of (S.5.657), so we omit the details.

The arguments to derive these three bounds can be used to derive other results similar to the lemmas used in the proof of Theorem 1. With these lemmas in hand, the remaining proof is almost the same as the proof of Theorem 1.

S.5.13.3 Proof of lemmas

Proof of Lemma 37.

We will prove the contraction of wrw_{r} first, and only sketch the different part for the proof of contraction of 𝝁r\bm{\mu}_{r} because the proofs are quite similar.

Part 1: Contraction of |wr(𝜽)wr||w_{r}(\bm{\theta})-w_{r}^{*}|:

First, note that wr(𝜽)=wrw_{r}(\bm{\theta}^{*})=w^{*}_{r} and 𝝁r(𝜽)=𝝁r\bm{\mu}_{r}(\bm{\theta}^{*})=\bm{\mu}_{r}^{*}. Therefore,

|wr(𝜽)wr|\displaystyle|w_{r}(\bm{\theta})-w_{r}^{*}| =|𝔼[γ(r)𝜽(𝒛)γ(r)𝜽(𝒛)]|\displaystyle=\left|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z})]\right| (S.5.682)
r~=1Rw(k)r|𝔼[γ(r)𝜽(𝒛)γ(r)𝜽(𝒛)|y=r~]|\displaystyle\leq\sum_{\widetilde{r}=1}^{R}w^{(k)*}_{r}\left|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z})|y=\widetilde{r}]\right| (S.5.683)
r~=1Rw(k)rr=2R|𝔼[γ(r)𝜽(𝒛)wr|𝜽=𝜽~t|y=r~]||wrwr|\displaystyle\leq\sum_{\widetilde{r}=1}^{R}w^{(k)*}_{r}\sum_{r^{\prime}=2}^{R}\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{|}y=\widetilde{r}\right]\right|\cdot|w_{r^{\prime}}-w_{r}^{*}| (S.5.684)
+r~=1Rw(k)rr=2R|𝔼[γ(r)𝜽(𝒛)δr|𝜽=𝜽~t|y=r~]||δrδr|\displaystyle\quad+\sum_{\widetilde{r}=1}^{R}w^{(k)*}_{r}\sum_{r^{\prime}=2}^{R}\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{|}y=\widetilde{r}\right]\right|\cdot|\delta_{r^{\prime}}-\delta_{r}^{*}| (S.5.685)
+r~=1Rw(k)rr=2R|𝔼[γ(r)𝜽(𝒛)𝜷r|𝜽=𝜽~t|y=r~](𝜷r𝜷r)|,\displaystyle\quad+\sum_{\widetilde{r}=1}^{R}w^{(k)*}_{r}\sum_{r^{\prime}=2}^{R}\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{|}y=\widetilde{r}\right]^{\top}(\bm{\beta}_{r^{\prime}}-\bm{\beta}_{r}^{*})\right|, (S.5.686)

We only show how to bound |𝔼[γ(r)𝜽(𝒛)γ(r)𝜽(𝒛)|y=1]|\left|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z})|y=1]\right|, i.e. the case when r~=1\widetilde{r}=1. For the other r~=2:R\widetilde{r}=2:R, the proof is the same by changingt the reference level from y=1y=1 to y=r~y=\widetilde{r}. Note that

|𝔼[γ(r)𝜽(𝒛)γ(r)𝜽(𝒛)|y=1]|\displaystyle\left|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z})|y=1]\right| r=2R|𝔼[γ(r)𝜽(𝒛)wr|𝜽=𝜽~t|y=1]||wrwr|\displaystyle\leq\sum_{r^{\prime}=2}^{R}\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{|}y=1\right]\right|\cdot|w_{r^{\prime}}-w_{r}^{*}| (S.5.687)
+r=2R|𝔼[γ(r)𝜽(𝒛)δr|𝜽=𝜽~t|y=1]||δrδr|\displaystyle\quad+\sum_{r^{\prime}=2}^{R}\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{|}y=1\right]\right|\cdot|\delta_{r^{\prime}}-\delta_{r}^{*}| (S.5.688)
+r=2R|𝔼[γ(r)𝜽(𝒛)𝜷r|𝜽=𝜽~t|y=1](𝜷r𝜷r)|.\displaystyle\quad+\sum_{r^{\prime}=2}^{R}\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{|}y=1\right]^{\top}(\bm{\beta}_{r^{\prime}}-\bm{\beta}_{r}^{*})\right|. (S.5.689)

where 𝜽~t=({w~r}r=2R,{𝜷~r}r=2R,{δ~r}r=2R)\widetilde{\bm{\theta}}_{t}=(\{\widetilde{w}_{r}\}_{r=2}^{R},\{\widetilde{\bm{\beta}}_{r}\}_{r=2}^{R},\{\widetilde{\delta}_{r}\}_{r=2}^{R}) with w~r=twr+(1t)wr\widetilde{w}_{r}=tw_{r}+(1-t)w_{r}^{*}, 𝜷~r=t𝜷r+(1t)𝜷r\widetilde{\bm{\beta}}_{r}=t\bm{\beta}_{r}+(1-t)\bm{\beta}_{r}^{*}, δ~r=tδr+(1t)δr\widetilde{\delta}_{r}=t\delta_{r}+(1-t)\delta_{r}^{*}, and δr=12𝜷r(𝝁r+𝝁1)\delta_{r}=\frac{1}{2}\bm{\beta}_{r}^{\top}(\bm{\mu}_{r}+\bm{\mu}_{1}). We will bound the three terms on the RHS separately. Note that when 𝜽Bcon(𝜽)\bm{\theta}\in B_{\textup{con}}(\bm{\theta}^{*}), we have wr[cw/2,1cw]w_{r}\in[c_{w}/2,1-c_{w}], 𝜷r𝜷r2CbΔ\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}\leq C_{b}\Delta, and maxr=1:R𝝁r𝝁r2CbΔ\max_{r=1:R}\|\bm{\mu}_{r}-\bm{\mu}_{r}^{*}\|_{2}\leq C_{b}\Delta, hence w~r[cw/2,1cw]\widetilde{w}_{r}\in[c_{w}/2,1-c_{w}], 𝜷~r𝜷r2tCbΔ\|\widetilde{\bm{\beta}}_{r}-\bm{\beta}^{*}_{r}\|_{2}\leq tC_{b}\Delta..

(i) Bounding |𝔼[γ(r)𝜽(𝒛)wr|𝜽=𝜽~t|y=1]||\mathbb{E}[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}|_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}|y=1]|: Note that

γ(r)𝜽(𝒛)wr\displaystyle\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}} =exp{𝜷~r𝒛δr}w~1+r=2Rw~rexp{𝜷~r𝒛δr}w~rexp{𝜷~r𝒛δr}(exp{𝜷~r𝒛δr}1)(w~1+r=2Rw~rexp{𝜷~r𝒛δr})2\displaystyle=\frac{\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}}{\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}}-\frac{\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}(\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}-1)}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}} (S.5.690)
={exp{𝜷~r𝒛δr}(w~1+w~r+rrw~rexp{𝜷~r𝒛δr})(w~1+r=2Rw~rexp{𝜷~r𝒛δr})2,r=r,w~rexp{𝜷~r𝒛δr}(exp{𝜷~r𝒛δr}1)(w~1+r=2Rw~rexp{𝜷~r𝒛δr})2,rr.\displaystyle=\begin{cases}\frac{\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}(\widetilde{w}_{1}+\widetilde{w}_{r}+\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\})}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}=r,\\ -\frac{\widetilde{w}_{r}\cdot\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\}(\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\}-1)}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}\neq r.\end{cases} (S.5.691)

Hence

𝔼[γ(r)𝜽(𝒛)wr|y=1]=𝔼[exp{𝜷~r𝒛(1)δr}(w~1+w~r+rrw~rexp{𝜷~r𝒛(1)δr})(w~1+r=2Rw~rexp{𝜷~r𝒛(1)δr})2]().\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r}}\bigg{|}y=1\right]=\underbrace{\mathbb{E}\left[\frac{\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\delta_{r}\}(\widetilde{w}_{1}+\widetilde{w}_{r}+\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}^{(1)}-\delta_{r^{\prime}}\})}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\delta_{r}\})^{2}}\right]}_{(*)}. (S.5.692)

Let z~r=𝜷~r(𝒛(1)𝝁1)𝒩(0,𝜷~r𝚺𝜷~r)\widetilde{z}_{r^{\prime}}=\widetilde{\bm{\beta}}^{\top}_{r^{\prime}}(\bm{z}^{(1)}-\bm{\mu}_{1}^{*})\ \sim\mathcal{N}(0,\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}}). And notice that

𝜷~r𝝁1δ~r=t(𝜷r𝝁1δr)+(1t)[(𝜷r)𝝁1δr],\widetilde{\bm{\beta}}^{\top}_{r^{\prime}}\bm{\mu}_{1}^{*}-\widetilde{\delta}_{r^{\prime}}=t(\bm{\beta}_{r^{\prime}}^{\top}\bm{\mu}_{1}^{*}-\delta_{r^{\prime}})+(1-t)[(\bm{\beta}_{r^{\prime}}^{*})^{\top}\bm{\mu}_{1}^{*}-\delta_{r^{\prime}}^{*}], (S.5.693)

where

𝜷r𝝁1δr\displaystyle\bm{\beta}_{r^{\prime}}^{\top}\bm{\mu}_{1}^{*}-\delta_{r^{\prime}} =[𝜷r+(𝜷r𝜷r)][12(𝝁1𝝁r)+12(𝝁r𝝁r)+12(𝝁1𝝁1)]\displaystyle=[\bm{\beta}_{r^{\prime}}^{*}+(\bm{\beta}_{r^{\prime}}-\bm{\beta}^{*}_{r^{\prime}})]^{\top}\Big{[}\frac{1}{2}(\bm{\mu}_{1}^{*}-\bm{\mu}_{r^{\prime}}^{*})+\frac{1}{2}(\bm{\mu}_{r^{\prime}}^{*}-\bm{\mu}_{r^{\prime}})+\frac{1}{2}(\bm{\mu}_{1}^{*}-\bm{\mu}_{1})\Big{]} (S.5.694)
=12(𝝁r𝝁1)(𝚺)1(𝝁r𝝁1)Ar2+12(𝜷r)(𝚺)1/2(𝚺)1/2[(𝝁r𝝁r)+(𝝁1𝝁1)]\displaystyle=-\frac{1}{2}\underbrace{(\bm{\mu}_{r^{\prime}}^{*}-\bm{\mu}_{1}^{*})^{\top}(\bm{\Sigma}^{*})^{-1}(\bm{\mu}_{r^{\prime}}^{*}-\bm{\mu}_{1}^{*})}_{A_{r^{\prime}}^{2}}+\frac{1}{2}(\bm{\beta}_{r^{\prime}}^{*})^{\top}(\bm{\Sigma}^{*})^{1/2}(\bm{\Sigma}^{*})^{-1/2}[(\bm{\mu}_{r^{\prime}}^{*}-\bm{\mu}_{r^{\prime}})+(\bm{\mu}_{1}^{*}-\bm{\mu}_{1})] (S.5.695)
+12(𝜷r𝜷r)(𝚺)1/2(𝚺)1/2(𝝁1𝝁r)\displaystyle\quad+\frac{1}{2}(\bm{\beta}_{r^{\prime}}-\bm{\beta}^{*}_{r^{\prime}})^{\top}(\bm{\Sigma}^{*})^{1/2}(\bm{\Sigma}^{*})^{-1/2}(\bm{\mu}_{1}^{*}-\bm{\mu}_{r^{\prime}}^{*}) (S.5.696)
+12(𝜷r𝜷r)[(𝝁r𝝁r)+(𝝁1𝝁1)],\displaystyle\quad+\frac{1}{2}(\bm{\beta}_{r^{\prime}}-\bm{\beta}^{*}_{r^{\prime}})^{\top}[(\bm{\mu}_{r^{\prime}}^{*}-\bm{\mu}_{r^{\prime}})+(\bm{\mu}_{1}^{*}-\bm{\mu}_{1})], (S.5.697)
(𝜷r)𝝁1δr\displaystyle(\bm{\beta}_{r^{\prime}}^{*})^{\top}\bm{\mu}_{1}^{*}-\delta_{r^{\prime}}^{*} =12Ar,\displaystyle=-\frac{1}{2}A_{r^{\prime}}, (S.5.698)

and Ar=(𝝁r𝝁1)(𝚺)1(𝝁r𝝁1)=(𝜷r)𝚺𝜷rA_{r^{\prime}}=\sqrt{(\bm{\mu}_{r^{\prime}}^{*}-\bm{\mu}_{1}^{*})^{\top}(\bm{\Sigma}^{*})^{-1}(\bm{\mu}_{r^{\prime}}^{*}-\bm{\mu}_{1}^{*})}=\sqrt{(\bm{\beta}_{r}^{*})^{\top}\bm{\Sigma}^{*}\bm{\beta}_{r}^{*}}. By the fact that maxr=1:R𝝁r𝝁r2CbΔ\max_{r=1:R}\|\bm{\mu}_{r}-\bm{\mu}_{r}^{*}\|_{2}\leq C_{b}\Delta and maxr=1:R𝜷r𝜷r2CbΔ\max_{r=1:R}\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}\leq C_{b}\Delta, we have

12Ar22c𝚺1/2CbΔArCb2Δ2𝜷r𝝁1δr12Ar2+2c𝚺1/2CbΔAr+Cb2Δ2,-\frac{1}{2}A_{r^{\prime}}^{2}-2c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r^{\prime}}-C_{b}^{2}\Delta^{2}\leq\bm{\beta}_{r^{\prime}}^{\top}\bm{\mu}_{1}^{*}-\delta_{r^{\prime}}\leq-\frac{1}{2}A_{r^{\prime}}^{2}+2c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r^{\prime}}+C_{b}^{2}\Delta^{2}, (S.5.699)

implying that

12Ar22c𝚺1/2CbΔArCb2Δ2𝜷~r𝝁1δ~r12Ar2+2c𝚺1/2CbΔAr+Cb2Δ2.-\frac{1}{2}A_{r^{\prime}}^{2}-2c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r^{\prime}}-C_{b}^{2}\Delta^{2}\leq\widetilde{\bm{\beta}}^{\top}_{r^{\prime}}\bm{\mu}_{1}^{*}-\widetilde{\delta}_{r^{\prime}}\leq-\frac{1}{2}A_{r^{\prime}}^{2}+2c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r^{\prime}}+C_{b}^{2}\Delta^{2}. (S.5.700)

By Gaussian tail, we have

(r=2R{|z~r|14𝜷~r𝚺𝜷~r})1CRexp{132𝜷~r𝚺𝜷~r}.\mathbb{P}\left(\bigcap_{r^{\prime}=2}^{R}\Big{\{}|\widetilde{z}_{r^{\prime}}|\leq\frac{1}{4}\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}}\Big{\}}\right)\geq 1-CR\exp\Big{\{}-\frac{1}{32}\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}}\Big{\}}. (S.5.701)

Denote event =r=1R{|z~r|14c𝚺Cb2Δ2+12Cbc𝚺1/2ΔAr+14Ar2}\mathcal{E}=\bigcap_{r^{\prime}=1}^{R}\big{\{}|\widetilde{z}_{r^{\prime}}|\leq\frac{1}{4}c_{\bm{\Sigma}}C_{b}^{2}\Delta^{2}+\frac{1}{2}C_{b}c_{\bm{\Sigma}}^{1/2}\Delta A_{r^{\prime}}+\frac{1}{4}A_{r^{\prime}}^{2}\big{\}}. Since

14𝜷~r𝚺𝜷~r\displaystyle\frac{1}{4}\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}} =14(𝜷~r𝜷r)𝚺(𝜷~r𝜷r)+12(𝜷~r𝜷r)(𝚺)1/2(𝚺)1/2𝜷r+14(𝜷r)𝚺𝜷r\displaystyle=\frac{1}{4}(\widetilde{\bm{\beta}}_{r^{\prime}}-\bm{\beta}_{r^{\prime}}^{*})^{\top}\bm{\Sigma}^{*}(\widetilde{\bm{\beta}}_{r^{\prime}}-\bm{\beta}_{r^{\prime}}^{*})+\frac{1}{2}(\widetilde{\bm{\beta}}_{r^{\prime}}-\bm{\beta}_{r^{\prime}}^{*})^{\top}(\bm{\Sigma}^{*})^{1/2}(\bm{\Sigma}^{*})^{1/2}\bm{\beta}_{r^{\prime}}^{*}+\frac{1}{4}(\bm{\beta}_{r^{\prime}}^{*})^{\top}\bm{\Sigma}^{*}\bm{\beta}_{r^{\prime}}^{*} (S.5.702)
14c𝚺Cb2Δ2+12Cbc𝚺1/2ΔAr+14Ar2,\displaystyle\leq\frac{1}{4}c_{\bm{\Sigma}}C_{b}^{2}\Delta^{2}+\frac{1}{2}C_{b}c_{\bm{\Sigma}}^{1/2}\Delta A_{r^{\prime}}+\frac{1}{4}A_{r^{\prime}}^{2}, (S.5.703)

and

𝜷~r𝚺𝜷~rAr2c𝚺Cb2Δ22Cbc𝚺1/2ΔAr(1c𝚺Cb22Cbc𝚺1/2)Δ212Δ2,\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}}\geq A_{r^{\prime}}^{2}-c_{\bm{\Sigma}}C_{b}^{2}\Delta^{2}-2C_{b}c_{\bm{\Sigma}}^{1/2}\Delta A_{r^{\prime}}\geq(1-c_{\bm{\Sigma}}C_{b}^{2}-2C_{b}c_{\bm{\Sigma}}^{1/2})\Delta^{2}\geq\frac{1}{2}\Delta^{2}, (S.5.704)

we have

()1CRexp{164Δ2}.\mathbb{P}(\mathcal{E})\geq 1-CR\exp\Big{\{}-\frac{1}{64}\Delta^{2}\Big{\}}. (S.5.705)

Then since minr=1:RArΔ5c𝚺1/2CbΔ\min_{r^{\prime}=1:R}A_{r^{\prime}}\geq\Delta\geq 5c_{\bm{\Sigma}}^{1/2}C_{b}\Delta and Cbc𝚺1/240(2c𝚺+8)1/2C_{b}\leq\frac{c_{\bm{\Sigma}}^{-1/2}}{40}\wedge(2c_{\bm{\Sigma}}+8)^{-1/2}, we have

()\displaystyle(*) (S.5.706)
𝔼[exp{14Ar2+52c𝚺1/2CbΔAr+(14c𝚺+1)Cb2Δ2}w~21\displaystyle\leq\mathbb{E}\Bigg{[}\frac{\exp\{-\frac{1}{4}A_{r}^{2}+\frac{5}{2}c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r}+(\frac{1}{4}c_{\bm{\Sigma}}+1)C_{b}^{2}\Delta^{2}\}}{\widetilde{w}^{2}_{1}} (S.5.707)
(w~1+w~r+rrw~rexp{14Ar2+52c𝚺1/2CbΔAr+(14c𝚺+1)Cb2Δ2})|]+(c)\displaystyle\quad\quad\quad\cdot\bigg{(}\widetilde{w}_{1}+\widetilde{w}_{r}+\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\exp\Big{\{}-\frac{1}{4}A_{r}^{2}+\frac{5}{2}c_{\bm{\Sigma}}^{1/2}C_{b}\Delta A_{r}+\Big{(}\frac{1}{4}c_{\bm{\Sigma}}+1\Big{)}C_{b}^{2}\Delta^{2}\Big{\}}\bigg{)}\bigg{|}\mathcal{E}\Bigg{]}+\mathbb{P}(\mathcal{E}^{c}) (S.5.708)
cw2exp{CΔ2}.\displaystyle\lesssim c_{w}^{-2}\exp\{-C\Delta^{2}\}. (S.5.709)

Hence,

|𝔼[γ(r)𝜽(𝒛)wr|𝜽=𝜽~t|y=1]|cw2exp{CΔ2}.\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r}}\Big{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{|}y=1\right]\right|\lesssim c_{w}^{-2}\exp\{-C\Delta^{2}\}. (S.5.710)

Similarly, it can be shown that

|𝔼[γ(r)𝜽(𝒛)wr|𝜽=𝜽~t|y=1]|cw2exp{CΔ2}.\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\Big{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bigg{|}y=1\right]\right|\lesssim c_{w}^{-2}\exp\{-C\Delta^{2}\}. (S.5.711)

for any r=2:Rr^{\prime}=2:R.

(ii) Bounding |𝔼[γ(r)𝜽(𝒛)δr|𝜽=𝜽~t|y=1]||\mathbb{E}[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}|_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}|y=1]|: Note that

γ(r)𝜽(𝒛)δr={wrexp{𝜷r𝒛δr}rrwrexp{𝜷r𝒛δr}(w1+r=2Rwrexp{𝜷r𝒛δr})2,r=r,wrexp{𝜷r𝒛δr}wrexp{𝜷r𝒛δr}(w1+r=2Rwrexp{𝜷r𝒛δr})2,rr.\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}=\begin{cases}\frac{-w_{r}\cdot\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}\cdot\sum_{r^{\prime}\neq r}w_{r^{\prime}}\exp\{\bm{\beta}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\}}{(w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}=r,\\ -\frac{w_{r}\cdot\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}\cdot w_{r^{\prime}}\cdot\exp\{\bm{\beta}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\}}{(w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}\neq r.\end{cases} (S.5.712)

The analysis is almost the same as in (i), which leads to

|𝔼[γ(r)𝜽(𝒛)δr|𝜽=𝜽]|cw2exp{CΔ2},\left|\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}\Big{|}_{\bm{\theta}=\bm{\theta}}\right]\right|\lesssim c_{w}^{-2}\exp\{-C\Delta^{2}\}, (S.5.713)

for any r=2:Rr^{\prime}=2:R. We omit the proof here.

(iii) Bounding |𝔼[γ(r)𝜽(𝒛)𝜷r|𝜽=𝜽~t|y=1](𝜷r𝜷r)||\mathbb{E}[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}|_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}|y=1]^{\top}(\bm{\beta}_{r^{\prime}}-\bm{\beta}_{r}^{*})|: Note that

γ(r)𝜽(𝒛)𝜷r={wrexp{𝜷r𝒛δr}rrwrexp{𝜷r𝒛δr}𝒛(w1+r=2Rwrexp{𝜷r𝒛δr})2,r=r,wrexp{𝜷r𝒛δr}wrexp{𝜷r𝒛δr}𝒛(w1+r=2Rwrexp{𝜷r𝒛δr})2,rr.\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}=\begin{cases}\frac{-w_{r}\cdot\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}\cdot\sum_{r^{\prime}\neq r}w_{r^{\prime}}\exp\{\bm{\beta}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\}\bm{z}}{(w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}=r,\\ -\frac{w_{r}\cdot\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\}\cdot w_{r^{\prime}}\cdot\exp\{\bm{\beta}_{r^{\prime}}^{\top}\bm{z}-\delta_{r^{\prime}}\}\bm{z}}{(w_{1}+\sum_{r=2}^{R}w_{r}\exp\{\bm{\beta}_{r}^{\top}\bm{z}-\delta_{r}\})^{2}},\quad&r^{\prime}\neq r.\end{cases} (S.5.714)

\bullet When r=rr^{\prime}=r:

𝔼[(γ(r)𝜽(𝒛)𝜷r|𝜽=𝜽~t)(𝜷r𝜷r)|y=1]\displaystyle\mathbb{E}\left[\left(\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\right)^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})\bigg{|}y=1\right] (S.5.715)
=𝔼[w~rexp{𝜷~r𝒛(1)δ~r}rrw~rexp{𝜷~r𝒛(1)δ~r}(𝒛(1))(𝜷r𝜷r)(w~1+r=2Rw~rexp{𝜷~r𝒛(1)δ~r})2]\displaystyle=\mathbb{E}\left[\frac{\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\}\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\cdot\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r^{\prime}}\}\cdot(\bm{z}^{(1)})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\})^{2}}\right] (S.5.716)
𝔼[w~rexp{𝜷~r𝒛(1)δ~r}rrw~rexp{𝜷~r𝒛(1)δ~r}(w~1+r=2Rw~rexp{𝜷~r𝒛(1)δ~r})2]2(1)𝔼[(𝒛(1))(𝜷r𝜷r)]2(2).\displaystyle\leq\underbrace{\sqrt{\mathbb{E}\left[\frac{\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\}\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\cdot\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r^{\prime}}\}}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\})^{2}}\right]^{2}}}_{(1)}\cdot\underbrace{\sqrt{\mathbb{E}[(\bm{z}^{(1)})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})]^{2}}}_{(2)}. (S.5.717)

Similar to the previous argument in (i), let z~r=𝜷~r(𝒛(r~)𝝁1)𝒩(0,𝜷~r𝚺𝜷~r)\widetilde{z}_{r^{\prime}}=\widetilde{\bm{\beta}}^{\top}_{r^{\prime}}(\bm{z}^{(\widetilde{r})}-\bm{\mu}_{1}^{*})\sim\mathcal{N}(0,\widetilde{\bm{\beta}}^{\top}_{r^{\prime}}\bm{\Sigma}^{*}\widetilde{\bm{\beta}}_{r^{\prime}}) and event =r=1R{|z~r|14c𝚺Cb2Δ2+12Cbc𝚺1/2ΔAr+14Ar2}\mathcal{E}=\bigcap_{r^{\prime}=1}^{R}\big{\{}|\widetilde{z}_{r^{\prime}}|\leq\frac{1}{4}c_{\bm{\Sigma}}C_{b}^{2}\Delta^{2}+\frac{1}{2}C_{b}c_{\bm{\Sigma}}^{1/2}\Delta A_{r^{\prime}}+\frac{1}{4}A_{r^{\prime}}^{2}\big{\}}, then

()1CRexp{164Δ2}.\mathbb{P}(\mathcal{E})\geq 1-CR\exp\Big{\{}-\frac{1}{64}\Delta^{2}\Big{\}}. (S.5.718)

Similar to (i), we have

(1)𝔼[(w~rexp{𝜷~r𝒛(1)δ~r}rrw~rexp{𝜷~r𝒛(1)δ~r}(w~1+r=2Rw~rexp{𝜷~r𝒛(1)δ~r})2)2|]+(c)exp{CΔ2}.(1)\lesssim\sqrt{\mathbb{E}\left[\left(\frac{\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\}\sum_{r^{\prime}\neq r}\widetilde{w}_{r^{\prime}}\cdot\exp\{\widetilde{\bm{\beta}}_{r^{\prime}}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r^{\prime}}\}}{(\widetilde{w}_{1}+\sum_{r=2}^{R}\widetilde{w}_{r}\exp\{\widetilde{\bm{\beta}}_{r}^{\top}\bm{z}^{(1)}-\widetilde{\delta}_{r}\})^{2}}\right)^{2}\bigg{|}\mathcal{E}\right]+\mathbb{P}(\mathcal{E}^{c})}\lesssim\exp\{-C\Delta^{2}\}. (S.5.719)

Moreover, (𝒛(1))(𝜷r𝜷r)=(𝒛(1)𝝁1)(𝜷r𝜷r)+(𝝁1)(𝜷r𝜷r)(\bm{z}^{(1)})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})=(\bm{z}^{(1)}-\bm{\mu}_{1}^{*})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})+(\bm{\mu}_{1}^{*})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*}), where (𝒛(1)𝝁1)(𝜷r𝜷r)𝒩(0,(𝜷r𝜷r)𝚺(𝜷r𝜷r))(\bm{z}^{(1)}-\bm{\mu}_{1}^{*})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})\sim\mathcal{N}(0,(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})^{\top}\bm{\Sigma}^{*}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})) and |(𝝁1)(𝜷r𝜷r)|M𝜷r𝜷r2|(\bm{\mu}_{1}^{*})^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})|\leq M\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}, hence

(2)(𝜷r𝜷r)𝚺(𝜷r𝜷r)+M𝜷r𝜷r2(c𝚺1/2+M)𝜷r𝜷r2(c𝚺1/2+M)CbΔ.(2)\lesssim\sqrt{(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})^{\top}\bm{\Sigma}^{*}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})}+M\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}\lesssim(c_{\bm{\Sigma}}^{1/2}+M)\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}\lesssim(c_{\bm{\Sigma}}^{1/2}+M)C_{b}\Delta. (S.5.720)

Therefore, since Δ2Mc𝚺1/2\Delta\leq 2Mc_{\bm{\Sigma}}^{1/2} and Δlog1/2(c𝚺Mcw1)\Delta\gtrsim\log^{1/2}(c_{\bm{\Sigma}}Mc_{w}^{-1}),

𝔼[(γ(r)𝜽(𝒛)𝜷r)(𝜷r𝜷r)|y=1]cw2exp{CΔ2}(c𝚺1/2+M)Mc𝚺1/2exp{CΔ2}.\mathbb{E}\left[\left(\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\right)^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})\bigg{|}y=1\right]\lesssim c_{w}^{-2}\exp\{-C^{\prime}\Delta^{2}\}\cdot(c_{\bm{\Sigma}}^{1/2}+M)Mc_{\bm{\Sigma}}^{1/2}\lesssim\exp\{-C^{\prime}\Delta^{2}\}. (S.5.721)

\bullet When rrr^{\prime}\neq r: we can obtain

𝔼[(γ(r)𝜽(𝒛)𝜷r)(𝜷r𝜷r)|y=1]exp{CΔ2}.\mathbb{E}\left[\left(\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\right)^{\top}(\bm{\beta}_{r}-\bm{\beta}_{r}^{*})\bigg{|}y=1\right]\lesssim\exp\{-C^{\prime}\Delta^{2}\}. (S.5.722)

similarly.

Combining (i)-(iii), we have

|wr(𝜽)wr|exp{CΔ2}r=2R(|wrwr|+|δrδr|+𝜷r𝜷r2).|w_{r}(\bm{\theta})-w_{r}^{*}|\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot\sum_{r=2}^{R}(|w_{r}-w_{r}^{*}|+|\delta_{r}-\delta_{r}^{*}|+\|\bm{\beta}_{r}-\bm{\beta}_{r}^{*}\|_{2}). (S.5.723)

Part 2: Contraction of 𝝁r(𝜽)𝝁r2\|\bm{\mu}_{r}(\bm{\theta})-\bm{\mu}_{r}^{*}\|_{2}:

By definition,

𝝁r(𝜽)𝝁r2𝔼[γ(r)𝜽(𝒛)𝒛]2wr(𝜽)wr|wrwr(𝜽)|+𝔼[(γ(r)𝜽(𝒛)γ(r)𝜽(𝒛))𝒛]2wr,\|\bm{\mu}_{r}(\bm{\theta})-\bm{\mu}_{r}^{*}\|_{2}\leq\frac{\|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})\bm{z}]\|_{2}}{w_{r}(\bm{\theta})w_{r}^{*}}\cdot|w_{r}^{*}-w_{r}(\bm{\theta})|+\frac{\|\mathbb{E}[(\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z}))\bm{z}]\|_{2}}{w_{r}^{*}}, (S.5.724)

implying that

𝔼[(γ(r)𝜽(𝒛)γ(r)𝜽(𝒛))𝒛]2\displaystyle\|\mathbb{E}[(\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z}))\bm{z}]\|_{2} r=2R𝔼[γ(r)𝜽(𝒛)wr|𝜽=𝜽~t𝒛]2|wrwr|\displaystyle\leq\sum_{r^{\prime}=2}^{R}\left\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bm{z}\bigg{]}\right\|_{2}\cdot|w_{r^{\prime}}-w_{r^{\prime}}^{*}| (S.5.725)
+r=2R𝔼[γ(r)𝜽(𝒛)δr|𝜽=𝜽~t𝒛]2|δrδr|\displaystyle\quad+\sum_{r=2}^{R}\left\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bm{z}\bigg{]}\right\|_{2}\cdot|\delta_{r^{\prime}}-\delta_{r}^{*}| (S.5.726)
+r=2R𝔼[γ(r)𝜽(𝒛)𝜷r|𝜽=𝜽~t𝒛]2𝜷r𝜷r2,\displaystyle\quad+\sum_{r=2}^{R}\left\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bm{z}\bigg{]}\right\|_{2}\cdot\|\bm{\beta}_{r^{\prime}}-\bm{\beta}_{r}^{*}\|_{2}, (S.5.727)

where 𝜽~t=({w~r}r=2R,{𝜷~r}r=2R,{δ~r}r=2R)\widetilde{\bm{\theta}}_{t}=(\{\widetilde{w}_{r}\}_{r=2}^{R},\{\widetilde{\bm{\beta}}_{r}\}_{r=2}^{R},\{\widetilde{\delta}_{r}\}_{r=2}^{R}) with w~r=twr+(1t)wr\widetilde{w}_{r}=tw_{r}+(1-t)w_{r}^{*}, 𝜷~r=t𝜷r+(1t)𝜷r\widetilde{\bm{\beta}}_{r}=t\bm{\beta}_{r}+(1-t)\bm{\beta}_{r}^{*}, and δ~r=tδr+(1t)δr\widetilde{\delta}_{r}=t\delta_{r}+(1-t)\delta_{r}^{*}. We will bound the three terms on the RHS separately. Note that when 𝜽Bcon(𝜽)\bm{\theta}\in B_{\textup{con}}(\bm{\theta}^{*}), we have w~r(cw/2,1cw)\widetilde{w}_{r}\in(c_{w}/2,1-c_{w}), 𝜷~r𝜷r2CbΔ\|\widetilde{\bm{\beta}}_{r}-\bm{\beta}^{*}_{r}\|_{2}\leq C_{b}\Delta, and |δ~rδr|CbΔ|\widetilde{\delta}_{r}-\delta_{r}|\leq C_{b}\Delta.

For any 𝒖p\bm{u}\in\mathbb{R}^{p} with 𝒖21\|\bm{u}\|_{2}\leq 1 and any r~1:R\widetilde{r}\in 1:R, similar to our previous arguments, we have

|𝔼[γ(r)𝜽(𝒛)wr|𝜽=𝜽~t𝒛𝒖|y=r~]|𝔼[γ(r)𝜽(𝒛)wr]2𝔼[(𝒛(r~))𝒖]2exp{CΔ2},\left|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\bigg{|}_{\bm{\theta}=\widetilde{\bm{\theta}}_{t}}\bm{z}^{\top}\bm{u}\bigg{|}y=\widetilde{r}\bigg{]}\right|\leq\sqrt{\mathbb{E}\left[\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\right]^{2}}\cdot\sqrt{\mathbb{E}[(\bm{z}^{(\widetilde{r})})^{\top}\bm{u}]^{2}}\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}, (S.5.728)

which leads to

𝔼[γ(r)𝜽(𝒛)wr𝒛]2exp{CΔ2},\left\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial w_{r^{\prime}}}\bm{z}\bigg{]}\right\|_{2}\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}, (S.5.729)

for any r2:Rr^{\prime}\in 2:R. Similarly, we have

𝔼[γ(r)𝜽(𝒛)δr𝒛]2,𝔼[γ(r)𝜽(𝒛)𝜷r𝒛]2exp{CΔ2}.\left\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\delta_{r^{\prime}}}\bm{z}\bigg{]}\right\|_{2},\left\|\mathbb{E}\bigg{[}\frac{\partial\gamma^{(r)}_{\bm{\theta}}(\bm{z})}{\partial\bm{\beta}_{r^{\prime}}}\bm{z}\bigg{]}\right\|_{2}\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}. (S.5.730)

for any r2:Rr^{\prime}\in 2:R. Therefore, 𝔼[(γ(r)𝜽(𝒛)γ(r)𝜽(𝒛))𝒛]2exp{CΔ2}\|\mathbb{E}[(\gamma^{(r)}_{\bm{\theta}}(\bm{z})-\gamma^{(r)}_{\bm{\theta}^{*}}(\bm{z}))\bm{z}]\|_{2}\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}. By part 1, we have 𝔼[γ(r)𝜽(𝒛)𝒛]2wr(𝜽)wr|wrwr(𝜽)|exp{CΔ2}d(𝜽,𝜽)\frac{\|\mathbb{E}[\gamma^{(r)}_{\bm{\theta}}(\bm{z})\bm{z}]\|_{2}}{w_{r}(\bm{\theta})w_{r}^{*}}\cdot|w_{r}^{*}-w_{r}(\bm{\theta})|\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*}). Hence by (S.5.724), we have 𝝁r(𝜽)𝝁r2exp{CΔ2}d(𝜽,𝜽)\|\bm{\mu}_{r}(\bm{\theta})-\bm{\mu}_{r}^{*}\|_{2}\lesssim\exp\{-C^{\prime\prime}\Delta^{2}\}\cdot d(\bm{\theta},\bm{\theta}^{*}).

Combining part 1 and part 2, we complete the proof.

S.5.14 Proof of Theorem 8

Note that the excess risk

R𝜽¯(k)(𝒞𝜽^(k))R𝜽¯(k)(𝒞𝜽(k))\displaystyle R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\bm{\theta}^{(k)*}}) (S.5.731)
=(y(k)𝒞𝜽^(k)(𝒛(k)))(y(k)𝒞𝜽(k)(𝒛(k)))\displaystyle=\mathbb{P}(y^{(k)}\neq\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z}^{(k)}))-\mathbb{P}(y^{(k)}\neq\mathcal{C}_{\bm{\theta}^{(k)*}}(\bm{z}^{(k)})) (S.5.732)
=𝒞𝜽^(k)𝒞𝜽(k)[(y(k)=𝒞𝜽(k)(𝒛)|𝒛(k)=𝒛)(y(k)=𝒞𝜽^(k)(𝒛)|𝒛(k)=𝒛)]d𝜽(k)(𝒛)\displaystyle=\int_{\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}\neq\mathcal{C}_{\bm{\theta}^{(k)*}}}\left[\mathbb{P}(y^{(k)}=\mathcal{C}_{\bm{\theta}^{(k)*}}(\bm{z})|\bm{z}^{(k)}=\bm{z})-\mathbb{P}(y^{(k)}=\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})|\bm{z}^{(k)}=\bm{z})\right]d\mathbb{P}_{\bm{\theta}^{(k)*}}(\bm{z}) (S.5.733)
=𝒞𝜽^(k)𝒞𝜽(k)[maxr=1:R(y(k)=r|𝒛(k)=𝒛)(y(k)=𝒞𝜽^(k)(𝒛)|𝒛(k)=𝒛)]d𝜽(k)(𝒛).\displaystyle=\int_{\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}\neq\mathcal{C}_{\bm{\theta}^{(k)*}}}\left[\max_{r=1:R}\mathbb{P}(y^{(k)}=r|\bm{z}^{(k)}=\bm{z})-\mathbb{P}(y^{(k)}=\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})|\bm{z}^{(k)}=\bm{z})\right]d\mathbb{P}_{\bm{\theta}^{(k)*}}(\bm{z}). (S.5.734)

Let event ={𝒛:maxr(y(k)=r|𝒛(k)=𝒛)maxj{(y(k)=j|𝒛(k)=𝒛):(y(k)=j|𝒛(k)=𝒛)<maxr(y(k)=r|𝒛(k)=𝒛)}t}\mathcal{E}=\big{\{}\bm{z}:\max_{r}\mathbb{P}(y^{(k)}=r|\bm{z}^{(k)}=\bm{z})-\max_{j}\{\mathbb{P}(y^{(k)}=j|\bm{z}^{(k)}=\bm{z}):\mathbb{P}(y^{(k)}=j|\bm{z}^{(k)}=\bm{z})<\max_{r}\mathbb{P}(y^{(k)}=r|\bm{z}^{(k)}=\bm{z})\}\leq t\big{\}}. We claim that the margin condition ()t\mathbb{P}(\mathcal{E})\lesssim t holds for any tt\leq a small constant cc (to be verified). If this is the case, then denote ~={maxr|η(r)𝜽^(k)(𝒛(k))η(r)𝜽(k)(𝒛(k))|t/2}\widetilde{\mathcal{E}}=\big{\{}\max_{r}|\eta^{(r)}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z}^{(k)})-\eta^{(r)}_{\bm{\theta}^{(k)*}}(\bm{z}^{(k)})|\leq t/2\big{\}}.

r\displaystyle r^{*} =argmaxrη(r)𝜽(k)(𝒛),\displaystyle=\operatorname*{arg\,max}_{r}\eta^{(r)}_{\bm{\theta}^{(k)*}}(\bm{z}), (S.5.735)
r^\displaystyle\widehat{r} =argmaxrη(r)𝜽^(k)(𝒛).\displaystyle=\operatorname*{arg\,max}_{r}\eta^{(r)}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z}). (S.5.736)

We have

RHS of (S.5.734)\displaystyle\textup{RHS of }\eqref{eq: proof mtl classification multi-cluster eq 1} (S.5.737)
rr^,~[η(r)𝜽(k)(𝒛)η(r^)𝜽(k)(𝒛)]d𝜽(k)(𝒛)+rr^c,~[η(r)𝜽(k)(𝒛)η(r^)𝜽^(k)(𝒛)]d𝜽(k)(𝒛)+(~c)\displaystyle\leq\int_{\begin{subarray}{c}r^{*}\neq\widehat{r}\\ \mathcal{E},\widetilde{\mathcal{E}}\end{subarray}}\big{[}\eta^{(r^{*})}_{\bm{\theta}^{(k)*}}(\bm{z})-\eta^{(\widehat{r})}_{\bm{\theta}^{(k)*}}(\bm{z})\big{]}d\mathbb{P}_{\bm{\theta}^{(k)*}}(\bm{z})+\int_{\begin{subarray}{c}r^{*}\neq\widehat{r}\\ \mathcal{E}^{c},\widetilde{\mathcal{E}}\end{subarray}}\big{[}\eta^{(r^{*})}_{\bm{\theta}^{(k)*}}(\bm{z})-\eta^{(\widehat{r})}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})\big{]}d\mathbb{P}_{\bm{\theta}^{(k)*}}(\bm{z})+\mathbb{P}(\widetilde{\mathcal{E}}^{c}) (S.5.738)
t()+(~c),\displaystyle\leq t\mathbb{P}(\mathcal{E})+\mathbb{P}(\widetilde{\mathcal{E}}^{c}), (S.5.739)

where the last inequality comes from the fact that when rr^r^{*}\neq\widehat{r}, η(r)𝜽(k)(𝒛)η(r^)𝜽(k)(𝒛)t\eta^{(r^{*})}_{\bm{\theta}^{(k)*}}(\bm{z})-\eta^{(\widehat{r})}_{\bm{\theta}^{(k)*}}(\bm{z})\leq t on \mathcal{E}. And notice that on c~\mathcal{E}^{c}\cap\widetilde{\mathcal{E}}, we must have r^=r\widehat{r}=r^{*} because if r^r\widehat{r}\neq r^{*}, then

η(r^)𝜽^(k)(𝒛)η(r)𝜽^(k)(𝒛)η(r^)𝜽(k)(𝒛)η(r)𝜽(k)(𝒛)+t2+t2t+t<0,\eta^{(\widehat{r})}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})-\eta^{(r^{*})}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})\leq\eta^{(\widehat{r})}_{\bm{\theta}^{(k)*}}(\bm{z})-\eta^{(r^{*})}_{\bm{\theta}^{(k)*}}(\bm{z})+\frac{t}{2}+\frac{t}{2}\leq-t+t<0, (S.5.740)

which is a contradiction with the definition of r^\widehat{r}. Hence {rr^}~\{r^{*}\neq\widehat{r}\}\cap\mathcal{E}\cap\widetilde{\mathcal{E}} is empty. Therefore rr^c,~[η(r)𝜽(k)(𝒛)η(r^)𝜽^(k)(𝒛)]d𝜽(k)(𝒛)=0\int_{\begin{subarray}{c}r^{*}\neq\widehat{r}\\ \mathcal{E}^{c},\widetilde{\mathcal{E}}\end{subarray}}\big{[}\eta^{(r^{*})}_{\bm{\theta}^{(k)*}}(\bm{z})-\eta^{(\widehat{r})}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z})\big{]}d\mathbb{P}_{\bm{\theta}^{(k)*}}(\bm{z})=0. Finally, by Lipschitzness,

(~c)\displaystyle\mathbb{P}(\widetilde{\mathcal{E}}^{c}) =(maxr|η(r)𝜽^(k)(𝒛(k))η(r)𝜽(k)(𝒛(k))|>t/2)\displaystyle=\mathbb{P}\left(\max_{r}|\eta^{(r)}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z}^{(k)})-\eta^{(r)}_{\bm{\theta}^{(k)*}}(\bm{z}^{(k)})|>t/2\right) (S.5.741)
r=1R(|η(r)𝜽^(k)(𝒛(k))η(r)𝜽(k)(𝒛(k))|>t/2)\displaystyle\leq\sum_{r=1}^{R}\mathbb{P}\left(|\eta^{(r)}_{\widehat{\bm{\theta}}^{(k)}}(\bm{z}^{(k)})-\eta^{(r)}_{\bm{\theta}^{(k)*}}(\bm{z}^{(k)})|>t/2\right) (S.5.742)
(|(𝜷^(k)𝜷(k))𝒛(k)δ^(k)+δ(k)logw^(k)r+logw^(1)r+logw(k)rlogw(k)1|>Ct)\displaystyle\lesssim\mathbb{P}(|(\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*})^{\top}\bm{z}^{(k)}-\widehat{\delta}^{(k)}+\delta^{(k)*}-\log\widehat{w}^{(k)}_{r}+\log\widehat{w}^{(1)}_{r}+\log w^{(k)*}_{r}-\log w^{(k)*}_{1}|>Ct) (S.5.743)
(|(𝜷^(k)𝜷(k))(𝒛(k)𝝁(k))|>Ct)\displaystyle\lesssim\mathbb{P}(|(\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*})^{\top}(\bm{z}^{(k)}-\bm{\mu}^{(k)})|>C^{\prime}t) (S.5.744)
exp{Ct2𝜷^(k)𝜷(k)22}.\displaystyle\lesssim\exp\left\{-\frac{Ct^{2}}{\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\|_{2}^{2}}\right\}. (S.5.745)

Plugging back into (S.5.739), we have

R𝜽¯(k)(𝒞𝜽^(k))R𝜽¯(k)(𝒞𝜽(k))t2+exp{Ct2𝜷^(k)𝜷(k)22}.R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\bm{\theta}^{(k)*}})\lesssim t^{2}+\exp\left\{-\frac{Ct^{2}}{\|\widehat{\bm{\beta}}^{(k)}-\bm{\beta}^{(k)*}\|_{2}^{2}}\right\}. (S.5.746)

Let td(𝜽^(k),𝜽(k))logd1(𝜽^(k),𝜽(k))t\asymp d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\sqrt{\log d^{-1}(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})}:

R𝜽¯(k)(𝒞𝜽^(k))R𝜽¯(k)(𝒞𝜽(k))d2(𝜽^(k),𝜽(k))logd1(𝜽^(k),𝜽(k))d2(𝜽^(k),𝜽(k))log(nSp+lognS).R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\widehat{\bm{\theta}}^{(k)}})-R_{\overline{\bm{\theta}}^{(k)*}}(\mathcal{C}_{\bm{\theta}^{(k)*}})\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\log d^{-1}(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\lesssim d^{2}(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*})\log\left(\frac{n_{S}}{p+\log n_{S}}\right). (S.5.747)

Then plugging in the upper bound of d(𝜽^(k),𝜽(k))d(\widehat{\bm{\theta}}^{(k)},\bm{\theta}^{(k)*}) in Theorem 7 completes the proof.

It remains to verify the margin condition ()t\mathbb{P}(\mathcal{E})\lesssim t for any tt\leq a small constant cc. In fact,

()\displaystyle\mathbb{P}(\mathcal{E}) =r=1Rjr(argmaxr(y=r|𝒛(k))=r,argmaxrr(y=r|𝒛(k))=j,\displaystyle=\sum_{r=1}^{R}\sum_{j\neq r}\mathbb{P}\bigg{(}\operatorname*{arg\,max}_{r^{\prime}}\mathbb{P}(y=r^{\prime}|\bm{z}^{(k)})=r,\operatorname*{arg\,max}_{r^{\prime}\neq r}\mathbb{P}(y=r^{\prime}|\bm{z}^{(k)})=j, (S.5.748)
(y=r|𝒛(k))(y=j|𝒛(k))t)\displaystyle\hskip 79.6678pt\mathbb{P}(y=r|\bm{z}^{(k)})-\mathbb{P}(y=j|\bm{z}^{(k)})\leq t\bigg{)} (S.5.749)
r=1Rjr(argmaxr(y=r|𝒛(k))=r,argmaxrr(y=r|𝒛(k))=j,\displaystyle\leq\sum_{r=1}^{R}\sum_{j\neq r}\mathbb{P}\bigg{(}\operatorname*{arg\,max}_{r^{\prime}}\mathbb{P}(y=r^{\prime}|\bm{z}^{(k)})=r,\operatorname*{arg\,max}_{r^{\prime}\neq r}\mathbb{P}(y=r^{\prime}|\bm{z}^{(k)})=j, (S.5.750)
1(y=j|𝒛(k))(y=r|𝒛(k))t(y=r|𝒛(k))Rt)\displaystyle\hskip 79.6678pt1-\frac{\mathbb{P}(y=j|\bm{z}^{(k)})}{\mathbb{P}(y=r|\bm{z}^{(k)})}\leq\frac{t}{\mathbb{P}(y=r|\bm{z}^{(k)})}\leq Rt\bigg{)} (S.5.751)
r=1Rjr(1Rt(y=j|𝒛(k))(y=r|𝒛(k))\displaystyle\leq\sum_{r=1}^{R}\sum_{j\neq r}\mathbb{P}\bigg{(}1-Rt\leq\frac{\mathbb{P}(y=j|\bm{z}^{(k)})}{\mathbb{P}(y=r|\bm{z}^{(k)})} (S.5.752)
=exp{(𝜷(k)j𝜷(k)r)𝒛(k)δj(k)+δr(k)+logw(k)jlogw(k)r})\displaystyle\hskip 79.6678pt=\exp\{(\bm{\beta}^{(k)*}_{j}-\bm{\beta}^{(k)*}_{r})^{\top}\bm{z}^{(k)}-\delta_{j}^{(k)*}+\delta_{r}^{(k)*}+\log w^{(k)*}_{j}-\log w^{(k)*}_{r}\}\bigg{)} (S.5.753)
r=1Rjrr=1R(log(1Rt)𝒩((𝜷(k)j𝜷(k)r)𝝁(k)rδj(k)+δr(k)+logw(k)jlogw(k)r,\displaystyle\lesssim\sum_{r=1}^{R}\sum_{j\neq r}\sum_{r^{\prime}=1}^{R}\mathbb{P}\bigg{(}\log(1-Rt)\leq\mathcal{N}\big{(}(\bm{\beta}^{(k)*}_{j}-\bm{\beta}^{(k)*}_{r})^{\top}\bm{\mu}^{(k)*}_{r}-\delta_{j}^{(k)*}+\delta_{r}^{(k)*}+\log w^{(k)*}_{j}-\log w^{(k)*}_{r}, (S.5.754)
(𝜷(k)j𝜷(k)r)𝚺(k)(𝜷(k)j𝜷(k)r))0)\displaystyle\hskip 91.04872pt(\bm{\beta}^{(k)*}_{j}-\bm{\beta}^{(k)*}_{r})^{\top}\bm{\Sigma}^{(k)*}(\bm{\beta}^{(k)*}_{j}-\bm{\beta}^{(k)*}_{r})\big{)}\leq 0\bigg{)} (S.5.755)
log(1Rt)\displaystyle\lesssim-\log(1-Rt) (S.5.756)
t,\displaystyle\lesssim t, (S.5.757)

when t>0t>0 is less than some constant c>0c>0. Note that we used the fact that (𝜷(k)j𝜷(k)r)𝚺(k)(𝜷(k)j𝜷(k)r)Δ2(\bm{\beta}^{(k)*}_{j}-\bm{\beta}^{(k)*}_{r})^{\top}\bm{\Sigma}^{(k)*}(\bm{\beta}^{(k)*}_{j}-\bm{\beta}^{(k)*}_{r})\geq\Delta^{2}\geq some constant CC, which implies that the Gaussian density is upper bounded by a constant. Hence the marginal condition is true.

We want to point out that this multi-class extension of margin condition in binary case has been widely used in literature of multi-class classification. For example, see \citeappchen2006consistency and \citeappvigogna2022multiclass.

S.5.15 Proof of Theorem 9

The proof is almost the same as the proof of Theorem 3, by noticing that we can make the GMM parameters the same across rr-th task with r3r\geq 3 to reduce the problem to the case R=2R=2, so we do not repeat it here.

S.5.16 Proof of Theorem 10

S.5.16.1 Lemmas

Lemma 39.

Consider 𝛉¯={{wr}r=2R,{𝛃r}r=2R,{δr}r=2R,{𝛍r}r=1R,𝚺}\overline{\bm{\theta}}=\{\{w_{r}\}_{r=2}^{R},\{\bm{\beta}_{r}\}_{r=2}^{R},\{\delta_{r}\}_{r=2}^{R},\{\bm{\mu}_{r}\}_{r=1}^{R},\bm{\Sigma}\} and 𝛉¯={{wr}r=2R,{𝛃r}r=2R,{δr}r=2R,{𝛍r}r=1R,𝚺}\overline{\bm{\theta}}^{\prime}=\{\{w_{r}^{\prime}\}_{r=2}^{R},\{\bm{\beta}_{r}^{\prime}\}_{r=2}^{R},\allowbreak\{\delta_{r}^{\prime}\}_{r=2}^{R},\{\bm{\mu}_{r}^{\prime}\}_{r=1}^{R},\bm{\Sigma}^{\prime}\} with wr=wrw_{r}=w_{r}^{\prime} for r3r\geq 3, 𝛍r=𝛍r=𝐞r\bm{\mu}_{r}=\bm{\mu}_{r}^{\prime}=\bm{e}_{r} for r2r\geq 2, 𝛍1=𝛍1=𝟎\bm{\mu}_{1}=\bm{\mu}_{1}^{\prime}=\bm{0}, and 𝚺=𝚺=𝐈p\bm{\Sigma}=\bm{\Sigma}^{\prime}=\bm{I}_{p}. Then 𝛃r=𝛃r=𝐞r\bm{\beta}_{r}=\bm{\beta}_{r}^{\prime}=\bm{e}_{r}, δr=δr=𝛃r(𝛍1+𝛍22)=12\delta_{r}=\delta_{r}^{\prime}=\bm{\beta}_{r}^{\top}(\frac{\bm{\mu}_{1}+\bm{\mu}_{2}}{2})=\frac{1}{2} for r2r\geq 2, and

𝜽¯(𝒞𝜽¯𝒞𝜽¯)|w2w2|.\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})\gtrsim|w_{2}-w_{2}^{\prime}|. (S.5.758)
Lemma 40.

Consider 𝛉¯={{wr}r=2R,{𝛃r}r=2R,{δr}r=2R,{𝛍r}r=1R,𝚺}\overline{\bm{\theta}}=\{\{w_{r}\}_{r=2}^{R},\{\bm{\beta}_{r}\}_{r=2}^{R},\{\delta_{r}\}_{r=2}^{R},\{\bm{\mu}_{r}\}_{r=1}^{R},\bm{\Sigma}\} and 𝛉¯={{wr}r=2R,{𝛃r}r=2R,{δr}r=2R,{𝛍r}r=1R,𝚺}\overline{\bm{\theta}}^{\prime}=\{\{w_{r}^{\prime}\}_{r=2}^{R},\{\bm{\beta}_{r}^{\prime}\}_{r=2}^{R},\allowbreak\{\delta_{r}^{\prime}\}_{r=2}^{R},\{\bm{\mu}_{r}^{\prime}\}_{r=1}^{R},\bm{\Sigma}^{\prime}\} with wr=wr=1Rw_{r}=w_{r}^{\prime}=\frac{1}{R} for r=1:Rr=1:R, 𝛍r=(u+1)𝐞r\bm{\mu}_{r}=(u+1)\bm{e}_{r}, 𝛍r=𝐞r\bm{\mu}_{r}^{\prime}=\bm{e}_{r} for r2r\geq 2, 𝛍1=u𝐞r\bm{\mu}_{1}=u\bm{e}_{r}, 𝛍1=𝟎\bm{\mu}_{1}^{\prime}=\bm{0}, and 𝚺=𝚺=𝐈p\bm{\Sigma}=\bm{\Sigma}^{\prime}=\bm{I}_{p}, where 0<uC0<u\leq C. Then 𝛃r=𝛃r=𝐞r\bm{\beta}_{r}=\bm{\beta}_{r}^{\prime}=\bm{e}_{r}, δr=𝐞r(𝛍1+𝛍22)=u+12\delta_{r}=\bm{e}_{r}^{\top}(\frac{\bm{\mu}_{1}+\bm{\mu}_{2}}{2})=u+\frac{1}{2}, δr=𝐞r(𝛍1+𝛍22)=12\delta_{r}^{\prime}=\bm{e}_{r}^{\top}(\frac{\bm{\mu}_{1}^{\prime}+\bm{\mu}_{2}^{\prime}}{2})=\frac{1}{2} for r2r\geq 2, and

𝜽¯(𝒞𝜽¯𝒞𝜽¯)u.\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})\gtrsim u. (S.5.759)
Lemma 41.

Consider 𝛉¯={{wr}r=2R,{𝛃r}r=2R,{δr}r=2R,{𝛍r}r=1R,𝚺}\overline{\bm{\theta}}=\{\{w_{r}\}_{r=2}^{R},\{\bm{\beta}_{r}\}_{r=2}^{R},\{\delta_{r}\}_{r=2}^{R},\{\bm{\mu}_{r}\}_{r=1}^{R},\bm{\Sigma}\} and 𝛉¯={{wr}r=2R,{𝛃r}r=2R,{δr}r=2R,{𝛍r}r=1R,𝚺}\overline{\bm{\theta}}^{\prime}=\{\{w_{r}^{\prime}\}_{r=2}^{R},\{\bm{\beta}_{r}^{\prime}\}_{r=2}^{R},\allowbreak\{\delta_{r}^{\prime}\}_{r=2}^{R},\{\bm{\mu}_{r}^{\prime}\}_{r=1}^{R},\bm{\Sigma}^{\prime}\} with wr=wr=1Rw_{r}=w_{r}^{\prime}=\frac{1}{R} for r=1:Rr=1:R, 𝛍r=𝛍r=𝐞r\bm{\mu}_{r}=\bm{\mu}_{r}^{\prime}=\bm{e}_{r} for r3r\geq 3, 𝛍1=𝛍1=𝟎\bm{\mu}_{1}=\bm{\mu}_{1}^{\prime}=\bm{0}, and 𝚺=𝚺=𝐈p\bm{\Sigma}=\bm{\Sigma}^{\prime}=\bm{I}_{p}. Suppose 𝛍2\bm{\mu}_{2} and 𝛍2\bm{\mu}_{2}^{\prime} satisfy (𝛍2)3:R=(𝛍2)3:R=𝟎(\bm{\mu}_{2})_{3:R}=(\bm{\mu}_{2})_{3:R}=\bm{0}. Then 𝛃r=𝛃r=𝐞r\bm{\beta}_{r}=\bm{\beta}_{r}^{\prime}=\bm{e}_{r} for r3r\geq 3, δr=δr\delta_{r}=\delta_{r}^{\prime} for r3r\geq 3, 𝛃2=𝛍2\bm{\beta}_{2}=\bm{\mu}_{2}, 𝛃2=𝛍2\bm{\beta}_{2}^{\prime}=\bm{\mu}_{2}^{\prime}, δ2=12𝛍222\delta_{2}=\frac{1}{2}\|\bm{\mu}_{2}\|_{2}^{2}, δ2=12𝛍222\delta_{2}^{\prime}=\frac{1}{2}\|\bm{\mu}_{2}^{\prime}\|_{2}^{2} where 𝛍22=𝛍22=1\|\bm{\mu}_{2}\|_{2}=\|\bm{\mu}_{2}^{\prime}\|_{2}=1 with 𝛍2𝛍2>22\bm{\mu}_{2}^{\top}\bm{\mu}_{2}^{\prime}>\frac{\sqrt{2}}{2}, and

𝜽¯(𝒞𝜽¯𝒞𝜽¯)𝝁2𝝁22.\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}})\gtrsim\|\bm{\mu}_{2}-\bm{\mu}_{2}^{\prime}\|_{2}. (S.5.760)

S.5.16.2 Main proof of Theorem 10

Given the three lemmas we presented, the proof is almost the same as the proof of Theorem 4. We do not repeat it here.

S.5.16.3 Proof of lemmas

Proof of Lemma 39.

Note that zjz_{j}’s are independent given y=3y=3. We have

𝜽¯(𝒞𝜽¯𝒞𝜽¯)\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}) 𝜽¯(𝒞𝜽¯(𝒛)=2,𝒞𝜽¯(𝒛)2)\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})=2,\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}(\bm{z})\neq 2) (S.5.761)
𝜽¯(z212logw1+logw20,z212logw1+logw20\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}z_{2}-\frac{1}{2}-\log w_{1}+\log w_{2}\geq 0,z_{2}-\frac{1}{2}-\log w_{1}^{\prime}+\log w_{2}^{\prime}\leq 0 (S.5.762)
zr12logw1+logwr0 for all r3)\displaystyle\quad\quad\quad z_{r}-\frac{1}{2}-\log w_{1}+\log w_{r}\leq 0\text{ for all }r\geq 3\Big{)} (S.5.763)
w3𝜽¯(z212logw1+logw20,z212logw1+logw20\displaystyle\geq w_{3}\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}z_{2}-\frac{1}{2}-\log w_{1}+\log w_{2}\geq 0,z_{2}-\frac{1}{2}-\log w_{1}^{\prime}+\log w_{2}^{\prime}\leq 0 (S.5.764)
zr12logw1+logwr0 for all r3|y=3)\displaystyle\quad\quad\quad z_{r}-\frac{1}{2}-\log w_{1}+\log w_{r}\leq 0\text{ for all }r\geq 3\Big{|}y=3\Big{)} (S.5.765)
𝜽¯(12+logw1logw2z212+logw1logw2|y=3)\displaystyle\gtrsim\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\frac{1}{2}+\log w_{1}-\log w_{2}\leq z_{2}\leq\frac{1}{2}+\log w_{1}^{\prime}-\log w_{2}^{\prime}\Big{|}y=3\bigg{)} (S.5.766)
r=3R(zr12logw1+logwr0)\displaystyle\quad\cdot\prod_{r=3}^{R}\mathbb{P}\Big{(}z_{r}-\frac{1}{2}-\log w_{1}+\log w_{r}\leq 0\Big{)} (S.5.767)
|logw1logw2logw1+logw2|\displaystyle\gtrsim\left|\log w_{1}-\log w_{2}-\log w_{1}^{\prime}+\log w_{2}^{\prime}\right| (S.5.768)
=|log(1w2)logw2log(1w2)+logw2|\displaystyle=\left|\log(1-w_{2})-\log w_{2}-\log(1-w_{2}^{\prime})+\log w_{2}^{\prime}\right| (S.5.769)
|w2w2|.\displaystyle\gtrsim|w_{2}-w_{2}^{\prime}|. (S.5.770)

Proof of Lemma 40.

Note that zjz_{j}’s are independent given y=3y=3. We have

𝜽¯(𝒞𝜽¯𝒞𝜽¯)\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}) 𝜽¯(𝒞𝜽¯(𝒛)=2,𝒞𝜽¯(𝒛)2)\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})=2,\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}(\bm{z})\neq 2) (S.5.771)
𝜽¯(z212u0,z212>0,zr12u0,zr120 for all r3)\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}z_{2}-\frac{1}{2}-u\geq 0,z_{2}-\frac{1}{2}>0,z_{r}-\frac{1}{2}-u\leq 0,z_{r}-\frac{1}{2}\leq 0\text{ for all }r\geq 3\Big{)} (S.5.772)
𝜽¯(12z212+u,zr12 for all r3)\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}\frac{1}{2}\leq z_{2}\leq\frac{1}{2}+u,z_{r}\leq\frac{1}{2}\text{ for all }r\geq 3\Big{)} (S.5.773)
w3𝜽¯(12z212+u,zr12 for all r3|y=3)\displaystyle\gtrsim w_{3}\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\frac{1}{2}\leq z_{2}\leq\frac{1}{2}+u,z_{r}\leq\frac{1}{2}\text{ for all }r\geq 3\Big{|}y=3\bigg{)} (S.5.774)
w3𝜽¯(12z212+u|y=3)r=3R(zr12)\displaystyle\gtrsim w_{3}\mathbb{P}_{\overline{\bm{\theta}}}\bigg{(}\frac{1}{2}\leq z_{2}\leq\frac{1}{2}+u\Big{|}y=3\bigg{)}\cdot\prod_{r=3}^{R}\mathbb{P}\Big{(}z_{r}\leq\frac{1}{2}\Big{)} (S.5.775)
u,\displaystyle\gtrsim u, (S.5.776)

where we used the fact that (zr12|y=3)\mathbb{P}\big{(}z_{r}\leq\frac{1}{2}|y=3\big{)}\geq some constant CC. ∎

Proof of Lemma 40.

Note that zjz_{j}’s are independent given y=3y=3. We have

𝜽¯(𝒞𝜽¯𝒞𝜽¯)\displaystyle\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}\neq\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}) 𝜽¯(𝒞𝜽¯(𝒛)=2,𝒞𝜽¯(𝒛)2)\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}(\mathcal{C}_{\overline{\bm{\theta}}}(\bm{z})=2,\mathcal{C}_{\overline{\bm{\theta}}^{\prime}}(\bm{z})\neq 2) (S.5.777)
𝜽¯(𝝁2𝒛12𝝁2220,(𝝁2)𝒛12𝝁2220,zr12 for all r3)\displaystyle\geq\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}\bm{\mu}_{2}^{\top}\bm{z}-\frac{1}{2}\|\bm{\mu}_{2}\|_{2}^{2}\geq 0,(\bm{\mu}_{2}^{\prime})^{\top}\bm{z}-\frac{1}{2}\|\bm{\mu}_{2}^{\prime}\|_{2}^{2}\leq 0,z_{r}\leq\frac{1}{2}\text{ for all }r\geq 3\Big{)} (S.5.778)
w3𝜽¯(𝝁2𝒛120,(𝝁2)𝒛120,zr12 for all r3|y=3)\displaystyle\geq w_{3}\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}\bm{\mu}_{2}^{\top}\bm{z}-\frac{1}{2}\geq 0,(\bm{\mu}_{2}^{\prime})^{\top}\bm{z}-\frac{1}{2}\leq 0,z_{r}\leq\frac{1}{2}\text{ for all }r\geq 3\Big{|}y=3\Big{)} (S.5.779)
w3𝜽¯(𝝁2𝒛12,(𝝁2)𝒛12|y=3)r=3R(zr12|y=3)\displaystyle\gtrsim w_{3}\mathbb{P}_{\overline{\bm{\theta}}}\Big{(}\bm{\mu}_{2}^{\top}\bm{z}\geq\frac{1}{2},(\bm{\mu}_{2}^{\prime})^{\top}\bm{z}\leq\frac{1}{2}\Big{|}y=3\Big{)}\cdot\prod_{r=3}^{R}\mathbb{P}\Big{(}z_{r}\leq\frac{1}{2}\Big{|}y=3\Big{)} (S.5.780)
1|𝝁2𝝁2|2𝝁224|𝝁2𝝁2|𝝁222\displaystyle\gtrsim\sqrt{1-\frac{|\bm{\mu}_{2}^{\top}\bm{\mu}_{2}^{\prime}|^{2}}{\|\bm{\mu}_{2}\|_{2}^{4}}}\cdot\frac{|\bm{\mu}_{2}^{\top}\bm{\mu}_{2}^{\prime}|}{\|\bm{\mu}_{2}\|_{2}^{2}} (S.5.781)
𝝁2𝝁22\displaystyle\gtrsim\|\bm{\mu}_{2}-\bm{\mu}_{2}^{\prime}\|_{2} (S.5.782)

The second last inequality is due to the fact that (zr12|y=3)\mathbb{P}\big{(}z_{r}\leq\frac{1}{2}|y=3\big{)}\geq some constant CC and Proposition 23 in \citeappazizyan2013minimax. The last inequality comes from Lemma 8.1 in \citeappcai2019chime. ∎

S.5.17 Proof of Theorem 11

WLOG, suppose that πk\pi_{k}^{*} satisfies πk(r)=\pi_{k}^{*}(r)= the “majority class” r~\widetilde{r}, if #{kS:πk(r)=r~}12|S|\#\{k\in S:\pi_{k}(r)=\widetilde{r}\}\geq\frac{1}{2}|S|. Note that we can make this assumption because it suffices to recover {ι(πk)}kS\{\iota(\pi^{*}_{k})\}_{k\in S} with a permutation ι\iota. And WLOG, suppose πk(r)=r\pi^{*}_{k}(r)=r for all kSk\in S. Let us consider any π={πk}k=1K\pi=\{\pi_{k}\}_{k=1}^{K} with πk(r)=πk(r)\pi_{k}(r)=\pi_{k}^{*}(r) for all kSck\in S^{c} and ππ\pi\neq\pi^{*}. It suffices to prove that score(π)>score(π)\text{score}(\pi)>\text{score}(\pi^{*}).

Recall that ξ=maxkSminπmaxr[R](𝚺^(k))1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))𝜷(k)r2\xi=\max_{k\in S}\min_{\pi}\max_{r\in[R]}\|(\widehat{\bm{\Sigma}}^{(k)})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-\bm{\beta}^{(k)*}_{r}\|_{2}. Note that

score(π)score(π)\displaystyle\text{score}(\pi)-\text{score}(\pi^{*}) =kkSπk(1)=πk(1)r=2R(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))2[1]\displaystyle=\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\|_{2}}_{[1]} (S.5.783)
+kkSπk(1)πk(1)r=2R(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))2[2]\displaystyle\quad+\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\|_{2}}_{[2]} (S.5.784)
+2kSkScr=2R(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))2[3]\displaystyle\quad+\underbrace{2\sum_{\begin{subarray}{c}k\in S\\ k^{\prime}\in S^{c}\end{subarray}}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\|_{2}}_{[3]} (S.5.785)
kkSπk(1)=πk(1)r=2R(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)2[1]\displaystyle-\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\|_{2}}_{[1]^{\prime}} (S.5.786)
kkSπk(1)πk(1)r=2R(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)2[2]\displaystyle-\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\|_{2}}_{[2]^{\prime}} (S.5.787)
2kSkScr=2R(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))2[3].\displaystyle-\underbrace{2\sum_{\begin{subarray}{c}k\in S\\ k^{\prime}\in S^{c}\end{subarray}}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\|_{2}}_{[3]^{\prime}}. (S.5.788)

We have

[2]\displaystyle[2] kkSπk(1)πk(1)r=2R((𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))22hξ)\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\left(\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(1)})\|_{2}-2h-\xi\right) (S.5.789)
kkSπk(1)πk(1)R((𝚺^(k)[0])1(𝝁^(k)[0]πk(1)𝝁^(k)[0]πk(1))22hξ)\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}R\left(\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(1)})\|_{2}-2h-\xi\right) (S.5.790)
kkSπk(1)πk(1)R(c𝚺1/2Δ2h2ξ).\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}R(c_{\bm{\Sigma}}^{-1/2}\Delta-2h-2\xi). (S.5.791)

Hence

[2][2]kkSπk(1)πk(1)[R(c𝚺1/2Δ2h2ξ)R(2ξ+2h)]=kkSπk(1)πk(1)R(c𝚺1/2Δ4h4ξ).[2]-[2]^{\prime}\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}[R(c_{\bm{\Sigma}}^{-1/2}\Delta-2h-2\xi)-R(2\xi+2h)]=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}R(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi). (S.5.792)

Therefore,

[1]\displaystyle[1] =kkSπk(1)=πk(1)r=2Rπk(r)=πk(r)(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))2[1]1\displaystyle=\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\|_{2}}_{[1]_{1}} (S.5.793)
+kkSπk(1)=πk(1)r=2Rπk(r)πk(r)(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))2[1]2.\displaystyle\quad+\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\|_{2}}_{[1]_{2}}. (S.5.794)

Correspondingly, we can decompose [1][1]^{\prime} in the same way as [1]=[1]1+[1]2[1]^{\prime}=[1]^{\prime}_{1}+[1]^{\prime}_{2} with

[1]1\displaystyle[1]_{1}^{\prime} =kkSπk(1)=πk(1)r=2Rπk(r)=πk(r)(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)2,\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\|_{2}, (S.5.795)
[1]2\displaystyle[1]_{2}^{\prime} =kkSπk(1)=πk(1)r=2Rπk(r)=πk(r)(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)2.\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\|_{2}. (S.5.796)

Note that

[1]2\displaystyle[1]_{2} =kkSπk(1)=πk(1)r=2Rπk(r)πk(r)((𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))22hξ)\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}\left(\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(1)})\|_{2}-2h-\xi\right) (S.5.797)
=kkSπk(1)=πk(1)r=2Rπk(r)πk(r)((𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(r))22hξ)\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}\left(\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(r)})\|_{2}-2h-\xi\right) (S.5.798)
kkSπk(1)=πk(1)r=2Rπk(r)πk(r)(c𝚺1/2Δ2h2ξ),\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}\left(c_{\bm{\Sigma}}^{-1/2}\Delta-2h-2\xi\right), (S.5.799)

hence

[1]2[1]2kkSπk(1)=πk(1)r=2Rπk(r)πk(r)(c𝚺1/2Δ4h4ξ).[1]_{2}-[1]^{\prime}_{2}\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}\left(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi\right). (S.5.800)

And

[1]1\displaystyle[1]_{1} =kkSπk(1)=πk(1)r=2Rπk(r)=πk(r)(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))2\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\|_{2} (S.5.801)
=kkSπk(1)=πk(1)=1r=2Rπk(r)=πk(r)(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))2[1]11\displaystyle=\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)=1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\|_{2}}_{[1]_{11}} (S.5.802)
+kkSπk(1)=πk(1)1r=2Rπk(r)=πk(r)(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))2[1]12.\displaystyle\quad+\underbrace{\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\|_{2}}_{[1]_{12}}. (S.5.803)
[1]12\displaystyle[1]_{12} =kkSπk(1)=πk(1)1r=2Rπk(r)=πk(r)((𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))22hξ)\displaystyle=\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}(\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k^{\prime}}(1)})\|_{2}-2h-\xi) (S.5.804)
kkSπk(1)=πk(1)1r=2Rπk(r)=πk(r)(2h+ξ).\displaystyle\geq-\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}(2h+\xi). (S.5.805)

An important observation is that [1]11=[1]11[1]_{11}=[1]_{11}^{\prime}. Therefore,

[1]1[1]1=[1]12[1]12kkSπk(1)=πk(1)1r=2Rπk(r)=πk(r)(4h+3ξ).[1]_{1}-[1]^{\prime}_{1}=[1]_{12}-[1]_{12}^{\prime}\geq-\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}(4h+3\xi). (S.5.806)

And

[1][1]\displaystyle[1]-[1]^{\prime} =[1]2[1]2+[1]1[1]1\displaystyle=[1]_{2}-[1]^{\prime}_{2}+[1]_{1}-[1]_{1}^{\prime} (S.5.807)
kkSπk(1)=πk(1)r=2Rπk(r)πk(r)(c𝚺1/2Δ4h4ξ)kkSπk(1)=πk(1)1r=2Rπk(r)=πk(r)(4h+3ξ).\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)-\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}(4h+3\xi). (S.5.808)

Furthermore, by triangle inequality,

[3][3]\displaystyle[3]-[3]^{\prime} =2kSkScr=2R(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))2\displaystyle=2\sum_{\begin{subarray}{c}k\in S\\ k^{\prime}\in S^{c}\end{subarray}}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\|_{2} (S.5.809)
2kSkScr=2R(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))2\displaystyle\quad-2\sum_{\begin{subarray}{c}k\in S\\ k^{\prime}\in S^{c}\end{subarray}}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(k^{\prime})[0]})^{-1}(\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(r)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{\pi_{k^{\prime}}(1)})\|_{2} (S.5.810)
2kSkScr=2R(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1)𝝁^(k)[0]r+𝝁^(k)[0]1)2.\displaystyle\geq-2\sum_{\begin{subarray}{c}k\in S\\ k^{\prime}\in S^{c}\end{subarray}}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}+\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\|_{2}. (S.5.811)

Putting all pieces together,

score(π)score(π)\displaystyle\text{score}(\pi)-\text{score}(\pi^{*}) kkSπk(1)πk(1)R(c𝚺1/2Δ4h4ξ)+kkSπk(1)=πk(1)r=2Rπk(r)πk(r)(c𝚺1/2Δ4h4ξ)\displaystyle\geq\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}R(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)+\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi) (S.5.812)
kkSπk(1)=πk(1)1r=2Rπk(r)=πk(r)(4h+3ξ)\displaystyle\quad-\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}(4h+3\xi) (S.5.813)
2kSr=2R(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1)𝝁^(k)[0]r+𝝁^(k)[0]1)2|Sc|.\displaystyle\quad-2\sum_{k\in S}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)}-\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{r}+\widehat{\bm{\mu}}^{(k^{\prime})[0]}_{1})\|_{2}\cdot|S^{c}|. (S.5.814)

Denote mr=m_{r}= the majority class among {πk(r)}kS\{\pi_{k}(r)\}_{k\in S} and S(r)r~={kS:πk(r)=r~}S^{(r)}_{\widetilde{r}}=\{k\in S:\pi_{k}(r)=\widetilde{r}\}.

(i)Case 1: |S(1)1|23|S||S^{(1)}_{1}|\leq\frac{2}{3}|S|.

Since πk(1)=1\pi^{*}_{k}(1)=1, we have |S(r)1|23|S||S^{(r)}_{1}|\leq\frac{2}{3}|S| for all r[R]r\in[R], otherwise by our assumption, m1=r0m_{1}=r_{0} since r0r_{0} satisfies |S(r0)1|>23|S||S(1)1||S^{(r_{0})}_{1}|>\frac{2}{3}|S|\geq|S^{(1)}_{1}|, which is a contradition. Therefore,

kkSπk(1)πk(1)1\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}1 =2rr|S(r)1||S(r)1|\displaystyle=2\sum_{r\neq r^{\prime}}|S^{(r)}_{1}|\cdot|S^{(r^{\prime})}_{1}| (S.5.815)
=(r=1R|S1(r)|)2r=1R|S1(r)|2\displaystyle=\Big{(}\sum_{r=1}^{R}|S_{1}^{(r)}|\Big{)}^{2}-\sum_{r=1}^{R}|S_{1}^{(r)}|^{2} (S.5.816)
=|S|2r=1R|S1(r)|2\displaystyle=|S|^{2}-\sum_{r=1}^{R}|S_{1}^{(r)}|^{2} (S.5.817)
|S|2(49|S|2+19|S|2)\displaystyle\geq|S|^{2}-\Big{(}\frac{4}{9}|S|^{2}+\frac{1}{9}|S|^{2}\Big{)} (S.5.818)
=49|S|2,\displaystyle=\frac{4}{9}|S|^{2}, (S.5.819)

and

kkSπk(1)=πk(1)1r=2Rπk(r)=πk(r)1\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}1 R2(|S||S(1)1|2)\displaystyle\leq R\cdot 2\binom{|S|-|S^{(1)}_{1}|}{2} (S.5.820)
R(|S||S(1)1|)2\displaystyle\leq R(|S|-|S^{(1)}_{1}|)^{2} (S.5.821)
R|S|2.\displaystyle\leq R|S|^{2}. (S.5.822)

Also,

kSr=2R12R|S|.\sum_{k\in S}\sum_{r=2}^{R}1\leq 2R|S|. (S.5.823)

Hence, since 49|S|4D|Sc|=49(1ϵ)K4DϵK>0\frac{4}{9}|S|-4D|S^{c}|=\frac{4}{9}(1-\epsilon)K-4D\epsilon K>0,

score(π)score(π)\displaystyle\text{score}(\pi)-\text{score}(\pi^{*}) 49|S|2R(c𝚺1/2Δ4h4ξ)|S|2R(4h+4ξ)2R|S||Sc|(2Dc𝚺1/2Δ+2ξ)\displaystyle\geq\frac{4}{9}|S|^{2}R(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)-|S|^{2}R(4h+4\xi)-2R|S||S^{c}|(2Dc_{\bm{\Sigma}}^{1/2}\Delta+2\xi) (S.5.824)
=|S|R[(49c𝚺1/2|S|4Dc𝚺1/2|Sc|)Δ529|S|h(529|S|+4|Sc|)ξ]\displaystyle=|S|R\Big{[}\Big{(}\frac{4}{9}c_{\bm{\Sigma}}^{-1/2}|S|-4Dc_{\bm{\Sigma}}^{1/2}|S^{c}|\Big{)}\Delta-\frac{52}{9}|S|h-\Big{(}\frac{52}{9}|S|+4|S^{c}|\Big{)}\xi\Big{]} (S.5.825)
>0.\displaystyle>0. (S.5.826)

(ii)Case 2: |S(1)1|>23|S||S^{(1)}_{1}|>\frac{2}{3}|S|.

In this case, by our assumption, we must have m1=1m_{1}=1. And

kkSπk(1)πk(1)1\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)\neq\pi_{k^{\prime}}(1)\end{subarray}}1 2|S(1)1|(|S||S(1)1|),\displaystyle\geq 2|S^{(1)}_{1}|(|S|-|S^{(1)}_{1}|), (S.5.827)
kkSπk(1)=πk(1)1r=2Rπk(r)=πk(r)1\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\sum_{\pi_{k}(r)=\pi_{k^{\prime}}(r)}1 Rk,k𝟙(kkS,πk(1)=πk(1)1)\displaystyle\leq R\sum_{k,k^{\prime}}\mathds{1}(k\neq k^{\prime}\in S,\pi_{k}(1)=\pi_{k^{\prime}}(1)\neq 1) (S.5.828)
R2(|S||S(1)1|2)\displaystyle\leq R\cdot 2\binom{|S|-|S^{(1)}_{1}|}{2} (S.5.829)
R(|S||S(1)1|)2\displaystyle\leq R(|S|-|S^{(1)}_{1}|)^{2} (S.5.830)
R12|S(1)1|(|S||S(1)1|).\displaystyle\leq R\frac{1}{2}|S^{(1)}_{1}|(|S|-|S^{(1)}_{1}|). (S.5.831)

Moreover,

kSr=2R(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1)𝝁^(k)[0]r+𝝁^(k)[0]1)2|Sc|\displaystyle\sum_{k\in S}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)}-\widehat{\bm{\mu}}^{(k)[0]}_{r}+\widehat{\bm{\mu}}^{(k)[0]}_{1})\|_{2}\cdot|S^{c}| (S.5.832)
kSπk(1)1r=2R(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1)𝝁^(k)[0]r+𝝁^(k)[0]1)2|Sc|\displaystyle\leq\sum_{\begin{subarray}{c}k\in S\\ \pi_{k}(1)\neq 1\end{subarray}}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)}-\widehat{\bm{\mu}}^{(k)[0]}_{r}+\widehat{\bm{\mu}}^{(k)[0]}_{1})\|_{2}\cdot|S^{c}| (S.5.833)
+kSπk(1)=1πk(r)r(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]r)2|Sc|\displaystyle\quad+\sum_{k\in S}\sum_{\begin{subarray}{c}\pi_{k}(1)=1\\ \pi_{k}(r)\neq r\end{subarray}}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{r})\|_{2}\cdot|S^{c}| (S.5.834)
(|S||S(1)1|)R(2c𝚺1/2Δ+2ξ)|Sc|+kSπk(1)=1πk(r)r(Dc𝚺1/2Δ+ξ)|Sc|.\displaystyle\leq(|S|-|S^{(1)}_{1}|)R\cdot(2c_{\bm{\Sigma}}^{1/2}\Delta+2\xi)|S^{c}|+\sum_{k\in S}\sum_{\begin{subarray}{c}\pi_{k}(1)=1\\ \pi_{k}(r)\neq r\end{subarray}}(Dc_{\bm{\Sigma}}^{1/2}\Delta+\xi)|S^{c}|. (S.5.835)

For rr satisfying |S(r)r|12|S||S^{(r)}_{r}|\leq\frac{1}{2}|S|, we have

kSπk(1)=1πk(r)r\displaystyle\sum_{k\in S}\sum_{\begin{subarray}{c}\pi_{k}(1)=1\\ \pi_{k}(r)\neq r\end{subarray}} 2|S(1)1|,\displaystyle\leq 2|S^{(1)}_{1}|, (S.5.836)
kkSπk(1)=πk(1)πk(r)πk(r)1\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}1 212|S|(23|S|12|S|)=16|S|2.\displaystyle\geq 2\cdot\frac{1}{2}|S|\Big{(}\frac{2}{3}|S|-\frac{1}{2}|S|\Big{)}=\frac{1}{6}|S|^{2}. (S.5.837)

For rr satisfying |S(r)r|>12|S||S^{(r)}_{r}|>\frac{1}{2}|S|, we have mr=rm_{r}=r. Define S~(r)r={kS:πk(1)=1,πk(r)=r}\widetilde{S}^{(r)}_{r}=\{k\in S:\pi_{k}(1)=1,\pi_{k}(r)=r\}. Note that |S~(r)r|23|S|12|S|=16|S||\widetilde{S}^{(r)}_{r}|\geq\frac{2}{3}|S|-\frac{1}{2}|S|=\frac{1}{6}|S|. Furthermore,

kSπk(1)=1πk(r)r\displaystyle\sum_{k\in S}\sum_{\begin{subarray}{c}\pi_{k}(1)=1\\ \pi_{k}(r)\neq r\end{subarray}} |S(1)1||S~(r)r|,\displaystyle\leq|S^{(1)}_{1}|-|\widetilde{S}^{(r)}_{r}|, (S.5.838)
kkSπk(1)=πk(1)πk(r)πk(r)1\displaystyle\sum_{\begin{subarray}{c}k\neq k^{\prime}\in S\\ \pi_{k}(1)=\pi_{k^{\prime}}(1)\end{subarray}}\sum_{\pi_{k}(r)\neq\pi_{k^{\prime}}(r)}1 2|S~(r)r|(|S(1)1||S~(r)r|)13|S|(|S(1)1||S~(r)r|).\displaystyle\geq 2|\widetilde{S}^{(r)}_{r}|(|S^{(1)}_{1}|-|\widetilde{S}^{(r)}_{r}|)\geq\frac{1}{3}|S|(|S^{(1)}_{1}|-|\widetilde{S}^{(r)}_{r}|). (S.5.839)

This implies that

score(π)score(π)\displaystyle\text{score}(\pi)-\text{score}(\pi^{*}) (S.5.840)
2|S1(1)|(|S||S1(1)|)R(c𝚺1/2Δ4h4ξ)R12|S(1)1|(|S||S(1)1|)(4h+3ξ)\displaystyle\geq 2|S_{1}^{(1)}|(|S|-|S_{1}^{(1)}|)R(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)-R\frac{1}{2}|S^{(1)}_{1}|(|S|-|S^{(1)}_{1}|)\cdot(4h+3\xi) (S.5.841)
+r:|S(r)r||S|/2[16|S|2(c𝚺1/2Δ4h4ξ)2|S1(1)|2|Sc|(c𝚺1/2DΔ+ξ)]\displaystyle\quad+\sum_{r:|S^{(r)}_{r}|\leq|S|/2}\Big{[}\frac{1}{6}|S|^{2}(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)-2|S_{1}^{(1)}|\cdot 2|S^{c}|(c_{\bm{\Sigma}}^{1/2}D\Delta+\xi)\Big{]} (S.5.842)
+r:|S(r)r|>|S|/2[13|S|(|S1(1)|S~(r)r)(c𝚺1/2Δ4h4ξ)(|S(1)1||S~(r)r|)|Sc|2(c𝚺1/2DΔ+ξ)]\displaystyle\quad+\sum_{r:|S^{(r)}_{r}|>|S|/2}\Big{[}\frac{1}{3}|S|(|S_{1}^{(1)}|-\widetilde{S}^{(r)}_{r})(c_{\bm{\Sigma}}^{-1/2}\Delta-4h-4\xi)-(|S^{(1)}_{1}|-|\widetilde{S}^{(r)}_{r}|)\cdot|S^{c}|\cdot 2(c_{\bm{\Sigma}}^{1/2}D\Delta+\xi)\Big{]} (S.5.843)
|S(1)1|(|S||S(1)1|)R(2c𝚺1/2Δ10h192ξ)\displaystyle\geq|S^{(1)}_{1}|(|S|-|S^{(1)}_{1}|)R\Big{(}2c_{\bm{\Sigma}}^{-1/2}\Delta-10h-\frac{19}{2}\xi\Big{)} (S.5.844)
+r:|S(r)r||S|/2|S|[(16c𝚺1/2|S|4Dc𝚺1/2|Sc|)Δ23|S|h(23|S|+4|Sc|)ξ]\displaystyle\quad+\sum_{r:|S^{(r)}_{r}|\leq|S|/2}|S|\bigg{[}\Big{(}\frac{1}{6}c_{\bm{\Sigma}}^{-1/2}|S|-4Dc_{\bm{\Sigma}}^{1/2}|S^{c}|\Big{)}\Delta-\frac{2}{3}|S|h-\Big{(}\frac{2}{3}|S|+4|S^{c}|\Big{)}\xi\bigg{]} (S.5.845)
+r:|S(r)r|>|S|/2[(13c𝚺1/2|S|2Dc𝚺1/2|Sc|)Δ43|S|h(43|S|+2|Sc|)ξ](|S(1)1||S~(r)r|)\displaystyle\quad+\sum_{r:|S^{(r)}_{r}|>|S|/2}\bigg{[}\Big{(}\frac{1}{3}c_{\bm{\Sigma}}^{-1/2}|S|-2Dc_{\bm{\Sigma}}^{1/2}|S^{c}|\Big{)}\Delta-\frac{4}{3}|S|h-\Big{(}\frac{4}{3}|S|+2|S^{c}|\Big{)}\xi\bigg{]}(|S^{(1)}_{1}|-|\widetilde{S}^{(r)}_{r}|) (S.5.846)
>0.\displaystyle>0. (S.5.847)

S.5.18 Proof of Theorem 12

WLOG, consider the step K~[K]\widetilde{K}\in[K] in the for loop and the case that ι=\iota= indentify mapping from [K][K] to [K][K], and K~S\widetilde{K}\in S. Denote S~=S[K~]\widetilde{S}=S\cap[\widetilde{K}] and S~c=Sc[K~]\widetilde{S}^{c}=S^{c}\cap[\widetilde{K}], hence [K~]=S~S~c[\widetilde{K}]=\widetilde{S}\cup\widetilde{S}^{c}. WLOG, consider π1=π2=πK~1=\pi_{1}=\pi_{2}=\cdots\pi_{\widetilde{K}-1}= identify from [R][R] to [R][R]. Denote 𝝅={πk}k=1K~\bm{\pi}=\{\pi_{k}\}_{k=1}^{\widetilde{K}} and 𝝅~={πk}k=1K~1{π~K~}\widetilde{\bm{\pi}}=\{\pi_{k}\}_{k=1}^{\widetilde{K}-1}\cup\{\widetilde{\pi}_{\widetilde{K}}\} with π~K~=\widetilde{\pi}_{\widetilde{K}}= identify from [R][R] to [R][R]. It suffices to show that

score(𝝅)>score(𝝅~),\text{score}(\bm{\pi})>\text{score}(\widetilde{\bm{\pi}}), (S.5.848)

for any 𝝅\bm{\pi} with πK~π~K~=\pi_{\widetilde{K}}\neq\widetilde{\pi}_{\widetilde{K}}= identify from [R][R] to [R][R]. If this is the case, then 𝝅^={π^k}k=1K\widehat{\bm{\pi}}=\{\widehat{\pi}_{k}\}_{k=1}^{K} satisfies π^k=\widehat{\pi}_{k}= identity for all kSk\in S, which completes the proof.

We focus on the derivation of (S.5.848) in the remaining part of the proof.

(i) Case 1: πK~(1)=1\pi_{\widetilde{K}}(1)=1.

score(π)score(π)\displaystyle\text{score}(\pi)-\text{score}(\pi^{*}) =r=2RkS~(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(K~)[0])1(𝝁^(K~)[0]πK~(r)𝝁^(K~)[0]1)2[1]\displaystyle=\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\|_{2}}_{[1]} (S.5.849)
+r=2RkS~c(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(K~)[0])1(𝝁^(K~)[0]πK~(r)𝝁^(K~)[0]1)2[2]\displaystyle\quad+\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}^{c}}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\|_{2}}_{[2]} (S.5.850)
r=2RkS~(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(K~)[0])1(𝝁^(K~)[0]r𝝁^(K~)[0]1)2[1]\displaystyle-\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\|_{2}}_{[1]^{\prime}} (S.5.851)
r=2RkS~c(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(K~)[0])1(𝝁^(K~)[0]r𝝁^(K~)[0]1)2[2].\displaystyle-\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}^{c}}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\|_{2}}_{[2]^{\prime}}. (S.5.852)

Note that

[1][1]\displaystyle[1]-[1]^{\prime} =#{r2:R:πK~(r)r}\displaystyle=\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot (S.5.853)
kS~[(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(K~)[0])1(𝝁^(K~)[0]πK~(r)𝝁^(K~)[0]1)2\displaystyle\quad\sum_{k\in\widetilde{S}}\Big{[}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\|_{2} (S.5.854)
(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(K~)[0])1(𝝁^(K~)[0]r𝝁^(K~)[0]1)2]\displaystyle\quad\quad\quad-\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\|_{2}\Big{]} (S.5.855)
#{r2:R:πK~(r)r}[(𝚺(k))1(𝝁(k)r𝝁(k)πK~(r))22ξ2h2ξ2h]\displaystyle\geq\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot\Big{[}\|(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{r}-\bm{\mu}^{(k)*}_{\pi_{\widetilde{K}}(r)})\|_{2}-2\xi-2h-2\xi-2h\Big{]} (S.5.856)
#{r2:R:πK~(r)r}|S~|(Δc𝚺1/24ξ4h).\displaystyle\geq\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot|\widetilde{S}|\cdot\big{(}\Delta c_{\bm{\Sigma}}^{-1/2}-4\xi-4h\big{)}. (S.5.857)
[2][2]\displaystyle[2]-[2]^{\prime} =#{r2:R:πK~(r)r}\displaystyle=\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot (S.5.858)
kS~c[(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(K~)[0])1(𝝁^(K~)[0]πK~(r)𝝁^(K~)[0]1)2\displaystyle\quad\sum_{k\in\widetilde{S}^{c}}\Big{[}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\|_{2} (S.5.859)
(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(K~)[0])1(𝝁^(K~)[0]r𝝁^(K~)[0]1)2]\displaystyle\quad\quad\quad-\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\|_{2}\Big{]} (S.5.860)
#{r2:R:πK~(r)r}kS~c(𝚺^(K~)[0])1(𝝁^(K~)[0]πK~(r)𝝁^(K~)[0]r)2\displaystyle\geq-\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot\sum_{k\in\widetilde{S}^{c}}\|(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r})\|_{2} (S.5.861)
#{r2:R:πK~(r)r}|S~c|[(𝚺(K~))1(𝝁(K~)πK~(r)𝝁(K~)r)2+ξ]\displaystyle\geq-\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot|\widetilde{S}^{c}|\cdot\big{[}\|(\bm{\Sigma}^{(\widetilde{K})*})^{-1}(\bm{\mu}^{(\widetilde{K})*}_{\pi_{\widetilde{K}}(r)}-\bm{\mu}^{(\widetilde{K})*}_{r})\|_{2}+\xi\big{]} (S.5.862)
#{r2:R:πK~(r)r}|S~c|(Dc𝚺1/2Δ+ξ).\displaystyle\geq-\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot|\widetilde{S}^{c}|\cdot(Dc_{\bm{\Sigma}}^{1/2}\Delta+\xi). (S.5.863)

These imply that

score(𝝅)score(𝝅~)\displaystyle\text{score}(\bm{\pi})-\text{score}(\widetilde{\bm{\pi}}) =[1][1]+[2][2]\displaystyle=[1]-[1]^{\prime}+[2]-[2]^{\prime} (S.5.864)
#{r2:R:πK~(r)r}[(|S~|c𝚺1/2|S~c|Dc𝚺1/2)Δ(4|S~|+|S~c|)ξ4|S~|h]\displaystyle\geq\#\{r\in 2:R:\pi_{\widetilde{K}}(r)\neq r\}\cdot\big{[}(|\widetilde{S}|c_{\bm{\Sigma}}^{-1/2}-|\widetilde{S}^{c}|Dc_{\bm{\Sigma}}^{1/2})\Delta-(4|\widetilde{S}|+|\widetilde{S}^{c}|)\xi-4|\widetilde{S}|h\big{]} (S.5.865)
>0,\displaystyle>0, (S.5.866)

where we used the fact that |S~|/|S~c|K0Kϵ|\widetilde{S}|/|\widetilde{S}^{c}|\leq\frac{K_{0}}{K\epsilon}.

(ii) Case 2: πK~(1)1\pi_{\widetilde{K}}(1)\neq 1.

score(π)score(π)\displaystyle\text{score}(\pi)-\text{score}(\pi^{*}) =r=2RkS~(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(K~)[0])1(𝝁^(K~)[0]πK~(r)𝝁^(K~)[0]πK~(1))2[1]\displaystyle=\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(1)})\|_{2}}_{[1]} (S.5.867)
+r=2RkS~c(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(K~)[0])1(𝝁^(K~)[0]πK~(r)𝝁^(K~)[0]πK~(1))2[2]\displaystyle\quad+\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}^{c}}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(1)})\|_{2}}_{[2]} (S.5.868)
r=2RkS~(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(K~)[0])1(𝝁^(K~)[0]r𝝁^(K~)[0]1)2[1]\displaystyle-\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\|_{2}}_{[1]^{\prime}} (S.5.869)
r=2RkS~c(𝚺^(k)[0])1(𝝁^(k)[0]πk(r)𝝁^(k)[0]πk(1))(𝚺^(K~)[0])1(𝝁^(K~)[0]r𝝁^(K~)[0]1)2[2].\displaystyle-\underbrace{\sum_{r=2}^{R}\sum_{k\in\widetilde{S}^{c}}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{k}(1)})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\|_{2}}_{[2]^{\prime}}. (S.5.870)

By previous results,

[1]\displaystyle[1] =kS~r=2R(𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(K~)[0])1(𝝁^(K~)[0]πK~(r)𝝁^(K~)[0]πK~(1))2\displaystyle=\sum_{k\in\widetilde{S}}\sum_{r=2}^{R}\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}}(1)})\|_{2} (S.5.871)
kS~r=2R((𝚺^(k)[0])1(𝝁^(k)[0]r𝝁^(k)[0]1)(𝚺^(K~)[0])1(𝝁^(k)[0]πK~(r)𝝁^(k)[0]πK~(1))22hξ)\displaystyle\geq\sum_{k\in\widetilde{S}}\sum_{r=2}^{R}\left(\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{r}-\widehat{\bm{\mu}}^{(k)[0]}_{1})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{\widetilde{K}}(r)}-\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{\widetilde{K}}(1)})\|_{2}-2h-\xi\right) (S.5.872)
|S~|R((𝚺^(k)[0])1(𝝁^(k)[0]πK~(1)𝝁^(k)[0]1)22hξ)\displaystyle\geq|\widetilde{S}|R\cdot\left(\|(\widehat{\bm{\Sigma}}^{(k)[0]})^{-1}(\widehat{\bm{\mu}}^{(k)[0]}_{\pi_{\widetilde{K}}(1)}-\widehat{\bm{\mu}}^{(k)[0]}_{1})\|_{2}-2h-\xi\right) (S.5.873)
|S~|R((𝚺(k))1(𝝁(k)πK~(1)𝝁(k)1)2ξ2hξ)\displaystyle\geq|\widetilde{S}|R\cdot\left(\|(\bm{\Sigma}^{(k)*})^{-1}(\bm{\mu}^{(k)*}_{\pi_{\widetilde{K}}(1)}-\bm{\mu}^{(k)*}_{1})\|_{2}-\xi-2h-\xi\right) (S.5.874)
|S~|R(c𝚺1/2Δ2ξ2h),\displaystyle\geq|\widetilde{S}|R\cdot(c_{\bm{\Sigma}}^{-1/2}\Delta-2\xi-2h), (S.5.875)

and

[1]|S~|R(2ξ+2h).-[1]^{\prime}\geq-|\widetilde{S}|R\cdot(2\xi+2h). (S.5.876)

Similar to case 1,

[2][2]\displaystyle[2]-[2]^{\prime} r=2RkS~(𝚺^(K~))1(𝝁^(K~)[0]πK~(r)𝝁^(K~)[0]πK~(1))(𝚺^(K~))1(𝝁^(K~)[0]r𝝁^(K~)[0]1)2\displaystyle\geq-\sum_{r=2}^{R}\sum_{k\in\widetilde{S}}\|(\widehat{\bm{\Sigma}}^{(\widetilde{K})})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}(r)}}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{\pi_{\widetilde{K}(1)}})-(\widehat{\bm{\Sigma}}^{(\widetilde{K})})^{-1}(\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{r}-\widehat{\bm{\mu}}^{(\widetilde{K})[0]}_{1})\|_{2} (S.5.877)
R|S~c|(2Dc𝚺1/2Δ+2ξ).\displaystyle\geq-R|\widetilde{S}^{c}|\cdot(2Dc_{\bm{\Sigma}}^{1/2}\Delta+2\xi). (S.5.878)

Therefore,

score(𝝅)score(𝝅~)\displaystyle\text{score}(\bm{\pi})-\text{score}(\widetilde{\bm{\pi}}) =[1][1]+[2][2]\displaystyle=[1]-[1]^{\prime}+[2]-[2]^{\prime} (S.5.879)
R[(|S~|c𝚺1/2|S~c|2Dc𝚺1/2)Δ(2|S~|+2|S~c|)ξ2|S~|h]\displaystyle\geq R\big{[}(|\widetilde{S}|c_{\bm{\Sigma}}^{-1/2}-|\widetilde{S}^{c}|\cdot 2Dc_{\bm{\Sigma}}^{1/2})\Delta-(2|\widetilde{S}|+2|\widetilde{S}^{c}|)\xi-2|\widetilde{S}|h\big{]} (S.5.880)
>0,\displaystyle>0, (S.5.881)

where we used the fact that |S~|/|S~c|K0Kϵ|\widetilde{S}|/|\widetilde{S}^{c}|\leq\frac{K_{0}}{K\epsilon}.

\bibliographyapp

reference.bib \bibliographystyleappapalike