This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Clustering in Anisotropic Gaussian Mixture Models

Xin Chen University of Washington Anderson Y. Zhang University of Pennsylvania
Abstract

We study the clustering task under anisotropic Gaussian Mixture Models where the covariance matrices from different clusters are unknown and are not necessarily the identical matrix. We characterize the dependence of signal-to-noise ratios on the cluster centers and covariance matrices and obtain the minimax lower bound for the clustering problem. In addition, we propose a computationally feasible procedure and prove it achieves the optimal rate within a few iterations. The proposed procedure is a hard EM type algorithm, and it can also be seen as a variant of the Lloyd’s algorithm that is adjusted to the anisotropic covariance matrices.

1 Introduction

Clustering is a fundamentally important task in statistics and machine learning [7, 2]. The most popular and studied model for clustering is the Gaussian Mixture Model (GMM) [18, 20] which can be written as

Yj\displaystyle Y_{j} =θzj+ϵj, where ϵjind𝒩(0,Σzj),j[n].\displaystyle=\theta^{*}_{z^{*}_{j}}+\epsilon_{j},\text{ where }\epsilon_{j}\stackrel{{\scriptstyle ind}}{{\sim}}\mathcal{N}(0,\Sigma^{*}_{z^{*}_{j}}),\forall j\in[n].

Here Y=(Y1,,Yn)Y=(Y_{1},\ldots,Y_{n}) are the observations with nn being the sample size. Let kk be the number of clusters that is assumed to be known. Denote {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]} to be unknown centers and {Σa}\{\Sigma^{*}_{a}\} to be unknown covariance matrices for the kk clusters. Let z[k]nz^{*}\in[k]^{n} be the cluster structure such that for each index j[n]j\in[n], the value of zjz^{*}_{j} indicates which cluster the jjth data point belongs to. The goal is to recover zz^{*} from YY. For any estimator z^\hat{z}, its clustering performance is measured by a misclustering error rate h(z^,z)h(\hat{z},z^{*}) which will be introduced later in (4).

There have been increasing interests in the theoretical and algorithmic analysis of clustering under GMMs. When the GMM is isotropic (that is, all the covariance matrices {Σa}a[k]\{\Sigma^{*}_{a}\}_{a\in[k]} are equal to the same identity matrix), [15] obtains the minimax rate for clustering which takes a form of exp((1+o(1))(minabθaθb)2/8)\exp(-(1+o(1))(\min_{a\neq b}\|\theta^{*}_{a}-\theta^{*}_{b}\|)^{2}/8) under the loss h(z^,z)h(\hat{z},z^{*}). Various methods have been studied in the isotropic setting. kk-means clustering [16] might be the most natural choice but it is NP-hard [4]. As a local approach to optimize the kk-mean objects, Lloyd’s algorithm [13] is one of the most popular clustering algorithms and has achieved many successes in different disciplines [24]. [15, 8] establishes computational and statistical guarantees for the Lloyd’s algorithm by showing it achieves the optimal rates after a few iterations provided with some decent initialization. Another popular approach to clustering especially for high-dimensional data is spectral clustering [22, 19, 21], which is an umbrella term for clustering after a dimension reduction through a spectral decomposition. [14, 17, 1] proves the spectral clustering also achieves the optimality under the isotropic GMM. Another line of work is to consider semidefinite programming (SDP) as a convex relaxation of the kk-means objective. Its statistical properties have been studied in [6, 9].

In spite of all the exciting results, most of the existing literature focuses on isotropic GMMs, and clustering under the anisotropic case where the covariance matrices are not necessarily the identity matrix is not well-understood. The results of some papers [15, 6] hold under sub-Gaussian mixture models, where the errors ϵj\epsilon_{j} are assumed to follow some sub-Gaussian distribution with variance proxy σ2\sigma^{2}. It seems that their result already covers the anisotropic case, as {𝒩(0,Σa)}a[k]\{\mathcal{N}(0,\Sigma^{*}_{a})\}_{a\in[k]} are indeed sub-Gaussian distributions. However, from a minimax point of view, among all the sub-Gaussian distributions with variance proxy σ2\sigma^{2}, the least favorable case (the case where clustering is the most difficult) is when the errors are 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}). Therefore, the minimax rates for clustering under the sub-Gaussian mixture model is essentially the one under isotropic GMMs, and methods such as the Lloyd’s algorithm that requires no covariance matrix information can be rate-optimal. As a result, the aforementioned results are all for isotropic GMMs.

A few papers have explored the direction of clustering under anisotropic GMMs. [3] gives a polynomial-time clustering algorithm that provably works well when the Gaussian distributions are well separated by hyperplanes. Their idea is further developed in [11] which allows the Gaussians to be overlapped with each other but only for two-cluster cases. A recent paper [23] proposes another method for clustering under a balanced mixture of two elliptical distributions. They give a provable upper bound of their clustering performance with respect to an excess risk. Nevertheless, it remains unknown what is the fundamental limit of clustering under the anisotropic GMMs and whether there is any polynomial-time procedure that achieves it.

In this paper, we will investigate the optimal rates of the clustering task under two anisotropic GMMs. Model 1 is when the covariance matrices are all equal to each other (i.e., homogeneous) and are equal to some unknown matrix Σ\Sigma^{*}. Model 2 is more flexible, where the covariance matrices are unknown and are not necessarily equal to each other (i.e., heterogeneous). The contribution of this paper is two-fold, summarized as follows.

Our first contribution is on the fundamental limits. We obtain the minimax lower bound for clustering under the anisotropic GMMs with respect to the loss h(z^,z)h(\hat{z},z^{*}). We show it takes the form

infz^supz[k]n𝔼h(z,z)exp((1+o(1))(signal-to-noise ratio)28),\displaystyle\inf_{\hat{z}}\sup_{z^{*}\in[k]^{n}}\mathbb{E}h(z,z^{*})\geq\exp\left(-(1+o(1))\frac{(\text{signal-to-noise ratio})^{2}}{8}\right),

where the signal-to-noise ratio under Model 1 is equal to mina,b[k]:ab(θaθb)TΣ12\min_{a,b\in[k]:a\neq b}\|(\theta^{*}_{a}-\theta^{*}_{b})^{T}\Sigma^{*-\frac{1}{2}}\| and the one for Model 2 is more complicated. For both models, we can see the minimax rates depend not only on the centers but also the covariance matrices. This is different from the isotropic case whose signal-to-noise ratio is minabθaθb\min_{a\neq b}\|\theta^{*}_{a}-\theta^{*}_{b}\|. Our results precisely capture the role the covariance matrices play in the clustering problem. It shows covariance matrices impact the fundamental limits of the clustering problem through complicated interaction with the centers especially in Model 2. The minimax lower bounds are obtained by establishing connections with Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA).

Our second and more important contribution is on the computational side. We propose a computationally feasible and rate-optimal algorithm for the anisotropic GMM. Popular methods including the Lloyd’s algorithm and the spectral clustering no longer work well as they are developed under the isotropic case and only consider the distances among the centers [3]. We study an adjusted Lloyd’s algorithm which estimates the covariance matrices in each iteration and recovers the clusters using the covariance matrix information. It can also be seen as a hard EM algorithm [5]. As an iterative algorithm, we give a statistical and computational guarantee and guidance to practitioners by showing that it obtains the minimax lower bound within logn\log n iterations. That is, let z(t)z^{(t)} be the output of the algorithm after tt iterations, we have with high probability,

h(z(t),z)exp((1+o(1))(signal-to-noise ratio)28),\displaystyle h(z^{(t)},z^{*})\leq\exp\left(-(1+o(1))\frac{(\text{signal-to-noise ratio})^{2}}{8}\right),

holds for all tlognt\geq\log n. The algorithm can be initialized by popular methods such as the spectral clustering or the Lloyd’s algorithm. In numeric studies, we show the proposed algorithm improves greatly from the two aforementioned methods under anisotropic GMMs, and matches the optimal exponent given in the minimax lower bound.

Paper Organization.

The remaining paper is organized as follows. In Section 2, we study Model 1 where the covariance matrices are unknown but homogeneous. In Section 3, we consider Model 2 where covariance matrices are unknown and heterogeneous. For both cases, we obtain the minimax lower bound for clustering and study the adjusted Lloyd’s algorithm. In Section 4, we provide a numeric comparison with other popular methods. The proofs of theorems in Section 2 are given in Section 5 and the proofs for Section 3 are included in Section 6. All the technical lemmas are included in Section 7.

Notation.

Let [m]={1,2,,m}[m]=\{1,2,\ldots,m\} for any positive integer mm. For any set SS, we denote |S|\left|S\right| for its cardinality. For any matrix Xd×dX\in\mathbb{R}^{d\times d}, we denote λ1(X)\lambda_{1}(X) to be its smallest eigenvalue and λd(X)\lambda_{d}(X) to be its largest eigenvalue. In addition, we denote X\left\|{X}\right\| to be its operator norm. For any two vectors u,vu,v of the same dimension, we denote u,v=uTv\left\langle u,v\right\rangle=u^{T}v to be its inner product. For any positive integer dd, we denote IdI_{d} to be the d×dd\times d identity matrix. We denote 𝒩(μ,Σ)\mathcal{N}(\mu,\Sigma) to be the normal distribution with mean μ\mu and covariance matrix Σ\Sigma. We denote 𝕀{}{\mathbb{I}\left\{{\cdot}\right\}} to be the indicator function. Given two positive sequences an,bna_{n},b_{n}, we denote an=o(bn)a_{n}=o(b_{n}) if an/bn=o(1)a_{n}/b_{n}=o(1) when nn\rightarrow\infty. We write anbna_{n}\lesssim b_{n} if there exists a constant C>0C>0 independent of nn such that anCbna_{n}\leq Cb_{n} for all nn.

2 GMM with Unknown but Homogeneous Covariance Matrices

2.1 Model

We first consider a GMM where covariance matrices of different clusters are unknown but are assumed to be equal to each other. The data generating progress can be displayed as follow:

Model 1: Yj\displaystyle Y_{j} =θzj+ϵj, where ϵjind𝒩(0,Σ),j[n].\displaystyle=\theta^{*}_{z^{*}_{j}}+\epsilon_{j},\text{ where }\epsilon_{j}\stackrel{{\scriptstyle ind}}{{\sim}}\mathcal{N}(0,\Sigma^{*}),\forall j\in[n]. (1)

It is called Stretched Mixture Model in [23] as the density of YjY_{j} is elliptical. Throughout the paper, we call it Model 1 for simplicity and to distinguish it from a more complicated model that will be introduced in Section 3. The goal is to recover the underlying cluster assignment vector zz^{*} from YY.

Signal-to-noise Ratio.

Define the signal-to-noise ratio

SNR=mina,b[k]:ab(θaθb)TΣ12,\displaystyle\textsf{SNR}=\min_{a,b\in[k]:a\neq b}\|(\theta^{*}_{a}-\theta^{*}_{b})^{T}\Sigma^{*-\frac{1}{2}}\|, (2)

which is a function of all the centers {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]} and the covariance matrix Σ\Sigma^{*}. As we will show later in Theorem 2.1, SNR captures the difficulty of the clustering problem and determines the minimax rate. For the geometric interpretation of SNR, we defer it after presenting Theorem 2.2.

A quantity closely related to SNR is the minimum distance among the centers. Define Δ\Delta as

Δ=mina,b[k]:abθaθb.\displaystyle\Delta=\min_{a,b\in[k]:a\neq b}\left\|{\theta_{a}^{*}-\theta_{b}^{*}}\right\|. (3)

Then we can see SNR and Δ\Delta are in the same order if all eigenvalues of the covariance matrix Σ\Sigma^{*} is assumed to be constants. If Σ\Sigma^{*} is further assumed to be an identical matrix, then we have SNR equal to Δ\Delta. As a result, in [15, 8, 14] where isotropic GMMs are studied, Δ\Delta plays the role of signal-to-noise ratio and appears in the minimax rates.

Loss Function.

To measure the clustering performance, we consider the misclustering error rate defined as follows. For any z,z[k]nz,z^{*}\in[k]^{n}, we define

h(z,z)=minψΨ1nj=1n𝕀{ψ(zj)zj},\displaystyle h(z,z^{*})=\min_{\psi\in\Psi}\frac{1}{n}\sum_{j=1}^{n}{\mathbb{I}\left\{{\psi(z_{j})\neq z^{*}_{j}}\right\}}, (4)

where Ψ={ψ:ψ is a bijection from [k] to [k]}\Psi=\left\{\psi:\psi\text{ is a bijection from }[k]\text{ to }[k]\right\}. Here the minimum is over all the permutations over [k][k] due to the identifiability issue of the labels 1,2,,k1,2,\ldots,k. Another loss that will be used is (z,z)\ell(z,z^{*}) defined as

(z,z)=j=1nθzjθzj2.\displaystyle\ell(z,z^{*})=\sum_{j=1}^{n}\left\|{\theta^{*}_{z_{j}}-\theta^{*}_{z^{*}_{j}}}\right\|^{2}. (5)

It also measures the clustering performance of zz considering the distances among the true centers. It is related to h(z,z)h(z,z^{*}) as h(z,z)(z,z)/(nΔ2)h(z,z^{*})\leq\ell(z,z^{*})/(n\Delta^{2}) and hence provides more information than h(z,z)h(z,z^{*}). We will mainly use (z,z)\ell(z,z^{*}) in the technical analysis but will eventually present the results using h(z,z)h(z,z^{*}) which is more interpretable.

2.2 Minimax Lower Bound

We first establish the minimax lower bound for the clustering problem under Model 1.

Theorem 2.1.

Under the assumption SNRlogk\frac{\textsf{SNR}}{\sqrt{\log k}}\rightarrow\infty, we have

infz^supz[k]n𝔼h(z,z)exp((1+o(1))SNR28).\displaystyle\inf_{\hat{z}}\sup_{z^{*}\in[k]^{n}}\mathbb{E}h(z,z^{*})\geq\exp\left(-(1+o(1))\frac{\textsf{SNR}^{2}}{8}\right). (6)

If SNR=O(1)\textsf{SNR}=O(1) instead, we have infz^supz[k]n𝔼h(z,z)c\inf_{\hat{z}}\sup_{z^{*}\in[k]^{n}}\mathbb{E}h(z,z^{*})\geq c for some constant c>0c>0.

Theorem 2.1 allows the cluster numbers kk to grow with nn and shows that SNR\textsf{SNR}\rightarrow\infty is a necessary condition to have a consistent clustering if kk is a constant. Theorem 2.1 holds for any arbitrary {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]} and Σ\Sigma^{*}, and the minimax lower bound depend on them through SNR. The parameter space is only for zz^{*} while {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]} and Σ\Sigma^{*} are fixed. Hence, (6) can be interpreted as a pointwise result, and it captures precisely the explicit dependence of the minimaxity on {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]} and Σ\Sigma^{*}.

Theorem 2.1 is closely related to the Linear Discriminant Analysis (LDA). If there are only two clusters, and if the centers and the covariance matrix are known, then estimating each zjz^{*}_{j} is exactly the task of LDA: we want to figure out which normal distribution an observation YjY_{j} is generated from, where the two normal distributions have different means but the same covariance matrix. In fact, this is also how Theorem 2.1 is proved: we will first reduce the estimation problem of zz^{*} into a two-point hypothesis testing problem for each individual zjz^{*}_{j}, the error of which is given in Lemma 2.1 by the analysis of LDA, and then aggregate all the testing errors together.

In the following lemma, we give a sharp and explicit formula for the testing error of the LDA. Here we have two normal distributions 𝒩(θ1,Σ)\mathcal{N}(\theta^{*}_{1},\Sigma^{*}) and 𝒩(θ2,Σ)\mathcal{N}(\theta^{*}_{2},\Sigma^{*}) and an observation XX that is generated from one of them. We are interested in estimating which distribution it is from. By Neyman-Pearson lemma, it is known that the likelihood ratio test 𝕀{2(θ2θ1)T(Σ)1Xθ2T(Σ)1θ2θ1T(Σ)1θ1}{\mathbb{I}\left\{{2(\theta^{*}_{2}-\theta^{*}_{1})^{T}(\Sigma^{*})^{-1}X\geq\theta_{2}^{*T}(\Sigma^{*})^{-1}\theta^{*}_{2}-\theta^{*T}_{1}(\Sigma^{*})^{-1}\theta^{*}_{1}}\right\}} is the optimal testing procedure. Then by using the Gaussian tail probability, we are able to obtain the optimal testing error, the lower bound of which is given in Lemma 2.1.

Lemma 2.1 (Testing Error for Linear Discriminant Analysis).

Consider two hypotheses 0:X𝒩(θ1,Σ)\mathbb{H}_{0}:X\sim\mathcal{N}(\theta_{1}^{*},\Sigma^{*}) and 1:X𝒩(θ2,Σ)\mathbb{H}_{1}:X\sim\mathcal{N}(\theta_{2}^{*},\Sigma^{*}). Define a testing procedure

ϕ=𝕀{2(θ2θ1)T(Σ)1Xθ2T(Σ)1θ2θ1T(Σ)1θ1}.\phi={\mathbb{I}\left\{{2(\theta^{*}_{2}-\theta^{*}_{1})^{T}(\Sigma^{*})^{-1}X\geq\theta_{2}^{*T}(\Sigma^{*})^{-1}\theta^{*}_{2}-\theta^{*T}_{1}(\Sigma^{*})^{-1}\theta^{*}_{1}}\right\}}.

Then we have infϕ^(0(ϕ^=1)+1(ϕ^=0))=0(ϕ=1)+1(ϕ=0).\inf_{\hat{\phi}}(\mathbb{P}_{\mathbb{H}_{0}}(\hat{\phi}=1)+\mathbb{P}_{\mathbb{H}_{1}}(\hat{\phi}=0))=\mathbb{P}_{\mathbb{H}_{0}}\left(\phi=1\right)+\mathbb{P}_{\mathbb{H}_{1}}\left(\phi=0\right). If (θ2θ1)T(Σ)12\|{(\theta^{*}_{2}-\theta^{*}_{1})^{T}(\Sigma^{*})^{-\frac{1}{2}}}\|\rightarrow\infty, we have

infϕ^(0(ϕ^=1)+1(ϕ^=0))exp((1+o(1))(θ2θ1)T(Σ)1228).\displaystyle\inf_{\hat{\phi}}(\mathbb{P}_{\mathbb{H}_{0}}(\hat{\phi}=1)+\mathbb{P}_{\mathbb{H}_{1}}(\hat{\phi}=0))\geq\exp\left(-(1+o(1))\frac{\|{(\theta^{*}_{2}-\theta^{*}_{1})^{T}(\Sigma^{*})^{-\frac{1}{2}}}\|^{2}}{8}\right).

Otherwise, infϕ^(0(ϕ^=1)+1(ϕ^=0))c\inf_{\hat{\phi}}(\mathbb{P}_{\mathbb{H}_{0}}(\hat{\phi}=1)+\mathbb{P}_{\mathbb{H}_{1}}(\hat{\phi}=0))\geq c for some constant c>0c>0.

Refer to caption
Refer to caption
Figure 1: A geometric interpretation of SNR.

With the help of Lemma 2.1, we have a geometric interpretation of SNR. In the left panel of Figure 1, we have two normal distributions 𝒩(θ1,Σ)\mathcal{N}(\theta^{*}_{1},\Sigma^{*}) and 𝒩(θ2,Σ)\mathcal{N}(\theta^{*}_{2},\Sigma^{*}) for XX to be generated from. The black line represents the optimal testing procedure ϕ\phi displayed in Lemma 2.1 that divides the space into two half-spaces. To calculate the testing error, we can make a transformation X=(Σ)12(Xθ1)X^{\prime}=(\Sigma^{*})^{-\frac{1}{2}}(X-\theta_{1}^{*}) so that the two normal distributions become isotropic: 𝒩(0,Id)\mathcal{N}(0,I_{d}) and 𝒩((Σ)12(θ2θ1),Id)\mathcal{N}((\Sigma^{*})^{-\frac{1}{2}}(\theta^{*}_{2}-\theta^{*}_{1}),I_{d}) as displayed in the right panel. Then the distance between the two centers are (Σ)12(θ2θ1)\|(\Sigma^{*})^{-\frac{1}{2}}(\theta^{*}_{2}-\theta^{*}_{1})\|, and the distance between a center and the black curve is half of it. Then H0(ϕ^=1)\mathbb{P}_{H_{0}}(\hat{\phi}=1) is the probability of 𝒩(0,Id)\mathcal{N}(0,I_{d}) in the grayed area, which is equal to exp((1+o(1))(Σ)12(θ2θ1)2/8)\exp(-(1+o(1))\|(\Sigma^{*})^{-\frac{1}{2}}(\theta^{*}_{2}-\theta^{*}_{1})\|^{2}/8) by Gaussian tail probability. As a result, (Σ)12(θ2θ1)\|(\Sigma^{*})^{-\frac{1}{2}}(\theta^{*}_{2}-\theta^{*}_{1})\| is the effective distance between the two centers of 𝒩(θ1,Σ)\mathcal{N}(\theta^{*}_{1},\Sigma^{*}) and 𝒩(θ2,Σ)\mathcal{N}(\theta^{*}_{2},\Sigma^{*}) for the clustering problem, considering the geometry of the covariance matrix. Since we have multiple clusters, SNR defined in (2) can be interpreted as the minimum effective distances among the centers {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]} considering the anisotropic structure of Σ\Sigma^{*} and it captures the intrinsic difficulty of the clustering problem.

2.3 Rate-Optimal Adaptive Procedure

In this section, we will propose a computationally feasible and rate-optimal procedure for clustering under Model 1. Summarized in Algorithm 1, the proposed algorithm is a variant of the Lloyd algorithm. Starting from some initialization, it updates the estimation of the centers {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]} (in (7)), the covariance matrix Σ\Sigma^{*} (in (8)), and the cluster assignment vector zz^{*} (in (9)) iteratively. It differs from the Lloyd’s algorithm in the sense that Lloyd’s algorithm is for isotropic GMMs without the covariance matrix update (8). In addition, in (9) it updates the estimation of zjz^{*}_{j} by argmina[k](Yjθa(t))T(Yjθa(t))\mathop{\rm argmin}_{a\in[k]}(Y_{j}-\theta_{a}^{(t)})^{T}(Y_{j}-\theta_{a}^{(t)}) instead. To distinguish them from each other, we call the classical Lloyd’s algorithm as the vanilla Lloyd’s algorithm, and name Algorithm 1 as the adjusted Lloyd’s algorithm, as it is adjusted to the unknown and anisotropic covariance matrix.

Algorithm 1 can also be interpreted as a hard EM algorithm. If we apply the Expectation Maximization (EM) for Model 1, we will have an M step for estimating parameters {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]} and Σ\Sigma^{*} and an E step for estimating zz^{*}. It turns out the updates on the parameters (7) - (8) are exactly the same as the updates of EM (M step). However, the update on zz^{*} in Algorithm 1 is different from that in the EM. Instead of taking a conditional expectation (E step), we also take a maximization in (9). As a result, Algorithm 1 consists solely of M steps for both the parameters and zz^{*}, which is known as a hard EM algorithm.

Input: Data YY, number of clusters kk, an initialization z(0)z^{(0)}, number of iterations TT
Output: z(T)z^{(T)}
1 for  t=1,,Tt=1,\ldots,T do
2      
3      Update the centers:
θa(t)=j[n]Yj𝕀{zj(t1)=a}j[n]𝕀{zj(t1)=a},a[k].\displaystyle\theta_{a}^{(t)}=\frac{\sum_{j\in[n]}Y_{j}{\mathbb{I}\left\{{z^{(t-1)}_{j}=a}\right\}}}{\sum_{j\in[n]}{\mathbb{I}\left\{{z^{(t-1)}_{j}=a}\right\}}},\quad\forall a\in[k].\; (7)
4      Update the covariance matrix:
Σ(t)=a[k]j[n](Yjθa(t))(Yjθa(t))T𝕀{zj(t1)=a}n.\displaystyle\Sigma^{(t)}=\frac{\sum_{a\in[k]}\sum_{j\in[n]}(Y_{j}-\theta_{a}^{(t)})(Y_{j}-\theta_{a}^{(t)})^{T}{\mathbb{I}\left\{{z^{(t-1)}_{j}=a}\right\}}}{n}.\; (8)
5      Update the cluster estimations:
zj(t)=argmina[k](Yjθa(t))T(Σ(t))1(Yjθa(t)),j[n].\displaystyle z^{(t)}_{j}=\mathop{\rm argmin}_{a\in[k]}(Y_{j}-\theta_{a}^{(t)})^{T}(\Sigma^{(t)})^{-1}(Y_{j}-\theta_{a}^{(t)}),\quad j\in[n]. (9)
Algorithm 1 Adjusted Lloyd’s Algorithm for Model 1 (1).

In Theorem 2.2, we give a computational and statistical guarantee of the proposed Algorithm 1. We show that starting from a decent initialization, within logn\log n iterations, Algorithm 1 achieves an error rate exp((1+o(1))SNR2/8)\exp\left(-(1+o(1))\textsf{SNR}^{2}/8\right) which matches with the minimax lower bound given in Theorem 2.1. As a result, Algorithm 1 is a rate-optimal procedure. In addition, the algorithm is fully adaptive to the unknown {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]} and Σ\Sigma^{*}. The only information assumed to be known is kk the number of clusters, which is commonly assumed to be known in clustering literature [15, 8, 14]. The theorem also shows that the number of iterations to achieve the optimal rate is at most logn\log n, which provides implementation guidance to practitioners.

Theorem 2.2.

Assume kd=O(n)kd=O(\sqrt{n}) and minakj=1n𝕀{zj=a}αnk\min_{a\in k}\sum_{j=1}^{n}\mathbb{I}\{z^{*}_{j}=a\}\geq\frac{\alpha n}{k} for some constant α>0\alpha>0. Assume SNRk\frac{\textsf{SNR}}{k}\rightarrow\infty and λd(Σ)/λ1(Σ)=O(1)\lambda_{d}(\Sigma^{*})/\lambda_{1}(\Sigma^{*})=O(1). For Algorithm 1, suppose z(0)z^{(0)} satisfies (z(0),z)=o(n/k)\ell(z^{(0)},z^{*})=o(n/k) with probability at least 1η1-\eta. Then with probability at least 1ηn1exp(SNR)1-\eta-n^{-1}-\exp(-\textsf{SNR}), we have

h(z(t),z)exp((1+o(1))SNR28),for all tlogn.\displaystyle h(z^{(t)},z^{*})\leq\exp\left(-(1+o(1))\frac{\textsf{SNR}^{2}}{8}\right),\quad\text{for all }t\geq\log n.

We have remarks on the assumptions of Theorem 2.2. We allow the number of clusters kk to grow with nn. When kk is a constant, the assumption on SNR\textsf{SNR}\rightarrow\infty is the necessary condition to have a consistent recovery of zz^{*} according to the minimax lower bound presented in Theorem 2.1. The assumption on Σ\Sigma^{*} is to make sure the covariance matrix is well-conditioned. The dimensionality dd is assumed to be at most O(n)O(\sqrt{n}), an assumption that is stronger than that in [15, 8, 14] which only needs d=O(n)d=O(n). This is due to that compared to these papers, we need to estimate the covariance matrix Σ\Sigma^{*} and to have a control on the estimation error Σ(t)Σ\|\Sigma^{(t)}-\Sigma^{*}\|.

The requirement for the initialization (z(0),z)=o(n/k)\ell(z^{(0)},z^{*})=o(n/k) can be fulfilled by simple procedures. A popular choice is the vanilla Lloyd’s algorithm the performance of which is studied in [15, 8]. Since ϵj\epsilon_{j} are sub-Gaussian random variables with proxy variance λmax\lambda_{\max}, [8] implies the vanilla Lloyd’s algorithm output z^\hat{z} satisfies (z^,z)nexp((1+o(1))Δ2/(8λmax))\ell(\hat{z},z^{*})\leq n\exp(-(1+o(1))\Delta^{2}/(8\lambda_{\max})) with probability at least 1exp(Δ)n11-\exp(-\Delta)-n^{-1}, under the assumption that SNR/k\textsf{SNR}/k\rightarrow\infty. Note that [8] is for isotropic GMMs, but its results can be extended to sub-Gaussian mixture models with nearly identical proof. Then we have (z^,z)=o(n/k)\ell(\hat{z},z^{*})=o(n/k), as Δ2/λmax\Delta^{2}/\lambda_{\max} and SNR2\textsf{SNR}^{2} are both in the same order under the assumption SNR/k\textsf{SNR}/k\rightarrow\infty. As a result, we immediately have the following corollary.

Corollary 2.1.

Assume kd=O(n)kd=O(\sqrt{n}) and minakj=1n𝕀{zj=a}αnk\min_{a\in k}\sum_{j=1}^{n}\mathbb{I}\{z^{*}_{j}=a\}\geq\frac{\alpha n}{k} for some constant α>0\alpha>0. Assume SNRk\frac{\textsf{SNR}}{k}\rightarrow\infty and λd(Σ)/λ1(Σ)=O(1)\lambda_{d}(\Sigma^{*})/\lambda_{1}(\Sigma^{*})=O(1). Using the vanilla Lloyd’s algorithm as the initialization z(0)z^{(0)} in Algorithm 1, we have with probability at least 1n1exp(SNR)exp(Δ)1-n^{-1}-\exp(-\textsf{SNR})-\exp(-\Delta),

h(z(t),z)exp((1+o(1))SNR28),for all tlogn.\displaystyle h(z^{(t)},z^{*})\leq\exp\left(-(1+o(1))\frac{\textsf{SNR}^{2}}{8}\right),\quad\text{for all }t\geq\log n.

3 GMM with Unknown and Heterogeneous Covariance Matrices

3.1 Model

In this section, we study the GMM with covariance matrices from each cluster unknown and not necessarily equal to each other. The data generation process can be displayed as follow,

Model 2: Yj\displaystyle Y_{j} =θzj+ϵj, where ϵjind𝒩(0,Σzj),j[n].\displaystyle=\theta^{*}_{z^{*}_{j}}+\epsilon_{j},\text{ where }\epsilon_{j}\stackrel{{\scriptstyle ind}}{{\sim}}\mathcal{N}(0,\Sigma^{*}_{z^{*}_{j}}),\forall j\in[n]. (10)

We call it Model 2 throughout the paper to distinguish it from Model 1 studied in Section 2. The difference between (10) and (1) is that we now have {Σa}a[k]\{\Sigma^{*}_{a}\}_{a\in[k]} instead of a shared Σ\Sigma^{*}. We consider the same loss functions as in (4) and (5).

Signal-to-noise Ratio.

The signal-to-noise ratio for Model 2 is defined as follows. We use the notation SNR\textsf{SNR}^{\prime} to distinguish it from SNR for Model 1. Compared to SNR, SNR\textsf{SNR}^{\prime} is much more complicated and does not have an explicit formula. We first define a space a,bd\mathcal{B}_{a,b}\in\mathbb{R}^{d} for any a,b[k]a,b\in[k] such that aba\neq b:

a,b={xd:\displaystyle\mathcal{B}_{a,b}=\Bigg{\{}x\in\mathbb{R}^{d}: xTΣa12Σb1(θaθb)+12xT(Σa12Σb1Σa12Id)x\displaystyle x^{T}\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}(\theta^{*}_{a}-\theta^{*}_{b})+\frac{1}{2}x^{T}\left(\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}\Sigma_{a}^{*\frac{1}{2}}-I_{d}\right)x
12(θaθb)TΣb1(θaθb)+12log|Σa|12log|Σb|}.\displaystyle\leq-\frac{1}{2}(\theta^{*}_{a}-\theta^{*}_{b})^{T}\Sigma_{b}^{*-1}(\theta^{*}_{a}-\theta^{*}_{b})+\frac{1}{2}\log\left|\Sigma_{a}^{*}\right|-\frac{1}{2}\log\left|\Sigma_{b}^{*}\right|\Bigg{\}}.

We then define SNRa,b=2minxa,bx\textsf{SNR}^{\prime}_{a,b}=2\min_{x\in\mathcal{B}_{a,b}}\left\|{x}\right\| and

SNR=mina,b[n]:abSNRa,b.\displaystyle\textsf{SNR}^{\prime}=\min_{a,b\in[n]:a\neq b}\textsf{SNR}^{\prime}_{a,b}. (11)

The from of SNR\textsf{SNR}^{\prime} is closely connected to the testing error of the Quadratic Discriminant Analysis (QDA), which we will give in Lemma 3.1. For the interpretation of the SNR\textsf{SNR}^{\prime} (especially from a geometric point of view), we defer it after presenting Lemma 3.1. Here let us consider a few special cases where we are able to simplify SNR\textsf{SNR}^{\prime}: (1) When Σa=Σ\Sigma^{*}_{a}=\Sigma^{*} for all a[k]a\in[k], by simple algebra, we have SNRa,b=(θaθb)TΣ12\textsf{SNR}^{\prime}_{a,b}=\|(\theta^{*}_{a}-\theta^{*}_{b})^{T}\Sigma^{*-\frac{1}{2}}\| for any a,b[k]a,b\in[k] such that aba\neq b. Hence, SNR=SNR\textsf{SNR}^{\prime}=\textsf{SNR} and Model 2 is reduced to the Model 1. (2) When Σ=σa2Id\Sigma^{*}=\sigma_{a}^{2}I_{d} for any a[k]a\in[k] where σ1,,σk>0\sigma_{1},\ldots,\sigma_{k}>0 are large constants, we have SNRa,b,SNRb,a\textsf{SNR}^{\prime}_{a,b},\textsf{SNR}^{\prime}_{b,a} both close to 2θaθb/(σa+σb)2\|\theta^{*}_{a}-\theta^{*}_{b}\|/(\sigma_{a}+\sigma_{b}). From these examples, we can see SNR\textsf{SNR}^{\prime} is determined by both the centers {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]} and the covariance matrices {Σa}a[k]\{\Sigma^{*}_{a}\}_{a\in[k]}.

3.2 Minimax Lower Bound

We first establish the minimax lower bound for the clustering problem under Model 2.

Theorem 3.1.

Under the assumption SNRlogk\frac{\textsf{SNR}^{\prime}}{\sqrt{\log k}}\rightarrow\infty, we have

infz^supz[k]n𝔼h(z,z)exp((1+o(1))SNR28).\displaystyle\inf_{\hat{z}}\sup_{z^{*}\in[k]^{n}}\mathbb{E}h(z,z^{*})\geq\exp\left(-(1+o(1))\frac{\textsf{SNR}^{{}^{\prime}2}}{8}\right).

If SNR=O(1)\textsf{SNR}^{\prime}=O(1) instead, we have infz^supz[k]n𝔼h(z,z)c\inf_{\hat{z}}\sup_{z^{*}\in[k]^{n}}\mathbb{E}h(z,z^{*})\geq c for some constant c>0c>0.

Despite that the statement of Theorem 3.1 looks similar to that of Theorem 2.1, the two minimax lower bounds are different from each other due to the discrepancy in the dependence of the centers and the covariance matrices in SNR\textsf{SNR}^{\prime} and SNR. By the same argument as in Section 2.2, the minimax lower bound established in Theorem 3.1 is closely related to the Quadratic Discriminant Analysis (QDA) between two normal distributions with different means and different covariance matrices.

Lemma 3.1 (Testing Error for Quadratic Discriminant Analysis).

Consider two hypotheses 0:X𝒩(θ1,Σ1)\mathbb{H}_{0}:X\sim\mathcal{N}(\theta_{1}^{*},\Sigma_{1}^{*}) and 1:X𝒩(θ2,Σ2)\mathbb{H}_{1}:X\sim\mathcal{N}(\theta_{2}^{*},\Sigma_{2}^{*}). Define a testing procedure

ϕ=𝕀{log|Σ1|+(xθ1)TΣ1(xθ1)log|Σ2|+(xθ2)TΣ2(xθ2)}.\phi={\mathbb{I}\left\{{\log\left|\Sigma_{1}^{*}\right|+(x-\theta_{1}^{*})^{T}\Sigma_{1}^{*}(x-\theta_{1}^{*})\geq\log\left|\Sigma_{2}^{*}\right|+(x-\theta_{2}^{*})^{T}\Sigma_{2}^{*}(x-\theta_{2}^{*})}\right\}}.

Then we have infϕ^(0(ϕ^=1)+1(ϕ^=0))=0(ϕ=1)+1(ϕ=0).\inf_{\hat{\phi}}(\mathbb{P}_{\mathbb{H}_{0}}(\hat{\phi}=1)+\mathbb{P}_{\mathbb{H}_{1}}(\hat{\phi}=0))=\mathbb{P}_{\mathbb{H}_{0}}\left(\phi=1\right)+\mathbb{P}_{\mathbb{H}_{1}}\left(\phi=0\right). If min{SNR1,2,SNR2,1}\min\left\{\textsf{SNR}^{\prime}_{1,2},\textsf{SNR}^{\prime}_{2,1}\right\}\rightarrow\infty, we have

infϕ^(H0(ϕ^=1)+H1(ϕ^=0))exp((1+o(1))min{SNR1,2,SNR2,1}28).\displaystyle\inf_{\hat{\phi}}(\mathbb{P}_{H_{0}}(\hat{\phi}=1)+\mathbb{P}_{H_{1}}(\hat{\phi}=0))\geq\exp\left(-(1+o(1))\frac{\min\left\{\textsf{SNR}^{\prime}_{1,2},\textsf{SNR}^{\prime}_{2,1}\right\}^{2}}{8}\right).

Otherwise, infϕ^(H0(ϕ^=1)+H1(ϕ^=0))c\inf_{\hat{\phi}}(\mathbb{P}_{H_{0}}(\hat{\phi}=1)+\mathbb{P}_{H_{1}}(\hat{\phi}=0))\geq c for some constant c>0c>0.

Refer to caption
Refer to caption
Figure 2: A geometric interpretation of SNR\textsf{SNR}^{\prime}.

From Lemma 3.1, we can have a geometric interpretation of SNR\textsf{SNR}^{\prime}. In the left panel of Figure 2, we have two normal distributions 𝒩(θ1,Σ1)\mathcal{N}(\theta^{*}_{1},\Sigma_{1}^{*}) and 𝒩(θ2,Σ2)\mathcal{N}(\theta^{*}_{2},\Sigma_{2}^{*}) where XX can be generated from. The black curve represents the optimal testing procedure ϕ\phi displayed in Lemma 3.1. Since Σ1\Sigma_{1}^{*} is not necessarily equal to Σ2\Sigma_{2}^{*}, the black curve is not necessarily a straight line. If 0\mathbb{H}_{0} is true, then the probability for XX to be incorrectly classified is when XX falls in the gray area, which is H0(ϕ^=1)\mathbb{P}_{H_{0}}(\hat{\phi}=1). To calculate it, we can make a transformation X=(Σ1)12(Xθ1)X^{\prime}=(\Sigma^{*}_{1})^{-\frac{1}{2}}(X-\theta^{*}_{1}). Then displayed in the right panel of Figure 2, the two distributions become 𝒩(0,Id)\mathcal{N}(0,I_{d}) and 𝒩((Σ1)12(θ2θ1),(Σ1)12Σ2(Σ1)12)\mathcal{N}((\Sigma^{*}_{1})^{-\frac{1}{2}}(\theta^{*}_{2}-\theta^{*}_{1}),(\Sigma^{*}_{1})^{-\frac{1}{2}}\Sigma^{*}_{2}(\Sigma^{*}_{1})^{-\frac{1}{2}}), and the optimal testing procedure 𝕀{X1,2}{\mathbb{I}\left\{{X^{\prime}\in\mathcal{B}_{1,2}}\right\}}. As a result, in the right panel of Figure 2, 1,2\mathcal{B}_{1,2} represents the space colored by gray, and the black curve is its boundary. Then H0(ϕ^=1)\mathbb{P}_{H_{0}}(\hat{\phi}=1) is equal to (𝒩(0,Id)1,2)\mathbb{P}\left(\mathcal{N}(0,I_{d})\in\mathcal{B}_{1,2}\right), which can be shown to be determined by the minimum distance between the center of 𝒩(0,Id)\mathcal{N}(0,I_{d}) and the space 1,2\mathcal{B}_{1,2}. Denote the minimum distance by SNR1,2/2\textsf{SNR}_{1,2}^{\prime}/2, by Lemmas 7.10 and Lemma 7.11, we can show (𝒩(0,Id)1,2)=exp((1+o(1))SNR1,22/8)\mathbb{P}\left(\mathcal{N}(0,I_{d})\in\mathcal{B}_{1,2}\right)=\exp(-(1+o(1))\textsf{SNR}_{1,2}^{{}^{\prime}2}/8). As a result, the SNR\textsf{SNR}^{\prime} can be interpreted as the minimum effective distance among the centers {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]} considering the anisotropic and heterogeneous structure of {Σa}a[k]\{\Sigma^{*}_{a}\}_{a\in[k]} and it captures the intrinsic difficulty of the clustering problem under Model 2.

3.3 Optimal Adaptive Procedure

In this section, we will propose a computationally feasible and rate-optimal procedure for clustering under Model 2. Similar to Algorithm 1, the proposed Algorithm 2 can be seen as a variant of the Lloyd’s algorithm that is adjusted to the unknown and heterogeneous covariance matrices. It can also be interpreted as a hard EM algorithm under Model 2. Algorithm 2 differs from Algorithm 1 in (13) and (14), as now there are kk covariance matrices to be estimated.

Input: Data YY, number of clusters kk, an initialization z(0)z^{(0)}, number of iterations TT
Output: z(T)z^{(T)}
1 for  t=1,,Tt=1,\ldots,T do
2      
3      Update the centers:
θa(t)=j[n]Yj𝕀{zj(t1)=a}j[n]𝕀{zj(t1)=a},a[k].\displaystyle\theta_{a}^{(t)}=\frac{\sum_{j\in[n]}Y_{j}{\mathbb{I}\left\{{z^{(t-1)}_{j}=a}\right\}}}{\sum_{j\in[n]}{\mathbb{I}\left\{{z^{(t-1)}_{j}=a}\right\}}},\quad\forall a\in[k].\; (12)
4      Update the covariance matrices:
Σa(t)=j[n](Yjθa(t))(Yjθa(t))T𝕀{zj(t1)=a}j[n]𝕀{zj(t1)=a},a[k].\displaystyle\Sigma_{a}^{(t)}=\frac{\sum_{j\in[n]}(Y_{j}-\theta_{a}^{(t)})(Y_{j}-\theta_{a}^{(t)})^{T}{\mathbb{I}\left\{{z^{(t-1)}_{j}=a}\right\}}}{\sum_{j\in[n]}{\mathbb{I}\left\{{z^{(t-1)}_{j}=a}\right\}}},\quad\forall a\in[k].\; (13)
5      Update the cluster estimations:
zj(t)=argmina[k](Yjθa(t))T(Σa(t))1(Yjθa(t))+log|Σa(t)|,j[n].\displaystyle z^{(t)}_{j}=\mathop{\rm argmin}_{a\in[k]}(Y_{j}-\theta_{a}^{(t)})^{T}(\Sigma^{(t)}_{a})^{-1}(Y_{j}-\theta_{a}^{(t)})+\log|\Sigma^{(t)}_{a}|,\quad j\in[n]. (14)
6 end for
Algorithm 2 Adjusted Lloyd’s Algorithm for Model 2 (10).

In Theorem 3.2, we give a computational and statistical guarantee of the proposed Algorithm 2. We show that provided with some decent initialization, Algorithm 2 is able to achieve the minimax lower bound within logn\log n iterations. The assumptions needed in Theorem 3.2 are similar to those in Theorem 3.2, except that we require stronger assumptions on kk and the dimensionality dd since now we have kk (instead of one) covariance matrices to be estimated. In addition, by maxa,b[k]λd(Σa)/λ1(Σb)=O(1)\max_{a,b\in[k]}\lambda_{d}(\Sigma^{*}_{a})/\lambda_{1}(\Sigma^{*}_{b})=O(1) we not only assume each of the kk covariance matrices is well-conditioned, but also assume they are comparable to each other.

Theorem 3.2.

Assume k,d=O(1)k,d=O(1) and minakj=1n𝕀{zj=a}αnk\min_{a\in k}\sum_{j=1}^{n}\mathbb{I}\{z^{*}_{j}=a\}\geq\frac{\alpha n}{k} for some constant α>0\alpha>0. Assume SNR\textsf{SNR}^{\prime}\rightarrow\infty and maxa,b[k]λd(Σa)/λ1(Σb)=O(1)\max_{a,b\in[k]}\lambda_{d}(\Sigma^{*}_{a})/\lambda_{1}(\Sigma^{*}_{b})=O(1). For Algorithm 2 , suppose z(0)z^{(0)} satisfies (z(0),z)=o(n/k)\ell(z^{(0)},z^{*})=o(n/k) with probability at least 1η1-\eta. Then with probability at least 1ηn1exp(SNR)1-\eta-n^{-1}-\exp(-\textsf{SNR}^{\prime}), we have

h(z(t),z)exp((1+o(1))SNR28),for all tlogn.\displaystyle h(z^{(t)},z^{*})\leq\exp\left(-(1+o(1))\frac{\textsf{SNR}^{{}^{\prime}2}}{8}\right),\quad\text{for all }t\geq\log n.

The vanilla Lloyd’s algorithm can be used as the initialization for Algorithm 2. Under the assumption that λd(Σ)/λ1(Σ)=O(1)\lambda_{d}(\Sigma^{*})/\lambda_{1}(\Sigma^{*})=O(1), Model 2 is also a sub-Gaussian mixture model. By the same argument as in Section 2.3 we have the following corollary.

Corollary 3.1.

Assume k,d=O(1)k,d=O(1) and minakj=1n𝕀{zj=a}αnk\min_{a\in k}\sum_{j=1}^{n}\mathbb{I}\{z^{*}_{j}=a\}\geq\frac{\alpha n}{k} for some constant α>0\alpha>0. Assume SNR\textsf{SNR}^{\prime}\rightarrow\infty and λd(Σ)/λ1(Σ)=O(1)\lambda_{d}(\Sigma^{*})/\lambda_{1}(\Sigma^{*})=O(1). Using the vanilla Lloyd’s algorithm as the initialization z(0)z^{(0)} in Algorithm 2, we have with probability at least 1n1exp(SNR)1-n^{-1}-\exp(-\textsf{SNR}),

h(z(t),z)exp((1+o(1))SNR28),for all tlogn.\displaystyle h(z^{(t)},z^{*})\leq\exp\left(-(1+o(1))\frac{\textsf{SNR}^{{}^{\prime}2}}{8}\right),\quad\text{for all }t\geq\log n.

4 Numerical Studies

In this section, we compare the performance of the proposed methods with other popular clustering methods on synthetic datasets under different settings.

Model 1.

The first simulation is designed for the GMM with unknown but homogeneous covariance matrices (i.e., Model 1). We independently generate n=1200n=1200 samples with dimension d=50d=50 from k=30k=30 clusters. Each cluster has 40 samples. We set Σ=UTΛU\Sigma^{*}=U^{T}\Lambda U, where Λ\Lambda is a 50×5050\times 50 diagonal matrix with diagonal elements selected from 0.5 to 8 with equal space and UU is a randomly generated orthogonal matrix. The centers {θa}a[n]\{\theta^{*}_{a}\}_{a\in[n]} are orthogonal or each other with θ1==θ30=9\left\|{\theta^{*}_{1}}\right\|=\ldots=\left\|{\theta^{*}_{30}}\right\|=9. We consider four popular clustering methods: (1) the spectral clustering method [14] (denoted as “spectral”), (2) the vanilla Lloyd’s algorithm [15] (denoted as “vanilla Lloyd”), (3) the proposed Algorithm 1 initialized by the spectral clustering (denoted as “spectral + Alg 1”), and (4) Algorithm 1 initialized by the vanilla Lloyd (denoted as “vanilla Lloyd + Alg 1”). The comparison is presented in the left panel of Figure 3.

In the plot, the xx-axis is the number of iterations and the yy-axis is the logarithm of the mis-clustering error rate, i.e., log(h)\log(h). Each of the curves plotted is an average of 100 independent trials. We can see the proposed Algorithm 1 outperforms the spectral clustering and the vanilla Lloyd’s algorithm significantly. What is more, the dashed line represents the optimal exponent SNR2/8-\textsf{SNR}^{2}/8 of the minimax lower bound given in Section 2.1. Then we can see Algorithm 1 achieves it after 3 iterations. This justifies the conclusion established in Theorem 2.2 that Algorithm 1 is rate-optimal.

Model 2.

We also compare the performances of four methods (spectral, vanilla Lloyd, spectral + Alg 2, and vanilla Lloyd + Alg 2) for the GMM with unknown and heterogeneous covariance matrices (i.e., Model 2). In this case, we take n=1200n=1200, k=3k=3 and d=5d=5. We set Σ1=I\Sigma_{1}^{*}=I, Σ2=Λ2\Sigma_{2}^{*}=\Lambda_{2} which is a 5×55\times 5 diagonal matrix with elements generated from 0.5 to 8 with equal space and Σ3=UTΛ3U\Sigma_{3}^{*}=U^{T}\Lambda_{3}U, where Λ3\Lambda_{3} is a diagonal matrix with elements selected uniformly from 0.5 to 2 and UU is a randomly generated orthogonal matrix. To simplify the calculation of SNR\textsf{SNR}^{\prime}, we take θ1\theta_{1}^{*} as a randomly selected unit vector, θ2=θ1+5e1\theta_{2}^{*}=\theta_{1}^{*}+5e_{1} with e1e_{1} denoting the vector with a 1 in the first coordinate and 0’s elsewhere and θ3=θ2+v1\theta_{3}^{*}=\theta_{2}^{*}+v_{1} with v1v_{1} randomly selected satisfying v1=10\|v_{1}\|=10. The comparison is presented in the right panel of Figure 3 where each curve plotted is an average of 100 independent trials.

From the plot, we can clearly see the proposed Algorithm 2 improves greatly the spectral clustering and the vanilla Lloyd algorithm. The dashed line represents the optimal exponent SNR2/8-\textsf{SNR}^{\prime 2}/8 of the minimax lower bound given in Section 3.1, which is achieved by Algorithm 2. Hence, this numerically justifies Theorem 3.2 that Algorithm 1 is rate-optimal for clustering under Model 2.

Refer to caption
Refer to caption
Figure 3: Left: Performance of Algorithm 1 compared with other methods under Model 1. Right: Performance of Algorithm 2 compared with other methods under Model 2.

5 Proofs in Section 2

5.1 Proofs for The Lower Bound

Proof of Lemma 2.1.

Note that ϕ\phi is the likelihood ratio test. Hence by the Neyman-Pearson lemma, it is the optimal procedure. Let ϵ𝒩(0,Id)\epsilon\sim\mathcal{N}(0,I_{d}). By Gaussian tail probability, we have

0(ϕ=1)+1(ϕ=0)\displaystyle\mathbb{P}_{\mathbb{H}_{0}}\left(\phi=1\right)+\mathbb{P}_{\mathbb{H}_{1}}\left(\phi=0\right) =(2(θ2θ1)T(Σ)1(θ1+ϵ)θ2T(Σ)1θ2θ1T(Σ)1θ1)\displaystyle=\mathbb{P}\left(2(\theta_{2}^{*}-\theta_{1}^{*})^{T}(\Sigma^{*})^{-1}(\theta_{1}+\epsilon)\geq\theta_{2}^{*T}(\Sigma^{*})^{-1}\theta_{2}^{*}-\theta_{1}^{*T}(\Sigma^{*})^{-1}\theta_{1}^{*}\right)
+(2(θ2θ1)T(Σ)1(θ2+ϵ)<θ2T(Σ)1θ2θ1T(Σ)1θ1)\displaystyle\quad+\mathbb{P}\left(2(\theta_{2}^{*}-\theta_{1}^{*})^{T}(\Sigma^{*})^{-1}(\theta_{2}+\epsilon)<\theta_{2}^{*T}(\Sigma^{*})^{-1}\theta_{2}^{*}-\theta_{1}^{*T}(\Sigma^{*})^{-1}\theta_{1}^{*}\right)
=2(2(θ2θ1)T(Σ)1(θ1+ϵ)θ2T(Σ)1θ2θ1T(Σ)1θ1)\displaystyle=2\mathbb{P}\left(2(\theta_{2}^{*}-\theta_{1}^{*})^{T}(\Sigma^{*})^{-1}(\theta_{1}+\epsilon)\geq\theta_{2}^{*T}(\Sigma^{*})^{-1}\theta_{2}^{*}-\theta_{1}^{*T}(\Sigma^{*})^{-1}\theta_{1}^{*}\right)
=2(ϵ>12(θ2θ1)T(Σ)12)\displaystyle=2\mathbb{P}\left(\epsilon>\frac{1}{2}\|(\theta_{2}^{*}-\theta_{1}^{*})^{T}(\Sigma^{*})^{-\frac{1}{2}}\|\right)
Cmin{1,1(θ2θ1)T(Σ)12exp((θ2θ1)T(Σ)1228)},\displaystyle\geq C\min\left\{1,\frac{1}{\|(\theta_{2}^{*}-\theta_{1}^{*})^{T}(\Sigma^{*})^{-\frac{1}{2}}\|}\exp\left(-\frac{\|(\theta_{2}^{*}-\theta_{1}^{*})^{T}(\Sigma^{*})^{-\frac{1}{2}}\|^{2}}{8}\right)\right\},

for some constant C>0C>0. The proof is complete. ∎

Proof of Theorem 2.1.

We adopt the idea from [15]. Without loss of generality, assume the minimum in (2) is achieved at a=1,b=2a=1,b=2 so that SNR=(θ1θ2)T(Σ)1(θ1θ2)\textsf{SNR}=(\theta_{1}^{*}-\theta_{2}^{*})^{T}(\Sigma^{*})^{-1}(\theta_{1}^{*}-\theta_{2}^{*}). Consider an arbitrary z¯[k]n\bar{z}\in[k]^{n} such that |{i[n]:z¯i=a}|nkn8k2|\{i\in[n]:\bar{z}_{i}=a\}|\geq\lceil\frac{n}{k}-\frac{n}{8k^{2}}\rceil for any a[k]a\in[k]. Then for each a[k]a\in[k], we can choose a subset of {i[n]:z¯i=a}\{i\in[n]:\bar{z}_{i}=a\} with cardinality nkn8k2\lceil\frac{n}{k}-\frac{n}{8k^{2}}\rceil, denoted by TaT_{a}. Let T=a[k]TaT=\cup_{a\in[k]}T_{a}. Then we can define a parameter space

𝒵={z[k]n:zi=z¯i for all iT and zi{1,2} if iTc}.\displaystyle\mathcal{Z}=\left\{z\in[k]^{n}:z_{i}=\bar{z}_{i}\text{ for all }i\in T\text{ and }z_{i}\in\{1,2\}\text{ if }i\in T^{c}\right\}.

Notice that for any zz~𝒵z\neq\tilde{z}\in\mathcal{Z}, we have 1ni=1n𝕀{ziz~i}knn8k2=18k\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}\{z_{i}\neq\tilde{z}_{i}\}\leq\frac{k}{n}\frac{n}{8k^{2}}=\frac{1}{8k} and 1ni=1n𝕀{ψ(zi)z~i}1n(n2kn8k2)14k\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}\{\psi(z_{i})\neq\tilde{z}_{i}\}\geq\frac{1}{n}(\frac{n}{2k}-\frac{n}{8k^{2}})\geq\frac{1}{4k} for any permutation ψ\psi on [k][k]. Thus we can conclude

h(z,z~)=1ni=1n𝕀{ziz~i},for all z,z~𝒵.\displaystyle h(z,\tilde{z})=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}\{z_{i}\neq\tilde{z}_{i}\},\quad\text{for all }z,\tilde{z}\in\mathcal{Z}.

We notice that

infz^supz[k]n𝔼h(z^,z)\displaystyle\inf_{\hat{z}}\sup_{z^{*}\in[k]^{n}}\mathbb{E}h(\hat{z},z^{*})\geq~{} infz^supz𝒵𝔼h(z^,z)\displaystyle\inf_{\hat{z}}\sup_{z^{*}\in\mathcal{Z}}\mathbb{E}h(\hat{z},z^{*})
\displaystyle\geq~{} infz^1|𝒵|z𝒵𝔼h(z^,z)\displaystyle\inf_{\hat{z}}\frac{1}{\left|\mathcal{Z}\right|}\sum_{z^{*}\in\mathcal{Z}}\mathbb{E}h(\hat{z},z^{*})
\displaystyle\geq~{} 1niTcinfz^i1|𝒵|z𝒵z(z^izi).\displaystyle\frac{1}{n}\sum_{i\in T^{c}}\inf_{\hat{z}_{i}}\frac{1}{\left|\mathcal{Z}\right|}\sum_{z^{*}\in\mathcal{Z}}\mathbb{P}_{z^{*}}(\hat{z}_{i}\neq z_{i}).

Now consider a fixed iTci\in T^{c}. Define 𝒵a={z𝒵:zi=a}\mathcal{Z}_{a}=\{z\in\mathcal{Z}:z_{i}=a\} for a=1,2a=1,2. Then we can see 𝒵=𝒵1𝒵2\mathcal{Z}=\mathcal{Z}_{1}\cup\mathcal{Z}_{2} and 𝒵1𝒵2=\mathcal{Z}_{1}\cap\mathcal{Z}_{2}=\emptyset. What is more, there exists a one-to-one mapping f()f(\cdot) between 𝒵1\mathcal{Z}_{1} and 𝒵2\mathcal{Z}_{2}, such that for any z𝒵1z\in\mathcal{Z}_{1}, we have f(z)𝒵2f(z)\in\mathcal{Z}_{2} with [f(z)]j=zj[f(z)]_{j}=z_{j} for any jij\neq i and [f(z)]i=2[f(z)]_{i}=2. Hence, we can reduce the problem into a two-point testing probe and then apply Lemma 2.1. We first consider the case that SNR\textsf{SNR}\rightarrow\infty. We have

infz^i1|𝒵|z𝒵z(z^izi)\displaystyle\inf_{\hat{z}_{i}}\frac{1}{\left|\mathcal{Z}\right|}\sum_{z^{*}\in\mathcal{Z}}\mathbb{P}_{z^{*}}(\hat{z}_{i}\neq z_{i}) =infz^i1|𝒵|z𝒵1(z(z^i1)+f(z)(z^i2))\displaystyle=\inf_{\hat{z}_{i}}\frac{1}{\left|\mathcal{Z}\right|}\sum_{z^{*}\in\mathcal{Z}_{1}}\left(\mathbb{P}_{z^{*}}(\hat{z}_{i}\neq 1)+\mathbb{P}_{f(z^{*})}(\hat{z}_{i}\neq 2)\right)
1|𝒵|z𝒵1infz^i(z(z^i1)+f(z)(z^i2))\displaystyle\geq\frac{1}{\left|\mathcal{Z}\right|}\sum_{z^{*}\in\mathcal{Z}_{1}}\inf_{\hat{z}_{i}}\left(\mathbb{P}_{z^{*}}(\hat{z}_{i}\neq 1)+\mathbb{P}_{f(z^{*})}(\hat{z}_{i}\neq 2)\right)
|𝒵1|𝒵exp((1+η)SNR28)\displaystyle\geq\frac{\left|\mathcal{Z}_{1}\right|}{\mathcal{Z}}\exp\left(-(1+\eta)\frac{\textsf{SNR}^{2}}{8}\right)
12exp((1+η)SNR28),\displaystyle\geq\frac{1}{2}\exp\left(-(1+\eta)\frac{\textsf{SNR}^{2}}{8}\right),

for some η=o(1)\eta=o(1). Here the second inequality is due to Lemma 2.1. Then,

infz^supz[k]n𝔼h(z^,z)\displaystyle\inf_{\hat{z}}\sup_{z^{*}\in[k]^{n}}\mathbb{E}h(\hat{z},z^{*}) |Tc|2nexp((1+η)SNR28)=116kexp((1+η)SNR28)\displaystyle\geq\frac{\left|T^{c}\right|}{2n}\exp\left(-(1+\eta)\frac{\textsf{SNR}^{2}}{8}\right)=\frac{1}{16k}\exp\left(-(1+\eta)\frac{\textsf{SNR}^{2}}{8}\right)
=exp((1+η)SNR28),\displaystyle=\exp\left(-(1+\eta^{\prime})\frac{\textsf{SNR}^{2}}{8}\right),

for some other η=o(1)\eta^{\prime}=o(1), where we use SNR2/logk\textsf{SNR}^{2}/\log k\rightarrow\infty.

The proof for the case SNR=O(1)\textsf{SNR}=O(1) is similar and hence is omitted here. ∎

5.2 Proofs for The Upper Bound

In this section, we will prove Theorem 2.2 using the framework developed in [8] for analyzing iterative algorithms. The key idea to establish statistical guarantees of the proposed iterative algorithm (i.e., Algorithm 1) is to perform a “one-step” analysis. That is, assume we have an estimation zz for zz^{*}. Then we can apply (7), (8), and (9) on zz to obtain {θ^a(z)}a[k]\{\hat{\theta}_{a}(z)\}_{a\in[k]}, Σ^(z)\hat{\Sigma}(z), and z^(z)\hat{z}(z) sequentially, which all depend on zz. Then z^(z)\hat{z}(z) can be seen as a refined estimation of zz^{*}. We will first build the connection between (z,z)\ell(z,z^{*}) with (z^(z),z)\ell(\hat{z}(z),z^{*}) as in Lemma 5.1. To establish the connection, we will decompose the loss (z^(z),z)\ell(\hat{z}(z),z^{*}) into several errors according to the difference in their behaviors. Then we will give conditions (Condition 5.2.1 - 5.2.3), under which we will show these errors are either negligible or well controlled by (z,z)\ell(z,z^{*}). With Lemma 5.1 established, in Lemma 5.2 we will show the connection can be extended to multiple iterations, under two more conditions (Condition 5.2.4 - 5.2.5). Last, we will show all these conditions hold with high probability, and hence prove Theorem 2.2.

In the statement of Theorem 2.2, the covariance matrix Σ\Sigma^{*} is assumed to satisfy λd(Σ)/λ1(Σ)=O(1)\lambda_{d}(\Sigma^{*})/\lambda_{1}(\Sigma^{*})=O(1). Without loss of generality, we can replace it by assuming Σ\Sigma^{*} satisfies

λminλ1(Σ)λd(Σ)λmax\displaystyle\lambda_{\min}\leq\lambda_{1}(\Sigma^{*})\leq\lambda_{d}(\Sigma^{*})\leq\lambda_{\max} (15)

where λmin,λmax>0\lambda_{\min},\lambda_{\max}>0 are two constants. This is due to the following simple argument using the scaling properties of normal distributions. Let {Yj}\{Y_{j}\} be some dataset generated according to Model 1 with parameters {θa}a[k]\{\theta^{*}_{a}\}_{a\in[k]}, Σ\Sigma^{*}, and zz^{*}. The assumption λd(Σ)/λ1(Σ)=O(1)\lambda_{d}(\Sigma^{*})/\lambda_{1}(\Sigma^{*})=O(1) is equivalent to assume there exist some constants λmin,λmax>0\lambda_{\min},\lambda_{\max}>0 and some quantity σ>0\sigma>0 that may depend on nn such that λminσ2λ1(Σ)λd(Σ)λmaxσ2\lambda_{\min}\sigma^{2}\leq\lambda_{1}(\Sigma^{*})\leq\lambda_{d}(\Sigma^{*})\leq\lambda_{\max}\sigma^{2}. Then performing a scaling transformation we obtain another dataset Yj=Yj/σY^{\prime}_{j}=Y_{j}/\sigma. Note that: 1) {Yj}\{Y^{\prime}_{j}\} can be seen to be generated from Model 1 with parameters {θa/σ}a[k]\{\theta^{*}_{a}/\sigma\}_{a\in[k]}, Σ/σ2\Sigma^{*}/\sigma^{2}, and zz^{*}, 2) clustering on {Yj}\{Y_{j}\} is equivalent to clustering on {Yj}\{Y^{\prime}_{j}\}, 3) by the definition in (2), the SNRs that are associated with the data generating processes of {Yj}\{Y^{\prime}_{j}\} and {Yj}\{Y_{j}\} are exactly equal to each other, and 4) we have λminλ1(Σ/σ2)λd(Σ/σ2)λmax\lambda_{\min}\leq\lambda_{1}(\Sigma^{*}/\sigma^{2})\leq\lambda_{d}(\Sigma^{*}/\sigma^{2})\leq\lambda_{\max}. Hence, in this section, we will assume (15) holds and it will not lose any generality.

In the proof, we will mainly use the loss (,)\ell(\cdot,\cdot) for convenience. Recall Δ\Delta is defined as the minimum distance among centers in (3). We have

h(z,z)(z,z)nΔ2.\displaystyle h(z,z^{*})\leq\frac{\ell(z,z^{*})}{n\Delta^{2}}. (16)

The algorithmic guarantees Lemma 5.1 and Lemma 5.2 are established with respect to the (,)\ell(\cdot,\cdot) loss. But eventually we will use (16) to convert it into a result with respect to h(,)h(\cdot,\cdot) in the proof of Theorem 2.2.

Error Decomposition for the One-step Analysis:

Consider an arbitrary z[k]nz\in[k]^{n}. Apply (7), (8), and (9) on zz to obtain {θ^a(z)}a[k]\{\hat{\theta}_{a}(z)\}_{a\in[k]}, Σ^(z)\hat{\Sigma}(z), and z^(z)\hat{z}(z):

θ^a(z)\displaystyle\hat{\theta}_{a}(z) =j[n]Yj𝕀{zj=a}j[n]𝕀{zj=a},a[k]\displaystyle=\frac{\sum_{j\in[n]}Y_{j}{\mathbb{I}\left\{{z_{j}=a}\right\}}}{\sum_{j\in[n]}{\mathbb{I}\left\{{z_{j}=a}\right\}}},\quad a\in[k]
Σ^(z)\displaystyle\hat{\Sigma}(z) =a[k]j[n](Yjθ^a(z))(Yjθ^a(z))T𝕀{zj=a}n,\displaystyle=\frac{\sum_{a\in[k]}\sum_{j\in[n]}(Y_{j}-\hat{\theta}_{a}(z))(Y_{j}-\hat{\theta}_{a}(z))^{T}{\mathbb{I}\left\{{z_{j}=a}\right\}}}{n},
z^j(z)\displaystyle\hat{z}_{j}(z) =argmina[k](Yjθ^a(z))T(Σ^(z))1(Yjθ^a),j[n].\displaystyle=\mathop{\rm argmin}_{a\in[k]}(Y_{j}-\hat{\theta}_{a}(z))^{T}(\hat{\Sigma}(z))^{-1}(Y_{j}-\hat{\theta}_{a}),\quad j\in[n].

For simplicity we use z^\hat{z} that is short for z^(z)\hat{z}(z). Let j[n]j\in[n] be an arbitrary index with zj=az_{j}^{*}=a. According to (9), zjz^{*}_{j} will be incorrectly estimated after on iteration in z^j\hat{z}_{j} if aargmina[k](Yjθ^a(z))T(Σ^(z))1(Yjθ^a(z))a\neq\mathop{\rm argmin}_{a\in[k]}(Y_{j}-\hat{\theta}_{a}(z))^{T}(\hat{\Sigma}(z))^{-1}(Y_{j}-\hat{\theta}_{a}(z)). That is, it is important to analyze the event

Yjθ^b(z),(Σ^(z))1(Yjθ^b(z))Yjθ^a(z),(Σ^(z))1(Yjθ^a(z)),\displaystyle\langle Y_{j}-\hat{\theta}_{b}(z),(\hat{\Sigma}(z))^{-1}(Y_{j}-\hat{\theta}_{b}(z))\rangle\leq\langle Y_{j}-\hat{\theta}_{a}(z),(\hat{\Sigma}(z))^{-1}(Y_{j}-\hat{\theta}_{a}(z))\rangle, (17)

for any b[k]{a}b\in[k]\setminus\{a\}. Note that Yj=θa+ϵjY_{j}=\theta^{*}_{a}+\epsilon_{j}. After some rearrangements, we can see (17) is equivalent to,

ϵj,(Σ^(z))1(θ^a(z)θ^b(z))\displaystyle\langle\epsilon_{j},(\hat{\Sigma}(z^{*}))^{-1}(\hat{\theta}_{a}(z^{*})-\hat{\theta}_{b}(z^{*}))\rangle
\displaystyle\leq 12θaθb,(Σ)1(θaθb)+Fj(a,b,z)+Gj(a,b,z)+Hj(a,b,z),\displaystyle-\frac{1}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle+F_{j}(a,b,z)+G_{j}(a,b,z)+H_{j}(a,b,z),

where

Fj(a,b,z)\displaystyle F_{j}(a,b,z) =ϵj,(Σ^(z))1(θ^b(z)θ^b(z))ϵj,(Σ^(z))1(θ^a(z)θ^a(z))\displaystyle=\langle\epsilon_{j},(\hat{\Sigma}(z))^{-1}(\hat{\theta}_{b}(z)-\hat{\theta}_{b}(z^{*}))\rangle-\langle\epsilon_{j},(\hat{\Sigma}(z))^{-1}(\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*}))\rangle
+ϵj,((Σ^(z))1(Σ^(z))1)(θ^b(z)θ^a(z)),\displaystyle+\langle\epsilon_{j},((\hat{\Sigma}(z))^{-1}-(\hat{\Sigma}(z^{*}))^{-1})(\hat{\theta}_{b}(z^{*})-\hat{\theta}_{a}(z^{*}))\rangle,
Gj(a,b,z)\displaystyle G_{j}(a,b,z) =(12θaθ^a(z),(Σ^(z))1(θaθ^a(z))12θaθ^a(z),(Σ^(z))1(θaθ^a(z)))\displaystyle=\left(\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{a}(z),(\hat{\Sigma}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z))\rangle-\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}),(\hat{\Sigma}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle\right)
+(12θaθ^a(z),(Σ^(z))1(θaθ^a(z))12θaθ^a(z),(Σ^(z))1(θaθ^a(z)))\displaystyle+\left(\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}),(\hat{\Sigma}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle-\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}),(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle\right)
+(12θaθ^b(z),(Σ^(z))1(θaθ^b(z))+12θaθ^b(z),(Σ^(z))1(θaθ^b(z)))\displaystyle+\left(-\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{b}(z),(\hat{\Sigma}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z))\rangle+\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle\right)
+(12θaθ^b(z),(Σ^(z))1(θaθ^b(z))+12θaθ^b(z),(Σ^(z))1(θaθ^b(z))),\displaystyle+\left(-\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle+\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle\right),

and

Hj(a,b,z)=\displaystyle H_{j}(a,b,z)= (12θaθ^b(z),(Σ^(z))1(θaθ^b(z))+12θaθb,(Σ^(z))1(θaθb))\displaystyle\left(-\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle+\frac{1}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle\right)
+(12θaθb,(Σ^(z))1(θaθb)+12θaθb,(Σ)1(θaθb))\displaystyle+\left(-\frac{1}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle+\frac{1}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle\right)
+(12θaθ^a(z),(Σ^(z))1(θaθ^a(z))).\displaystyle+\left(\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}),(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle\right).

Here ϵj,(Σ^(z))1(θ^a(z)θ^b(z))12θaθb,(Σ)1(θaθb)\langle\epsilon_{j},(\hat{\Sigma}(z^{*}))^{-1}(\hat{\theta}_{a}(z^{*})-\hat{\theta}_{b}(z^{*}))\rangle\leq-\frac{1}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle is the main term that will lead to the optimal rate. Among all the remaining terms, Fj(a,b,z)F_{j}(a,b,z) includes all terms involving ϵj\epsilon_{j}. Gj(a,b,z)G_{j}(a,b,z) includes all terms related to zz and Hj(a,b,z)H_{j}(a,b,z) consists of terms that only involves zz^{*}. Readers can refer [8] for more information about the decomposition.

Conditions and Guarantees for One-step Analysis.

To establish the guarantee for the one-step analysis, we first give several conditions on the error terms Fj(a,b;z),Gj(a,b;z)F_{j}(a,b;z),G_{j}(a,b;z) and Hj(a,b;z)H_{j}(a,b;z).

Condition 5.2.1.

Assume that

max{z:l(z,z)τ}maxj[n]maxb[k]{zj}|Hj(zj,b,z)|θzjθb,(Σ)1(θzjθb)δ4\displaystyle\max_{\{z:l(z,z^{\ast})\leq\tau\}}\max_{j\in[n]}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{|H_{j}(z_{j}^{*},b,z)|}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle}\leq\frac{\delta}{4}

holds with probability with at least 1η11-\eta_{1} for some τ,δ,η1>0\tau,\delta,\eta_{1}>0.

Condition 5.2.2.

Assume that

max{z:l(z,z)τ}j=1nmaxb[k]{zj}Fj(zj,b,z)2θzjθb2θzjθb,(Σ)1(θzjθb)2l(z,z)δ2256\displaystyle\max_{\{z:l(z,z^{\ast})\leq\tau\}}\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{F_{j}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{\ast})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}l(z,z^{\ast})}\leq\frac{\delta^{2}}{256}

holds with probability with at least 1η21-\eta_{2} for some τ,δ,η2>0\tau,\delta,\eta_{2}>0.

Condition 5.2.3.

Assume that

max{z:l(z,z)τ}maxj[n]maxb[k]{zj}|Gj(zj,b,z)|θzjθb,(Σ)1(θzjθb)δ8\displaystyle\max_{\{z:l(z,z^{\ast})\leq\tau\}}\max_{j\in[n]}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{|G_{j}(z_{j}^{*},b,z)|}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{\ast})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle}\leq\frac{\delta}{8}

holds with probability with at least 1η31-\eta_{3} for some τ,δ,η3>0\tau,\delta,\eta_{3}>0.

We next define a quantity that we refer it as the ideal error,

ξideal(δ)=j=1nb[k]{zj}θzjθb2𝕀{ϵj,(Σ^(z))1(θ^a(z)θ^b(z))1δ2θaθb,(Σ)1(θaθb)}.\displaystyle\xi_{\text{ideal}}(\delta)=\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\mathbb{I}\{\langle\epsilon_{j},(\hat{\Sigma}(z^{*}))^{-1}(\hat{\theta}_{a}(z^{*})-\hat{\theta}_{b}(z^{*}))\rangle\leq-\frac{1-\delta}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle\}.
Lemma 5.1.

Assumes Conditions 5.2.1 - 5.2.3 hold for some τ,δ,η1,η2,η3,>0\tau,\delta,\eta_{1},\eta_{2},\eta_{3},>0. We then have

((z^,z)2ξideal(δ)+14(z,z) for any z[k]n such that (z,z)τ)1η,\displaystyle\mathbb{P}\left(\ell(\hat{z},z^{*})\leq 2\xi_{\text{ideal}}(\delta)+\frac{1}{4}\ell(z,z^{*})\text{ for any $z\in[k]^{n}$ such that $\ell(z,z^{*})\leq\tau$}\right)\geq 1-\eta,

where η=i=13ηi\eta=\sum_{i=1}^{3}\eta_{i}.

Proof.

We notice that

𝕀{z^j=b}\displaystyle\mathbb{I}\left\{\hat{z}_{j}=b\right\} 𝕀{Yjθ^b(z),(Σ^(z))1(Yjθ^b(z))Yjθ^a(z),(Σ^(z))1(Yjθ^a(z))}\displaystyle\leq{\mathbb{I}\left\{{\langle Y_{j}-\hat{\theta}_{b}(z),(\hat{\Sigma}(z))^{-1}(Y_{j}-\hat{\theta}_{b}(z))\rangle\leq\langle Y_{j}-\hat{\theta}_{a}(z),(\hat{\Sigma}(z))^{-1}(Y_{j}-\hat{\theta}_{a}(z))\rangle}\right\}}
=\displaystyle=~{} 𝕀{ϵj,(Σ^(z))1(θ^zj(z)θ^b(z))\displaystyle\mathbb{I}\bigg{\{}\langle\epsilon_{j},(\hat{\Sigma}(z^{*}))^{-1}(\hat{\theta}_{z_{j}^{*}}(z^{*})-\hat{\theta}_{b}(z^{*}))\rangle
12θzjθb,(Σ)1(θzjθb)+Fj(zj,b,z)+Gj(zj,b,z)+Hj(zj,b,z)}\displaystyle\leq-\frac{1}{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle+F_{j}(z_{j}^{*},b,z)+G_{j}(z_{j}^{*},b,z)+H_{j}(z_{j}^{*},b,z)\bigg{\}}
\displaystyle\leq~{} 𝕀{ϵj,(Σ^(z))1(θ^zj(z)θ^b(z))1δ2θzjθb,(Σ)1(θzjθb)}\displaystyle\mathbb{I}\left\{\langle\epsilon_{j},(\hat{\Sigma}(z^{*}))^{-1}(\hat{\theta}_{z_{j}^{*}}(z^{*})-\hat{\theta}_{b}(z^{*}))\rangle\leq-\frac{1-\delta}{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}
+𝕀{δ2θzjθb,(Σ)1(θzjθb)Fj(zj,b,z)+Gj(zj,b,z)+Hj(zj,b,z)}\displaystyle+\mathbb{I}\left\{\frac{\delta}{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\leq F_{j}(z_{j}^{*},b,z)+G_{j}(z_{j}^{*},b,z)+H_{j}(z_{j}^{*},b,z)\right\}
\displaystyle\leq~{} 𝕀{ϵj,(Σ^(z))1(θ^zj(z)θ^b(z))1δ2θzjθb,(Σ)1(θzjθb)}\displaystyle\mathbb{I}\left\{\langle\epsilon_{j},(\hat{\Sigma}(z^{*}))^{-1}(\hat{\theta}_{z_{j}^{*}}(z^{*})-\hat{\theta}_{b}(z^{*}))\rangle\leq-\frac{1-\delta}{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}
+𝕀{δ8θzjθb,(Σ)1(θzjθb)Fj(zj,b,z)}\displaystyle+\mathbb{I}\left\{\frac{\delta}{8}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\leq F_{j}(z_{j}^{*},b,z)\right\}
\displaystyle\leq~{} 𝕀{ϵj,(Σ^(z))1(θ^zj(z)θ^b(z))1δ2θzjθb,(Σ)1(θzjθb)}\displaystyle\mathbb{I}\left\{\langle\epsilon_{j},(\hat{\Sigma}(z^{*}))^{-1}(\hat{\theta}_{z_{j}^{*}}(z^{*})-\hat{\theta}_{b}(z^{*}))\rangle\leq-\frac{1-\delta}{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}
+64Fj(zj,b,z)2δ2θzjθb,(Σ)1(θzjθb)2,\displaystyle+\frac{64F_{j}(z_{j}^{*},b,z)^{2}}{\delta^{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}},

where the third inequality comes from Condition 5.2.1. Thus, we have

(z^,z)\displaystyle\ell(\hat{z},z^{*})
=\displaystyle=~{} j=1nb[k]{a}θbθzj2𝕀{z^j=b}\displaystyle\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{a\}}\left\|\theta_{b}^{*}-\theta_{z_{j}^{*}}^{*}\right\|^{2}\mathbb{I}\left\{\hat{z}_{j}=b\right\}
\displaystyle\leq~{} ξideal(δ)+j=1nb[k]{a}θbθzj2𝕀{z^j=b}64Fj(zj,b,z)2δ2θzjθb,(Σ)1(θzjθb)2\displaystyle\xi_{\text{ideal}}(\delta)+\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{a\}}\left\|\theta_{b}^{*}-\theta_{z_{j}^{*}}^{*}\right\|^{2}\mathbb{I}\left\{\hat{z}_{j}=b\right\}\frac{64F_{j}(z_{j}^{*},b,z)^{2}}{\delta^{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}}
\displaystyle\leq~{} ξideal(δ)+(z,z)4,\displaystyle\xi_{\text{ideal}}(\delta)+\frac{\ell(z,z^{*})}{4},

which implies Lemma 5.1. Here the last inequality uses Conditions 5.2.2 and 5.2.3. ∎

Conditions and Guarantees for Multiple Iterations.

In the above we establish a statistical guarantee for the one-step analysis. Now we will extend the result to multiple iterations. That is, starting from some initialization z(0)z^{(0)}, we will characterize how the losses (z(0),z)\ell(z^{(0)},z^{*}), (z(1),z)\ell(z^{(1)},z^{*}), (z(2),z)\ell(z^{(2)},z^{*}), …, decay. We impose a condition on ξideal(δ)\xi_{\text{ideal}}(\delta) and a condition for z(0)z^{(0)}.

Condition 5.2.4.

Assume that

ξideal(δ)3τ8\displaystyle\xi_{\text{ideal}}(\delta)\leq\frac{3\tau}{8}

holds with probability with at least 1η41-\eta_{4} for some τ,δ,η4>0\tau,\delta,\eta_{4}>0.

Finally, we need a condition on the initialization.

Condition 5.2.5.

Assume that

(z(0),z)τ\displaystyle\ell(z^{(0)},z^{*})\leq\tau

holds with probability with at least 1η51-\eta_{5} for some τ,η5>0\tau,\eta_{5}>0.

With these conditions satisfied, we can give a lemma that shows the linear convergence guarantee for our algorithm.

Lemma 5.2.

Assumes Conditions 5.2.1 - 5.2.5 hold for some τ,δ,η1,η2,η3,η4,η5>0\tau,\delta,\eta_{1},\eta_{2},\eta_{3},\eta_{4},\eta_{5}>0. We then have

(z(t),z)2ξideal(δ)+14(z(t1),z)\displaystyle\ell(z^{(t)},z^{*})\leq 2\xi_{\text{ideal}}(\delta)+\frac{1}{4}\ell(z^{(t-1)},z^{*})

for all t1t\geq 1, with probability at least 1η1-\eta, where η=i=15ηi\eta=\sum_{i=1}^{5}\eta_{i}.

Proof.

By Condition 5.2.4, 5.2.5, and a mathematical induction argument, we can easily know (z(t),z)τ\ell(z^{(t)},z^{*})\leq\tau for any t0t\geq 0. Thus, Lemma 5.2 is a direct extension of Lemma 5.1. ∎

With-high-probability Results for the Conditions and The Proof of The Main Theorem.

Recall the definition of Δ\Delta in (3). Recall that in (15) we assume λminλ1(Σ)λd(Σ)λmax\lambda_{\min}\leq\lambda_{1}(\Sigma^{*})\leq\lambda_{d}(\Sigma^{*})\leq\lambda_{\max} for two constants λmin,λmax>0\lambda_{\min},\lambda_{\max}>0. Hence we have Δ\Delta is in the same order of SNR. Specifically, we have

1λmaxΔSNR1λminΔ.\displaystyle\frac{1}{\lambda_{\max}}\Delta\leq\textsf{SNR}\leq\frac{1}{\lambda_{\min}}\Delta. (18)

Hence the assumption SNR/k\textsf{SNR}/k\rightarrow\infty in the statement of Theorem 2.2 is equivalently Δ/k\Delta/k\rightarrow\infty. Next, we give two lemmas to clarify the Conditions. The first lemma shows that δ\delta can be taken for some o(1)o(1) term and the second lemma shows for any δ=o(1)\delta=o(1), ξideal(δ)\xi_{\text{ideal}}(\delta) is upper bounded by the desired minimax rate multiply by the sample size nn.

Lemma 5.3.

Under the same conditions as in Theorem 2.2, for any constant C>0C^{\prime}>0, there exists some constant C>0C>0 only depending on α\alpha and CC^{\prime} such that

max{z:(z,z)τ}maxj[n]maxb[k]{zj}|Hj(zj,b,z)|θzjθb,(Σ)1(θzjθb)\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\max_{j\in[n]}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{|H_{j}(z_{j}^{*},b,z)|}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{\ast})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle} Ck(d+logn)n\displaystyle\leq C\sqrt{\frac{k(d+\log n)}{n}} (19)
max{z:(z,z)τ}j=1nmaxb[k]{zj}Fj(zj,b,z)2θzjθb2θzjθb,(Σ)1(θzjθb)2(z,z)\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{F_{j}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{\ast})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})} Ck3(τn+1Δ2+d2nΔ2)\displaystyle\leq Ck^{3}\left(\frac{\tau}{n}+\frac{1}{\Delta^{2}}+\frac{d^{2}}{n\Delta^{2}}\right) (20)
max{z:(z,z)τ}maxj[n]maxb[k]{zj}|Gj(zj,b,z)|θzjθb,(Σ)1(θzjθb)\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\max_{j\in[n]}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{|G_{j}(z_{j}^{*},b,z)|}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{\ast})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle} Ck(τn+1Δτn+dτnΔ)\displaystyle\leq Ck\left(\frac{\tau}{n}+\frac{1}{\Delta}\sqrt{\frac{\tau}{n}}+\frac{d\sqrt{\tau}}{n\Delta}\right) (21)

with probability at least 1nC1-n^{-C^{\prime}}.

Proof.

Under the conditions of Theorem 2.2, the inequalities (33)-(38) hold with probability at least 1nC1-n^{-C^{\prime}}. In the remaining proof, we will work on the event these inequalities hold. Denote Σ^a(z)=j[n](Yjθ^a(z))(Yjθ^a(z))T𝕀{zj=a}j[n]𝕀{zj=a}\hat{\Sigma}_{a}(z)=\frac{\sum_{j\in[n]}(Y_{j}-\hat{\theta}_{a}(z))(Y_{j}-\hat{\theta}_{a}(z))^{T}{\mathbb{I}\left\{{z_{j}=a}\right\}}}{\sum_{j\in[n]}{\mathbb{I}\left\{{z_{j}=a}\right\}}} and Σa=Σ\Sigma^{*}_{a}=\Sigma^{*} for any a[k]a\in[k]. Then we have the equivalence

Σ^(z)Σ=a=1kj=1n𝕀{zj=a}n(Σ^a(z)Σa).\displaystyle\hat{\Sigma}(z^{*})-\Sigma^{*}=\sum_{a=1}^{k}\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}{n}(\hat{\Sigma}_{a}(z^{*})-\Sigma_{a}^{*}).

Hence, we can use the results from Lemma 7.7 and Lemma 7.8.

By (43) and (44), we have

Σ^(z)Σ\displaystyle\|\hat{\Sigma}(z^{*})-\Sigma^{*}\|\lesssim~{} k(d+logn)n,\displaystyle\sqrt{\frac{k(d+\log n)}{n}},

and

Σ^(z)Σ^(z)=\displaystyle\|\hat{\Sigma}(z)-\hat{\Sigma}(z^{*})\|=~{} a=1kj=1n𝕀{zj=a}nΣ^a(z)a=1kj=1n𝕀{zj=a}nΣ^a(z)\displaystyle\left\|\sum_{a=1}^{k}\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}{n}\hat{\Sigma}_{a}(z)-\sum_{a=1}^{k}\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}{n}\hat{\Sigma}_{a}(z^{*})\right\|
\displaystyle\lesssim~{} a=1kj=1n𝕀{zj=a}n(Σ^a(z)Σ^(z))+a=1kj=1n(𝕀{zj=a}𝕀{zj=a})nΣ^a(z)\displaystyle\left\|\sum_{a=1}^{k}\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}{n}(\hat{\Sigma}_{a}(z)-\hat{\Sigma}(z^{*}))\right\|+\left\|\sum_{a=1}^{k}\frac{\sum_{j=1}^{n}(\mathbb{I}\{z_{j}=a\}-\mathbb{I}\{z_{j}^{*}=a\})}{n}\hat{\Sigma}_{a}(z^{*})\right\|
\displaystyle\lesssim~{} kn(z,z)nΔ+kn(z,z)+kdnΔ(z,z)+knΔ2(z,z)\displaystyle\frac{k\sqrt{n\ell(z,z^{*})}}{n\Delta}+\frac{k}{n}\ell(z,z^{*})+\frac{kd}{n\Delta}\sqrt{\ell(z,z^{*})}+\frac{k}{n\Delta^{2}}\ell(z,z^{*})
\displaystyle\lesssim~{} kn(z,z)nΔ+kn(z,z)+kdnΔ(z,z).\displaystyle\frac{k\sqrt{n\ell(z,z^{*})}}{n\Delta}+\frac{k}{n}\ell(z,z^{*})+\frac{kd}{n\Delta}\sqrt{\ell(z,z^{*})}.

By the assumption that kd=O(n)kd=O(\sqrt{n}), Δk\frac{\Delta}{k}\rightarrow\infty and τ=o(n/k)\tau=o(n/k), we have Σ^(z)Σ,Σ^(z)Σ^(z)=o(1)\|\hat{\Sigma}(z^{*})-\Sigma^{*}\|,\|\hat{\Sigma}(z)-\hat{\Sigma}(z^{*})\|=o(1), which implies (Σ^(z))1,(Σ^(z))11\|(\hat{\Sigma}(z^{*}))^{-1}\|,\|(\hat{\Sigma}(z))^{-1}\|\lesssim 1. Thus, we have

(Σ^(z))1(Σ)1(Σ^(z))1Σ^(z)Σ(Σ)1k(d+logn)n,\displaystyle\|(\hat{\Sigma}(z^{*}))^{-1}-(\Sigma^{*})^{-1}\|\leq\|(\hat{\Sigma}(z^{*}))^{-1}\|\|\hat{\Sigma}(z^{*})-\Sigma^{*}\|\|(\Sigma^{*})^{-1}\|\lesssim\sqrt{\frac{k(d+\log n)}{n}}, (22)

and similarly

(Σ^(z))1(Σ^(z))1kn(z,z)+kn(z,z)nΔ+kdnΔ(z,z).\displaystyle\|(\hat{\Sigma}(z))^{-1}-(\hat{\Sigma}(z^{*}))^{-1}\|\lesssim\frac{k}{n}\ell(z,z^{*})+\frac{k\sqrt{n\ell(z,z^{*})}}{n\Delta}+\frac{kd}{n\Delta}\sqrt{\ell(z,z^{*})}. (23)

Now we start to prove (19)-(21). Let Fj(a,b,z)=Fj(1)(a,b,z)+Fj(2)(a,b,z)+Fj(3)(a,b,z)F_{j}(a,b,z)=F_{j}^{(1)}(a,b,z)+F_{j}^{(2)}(a,b,z)+F_{j}^{(3)}(a,b,z) where

Fj(1)(a,b,z)\displaystyle F_{j}^{(1)}(a,b,z) :=ϵj,(Σ^(z))1(θ^b(z)θ^b(z))ϵj,(Σ^(z))1(θ^a(z)θ^a(z)),\displaystyle:=\langle\epsilon_{j},(\hat{\Sigma}(z))^{-1}(\hat{\theta}_{b}(z)-\hat{\theta}_{b}(z^{*}))\rangle-\langle\epsilon_{j},(\hat{\Sigma}(z))^{-1}(\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*}))\rangle,
Fj(2)(a,b,z)\displaystyle F_{j}^{(2)}(a,b,z) :=ϵj,((Σ^(z))1(Σ^(z))1)(θaθb),\displaystyle:=-\langle\epsilon_{j},((\hat{\Sigma}(z))^{-1}-(\hat{\Sigma}(z^{*}))^{-1})(\theta_{a}^{*}-\theta_{b}^{*})\rangle,
Fj(3)(a,b,z)\displaystyle F_{j}^{(3)}(a,b,z) :=ϵj,((Σ^(z))1(Σ^(z))1)(θbθ^b(z))+ϵj,((Σ^(z))1(Σ^(z))1)(θaθ^a(z)).\displaystyle:=-\langle\epsilon_{j},((\hat{\Sigma}(z))^{-1}-(\hat{\Sigma}(z^{*}))^{-1})(\theta_{b}^{*}-\hat{\theta}_{b}(z^{*}))\rangle+\langle\epsilon_{j},((\hat{\Sigma}(z))^{-1}-(\hat{\Sigma}(z^{*}))^{-1})(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle.

Notice that

j=1nmaxb[k]{zj}Fj(2)(zj,b,z)2θzjθb2θzjθb,(Σ)1(θzjθb)2(z,z)\displaystyle\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{F_{j}^{(2)}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{\ast})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})}
\displaystyle\lesssim~{} j=1nb=1k|ϵj,((Σ^(z))1(Σ^(z))1)(θzjθb)|2θzjθb2(z,z)\displaystyle\sum_{j=1}^{n}\sum_{b=1}^{k}\frac{\bigg{|}\langle\epsilon_{j},((\hat{\Sigma}(z))^{-1}-(\hat{\Sigma}(z^{*}))^{-1})(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\bigg{|}^{2}}{\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\ell(z,z^{*})}
\displaystyle\leq~{} b=1ka[k]{b}j=1n𝕀{zj=a}|ϵj,((Σ^(z))1(Σ^(z))1)(θaθb)|2θaθb2(z,z)\displaystyle\sum_{b=1}^{k}\sum_{a\in[k]\setminus\{b\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}\frac{\bigg{|}\langle\epsilon_{j},((\hat{\Sigma}(z))^{-1}-(\hat{\Sigma}(z^{*}))^{-1})(\theta_{a}^{*}-\theta_{b}^{*})\rangle\bigg{|}^{2}}{\|\theta_{a}^{*}-\theta_{b}^{*}\|^{2}\ell(z,z^{*})}
\displaystyle\leq~{} b=1ka[k]{b}((Σ^(z))1(Σ^(z))1)(θaθb)2θaθb2(z,z)j=1n𝕀{zj=a}ϵjϵjT\displaystyle\sum_{b=1}^{k}\sum_{a\in[k]\setminus\{b\}}\frac{\|((\hat{\Sigma}(z))^{-1}-(\hat{\Sigma}(z^{*}))^{-1})(\theta_{a}^{*}-\theta_{b}^{*})\|^{2}}{\|\theta_{a}^{*}-\theta_{b}^{*}\|^{2}\ell(z,z^{*})}\bigg{\|}\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}\epsilon_{j}\epsilon_{j}^{T}\bigg{\|}
\displaystyle\lesssim~{} k3(τn+1Δ2+d2nΔ2),\displaystyle k^{3}(\frac{\tau}{n}+\frac{1}{\Delta^{2}}+\frac{d^{2}}{n\Delta^{2}}),

where we use (34), (23), and the fact that (z,z)τ\ell(z,z^{*})\leq\tau and kd=O(n)kd=O(\sqrt{n}) for the last inequality. From (41) we have maxa[k]θaθ^a(z)=o(1)\max_{a\in[k]}\|\theta_{a}^{*}-\hat{\theta}_{a}(z^{*})\|=o(1) under the assumption kd=O(n)kd=O(\sqrt{n}). By the similar analysis as in Fj(2)(a,b,z)F_{j}^{(2)}(a,b,z), we have

j=1nmaxb[k]{zj}Fj(3)(zj,b,z)2θzjθb2θzjθb,(Σ)1(θzjθb)2(z,z)k3(τn+1Δ2+d2nΔ2).\displaystyle\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{F_{j}^{(3)}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{\ast})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})}\lesssim k^{3}(\frac{\tau}{n}+\frac{1}{\Delta^{2}}+\frac{d^{2}}{n\Delta^{2}}).

Similarly, we have

j=1nmaxb[k]{zj}Fj(1)(zj,b,z)2θzjθb2θzjθb,(Σ)1(θzjθb)2(z,z)\displaystyle\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{F_{j}^{(1)}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{\ast})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})}
\displaystyle\lesssim~{} b=1ka[k]{b}(Σ^(z))1(θ^a(z)θ^a(z))2θaθb2(z,z)j=1n𝕀{zj=a}ϵjϵjT\displaystyle\sum_{b=1}^{k}\sum_{a\in[k]\setminus\{b\}}\frac{\|(\hat{\Sigma}(z))^{-1}(\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*}))\|^{2}}{\|\theta_{a}^{*}-\theta_{b}^{*}\|^{2}\ell(z,z^{*})}\bigg{\|}\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}\epsilon_{j}\epsilon_{j}^{T}\bigg{\|}
\displaystyle\lesssim~{} k3Δ4,\displaystyle\frac{k^{3}}{\Delta^{4}},

where we use (42) and the fact that (Σ^(z))1(\hat{\Sigma}(z))^{-1} has bounded operator norm. Combining these terms together, we obtain (20).

Next, for (19), by (41) we have

|θaθ^b(z),(Σ^(z))1(θaθ^b(z))+θaθb,(Σ^(z))1(θaθb)|\displaystyle|-\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle+\langle\theta_{a}^{*}-\theta_{b}^{*},(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle|
\displaystyle\leq~{} |θbθ^b(z),(Σ^(z))1(θbθ^b(z))|+2|θbθ^b(z),(Σ^(z))1(θaθb)|\displaystyle|\langle\theta_{b}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}(z^{*}))^{-1}(\theta_{b}^{*}-\hat{\theta}_{b}(z^{*}))\rangle|+2|\langle\theta_{b}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle|
\displaystyle\lesssim~{} k(d+logn)n+k(d+logn)nθaθb,\displaystyle\frac{k(d+\log n)}{n}+\sqrt{\frac{k(d+\log n)}{n}}\|\theta_{a}^{*}-\theta_{b}^{*}\|,

and

θaθ^a(z),(Σ^(z))1(θaθ^a(z))k(d+logn)n.\displaystyle\langle\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}),(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle\lesssim\frac{k(d+\log n)}{n}.

By (22) we have

θaθb,(Σ^(z))1(θaθb)+θaθb,(Σ)1(θaθb)k(d+logn)nθaθb2.\displaystyle-\langle\theta_{a}^{*}-\theta_{b}^{*},(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle+\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle\lesssim\sqrt{\frac{k(d+\log n)}{n}}\|\theta_{a}^{*}-\theta_{b}^{*}\|^{2}.

Using the results above we can get (19).

Finally we are going to establish (21). Recall the definition of Gj(a,b,z)G_{j}(a,b,z) which has four terms. For the third and fourth terms, we have

θaθ^b(z),(Σ^(z))1(θaθ^b(z))+θaθ^b(z),(Σ^(z))1(θaθ^b(z))\displaystyle-\langle\theta_{a}^{*}-\hat{\theta}_{b}(z),(\hat{\Sigma}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z))\rangle+\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle
\displaystyle\lesssim~{} θ^b(z)θ^b(z)2+θ^b(z)θ^b(z)θaθb,\displaystyle\|\hat{\theta}_{b}(z)-\hat{\theta}_{b}(z^{*})\|^{2}+\|\hat{\theta}_{b}(z)-\hat{\theta}_{b}(z^{*})\|\|\theta_{a}^{*}-\theta_{b}^{*}\|,

and

θaθ^b(z),(Σ^(z))1(θaθ^b(z))+θaθ^b(z),(Σ^(z))1(θaθ^b(z))\displaystyle-\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle+\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle
\displaystyle\lesssim~{} θaθb2(Σ^(z))1(Σ^(z))1.\displaystyle\|\theta_{a}^{*}-\theta_{b}^{*}\|^{2}\|(\hat{\Sigma}(z))^{-1}-(\hat{\Sigma}(z^{*}))^{-1}\|.

We can easily verify that the other two terms are smaller than the above two terms. Then, by using (42) and (23), we have

|Gj(zj,b,z)|θzjθb,(Σ)1(θzjθb)\displaystyle\frac{|G_{j}(z_{j}^{*},b,z)|}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{\ast})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle}
θ^b(z)θ^b(z)2+θ^b(z)θ^b(z)θzjθb+θzjθb2(Σ^(z))1(Σ^(z))1θzjθb2\displaystyle\lesssim\frac{\|\hat{\theta}_{b}(z)-\hat{\theta}_{b}(z^{*})\|^{2}+\|\hat{\theta}_{b}(z)-\hat{\theta}_{b}(z^{*})\|\|\theta^{*}_{z^{*}_{j}}-\theta^{*}_{b}\|+\|\theta^{*}_{z^{*}_{j}}-\theta^{*}_{b}\|^{2}\|(\hat{\Sigma}(z))^{-1}-(\hat{\Sigma}(z^{*}))^{-1}\|}{\|\theta^{*}_{z^{*}_{j}}-\theta^{*}_{b}\|^{2}}
kτn+kΔτn+kdτnΔ.\displaystyle\lesssim\frac{k\tau}{n}+\frac{k}{\Delta}\sqrt{\frac{\tau}{n}}+\frac{kd\sqrt{\tau}}{n\Delta}.

Lemma 5.4.

With the same conditions as Theorem 2.2, for any sequence δn=o(1)\delta_{n}=o(1), we have

ξideal(δn)nexp((1+o(1))SNR28).\displaystyle\xi_{\text{ideal}}(\delta_{n})\leq n\exp\left(-(1+o(1))\frac{\textsf{SNR}^{2}}{8}\right).

with probability at least 1nCexp(SNR)1-n^{-C^{\prime}}-\exp(-\textsf{SNR}).

Proof.

Under the conditions of Theorem 2.2, the inequalities (33)-(38) hold with probability at least 1nC1-n^{-C^{\prime}}. In the remaining proof, we will work on the event these inequalities hold. Recall the definition of ξideal\xi_{\text{ideal}}, we can write

ξideal(δ)\displaystyle\xi_{\text{ideal}}(\delta) =j=1nb[k]{zj}θzjθb2𝕀{ϵj,(Σ^(z))1(θ^zj(z)θ^b(z))1δ2θzjθb,(Σ)1(θzjθb)}\displaystyle=\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\mathbb{I}\left\{\langle\epsilon_{j},(\hat{\Sigma}(z^{*}))^{-1}(\hat{\theta}_{z_{j}^{*}}(z^{*})-\hat{\theta}_{b}(z^{*}))\rangle\leq-\frac{1-\delta}{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}
j=1nb[k]{zj}θzjθb2𝕀{ϵj,(Σ)1(θzjθb)1δδ¯2θzjθb,(Σ)1(θzjθb)}\displaystyle\leq\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\mathbb{I}\left\{\langle\epsilon_{j},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\leq-\frac{1-\delta-\bar{\delta}}{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}
+j=1nb[k]{zj}θzjθb2𝕀{ϵj,((Σ^(z))1(Σ)1)(θzjθb)δ¯6θzjθb,(Σ)1(θzjθb)}\displaystyle+\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\mathbb{I}\left\{\langle\epsilon_{j},((\hat{\Sigma}(z^{*}))^{-1}-(\Sigma^{*})^{-1})(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\leq-\frac{\bar{\delta}}{6}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}
+j=1nb[k]{zj}θzjθb2𝕀{ϵj,(Σ^(z))1(θ^zj(z)θzj)δ¯6θzjθb,(Σ)1(θzjθb)}\displaystyle+\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\mathbb{I}\left\{\langle\epsilon_{j},(\hat{\Sigma}(z^{*}))^{-1}(\hat{\theta}_{z_{j}^{*}}(z^{*})-\theta_{z_{j}^{*}}^{*})\rangle\leq-\frac{\bar{\delta}}{6}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}
+j=1nb[k]{zj}θzjθb2𝕀{ϵj,(Σ^(z))1(θ^b(z)θb)δ¯6θzjθb,(Σ)1(θzjθb)}\displaystyle+\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\mathbb{I}\left\{-\langle\epsilon_{j},(\hat{\Sigma}(z^{*}))^{-1}(\hat{\theta}_{b}(z^{*})-\theta_{b}^{*})\rangle\leq-\frac{\bar{\delta}}{6}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}
=:M1+M2+M3+M4.\displaystyle=:M_{1}+M_{2}+M_{3}+M_{4}.

where δ¯=δ¯n\bar{\delta}=\bar{\delta}_{n} is some sequence to be chosen later. We bound the four terms respectively. Suppose ϵj=(Σ)1/2ωj\epsilon_{j}=(\Sigma^{*})^{1/2}\omega_{j}, where wjiid𝒩(0,Id)w_{j}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(0,I_{d}). By (22), we know

M2\displaystyle M_{2} j=1nb[k]{zj}θzjθb2𝕀{δ¯6λmaxθzjθb2λmaxwj(Σ^(z))1(Σ)1θzjθb}\displaystyle\leq\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\mathbb{I}\left\{\frac{\bar{\delta}}{6\lambda_{\max}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\leq\lambda_{\max}\|w_{j}\|\|(\hat{\Sigma}(z^{*}))^{-1}-(\Sigma^{*})^{-1}\|\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|\right\}
j=1nb[k]{zj}θzjθb2𝕀{Cδ¯θzjθbnd+lognwj}\displaystyle\leq\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\mathbb{I}\left\{C\bar{\delta}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|\sqrt{\frac{n}{d+\log n}}\leq\|w_{j}\|\right\}
j=1nb[k]{zj}θzjθb2𝕀{Cδ¯2θzjθb2nd+logn2dwj22d},\displaystyle\leq\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\mathbb{I}\left\{C\bar{\delta}^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\frac{n}{d+\log n}-2d\leq\|w_{j}\|^{2}-2d\right\},

where CC is a constant which may differ line by line. Recall that kd=O(n)kd=O(\sqrt{n}), minabθaθb\min_{a\neq b}\|\theta^{*}_{a}-\theta^{*}_{b}\|\rightarrow\infty, and Δ/k\Delta/k\rightarrow\infty by assumption. Let n14=o(δ¯)n^{-\frac{1}{4}}=o(\bar{\delta}). Using the χ2\chi^{2} tail probability in Lemma 7.1, we have for any ab[k]a\neq b\in[k],

𝔼M2\displaystyle\mathbb{E}M_{2} j=1nb[k]{zj}θzjθb2exp(Cδ¯2θzjθb2n)nexp((1+o(1))SNR28).\displaystyle\leq\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\exp\left(-C\bar{\delta}^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\sqrt{n}\right)\leq n\exp\left(-(1+o(1))\frac{\textsf{SNR}^{2}}{8}\right).

We can obtain similar bounds on M3M_{3} and M4M_{4} by using (41). For M1M_{1}, the Gaussian tail bound leads to the inequality

{ϵj,(Σ)1(θaθb)1δδ¯2θaθb,(Σ)1(θaθb)}\displaystyle\mathbb{P}\bigg{\{}\langle\epsilon_{j},(\Sigma^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle\leq-\frac{1-\delta-\bar{\delta}}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle\bigg{\}}
=\displaystyle=~{} {wj,(Σ)1/2(θaθb)1δδ¯2θaθb,(Σ)1(θaθb)}\displaystyle\mathbb{P}\bigg{\{}\langle w_{j},(\Sigma^{*})^{-1/2}(\theta_{a}^{*}-\theta_{b}^{*})\rangle\leq-\frac{1-\delta-\bar{\delta}}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle\bigg{\}}
\displaystyle\leq~{} exp((1δδ¯)28θaθb,(Σ)1(θaθb)).\displaystyle\exp\left(-\frac{(1-\delta-\bar{\delta})^{2}}{8}\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle\right).

Thus,

𝔼M1\displaystyle\mathbb{E}M_{1} j=1nb[k]{zj}θzjθb2exp((1δδ¯)28θaθb,(Σ)1(θaθb))\displaystyle\leq\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}\exp\left(-\frac{(1-\delta-\bar{\delta})^{2}}{8}\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle\right)
nexp((1+o(1))SNR28).\displaystyle\leq n\exp\left(-(1+o(1))\frac{\textsf{SNR}^{2}}{8}\right).

Overall, we have 𝔼ξidealnexp((1+o(1))SNR28)\mathbb{E}\xi_{\text{ideal}}\lesssim n\exp\left(-(1+o(1))\frac{\textsf{SNR}^{2}}{8}\right). By the Markov’s inequality, we have

(ξideal(δn)𝔼ξidealexp(SNR))exp(SNR).\displaystyle\mathbb{P}\left(\xi_{\text{ideal}}(\delta_{n})\geq\mathbb{E}\xi_{\text{ideal}}\exp(\textsf{SNR})\right)\leq\exp(-\textsf{SNR}).

In other words, with probability at least 1exp(SNR)1-\exp(-\textsf{SNR}), we have

ξideal(δn)𝔼ξideal(δn)exp(SNR)nexp((1+o(1))SNR28).\displaystyle\xi_{\text{ideal}}(\delta_{n})\leq\mathbb{E}\xi_{\text{ideal}}(\delta_{n})\exp(\textsf{SNR})\leq n\exp\left(-(1+o(1))\frac{\textsf{SNR}^{2}}{8}\right).

Proof of Theorem 2.2.

By Lemmas 5.2 - 5.4, we have Conditions 5.2.1 - 5.2.5 satisfied with probability at least 1ηn1exp(SNR)1-\eta-n^{-1}-\exp(-\textsf{SNR}). Then applying Lemma 5.2, we have

(z(t),z)nexp((1+o(1))SNR28)+14(z(t1),z),for all t1.\displaystyle\ell(z^{(t)},z^{*})\leq n\exp\left(-(1+o(1))\frac{\textsf{SNR}^{2}}{8}\right)+\frac{1}{4}\ell(z^{(t-1)},z^{*}),\quad\text{for all }t\geq 1.

By (16) and there exists a constant C such that ΔCSNR\Delta\leq C\textsf{SNR}, we can conclude

h(z(t),z)exp((1+o(1))SNR28)+4t,for all t1.\displaystyle h(z^{(t)},z^{*})\leq\exp\left(-(1+o(1))\frac{\textsf{SNR}^{2}}{8}\right)+4^{-t},\quad\text{for all }t\geq 1.

Notice that h(z,z)h(z,z^{*}) takes value in the set {j/n:j[n]{0}}\{j/n:j\in[n]\cup\{0\}\}, the term 4t4^{-t} in the above inequality should be negligible as long as 4t=o(n1)4^{-t}=o(n^{-1}). Thus, we can claim

h(z(t),z)exp((1+o(1))SNR28),for all tlogn.\displaystyle h(z^{(t)},z^{*})\leq\exp\left(-(1+o(1))\frac{\textsf{SNR}^{2}}{8}\right),\quad\text{for all }t\geq\log n.

6 Proofs in Section 3

6.1 Proofs for The Lower Bound

Proof of Lemma 3.1.

The Neyman-Pearson lemma tells us the likelihood ratio test ϕ\phi is the optimal procedure. Following the proof of Lemma 2.1, we have

0(ϕ=1)+1(ϕ=0)\displaystyle\mathbb{P}_{\mathbb{H}_{0}}\left(\phi=1\right)+\mathbb{P}_{\mathbb{H}_{1}}\left(\phi=0\right) =(ϵ1,2)+(ϵ2,1)\displaystyle=\mathbb{P}\left(\epsilon\in\mathcal{B}_{1,2}\right)+\mathbb{P}\left(\epsilon\in\mathcal{B}_{2,1}\right)
exp(1+o(1)8SNR1,22)+exp(1+o(1)8SNR2,12),\displaystyle\geq\exp\left(-\frac{1+o(1)}{8}\textsf{SNR}_{1,2}^{{}^{\prime}2}\right)+\exp\left(-\frac{1+o(1)}{8}\textsf{SNR}_{2,1}^{{}^{\prime}2}\right),

where the last inequality is by Lemma 7.11. ∎

Proof of Theorem 3.1.

The proof is identical to the proof of Theorem 2.1 and is omitted here. ∎

6.2 Proofs for The Upper Bound

We adopt a similar proof idea as in Section 5.2. We first present an error decomposition for one-step analysis for Algorithm 2. In Lemma 6.1, we show the loss decays after a one-step iteration under Conditions 6.2.1 - 6.2.6. Then in Lemma 6.2 we extend the result to multiple iterations, under two extra Conditions 6.2.7 - 6.2.8. Last we show all the conditions are satisfied with high probability and thus prove Theorem 3.2.

In the statement of Theorem 3.2, we assume maxa,b[k]λd(Σa)/λ1(Σb)=O(1)\max_{a,b\in[k]}\lambda_{d}(\Sigma^{*}_{a})/\lambda_{1}(\Sigma^{*}_{b})=O(1) for the covariance matrix {Σa}a[k]\{\Sigma_{a}^{*}\}_{a\in[k]}. Without loss of generality, we can replace it by assuming Σ\Sigma^{*} satisfies

λminmina[k]λ1(Σa)maxa[k]λd(Σa)λmax\displaystyle\lambda_{\min}\leq\min_{a\in[k]}\lambda_{1}(\Sigma^{*}_{a})\leq\max_{a\in[k]}\lambda_{d}(\Sigma^{*}_{a})\leq\lambda_{\max} (24)

where λmin,λmax>0\lambda_{\min},\lambda_{\max}>0 are two constants. The is due to the scaling properties of the normal distributions. The reasoning is the same as that of (15) for Model 1 and hence is omitted here. In the remaining of this section, we will assume (24) holds for the covariance matrices.

Error Decomposition for the One-step Analysis:

Consider an arbitrary z[k]nz\in[k]^{n}. Apply (12), (13), and (14) on zz to obtain {θ^a(z)}a[k]\{\hat{\theta}_{a}(z)\}_{a\in[k]}, {Σ^a(z)}a[k]\{\hat{\Sigma}_{a}(z)\}_{a\in[k]}, and z^(z)\hat{z}(z):

θ^a(z)\displaystyle\hat{\theta}_{a}(z) =j[n]Yj𝕀{zj=a}j[n]𝕀{zj=a},\displaystyle=\frac{\sum_{j\in[n]}Y_{j}{\mathbb{I}\left\{{z_{j}=a}\right\}}}{\sum_{j\in[n]}{\mathbb{I}\left\{{z_{j}=a}\right\}}},
Σ^a(z)\displaystyle\hat{\Sigma}_{a}(z) =j[n](Yjθ^a(z))(Yjθ^a(z))T𝕀{zj=a}j[n]𝕀{zj=a},\displaystyle=\frac{\sum_{j\in[n]}(Y_{j}-\hat{\theta}_{a}(z))(Y_{j}-\hat{\theta}_{a}(z))^{T}{\mathbb{I}\left\{{z_{j}=a}\right\}}}{\sum_{j\in[n]}{\mathbb{I}\left\{{z_{j}=a}\right\}}},
z^j(z)\displaystyle\hat{z}_{j}(z) =argmina[k](Yjθ^a(z))T(Σ^a(a))1(Yjθ^a(z))+log|Σ^a(z)|,j[n].\displaystyle=\mathop{\rm argmin}_{a\in[k]}(Y_{j}-\hat{\theta}_{a}(z))^{T}(\hat{\Sigma}_{a}(a))^{-1}(Y_{j}-\hat{\theta}_{a}(z))+\log|\hat{\Sigma}_{a}(z)|,\quad j\in[n].

For simplicity we denote z^\hat{z} short for z^(z)\hat{z}(z). Let j[n]j\in[n] be an arbitrary index with zj=az_{j}^{*}=a. According to (9), zjz^{*}_{j} will be incorrectly estimated after on iteration in z^\hat{z} if aargmina[k](Yjθ^a(z))T(Σ^a(a))1(Yjθ^a(z))+log|Σ^a(z)|,a\neq\mathop{\rm argmin}_{a\in[k]}(Y_{j}-\hat{\theta}_{a}(z))^{T}(\hat{\Sigma}_{a}(a))^{-1}(Y_{j}-\hat{\theta}_{a}(z))+\log|\hat{\Sigma}_{a}(z)|,. That is, it is important to analyze the event

Yjθ^b(z),(Σ^b(z))1(Yjθ^b(z))+log|Σ^b(z)|Yjθ^a(z),(Σ^a(z))1(Yjθ^a(z))+log|Σ^a(z)|,\displaystyle\langle Y_{j}-\hat{\theta}_{b}(z),(\hat{\Sigma}_{b}(z))^{-1}(Y_{j}-\hat{\theta}_{b}(z))\rangle+\log|\hat{\Sigma}_{b}(z)|\leq\langle Y_{j}-\hat{\theta}_{a}(z),(\hat{\Sigma}_{a}(z))^{-1}(Y_{j}-\hat{\theta}_{a}(z))\rangle+\log|\hat{\Sigma}_{a}(z)|, (25)

for any b[k]{a}b\in[k]\setminus\{a\}. After some rearrangements, we can see (25) is equivalent to,

ϵj,(Σ^b(z))1(θaθ^b(z))ϵj,(Σ^a(z))1(θaθ^a(z))\displaystyle\langle\epsilon_{j},(\hat{\Sigma}_{b}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle-\langle\epsilon_{j},(\hat{\Sigma}_{a}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle
+\displaystyle+~{} 12ϵj,((Σ^b(z))1(Σ^a(z))1)ϵj12log|Σa|+12log|Σb|\displaystyle\frac{1}{2}\langle\epsilon_{j},((\hat{\Sigma}_{b}(z^{*}))^{-1}-(\hat{\Sigma}_{a}(z^{*}))^{-1})\epsilon_{j}\rangle-\frac{1}{2}\log|\Sigma_{a}^{*}|+\frac{1}{2}\log|\Sigma_{b}^{*}|
\displaystyle\leq-~{} 12θaθb,(Σb)1(θaθb)\displaystyle\frac{1}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle
+\displaystyle+~{} Fj(a,b,z)+Qj(a,b,z)+Gj(a,b,z)+Hj(a,b,z)+Kj(a,b,z)+Lj(a,b,z),\displaystyle F_{j}(a,b,z)+Q_{j}(a,b,z)+G_{j}(a,b,z)+H_{j}(a,b,z)+K_{j}(a,b,z)+L_{j}(a,b,z),

where

Fj(a,b,z)\displaystyle F_{j}(a,b,z) =ϵj,(Σ^b(z))1(θ^b(z)θ^b(z))ϵj,(Σ^a(z))1(θ^a(z)θ^a(z))\displaystyle=\langle\epsilon_{j},(\hat{\Sigma}_{b}(z))^{-1}(\hat{\theta}_{b}(z)-\hat{\theta}_{b}(z^{*}))\rangle-\langle\epsilon_{j},(\hat{\Sigma}_{a}(z))^{-1}(\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*}))\rangle
ϵj,((Σ^b(z))1(Σ^b(z))1)(θaθ^b(z))\displaystyle-\langle\epsilon_{j},((\hat{\Sigma}_{b}(z))^{-1}-(\hat{\Sigma}_{b}(z^{*}))^{-1})(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle
+ϵj,((Σ^a(z))1(Σ^a(z))1)(θaθ^a(z)),\displaystyle+\langle\epsilon_{j},((\hat{\Sigma}_{a}(z))^{-1}-(\hat{\Sigma}_{a}(z^{*}))^{-1})(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle,
Qj(a,b,z)=12ϵj,((Σ^b(z))1(Σ^b(z))1)ϵj+12ϵj,((Σ^a(z))1(Σ^a(z))1)ϵj,\displaystyle Q_{j}(a,b,z)=-\frac{1}{2}\langle\epsilon_{j},((\hat{\Sigma}_{b}(z))^{-1}-(\hat{\Sigma}_{b}(z^{*}))^{-1})\epsilon_{j}\rangle+\frac{1}{2}\langle\epsilon_{j},((\hat{\Sigma}_{a}(z))^{-1}-(\hat{\Sigma}_{a}(z^{*}))^{-1})\epsilon_{j}\rangle,
Gj(a,b,z)\displaystyle G_{j}(a,b,z) =12θaθ^a(z),(Σ^a(z))1(θaθ^a(z))12θaθ^a(z),(Σ^a(z))1(θaθ^a(z))\displaystyle=\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{a}(z),(\hat{\Sigma}_{a}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z))\rangle-\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}),(\hat{\Sigma}_{a}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle
+12θaθ^a(z),(Σ^a(z))1(θaθ^a(z))12θaθ^a(z),(Σ^a(z))1(θaθ^a(z))\displaystyle+\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}),(\hat{\Sigma}_{a}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle-\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}),(\hat{\Sigma}_{a}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle
12θaθ^b(z),(Σ^b(z))1(θaθ^b(z))+12θaθ^b(z),(Σ^b(z))1(θaθ^b(z))\displaystyle-\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{b}(z),(\hat{\Sigma}_{b}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z))\rangle+\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}_{b}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle
12θaθ^b(z),(Σ^b(z))1(θaθ^b(z))+12θaθ^b(z),(Σ^b(z))1(θaθ^b(z)),\displaystyle-\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}_{b}(z))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle+\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}_{b}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle,
Hj(a,b,z)=\displaystyle H_{j}(a,b,z)= 12θaθ^b(z),(Σ^b(z))1(θaθ^b(z))+12θaθb,(Σ^b(z))1(θaθb)\displaystyle-\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}),(\hat{\Sigma}_{b}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle+\frac{1}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\hat{\Sigma}_{b}(z^{*}))^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle
12θaθb,(Σ^b(z))1(θaθb)+12θaθb,(Σb)1(θaθb)\displaystyle-\frac{1}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\hat{\Sigma}_{b}(z^{*}))^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle+\frac{1}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle
+12θaθ^a(z),(Σ^a(z))1(θaθ^a(z)),\displaystyle+\frac{1}{2}\langle\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}),(\hat{\Sigma}_{a}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle,
Kj(a,b,z):=12(log|Σ^a(z)|log|Σ^a(z)|)12(log|Σ^b(z)|log|Σ^b(z)|),\displaystyle K_{j}(a,b,z):=\frac{1}{2}(\log|\hat{\Sigma}_{a}(z)|-\log|\hat{\Sigma}_{a}(z^{*})|)-\frac{1}{2}(\log|\hat{\Sigma}_{b}(z)|-\log|\hat{\Sigma}_{b}(z^{*})|),

and

Lj(a,b,z):=12(log|Σ^a(z)|log|Σa|)12(log|Σ^b(z)|log|Σb|).\displaystyle L_{j}(a,b,z):=\frac{1}{2}(\log|\hat{\Sigma}_{a}(z^{*})|-\log|\Sigma_{a}^{*}|)-\frac{1}{2}(\log|\hat{\Sigma}_{b}(z^{*})|-\log|\Sigma_{b}^{*}|).

Among these terms, Fj,Gj,HjF_{j},G_{j},H_{j} are nearly identical to their counterparts in Section 5.2 with Σ^(z)\hat{\Sigma}(z) replaced by Σ^a(z)\hat{\Sigma}_{a}(z) or Σ^b(z)\hat{\Sigma}_{b}(z). There are three extra terms not appearing in Section 5.2: QjQ_{j} is a quadratic term of ϵj\epsilon_{j} and Kj,LjK_{j},L_{j} are terms involving matrix determinants.

Conditions and Guarantees for One-step Analysis.

To establish the guarantee for the one-step analysis, we first give several conditions on the error terms.

Condition 6.2.1.

Assume that

max{z:(z,z)τ}maxj[n]maxb[k]{zj}|Hj(zj,b,z)|θzjθb,(Σb)1(θzjθb)δ12\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\max_{j\in[n]}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{|H_{j}(z_{j}^{*},b,z)|}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle}\leq\frac{\delta}{12}

holds with probability with at least 1η11-\eta_{1} for some τ,δ,η1>0\tau,\delta,\eta_{1}>0.

Condition 6.2.2.

Assume that

max{z:(z,z)τ}j=1nmaxb[k]{zj}Fj(zj,b,z)2θzjθb2θzjθb,(Σb)1(θzjθb)2(z,z)δ2288\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{F_{j}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})}\leq\frac{\delta^{2}}{288}

holds with probability with at least 1η21-\eta_{2} for some τ,δ,η2>0\tau,\delta,\eta_{2}>0.

Condition 6.2.3.

Assume that

max{z:(z,z)τ}maxj[n]maxb[k]{zj}|Gj(zj,b,z)|θzjθb,(Σb)1(θzjθb)δ12\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\max_{j\in[n]}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{|G_{j}(z_{j}^{*},b,z)|}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle}\leq\frac{\delta}{12}

holds with probability with at least 1η31-\eta_{3} for some τ,δ,η3>0\tau,\delta,\eta_{3}>0.

Condition 6.2.4.

Assume that

max{z:(z,z)τ}j=1nmaxb[k]{zj}Qj(zj,b,z)2θzjθb2θzjθb,(Σb)1(θzjθb)2(z,z)δ2288\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{Q_{j}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})}\leq\frac{\delta^{2}}{288}

holds with probability with at least 1η41-\eta_{4} for some τ,δ,η4>0\tau,\delta,\eta_{4}>0.

Condition 6.2.5.

Assume that

max{z:(z,z)τ}j=1nmaxb[k]{zj}Kj(zj,b,z)2θzjθb2θzjθb,(Σb)1(θzjθb)2(z,z)δ2288\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{K_{j}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})}\leq\frac{\delta^{2}}{288}

holds with probability with at least 1η51-\eta_{5} for some τ,δ,η5>0\tau,\delta,\eta_{5}>0.

Condition 6.2.6.

Assume that

max{z:(z,z)τ}maxj[n]maxb[k]{zj}|Lj(zj,b,z)|θzjθb,(Σb)1(θzjθb)δ12\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\max_{j\in[n]}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{|L_{j}(z_{j}^{*},b,z)|}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle}\leq\frac{\delta}{12}

holds with probability with at least 1η61-\eta_{6} for some τ,δ,η6>0\tau,\delta,\eta_{6}>0.

We next define a quantity that refers to as the ideal error,

ξideal(δ)=j=1pb[k]{zj}θzjθb2𝕀{ϵj,(Σ^b(z))1(θaθ^b(z))ϵj,(Σ^a(z))1(θaθ^a(z))\displaystyle\xi_{\text{ideal}}(\delta)=\sum_{j=1}^{p}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\|\theta^{*}_{z_{j}^{*}}-\theta_{b}^{*}\|^{2}\mathbb{I}\{\langle\epsilon_{j},(\hat{\Sigma}_{b}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{b}(z^{*}))\rangle-\langle\epsilon_{j},(\hat{\Sigma}_{a}(z^{*}))^{-1}(\theta_{a}^{*}-\hat{\theta}_{a}(z^{*}))\rangle
+12ϵj,((Σ^b(z))1(Σ^a(z))1)ϵj12log|Σa|+12log|Σb|1δ2θaθb,(Σb)1(θaθb)}.\displaystyle+\frac{1}{2}\langle\epsilon_{j},((\hat{\Sigma}_{b}(z^{*}))^{-1}-(\hat{\Sigma}_{a}(z^{*}))^{-1})\epsilon_{j}\rangle-\frac{1}{2}\log|\Sigma_{a}^{*}|+\frac{1}{2}\log|\Sigma_{b}^{*}|\leq-\frac{1-\delta}{2}\langle\theta_{a}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{a}^{*}-\theta_{b}^{*})\rangle\}.
Lemma 6.1.

Assumes Conditions 6.2.1 - 6.2.6 hold for some τ,δ,η1,,η6>0\tau,\delta,\eta_{1},\ldots,\eta_{6}>0. We then have

((z^,z)2ξideal(δ)+14(z,z) for any z[k]n such that (z,z)τ)1η,\displaystyle\mathbb{P}\left(\ell(\hat{z},z^{*})\leq 2\xi_{\text{ideal}}(\delta)+\frac{1}{4}\ell(z,z^{*})\text{ for any $z\in[k]^{n}$ such that $\ell(z,z^{*})\leq\tau$}\right)\geq 1-\eta,

where η=i=16ηi\eta=\sum_{i=1}^{6}\eta_{i}.

Proof.

The proof of this lemma is quite similar to the proof of Lemma 5.1. The additional terms QjQ_{j} and KjK_{j} can be dealt with the same way as FjF_{j} while LjL_{j} can be dealt with the same way as HjH_{j}. We omit the details here. ∎

Conditions and Guarantees for Multiple Iterations.

In the above we establish a statistical guarantee for the one-step analysis. Now we will extend the result to multiple iterations. That is, starting from some initialization z(0)z^{(0)}, we will characterize how the losses (z(0),z)\ell(z^{(0)},z^{*}), (z(1),z)\ell(z^{(1)},z^{*}), (z(2),z)\ell(z^{(2)},z^{*}), …, decay. We impose a condition on ξideal(δ)\xi_{\text{ideal}}(\delta) and a condition for z(0)z^{(0)}.

Condition 6.2.7.

Assume that

ξideal(δ)τ2\displaystyle\xi_{\text{ideal}}(\delta)\leq\frac{\tau}{2}

holds with probability with at least 1η71-\eta_{7} for some τ,δ,η7>0\tau,\delta,\eta_{7}>0.

Finally, we need a condition on the initialization.

Condition 6.2.8.

Assume that

(z(0),z)τ\displaystyle\ell(z^{(0)},z^{*})\leq\tau

holds with probability with at least 1η81-\eta_{8} for some τ,η8>0\tau,\eta_{8}>0.

With these conditions satisfied, we can give a lemma that shows the linear convergence guarantee for our algorithm.

Lemma 6.2.

Assumes Conditions 6.2.1 - 6.2.8 hold for some τ,δ,η1,,η8>0\tau,\delta,\eta_{1},\cdots,\eta_{8}>0. We then have

(z(t),z)ξideal(δ)+12(z(t1),z)\displaystyle\ell(z^{(t)},z^{*})\leq\xi_{\text{ideal}}(\delta)+\frac{1}{2}\ell(z^{(t-1)},z^{*})

for all t1t\geq 1, with probability at least 1η1-\eta, where η=i=18ηi\eta=\sum_{i=1}^{8}\eta_{i}.

Proof.

The proof of this lemma is the same as the proof of Lemma 5.2. ∎

With-high-probability Results for the Conditions and The Proof of The Main Theorem.

Recall the definition of Δ\Delta in (3). Lemma 6.3 shows SNR\textsf{SNR}^{\prime} is in the same order with Δ\Delta, which will play a similar role as (18) in Section 5.2. It immediately implies the assumption SNR\textsf{SNR}^{\prime}\rightarrow\infty in the statement of Theorem 3.2 is equivalently Δ\Delta\rightarrow\infty. The proof of Lemma 6.3 is deferred to Section 7.

Lemma 6.3.

Assume SNR\textsf{SNR}^{\prime}\to\infty and d=O(1)d=O(1). Further assume there exist constants λmin,λmax>0\lambda_{\min},\lambda_{\max}>0 such that λminλ1(Σa)λd(Σa)λmax\lambda_{\min}\leq\lambda_{1}(\Sigma_{a}^{*})\leq\lambda_{d}(\Sigma_{a}^{*})\leq\lambda_{\max} for any a[k]a\in[k]. Then, there exist constants C1,C0>0C_{1},C_{0}>0 only depending on λmin,λmax,d\lambda_{\min},\lambda_{\max},d such that

C1θaθbSNRa,bC2θaθb,\displaystyle C_{1}\left\|{\theta_{a}^{*}-\theta_{b}^{*}}\right\|\leq\textsf{SNR}^{\prime}_{a,b}\leq C_{2}\left\|{\theta_{a}^{*}-\theta_{b}^{*}}\right\|,

for any ab[k]a\neq b\in[k]. As a result, SNR\textsf{SNR}^{\prime} is in the same order of Δ\Delta.

Lemma 6.4 and Lemma 6.5 are counterparts of Lemmas 5.3 and 5.4 in Section 5.2.

Lemma 6.4.

Under the same conditions as in Theorem 3.2, for any constant C>0C^{\prime}>0, there exists some constant C>0C>0 only depending on α,C,λmin,λmax\alpha,C^{\prime},\lambda_{\min},\lambda_{\max} such that

max{z:(z,z)τ}maxj[n]maxb[k]{zj}|Hj(zj,b,z)|θzjθb,(Σb)1(θzjθb)\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\max_{j\in[n]}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{|H_{j}(z_{j}^{*},b,z)|}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle} Ck(d+logn)n\displaystyle\leq C\sqrt{\frac{k(d+\log n)}{n}} (26)
max{z:(z,z)τ}j=1nmaxb[k]{zj}Fj(zj,b,z)2θzjθb2θzjθb,(Σb)1(θzjθb)2(z,z)\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{F_{j}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})} Ck3(τn+1Δ2+d2nΔ2)\displaystyle\leq Ck^{3}\left(\frac{\tau}{n}+\frac{1}{\Delta^{2}}+\frac{d^{2}}{n\Delta^{2}}\right) (27)
max{z:(z,z)τ}maxj[n]maxb[k]{zj}|Gj(zj,b,z)|θzjθb,(Σb)1(θzjθb)\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\max_{j\in[n]}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{|G_{j}(z_{j}^{*},b,z)|}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle} Ck(τn+1Δτn+dτnΔ)\displaystyle\leq Ck\left(\frac{\tau}{n}+\frac{1}{\Delta}\sqrt{\frac{\tau}{n}}+\frac{d\sqrt{\tau}}{n\Delta}\right) (28)
max{z:(z,z)τ}j=1nmaxb[k]{zj}Qj(zj,b,z)2θzjθb2θzjθb,(Σb)1(θzjθb)2(z,z)\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{Q_{j}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})} Ck3d2Δ2(τn+1Δ2+d2nΔ2)\displaystyle\leq C\frac{k^{3}d^{2}}{\Delta^{2}}\left(\frac{\tau}{n}+\frac{1}{\Delta^{2}}+\frac{d^{2}}{n\Delta^{2}}\right) (29)
max{z:(z,z)τ}j=1nmaxb[k]{zj}Kj(zj,b,z)2θzjθb2θzjθb,(Σb)1(θzjθb)2(z,z)\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{K_{j}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})} Ck3d2Δ2(τn+1Δ2+d2nΔ2)\displaystyle\leq C\frac{k^{3}d^{2}}{\Delta^{2}}\left(\frac{\tau}{n}+\frac{1}{\Delta^{2}}+\frac{d^{2}}{n\Delta^{2}}\right) (30)
max{z:(z,z)τ}maxj[n]maxb[k]{zj}|Lj(zj,b,z)|θzjθb,(Σb)1(θzjθb)\displaystyle\max_{\{z:\ell(z,z^{\ast})\leq\tau\}}\max_{j\in[n]}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{|L_{j}(z_{j}^{*},b,z)|}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle} CdΔ2k(d+logn)n\displaystyle\leq C\frac{d}{\Delta^{2}}\sqrt{\frac{k(d+\log n)}{n}} (31)

with probability at least 1nC4nd1-n^{-C^{\prime}}-\frac{4}{nd}.

Proof.

Under the conditions of Theorem 3.2, the inequalities (33)-(38) hold with probability at least 1nC1-n^{-C^{\prime}}. In the remaining proof, we will work on the event these inequalities hold. Hence, we can use the results from Lemma 7.7 and 7.8. Using the same arguments as in the proof of Lemma 5.3, we can get (26), (27) and (28).

As for (29), we first use Lemma 7.9 to have j=1nϵj43nd\sum_{j=1}^{n}\|\epsilon_{j}\|^{4}\leq 3nd with probability at least 14/(nd)1-4/(nd). Then, we have

j=1nmaxb[k]{zj}Qj(zj,b,z)2θzjθb2θzjθb,(Σb)1(θzjθb)2(z,z)\displaystyle\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{Q_{j}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})}\lesssim~{} j=1nb=1kQj(zj,b,z)2Δ2(z,z)\displaystyle\sum_{j=1}^{n}\sum_{b=1}^{k}\frac{Q_{j}(z_{j}^{*},b,z)^{2}}{\Delta^{2}\ell(z,z^{*})}
\displaystyle\leq~{} kj=1nϵj4maxa[k](Σ^a(z))1(Σ^a(z))12Δ2(z,z)\displaystyle k\sum_{j=1}^{n}\|\epsilon_{j}\|^{4}\frac{\max_{a\in[k]}\|(\hat{\Sigma}_{a}(z))^{-1}-(\hat{\Sigma}_{a}(z^{*}))^{-1}\|^{2}}{\Delta^{2}\ell(z,z^{*})}
\displaystyle\lesssim~{} k3d2Δ2(τn+1Δ2+d2nΔ2),\displaystyle\frac{k^{3}d^{2}}{\Delta^{2}}\left(\frac{\tau}{n}+\frac{1}{\Delta^{2}}+\frac{d^{2}}{n\Delta^{2}}\right),

where the last inequality is due to (53) and the fact that (z,z)τ\ell(z,z^{*})\leq\tau.

Next for (30), notice that by (43), (44), and SNR\textsf{SNR}^{\prime}\to\infty, we have for any 1id1\leq i\leq d, λmin2λi(Σ^a(z))2λmax\frac{\lambda_{\min}}{2}\leq\lambda_{i}(\hat{\Sigma}_{a}(z^{*}))\leq 2\lambda_{\max} and

|log(1+maxa[k]Σ^a(z)Σ^a(z)2λi(Σ^a(z)))||log(1maxa[k]Σ^a(z)Σ^a(z)2λi(Σ^a(z)))|.\displaystyle\left|\log(1+\max_{a\in[k]}\frac{\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\|_{2}}{\lambda_{i}(\hat{\Sigma}_{a}(z^{*}))})\right|\leq\left|\log(1-\max_{a\in[k]}\frac{\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\|_{2}}{\lambda_{i}(\hat{\Sigma}_{a}(z^{*}))})\right|.

Thus by Lemma 7.6, we know

maxa[k]|log|Σ^a(z)|log|Σ^a(z)||\displaystyle\max_{a\in[k]}\left|\log|\hat{\Sigma}_{a}(z)|-\log|\hat{\Sigma}_{a}(z^{*})|\right|
=\displaystyle=~{} maxa[k]|log|Σ^a(z)||Σ^a(z)||\displaystyle\max_{a\in[k]}\left|\log\frac{|\hat{\Sigma}_{a}(z)|}{|\hat{\Sigma}_{a}(z^{*})|}\right|
\displaystyle\leq~{} |i=1dlog(1maxa[k]Σ^a(z)Σ^a(z)2λi(Σ^a(z)))|\displaystyle\left|\sum_{i=1}^{d}\log(1-\frac{\max_{a\in[k]}\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\|_{2}}{\lambda_{i}(\hat{\Sigma}_{a}(z^{*}))})\right|
\displaystyle\leq i=1dlog(1+maxa[k]Σ^a(z)Σ^a(z)2λi(Σ^a(z))+maxa[k]Σ^a(z)Σ^a(z)22λi2(Σ^a(z))1maxa[k]Σ^a(z)Σ^a(z)2λi(Σ^a(z)))\displaystyle\sum_{i=1}^{d}\log\left(1+\max_{a\in[k]}\frac{\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\|_{2}}{\lambda_{i}(\hat{\Sigma}_{a}(z^{*}))}+\frac{\max_{a\in[k]}\frac{\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\|_{2}^{2}}{\lambda_{i}^{2}(\hat{\Sigma}_{a}(z^{*}))}}{1-\max_{a\in[k]}\frac{\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\|_{2}}{\lambda_{i}(\hat{\Sigma}_{a}(z^{*}))}}\right)
\displaystyle\lesssim dΣ^a(z)Σ^a(z)2,\displaystyle d\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\|_{2}, (32)

where the last inequality is due to the fact that λi(Σ^a(z))\lambda_{i}(\hat{\Sigma}_{a}(z^{*})) is at the constant rate, Σ^a(z)Σ^a(z)2=o(1)\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\|_{2}=o(1) and the inequality log(1+x)x\log(1+x)\leq x for any x>0x>0. (32) yields to the inequality

j=1nmaxb[k]{zj}Kj(zj,b,z)2θzjθb2θzjθb,(Σb)1(θzjθb)2(z,z)\displaystyle\sum_{j=1}^{n}\max_{b\in[k]\setminus\{z_{j}^{*}\}}\frac{K_{j}(z_{j}^{*},b,z)^{2}\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\|^{2}}{\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle^{2}\ell(z,z^{\ast})}\lesssim~{} j=1nd2maxa[k]Σ^a(z)Σ^a(z)2Δ2(z,z)\displaystyle\sum_{j=1}^{n}\frac{d^{2}\max_{a\in[k]}\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\|^{2}}{\Delta^{2}\ell(z,z^{*})}
\displaystyle\lesssim~{} k2d2Δ2(τn+1Δ2+d2nΔ2).\displaystyle\frac{k^{2}d^{2}}{\Delta^{2}}\left(\frac{\tau}{n}+\frac{1}{\Delta^{2}}+\frac{d^{2}}{n\Delta^{2}}\right).

Finally for (31), by (43) and the similar argument as (32), we can get

maxa[k]|log|Σ^a(z)|log|Σa||dk(d+logn)n\displaystyle\max_{a\in[k]}\left|\log|\hat{\Sigma}_{a}(z^{*})|-\log|\Sigma_{a}^{*}|\right|\lesssim d\sqrt{\frac{k(d+\log n)}{n}}

which implies (31). We complete the proof. ∎

Lemma 6.5.

With the same conditions as Theorem 3.2, for any sequence δn=o(1)\delta_{n}=o(1), we have

ξideal(δn)nexp((1+o(1))SNR28).\displaystyle\xi_{\text{ideal}}(\delta_{n})\leq n\exp\left(-(1+o(1))\frac{\textsf{SNR}^{\prime 2}}{8}\right).

with probability at least 1nCexp(SNR)1-n^{-C^{\prime}}-\exp(-\textsf{SNR}^{\prime}).

Proof.

Under the conditions of Theorem 3.2, the inequalities (33)-(38) hold with probability at least 1nC1-n^{-C^{\prime}}. In the remaining proof, we will work on the event these inequalities hold. Similar to the proof of Lemma 5.4, we have a decomposition ξideali=16Mi\xi_{\text{ideal}}\leq\sum_{i=1}^{6}M_{i} where

M1:=j=1nb[k]{zj}θzjθb2𝕀{\displaystyle M_{1}:=\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\left\|\theta^{*}_{z_{j}^{*}}-\theta_{b}^{*}\right\|^{2}\mathbb{I}\bigg{\{}\langle ϵj,(Σb)1(θzjθb)+12ϵj,((Σb)1(Σzj)1)ϵj\displaystyle\epsilon_{j},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle+\frac{1}{2}\langle\epsilon_{j},((\Sigma_{b}^{*})^{-1}-(\Sigma_{z_{j}^{*}}^{*})^{-1})\epsilon_{j}\rangle
12log|Σzj|+12log|Σb|1δδ¯2θzjθb,(Σb)1(θzjθb)}\displaystyle-\frac{1}{2}\log|\Sigma_{z_{j}^{*}}^{*}|+\frac{1}{2}\log|\Sigma_{b}^{*}|\leq-\frac{1-\delta-\bar{\delta}}{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\bigg{\}}

is the main term and

M2:=j=1nb[k]{zj}θzjθb2𝕀{\displaystyle M_{2}:=\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\left\|\theta^{*}_{z_{j}^{*}}-\theta_{b}^{*}\right\|^{2}\mathbb{I}\bigg{\{}\langle ϵj,((Σ^b(z))1(Σb)1)(θzjθb)δ¯10θzjθb,(Σb)1(θzjθb)}\displaystyle\epsilon_{j},((\hat{\Sigma}_{b}(z^{*}))^{-1}-(\Sigma_{b}^{*})^{-1})(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\leq-\frac{\bar{\delta}}{10}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\bigg{\}}
M3:=j=1nb[k]{zj}θzjθb2𝕀{ϵj,(Σ^zj(z))1(θzjθ^zj(z))δ¯10θzjθb,(Σb)1(θzjθb)}\displaystyle M_{3}:=\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\left\|\theta^{*}_{z_{j}^{*}}-\theta_{b}^{*}\right\|^{2}\mathbb{I}\left\{-\langle\epsilon_{j},(\hat{\Sigma}_{z_{j}^{*}}(z^{*}))^{-1}(\theta_{z_{j}^{*}}^{*}-\hat{\theta}_{z_{j}^{*}}(z^{*}))\rangle\leq-\frac{\bar{\delta}}{10}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}
M4:=j=1nb[k]{zj}θzjθb2𝕀{ϵj,(Σ^b(z))1(θ^b(z)θb)δ¯10θzjθb,(Σb)1(θzjθb)}\displaystyle M_{4}:=\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\left\|\theta^{*}_{z_{j}^{*}}-\theta_{b}^{*}\right\|^{2}\mathbb{I}\left\{-\langle\epsilon_{j},(\hat{\Sigma}_{b}(z^{*}))^{-1}(\hat{\theta}_{b}(z^{*})-\theta_{b}^{*})\rangle\leq-\frac{\bar{\delta}}{10}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}
M5:=j=1nb[k]{zj}θzjθb2𝕀{12ϵj,((Σ^b(z))1(Σb)1)ϵjδ¯10θzjθb,(Σb)1(θzjθb)}\displaystyle M_{5}:=\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\left\|\theta^{*}_{z_{j}^{*}}-\theta_{b}^{*}\right\|^{2}\mathbb{I}\left\{\frac{1}{2}\langle\epsilon_{j},((\hat{\Sigma}_{b}(z^{*}))^{-1}-(\Sigma_{b}^{*})^{-1})\epsilon_{j}\rangle\leq-\frac{\bar{\delta}}{10}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}
M6:=j=1nb[k]{zj}θzjθb2𝕀{12ϵj,((Σ^zj(z))1(Σzj)1)ϵjδ¯10θzjθb,(Σb)1(θzjθb)}.\displaystyle M_{6}:=\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\left\|\theta^{*}_{z_{j}^{*}}-\theta_{b}^{*}\right\|^{2}\mathbb{I}\left\{-\frac{1}{2}\langle\epsilon_{j},((\hat{\Sigma}_{z_{j}^{*}}(z^{*}))^{-1}-(\Sigma_{z_{j}^{*}}^{*})^{-1})\epsilon_{j}\rangle\leq-\frac{\bar{\delta}}{10}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\right\}.

Using the same arguments as the proof of Lemma 5.4, we can choose some δ¯=δ¯n=o(1)\bar{\delta}=\bar{\delta}_{n}=o(1) which is slowly diverging to zero satisfying

𝔼Minexp((1+o(1))SNR22)for i=2,3,4.\displaystyle\mathbb{E}M_{i}\leq n\exp\left(-(1+o(1))\frac{\textsf{SNR}^{{}^{\prime}2}}{2}\right)\quad\text{for }i=2,3,4.

As for M5M_{5}, by (43) we have

M5j=1nb[k]{zj}θzjθb2𝕀{Cδ¯θzjθb2wj2lognn},\displaystyle M_{5}\leq\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\left\|\theta^{*}_{z_{j}^{*}}-\theta_{b}^{*}\right\|^{2}\mathbb{I}\left\{C\bar{\delta}\left\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\right\|^{2}\leq\|w_{j}\|^{2}\sqrt{\frac{\log n}{n}}\right\},

where CC is a constant and wjiid𝒩(0,Id)w_{j}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(0,I_{d}). Since there exists some constant CC^{\prime} such that SNRCΔ\textsf{SNR}^{\prime}\leq C^{\prime}\Delta, we can choose appropriate δ¯=o(1)\bar{\delta}=o(1) such that

𝔼M5\displaystyle\mathbb{E}M_{5} j=1nb[k]{zj}θzjθb2{Cδ¯θzjθb2nlognwj2}\displaystyle\leq\sum_{j=1}^{n}\sum_{b\in[k]\setminus\{z_{j}^{*}\}}\left\|\theta^{*}_{z_{j}^{*}}-\theta_{b}^{*}\right\|^{2}\mathbb{P}\left\{C\bar{\delta}\left\|\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*}\right\|^{2}\sqrt{\frac{n}{\log n}}\leq\|w_{j}\|^{2}\right\}
nexp((1+o(1))SNR28).\displaystyle\leq n\exp\left(-(1+o(1))\frac{\textsf{SNR}^{{}^{\prime}2}}{8}\right).

M6M_{6} is essentially the same with M5M_{5}. Finally for M1M_{1}, using Lemma 7.10, we have

(ϵj,(Σb)1(θzjθb)+12ϵj,((Σb)1(Σzj)1)ϵj\displaystyle\mathbb{P}\bigg{(}\langle\epsilon_{j},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle+\frac{1}{2}\langle\epsilon_{j},((\Sigma_{b}^{*})^{-1}-(\Sigma_{z_{j}^{*}}^{*})^{-1})\epsilon_{j}\rangle
12log|Σzj|+12log|Σb|1δδ¯2θzjθb,(Σb)1(θzjθb))\displaystyle\quad\quad-\frac{1}{2}\log|\Sigma_{z_{j}^{*}}^{*}|+\frac{1}{2}\log|\Sigma_{b}^{*}|\leq-\frac{1-\delta-\bar{\delta}}{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\bigg{)}
=\displaystyle= (wj,(Σzj)12(Σb)1(θzjθb)+12wj,((Σzj)12(Σb)1(Σzj)12Id)wj\displaystyle\mathbb{P}\bigg{(}\langle w_{j},(\Sigma^{*}_{z^{*}_{j}})^{\frac{1}{2}}(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle+\frac{1}{2}\langle w_{j},((\Sigma^{*}_{z^{*}_{j}})^{\frac{1}{2}}(\Sigma_{b}^{*})^{-1}(\Sigma^{*}_{z^{*}_{j}})^{\frac{1}{2}}-I_{d})w_{j}\rangle
12log|Σzj|+12log|Σb|1δδ¯2θzjθb,(Σb)1(θzjθb))\displaystyle\quad\quad-\frac{1}{2}\log|\Sigma_{z_{j}^{*}}^{*}|+\frac{1}{2}\log|\Sigma_{b}^{*}|\leq-\frac{1-\delta-\bar{\delta}}{2}\langle\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*},(\Sigma_{b}^{*})^{-1}(\theta_{z_{j}^{*}}^{*}-\theta_{b}^{*})\rangle\bigg{)}
exp((1o(1))SNRzj,b8).\displaystyle\leq\exp\left(-(1-o(1))\frac{\textsf{SNR}^{\prime}_{z^{*}_{j},b}}{8}\right).

Then we have

𝔼M1nexp((1+o(1))SNR28).\displaystyle\mathbb{E}M_{1}\leq n\exp\left(-(1+o(1))\frac{\textsf{SNR}^{{}^{\prime}2}}{8}\right).

Using the Markov’s inequality we complete the proof of Lemma 6.5. ∎

Proof of Theorem 3.2.

By Lemmas 6.2-6.5, we can obtain the result by arguments used in the proof of Theorem 2.2 and hence is omitted here. ∎

7 Technical Lemmas

Here are the technical lemmas.

Lemma 7.1.

For any x>0x>0, we have

(χd2d+2dx+2x)ex,\displaystyle\mathbb{P}(\chi_{d}^{2}\geq d+2\sqrt{dx}+2x)\leq e^{-x},
(χd2d2dx)ex.\displaystyle\mathbb{P}(\chi_{d}^{2}\leq d-2\sqrt{dx})\leq e^{-x}.
Proof.

These results are Lemma 1 of [12]. ∎

Lemma 7.2.

For any z[k]nz^{*}\in[k]^{n} and k[n]k\in[n], consider independent vectors ϵj𝒩(0,Σzj)\epsilon_{j}\sim\mathcal{N}(0,\Sigma^{*}_{z^{*}_{j}}) for any j[n]j\in[n]. Assume there exists a constant λmax>0\lambda_{\max}>0 such that Σaλmax\|\Sigma_{a}^{*}\|\leq\lambda_{\max} for any a[k]a\in[k]. Then, for any constant C>0C^{\prime}>0, there exists some constant C>0C>0 only depending on C,λmaxC^{\prime},\lambda_{\max} such that

maxa[k]j=1n𝕀{zj=a}ϵjj=1n𝕀{zj=a}\displaystyle\max_{a\in[k]}\left\|\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}\epsilon_{j}}{\sqrt{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}}\right\| Cd+logn,\displaystyle\leq C\sqrt{d+\log n}, (33)
maxa[k]1d+j=1n𝕀{zj=a}j=1n𝕀{zj=a}ϵjϵjT\displaystyle\max_{a\in[k]}\frac{1}{d+\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}\left\|\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}\epsilon_{j}\epsilon_{j}^{T}\right\| C,\displaystyle\leq C, (34)
maxT[n]1|T|jTϵj\displaystyle\max_{T\subset[n]}\left\|\frac{1}{\sqrt{|T|}}\sum_{j\in T}\epsilon_{j}\right\| Cd+n,\displaystyle\leq C\sqrt{d+n}, (35)
maxa[k]maxT{j:zj=a}1|T|(d+j=1n𝕀{zj=a})jTϵj\displaystyle\max_{a\in[k]}\max_{T\subset\{j:z_{j}^{*}=a\}}\left\|\frac{1}{\sqrt{|T|(d+\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\})}}\sum_{j\in T}\epsilon_{j}\right\| C,\displaystyle\leq C, (36)

with probability at least 1nC1-n^{-C^{\prime}}. We have used the convention that 0/0=00/0=0.

Proof.

Note that ϵj\epsilon_{j} is sub-Gaussian with parameter λmax\lambda_{\max} which is a constant. The inequalities (33) and (35) are respectively Lemmas A.4, A.1 in [15]. The inequality (34) is a slight extension of Lemma A.2 in [15]. This extension can be done by a standard union bound argument. The proof of (36) is identical to that of (35). ∎

Lemma 7.3.

Consider the same assumptions as in Lemma 7.2. Assume additionally minakj=1n𝕀{zj=a}αnk\min_{a\in k}\sum_{j=1}^{n}\mathbb{I}\{z^{*}_{j}=a\}\geq\frac{\alpha n}{k} for some constant α>0\alpha>0 and k(d+logn)n=o(1)\frac{k(d+\log n)}{n}=o(1). Then, for any constant C>0C^{\prime}>0, there exists some constant C>0C>0 only depending on α,C,λmax\alpha,C^{\prime},\lambda_{\max} such that

maxa[k]1j=1n𝕀{zj=a}j=1n𝕀{zj=a}ϵjϵjTΣa\displaystyle\max_{a\in[k]}\left\|\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}\epsilon_{j}\epsilon_{j}^{T}-\Sigma_{a}^{*}\right\| Ck(d+logn)n,\displaystyle\leq C\sqrt{\frac{k(d+\log n)}{n}}, (37)

with probability at least 1nC1-n^{-C^{\prime}}.

Proof.

Note that we have ϵj=Σ12zjηj\epsilon_{j}=\Sigma^{*\frac{1}{2}}_{z^{*}_{j}}\eta_{j} where ηjiid𝒩(0,Id)\eta_{j}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(0,I_{d}) for any j[n]j\in[n]. Since maxaΣaλmax\max_{a}\|\Sigma^{*}_{a}\|\leq\lambda_{\max}, we have

maxa[k]1j=1n𝕀{zj=a}j=1n𝕀{zj=a}ϵjϵjTΣaλmaxmaxa[k]1j=1n𝕀{zj=a}j=1n𝕀{zj=a}ηjηjTId.\displaystyle\max_{a\in[k]}\left\|\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}\epsilon_{j}\epsilon_{j}^{T}-\Sigma_{a}^{*}\right\|\leq\lambda_{\max}\max_{a\in[k]}\left\|\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}\eta_{j}\eta_{j}^{T}-I_{d}\right\|.

Define

Qa=1j=1n𝕀{zj=a}j=1n𝕀{zj=a}ηjηjTId.\displaystyle Q_{a}=\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}\eta_{j}\eta_{j}^{T}-I_{d}.

Take Sd1={yd:y=1}S^{d-1}=\{y\in\mathbb{R}^{d}:\|y\|=1\} and Nϵ={v1,,v|Nϵ|}N_{\epsilon}=\{v_{1},\cdots,v_{|N_{\epsilon}|}\} is an ϵ\epsilon-covering of Sd1S^{d-1}. In particular, we pick ϵ<14\epsilon<\frac{1}{4}, then |Nϵ|9d|N_{\epsilon}|\leq 9^{d}. By the definition of the ϵ\epsilon-covering, we have

Qa112ϵmaxi=1,,|Nϵ||viTQavi|2maxi=1,,|Nϵ||viTQavi|.\displaystyle\left\|Q_{a}\right\|\leq\frac{1}{1-2\epsilon}\max_{i=1,\cdots,|N_{\epsilon}|}|v_{i}^{T}Q_{a}v_{i}|\leq 2\max_{i=1,\cdots,|N_{\epsilon}|}|v_{i}^{T}Q_{a}v_{i}|.

For any vNϵv\in N_{\epsilon},

vTQav=1j=1n𝕀{zj=a}j=1n𝕀{zj=a}(vTηjηjTv1).\displaystyle v^{T}Q_{a}v=\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}(v^{T}\eta_{j}\eta_{j}^{T}v-1).

Denote na=j=1n𝕀{zj=a}n_{a}=\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}. Then j=1n𝕀{zj=a}vTηjηjTvχ2na\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}v^{T}\eta_{j}\eta_{j}^{T}v\sim\chi^{2}_{n_{a}}. Using Lemma 7.1, we have

P(maxa[k]Qat)\displaystyle P(\max_{a\in[k]}\|Q_{a}\|\geq t) a=1kP(Qat)\displaystyle\leq\sum_{a=1}^{k}P(\|Q_{a}\|\geq t)
a=1ki=1|Nϵ|P(|viTQavi|t/2)\displaystyle\leq\sum_{a=1}^{k}\sum_{i=1}^{|N_{\epsilon}|}P(|v_{i}^{T}Q_{a}v_{i}|\geq t/2)
a=1k2exp{na8min{t,t2}+dlog9}.\displaystyle\leq\sum_{a=1}^{k}2\exp\Biggl{\{}-\frac{n_{a}}{8}\min\{t,t^{2}\}+d\log 9\Biggr{\}}.

Since k(d+logn)n=o(1)\frac{k(d+\log n)}{n}=o(1) and naαn/kn_{a}\geq\alpha n/k where α\alpha is a constant, we can take t=Ck(d+logn)nt=C^{\prime\prime}\sqrt{\frac{k(d+\log n)}{n}} for some large constant CC^{\prime\prime} and the proof is complete. ∎

Lemma 7.4.

Consider the same assumptions as in Lemma 7.2. Then, for any s=o(n)s=o(n) and for any constant C>0C^{\prime}>0, there exists some constant C>0C>0 only depending on C,λmaxC^{\prime},\lambda_{\max} such that

maxT[n]:|T|s1|T|logn|T|+min{1,|T|}djTϵjϵjTC,\displaystyle\max_{T\subset[n]:|T|\leq s}\frac{1}{|T|\log\frac{n}{|T|}+\min\{1,\sqrt{\left|T\right|}\}d}\left\|\sum_{j\in T}\epsilon_{j}\epsilon_{j}^{T}\right\|\leq C, (38)

with probability at least 1nC1-n^{-C^{\prime}}. We have used the convention that 0/0=00/0=0.

Proof.

Consider any a[s]a\in[s] and a fixed T[n]T\subset[n] such that |T|=a\left|T\right|=a. Similar to the proof of Lemma 7.3, we can take Sd1={yd:y=1}S^{d-1}=\{y\in\mathbb{R}^{d}:\|y\|=1\} and its ϵ\epsilon-covering NϵN_{\epsilon} with ϵ<14\epsilon<\frac{1}{4} and |Nϵ|9d|N_{\epsilon}|\leq 9^{d}. Then we have

jTϵjϵjT=supw=1jT(wTϵj)22maxwNϵjT(wTϵj)2.\displaystyle\|\sum_{j\in T}\epsilon_{j}\epsilon_{j}^{T}\|=\sup_{\left\|{w}\right\|=1}\sum_{j\in T}(w^{T}\epsilon_{j})^{2}\leq 2\max_{w\in N_{\epsilon}}\sum_{j\in T}(w^{T}\epsilon_{j})^{2}.

Note that wTϵj/λmaxw^{T}\epsilon_{j}/\sqrt{\lambda_{\max}} is a sub-Gaussian random variable with parameter 1. By [10], for any fixed wNϵw\in N_{\epsilon}, we have

(jT(wTϵj)2λmax(a+2at+2t))exp(t).\displaystyle\mathbb{P}\left(\sum_{j\in T}(w^{T}\epsilon_{j})^{2}\geq\lambda_{\max}\left(a+2\sqrt{at}+2t\right)\right)\leq\exp\left(-t\right).

Since a=o(n)a=o(n), there exists a constant C0C_{0} such that 2aC0alogna2a\leq C_{0}a\log\frac{n}{a}. We can take t=C~(alogna+d)t=\tilde{C}(a\log\frac{n}{a}+d) with C~=C16C04\tilde{C}=\frac{C}{16}-\frac{C_{0}}{4}, then a+2at+2tC4(alogna+d)a+2\sqrt{at}+2t\leq\frac{C}{4}(a\log\frac{n}{a}+d). Thus,

(jT(wTϵj)2C4(alogna+d))exp(C~(alogna+d)).\displaystyle\mathbb{P}\left(\sum_{j\in T}(w^{T}\epsilon_{j})^{2}\geq\frac{C}{4}(a\log\frac{n}{a}+d)\right)\leq\exp\bigg{(}-\tilde{C}(a\log\frac{n}{a}+d)\bigg{)}.

Hence, we have

(jTϵjϵjTC2(alogna+d))9dexp(C~(alogna+d)).\displaystyle\mathbb{P}\left(\|\sum_{j\in T}\epsilon_{j}\epsilon_{j}^{T}\|\geq\frac{C}{2}(a\log\frac{n}{a}+d)\right)\leq 9^{d}\exp\bigg{(}-\tilde{C}(a\log\frac{n}{a}+d)\bigg{)}.

As a result,

{maxT[n],1|T|s1|T|logn|T|+djTϵjϵjTC}\displaystyle\mathbb{P}\bigg{\{}\max_{T\subset[n],1\leq|T|\leq s}\frac{1}{|T|\log\frac{n}{|T|}+d}\|\sum_{j\in T}\epsilon_{j}\epsilon_{j}^{T}\|\geq C\bigg{\}}\leq a=1s{max|T|=ajTϵjϵjTC(alogna+d)}\displaystyle\sum_{a=1}^{s}\mathbb{P}\biggl{\{}\max_{|T|=a}\|\sum_{j\in T}\epsilon_{j}\epsilon_{j}^{T}\|\geq C(a\log\frac{n}{a}+d)\biggr{\}}
\displaystyle\leq a=1s(na)max|T|=a{jTϵjϵjTC(alogna+d)}\displaystyle\sum_{a=1}^{s}\binom{n}{a}\max_{|T|=a}\mathbb{P}\biggl{\{}\|\sum_{j\in T}\epsilon_{j}\epsilon_{j}^{T}\|\geq C(a\log\frac{n}{a}+d)\biggr{\}}
a=1s(na)9dexp(C~(alogna+d)).\displaystyle\leq\sum_{a=1}^{s}\binom{n}{a}9^{d}\exp\bigg{(}-\tilde{C}(a\log\frac{n}{a}+d)\bigg{)}.

Since alognaa\log\frac{n}{a} is an increasing function when a[1,s]a\in[1,s] and alognalognlogsa\log\frac{n}{a}\geq\log n\geq\log s, a choice of C~=3+C\tilde{C}=3+C^{\prime}, that is C=16C+4C0+48C=16C^{\prime}+4C_{0}+48, can yield the desired result.

Finally, to allow |T|=0\left|T\right|=0, we note that dmin{1,|T|}dd\leq\min\{1,\sqrt{\left|T\right|}\}d. The proof is complete. ∎

Lemma 7.5.

For any z[k]nz^{*}\in[k]^{n} and k[n]k\in[n], assume minakj=1n𝕀{zj=a}αnk\min_{a\in k}\sum_{j=1}^{n}\mathbb{I}\{z^{*}_{j}=a\}\geq\frac{\alpha n}{k} and (z,z)=o(nΔ2k)\ell(z,z^{*})=o(\frac{n\Delta^{2}}{k}), then

maxa[k]j=1n𝕀{zj=a}j=1n𝕀{zj=a}2.\displaystyle\max_{a\in[k]}\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\leq 2. (39)
Proof.

For any z[k]nz\in[k]^{n} such that (z,z)=o(n)\ell(z,z^{*})=o(n) and any a[k]a\in[k], we have

j=1n𝕀{zj=a}\displaystyle\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\} j=1n𝕀{zj=a}j=1n𝕀{zjzj}\displaystyle\geq\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}-\sum_{j=1}^{n}\mathbb{I}\{z_{j}\neq z_{j}^{*}\}
j=1n𝕀{zj=a}(z,z)Δ2\displaystyle\geq\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}-\frac{\ell(z,z^{*})}{\Delta^{2}}
αn2k,\displaystyle\geq\frac{\alpha n}{2k}, (40)

which implies

j=1n𝕀{zj=a}j=1n𝕀{zj=a}\displaystyle\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}} j=1n𝕀{zj=a}+j=1n𝕀{zjzj}j=1n𝕀{zj=a}\displaystyle\leq\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}+\sum_{j=1}^{n}\mathbb{I}\{z_{j}\neq z_{j}^{*}\}}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}
1+αn/2kj=1n𝕀{zj=a}\displaystyle\leq 1+\frac{\alpha n/2k}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}
2.\displaystyle\leq 2.

Thus, we obtain (39). ∎

The next lemma is the famous Weyl’s Theorem and we omit the proof here.

Lemma 7.6 (Weyl’s Theorem).

Let AA and BB be any two d×dd\times d symmetric real matrix. Then for any 1id1\leq i\leq d, we have

λi(A+B)λd(A)+λi(B).\displaystyle\lambda_{i}(A+B)\leq\lambda_{d}(A)+\lambda_{i}(B).

In the following lemma, we are going to analyze estimation errors of {Σa}a[k]\{\Sigma^{*}_{a}\}_{a\in[k]} under the anisotropic GMMs. For any z[k]nz\in[k]^{n} and for any z[k]z\in[k], recall the definitions

θ^a(z)\displaystyle\hat{\theta}_{a}(z) =j[n]Yj𝕀{zj=a}j[n]𝕀{zj=a},\displaystyle=\frac{\sum_{j\in[n]}Y_{j}{\mathbb{I}\left\{{z_{j}=a}\right\}}}{\sum_{j\in[n]}{\mathbb{I}\left\{{z_{j}=a}\right\}}},
Σ^a(z)\displaystyle\hat{\Sigma}_{a}(z) =j[n](Yjθ^a(z))(Yjθ^a(z))T𝕀{zj=a}j[n]𝕀{zj=a}.\displaystyle=\frac{\sum_{j\in[n]}(Y_{j}-\hat{\theta}_{a}(z))(Y_{j}-\hat{\theta}_{a}(z))^{T}{\mathbb{I}\left\{{z_{j}=a}\right\}}}{\sum_{j\in[n]}{\mathbb{I}\left\{{z_{j}=a}\right\}}}.
Lemma 7.7.

For any z[k]nz^{*}\in[k]^{n} and k[n]k\in[n], consider independent vectors Yj=θzj+ϵjY_{j}=\theta^{*}_{z^{*}_{j}}+\epsilon_{j} where ϵj𝒩(0,Σzj)\epsilon_{j}\sim\mathcal{N}(0,\Sigma^{*}_{z^{*}_{j}}) for any j[n]j\in[n]. Assume there exist constants λmin,λmax>0\lambda_{\min},\lambda_{\max}>0 such that λminλ1(Σa)λd(Σa)λmax\lambda_{\min}\leq\lambda_{1}(\Sigma_{a}^{*})\leq\lambda_{d}(\Sigma_{a}^{*})\leq\lambda_{\max} for any a[k]a\in[k], and a constant α>0\alpha>0 such that minakj=1n𝕀{zj=a}αnk\min_{a\in k}\sum_{j=1}^{n}\mathbb{I}\{z^{*}_{j}=a\}\geq\frac{\alpha n}{k}. Assume k(d+logn)n=o(1)\frac{k(d+\log n)}{n}=o(1) and Δk\frac{\Delta}{k}\rightarrow\infty. Assume (33)-(38) hold. Then for any τ=o(n)\tau=o(n) and for any constant C>0C^{\prime}>0, there exists some constant C>0C>0 only depending on α,λmax,C\alpha,\lambda_{\max},C^{\prime} such that

maxa[k]θ^a(z)θa\displaystyle\max_{a\in[k]}\left\|\hat{\theta}_{a}(z^{*})-\theta_{a}^{*}\right\| Ck(d+logn)n,\displaystyle\leq C\sqrt{\frac{k(d+\log n)}{n}}, (41)
maxa[k]θ^a(z)θ^a(z)\displaystyle\max_{a\in[k]}\left\|\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*})\right\| C(knΔ(z,z)+kd+nnΔ(z,z)),\displaystyle\leq C\left(\frac{k}{n\Delta}\ell(z,z^{*})+\frac{k\sqrt{d+n}}{n\Delta}\sqrt{\ell(z,z^{*})}\right), (42)
maxa[k]Σ^a(z)Σa\displaystyle\max_{a\in[k]}\left\|\hat{\Sigma}_{a}(z^{*})-\Sigma_{a}^{*}\right\| Ck(d+logn)n,\displaystyle\leq C\sqrt{\frac{k(d+\log n)}{n}}, (43)
maxa[k]Σ^a(z)Σ^a(z)\displaystyle\max_{a\in[k]}\left\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\right\| C(kn(z,z)+kn(z,z)nΔ+kdnΔ(z,z)),\displaystyle\leq C\left(\frac{k}{n}\ell(z,z^{*})+\frac{k\sqrt{n\ell(z,z^{*})}}{n\Delta}+\frac{kd}{n\Delta}\sqrt{\ell(z,z^{*})}\right), (44)

for all zz such that (z,z)τ\ell(z,z^{*})\leq\tau.

Proof.

Using (33) we obtain (41). By the same argument of (118) in [8], we can obtain (42). By (33) and (37) and (41), we can obtain (43). In the remaining of the proof, we will establish (53).

Since k(d+logn)n=o(1)\frac{k(d+\log n)}{n}=o(1), we have Σ^a(z)1\|\hat{\Sigma}_{a}(z^{*})\|\lesssim 1 for any a[k]a\in[k]. The difference Σ^a(z)Σ^a(z)\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*}) will be decomposed into several terms. We notice that

Σ^a(z)Σ^a(z)S1+S2,\displaystyle\left\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\right\|\leq S_{1}+S_{2}, (45)

where

S1:=1𝕀{zj=a}j=1n𝕀{zj=a}((Yjθ^a(z))(Yjθ^a(z))T(Yjθ^a(z))(Yjθ^a(z))T),\displaystyle S_{1}:=\bigg{\|}\frac{1}{\sum\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}\bigg{(}(Y_{j}-\hat{\theta}_{a}(z))(Y_{j}-\hat{\theta}_{a}(z))^{T}-(Y_{j}-\hat{\theta}_{a}(z^{*}))(Y_{j}-\hat{\theta}_{a}(z^{*}))^{T}\bigg{)}\bigg{\|},

and

S2:=\displaystyle S_{2}:=\bigg{\|} (1𝕀{zj=a}1𝕀{zj=a})j=1n𝕀{zj=a}(Yjθ^a(z))(Yjθ^a(z))T.\displaystyle\left(\frac{1}{\sum\mathbb{I}\{z_{j}=a\}}-\frac{1}{\sum\mathbb{I}\{z_{j}^{*}=a\}}\right)\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}(Y_{j}-\hat{\theta}_{a}(z^{*}))(Y_{j}-\hat{\theta}_{a}(z^{*}))^{T}\bigg{\|}.

Also, we notice that

S1L1+L2+L3,\displaystyle S_{1}\leq L_{1}+L_{2}+L_{3}, (46)

where

L1\displaystyle L_{1} :=1j=1n𝕀{zj=a}j=1n𝕀{zj=zj=a}((Yjθ^a(z))(Yjθ^a(z))T(Yjθ^a(z))(Yjθ^a(z))T),\displaystyle:=\bigg{\|}\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=z_{j}^{*}=a\}\bigg{(}(Y_{j}-\hat{\theta}_{a}(z))(Y_{j}-\hat{\theta}_{a}(z))^{T}-(Y_{j}-\hat{\theta}_{a}(z^{*}))(Y_{j}-\hat{\theta}_{a}(z^{*}))^{T}\bigg{)}\bigg{\|},
L2\displaystyle L_{2} :=1j=1n𝕀{zj=a}j=1n𝕀{zj=a,zja}(Yjθ^a(z))(Yjθ^a(z))T,\displaystyle:=\bigg{\|}\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a,z_{j}^{*}\neq a\}(Y_{j}-\hat{\theta}_{a}(z))(Y_{j}-\hat{\theta}_{a}(z))^{T}\bigg{\|},
L3\displaystyle L_{3} :=1j=1n𝕀{zj=a}j=1n𝕀{zja,zj=a}(Yjθ^a(z))(Yjθ^a(z))T.\displaystyle:=\bigg{\|}\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}\neq a,z_{j}^{*}=a\}(Y_{j}-\hat{\theta}_{a}(z^{*}))(Y_{j}-\hat{\theta}_{a}(z^{*}))^{T}\bigg{\|}.

For L1L_{1}, we have

L1\displaystyle L_{1} 1j=1n𝕀{zj=a}j=1n𝕀{zj=zj=a}(θ^a(z)θ^a(z))(θ^a(z)θ^a(z))T\displaystyle\leq\bigg{\|}\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=z_{j}^{*}=a\}(\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*}))(\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*}))^{T}\bigg{\|}
+21j=1n𝕀{zj=a}j=1n𝕀{zj=zj=a}(Yjθ^a(z))(θ^a(z)θ^a(z))T\displaystyle\quad+2\bigg{\|}\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=z_{j}^{*}=a\}(Y_{j}-\hat{\theta}_{a}(z^{*}))(\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*}))^{T}\bigg{\|}
θ^a(z)θ^a(z)2j=1n𝕀{zj=a}j=1n𝕀{zj=a}+θaθ^a(z)θ^a(z)θ^a(z)j=1n𝕀{zj=a}j=1n𝕀{zj=a}\displaystyle\lesssim\left\|\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*})\right\|^{2}\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}+\left\|\theta_{a}^{*}-\hat{\theta}_{a}(z^{*})\right\|\left\|\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*})\right\|\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}
+θ^a(z)θ^a(z)1j=1n𝕀{zj=a}j=1n𝕀{zj=zj=a}ϵj.\displaystyle\quad+\left\|\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*})\right\|\left\|\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=z_{j}^{*}=a\}\epsilon_{j}\right\|. (47)

By (36), (39), (40), we have uniformly for any a[k]a\in[k],

1j=1n𝕀{zj=a}j=1n𝕀{zj=zj=a}ϵj\displaystyle\left\|\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=z_{j}^{*}=a\}\epsilon_{j}\right\| j=1n𝕀{zj=zj=a}j=1n𝕀{zj=a}d+j=1n𝕀{zj=a}\displaystyle\lesssim\frac{\sqrt{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=z_{j}^{*}=a\}}}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sqrt{d+\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}}
1.\displaystyle\lesssim 1. (48)

Since maxa[k]θ^a(z)θa=o(1)\max_{a\in[k]}\left\|\hat{\theta}_{a}(z^{*})-\theta_{a}^{*}\right\|=o(1), by (39), (42), (41), (47), and (48), we have uniformly for any a[k]a\in[k],

L1θ^a(z)θ^a(z)knΔ(z,z)+kd+nnΔ(z,z).\displaystyle L_{1}\lesssim\left\|\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*})\right\|\lesssim\frac{k}{n\Delta}\ell(z,z^{*})+\frac{k\sqrt{d+n}}{n\Delta}\sqrt{\ell(z,z^{*})}. (49)

To bound L2L_{2}, we first give the following simple fact. For any positive integer mm and any {uj}j[m],{vj}j[m]d\{u_{j}\}_{j\in[m]},\{v_{j}\}_{j\in[m]}\in\mathbb{R}^{d}, we have j[m](uj+vj)(uj+vj)T2j[m]ujujT+2j[m]vjvjT\|\sum_{j\in[m]}(u_{j}+v_{j})(u_{j}+v_{j})^{T}\|\leq 2\|\sum_{j\in[m]}u_{j}u_{j}^{T}\|+2\|\sum_{j\in[m]}v_{j}v_{j}^{T}\|. Hence, for L2L_{2}, we have the following decomposition

L22R1+2R2,\displaystyle L_{2}\leq 2R_{1}+2R_{2}, (50)

where

R1\displaystyle R_{1} :=1j=1n𝕀{zj=a}j=1n𝕀{zj=a,zja}(Yjθa)(Yjθa)T,\displaystyle:=\bigg{\|}\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a,z_{j}^{*}\neq a\}(Y_{j}-\theta_{a}^{*})(Y_{j}-\theta_{a}^{*})^{T}\bigg{\|},
R2\displaystyle R_{2} :=1j=1n𝕀{zj=a}j=1n𝕀{zj=a,zja}(θaθ^a(z))(θaθ^a(z))T.\displaystyle:=\bigg{\|}\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a,z_{j}^{*}\neq a\}(\theta_{a}^{*}-\hat{\theta}_{a}(z))(\theta_{a}^{*}-\hat{\theta}_{a}(z))^{T}\bigg{\|}.

Since maxa[k]j=1n𝕀{zj=a,zja}(z,z)Δ2\max_{a\in[k]}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a,z_{j}^{*}\neq a\}\leq\frac{\ell(z,z^{*})}{\Delta^{2}}, we have

R2\displaystyle R_{2} θaθ^a(z)2j=1n𝕀{zj=a,zja}j=1n𝕀{zj=a}\displaystyle\leq\left\|\theta_{a}^{*}-\hat{\theta}_{a}(z)\right\|^{2}\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a,z_{j}^{*}\neq a\}}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}
(θ^a(z)θ^a(z)2+θ^a(z)θa2)k(z,z)nΔ2.\displaystyle\lesssim\left(\left\|\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*})\right\|^{2}+\left\|\hat{\theta}_{a}(z^{*})-\theta_{a}^{*}\right\|^{2}\right)\frac{k\ell(z,z^{*})}{n\Delta^{2}}. (51)

By (38) and the fact that maxa[k]j=1n𝕀{zi=a,zia}(z,z)Δ2\max_{a\in[k]}\sum_{j=1}^{n}\mathbb{I}\{z_{i}=a,z_{i}^{*}\neq a\}\leq\frac{\ell(z,z^{*})}{\Delta^{2}}, we also have

R1\displaystyle R_{1} 21j=1n𝕀{zj=a}j=1n𝕀{zj=a,zja}(θzjθzj)(θzjθzj)T\displaystyle\leq 2\bigg{\|}\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a,z_{j}^{*}\neq a\}(\theta^{*}_{z^{*}_{j}}-\theta^{*}_{z_{j}})(\theta^{*}_{z^{*}_{j}}-\theta^{*}_{z_{j}})^{T}\bigg{\|}
+21j=1n𝕀{zj=a}j=1n𝕀{zj=a,zja}ϵjϵjT\displaystyle\quad+2\bigg{\|}\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a,z_{j}^{*}\neq a\}\epsilon_{j}\epsilon_{j}^{T}\bigg{\|}
2j=1n𝕀{zj=a,zja}θzjθzj2j=1n𝕀{zj=a}+21j=1n𝕀{zj=a}j=1n𝕀{zj=a,zja}ϵjϵjT\displaystyle\leq 2\frac{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a,z_{j}^{*}\neq a\}\|\theta^{*}_{z^{*}_{j}}-\theta^{*}_{z_{j}}\|^{2}}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}+2\bigg{\|}\frac{1}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a,z_{j}^{*}\neq a\}\epsilon_{j}\epsilon_{j}^{T}\bigg{\|}
k(z,z)n+(z,z)Δ2lognΔ2(z,z)+d(z,z)Δ2n/k.\displaystyle\lesssim\frac{k\ell(z,z^{*})}{n}+\frac{\frac{\ell(z,z^{*})}{\Delta^{2}}\log\frac{n\Delta^{2}}{\ell(z,z^{*})}+d\sqrt{\frac{\ell(z,z^{*})}{\Delta^{2}}}}{n/k}.

We are going to simplify the above bounds for R1,R2R_{1},R_{2}. Under the assumption that k(d+logn)n=o(1)\frac{k(d+\log n)}{n}=o(1), Δ/k\Delta/k\rightarrow\infty, and (z,z)τ=o(n)\ell(z,z^{*})\leq\tau=o(n), we have maxa[k]θ^a(z)θ^a(z)=o(1)\max_{a\in[k]}\|\hat{\theta}_{a}(z)-\hat{\theta}_{a}(z^{*})\|=o(1), maxa[k]θ^a(z)θa=o(1)\max_{a\in[k]}\|\hat{\theta}_{a}(z^{*})-\theta_{a}^{*}\|=o(1), and k(z,z)nΔ2=o(1)\frac{k\ell(z,z^{*})}{n\Delta^{2}}=o(1). Hence R2k(z,z)nΔ2R_{2}\lesssim\frac{k\ell(z,z^{*})}{n\Delta^{2}}. Also we have

k(z,z)nΔ2lognΔ2(z,z)=k(z,z)nΔ(z,z)Δ2(lognΔ2(z,z))2kn(z,z)nΔ.\displaystyle\frac{k\ell(z,z^{*})}{n\Delta^{2}}\log\frac{n\Delta^{2}}{\ell(z,z^{*})}=\frac{k\sqrt{\ell(z,z^{*})}}{n\Delta}\sqrt{\frac{\ell(z,z^{*})}{\Delta^{2}}\left(\log\frac{n\Delta^{2}}{\ell(z,z^{*})}\right)^{2}}\leq\frac{k\sqrt{n\ell(z,z^{*})}}{n\Delta}.

where in the last inequality, we use the fact that x(log(n/x))2x(\log(n/x))^{2} is an increasing function of xx when 0<x=o(n)0<x=o(n). Then,

L2kn(z,z)nΔ+kn(z,z)+kdnΔ(z,z).\displaystyle L_{2}\lesssim\frac{k\sqrt{n\ell(z,z^{*})}}{n\Delta}+\frac{k}{n}\ell(z,z^{*})+\frac{kd}{n\Delta}\sqrt{\ell(z,z^{*})}.

Since L3L_{3} is similar to L2L_{2}, by (46) we have uniformly for any a[k]a\in[k]

S1kn(z,z)nΔ+kn(z,z)+kdnΔ(z,z).\displaystyle S_{1}\lesssim\frac{k\sqrt{n\ell(z,z^{*})}}{n\Delta}+\frac{k}{n}\ell(z,z^{*})+\frac{kd}{n\Delta}\sqrt{\ell(z,z^{*})}. (52)

To bound S2S_{2}, by (70) in [8], we have uniformly for any a[k]a\in[k],

S2=|j=1n𝕀{zj=a}j=1n𝕀{zj=a}|j=1n𝕀{zj=a}Σ^a(z)2kn(z,z)Δ2,\displaystyle S_{2}=\frac{\left|\sum_{j=1}^{n}\mathbb{I}\{z_{j}^{*}=a\}-\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}\right|}{\sum_{j=1}^{n}\mathbb{I}\{z_{j}=a\}}\left\|{\hat{\Sigma}_{a}(z^{*})}\right\|^{2}\lesssim\frac{k}{n}\frac{\ell(z,z^{*})}{\Delta^{2}},

where we use (43). Since kn(z,z)Δ2kn(z,z)nΔ\frac{k}{n}\frac{\ell(z,z^{*})}{\Delta^{2}}\lesssim\frac{k\sqrt{n\ell(z,z^{*})}}{n\Delta}, by (45) and the facts that (z,z)τ=o(n)\ell(z,z^{*})\leq\tau=o(n) we have

maxa[k]Σ^a(z)Σ^a(z)kn(z,z)nΔ+kn(z,z)+kdnΔ(z,z).\displaystyle\max_{a\in[k]}\left\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\right\|\lesssim\frac{k\sqrt{n\ell(z,z^{*})}}{n\Delta}+\frac{k}{n}\ell(z,z^{*})+\frac{kd}{n\Delta}\sqrt{\ell(z,z^{*})}.

Lemma 7.8.

Under the same assumption as in Lemma 7.7, if additional we assume kd=O(n)kd=O(\sqrt{n}) and τ=o(n/k)\tau=o(n/k), there exists some constant C>0C>0 only depending on α,λmin,λmax,C\alpha,\lambda_{\min},\lambda_{\max},C^{\prime} such that

maxa[k](Σ^a(z))1(Σ^a(z))1\displaystyle\max_{a\in[k]}\left\|(\hat{\Sigma}_{a}(z))^{-1}-(\hat{\Sigma}_{a}(z^{*}))^{-1}\right\| C(kn(z,z)+kn(z,z)nΔ+kdnΔ(z,z)).\displaystyle\leq C\left(\frac{k}{n}\ell(z,z^{*})+\frac{k\sqrt{n\ell(z,z^{*})}}{n\Delta}+\frac{kd}{n\Delta}\sqrt{\ell(z,z^{*})}\right). (53)
Proof.

By (43) we have maxa[k]Σ^a(z),maxa[k](Σ^a(z))11\max_{a\in[k]}\|\hat{\Sigma}_{a}(z^{*})\|,\max_{a\in[k]}\|(\hat{\Sigma}_{a}(z^{*}))^{-1}\|\lesssim 1. By (44) we also have maxa[k]Σ^a(z),maxa[k](Σ^a(z))11\max_{a\in[k]}\|\hat{\Sigma}_{a}(z)\|,\max_{a\in[k]}\|(\hat{\Sigma}_{a}(z))^{-1}\|\lesssim 1. Hence,

maxa[k](Σ^a(z))1(Σ^a(z))1\displaystyle\max_{a\in[k]}\left\|(\hat{\Sigma}_{a}(z))^{-1}-(\hat{\Sigma}_{a}(z^{*}))^{-1}\right\|\leq maxa[k](Σ^a(z))1Σ^a(z)Σ^a(z)(Σ^a(z))1\displaystyle\max_{a\in[k]}\left\|(\hat{\Sigma}_{a}(z^{*}))^{-1}\right\|\left\|\hat{\Sigma}_{a}(z)-\hat{\Sigma}_{a}(z^{*})\right\|\left\|(\hat{\Sigma}_{a}(z))^{-1}\right\|
\displaystyle\lesssim kn(z,z)nΔ+kn(z,z)+kdnΔ(z,z).\displaystyle\frac{k\sqrt{n\ell(z,z^{*})}}{n\Delta}+\frac{k}{n}\ell(z,z^{*})+\frac{kd}{n\Delta}\sqrt{\ell(z,z^{*})}. (54)

Lemma 7.9.

Let Wiiidχ2dW_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\chi^{2}_{d} for any i[n]i\in[n] where n,dn,d are positive integers. Then we have

(i=1nWi23nd2)4nd.\displaystyle\mathbb{P}\left(\sum_{i=1}^{n}W_{i}^{2}\geq 3nd^{2}\right)\leq\frac{4}{nd}.
Proof.

We have 𝔼i=1nWi2=nd(d+2)\mathbb{E}\sum_{i=1}^{n}W_{i}^{2}=nd(d+2) and 𝔼i=1nWi4=nd(d+2)(d+4)(d+6)\mathbb{E}\sum_{i=1}^{n}W_{i}^{4}=nd(d+2)(d+4)(d+6). Then we have Var(i=1nWi2)=8nd(d+2)(d+3)\text{Var}\left(\sum_{i=1}^{n}W_{i}^{2}\right)=8nd(d+2)(d+3). Then we obtain the desired result by Chebyshev’s inequality. ∎

Proof of Lemma 6.3.

Consider any ab[k]a\neq b\in[k]. We are going to prove

λmax+λmax+λmin(λmin+λmax)2λmaxλmin+λmaxθaθbSNRa,bλmin1/2θaθb+32d+dlogλmaxλmin.\displaystyle\frac{-\sqrt{\lambda_{\max}}+\sqrt{\lambda_{\max}+\frac{\lambda_{\min}(\lambda_{\min}+\lambda_{\max})}{2\lambda_{\max}}}}{\lambda_{\min}+\lambda_{\max}}\left\|{\theta_{a}^{*}-\theta_{b}^{*}}\right\|\leq\textsf{SNR}^{\prime}_{a,b}\leq\lambda_{\min}^{-1/2}\left\|{\theta_{a}^{*}-\theta_{b}^{*}}\right\|+\sqrt{\frac{3}{2}d}+\sqrt{d\log\frac{\lambda_{\max}}{\lambda_{\min}}}. (55)

We first prove the upper bound. Denote Ξa,b=θaθb\Xi_{a,b}=\theta_{a}^{*}-\theta_{b}^{*}. Since we have assumed SNR\textsf{SNR}^{\prime}\to\infty, we have that 0a,b0\notin\mathcal{B}_{a,b}. Note that we have an equivalent expression of a,b\mathcal{B}_{a,b}:

a,b={(Σa)12Ξa,b+yd:\displaystyle\mathcal{B}_{a,b}=\bigg{\{}-(\Sigma_{a}^{*})^{-\frac{1}{2}}\Xi_{a,b}+y\in\mathbb{R}^{d}: 2yT(Σa)12Ξa,b+yT(Σa12Σb1Σa12I)y\displaystyle 2y^{T}(\Sigma_{a}^{*})^{-\frac{1}{2}}\Xi_{a,b}+y^{T}\left(\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}\Sigma_{a}^{*\frac{1}{2}}-I\right)y
log|Σa|+log|Σb|Ξa,bT(Σa)1Ξa,b0}.\displaystyle-\log|\Sigma_{a}^{*}|+\log|\Sigma_{b}^{*}|-\Xi_{a,b}^{T}(\Sigma_{a}^{*})^{-1}\Xi_{a,b}\leq 0\bigg{\}}.

We consider the following scenarios.

(1). If λ1(Σa12Σb1Σa12I)0\lambda_{1}\left(\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}\Sigma_{a}^{*\frac{1}{2}}-I\right)\geq 0, we have |Σa||Σb||\Sigma_{a}^{*}|\geq|\Sigma_{b}^{*}|. Let y=0y=0, we can know (Σa)12Ξa,ba,b-(\Sigma_{a}^{*})^{-\frac{1}{2}}\Xi_{a,b}\in\mathcal{B}_{a,b}. This tells us SNRa,b(Σa)12Ξa,bλmin1/2Ξa,b\textsf{SNR}^{\prime}_{a,b}\leq\left\|-(\Sigma_{a}^{*})^{-\frac{1}{2}}\Xi_{a,b}\right\|\leq\lambda_{\min}^{-1/2}\left\|{\Xi_{a,b}}\right\|.

(2). If λ1(Σa12Σb1Σa12I)1\lambda_{1}\left(\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}\Sigma_{a}^{*\frac{1}{2}}-I\right)\leq-1, let A:=Σa12Σb1Σa12IA:=\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}\Sigma_{a}^{*\frac{1}{2}}-I and assume UTAU=VU^{T}AU=V, where UU is an orthogonal matrix and V:=diag{v1,,vd}V:=\text{diag}\left\{{v_{1},\cdots,v_{d}}\right\} is a diagonal matrix with diagonal elements v1v2vdv_{1}\leq v_{2}\leq\cdots\leq v_{d} and v11v_{1}\leq-1. We can rewrite y=Uzy=Uz with z=(z1,,zd)Tz=(z_{1},\cdots,z_{d})^{T} and UT(Σa)12Ξa,b=(τ1,,τd)TU^{T}(\Sigma_{a}^{*})^{-\frac{1}{2}}\Xi_{a,b}=(\tau_{1},\cdots,\tau_{d})^{T}. Since log|Σa|+log|Σb|dlogλmaxλmin-\log|\Sigma_{a}^{*}|+\log|\Sigma_{b}^{*}|\leq d\log\frac{\lambda_{\max}}{\lambda_{\min}}, we can take z1= sign(τ1)dlogλmaxλminz_{1}=-\text{ sign}\left(\tau_{1}\right)\sqrt{d\log\frac{\lambda_{\max}}{\lambda_{\min}}} and zi=0z_{i}=0 for i2i\geq 2. Then, we have

2yT(Σa)12Ξa,b+yT(Σa12Σb1Σa12I)ylog|Σa|+log|Σb|Ξa,bT(Σa)1Ξa,b\displaystyle 2y^{T}(\Sigma_{a}^{*})^{-\frac{1}{2}}\Xi_{a,b}+y^{T}\left(\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}\Sigma_{a}^{*\frac{1}{2}}-I\right)y-\log|\Sigma_{a}^{*}|+\log|\Sigma_{b}^{*}|-\Xi_{a,b}^{T}(\Sigma_{a}^{*})^{-1}\Xi_{a,b}
=\displaystyle=~{} 2z1τ1+v1z12log|Σa|+log|Σb|Ξa,bT(Σa)1Ξa,b\displaystyle 2z_{1}\tau_{1}+v_{1}z_{1}^{2}-\log|\Sigma_{a}^{*}|+\log|\Sigma_{b}^{*}|-\Xi_{a,b}^{T}(\Sigma_{a}^{*})^{-1}\Xi_{a,b}
\displaystyle\leq~{} 2dlogλmaxλmin|τ1|dlogλmaxλminlog|Σa|+log|Σb|Ξa,bT(Σa)1Ξa,b\displaystyle-2\sqrt{d\log\frac{\lambda_{\max}}{\lambda_{\min}}}|\tau_{1}|-d\log\frac{\lambda_{\max}}{\lambda_{\min}}-\log|\Sigma_{a}^{*}|+\log|\Sigma_{b}^{*}|-\Xi_{a,b}^{T}(\Sigma_{a}^{*})^{-1}\Xi_{a,b}
\displaystyle\leq~{} 0.\displaystyle 0.

It means (Σa)12Ξa,b+ya,b-(\Sigma_{a}^{*})^{-\frac{1}{2}}\Xi_{a,b}+y\notin\mathcal{B}_{a,b} and hence SNRa,b(Σa)12Ξa,b+y\textsf{SNR}^{\prime}_{a,b}\leq\left\|{-(\Sigma_{a}^{*})^{-\frac{1}{2}}\Xi_{a,b}}\right\|+\left\|{y}\right\|. Then Thus we have SNRa,b(Σa)12Ξa,b+dlogλmaxλminλmin1/2Ξa,b+dlogλmaxλmin\textsf{SNR}^{\prime}_{a,b}\leq\left\|-(\Sigma_{a}^{*})^{-\frac{1}{2}}\Xi_{a,b}\right\|+\sqrt{d\log\frac{\lambda_{\max}}{\lambda_{\min}}}\leq\lambda_{\min}^{-1/2}\left\|{\Xi_{a,b}}\right\|+\sqrt{d\log\frac{\lambda_{\max}}{\lambda_{\min}}}.

(3). If 1<λ1(Σa12Σb1Σa12I)<0-1<\lambda_{1}\left(\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}\Sigma_{a}^{*\frac{1}{2}}-I\right)<0, we still use the notations in scenario (2). Notice that Σa12Σb1Σa12=A+I\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}\Sigma_{a}^{*\frac{1}{2}}=A+I, we have

log|Σa||Σb|\displaystyle\log\frac{|\Sigma_{a}^{*}|}{|\Sigma_{b}^{*}|} =log(1+v1)(1+vd)\displaystyle=\log(1+v_{1})\cdot\cdots\cdot(1+v_{d})
dlog(1+v1)\displaystyle\geq d\log(1+v_{1})
32dv1.\displaystyle\geq\frac{3}{2}dv_{1}.

Now we take z1= sign(τ1)32dz_{1}=-\text{ sign}\left(\tau_{1}\right)\sqrt{\frac{3}{2}d} and zi=0z_{i}=0 for i2i\geq 2, then we have

2yT(Σa)12Ξa,b+yT(Σa12Σb1Σa12I)ylog|Σa|+log|Σb|Ξa,bT(Σa)1Ξa,b\displaystyle 2y^{T}(\Sigma_{a}^{*})^{-\frac{1}{2}}\Xi_{a,b}+y^{T}\left(\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}\Sigma_{a}^{*\frac{1}{2}}-I\right)y-\log|\Sigma_{a}^{*}|+\log|\Sigma_{b}^{*}|-\Xi_{a,b}^{T}(\Sigma_{a}^{*})^{-1}\Xi_{a,b}
=\displaystyle=~{} 2z1τ1+v1z12log|Σa|+log|Σb|Ξa,bT(Σa)1Ξa,b\displaystyle 2z_{1}\tau_{1}+v_{1}z_{1}^{2}-\log|\Sigma_{a}^{*}|+\log|\Sigma_{b}^{*}|-\Xi_{a,b}^{T}(\Sigma_{a}^{*})^{-1}\Xi_{a,b}
\displaystyle\leq~{} 2|τ1|32dΞa,bT(Σa)1Ξa,b\displaystyle-2|\tau_{1}|\sqrt{\frac{3}{2}d}-\Xi_{a,b}^{T}(\Sigma_{a}^{*})^{-1}\Xi_{a,b}
\displaystyle\leq~{} 0.\displaystyle 0.

Thus we have SNRa,b(Σa)12Ξa,b+32dλmin1/2Ξa,b+32d\textsf{SNR}^{\prime}_{a,b}\leq\left\|-(\Sigma_{a}^{*})^{-\frac{1}{2}}\Xi_{a,b}\right\|+\sqrt{\frac{3}{2}d}\leq\lambda_{\min}^{-1/2}\left\|{\Xi_{a,b}}\right\|+\sqrt{\frac{3}{2}d}.

Overall, we have SNRa,bλmin1/2Ξa,b+32d+dlogλmaxλmin\textsf{SNR}^{\prime}_{a,b}\leq\lambda_{\min}^{-1/2}\left\|{\Xi_{a,b}}\right\|+\sqrt{\frac{3}{2}d}+\sqrt{d\log\frac{\lambda_{\max}}{\lambda_{\min}}} for all the three scenarios.

To prove the lower bound, we have

xTΣa12Σb1(θaθb)+12xT(Σa12Σb1Σa12I)x\displaystyle x^{T}\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}(\theta^{*}_{a}-\theta^{*}_{b})+\frac{1}{2}x^{T}\left(\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}\Sigma_{a}^{*\frac{1}{2}}-I\right)x Σa12Σb1xΞa,b12x2Σa12Σb1Σa12I\displaystyle\geq-\left\|{\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}}\right\|\left\|{x}\right\|\left\|{\Xi_{a,b}}\right\|-\frac{1}{2}\left\|{x}\right\|^{2}\left\|{\Sigma_{a}^{*\frac{1}{2}}\Sigma_{b}^{*-1}\Sigma_{a}^{*\frac{1}{2}}-I}\right\|
λmaxλminxΞa,b12(λmaxλmin+1)x2.\displaystyle\geq-\frac{\sqrt{\lambda_{\max}}}{\lambda_{\min}}\left\|{x}\right\|\left\|{\Xi_{a,b}}\right\|-\frac{1}{2}\left(\frac{\lambda_{\max}}{\lambda_{\min}}+1\right)\left\|{x}\right\|^{2}.

By the upper bound we know for any ab[k]a\neq b\in[k], Ξa,b\left\|{\Xi_{a,b}}\right\|\to\infty when SNR\textsf{SNR}^{\prime}\to\infty. Thus, we have

λmaxλminxΞa,b+12(λmaxλmin+1)x2\displaystyle\frac{\sqrt{\lambda_{\max}}}{\lambda_{\min}}\left\|{x}\right\|\left\|{\Xi_{a,b}}\right\|+\frac{1}{2}\left(\frac{\lambda_{\max}}{\lambda_{\min}}+1\right)\left\|{x}\right\|^{2} 12Ξa,bTΣb1Ξa,blog|Σa|+log|Σb|\displaystyle\geq\frac{1}{2}\Xi_{a,b}^{T}\Sigma_{b}^{*-1}\Xi_{a,b}-\log|\Sigma_{a}^{*}|+\log|\Sigma_{b}^{*}|
14λmaxΞa,b2.\displaystyle\geq\frac{1}{4\lambda_{\max}}\left\|{\Xi_{a,b}}\right\|^{2}.

Hence,

xλmax+λmax+λmin(λmin+λmax)2λmaxλmin+λmaxΞa,b.\displaystyle\left\|{x}\right\|\geq\frac{-\sqrt{\lambda_{\max}}+\sqrt{\lambda_{\max}+\frac{\lambda_{\min}(\lambda_{\min}+\lambda_{\max})}{2\lambda_{\max}}}}{\lambda_{\min}+\lambda_{\max}}\left\|{\Xi_{a,b}}\right\|.

In the following lemmas, we are going to establish connections between testing errors and {SNRa,b}ab\{\textsf{SNR}^{\prime}_{a,b}\}_{a\neq b}. Consider any a,b[k]a,b\in[k] such that aba\neq b. Let η𝒩(0,Id)\eta\sim\mathcal{N}(0,I_{d}), Ξa,b=θaθb\Xi_{a,b}=\theta_{a}^{*}-\theta_{b}^{*}, and Δa,b=Ξa,b\Delta_{a,b}=\left\|{\Xi_{a,b}}\right\|. Define

a,b(δ)={xd:xTΣa12(Σb)1Ξa,b\displaystyle\mathcal{B}_{a,b}(\delta)=\Bigg{\{}x\in\mathbb{R}^{d}:x^{T}\Sigma_{a}^{*\frac{1}{2}}(\Sigma_{b}^{*})^{-1}\Xi_{a,b} +12xT(Σa12(Σb)1Σa12Id)x\displaystyle+\frac{1}{2}x^{T}\left(\Sigma_{a}^{*\frac{1}{2}}(\Sigma_{b}^{*})^{-1}\Sigma_{a}^{*\frac{1}{2}}-I_{d}\right)x
1δ2Ξa,bT(Σb)1Ξa,b+12log|Σa|12log|Σb|},\displaystyle\leq-\frac{1-\delta}{2}\Xi_{a,b}^{T}(\Sigma_{b}^{*})^{-1}\Xi_{a,b}+\frac{1}{2}\log\left|\Sigma_{a}^{*}\right|-\frac{1}{2}\log\left|\Sigma_{b}^{*}\right|\Bigg{\}},

for any δ\delta\in\mathbb{R}. In addition, we define

SNRa,b(δ)=minxa,b(δ)2x,\displaystyle\textsf{SNR}_{a,b}^{\prime}(\delta)=\min_{x\in\mathcal{B}_{a,b}(\delta)}2\left\|{x}\right\|,
and Pa,b(δ)=(ηa,b(δ)).\displaystyle P_{a,b}(\delta)=\mathbb{P}\left(\eta\in\mathcal{B}_{a,b}(\delta)\right).

Recall the definitions of a,b\mathcal{B}_{a,b} and SNRa,b\textsf{SNR}^{\prime}_{a,b} in Section 3. Then they are a special case of a,b(δ)\mathcal{B}_{a,b}(\delta) and SNRa,b(δ)\textsf{SNR}_{a,b}^{\prime}(\delta) with δ=0\delta=0. That is, we have a,b=a,b(0)\mathcal{B}_{a,b}=\mathcal{B}_{a,b}(0) and SNRa,b=SNRa,b(0)\textsf{SNR}^{\prime}_{a,b}=\textsf{SNR}^{\prime}_{a,b}(0).

Lemma 7.10.

Assume d=O(1)d=O(1) and λminλ1(Σa),λ1(Σb)λd(Σa),λd(Σb)λmax\lambda_{\min}\leq\lambda_{1}(\Sigma_{a}^{*}),\lambda_{1}(\Sigma_{b}^{*})\leq\lambda_{d}(\Sigma_{a}^{*}),\lambda_{d}(\Sigma_{b}^{*})\leq\lambda_{\max} where λmin,λmax>0\lambda_{\min},\lambda_{\max}>0 are constants. Under the condition SNRa,b\textsf{SNR}^{\prime}_{a,b}\rightarrow\infty, for any positive sequence δ=o(1)\delta=o(1), there exists a δ~=o(1)\tilde{\delta}=o(1) that depends on δ,d,Δa,b,λmin,λmax\delta,d,\Delta_{a,b},\lambda_{\min},\lambda_{\max} such that

Pa,b(δ)exp(1δ~8SNRa,b2)\displaystyle P_{a,b}(\delta)\leq\exp\left(-\frac{1-\tilde{\delta}}{8}\textsf{SNR}_{a,b}^{{}^{\prime}2}\right)
Proof.

For convenience and conciseness, we will use the notation θa,θb,Σa,Σb\theta_{a},\theta_{b},\Sigma_{a},\Sigma_{b} instead of θa,θb,Σa,Σb\theta_{a}^{*},\theta_{b}^{*},\Sigma_{a}^{*},\Sigma_{b}^{*} throughout the proof. By Lemma 6.3, we have SNRa,b\textsf{SNR}_{a,b}^{\prime} in the same order of Δa,b\Delta_{a,b}, which means Δa,b\Delta_{a,b}\rightarrow\infty.

Assume we had obtained SNRa,b(δ)(1o(1))SNRa,b\textsf{SNR}_{a,b}^{\prime}(\delta)\geq(1-o(1))\textsf{SNR}_{a,b}^{\prime}. Then by Lemma 6.3, we have SNRa,b(δ)\textsf{SNR}_{a,b}^{\prime}(\delta) in the same order of Δa,b\Delta_{a,b} which is far bigger than dd by assumption. Since η2χ2d\left\|{\eta}\right\|^{2}\sim\chi^{2}_{d}, using Lemma 7.1, we have

Pa,b(δ)\displaystyle P_{a,b}(\delta) P(η2SNRa,b2(δ)4)exp((1O(dΔa,b2))SNRa,b2(δ)8)\displaystyle\leq P\left(\left\|{\eta}\right\|^{2}\geq\frac{\textsf{SNR}_{a,b}^{{}^{\prime}2}(\delta)}{4}\right)\leq\exp\left(-\left(1-O\left(\frac{d}{\Delta_{a,b}^{2}}\right)\right)\frac{\textsf{SNR}_{a,b}^{{}^{\prime}2}(\delta)}{8}\right)
(1O(dΔa,b2)(1o(1))SNRa,b28)\displaystyle\leq\left(1-O\left(\frac{d}{\Delta_{a,b}^{2}}\right)\frac{(1-o(1))\textsf{SNR}_{a,b}^{{}^{\prime}2}}{8}\right)

which is the desired result. Hence, the proof of this lemma is all about establishing SNRa,b(δ)(1o(1))SNRa,b\textsf{SNR}_{a,b}^{\prime}(\delta)\geq(1-o(1))\textsf{SNR}_{a,b}^{\prime}.

To prove it, we first simplify SNRa,b(δ)\textsf{SNR}^{\prime}_{a,b}(\delta). In spite of some abuse of notation, denote λ1λd\lambda_{1}\leq\ldots\leq\lambda_{d} to be the eigenvalues of Σa12(Σb)1Σa12Id\Sigma_{a}^{\frac{1}{2}}(\Sigma_{b})^{-1}\Sigma_{a}^{\frac{1}{2}}-I_{d} such that its eigen-decomposition can be written as Σa12(Σb)1Σa12Id=i=1dλiuiuiT,\Sigma_{a}^{\frac{1}{2}}(\Sigma_{b})^{-1}\Sigma_{a}^{\frac{1}{2}}-I_{d}=\sum_{i=1}^{d}\lambda_{i}u_{i}u_{i}^{T}, where {ui}\{u_{i}\} are orthogonal vectors. Denote U=(u1,,ud)U=(u_{1},\ldots,u_{d}) and

v=UTΣa12Σb1Ξa,b\displaystyle v=U^{T}\Sigma_{a}^{\frac{1}{2}}\Sigma_{b}^{-1}\Xi_{a,b}

and

a,b(δ)\displaystyle\mathcal{B}^{\prime}_{a,b}(\delta) ={yd:iyivi+12iλiyi21δ2Ξa,bTΣb1Ξa,b+12log|Σa||Σb|}.\displaystyle=\left\{y\in\mathbb{R}^{d}:\sum_{i}y_{i}v_{i}+\frac{1}{2}\sum_{i}\lambda_{i}y_{i}^{2}\leq-\frac{1-\delta}{2}\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\frac{1}{2}\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}\right\}.

Then a,b(δ)\mathcal{B}^{\prime}_{a,b}(\delta) can be seen a reflection-rotation of a,b(δ)\mathcal{B}_{a,b}(\delta) by the transformation y=Utxy=U^{t}x. Hence we have SNRa,b(δ)=minya,b(δ)2y\textsf{SNR}^{\prime}_{a,b}(\delta)=\min_{y\in\mathcal{B}^{\prime}_{a,b}(\delta)}2\left\|{y}\right\| for any δ\delta. What is more, let ¯a,b(δ)\bar{\mathcal{B}}^{\prime}_{a,b}(\delta) to be its boundary, i.e.,

¯a,b(δ)\displaystyle\bar{\mathcal{B}}^{\prime}_{a,b}(\delta) ={yd:iyivi+12iλiyi2=1δ2Ξa,bTΣb1Ξa,b+12log|Σa||Σb|}.\displaystyle=\left\{y\in\mathbb{R}^{d}:\sum_{i}y_{i}v_{i}+\frac{1}{2}\sum_{i}\lambda_{i}y_{i}^{2}=-\frac{1-\delta}{2}\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\frac{1}{2}\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}\right\}.

Since 0a,b(δ)0\notin\mathcal{B}^{\prime}_{a,b}(\delta), we have SNRa,b(δ)=2miny¯a,b(δ)y\textsf{SNR}^{\prime}_{a,b}(\delta)=2\min_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)}\left\|{y}\right\| As a result, we only need to work on ¯a,b(δ)\bar{\mathcal{B}}^{\prime}_{a,b}(\delta) instead of a,b(δ)\mathcal{B}_{a,b}(\delta). Denote ¯a,b\bar{\mathcal{B}}^{\prime}_{a,b} to be ¯a,b(0)\bar{\mathcal{B}}^{\prime}_{a,b}(0) for simplicity.

We then give an equivalent expression of a,b(δ)\mathcal{B}^{\prime}_{a,b}(\delta). From (55), we have an upper bound of SNRa,b\textsf{SNR}^{\prime}_{a,b}: SNRa,b2λmin1/2Δa,b\textsf{SNR}^{\prime}_{a,b}\leq 2\lambda_{\min}^{-1/2}\Delta_{a,b} where we use Δa,bλmin,λmax,d\Delta_{a,b}\gg\lambda_{\min},\lambda_{\max},d. The same upper bound actually holds for a,b(δ)\mathcal{B}^{\prime}_{a,b}(\delta) for any δ=o(1)\delta=o(1) following the same proof. Define S={yd:y2λmin1/2Δa,b}S=\{y\in\mathbb{R}^{d}:\|y\|\leq 2\lambda_{\min}^{-1/2}\Delta_{a,b}\}. We then have

SNRa,b(δ)=2miny¯a,b(δ)Sy.\displaystyle\textsf{SNR}^{\prime}_{a,b}(\delta)=2\min_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{y}\right\|.

We have the following inequality. Let g(y):¯a,b(δ)¯a,b(0)g(y):\bar{\mathcal{B}}_{a,b}(\delta)\rightarrow\bar{\mathcal{B}}_{a,b}(0) be any mapping. By the triangle inequality, we have yg(y)yg(y)\left\|{y}\right\|\geq\left\|{g(y)}\right\|-\left\|{y-g(y)}\right\|. We have

21SNRa,b(δ)\displaystyle 2^{-1}\textsf{SNR}^{\prime}_{a,b}(\delta) =miny¯a,b(δ)Sy\displaystyle=\min_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{y}\right\|
miny¯a,b(δ)S(g(y)yg(y))\displaystyle\geq\min_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left(\left\|{g(y)}\right\|-\left\|{y-g(y)}\right\|\right)
miny¯a,b(δ)Sg(y)maxy¯a,b(δ)Syg(y)\displaystyle\geq\min_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{g(y)}\right\|-\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{y-g(y)}\right\|
miny¯a,b(0)ymaxy¯a,b(δ)Syg(y)\displaystyle\geq\min_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(0)}\left\|{y}\right\|-\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{y-g(y)}\right\|
SNRa,bmaxy¯a,b(δ)Syg(y).\displaystyle\geq\textsf{SNR}^{\prime}_{a,b}-\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{y-g(y)}\right\|. (56)

As a result, if we are able to find some gg such that maxy¯a,b(δ)Syg(y)=o(1)SNRa,b\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{y-g(y)}\right\|=o(1)\textsf{SNR}^{\prime}_{a,b}, we will immediately have SNRa,b(δ)(1o(1))SNRa,b\textsf{SNR}^{\prime}_{a,b}(\delta)\geq(1-o(1))\textsf{SNR}^{\prime}_{a,b} and the proof will be complete.

Let wdw\in\mathbb{R}^{d} be some vector. Define g(y)=y+wargmint:y+tw¯a,b|t|g(y)=y+w\mathop{\rm argmin}_{t\in\mathbb{R}:y+tw\in\bar{\mathcal{B}}^{\prime}_{a,b}}|t|. If g(y)g(y) is a well-defined mapping, we have

maxy¯a,b(δ)Syg(y)\displaystyle\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{y-g(y)}\right\| =maxy¯a,b(δ)Smint:y+tw¯a,bw|t|,\displaystyle=\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\min_{t\in\mathbb{R}:y+tw\in\bar{\mathcal{B}}^{\prime}_{a,b}}\left\|{w}\right\|\left|t\right|, (57)

which can be used to derive an upper bound. However, to make g(y)g(y) well-defined, we need for any y¯a,b(δ)Sy\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S, there exits some tt\in\mathbb{R} such that y+tw¯a,by+tw\in\bar{\mathcal{B}}^{\prime}_{a,b}. This means we have the following two equations:

iyivi+12iλiyi2=1δ2Ξa,bTΣb1Ξa,b+12log|Σa||Σb|,\displaystyle\sum_{i}y_{i}v_{i}+\frac{1}{2}\sum_{i}\lambda_{i}y_{i}^{2}=-\frac{1-\delta}{2}\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\frac{1}{2}\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}, (58)
and i(yi+twi)vi+12iλi(yi+twi)2=12Ξa,bTΣb1Ξa,b+12log|Σa||Σb|.\displaystyle\sum_{i}(y_{i}+tw_{i})v_{i}+\frac{1}{2}\sum_{i}\lambda_{i}(y_{i}+tw_{i})^{2}=-\frac{1}{2}\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\frac{1}{2}\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}.

It is equivalent to require tt to satisfy

ti(wivi+λiyiwi)+t22iλiwi2=δ2Ξa,bTΣb1Ξa,b.\displaystyle t\sum_{i}\left(w_{i}v_{i}+\lambda_{i}y_{i}w_{i}\right)+\frac{t^{2}}{2}\sum_{i}\lambda_{i}w_{i}^{2}=-\frac{\delta}{2}\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}. (59)

Hence, all we need is to find a decent vector ww such that: for any y¯a,b(δ)Sy\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S there exists a tt satisfying (59), and we can obtain the desired upper bound for (57).

In the following, we will consider four different scenarios according to the spectral {λi}\{\lambda_{i}\}. For each scenario, we will construct a ww with decent bounds for (57). Denote δ=δ\delta^{\prime}=\sqrt{\delta}.

Scenario 1: |λ1|,|λd|δ\left|\lambda_{1}\right|,\left|\lambda_{d}\right|\leq\delta^{\prime}. We choose w=v/vw=v/\|v\|. Note that we have v\|v\| in the same order of Δa,b\Delta_{a,b} and Ξa,bTΣb1Ξa,b\|\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}\| in the same order of Δa,b2\Delta_{a,b}^{2}. Note that we have

ti(wivi+λiyiwi)+t22iλiwi2\displaystyle t\sum_{i}\left(w_{i}v_{i}+\lambda_{i}y_{i}w_{i}\right)+\frac{t^{2}}{2}\sum_{i}\lambda_{i}w_{i}^{2} tv+|t|yiλi2wi2+t22iλiwi2\displaystyle\leq t\left\|{v}\right\|+\left|t\right|\left\|{y}\right\|\sqrt{\sum_{i}\lambda_{i}^{2}w_{i}^{2}}+\frac{t^{2}}{2}\sum_{i}\lambda_{i}w_{i}^{2}
tv+|t|δy+t2δ2\displaystyle\leq t\left\|{v}\right\|+\left|t\right|\delta^{\prime}\left\|{y}\right\|+\frac{t^{2}\delta^{\prime}}{2}
tv+2|t|δλmin1/2Δa,b+t2δ2,\displaystyle\leq t\left\|{v}\right\|+2\left|t\right|\delta^{\prime}\lambda_{\min}^{-1/2}\Delta_{a,b}+\frac{t^{2}\delta^{\prime}}{2},

where in the last inequality we use ySy\in S. Define t0=δ1/2Δa,bt_{0}=-\delta^{1/2}\Delta_{a,b}. Then we have

t0i(wivi+λiyiwi)+t022iλiwi2\displaystyle t_{0}\sum_{i}\left(w_{i}v_{i}+\lambda_{i}y_{i}w_{i}\right)+\frac{t_{0}^{2}}{2}\sum_{i}\lambda_{i}w_{i}^{2} δ12Δa,b2+δ12δΔa,b2+δδΔa,b2δΔa,b2δ2Ξa,bTΣb1Ξa,b.\displaystyle\lesssim-\delta^{\frac{1}{2}}\Delta_{a,b}^{2}+\delta^{\frac{1}{2}}\delta^{\prime}\Delta_{a,b}^{2}+\delta\delta^{\prime}\Delta_{a,b}^{2}\ll-\delta\Delta_{a,b}^{2}\lesssim-\frac{\delta}{2}\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}.

Hence for any ySy\in S there exists a t(t0,0)t\in(t_{0},0) such that (59) is satisfied. Hence, |t0|=δ1/2Δa,b\left|t_{0}\right|=\delta^{1/2}\Delta_{a,b} is an upper bound for (57).

Scenario 2: λ1<δ\lambda_{1}<-\delta^{\prime}. We choose w=e1w=e_{1} which is the first standard basis of d\mathbb{R}^{d}. Then, (59) can be written as

λ1t2+2(v1+λ1y1)t+δΞa,bTΣb1Ξa,b=0.\displaystyle\lambda_{1}t^{2}+2(v_{1}+\lambda_{1}y_{1})t+\delta\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}=0.

Since λ1<0\lambda_{1}<0, the above equation has two different solutions t1,t2t_{1},t_{2}\in\mathbb{R}. Simple algebra leads to

min{|t1|,|t2|}δΞa,bTΣb1Ξa,bλ1δΞa,bTΣb1Ξa,bδδ14Δa,b.\displaystyle\min\{\left|t_{1}\right|,\left|t_{2}\right|\}\leq\sqrt{\frac{\delta\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}}{-\lambda_{1}}}\leq\sqrt{\frac{\delta\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}}{\delta^{\prime}}}\lesssim\delta^{\frac{1}{4}}\Delta_{a,b}.

Hence, an upper bound for (57) is O(δ14Δa,b)O(\delta^{\frac{1}{4}}\Delta_{a,b}).

Scenario 3: λ1δ\lambda_{1}\geq-\delta^{\prime} and there exists a j[d]j\in[d] such that λjδ\lambda_{j}\leq\delta^{\prime} and |vj|δΔa,b\left|v_{j}\right|\geq\sqrt{\delta^{\prime}}\Delta_{a,b}. We choose w=ejw=e_{j}. Then (59) can be written as

λjt2+2(vj+λjyj)t+δΞa,bTΣb1Ξa,b=0.\displaystyle\lambda_{j}t^{2}+2(v_{j}+\lambda_{j}y_{j})t+\delta\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}=0.

Note that for any ySy\in S, we have |vj+λjyj||vj||λjyj|δΔa,bδ(2λmin1/2Δa,b)δΔa,b/2\left|v_{j}+\lambda_{j}y_{j}\right|\geq\left|v_{j}\right|-\left|\lambda_{j}y_{j}\right|\geq\sqrt{\delta^{\prime}}\Delta_{a,b}-\delta^{\prime}(2\lambda_{\min}^{-1/2}\Delta_{a,b})\geq\sqrt{\delta^{\prime}}\Delta_{a,b}/2. Denote t0=sign(vj+λjyj)δΔa,bt_{0}=-\text{sign}(v_{j}+\lambda_{j}y_{j})\sqrt{\delta^{\prime}}\Delta_{a,b}. Then we have

λjt02+2(vj+λjyj)t0+δΞa,bTΣb1Ξa,b\displaystyle\lambda_{j}t_{0}^{2}+2(v_{j}+\lambda_{j}y_{j})t_{0}+\delta\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b} =λjδΔa,b2|vj+λjyj|δΔa,b+δΞa,bTΣb1Ξa,b\displaystyle=\lambda_{j}\delta^{\prime}\Delta_{a,b}^{2}-\left|v_{j}+\lambda_{j}y_{j}\right|\sqrt{\delta^{\prime}}\Delta_{a,b}+\delta\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}
(δ2δ2)Δa,b2+δO(Δa,b2)\displaystyle\leq-\left(\frac{\delta^{\prime}}{2}-\delta^{{}^{\prime}2}\right)\Delta_{a,b}^{2}+\delta O(\Delta_{a,b}^{2})
0.\displaystyle\leq 0.

As a result, there exists some t(t0,0)t\in(t_{0},0) satisfying (59). Hence, |t0|=δ1/2Δa,b\left|t_{0}\right|=\delta^{{}^{\prime}1/2}\Delta_{a,b} is an upper bound for (57).

Scenario 4: λ1δ\lambda_{1}\geq-\delta^{\prime} and |vj|<δΔa,b\left|v_{j}\right|<\sqrt{\delta^{\prime}}\Delta_{a,b} for all j[d]j\in[d] such that λjδ\lambda_{j}\leq\delta^{\prime}. This scenario is slightly more complicated as we need ww to be dependent on yy. Denote it as w(y)w(y). Then (56) still holds and (57) can be changed into

maxy¯a,b(δ)Syg(y)=maxy¯a,b(δ)Smint:y+tw¯a,bw(y)|t|.\displaystyle\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{y-g(y)}\right\|=\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\min_{t\in\mathbb{R}:y+tw\in\bar{\mathcal{B}}^{\prime}_{a,b}}\left\|{w(y)}\right\|\left|t\right|. (60)

Denote m[d]m\in[d] to be the integer such that λjδ\lambda_{j}\leq\delta^{\prime} for all jmj\leq m and λj>δ\lambda_{j}>\delta^{\prime} for all j>mj>m. We can have m<dm<d otherwise this scenario can be reduced to Scenario 1. Define

[w(y)]i=(yi+viλi)𝕀{i>m}.\displaystyle[w(y)]_{i}=-\left(y_{i}+\frac{v_{i}}{\lambda_{i}}\right){\mathbb{I}\left\{{i>m}\right\}}.

for any i[d]i\in[d]. Instead of using (59), we will analyze it slightly differently.

For y¯a,b(δ)y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta), (58) can be rewritten as

i>mλi(yi+viλi)2=i>mvi2λi(1δ)Ξa,bTΣb1Ξa,b+log|Σa||Σb|(2imyivi+imλiyi2).\displaystyle\sum_{i>m}\lambda_{i}\left(y_{i}+\frac{v_{i}}{\lambda_{i}}\right)^{2}=\sum_{i>m}\frac{v_{i}^{2}}{\lambda_{i}}-(1-\delta)\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}-\left(2\sum_{i\leq m}y_{i}v_{i}+\sum_{i\leq m}\lambda_{i}y_{i}^{2}\right). (61)

On the other hand, if g(y)g(y) is well-defined, we need g(y)¯a,bg(y)\in\bar{\mathcal{B}}^{\prime}_{a,b} which means

i>mλi([g(y)]i+viλi)2=i>mvi2λiΞa,bTΣb1Ξa,b+log|Σa||Σb|(2imyivi+imλiyi2).\displaystyle\sum_{i>m}\lambda_{i}\left([g(y)]_{i}+\frac{v_{i}}{\lambda_{i}}\right)^{2}=\sum_{i>m}\frac{v_{i}^{2}}{\lambda_{i}}-\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}-\left(2\sum_{i\leq m}y_{i}v_{i}+\sum_{i\leq m}\lambda_{i}y_{i}^{2}\right).

Note that we have (yi+vi/λi)(1t)=[g(y)]i+vi/λi(y_{i}+v_{i}/\lambda_{i})(1-t)=[g(y)]_{i}+v_{i}/\lambda_{i} for i>mi>m and [g(y)]i=yi[g(y)]_{i}=y_{i} for imi\leq m. Then the above display can be written as

(1t)2i>mλi(yi+viλi)2=i>mvi2λiΞa,bTΣb1Ξa,b+log|Σa||Σb|(2imyivi+imλiyi2).\displaystyle(1-t)^{2}\sum_{i>m}\lambda_{i}\left(y_{i}+\frac{v_{i}}{\lambda_{i}}\right)^{2}=\sum_{i>m}\frac{v_{i}^{2}}{\lambda_{i}}-\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}-\left(2\sum_{i\leq m}y_{i}v_{i}+\sum_{i\leq m}\lambda_{i}y_{i}^{2}\right).

Together with (61) multiplied, the above equation leads to

(1t)2δΞa,bTΣb1Ξa,b=(2tt2)(i>mvi2λiΞa,bTΣb1Ξa,b+log|Σa||Σb|(2imyivi+imλiyi2)).\displaystyle(1-t)^{2}\delta\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}=(2t-t^{2})\left(\sum_{i>m}\frac{v_{i}^{2}}{\lambda_{i}}-\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}-\left(2\sum_{i\leq m}y_{i}v_{i}+\sum_{i\leq m}\lambda_{i}y_{i}^{2}\right)\right). (62)

It is sufficient to find some 0<t0<10<t_{0}<1 such that

(1t0)2t0(2t0)δΞa,bTΣb1Ξa,bi>mvi2λiΞa,bTΣb1Ξa,b+log|Σa||Σb|(2imyivi+imλiyi2),\displaystyle\frac{(1-t_{0})^{2}}{t_{0}(2-t_{0})}\delta\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}\leq\sum_{i>m}\frac{v_{i}^{2}}{\lambda_{i}}-\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}-\left(2\sum_{i\leq m}y_{i}v_{i}+\sum_{i\leq m}\lambda_{i}y_{i}^{2}\right), (63)

then there definitely exists some 0<t<t00<t<t_{0} satisfying (62).

We are going to give a lower bound for the right hand side of (63). Particularly, we need to lower bound i>mvi2λiΞa,bTΣb1Ξa,b\sum_{i>m}\frac{v_{i}^{2}}{\lambda_{i}}-\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}. Denote y~=UT(Σa1/2Ξa,b)\tilde{y}=U^{T}(-\Sigma_{a}^{-1/2}\Xi_{a,b}). Then using the definition of vv and {λi}\{\lambda_{i}\}, we have

2i[k]y~ivi+i[k]λiy~i2\displaystyle 2\sum_{i\in[k]}\tilde{y}_{i}v_{i}+\sum_{i\in[k]}\lambda_{i}\tilde{y}_{i}^{2} =2(Σa12Ξa,b)TΣa12Σb1Ξa,b+(Σa12Ξa,b)T(Σa12Σb1Σa12Id)(Σa12Ξa,b)\displaystyle=2(-\Sigma_{a}^{-\frac{1}{2}}\Xi_{a,b})^{T}\Sigma_{a}^{\frac{1}{2}}\Sigma_{b}^{-1}\Xi_{a,b}+(-\Sigma_{a}^{-\frac{1}{2}}\Xi_{a,b})^{T}\left(\Sigma_{a}^{\frac{1}{2}}\Sigma_{b}^{-1}\Sigma_{a}^{\frac{1}{2}}-I_{d}\right)(-\Sigma_{a}^{-\frac{1}{2}}\Xi_{a,b})
=Ξa,bTΣb1Ξa,bΞa,bTΣa1Ξa,b.\displaystyle=-\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}-\Xi_{a,b}^{T}\Sigma_{a}^{-1}\Xi_{a,b}.

Then we have

i>mvi2λiΞa,bTΣb1Ξa,b\displaystyle\sum_{i>m}\frac{v_{i}^{2}}{\lambda_{i}}-\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b} =Ξa,bTΣa1Ξa,b+i>mλi(y~i+viλi)2+(2imy~ivi+imλiy~i2)\displaystyle=\Xi_{a,b}^{T}\Sigma_{a}^{-1}\Xi_{a,b}+\sum_{i>m}\lambda_{i}\left(\tilde{y}_{i}+\frac{v_{i}}{\lambda_{i}}\right)^{2}+\left(2\sum_{i\leq m}\tilde{y}_{i}v_{i}+\sum_{i\leq m}\lambda_{i}\tilde{y}_{i}^{2}\right)
Ξa,bTΣa1Ξa,b+(2imy~ivi+imλiy~i2).\displaystyle\geq\Xi_{a,b}^{T}\Sigma_{a}^{-1}\Xi_{a,b}+\left(2\sum_{i\leq m}\tilde{y}_{i}v_{i}+\sum_{i\leq m}\lambda_{i}\tilde{y}_{i}^{2}\right). (64)

Hence, the right hand side of (63) can be lower bounded by

Ξa,bTΣa1Ξa,b+log|Σa||Σb|+(2imy~ivi+imλiy~i2)(2imyivi+imλiyi2)\displaystyle\geq\Xi_{a,b}^{T}\Sigma_{a}^{-1}\Xi_{a,b}+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}+\left(2\sum_{i\leq m}\tilde{y}_{i}v_{i}+\sum_{i\leq m}\lambda_{i}\tilde{y}_{i}^{2}\right)-\left(2\sum_{i\leq m}y_{i}v_{i}+\sum_{i\leq m}\lambda_{i}y_{i}^{2}\right)
Ξa,bTΣa1Ξa,b+log|Σa||Σb|8(δdλmin12Δa,b2+δλmin1Δa,b2)\displaystyle\geq\Xi_{a,b}^{T}\Sigma_{a}^{-1}\Xi_{a,b}+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}-8\left(\sqrt{\delta^{\prime}}\sqrt{d}\lambda_{\min}^{-\frac{1}{2}}\Delta_{a,b}^{2}+\delta^{\prime}\lambda_{\min}^{-1}\Delta_{a,b}^{2}\right)
Ξa,bTΣa1Ξa,b+log|Σa||Σb|16δdλmin12Δa,b2,\displaystyle\geq\Xi_{a,b}^{T}\Sigma_{a}^{-1}\Xi_{a,b}+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}-16\sqrt{\delta^{\prime}d}\lambda_{\min}^{-\frac{1}{2}}\Delta_{a,b}^{2}, (65)

where we use both y~,yS\tilde{y},y\in S and the assumption that |vi|δΔa,b\left|v_{i}\right|\leq\sqrt{\delta^{\prime}}\Delta_{a,b} and |λi|δ\left|\lambda_{i}\right|\leq\delta^{\prime} for any imi\leq m. Then a sufficient condition for (63) is t0t_{0} satisfies

(1t0)2t0(2t0)δΞa,bTΣb1Ξa,bΞa,bTΣa1Ξa,b+log|Σa||Σb|+16δdλmin12Δa,b2.\displaystyle\frac{(1-t_{0})^{2}}{t_{0}(2-t_{0})}\delta\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}\leq\Xi_{a,b}^{T}\Sigma_{a}^{-1}\Xi_{a,b}+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}+16\sqrt{\delta^{\prime}d}\lambda_{\min}^{-\frac{1}{2}}\Delta_{a,b}^{2}.

Since Ξa,bTΣb1Ξa,b\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b} is in the same order of Δa,b2\Delta_{a,b}^{2} and log|Σa||Σb|d=O(1)\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}\lesssim d=O(1), it can be achieved by t0=δt_{0}=\sqrt{\delta}.

As a result, from (60) we have

maxy¯a,b(δ)Syg(y)\displaystyle\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{y-g(y)}\right\| maxy¯a,b(δ)Sw(y)|t0||t0|maxy¯a,b(δ)S2y2+v2δ\displaystyle\leq\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{w(y)}\right\|\left|t_{0}\right|\leq\left|t_{0}\right|\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}2\sqrt{\left\|{y}\right\|^{2}+\frac{\left\|{v}\right\|^{2}}{\delta^{\prime}}}
δ(Δa,b+δ12Δa,b)2δ14Δa,b.\displaystyle\lesssim\sqrt{\delta}\left(\Delta_{a,b}+\delta^{\prime-\frac{1}{2}}\Delta_{a,b}\right)\leq 2\delta^{\frac{1}{4}}\Delta_{a,b}.

Combining the above four scenarios, we can see we all have maxy¯a,b(δ)Syg(y)δ14Δa,b\max_{y\in\bar{\mathcal{B}}^{\prime}_{a,b}(\delta)\cap S}\left\|{y-g(y)}\right\|\lesssim\delta^{\frac{1}{4}}\Delta_{a,b} which is o(1)SNRa,bo(1)\textsf{SNR}^{\prime}_{a,b}. By the argument before the discussion of the four scenarios, we have SNRa,b(δ)(1o(1))SNRa,b\textsf{SNR}^{\prime}_{a,b}(\delta)\geq(1-o(1))\textsf{SNR}^{\prime}_{a,b} and the proof is complete. ∎

Lemma 7.11.

Assume d=O(1)d=O(1) and λminλ1(Σa),λ1(Σb)λd(Σa),λd(Σb)λmax\lambda_{\min}\leq\lambda_{1}(\Sigma_{a}^{*}),\lambda_{1}(\Sigma_{b}^{*})\leq\lambda_{d}(\Sigma_{a}^{*}),\lambda_{d}(\Sigma_{b}^{*})\leq\lambda_{\max} where λmin,λmax>0\lambda_{\min},\lambda_{\max}>0 are constants. Under the condition SNRa,b\textsf{SNR}^{\prime}_{a,b}\rightarrow\infty, for any positive sequence δ=o(1)\delta=o(1), there exists a δ~=o(1)\tilde{\delta}=o(1) that depends on d,Δa,b,λmin,λmaxd,\Delta_{a,b},\lambda_{\min},\lambda_{\max} such that

P1,2(0)exp(1+δ~8SNRa,b2).\displaystyle P_{1,2}(0)\geq\exp\left(-\frac{1+\tilde{\delta}}{8}\textsf{SNR}_{a,b}^{{}^{\prime}2}\right).
Proof.

For convenience and conciseness, we will use the notation θa,θb,Σa,Σb\theta_{a},\theta_{b},\Sigma_{a},\Sigma_{b} instead of θa,θb,Σa,Σb\theta_{a}^{*},\theta_{b}^{*},\Sigma_{a}^{*},\Sigma_{b}^{*} throughout the proof. From Lemma 6.3, we know SNRa,b\textsf{SNR}^{\prime}_{a,b} is in the same order of Δa,b\Delta_{a,b}, which means Δa,b\Delta_{a,b}\rightarrow\infty. Similar to the proof of Lemma 7.10, denote λ1λd\lambda_{1}\leq\ldots\leq\lambda_{d} to be the eigenvalues of Σa12(Σb)1Σa12Id\Sigma_{a}^{\frac{1}{2}}(\Sigma_{b})^{-1}\Sigma_{a}^{\frac{1}{2}}-I_{d} such that its eigen-decomposition can be written as Σa12(Σb)1Σa12Id=i=1dλiuiuiT,\Sigma_{a}^{\frac{1}{2}}(\Sigma_{b})^{-1}\Sigma_{a}^{\frac{1}{2}}-I_{d}=\sum_{i=1}^{d}\lambda_{i}u_{i}u_{i}^{T}, where {ui}\{u_{i}\} are orthogonal vectors. Denote U=(u1,,ud)U=(u_{1},\ldots,u_{d}), v=UTΣa12Σb1Ξa,bv=U^{T}\Sigma_{a}^{\frac{1}{2}}\Sigma_{b}^{-1}\Xi_{a,b}. Then denote

a,b\displaystyle\mathcal{B}^{\prime}_{a,b} ={yd:iyivi+12iλiyi212Ξa,bTΣb1Ξa,b+12log|Σa||Σb|},\displaystyle=\left\{y\in\mathbb{R}^{d}:\sum_{i}y_{i}v_{i}+\frac{1}{2}\sum_{i}\lambda_{i}y_{i}^{2}\leq-\frac{1}{2}\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\frac{1}{2}\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}\right\},

and its boundary

¯a,b\displaystyle\bar{\mathcal{B}}^{\prime}_{a,b} ={yd:iyivi+12iλiyi2=12Ξa,bTΣb1Ξa,b+12log|Σa||Σb|}.\displaystyle=\left\{y\in\mathbb{R}^{d}:\sum_{i}y_{i}v_{i}+\frac{1}{2}\sum_{i}\lambda_{i}y_{i}^{2}=-\frac{1}{2}\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\frac{1}{2}\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}\right\}.

By the same argument as in the proof of Lemma 7.10, a,b\mathcal{B}^{\prime}_{a,b} can be seen a reflection-rotation of a,b\mathcal{B}_{a,b} by the transformation y=Utxy=U^{t}x. Hence we have SNRa,b=minya,b2y\textsf{SNR}^{\prime}_{a,b}=\min_{y\in\mathcal{B}^{\prime}_{a,b}}2\left\|{y}\right\| and we can work on a,b\mathcal{B}^{\prime}_{a,b} instead of a,b\mathcal{B}_{a,b}. Denote y¯a,b\bar{y}\in\mathcal{B}^{\prime}_{a,b} to be the one such that 2y¯=SNRa,b2\left\|{\bar{y}}\right\|=\textsf{SNR}^{\prime}_{a,b}. From the proof of Lemma 7.10 we also know y¯S\bar{y}\in S which is defined as S={yd:y2λmin1/2Δa,b}S=\{y\in\mathbb{R}^{d}:\|y\|\leq 2\lambda_{\min}^{-1/2}\Delta_{a,b}\}. In addition, we know y¯¯a,b\bar{y}\in\bar{\mathcal{B}}^{\prime}_{a,b}.

We first give the main idea of the remaining proof. Denote p(y)p(y) to be the density function of y𝒩(0,Id)y\sim\mathcal{N}(0,I_{d}) We will construct a set TdT\subset\mathbb{R}^{d} around y¯\bar{y} such that for any yTy\in T we have ya,by\in\mathcal{B}^{\prime}_{a,b} and yy¯=o(Δa,b)\left\|{y-\bar{y}}\right\|=o(\Delta_{a,b}). Then we have

P1,2(0)\displaystyle P_{1,2}(0) |T|infyTp(y)=|T|1(2π)d2exp(12maxyTy2)\displaystyle\geq\left|T\right|\inf_{y\in T}p(y)=\left|T\right|\frac{1}{(2\pi)^{\frac{d}{2}}}\exp\left(-\frac{1}{2}\max_{y\in T}\left\|{y}\right\|^{2}\right)
=|T|1(2π)d2exp((1+o(1))SNRa,b28).\displaystyle=\left|T\right|\frac{1}{(2\pi)^{\frac{d}{2}}}\exp\left(-(1+o(1))\frac{\textsf{SNR}_{a,b}^{{}^{\prime}2}}{8}\right). (66)

Hence if log|T|=o(SNRa,b)\log\left|T\right|=o(\textsf{SNR}^{\prime}_{a,b}) then the proof will be complete. So it is all about constructing such TT. We will consider four scenarios same as in the proof of Lemma 7.10. Let δ=o(1)\delta=o(1) be some positive sequence going to 0 very slowly and denote δ=δ\delta^{\prime}=\sqrt{\delta}.

Scenario 1: |λ1|,|λd|δ\left|\lambda_{1}\right|,\left|\lambda_{d}\right|\leq\delta^{\prime}. Define w=v/vw=v/\left\|{v}\right\|. We define TT as follows:

T={y=y¯+s:(IdwwT)sδ|wTs|,wTs[δΔa,b,0]}.\displaystyle T=\left\{y=\bar{y}+s:\left\|{(I_{d}-ww^{T})s}\right\|\leq\delta^{\prime}\left|w^{T}s\right|,w^{T}s\in[-\delta\Delta_{a,b},0]\right\}.

Since y¯¯a,b\bar{y}\in\bar{\mathcal{B}}^{\prime}_{a,b}, we have

iy¯ivi+12iλiy¯i2=12Ξa,bTΣb1Ξa,b+12log|Σa||Σb|.\displaystyle\sum_{i}\bar{y}_{i}v_{i}+\frac{1}{2}\sum_{i}\lambda_{i}\bar{y}_{i}^{2}=-\frac{1}{2}\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\frac{1}{2}\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}. (67)

It is obvious maxyTyy¯2δΔa,b\max_{y\in T}\left\|{y-\bar{y}}\right\|\leq 2\delta\Delta_{a,b}. Hence we only need to show that for any yTy\in T, ya,by\in\mathcal{B}^{\prime}_{a,b}, i.e.,

i(y¯i+si)vi+12iλi(y¯i+si)212Ξa,bTΣb1Ξa,b+12log|Σa||Σb|.\displaystyle\sum_{i}(\bar{y}_{i}+s_{i})v_{i}+\frac{1}{2}\sum_{i}\lambda_{i}(\bar{y}_{i}+s_{i})^{2}\leq-\frac{1}{2}\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\frac{1}{2}\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}. (68)

From the above two displays, we need to show

2isivi+iλisi2+2iλiy¯isi0.\displaystyle 2\sum_{i}s_{i}v_{i}+\sum_{i}\lambda_{i}s_{i}^{2}+2\sum_{i}\lambda_{i}\bar{y}_{i}s_{i}\leq 0. (69)

Note that ss satisfies s2|wTs|\left\|{s}\right\|\leq 2\left|w^{T}s\right|.

2isivi+iλisi2+2iλiy¯isi\displaystyle 2\sum_{i}s_{i}v_{i}+\sum_{i}\lambda_{i}s_{i}^{2}+2\sum_{i}\lambda_{i}\bar{y}_{i}s_{i} 2vwTs+δs2+δy¯s\displaystyle\leq 2\left\|{v}\right\|w^{T}s+\delta^{\prime}\left\|{s}\right\|^{2}+\delta^{\prime}\left\|{\bar{y}}\right\|\left\|{s}\right\|
2vwTs+δ|wTs|2+δy¯|wTs|\displaystyle\leq 2\left\|{v}\right\|w^{T}s+\delta^{\prime}\left|w^{T}s\right|^{2}+\delta^{\prime}\left\|{\bar{y}}\right\|\left|w^{T}s\right|
=|wTs|(2v+δ|wTs|+δy¯)\displaystyle=\left|w^{T}s\right|\left(-2\left\|{v}\right\|+\delta^{\prime}\left|w^{T}s\right|+\delta^{\prime}\left\|{\bar{y}}\right\|\right)
0,\displaystyle\leq 0,

where we use the fact that v,y¯\left\|{v}\right\|,\left\|{\bar{y}}\right\| are in the order of Δa,b\Delta_{a,b}. Hence, for any yTy\in T, we have shown ya,by\in\mathcal{B}^{\prime}_{a,b}. From Lemma 7.12, we have |T|exp(dlogδδΔa,b4d2logd)\left|T\right|\geq\exp\left(d\log\frac{\delta\delta^{\prime}\Delta_{a,b}}{4}-\frac{d}{2}\log d\right). Since d=O(1)d=O(1), Δa,b\Delta_{a,b}\rightarrow\infty, and δ\delta goes to 0 slowly, (7) leads to P1,2(0)exp((1+o(1))SNRa,b28)P_{1,2}(0)\geq\exp\left(-(1+o(1))\frac{\textsf{SNR}_{a,b}^{{}^{\prime}2}}{8}\right).

Scenario 2: λ1<δ\lambda_{1}<-\delta^{\prime}. Denote e1e_{1} the first standard basis in d\mathbb{R}^{d}. Define TT as

T={y=y¯+s:(Ide1e1T)sδ|e1Ts|,sign(v1+λ1y¯1)e1Ts[2δ14Δa,b,δ14Δa,b]}.\displaystyle T=\left\{y=\bar{y}+s:\left\|{(I_{d}-e_{1}e_{1}^{T})s}\right\|\leq\delta\left|e_{1}^{T}s\right|,\text{sign}(v_{1}+\lambda_{1}\bar{y}_{1})e_{1}^{T}s\in[-2\delta^{\frac{1}{4}}\Delta_{a,b},-\delta^{\frac{1}{4}}\Delta_{a,b}]\right\}.

Here for the sign function we define sign(0)=1\text{sign}(0)=1. It is obvious maxyTyy¯2δΔa,b\max_{y\in T}\left\|{y-\bar{y}}\right\|\leq 2\delta\Delta_{a,b}. Hence we only need to establish (69) to show that for any yTy\in T, ya,by\in\mathcal{B}^{\prime}_{a,b}. Note that

2isivi\displaystyle 2\sum_{i}s_{i}v_{i} +iλisi2+2iλiy¯isi=2s1(v1+λ1y¯1)+λ1s2+2i2si(vi+λiy¯i)+2i2λisi2\displaystyle+\sum_{i}\lambda_{i}s_{i}^{2}+2\sum_{i}\lambda_{i}\bar{y}_{i}s_{i}=2s_{1}(v_{1}+\lambda_{1}\bar{y}_{1})+\lambda_{1}s^{2}+2\sum_{i\geq 2}s_{i}(v_{i}+\lambda_{i}\bar{y}_{i})+2\sum_{i\geq 2}\lambda_{i}s_{i}^{2}
2s1(v1+λ1y¯1)+λ1s2+2(Ie1e1T)s(v+maxj|λj|y¯)+2maxj|λj|(Ie1e1T)s2\displaystyle\leq 2s_{1}(v_{1}+\lambda_{1}\bar{y}_{1})+\lambda_{1}s^{2}+2\left\|{(I-e_{1}e_{1}^{T})s}\right\|\left(\left\|{v}\right\|+\max_{j}\left|\lambda_{j}\right|\left\|{\bar{y}}\right\|\right)+2\max_{j}\left|\lambda_{j}\right|\left\|{(I-e_{1}e_{1}^{T})s}\right\|^{2}
(λ1+δ2maxj|λj|)s122|v1+λ1y¯1||s1|+2(v+maxj|λj|y¯)δ|s1|\displaystyle\leq\left(\lambda_{1}+\delta^{2}\max_{j}\left|\lambda_{j}\right|\right)s_{1}^{2}-2\left|v_{1}+\lambda_{1}\bar{y}_{1}\right|\left|s_{1}\right|+2\left(\left\|{v}\right\|+\max_{j}\left|\lambda_{j}\right|\left\|{\bar{y}}\right\|\right)\delta\left|s_{1}\right|
δ2s12+O(Δa,b)δ|s1|,\displaystyle\leq-\frac{\delta^{\prime}}{2}s_{1}^{2}+O(\Delta_{a,b})\delta\left|s_{1}\right|,

where we use maxj|λj|=O(1)\max_{j}\left|\lambda_{j}\right|=O(1) and v,y¯\left\|{v}\right\|,\left\|{\bar{y}}\right\| are in the order of Δa,b\Delta_{a,b}. It is easy to verify the right hand side is negative when s1[2δ14Δa,b,δ14Δa,b]s_{1}\in[-2\delta^{\frac{1}{4}}\Delta_{a,b},-\delta^{\frac{1}{4}}\Delta_{a,b}]. From Lemma 7.12, we have |T|exp(dlogδ54Δa,b4d2logd)\left|T\right|\geq\exp\left(d\log\frac{\delta^{\frac{5}{4}}\Delta_{a,b}}{4}-\frac{d}{2}\log d\right). Then (7) leads to the desired result.

Scenario 3: λ1δ\lambda_{1}\geq-\delta^{\prime} and there exists a j[d]j\in[d] such that λjδ\lambda_{j}\leq\delta^{\prime} and |vj|δΔa,b\left|v_{j}\right|\geq\sqrt{\delta^{\prime}}\Delta_{a,b}. Denote eje_{j} the jjth standard basis in d\mathbb{R}^{d}. Define TT as

T={y=y¯+s:(IdejejT)sδ|ejTs|,sign(vj+λjy¯j)ejTs[δΔa,b,0]}.\displaystyle T=\left\{y=\bar{y}+s:\left\|{(I_{d}-e_{j}e_{j}^{T})s}\right\|\leq\delta^{\prime}\left|e_{j}^{T}s\right|,\text{sign}(v_{j}+\lambda_{j}\bar{y}_{j})e_{j}^{T}s\in[-\delta\Delta_{a,b},0]\right\}.

Again define sign(0)=1\text{sign}(0)=1 and it is obvious maxyTyy¯2δΔa,b\max_{y\in T}\left\|{y-\bar{y}}\right\|\leq 2\delta\Delta_{a,b}. Now we are going to verify (69), i.e., to show λjsj2+2sj(vj+λjy¯j)ijλisi22ijsi(vi+λiy¯i)\lambda_{j}s_{j}^{2}+2s_{j}(v_{j}+\lambda_{j}\bar{y}_{j})\leq-\sum_{i\neq j}\lambda_{i}s_{i}^{2}-2\sum_{i\neq j}s_{i}(v_{i}+\lambda_{i}\bar{y}_{i}). On one hand, we have

λjsj2+2sj(vj+λjy¯j)\displaystyle\lambda_{j}s_{j}^{2}+2s_{j}(v_{j}+\lambda_{j}\bar{y}_{j}) =λjsj22|sj||vj+λjy¯j|\displaystyle=\lambda_{j}s_{j}^{2}-2\left|s_{j}\right|\left|v_{j}+\lambda_{j}\bar{y}_{j}\right|
δδΔa,b|sj|2(δΔa,bδO(Δa,b))|sj|\displaystyle\leq-\delta^{\prime}\delta\Delta_{a,b}\left|s_{j}\right|-2(\sqrt{\delta^{\prime}}\Delta_{a,b}-\delta^{\prime}O(\Delta_{a,b}))\left|s_{j}\right|
δΔa,b|sj|.\displaystyle\leq-\sqrt{\delta^{\prime}}\Delta_{a,b}\left|s_{j}\right|.

One the other hand, we have

ijλisi22ijsi(vi+λiy¯i)\displaystyle-\sum_{i\neq j}\lambda_{i}s_{i}^{2}-2\sum_{i\neq j}s_{i}(v_{i}+\lambda_{i}\bar{y}_{i}) maxj|λj|(IejejT)s22(IejejT)s(v+maxj|λj|y¯)\displaystyle\geq-\max_{j}\left|\lambda_{j}\right|\left\|{(I-e_{j}e_{j}^{T})s}\right\|^{2}-2\left\|{(I-e_{j}e_{j}^{T})s}\right\|\left(\left\|{v}\right\|+\max_{j}\left|\lambda_{j}\right|\left\|{\bar{y}}\right\|\right)
maxj|λj|δ2|sj|22δ|sj|(v+maxj|λj|y¯)\displaystyle\geq-\max_{j}\left|\lambda_{j}\right|\delta{{}^{\prime}2}\left|s_{j}\right|^{2}-2\delta^{\prime}\left|s_{j}\right|\left(\left\|{v}\right\|+\max_{j}\left|\lambda_{j}\right|\left\|{\bar{y}}\right\|\right)
2δ(δΔa,bmaxj|λj|+v+maxj|λj|y¯)|sj|\displaystyle\geq-2\delta^{\prime}\left(\delta\Delta_{a,b}\max_{j}\left|\lambda_{j}\right|+\left\|{v}\right\|+\max_{j}\left|\lambda_{j}\right|\left\|{\bar{y}}\right\|\right)\left|s_{j}\right|
2δO(Δa,b)|sj|\displaystyle\geq-2\delta^{\prime}O(\Delta_{a,b})\left|s_{j}\right|
δΔa,b|sj|,\displaystyle\geq-\sqrt{\delta^{\prime}}\Delta_{a,b}\left|s_{j}\right|,

we use maxj|λj|=O(1)\max_{j}\left|\lambda_{j}\right|=O(1) and v,y¯\left\|{v}\right\|,\left\|{\bar{y}}\right\| are in the order of Δa,b\Delta_{a,b}. Hence (69) is established. From Lemma 7.12, we have |T|exp(dlogδδΔa,b4d2logd)\left|T\right|\geq\exp\left(d\log\frac{\delta\delta^{\prime}\Delta_{a,b}}{4}-\frac{d}{2}\log d\right). Then (7) leads to the desired result.

Scenario 4: λ1δ\lambda_{1}\geq-\delta^{\prime} and |vj|<δΔa,b\left|v_{j}\right|<\sqrt{\delta^{\prime}}\Delta_{a,b} for all j[d]j\in[d] such that λjδ\lambda_{j}\leq\delta^{\prime}. Denote m[d]m\in[d] to be the integer such that λjδ\lambda_{j}\leq\delta^{\prime} for all jmj\leq m and λj>δ\lambda_{j}>\delta^{\prime} for all j>mj>m. We can have m<km<k otherwise this scenario can be reduced to Scenario 1.

Define wdw\in\mathbb{R}^{d} to be unit vector such that

wi={λiy¯i+vij>m(λjy¯j+vj)2, for all i>m,0, o.w..\displaystyle w_{i}=\begin{cases}\frac{\lambda_{i}\bar{y}_{i}+v_{i}}{\sqrt{\sum_{j>m}(\lambda_{j}\bar{y}_{j}+v_{j})^{2}}},\text{ for all }i>m,\\ 0,\text{ o.w..}\end{cases}

Define

T={y=y¯+s:(IdwwT)sδ|wTs|,wTs[δΔa,b,0]}.\displaystyle T=\left\{y=\bar{y}+s:\left\|{(I_{d}-ww^{T})s}\right\|\leq\delta^{\prime}\left|w^{T}s\right|,w^{T}s\in[-\delta\Delta_{a,b},0]\right\}.

Now we are going to verify (69), i.e., to show 2i>msi(vi+λiy¯i)+iλisi2+2imsi(vi+λiy¯i)02\sum_{i>m}s_{i}(v_{i}+\lambda_{i}\bar{y}_{i})+\sum_{i}\lambda_{i}s_{i}^{2}+2\sum_{i\leq m}s_{i}(v_{i}+\lambda_{i}\bar{y}_{i})\leq 0. On one hand, we have

2i>msi(vi+λiy¯i)\displaystyle 2\sum_{i>m}s_{i}(v_{i}+\lambda_{i}\bar{y}_{i}) =2j>m(vj+λjy¯j)2i>msiwi=2j>m(vj+λjy¯j)2|wTs|\displaystyle=2\sqrt{\sum_{j>m}(v_{j}+\lambda_{j}\bar{y}_{j})^{2}}\sum_{i>m}s_{i}w_{i}=-2\sqrt{\sum_{j>m}(v_{j}+\lambda_{j}\bar{y}_{j})^{2}}\left|w^{T}s\right|
2δj>m(y¯j+vjλj)2|wTs|.\displaystyle\leq-2\sqrt{\delta^{\prime}}\sqrt{\sum_{j>m}\left(\bar{y}_{j}+\frac{v_{j}}{\lambda_{j}}\right)^{2}}\left|w^{T}s\right|.

We are going to give a lower bound for j>m(y¯j+vjλj)2\sum_{j>m}\left(\bar{y}_{j}+\frac{v_{j}}{\lambda_{j}}\right)^{2}. Note that (67) can be written as

i>mλi(y¯i+viλi)2\displaystyle\sum_{i>m}\lambda_{i}\left(\bar{y}_{i}+\frac{v_{i}}{\lambda_{i}}\right)^{2} =i>mvi2λiΞa,bTΣb1Ξa,b+log|Σa||Σb|(2imy¯ivi+imλiy¯i2).\displaystyle=\sum_{i>m}\frac{v_{i}^{2}}{\lambda_{i}}-\Xi_{a,b}^{T}\Sigma_{b}^{-1}\Xi_{a,b}+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}-\left(2\sum_{i\leq m}\bar{y}_{i}v_{i}+\sum_{i\leq m}\lambda_{i}\bar{y}_{i}^{2}\right).

Denote y~=UT(Σa1/2Ξa,b)\tilde{y}=U^{T}(-\Sigma_{a}^{-1/2}\Xi_{a,b}). Using (64), we have

i>mλi(y¯i+viλi)2\displaystyle\sum_{i>m}\lambda_{i}\left(\bar{y}_{i}+\frac{v_{i}}{\lambda_{i}}\right)^{2} Ξa,bTΣa1Ξa,b+(2imy~ivi+imλiy~i2)+log|Σa||Σb|(2imy¯ivi+imλiy¯i2)\displaystyle\geq\Xi_{a,b}^{T}\Sigma_{a}^{-1}\Xi_{a,b}+\left(2\sum_{i\leq m}\tilde{y}_{i}v_{i}+\sum_{i\leq m}\lambda_{i}\tilde{y}_{i}^{2}\right)+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}-\left(2\sum_{i\leq m}\bar{y}_{i}v_{i}+\sum_{i\leq m}\lambda_{i}\bar{y}_{i}^{2}\right)
Ξa,bTΣa1Ξa,b+log|Σa||Σb|16δdλmin12Δa,b2,\displaystyle\geq\Xi_{a,b}^{T}\Sigma_{a}^{-1}\Xi_{a,b}+\log\frac{|\Sigma_{a}|}{|\Sigma_{b}|}-16\sqrt{\delta^{\prime}d}\lambda_{\min}^{-\frac{1}{2}}\Delta_{a,b}^{2},
CΔa,b2,\displaystyle\geq C\Delta_{a,b}^{2},

for some constant C>0C>0. Here the second inequality is by the same argument as (65) and the last inequality uses the fact that Ξa,bTΣa1Ξa,b\Xi_{a,b}^{T}\Sigma_{a}^{-1}\Xi_{a,b} is in the order of Δa,b2\Delta_{a,b}^{2} and d=O(1)d=O(1). Hence,

i>mλi(y¯i+viλi)22CδΔa,b|wTs|.\displaystyle\sum_{i>m}\lambda_{i}\left(\bar{y}_{i}+\frac{v_{i}}{\lambda_{i}}\right)^{2}\leq-2\sqrt{C\delta^{\prime}}\Delta_{a,b}\left|w^{T}s\right|.

On the other hand, we have

iλisi2+2imsi(vi+λiy¯i)\displaystyle\sum_{i}\lambda_{i}s_{i}^{2}+2\sum_{i\leq m}s_{i}(v_{i}+\lambda_{i}\bar{y}_{i}) maxj|λj|s2+2imsi2(v+maxj|λj|y¯)\displaystyle\leq\max_{j}\left|\lambda_{j}\right|\left\|{s}\right\|^{2}+2\sqrt{\sum_{i\leq m}s_{i}^{2}}\left(\left\|{v}\right\|+\max_{j}\left|\lambda_{j}\right|\left\|{\bar{y}}\right\|\right)
2maxj|λj||wTs|2+2(v+maxj|λj|y¯)(IwwT)s\displaystyle\leq 2\max_{j}\left|\lambda_{j}\right|\left|w^{T}s\right|^{2}+2\left(\left\|{v}\right\|+\max_{j}\left|\lambda_{j}\right|\left\|{\bar{y}}\right\|\right)\left\|{(I-ww^{T})s}\right\|
2(maxj|λj|δΔa,b+δ(v+maxj|λj|y¯))|wTs|\displaystyle\leq 2\left(\max_{j}\left|\lambda_{j}\right|\delta\Delta_{a,b}+\delta^{\prime}\left(\left\|{v}\right\|+\max_{j}\left|\lambda_{j}\right|\left\|{\bar{y}}\right\|\right)\right)\left|w^{T}s\right|
O(δΔa,b)|wTs|,\displaystyle\leq O(\delta^{\prime}\Delta_{a,b})\left|w^{T}s\right|,

where we use the properties of ss as yTy\in T. Summing the above two displays together, we have (69) satisfied. From Lemma 7.12, we have |T|exp(dlogδδΔa,b4d2logd)\left|T\right|\geq\exp\left(d\log\frac{\delta\delta^{\prime}\Delta_{a,b}}{4}-\frac{d}{2}\log d\right). Then (7) leads to the desired result. ∎

Lemma 7.12.

Consider any positive integer dd and any 0<r<10<r<1, t>0t>0. Define a set T={yd:(i2yi2)1/2r|y1|,y1[2t,t]}T=\left\{y\in\mathbb{R}^{d}:\left(\sum_{i\geq 2}y_{i}^{2}\right)^{1/2}\leq r\left|y_{1}\right|,y_{1}\in[-2t,-t]\right\}. Then we have

|T|exp(dlogrt2d2logd).\displaystyle\left|T\right|\geq\exp\left(d\log\frac{rt}{2}-\frac{d}{2}\log d\right).
Proof.

Define a dd-dimensional ball B={yd:(y1+1.5t)2+i2yi2(rt/2)2}B=\left\{y\in\mathbb{R}^{d}:(y_{1}+1.5t)^{2}+\sum_{i\geq 2}y_{i}^{2}\leq(rt/2)^{2}\right\}. We can easily verify that BTB\in T. First, for all yBy\in B, we have y1[2t,t]y_{1}\in[-2t,-t] as r(0,1)r\in(0,1). Then, we have (i2yi2)1/2rt/2r|y1|\left(\sum_{i\geq 2}y_{i}^{2}\right)^{1/2}\leq rt/2\leq r\left|y_{1}\right|. As a result, by the expression of the volume of a dd-dimensional ball, we have

|T||B|=πd2Γ(d2+1)(rt2)d1dd2(rt2)d=exp(dlogrt2d2logd),\displaystyle\left|T\right|\geq\left|B\right|=\frac{\pi^{\frac{d}{2}}}{\Gamma(\frac{d}{2}+1)}\left(\frac{rt}{2}\right)^{d}\geq\frac{1}{d^{\frac{d}{2}}}\left(\frac{rt}{2}\right)^{d}=\exp\left(d\log\frac{rt}{2}-\frac{d}{2}\log d\right),

where Γ()\Gamma(\cdot) is the Gamma function. ∎

References

  • Abbe et al. [2020] E. Abbe, J. Fan, and K. Wang. An p\ell_{p}-theory of PCA and spectral clustering. arxiv preprint, 2020.
  • Bishop [2006] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
  • Brubaker and Vempala [2008] S Charles Brubaker and Santosh S Vempala. Isotropic pca and affine-invariant clustering. In Building Bridges, pages 241–281. Springer, 2008.
  • Dasgupta [2008] Sanjoy Dasgupta. The hardness of k-means clustering. Department of Computer Science and Engineering, University of California …, 2008.
  • Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
  • Fei and Chen [2018] Yingjie Fei and Yudong Chen. Hidden integrality of sdp relaxation for sub-gaussian mixture models. arXiv preprint arXiv:1803.06510, 2018.
  • Friedman et al. [2001] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
  • Gao and Zhang [2019] Chao Gao and Anderson Y Zhang. Iterative algorithm for discrete structure recovery. arXiv preprint arXiv:1911.01018, 2019.
  • Giraud and Verzelen [2018] Christophe Giraud and Nicolas Verzelen. Partial recovery bounds for clustering with the relaxed kk means. arXiv preprint arXiv:1807.07547, 2018.
  • Hsu et al. [2012] Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
  • Kalai et al. [2010] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Efficiently learning mixtures of two gaussians. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 553–562, 2010.
  • Laurent and Massart [2000] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
  • Lloyd [1982] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  • Löffler et al. [2019] Matthias Löffler, Anderson Y Zhang, and Harrison H Zhou. Optimality of spectral clustering for gaussian mixture model. arXiv preprint arXiv:1911.00538, 2019.
  • Lu and Zhou [2016] Yu Lu and Harrison H Zhou. Statistical and computational guarantees of lloyd’s algorithm and its variants. arXiv preprint arXiv:1612.02099, 2016.
  • MacQueen et al. [1967] James MacQueen et al. Some methods for classification and analysis of multivariate observations. 1967.
  • Ndaoud [2019] M. Ndaoud. Sharp optimal recovery in the two component gaussian mixture model. arXiv preprint, 2019.
  • Pearson [1894] Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71–110, 1894.
  • Spielman and Teng [1996] Daniel A Spielman and Shang-Hua Teng. Spectral partitioning works: Planar graphs and finite element meshes. In Proceedings of 37th Conference on Foundations of Computer Science, pages 96–105. IEEE, 1996.
  • Titterington et al. [1985] D Michael Titterington, Adrian FM Smith, and Udi E Makov. Statistical analysis of finite mixture distributions. Wiley,, 1985.
  • Vempala and Wang [2004] S. Vempala and G. Wang. A spectral algorithm for learning mixture models. J. Comput. Syst. Sci., 68(4):841–860, 2004.
  • Von Luxburg [2007] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
  • Wang et al. [2020] Kaizheng Wang, Yuling Yan, and Mateo Diaz. Efficient clustering for stretched mixtures: Landscape and optimality. arXiv preprint arXiv:2003.09960, 2020.
  • Wu et al. [2008] Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, S Yu Philip, et al. Top 10 algorithms in data mining. Knowledge and information systems, 14(1):1–37, 2008.