This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Feature screening for clustering analysis

Changhu Wang  
School of Mathematical Sciences, Peking University
and
Zihao Chen  
School of Mathematical Sciences, Peking University
and
Ruibin Xi
School of Mathematical Sciences, Peking University
Center for Statistial Sciences, Peking University
The authors gratefully acknowledge the National Key Basic Research Project of China (2020YFE0204000), the National Natural Science Foundation of China (11971039), and Sino-Russian Mathematics Center.
Abstract

In this paper, we consider feature screening for ultrahigh dimensional clustering analyses. Based on the observation that the marginal distribution of any given feature is a mixture of its conditional distributions in different clusters, we propose to screen clustering features by independently evaluating the homogeneity of each feature’s mixture distribution. Important cluster-relevant features have heterogeneous components in their mixture distributions and unimportant features have homogeneous components. The well-known EM-test statistic is used to evaluate the homogeneity. Under general parametric settings, we establish the tail probability bounds of the EM-test statistic for the homogeneous and heterogeneous features, and further show that the proposed screening procedure can achieve the sure independent screening and even the consistency in selection properties. Limiting distribution of the EM-test statistic is also obtained for general parametric distributions. The proposed method is computationally efficient, can accurately screen for important cluster-relevant features and help to significantly improve clustering, as demonstrated in our extensive simulation and real data analyses.


Keywords: Clustering analyses; feature screening; homogeneity test.

1 Introduction

High dimensional data is prevalent in a wide range of research fields and applications, such as biological studies, financial studies and image data analyses. In high dimensional data, the number of features pp is very large and can be much larger than the number of samples nn (pnp\gg n). One of the most important tasks of high dimensional data analyses is to cluster the samples and uncover unknown groups and structures in the data. In real applications, cluster-relevant features are often only a small proportion of the pp features, and other features are cluster-irrelevant. Incorporation of the irrelevant features in clustering analyses can blur the differences between clusters, significantly influence the clustering accuracy, and make clustering computationally more demanding, especially when pp is large. If one can accurately distinguish cluster-relevant features from cluster-irrelevant features, clustering analyses could be significantly improved in terms of both clustering accuracy and computational efficiency (See Figure 1 for an example).

Refer to caption
Figure 1: An example of simulated data generated as in Section 4. The data has 5 clusters and 5000 features. The first 30 features are cluster-relevant and the other 4970 features are cluster-irrelevant. We perform dimension reduction using the uniform manifold approximation and projection (UMAP). The left plot is the UMAP plot using all features and the right plot is the UMAP plot using the features selected by the proposed method. The points in the plots are colored by their true clustering labels. The large amount of the cluster-irrelevant features make the clusters difficult to be distinguished. After feature screening, different clusters are much easier to be distinguished.

We consider the feature screening problem in clustering analyses of high dimensional data. Suppose that 𝐱i=(xi1,,xip)Tp{\bf x}_{i}=(x_{i1},\dots,x_{ip})^{\rm T}\in\mathbb{R}^{p} (i=1,,ni=1,\dots,n) are nn independent observations. The nn samples are from GG clusters and their cluster labels are unknown. We assume that only ss of the pp features (often, sps\ll p) contain cluster label information and all other features are independent of the clusters. We aim to develop a computationally efficient statistical method that can effectively screen out the cluster-irrelevant features, while retaining all or almost all cluster-relevant features. Traditional clustering algorithms such as the k-means algorithm can then be applied to the retained features and sample clusters can be obtained. We are most interested in high dimensional count data, although the method developed here can also be applied to continuous data.

One motivation of this work is the single-cell RNA sequencing (scRNA-seq) analysis (Kiselev et al., 2019). In recent scRNA-seq studies, gene expressions in single-cells are profiled for over 10,00010,000 genes and the raw expression values are rather small count data (<< 20 for majority of genes). The unknown cell types of single-cells are often assigned based on clustering analyses of gene expressions. However, only marker genes differentially expressed among different cell types are useful for cell type identification. To address the high-dimensional clustering problems, scRNA-seq studies often only use the so-called highly variable genes for clustering analysis, based on the assumption that genes with larger expression variances are more likely to be marker genes. Though this strategy has been widely adopted, selecting highly variable genes can include many non-marker genes and exclude many marker genes, thus leading to the inaccurate clustering (Andrews and Hemberg, 2019).

The supervised screening problem has been extensively studied and many methods have been developed, such as Fan and Lv (2008), Zhu et al. (2011) and Li et al. (2012) among many others (Liu et al., 2015). These methods are particularly suitable for ultra-high dimensional supervised learning problems, which bring many statistical and computational challenges to the traditional variable selection methods. With the response variable, available supervised screening methods can measure each predictor’s association with the response variable independently. The predictors are then ranked by their association strengths and top predictors are retained. With a proper threshold, these screening methods can correctly select all important features with a high probability or even can correctly distinguish the important and unimportant features with a high probability, which are known as the sure independent screening property (Fan and Lv, 2008) and the consistency in selection property (Li et al., 2012), respectively.

The unsupervised feature screening is more challenging because there is no response variable. Variable selection methods for clustering analyses have been developed (Witten and Tibshirani, 2010; Fop and Murphy, 2018; Liu et al., 2022). However, similar to the supervised variable selection methods, when the dimensionality is ultrahigh, their performance is challenged in terms of both statistical accuracy and computational efficiency. To address the ultra-high dimensional problems, the pioneer work by Chan and Hall (2010) developed a feature screening method by testing the unimodality of each feature’s distribution. Features with unimodal distributions are cluster-irrelevant and should be screened out. More recently, Jin and Wang (2016) developed an innovative method called IF-PCA for ultra-high dimensional clustering analysis. IF-PCA first screens cluster-relevant using the Kolmogorov-Smirnov (KS) test and then applies the k-means algorithm to cluster. Liu et al. (2022) proposed a non-parametric feature screening method called SC-FS. SC-FS performs feature screening by correlating each feature with a pre-clustering label. However, the few available screening methods are either developed for continuous data or require pre-clustering of the data. The continuous methods are not suitable for count data, and in ultrahigh dimensional settings, the pre-clustering can be very inaccurate and the methods relying on the pre-clustering results will also be inaccurate.

In this paper, we develop a general parametric feature screening method that can be applied to both continuous and count data. Marginally, all features can be viewed as following mixture distributions. However, the mixture components of a cluster-relevant feature are not all the same (heterogeneous distribution), and those of a cluster-irrelevant feature will be the same (homogeneous distribution). Therefore, we can test whether a feature is cluster-relevant without the cluster labels. Observe that multi-modal distributions are mixture distributions of unimodal distributions. Thus, our method essentially uses the same characteristic as Chan and Hall (2010) for feature screening of clustering analyses.

We propose to use the EM-test, a well-known homogeneity test of mixture models, for feature screening. The EM-test was originally developed to overcome the critical problems of likelihood ratio tests for homogeneity (Hartigan, 1985; Chernoff and Lander, 1995). Limiting distributions under the homogeneity were available for mixture models of one-parameter distributions or of two components (Li et al., 2009; Niu et al., 2011; Li and Chen, 2010). In this paper, in addition to the limiting distribution, we establish theoretical properties of the EM-test for feature screening of clustering analyses under general settings of mixture models. The mixture models are allowed to be mixtures of multi-parameter distributions and/or of multiple-components. The major theoretical results include the following.

  • Under the homogeneous model, the EM-test statistic is bounded with a high probability, or more specifically, the probability of the EM-test statistic greater than any t>0t>0 decays to zero at a polynomial rate with respect to tt.

  • Under the heterogeneous model, the EM-test statistic diverges to infinity with a probability approaching to one at an exponential rate.

  • When the dimensionality pp goes to infinity exponentially with the sample size nn (more precisely, with exp(nβ)\exp\left(n^{\beta}\right) for 0<β<1/20<\beta<1/2), the screening procedure based on the EM-test achieves the sure independent screening property (Fan and Lv, 2008). If pp goes to infinity at any polynomial order of nn, we can even achieve the consistency in selection (Li et al., 2012).

We perform extensive simulation studies and find that the EM-test can accurately screen for cluster-relevant features. After feature screening, clustering accuracy can also be significantly improved. In an application of scRNA-seq data, we find that the EM-test renders more accurate single-cell clustering and enables the detection of a rare cell-type that is difficult to be detected by other methods.

The rest of the paper is organized as follows. Section 2 introduces the model setup, the EM-test and defines basic notations. Section 3 gives the bounds of the tail probabilities under the homogeneous and heterogeneous models, and further establishes the sure independent screening and model selection consistency property. The limiting distribution of the EM-test statistic is also presented in Section 3. Simulation and real data analyses are presented in Section 4 and Section 5, respectively. Finally, in Section 6, we discuss the limitations of this research and future research directions of high dimensional clustering feature screening. The proofs of the results are presented in the Supplementary material.

2 Model setup and the EM-test

In this section, we present the statistical model setup for the feature screening of clustering analyses and introduce the screening procedure based on the EM-test statistic.

2.1 Model setup for feature screening of clustering analyses

Suppose that we have nn independent observations 𝐱i=(xi1,,xip)Tp{\bf x}_{i}=(x_{i1},\dots,x_{ip})^{\rm T}\in\mathbb{R}^{p} (i=1,,ni=1,\dots,n) from GG clusters and 𝜶=(α1,,αG){\bm{\alpha}}=(\alpha_{1},\dots,\alpha_{G}) be the proportions of different clusters (g=1Gαg=1,αg>0,g=1,,G)(\sum_{g=1}^{G}\alpha_{g}=1,\alpha_{g}>0,g=1,\dots,G). We denote the unknown cluster labels as gi{1,,G}g_{i}\in\{1,\dots,G\} (i=1,,ni=1,\dots,n). Assume that given the cluster label gg, the conditional distribution Fj(x|g)F_{j}(x|g) of xijx_{ij} is from a known identifiable parametric distribution family 𝒫={f(x;𝜽):𝜽Θd}\mathcal{P}=\{f(x;\bm{\theta}):\bm{\theta}\in\Theta\subset\mathbb{R}^{d}\}, where f(x;𝜽)f(x;\bm{\theta}) is the density function with respect to a σ\sigma-finite measure μ\mu, on \mathbb{R} parameterized by 𝜽\bm{\theta}, and Θd\Theta\subset\mathbb{R}^{d} is a convex compact parameter space. Note that for count data, the measure μ\mu can be taken as the counting measure of the nonnegative integers; for continuous data, μ\mu is the Lebesgue measure on \mathbb{R}. Thus, our method and theory apply to both count and continuous data. Define Ξ=ΘG\Xi=\Theta^{G} as the product space of Θ\Theta.

In high dimensional clustering problems, only a small portion of the pp features contain information about the cluster labels and the majority of them are irrelevant to the sample clusters. Our goal is to screen out the cluster-irrelevant features to facilitate downstream clustering analysis. Intuitively, if the jjth random variable xjx_{j}\in\mathbb{R} is unrelated with the cluster label gg, the conditional distribution Fj(x|g)F_{j}(x|g) of xjx_{j} given the cluster label gg should be independent of the cluster label gg, or Fj(x|g=1)==Fj(x|g=G)F_{j}(x|g=1)=\cdots=F_{j}(x|g=G). If, on the other hand, the jjth random variable xjRx_{j}\in R is a cluster-relevant feature, there are at least two ggg\neq g^{\prime} such that Fj(x|g)Fj(x|g)F_{j}(x|g)\neq F_{j}(x|g^{\prime}).

Let f(x;𝜽jg)f(x;\bm{\theta}_{jg}) be the density function of the conditional distribution Fj(x|g)F_{j}(x|g). The labels are unknown and the random variable xjx_{j} should follow a mixture distribution φ(x;𝝃j,𝜶)=g=1Gαgf(x;𝜽jg)\varphi(x;\bm{\xi}_{j},{\bm{\alpha}})=\sum_{g=1}^{G}\alpha_{g}f(x;\bm{\theta}_{jg}), where 𝝃j=(𝜽j1T,,𝜽jGT)TΞ=ΘG\bm{\xi}_{j}=(\bm{\theta}_{j1}^{\rm T},\dots,\bm{\theta}_{jG}^{\rm T})^{\rm T}\in\Xi=\Theta^{G}. Define the interior of the G1G-1 dimensional probability simplex as 𝕊G1={𝜶G:g=1Gαg=1,αg>0, for g=1,,G}\mathbb{S}^{G-1}=\{{\bm{\alpha}}\in\mathbb{R}^{G}:\sum_{g=1}^{G}\alpha_{g}=1,\alpha_{g}>0,\mbox{ for }g=1,\dots,G\} and the GG-mixture distribution family as

𝒫G={g=1Gαgf(x;𝜽g):𝜶𝕊G1,𝜽gΘ,g=1,,G}.\mathcal{P}^{G}=\left\{\sum_{g=1}^{G}\alpha_{g}f(x;\bm{\theta}_{g}):{\bm{\alpha}}\in\mathbb{S}^{G-1},\bm{\theta}_{g}\in\Theta,g=1,\dots,G\right\}.

We have φ(x;𝝃j,𝜶)𝒫G\varphi(x;\bm{\xi}_{j},{\bm{\alpha}})\in\mathcal{P}^{G}. In this paper, we assume that 𝒫G\mathcal{P}^{G} is an identifiable finite mixture, in other words, 𝒫\mathcal{P} is a linearly independent set over the field of real numbers (Yakowitz and Spragins, 1968). For a cluster-irrelevant feature jj, 𝜽j1==𝜽jG\bm{\theta}_{j1}=\cdots=\bm{\theta}_{jG} and thus φ(x;𝝃j,𝜶)𝒫\varphi(x;\bm{\xi}_{j},{\bm{\alpha}})\in\mathcal{P}. For a cluster-relevant feature jj, there are at least two ggg\neq g^{\prime} such that 𝜽jg𝜽jg\bm{\theta}_{jg}\neq\bm{\theta}_{jg^{\prime}} and φ(x;𝝃j,𝜶)𝒫G\𝒫\varphi(x;\bm{\xi}_{j},{\bm{\alpha}})\in\mathcal{P}^{G}\backslash\mathcal{P}. Therefore, we can consider the following hypothesis testing problems to screen for the cluster-relevant features.

j0:φ(x;𝝃j,𝜶)𝒫 v.s. j1:φ(x;𝝃j,𝜶)𝒫G\𝒫.\mathbb{H}_{j0}:\varphi(x;\bm{\xi}_{j},{\bm{\alpha}})\in\mathcal{P}\mbox{ v.s. }\mathbb{H}_{j1}:\varphi(x;\bm{\xi}_{j},{\bm{\alpha}})\in\mathcal{P}^{G}\backslash\mathcal{P}. (1)

We call the models under the null hypotheses j0\mathbb{H}_{j0} homogeneous models, and those under the alternative hypotheses j1\mathbb{H}_{j1} heterogeneous models. In real applications, the number of clusters GG is often unknown. However, we often can have a rough estimate of GG and can choose GG to be larger than the true number of clusters. In such cases, the null and alternative hypotheses still hold for the cluster-irrelevant and relevant features, respectively. Simulation shows that the choice of GG has little influence on the performance of EM-test, especially when GG is chosen to be larger than the true clusters (Supplementary Section D).

2.2 The EM-test statistic and the screening procedure

We use the EM-test statistic for feature screening of clustering analyses. Theoretical results of the EM-test statistic are developed for the hypothesis testing problem (1) under general settings with multiple parameters (d1d\geq 1), multiple components (G2G\geq 2) and both continuous and count data. Let 𝐱=(x1,,xn){\bf x}=(x_{1},\dots,x_{n}) be a random sample of size nn from a GG-mixture model

φ(x;𝝃,𝜶)=g=1Gαgf(x;𝜽g),\varphi(x;\bm{\xi},{\bm{\alpha}})=\sum_{g=1}^{G}\alpha_{g}f(x;\bm{\theta}_{g}), (2)

where 𝜽gΘ\bm{\theta}_{g}\in\Theta, (g=1,,Gg=1,\ldots,G), 𝝃=(𝜽1,,𝜽G)\bm{\xi}=(\bm{\theta}_{1},\ldots,\bm{\theta}_{G}) and 𝜶=(α1,,αG){\bm{\alpha}}=(\alpha_{1},\dots,\alpha_{G}). Let ln(𝝃,𝜶)=i=1nlogφ(xi;𝝃,𝜶)l_{n}(\bm{\xi},{\bm{\alpha}})=\sum_{i=1}^{n}\hbox{log}\ \varphi(x_{i};\bm{\xi},{\bm{\alpha}}) be the log-likelihood function, and define the penalized log-likelihood function as

pln(𝝃,𝜶)=i=1nlogφ(xi;𝝃,𝜶)+p(𝜶),pl_{n}(\bm{\xi},{\bm{\alpha}})=\sum_{i=1}^{n}\hbox{log}\ \varphi(x_{i};\bm{\xi},{\bm{\alpha}})+p({\bm{\alpha}}), (3)

where p(𝜶)=λ(g=1Glogαg+Glog(G))p({\bm{\alpha}})=\lambda\left(\sum_{g=1}^{G}{\rm log}\alpha_{g}+G\hbox{log}(G)\right) is a penalty function, where λ>0\lambda>0 is a penalty parameter and is always set as 0.00001 in the simulation and real data analyses of this paper. Simulation shows that EM-test is robust to the choice of the penalty parameter λ\lambda and λ=0.00001\lambda=0.00001 gives very similar results to other choices of λ\lambda (Supplementary Table S3). Largely speaking, the EM-test statistic is defined as the difference between the maximum penalized log-likelihoods of the heterogeneous and homogeneous models. The maximum penalized log-likelihood under the heterogeneous model is obtained using the EM algorithm. More specifically, we use the following procedure to calculate the EM-test statistic.

Suppose that 𝝃^0=(𝜽^0,,𝜽^0)\hat{\bm{\xi}}_{0}=\left(\hat{\bm{\theta}}_{0},\ldots,\hat{\bm{\theta}}_{0}\right) is the estimator that maximizes the penalized log-likelihood function (3) under the homogeneous model. Under the heterogeneous model, given any initial value 𝜶(0)𝕊G1{\bm{\alpha}}^{(0)}\in\mathbb{S}^{G-1} we first compute

𝝃(0)=argmax𝝃Ξi=1nlogφ(xi;𝝃,𝜶(0))+p(𝜶(0)).\bm{\xi}^{(0)}=\mathop{\arg\max}_{\begin{subarray}{c}\bm{\xi}\in\Xi\end{subarray}}\sum_{i=1}^{n}\hbox{log}\ \varphi\left(x_{i};\bm{\xi},{\bm{\alpha}}^{(0)}\right)+p\left({\bm{\alpha}}^{(0)}\right). (4)

Assume that 𝜶(k){\bm{\alpha}}^{(k)} and 𝝃(k)\bm{\xi}^{(k)} are the estimators at the kk-th iteration of the EM algorithm. The E-step updates the posterior probability of the ii-th sample coming from the gg-th component by

wgi(k)=αg(k)f(xi;𝜽g(k))φ(xi;𝝃(k),𝜶(k)).w_{gi}^{(k)}=\frac{\alpha^{(k)}_{g}f\left(x_{i};\bm{\theta}_{g}^{(k)}\right)}{\varphi\left(x_{i};\bm{\xi}^{(k)},{\bm{\alpha}}^{(k)}\right)}. (5)

At the k+1k+1-th iteration, the M-step updates 𝜶{\bm{\alpha}} and 𝝃\bm{\xi} such that

𝜶(k+1)=argmax𝜶𝕊G1g=1Gi=1nwgi(k)log(αg)+p(𝜶),and{\bm{\alpha}}^{(k+1)}=\mathop{\arg\max}_{{{\bm{\alpha}}}\in\mathbb{S}^{G-1}}\sum_{g=1}^{G}\sum_{i=1}^{n}w_{gi}^{(k)}\hbox{log}(\alpha_{g})+p({\bm{\alpha}}),~{}\mbox{and} (6)
𝝃(k+1)=argmax𝝃Ξg=1Gi=1nwgi(k)logf(xi;𝜽g).\bm{\xi}^{(k+1)}=\mathop{\arg\max}_{{\bm{\xi}\in\Xi}}\sum_{g=1}^{G}\sum_{i=1}^{n}w_{gi}^{(k)}\hbox{log}f(x_{i};\bm{\theta}_{g}). (7)

Let K>0K>0 be the maximum number of EM updates. We define Mn(K)(𝜶(0))=2{pln(𝝃(K),𝜶(K))pln(𝝃^0,𝜶0)},{M}_{n}^{(K)}({\bm{\alpha}}^{(0)})=2\{pl_{n}(\bm{\xi}^{(K)},{\bm{\alpha}}^{(K)})-pl_{n}(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0})\}, where 𝜶0=(1/G,,1/G)T{\bm{\alpha}}_{0}=(1/G,\dots,1/G)^{\rm T}. To improve the performance, we choose a set of initial values {𝜶1,,𝜶T}\{{\bm{\alpha}}_{1},\dots,{\bm{\alpha}}_{T}\} and define the EM-test statistic as EMn(K)=max{Mn(K)(𝜶t),t=1,,T}.{\rm EM}_{n}^{(K)}=\max\{M_{n}^{(K)}({\bm{\alpha}}_{t}),t=1,\dots,T\}. Intuitively, under the homogeneous model, 𝝃(K)\bm{\xi}^{(K)} and 𝝃^0\hat{\bm{\xi}}_{0} are close to 𝝃0\bm{\xi}_{0} and are close to each other, while under the heterogeneous model, 𝝃(K)\bm{\xi}^{(K)} and 𝝃^0\hat{\bm{\xi}}_{0} are far away from each other. Hence, we reject the null hypothesis in (1) when EMn(K){\rm EM}_{n}^{(K)} is large. In this paper, we always assume that K3K\geq 3 is a fixed number. In simulation and real data analyses, we set K=100K=100. Simulation shows that EM-test is robust to the choice of KK. When KK is chosen too large, EM-test tends to have slightly more false positives and K=100K=100 is a reasonable choice (Supplementary Section D).

With the EM-test statistic, we can use the following procedure to screen for the cluster-relevant features. Let EMnj(K){\rm EM}_{nj}^{(K)} be the EM-test statistic corresponding to the jjth hypothesis testing problem (1). Given a threshold tn>0t_{n}>0, if EMnj(K)<tn{\rm EM}_{nj}^{(K)}<t_{n}, we will screen out the jjth feature; Otherwise, we will retain the jjth feature as a cluster-relevant feature. Theoretical results in Section 3 show that if we choose tn=nϑt_{n}=n^{\vartheta} (0<ϑ<10<\vartheta<1), this feature screening procedure can have the sure independent screening property or even the consistency of selection property.

2.3 Notations

We use diam(Ξ){\rm diam}(\Xi) to represent the Euclidean diameter of Ξ\Xi. Denote by |||\cdot| the absolute value of a real number or cardinality of a set. For two sequences of random variable {an}n=1\left\{a_{n}\right\}_{n=1}^{\infty} and {bn}n=1\left\{b_{n}\right\}_{n=1}^{\infty}, we write an=op(|bn|)a_{n}=o_{p}\left(|b_{n}|\right) if an/|bn|0a_{n}/|b_{n}|\rightarrow 0 in probability, and an=Op(|bn|)a_{n}=O_{p}\left(|b_{n}|\right) if there exists a positive constant CC such that anC|bn|a_{n}\leq C|b_{n}| in probability. For real numbers aa and bb, let ab=min{a,b}a\wedge b=\min\{a,b\} and ab=max{a,b}a\vee b=\max\{a,b\}. Define 𝐚2=i=1nai2\left\|{\bf a}\right\|_{2}=\sqrt{\sum_{i=1}^{n}a_{i}^{2}} as the L2L_{2}-norm of the vector 𝐚=(a1,,an)Tn{\bf a}=(a_{1},\dots,a_{n})^{\rm T}\in\mathbb{R}^{n}, and vech(𝐀)=(a11,a22,,add,a12,,a1d,,ad(d1))Td(d+1)/2{\rm vech}({\bf A})=\left(a_{11},a_{22},\ldots,a_{dd},a_{12},\ldots,a_{1d},\ldots,a_{d(d-1)}\right)^{\rm T}\in\mathbb{R}^{d(d+1)/2} as the vectorization of the symmetric dd-dimensional matrix 𝐀=(aij){\bf A}=(a_{ij}). We use 𝐀𝟎{\bf A}\succeq{\bf 0} to represent that the matrix 𝐀{\bf A} is positive semi-definite. For a random variable XX, we define its sub-exponential norm as Xψ1=inf{t>0:𝔼(exp(|X|/t))2}\|X\|_{\psi_{1}}=\inf\{t>0:\mathbb{E}\left(\exp(|X|/t)\right)\leq 2\}. If XX is a random variable from a homogeneous model 0\mathbb{H}_{0} and g(x)g(x) is a function, we define the LmL_{m}-norm of g(X)g(X) as g(X)Lm=(𝔼[gm(X)])1/m\left\|g(X)\right\|_{L^{m}}=\left({\mathbb{E}}\left[g^{m}(X)\right]\right)^{1/m}.

We denote 𝜶=(α1,,αG){\bm{\alpha}}^{*}=(\alpha_{1}^{*},\dots,\alpha_{G}^{*}) as the true proportions of the GG clusters and 𝝃j=(𝜽j1,,𝜽jG)\bm{\xi}^{*}_{j}=(\bm{\theta}^{*}_{j1},\ldots,\bm{\theta}^{*}_{jG}) as the true parameters of the mixture model corresponding to the jjth feature. We assume that 𝜶{\bm{\alpha}}^{*} is fixed and mingαg>0\min_{g}\alpha_{g}^{*}>0. We use δ>0\delta>0 as a fixed very small constant. If the jjth feature is from the homogeneous model j0\mathbb{H}_{j0} (cluster-irrelevant), we write 𝝃j=𝝃j0=(𝜽j0,,𝜽j0)\bm{\xi}^{*}_{j}=\bm{\xi}_{j0}=(\bm{\theta}_{j0},\ldots,\bm{\theta}_{j0}) as its true parameters. When appropriate, we drop the subscript jj and use 𝝃\bm{\xi}^{*} as the true parameters of a general mixture model and 𝝃0=(𝜽0,,𝜽0)\bm{\xi}_{0}=(\bm{\theta}_{0},\ldots,\bm{\theta}_{0}) as the true parameter of some general homogeneous model 0\mathbb{H}_{0}. We always assume that 𝜽0\bm{\theta}_{0} is an interior point of Θ\Theta and use 𝜶0=(1/G,,1/G)T{\bm{\alpha}}_{0}=(1/G,\dots,1/G)^{\rm T}. Let

Yih=1f(xi,𝜽0)f(xi,𝜽0)θh,Zih=12f(xi,𝜽0)2f(xi,𝜽0)θh2,\displaystyle Y_{ih}=\frac{1}{f(x_{i},\bm{\theta}_{0})}\frac{\partial f(x_{i},\bm{\theta}_{0})}{\partial\theta_{h}},~{}Z_{ih}=\frac{1}{2f(x_{i},\bm{\theta}_{0})}\frac{\partial^{2}f(x_{i},\bm{\theta}_{0})}{\partial\theta_{h}^{2}},
Uih=1f(xi,𝜽0)2f(xi,𝜽0)θhθ(h<),𝐛1i=(Yi1,,Yid)T,\displaystyle U_{ih\ell}=\frac{1}{f(x_{i},\bm{\theta}_{0})}\frac{\partial^{2}f(x_{i},\bm{\theta}_{0})}{\partial\theta_{h}\partial\theta_{\ell}}(h<\ell),~{}{\bf b}_{1i}=(Y_{i1},\ldots,Y_{id})^{\rm T}, (8)
𝐛2i=(Zi1,,Zid,Ui12,,Ui(d1)d)T,and𝐛i=(𝐛1iT,𝐛2iT)T.\displaystyle{\bf b}_{2i}=(Z_{i1},\ldots,Z_{id},U_{i12},\ldots,U_{i(d-1)d})^{\rm T},~{}\mbox{and}~{}{\bf b}_{i}=\left({\bf b}_{1i}^{\rm T},{\bf b}_{2i}^{\rm T}\right)^{\rm T}.

3 Theoretical results

In this section, we investigate the theoretical properties of the screening procedure. Without loss of generality, we assume that there is only one initial value 𝜶(0){\bm{\alpha}}^{(0)} (i.e. T=1T=1) and ming=1,,G\min_{g=1,\ldots,G} 271αg(0)δ>027^{-1}{\alpha}_{g}^{(0)}\geq\delta>0.

We first present the main theoretical result of the paper—the feature screening property of the EM-test statistic. Define the set of cluster-irrelevant features as S0={j:1jp,𝝃j=(𝜽j0,,𝜽j0)}S_{0}=\{j:1\leq j\leq p,\bm{\xi}^{*}_{j}=(\bm{\theta}_{j0},\dots,\bm{\theta}_{j0})\}, and the set of cluster-relevant features as S1={1,,p}\S0S_{1}=\{1,\dots,p\}\backslash S_{0}. Denote s=|S1|s=|S_{1}| as the number of cluster-relevant features. For a small fixed γ>0\gamma>0, we define

Ξ1={𝝃:maxgg𝜽g𝜽g2γ,𝝃Ξ}\Xi_{1}=\left\{\bm{\xi}:\max_{g\neq g^{\prime}}\|\bm{\theta}_{g}-\bm{\theta}_{g^{\prime}}\|_{2}\geq\gamma,\bm{\xi}\in\Xi\right\} (9)

as the parameter set of heterogeneous models with a minimum component-difference γ\gamma. When jS1j\in S_{1}, we assume 𝝃jΞ1\bm{\xi}^{*}_{j}\in\Xi_{1}. Given a threshold tn>0t_{n}>0, we define

S^1(tn)={1jp:EMnj(K)tn}\hat{S}_{1}(t_{n})=\{1\leq j\leq p:{\rm EM}_{nj}^{(K)}\geq t_{n}\} (10)

as an estimator of S1S_{1}. Under Condition (C1)–(C7) that will be specified in Section 3.1 and 3.2, the following theorem guarantees that the screening procedure based on the EM-test statistic can effectively filter cluster-irrelevant features while retaining all cluster-relevant features with a high probability.

Theorem 1.

Assume that for any cluster-relevant feature jS1j\in S_{1}, 𝛏jΞ1\bm{\xi}^{*}_{j}\in\Xi_{1}, where Ξ1\Xi_{1} is defined as in (85). Under Condition (C1)–(C7), given a fixed KK, choosing the threshold tn=nϑ(0<ϑ<1)t_{n}=n^{\vartheta}(0<\vartheta<1), when nn is sufficiently large, we have

(S1S^1(tn))1sexp(C3n1/2+C4nϑ1/2), and {\mathbb{P}}\left(S_{1}\subset\hat{S}_{1}(t_{n})\right)\geq 1-s\exp\left(-C_{3}n^{1/2}+C_{4}n^{\vartheta-1/2}\right),\mbox{ and }
(S1=S^1(tn))\displaystyle{\mathbb{P}}\left(S_{1}=\hat{S}_{1}(t_{n})\right) 1(ps)((C1n)m/4+(C2n)ϑm)\displaystyle\geq 1-(p-s)\left((C_{1}n)^{-m/4}+(C_{2}n)^{-\vartheta m}\right)
sexp(C3n1/2+C4nϑ1/2),\displaystyle\quad-s\exp\left(-C_{3}n^{1/2}+C_{4}n^{\vartheta-1/2}\right),

where C1,C2,C3C_{1},C_{2},C_{3} and C4C_{4} are four constants depending on K,G,d,δK,G,d,\delta, diam(Ξ){\rm diam}(\Xi) and the constants specified in Condition (C3)–(C7), s=|S1|s=|S_{1}| and mm is the integer in Condition (C3).

If pp does not go to infinity too fast, Theorem 1 implies that we can achieve the sure independent screening property or model selection consistency in high dimensional settings.

  • If p=O(exp(nβ)),0<β<1/2p={O}(\exp(n^{\beta})),0<\beta<1/2, we have (S1S^1(tn))1,{\mathbb{P}}\left(S_{1}\subset\hat{S}_{1}(t_{n})\right)\rightarrow 1, as nn\rightarrow\infty. Thus, the feature screening method based on the EM-test statistic has the sure independent screening property.

  • If p=O(nκ)p={O}(n^{\kappa}) with 0<κ<m/max{4,ϑ1}0<\kappa<m/\max\{4,\vartheta^{-1}\}, we have (S1=S^1(tn))1,{\mathbb{P}}\left(S_{1}=\hat{S}_{1}(t_{n})\right)\rightarrow 1, as nn\rightarrow\infty, or in other words, we can achieve model selection consistency. The condition κ<m/max{4,ϑ1}\kappa<m/\max\{4,\vartheta^{-1}\} is a very lenient condition. For most common parametric distribution families, mm in Condition (C3) can be taken as any positive integer and thus κ\kappa can be any positive number.

Empirical studies show that choosing ϑ[0.3,0.35]\vartheta\in[0.3,0.35] can make a good balance between the type I and type II error (Supplementary Section D) and hence we suggest to choose ϑ[0.3,0.35]\vartheta\in[0.3,0.35] in real applications. Note that in Theorem 1, for notational simplicity, we assume that different features are in the same parametric family. The similar screening property can also be proved even if different features are in different parametric families (e.g. some are continuous variables and some are count variables), as long as these features satisfy conditions similar to the ones in Theorem 1. In addition, Theorem 1 does not need to assume that different features are independent. Even if the features are dependent, the same screening properties also hold.

The proof of Theorem 1 is based on the tail probability bounds of the EM-test statistic under the null and alternative hypotheses. Under 0\mathbb{H}_{0}, we show that the EM-test is bounded with a high probability, and under 1\mathbb{H}_{1}, the EM-test statistic will diverge to infinity with a high probability. We present these tail probability bounds in Section 3.1 and 3.2. The proof of Theorem 1 is in Supplementary material. In the Supplementary material, we show that many commonly used distributions, such as many exponential family distributions and the negative binomial distributions, satisfy the conditions in Theorem 1, and thus the screening properties hold for these distributions.

3.1 The probability bound of the EM-test statistic under 0\mathbb{H}_{0}

We need the following regularity conditions before presenting the tail probability bounds of the EM-test statistic under 0\mathbb{H}_{0}.

  • (C1)

    For every 𝜽0𝚯\bm{\theta}_{0}\in\bm{\Theta} and sufficiently small ball VΘV\subset\Theta around 𝜽0\bm{\theta}_{0}, we assume that the function sup𝜽Vf1/2(x;𝜽)f1/2(x;𝜽0)\sup_{\bm{\theta}\in V}{f^{1/2}(x;\bm{\theta})}{f^{1/2}(x;\bm{\theta}_{0})} is measurable and sup𝜽Vf1/2(x;𝜽)f1/2(x;𝜽0)μ(dx)<.\int\sup_{\bm{\theta}\in V}{f^{1/2}(x;\bm{\theta})}{f^{1/2}(x;\bm{\theta}_{0})}\mu({\rm d}x)<\infty. In addition, for every sufficiently small ball UΞU\subset\Xi around 𝝃0=(𝜽0,,𝜽0)\bm{\xi}_{0}=(\bm{\theta}_{0},\cdots,\bm{\theta}_{0}) and 𝜶𝕊G1{\bm{\alpha}}\in\mathbb{S}^{G-1}, we assume that the function sup𝝃Ulog{φ(x;𝝃,𝜶)}\sup_{\bm{\xi}\in U}\hbox{log}\{\varphi(x;\bm{\xi},{\bm{\alpha}})\} is measurable and 𝔼(sup𝝃Ulog{1+φ(x;𝝃,𝜶)})<+.{\mathbb{E}}(\sup_{\bm{\xi}\in U}\hbox{log}\{1+\varphi(x;\bm{\xi},{\bm{\alpha}})\})<+\infty.

  • (C2)

    The density function f(x;𝜽)f(x;\bm{\theta}) has a common support for 𝜽Θ\bm{\theta}\in\Theta and continuous 5th order partial derivatives with respect to 𝜽\bm{\theta}.

  • (C3)

    Let m>0m>0 be an integer and M>0M>0 be a constant. There are a function g(x,𝜽)>0g(x,\bm{\theta})>0 with sup𝜽Θg(x;𝜽)L8mM\sup_{\bm{\theta}\in\Theta}\left\|g(x;{\bm{\theta}})\right\|_{L^{8m}}\leq M, a function r(x)>0r(x)>0 with r2(x)μ(dx)M\int r^{2}(x)\mu({\rm d}x)\leq M and a constant τ>0\tau>0, such that, for all 𝜽0Θ\bm{\theta}_{0}\in\Theta, h=1,,5h=1,\dots,5 and j1,,jh{1,,d}{j_{1}},\ldots,{j_{h}}\in\{1,\dots,d\},

    sup𝜽𝜽02τ|hf(x;𝜽)θj1θjh/f(x;𝜽0)|g(x;𝜽0), and \sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau}\left|\frac{\partial^{h}f(x;\bm{\theta})}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{h}}}\bigg{/}f(x;\bm{\theta}_{0})\right|\leq g(x;{\bm{\theta}_{0}}),\mbox{ and }
    sup𝜽Θf(x;𝜽)+sup𝜽Θ1f(x;𝜽)f(x;𝜽)𝜽22r2(x).\sup_{\bm{\theta}\in\Theta}f(x;\bm{\theta})+\sup_{\bm{\theta}\in\Theta}\frac{1}{{f(x;\bm{\theta})}}\left\|\frac{\partial f(x;\bm{\theta})}{\partial\bm{\theta}}\right\|_{2}^{2}\leq r^{2}(x).
  • (C4)

    The minimum eigenvalue λmin(𝐁(𝜽0))\lambda_{\rm min}({\bf B}(\bm{\theta}_{0})) of the covariance matrix 𝐁(𝜽0)=cov(𝐛i){\bf B}(\bm{\theta}_{0})=\mbox{cov}({\bf b}_{i}) satisfies λmin:=inf𝜽0Θλmin(𝐁(𝜽0))>0.\lambda_{\rm min}:=\inf_{\bm{\theta}_{0}\in\Theta}\lambda_{\rm min}({\bf B}(\bm{\theta}_{0}))>0.

Condition (C1) is the Wald consistency condition, which can be founded in Van der Vaart (2000). It also ensures the continuity of the Hellinger distance which we will define below. Condition (C2) guarantees the smoothness of f(x;𝜽)f(x;\bm{\theta}). Condition (C3) ia a technical condition on the partial derivatives. It guarantees that there is a dominating function of the remainder term in the Taylor expansion, and thus allows us to give polynomial tail probability bounds of the higher-order infinitesimal terms of the EM-test statistic under 0\mathbb{H}_{0}. Condition (C4) is the strong identifiability condition (Chen, 1995; Nguyen, 2013). Most of the commonly used one-parameter distributions, such as the Poisson distribution and the exponential distribution, satisfy Condition (C3) and (C4). Many multiple-parameter distributions including the negative binomial distribution and the gamma distribution also satisfy Condition (C3) and (C4).

Theorem 2.

Assume that x1,,xnx_{1},\dots,x_{n} are independent samples from the homogeneous distribution f(x;𝛉0)f(x;\bm{\theta}_{0}). Under Condition (C1)–(C4), for any t>0t>0, when nn is sufficiently large, we have

(EMn(K)t+Cn1/16log3/2n)1(C1n)m/4(C2t)m,{\mathbb{P}}\left({\rm EM}_{n}^{(K)}\leq t+Cn^{-1/16}{\rm log}^{3/2}n\right)\geq 1-{(C_{1}n)^{-m/4}}-{(C_{2}t)}^{-m},

where C,C1C,C_{1} and C2C_{2} are three positive constants depending on τ,K,G,d,λmin,m,M,δ\tau,K,G,d,\lambda_{\rm min},m,M,\delta and diam(Ξ){\rm diam}(\Xi).

Observe that when nn\rightarrow\infty, Cn1/16log3/2nCn^{-1/16}{\rm log}^{3/2}n approaches to zero. Therefore, roughly speaking, Theorem 2 shows that when nn is sufficiently large, under 0\mathbb{H}_{0}, the tail probability of the EM-test statistic greater than tt has a polynomial decay rate. To prove Theorem 2, we first derive the tail probability bound for the mixture parameter estimators 𝝃(k)\bm{\xi}^{(k)} by analyzing the empirical processes indexed by 𝝃\bm{\xi} (Wong and Shen, 1995). Then, we analyze the Taylor expansion of EMn(K){\rm EM}_{n}^{(K)} and bound each term in the expansion using concentration inequalities (Wainwright, 2019). Details of the proof are given in the Supplementary material.

3.2 The probability bound of the EM-test statistic under 1\mathbb{H}_{1}

Our next goal is to show that the EM-test statistic diverges to infinity under 1\mathbb{H}_{1} with a high probability. Recall that 𝜶{\bm{\alpha}}^{*} is the true proportion parameter and that under 1\mathbb{H}_{1}, 𝝃Ξ1\bm{\xi}^{*}\in\Xi_{1}, where Ξ1\Xi_{1} is defined in (85). We define 𝜽0=argmax𝜽Θ𝔼𝜶,𝝃[logf(x;𝜽)]{\bm{\theta}}^{\dagger}_{0}=\mathop{\arg\max}_{\bm{\theta}\in{\Theta}}{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}\left[\hbox{log}f(x;\bm{\theta})\right] as the parameter of the homogeneous model that is closest to the true heterogeneous model in terms of the Kullback-Leibler divergence. Denote 𝝃0=(𝜽0,,𝜽0)\bm{\xi}_{0}^{\dagger}=\left({\bm{\theta}}^{\dagger}_{0},\dots,{\bm{\theta}}^{\dagger}_{0}\right). Similarly, given an initial value 𝜶(0){\bm{\alpha}}^{(0)}, we can find a heterogeneous model with a proportion parameter 𝜶(0){\bm{\alpha}}^{(0)} that is closest to the true heterogeneous model and denote its parameter as 𝝃=argmax𝝃Ξ𝔼𝜶,𝝃[logφ(x;𝝃,𝜶(0))].{\bm{\xi}}^{\dagger}=\mathop{\arg\max}_{\bm{\xi}\in\Xi}{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}\left[\hbox{log}\ \varphi\left(x;\bm{\xi},{\bm{\alpha}}^{(0)}\right)\right]. Note that 𝝃{\bm{\xi}}^{\dagger} and 𝜽0{\bm{\theta}}^{\dagger}_{0} depend on the true value 𝜶,𝝃{\bm{\alpha}}^{*},\bm{\xi}^{*}.

Define R(x;𝝃)=logφ(x;𝝃,𝜶(0))logf(x;𝜽0)R(x;\bm{\xi}^{*})=\hbox{log}\ \varphi\left(x;{\bm{\xi}}^{\dagger},{\bm{\alpha}}^{(0)}\right)-\hbox{log}\ f\left(x;{\bm{\theta}}_{0}^{\dagger}\right) as the difference between two “working” log-likelihood logφ(x;𝝃,𝜶(0))\hbox{log}\ \varphi\left(x;{\bm{\xi}}^{\dagger},{\bm{\alpha}}^{(0)}\right) and logf(x;𝜽0)\hbox{log}\ f\left(x;{\bm{\theta}}_{0}^{\dagger}\right). If the initial value 𝜶(0){\bm{\alpha}}^{(0)} is close to the true proportion 𝜶{\bm{\alpha}}^{*}, the expectation of R(x;𝝃)R(x;\bm{\xi}^{*}) would be bounded away from zero. So, the one-step EM-test statistic, and thus the EM-test statistic, would be large. Thus, we would correctly reject the null hypothesis with a high probability. Furthermore, denoting D(𝜽)=𝔼𝜶,𝝃[logf(x;𝜽)]D(\bm{\theta})={\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}[\hbox{log}f(x;\bm{\theta})], we define a mean-zero empirical process indexed by 𝜽Θ\bm{\theta}\in\Theta as

Z𝜽(𝝃)=n1/2i=1n{logf(xi;𝜽)logf(xi;𝜽0)[D(𝜽)D(𝜽0)]}.Z_{\bm{\theta}}(\bm{\xi}^{*})=n^{-1/2}\sum_{i=1}^{n}\left\{\hbox{log}\ f(x_{i};\bm{\theta})-\hbox{log}\ f\left(x_{i};{\bm{\theta}}^{\dagger}_{0}\right)-\left[D(\bm{\theta})-D\left({\bm{\theta}}^{\dagger}_{0}\right)\right]\right\}.

We need the following conditions under 1\mathbb{H}_{1}.

  • (C5)

    The initial value 𝜶(0){\bm{\alpha}}^{(0)} fulfills ϱ=inf𝝃Ξ1𝔼𝜶,𝝃[R(x;𝝃)]>0.\varrho=\inf_{\bm{\xi}^{*}\in\Xi_{1}}{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}\left[R(x;\bm{\xi}^{*})\right]>0.

  • (C6)

    There exists a constant Mψ1M_{\psi_{1}} such that sup𝝃Ξ1R(x;𝝃)𝔼𝜶,𝝃[R(x;𝝃)]ψ1Mψ1.\sup_{\bm{\xi}^{*}\in\Xi_{1}}\left\|R(x;\bm{\xi}^{*})-{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}\left[R(x;\bm{\xi}^{*})\right]\right\|_{\psi_{1}}\leq M_{\psi_{1}}.

  • (C7)

    {Z𝜽(𝝃):𝜽Θ}\{Z_{\bm{\theta}}(\bm{\xi}^{*}):\bm{\theta}\in\Theta\} is a ψ1\psi_{1}- process such that for any 𝜽,𝜽Θ\bm{\theta},\bm{\theta}^{\prime}\in\Theta, sup𝝃Ξ1Z𝜽(𝝃)Z𝜽(𝝃)ψ1Cρ𝜽𝜽2,\sup_{\bm{\xi}^{*}\in\Xi_{1}}\|Z_{\bm{\theta}}(\bm{\xi}^{*})-Z_{\bm{\theta}^{\prime}}(\bm{\xi}^{*})\|_{\psi_{1}}\leq C_{\rho}\|\bm{\theta}-\bm{\theta}^{\prime}\|_{2}, where Cρ>0C_{\rho}>0 is a constant.

Condition (C5) is a key assumption. Because the EM algorithm cannot guarantee convergence to the global maximum, we need to choose an initial value 𝜶(0){\bm{\alpha}}^{(0)} such that the theoretical best heterogeneous model that we can achieve in one step EM update is uniformly closer to the true heterogeneous model than the best homogeneous model. Note that we always have ϱ0\varrho\geq 0, but it is hard to give a necessary and sufficient condition for the choice of 𝜶(0){\bm{\alpha}}^{(0)} such that ϱ>0\varrho>0. However, we can show that if 𝜶𝜶(0)2τ(γ)\left\|{\bm{\alpha}}^{*}-{\bm{\alpha}}^{(0)}\right\|_{2}\leq\tau(\gamma) for some constant τ(γ)\tau(\gamma), Condition (C5) holds (see Section C.1 in Supplementary material for more discussion). Condition (C6) and (C7) are two weaker conditions and hold for many commonly used distribution families.Under these conditions, we obtain the following tail probability bound of the EM-test statistic under 1\mathbb{H}_{1}.

Theorem 3.

Assume that x1,,xnx_{1},\dots,x_{n} are independent samples from the heterogeneous model distribution φ(x;𝛂,𝛏)\varphi(x;{\bm{\alpha}}^{*},\bm{\xi}^{*}) with 𝛏Ξ1\bm{\xi}^{*}\in\Xi_{1}. Under Condition (C5)–(C7), for any tt, we have

(EMn(K)21nϱn1/2CJ[J(D)+t]p0)\displaystyle\quad{\mathbb{P}}\left({\rm EM}_{n}^{\left(K\right)}\geq 2^{-1}n\varrho-n^{1/2}C_{J}\left[J\left(D\right)+t\right]-p_{0}\right)
12exp[Cmin(nϱ24Mψ12,nϱ2Mψ1)]2exp(tD),\displaystyle\geq 1-2\exp\left[-C^{\prime}\min\left(\frac{n\varrho^{2}}{4M_{\psi_{1}}^{2}},\frac{n\varrho}{2M_{\psi_{1}}}\right)\right]-2\exp\left(\frac{-t}{D}\right),

where D,CJ,J(D)D,C_{J},J(D) and CC^{\prime} are four constants and defined in Lemma 17 and 18 in the Supplementary material and p0=λGlog(δG)p_{0}=\lambda G\hbox{log}(\delta G).

Corollary 1.

From Theorem 3, for tn=nϑ,0<ϑ<1t_{n}=n^{\vartheta},0<\vartheta<1, there are two constants C3,C4C_{3},C_{4} such that

(EMn(K)tn)1exp(C3n1/2+C4nϑ1/2).{\mathbb{P}}\left({\rm EM}_{n}^{\left(K\right)}\geq t_{n}\right)\geq 1-\exp\left(-C_{3}n^{1/2}+C_{4}n^{\vartheta-1/2}\right).

Theorem 3 and Corollary 1 say that under 1\mathbb{H}_{1}, by selecting a suitable threshold, the cluster-relevant feature can be retained with high probability.

3.3 The limiting distribution of the EM-test statistic under 0\mathbb{H}_{0}

In many applications, giving a valid pp-value of the retained feature is also crucial. In this section, we give the limiting distribution of the EM-test statistic under 0\mathbb{H}_{0}. To derive the limiting distribution, we only need the following weaker conditions in replacement of Condition (C3) and (C4).

  • (WC3)

    For all h=1,,5h=1,\dots,5 and θj1,,θjh\theta_{j_{1}},\ldots,\theta_{j_{h}}, there exists a function g(x;𝜽0)0g(x;\bm{\theta}_{0})\geq 0 and a constant τ>0\tau>0 such that

    sup𝜽𝜽0τ|hf(x,𝜽)θj1θjh/f(x,𝜽0)|g(x;𝜽0)\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|\leq\tau}\left|\frac{\partial^{h}f(x,\bm{\theta})}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{h}}}\bigg{/}f(x,\bm{\theta}_{0})\right|\leq g(x;\bm{\theta}_{0})

    and g(x;𝜽0)L3<\left\|g(x;\bm{\theta}_{0})\right\|_{L^{3}}<\infty.

  • (WC4)

    The covariance matrix 𝐁(𝜽0)=cov(𝐛i){\bf B}(\bm{\theta}_{0})=\mbox{cov}({\bf b}_{i}) of 𝐛i{\bf b}_{i} is positive definite.

Let r=min(G1,d)r=\min(G-1,d) and

𝒱={vech(𝐕):𝐕d×d is symmetric,rank(𝐕)r,𝐕0}.\mathcal{V}=\left\{{\rm vech}({\bf V}):{\bf V}\in\mathbb{R}^{d\times d}\mbox{ is symmetric},~{}{\rm rank}({\bf V})\leq r,{\bf V}\succeq 0\right\}. (11)

For j,k=1,2j,k=1,2, let 𝐁jk=𝔼𝜽0({𝐛ji𝔼(𝐛ji)}{𝐛ki𝔼(𝐛ki)}T){\bf B}_{jk}={\mathbb{E}}_{\bm{\theta}_{0}}(\{{\bf b}_{ji}-{\mathbb{E}}({\bf b}_{ji})\}\{{\bf b}_{ki}-{\mathbb{E}}({\bf b}_{ki})\}^{\rm T}) and 𝐛~2i=𝐛2i𝐁21𝐁111𝐛1i\tilde{{\bf b}}_{2i}={\bf b}_{2i}-{\bf B}_{21}{\bf B}_{11}^{-1}{\bf b}_{1i}. The covariance matrix of 𝐛~2i\tilde{{\bf b}}_{2i} is 𝐁~22=𝐁22𝐁21𝐁111𝐁12\widetilde{{\bf B}}_{22}={{\bf B}}_{22}-{\bf B}_{21}{\bf B}_{11}^{-1}{\bf B}_{12}.

Theorem 4.

Assume that x1,,xnx_{1},\cdots,x_{n} are independent samples from the homogeneous distribution f(x;𝛉0)f(x;\bm{\theta}_{0}). If G2G\geq 2 and one of the 𝛂t{\bm{\alpha}}_{t} (t=1,,Tt=1,\dots,T) is 𝛂0{\bm{\alpha}}_{0}, then under Condition (C1)–(C2) and (WC3)–(WC4), as nn\rightarrow\infty, we have

EMn(K)𝑑sup𝐯𝒱2𝐯T𝐰𝐯T𝐁~22𝐯,{\rm EM}_{n}^{(K)}\overset{d}{\longrightarrow}\sup_{{\bf v}\in\mathcal{V}}2{\bf v}^{\rm T}{\bf{w}}-{\bf v}^{\rm T}\widetilde{{\bf B}}_{22}{\bf v},

where 𝐰=(w1,,wd(d+1)/2)T{\bf{w}}=(w_{1},\ldots,w_{d(d+1)/2})^{\rm T} is a zero-mean multivariate normal random vector with a covariance matrix 𝐁~22\widetilde{{\bf B}}_{22} and 𝒱\mathcal{V} is as in (81).

If d=1d=1, the limiting distribution in Theorem 4 is 0.5χ12+0.5χ020.5\chi^{2}_{1}+0.5\chi^{2}_{0}, the same as the one in Li et al. (2009), while, if G=2G=2, it is the distribution in Niu et al. (2011). When G>dG>d, we have r=dr=d and the limiting distribution will be independent of the component number GG. Generally, it is computationally difficult to calculate the limiting distribution in Theorem 4. When G>dG>d, the feasible domain 𝒱={vech(𝐕):𝐕d×d,𝐕0}\mathcal{V}=\left\{{\rm vech}({\bf V}):{\bf V}\in\mathbb{R}^{d\times d},{\bf V}\succeq 0\right\} is a positive semi-definite matrix cone. Computation of the limiting distribution in Theorem 4 becomes a classic cone quadratic program and can be solved using the algorithms reviewed in Vandenberghe (2010), but these algorithms are still computationally expensive. However, it can be easily shown that sup𝐯𝒱2𝐯T𝐰𝐯T𝐁~22𝐯𝐰T𝐁~221𝐰\sup_{{\bf v}\in\mathcal{V}}2{\bf v}^{\rm T}{\bf{w}}-{\bf v}^{\rm T}\widetilde{{\bf B}}_{22}{\bf v}\leq{\bf w}^{\rm T}\widetilde{{\bf B}}_{22}^{-1}{\bf w}, and thus EMn(K){\rm EM}_{n}^{(K)} is stochastically less than χ2(d(d+1)/2)\chi^{2}({d(d+1)/2}). Therefore, we can always use χ2(d(d+1)/2)\chi^{2}({d(d+1)/2}) as the limiting distribution. The test will be conservative, but our empirical studies show that the test still has a high power.

4 Simulation

In this section, we use simulation to assess the performance of the EM-test statistic for feature screening and clustering of high dimensional count data. We compare the EM-test with feature screening and feature selection methods. The feature screening methods include Dip-test (Chan and Hall, 2010), KS-test (Jin and Wang, 2016), COSCI (Banerjee et al., 2017), SC-FS (Liu et al., 2022) and a baseline method based on the goodness-of-fit test. Dip-test screens features by investigating the unimodality of the data distribution. For the baseline method, we use the Chi-square test to test the fit of the data to the null distribution. The feature selection method is the Sparse kmeans method (abbreviated as Skmeans) proposed in Witten and Tibshirani (2010). Dip-test, KS-test and COSCI are methods for continuous data. When applying these to the simulated count data, we first log transform the data (log(x+1)\hbox{log}(x+1)) to make them more like continuous data.

4.1 Simulation Setup

In the simulations, we set the number of clusters as G=5G=5 and the proportions of the clusters as 𝜶=(α1,,α5)=(0.5,0.125,0.125,0.125,0.125){\bm{\alpha}}=(\alpha_{1},\dots,\alpha_{5})=(0.5,0.125,0.125,0.125,0.125). The sample size is set as n=1000n=1000. The dimension is set as p=500p=500, 50005000 or p=20,000p=20,000. Skmeans and COSCI are not evaluated for p=20,000p=20,000 because they are computationally too expensive for this ultra-high dimensional setting. The number of cluster-relevant features is fixed at s=20s=20. We always set the first 2020 features as the cluster-relevant and all other features as cluster-irrelevant.

More specifically, for the iith sample, we first randomly assign it to a cluster gg with the probability αg\alpha_{g}. Then, if the jjth feature is cluster-relevant (j=1,,20j=1,\dots,20), we randomly sample xijx_{ij} from NB(μgj,rj){\rm NB}(\mu_{gj},r_{j}); If it is cluster-irrelevant (j=21,,pj=21,\dots,p), we randomly sample xijx_{ij} from NB(μj,rj){\rm NB}(\mu_{j},r_{j}). The mean and over-dispersion parameters of the negative binomial distributions are randomly generated (see below).

We consider simulation setups of two noise levels (low or high) and three cluster signal strength levels (low, medium and high). The over-dispersion parameters rjr_{j} represent the noise level of the data and the differences of the mean parameters μgj\mu_{gj} between clusters represent the cluster signal strength. The details of generating rjr_{j} and μgj\mu_{gj} are given in the Supplementary material. Thus, in total, we have 18 different simulation setups (3 dimension setups ×\times 2 noise levels ×\times 3 signal levels). In each simulation setup, we generate 100 datasets.

To evaluate the performance of EM-test on continuous data, we also generate simulation data based on normal distributions. The simulation setups are detailed in Supplementary material. EM-test for normal models are used for these continuous data. To investigate the robustness of EM-test to model mis-specification, we further generate count data based on Poisson-truncated-normal and the binomial-Gamma distributions. EM-test for negative binomial model is used for these count data (Supplementary Section D). In the following, we only discuss the negative binomial simulations. For the continuous data simulation, we find that EM-test performs similar to other available methods under easier simulation setups and outperforms other methods under more difficult setups (Supplementary Section D). From the mis-specified count data simulations, we find that EM-test is robust to model mis-specification and outperforms other methods (Supplementary Section D).

4.2 Performance on feature screening

We first evaluate the accuracy of feature screening. For different screening methods, we first rank the features by their corresponding test statistics / p-values or feature weights. Following previous researches about feature screening (Zhu et al., 2011; Li et al., 2012), let 𝒮\mathcal{S} be the minimum number of features needed to include all cluster-relevant features in a rank. Table 1 shows the mean and the standard deviation of 𝒮\mathcal{S} over the 100 replications. Table 1 does not include COSCI because it only reports a selected feature index but does not provide an order of all features. In the low dimensional cases (p=500p=500), all count-data methods work well. EM-test only needs 20 or slightly more than 20 features to include all cluster-relevant features. In higher dimensions (p=5000p=5000 or 20,00020,000), the EM-test outperforms other methods, often by a large amount. For example, in the medium signal, high noise and p=20,000p=20,000 case, the EM-test needs around 21 features to include all cluster-relevant features, while other methods need over a thousand features. SC-FS works well in lower dimensional cases, but its performance deteriorates in higher dimensional cases, especially in higher dimensional cases with lower signal to noise ratios. This is because SC-FS needs a pre-cluster label for feature screening. When pp is large, the pre-cluster results can be very inaccurate, leading to the inferior performance of SC-FS in these settings. The continuous methods (KS-test and Dip-test) do not perform well for these count data.

Table 1: The mean of the minimum model size 𝒮\mathcal{S} over 100 replications. The numbers in the parenthesis are the standard deviation of 𝒮\mathcal{S} over 100 replications.
pp EM-test Chi-square SC-FS Skmeans KS-test Dip-test
Case 1: High signal and low noise
500 20.1 (0.4) 20.9 (3.0) 20.0 (0.0) 20.0 (0.0) 499.4 (3.1) 499.2 (2.7)
5000 20.7 (0.8) 28.0 (24.8) 212.0 (622.7) 1048.0 (1226.5) 4994.6 (12.6) 4988.0 (31.3)
20,000 22.7 (1.7) 47.8 (47.9) 14675.7 (4233.6) NA 19968.2 (90.6) 19968.2 (80.6)
Case 2: High signal and high noise
500 20.0 (0.1) 24.6 (11.4) 20.0 (0.0) 20.0 (0.0) 499.1 (3.1) 498.4 (4.1)
5000 20.0 (0.2) 75.2 (86.5) 871.7 (1276.0) 554.8 (1047.3) 4994.4 (12.8) 4981.0 (48.8)
20,000 20.2 (0.4) 335.5 (689.1) 16921.6 (3076.0) NA 19977.3 (43.2) 19955.4 (104.7)
Case 3: Medium signal and low noise
500 20.2 (0.6) 41.9 (29.4) 20.0 (0.0) 20.0 (0.0) 498.8 (3.7) 499.0 (2.8)
5000 22.1 (13.2) 276.6 (348.1) 1140.1 (1319.5) 762.5 (1204.2) 4990.4 (23.4) 4985.3 (37.0)
20,000 23.1 (2.6) 901.5 (1200.0) 16747.7 (2854.5) NA 19956.4 (96.6) 19952.5 (119.7)
Case 4: Medium signal and high noise
500 20.4 (2.1) 80.2 (70.1) 20.0 (0.0) 20.0 (0.0) 498.1 (5.9) 497.5 (6.9)
5000 20.8 (2.5) 581.9 (581.6) 2411.3 (1474.8) 1814.1 (1438.5) 4990.4 (22.9) 4976.9 (48.8)
20,000 45.2 (147.1) 2103.1 (1799.7) 17306.9 (2630.9) NA 19938.5 (103.2) 19917.7 (193.7)
Case 5: Low signal and low noise
500 33.3 (27.0) 202.3 (100.1) 20.0 (0.1) 20.0 (0.0) 497.5 (5.6) 498.4 (3.6)
5000 129.8 (225.3) 1795.7 (844.8) 3349.3 (1203.5) 2589.0 (1042.7) 4971.4 (56.5) 4974.6 (62.8)
20,000 755.2 (1635.7) 7964.7 (4051.5) 18283.2 (1422.1) NA 19885.6 (197.8) 19915.5 (182.9)
Case 6: Low signal and high noise
500 58.9 (56.7) 251.9 (97.1) 20.1 (0.8) 20.0 (0.0) 496.3 (7.6) 495.8 (8.3)
5000 342.7 (496.9) 2221.6 (965.0) 4182.7 (783.2) 3175.8 (1029.2) 4979.0 (42.4) 4971.7 (54.0)
20,000 1425.6 (2429.9) 9993.0 (3870.1) 18696.1 (1263.5) NA 19847.3 (250.0) 19903.6 (187.2)

The minimum model size 𝒮\mathcal{S} measures the feature ranks given by different methods. However, in clustering analysis, simply having 𝒮\mathcal{S} close to ss is inadequate because we need a criterion to determine which features to retain. Therefore, we further compare the number of correctly retained cluster-relevant features (denoted as \mathcal{R}) and falsely retained cluster-irrelevant features (denoted as \mathcal{F}) by different methods. For SC-FS, COSCI and Skmeans, we use their default parameters to select the cluster-relevant features. For the EM-test, we select the features by the adjusted p-values (EM-adjust, adjusted p-value << 0.01) and by the threshold n0.35n^{0.35} (EM-0.35). The p-value is calculated using the χ2(3)\chi^{2}(3)-distribution, because compared with the limiting distribution in Theorem 4, the χ2(3)\chi^{2}(3)-distribution is computationally more efficient, achieves good sensitivity and false discovery rate (FDR) control (Supplementary Section D). The Benjamini-Hochberg (BH) procedure (Benjamini and Hochberg, 1995a) is used to adjust the p-values. We choose the threshold n0.35n^{0.35} because the EM-test has a good balance between retaining cluster-relevant features and excluding cluster-irrelevant features at these cutoffs (Supplementary Section D). For the Chi-square goodness-of-fit test, KS-test and Dip-test, we use the BH-adjusted p-values (<0.01<0.01) to screen the cluster-relevant features.

Table 2 shows the numbers of correctly retained and falsely retained features by different methods. Similarly, we find that the EM-test methods (EM-adjust, EM-0.35) outperform other methods in most settings especially when pp is larger and different clusters are more similar to each other. In most cases, Skmeans is able to select all cluster-relevant features, but also falsely select many cluster-irrelevant features. SC-FS is conservative. In low dimensional settings (p=500p=500), SC-FS could correctly select all cluster-relevant features with almost no false positives. However, in higher dimensions, SC-FS also has almost zero false positives, but its power is low. For example, in the p=5000p=5000, medium signal and high noise case, SC-FS only reports 2 cluster-relevant features. In the same case, EM-adjust reports all 20 features with almost zero false positives. The performance of two versions of EM-test are slightly different. EM-adjust is more conservative than EM-0.35. In the more difficult settings (with large pp and low signal to noise ratio), EM-adjust is still able to control the false positives, but detects less cluster-relevant features. In most cases, EM-0.35 can detect most cluster-relevant features, but also report some cluster-irrelevant features in the more difficult simulation settings. KS-test and Dip-test report many false positives and select almost all features as clustering-relevant features. COCSI is very conservative in this simulation and could not select any features.

Table 2: The mean of numbers of correctly retained features (\mathcal{R}) and falsely retained features (\mathcal{F}) by different methods over 100 replications. The numbers in the parenthesis are the standard deviation of \mathcal{R} and \mathcal{F} over 100 replications.
pp EM-adjust EM-0.35 Chi-square SC-FS Skmeans KS-test Dip-test COSCI
Case 1: High signal and low noise
500 \mathcal{R} 20 (0.1) 20 (0.1) 20 (0.7) 20 (0.9) 20 (0.0) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.1) 1 (1.1) 2 (1.6) 0 (0.0) 480 (0.0) 480.0 (0.0) 480.0 (0.0) 0.0 (0.0)
5000 \mathcal{R} 20 (0.0) 20 (0.0) 19 (1.2) 16 (4.2) 19 (0.8) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.1) 10 (3.3) 2 (1.4) 0 (0.0) 239 (700.9) 4980.0 (0.0) 4980.0 (0.0) 0.0 (0.0)
20,000 \mathcal{R} 20 (0.0) 20 (0.0) 18 (1.3) 0 (0.4) NA 20.0 (0.0) 20.0 (0.0) NA
\mathcal{F} 0 (0.2) 40 (5.6) 2 (1.7) 0 (0.0) NA 19980.0 (0.0) 19980.0 (0.0) NA
Case 2: High signal and high noise
500 \mathcal{R} 20 (0.1) 20 (0.0) 18 (1.2) 20 (0.7) 20 (0.0) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.2) 2 (1.4) 2 (1.5) 0 (0.0) 480 (0.0) 480.0 (0.0) 480.0 (0.0) 0.0 (0.0)
5000 \mathcal{R} 20 (0.0) 20 (0.0) 17 (1.9) 11 (5.4) 19 (2.0) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.1) 18 (4.1) 2 (1.4) 0 (0.1) 367 (739.3) 4980.0 (0.0) 4980.0 (0.0) 0.0 (0.0)
20,000 \mathcal{R} 20 (0.2) 20 (0.0) 15 (1.9) 0 (0.1) NA 20.0 (0.0) 20.0 (0.0) NA
\mathcal{F} 0 (0.3) 75 (7.7) 2 (1.9) 0 (0.0) NA 19980.0 (0.0) 19980.0 (0.0) NA
Case 3: Medium signal and low noise
500 \mathcal{R} 20 (0.3) 20 (0.1) 16 (2.4) 20 (0.5) 20 (0.0) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.1) 1 (1.1) 2 (1.6) 0 (0.0) 480 (0.0) 480.0 (0.0) 480.0 (0.0) 0.0 (0.0)
5000 \mathcal{R} 20 (0.4) 20 (0.1) 12 (2.3) 8 (5.0) 19 (3.0) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.1) 10 (3.2) 2 (1.4) 0 (0.0) 447 (852.1) 4980.0 (0.0) 4980.0 (0.0) 0.0 (0.0)
20,000 \mathcal{R} 20 (0.6) 20 (0.0) 10 (2.5) 0 (0.0) NA 20.0 (0.0) 20.0 (0.0) NA
\mathcal{F} 0 (0.2) 41 (5.7) 1 (1.3) 0 (0.0) NA 19980.0 (0.0) 19980.0 (0.0) NA
Case 4: Medium signal and high noise
500 \mathcal{R} 20 (0.5) 20 (0.1) 13 (2.3) 20 (0.4) 20 (0.0) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.1) 2 (1.4) 2 (1.4) 0 (0.0) 480 (0.0) 480.0 (0.0) 480.0 (0.0) 0.0 (0.0)
5000 \mathcal{R} 19 (0.9) 20 (0.1) 9 (2.5) 3 (3.2) 16 (6.5) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.2) 18 (4.1) 1 (1.1) 0 (0.1) 1004 (1417.9) 4980.0 (0.0) 4980.0 (0.0) 0.0 (0.0)
20,000 \mathcal{R} 19 (1.2) 20 (0.2) 7 (2.7) 0 (0.0) NA 20.0 (0.0) 20.0 (0.0) NA
\mathcal{F} 0 (0.3) 75 (7.5) 1 (1.3) 0 (0.0) NA 19980.0 (0.0) 19980.0 (0.0) NA
Case 5: Low signal and low noise
500 \mathcal{R} 16 (2.0) 19 (1.0) 4 (2.5) 20 (0.6) 20 (0.0) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.1) 1 (1.1) 1 (1.2) 0 (0.0) 470 (67.5) 480.0 (0.0) 480.0 (0.0) 0.0 (0.0)
5000 \mathcal{R} 14 (2.0) 19 (1.0) 2 (1.6) 1 (1.6) 11 (8.0) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.1) 10 (3.3) 1 (0.7) 0 (0.0) 1458 (1811.1) 4980.0 (0.0) 4980.0 (0.0) 0.0 (0.0)
20,000 \mathcal{R} 12 (2.3) 19 (1.1) 1 (1.0) 0 (0.0) NA 20.0 (0.0) 20.0 (0.0) NA
\mathcal{F} 0 (0.1) 41 (5.6) 0 (0.6) 0 (0.0) NA 19980.0 (0.0) 19980.0 (0.0) NA
Case 6: Low signal and high noise
500 \mathcal{R} 15 (2.1) 18 (1.2) 3 (2.0) 20 (0.6) 20 (0.0) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.1) 2 (1.4) 1 (1.0) 0 (0.1) 480 (0.0) 480.0 (0.0) 480.0 (0.0) 0.0 (0.0)
5000 \mathcal{R} 12 (2.3) 18 (1.3) 1 (1.1) 0 (0.3) 10 (8.7) 20.0 (0.0) 20.0 (0.0) 0.0 (0.0)
\mathcal{F} 0 (0.1) 18 (4.2) 0 (0.4) 0 (0.0) 2150 (2158.9) 4980.0 (0.0) 4980.0 (0.0) 0.0 (0.0)
20,000 \mathcal{R} 10 (2.5) 18 (1.4) 1 (1.0) 0 (0.0) NA 20.0 (0.0) 20.0 (0.0) NA
\mathcal{F} 0 (0.3) 75 (7.4) 0 (0.7) 0 (0.0) NA 19980.0 (0.0) 19980.0 (0.0) NA

4.3 Feature screening improves clustering analysis

In this subsection, we assess the influence of feature screening on clustering analyses. For each simulation, we first use the feature screening methods to select potential cluster-relevant features and then use the k-means algorithm to cluster the samples. For the feature selection method Skmeans, we directly use its clustering results. The number of clusters in the k-means and Skmeans algorithms is set as 5. The parameters of the feature screening/selection methods are set as in Section 4.2. For comparison, we also include k-means clustering results using all features (called No-Screening) and the oracle clustering results using only the cluster-relevant features.

Table 3 shows the adjusted Rand index (ARI) between the clustering results given by different methods and the true clusters. Generally speaking, methods that can accurately select more cluster-relevant features while excluding more cluster-irrelevant features (Table 2) tend to perform better in the clustering. All count-data feature screening methods or the feature selection method can help to improve, often by a large amount, the clustering accuracy in comparison with the baseline method No-Screening, indicating that feature screening is an essential step for clustering analysis of high dimensional data. The two versions of the EM-test method have similar performances and consistently perform better than other methods. When the dimension is small (p=500p=500) or the difference between clusters is large (high signal and low noise), EM-test, Chi-square, SC-FS and Skmeans have similar performance. When the difference between clusters is smaller and the dimension pp is larger, the advantage of the EM-test over other methods is more apparent. In addition, clustering based on the EM-test screening can achieve an accuracy similar to that of the oracle clustering in most settings. The performance of KS-test and Dip test are similar to No-Screening because they select almost all features. The ARIs of COSCI are zero because COSCI could not select any features for count data.

Table 3: The means and standard deviations (in parenthesis) of ARIs over 100 replications. The values in the table are shown as the actual values ×100\times~{}100.
pp No-Screening Oracle EM-pvalue EM-0.35 Chi-square SC-FS Skmeans KS-test Dip-test COSCI
Case 1: High signal and low noise
500500 94 (11.9) 98 (1.1) 98 (1.5) 98 (2.7) 98 (2.9) 98 (4.1) 98 (0.8) 94 (1.4) 94 (1.4) 0 (0.0)
50005000 10 (3.6) 98 (2.8) 98 (0.9) 98 (1.5) 97 (1.4) 95 (22.7) 96 (21.5) 12 (3.2) 12 (3.1) 0 (0.0)
20,00020,000 0 (0.3) 98 (4.5) 98 (2.9) 98 (1.3) 97 (3.0) 0 (1.1) NA 1 (0.3) 1 (0.3) NA
Case 2: High signal and high noise
500500 88 (11.2) 94 (2.5) 94 (2.8) 94 (3.0) 93 (2.1) 94 (4.7) 94 (1.4) 88 (2.5) 88 (2.5) 0 (0.0)
50005000 4 (2.0) 94 (1.7) 94 (1.7) 94 (1.7) 91 (3.9) 57 (28.5) 49 (26.3) 6 (2.0) 6 (1.9) 0 (0.0)
20,00020,000 0 (0.2) 94 (2.9) 94 (1.9) 93 (1.5) 89 (4.2) 0 (1.0) NA 1 (0.2) 1 (0.2) NA
Case 3: Medium signal and low noise
500500 88 (6.8) 96 (1.7) 96 (1.8) 96 (1.9) 92 (5.1) 96 (3.0) 96 (1.3) 88 (2.5) 88 (2.5) 0 (0.0)
50005000 3 (1.7) 96 (1.1) 96 (1.2) 96 (1.1) 86 (8.0) 33 (27.7) 30 (24.4) 5 (1.7) 5 (1.8) 0 (0.0)
20,00020,000 0 (0.2) 96 (3.2) 96 (1.9) 95 (1.3) 80 (11.4) 0 (0.0) NA 1 (0.2) 1 (0.2) NA
Case 4: Medium signal and high noise
500500 78 (5.5) 90 (1.8) 90 (2.0) 90 (1.7) 80 (7.4) 90 (2.1) 90 (1.7) 77 (4.0) 77 (4.1) 0 (0.0)
50005000 2 (1.0) 91 (2.1) 90 (2.5) 90 (2.1) 68 (13.0) 12 (14.5) 13 (16.2) 3 (1.1) 3 (1.0) 0 (0.0)
20,00020,000 0 (0.2) 90 (1.9) 89 (3.0) 89 (1.8) 54 (17.2) 0 (0.0) NA 0 (0.2) 0 (0.2) NA
Case 5: Low signal and low noise
500500 59 (11.1) 90 (1.8) 85 (7.6) 89 (2.6) 32 (19.4) 90 (2.8) 89 (2.1) 60 (9.6) 60 (9.7) 0 (0.0)
50005000 1 (0.7) 90 (2.1) 79 (7.1) 88 (2.6) 12 (13.7) 0 (8.9) 9 (9.9) 2 (0.6) 2 (0.7) 0 (0.0)
20,00020,000 0 (0.2) 89 (1.8) 75 (8.7) 87 (3.0) 0 (8.9) 0 (0.0) NA 0 (0.2) 0 (0.2) NA
Case 6: Low signal and high noise
500500 38 (8.7) 82 (2.5) 73 (6.5) 80 (3.6) 17 (14.9) 82 (3.1) 81 (3.1) 42 (7.7) 42 (7.7) 0 (0.0)
50005000 1 (0.4) 83 (2.4) 67 (8.3) 79 (3.9) 10 (8.9) 0 (1.8) 0 (6.5) 1 (0.4) 1 (0.4) 0 (0.0)
20,00020,000 0 (0.2) 82 (2.6) 60 (11.7) 75 (8.9) 0 (7.6) 0 (0.0) NA 0 (0.2) 0 (0.2) NA

We also compare the computational time of the feature screening/selection methods (Supplementary Section D). SC-FS, KS-test and Dip-test are the computationally most efficient method. The EM-test is also computationally efficient and can allow analyzing tens of thousands of features using a typical desktop computer. Skmeans is computationally very demanding, partly because it has to select the best tuning parameter using permutation.

5 Application on scRNA-seq data

In this section, we consider an application to the scRNA-seq data from Heming et al. (2021). The scRNA-seq data contain single cells from 31 patients, including eight patients of coronavirus disease 2019 with acute or long-term neurological sequelae (Neuro-COVID), five viral encephalitis (VE) patients, nine multiple sclerosis (MS) patients and nine idiopathic intracranial hypertension (IIH) patients. Here, we focus on monocytes, granulocytes and dendritic cells. After quality control, in total, we have 11,697 cells and 33,538 candidate genes. The scRNA-seq data are usually modeled by the negative binomial distribution(Chen et al., 2018). We thus apply the EM-test of the negative binomial distribution to screen for important genes. At the FDR threshold 0.01, the EM-test selects 2754 genes. With these genes, we perform clustering and annotation analysis and identify 9 cell subtypes. Details of feature screening, clustering and annotation are shown in the Supplementary material.

We also apply the Chi-square test and the KS-test and use their selected genes to cluster the single cells. The Chi-square test is the goodness-of-fit test of the negative binomial distribution. At the FDR threshold 0.01, it reports 158 genes. The KS-test is applied to the normalized data using a normalization procedure that are commonly used in scRNA-seq data analyses (Butler et al., 2018). The normalization can make the data better approximated by normal distributions. Following IF-PCA (Jin and Wang, 2016), the threshold of the KS-test is chosen as the higher-criticism threshold, which gives 15,732 important genes. For comparison, we also consider the baseline No-Screening method (all genes are included). Then, we use the same clustering procedures to cluster the single cells using the Chi-square or KS-test selected genes.

We first evaluate the clustering results using the Calinski-Harabasz index (Caliński and Harabasz, 1974) and Silhouette index (Rousseeuw, 1987). More specifically, we perform dimension reduction with the genes selected by different methods. Then we calculate the Calinski-Harabasz and Silhouette index of the clustering results given by different methods in their respective lower dimension spaces. Principle component analysis (PCA) and uniform manifold approximation and projection (UMAP) (McInnes and Healy, 2018) are used for dimension reduction (to 40 dimensions for PCA and 2 dimensions for UMAP). As shown in Table 4, the EM-test has the largest Calinski-Harabasz and Silhouette indexes, indicating that genes selected by EM-test provide the most distinct clustering results on the reduced feature spaces. Also, we can see that the EM-test is the only method that has the Calinski-Harabasz and Silhouette larger than the No-Screening method, indicating that the EM-test is more effective in selecting cluster-relevant features.

Table 4: The Silhouette and Calinski-Harabasz index of the clustering results based on genes selected by different methods on their respective lower dimensional spaces.
      Method       EM-test       Chi-square       KS-test       No-Screening
      Number of selected features       2754       158       15,732       33,538
      Dimension Reduction by UMAP
      Silhouette index       0.27       0.04       0.21       0.23
      Calinski-Harabasz index       8713       2981       7674       7895
      Dimension Reduction by PCA
      Silhouette index       0.11       0.02       0.08       0.10
      Calinski-Harabasz index       1048       326       1033       988

Clustering of the EM-test selected genes gives 9 single cell clusters, including the 6 cell subtypes reported in (Heming et al., 2021). The additional three cell subtypes are two subtypes of monocytes, which we named as mono_IFN monocyte and IL7R+\textrm{R}^{+} monocyte1 and IL7R+\textrm{R}^{+} monocyte2 (Fig.2). The two clusters of IL7R+\textrm{R}^{+} monocytes are often observed at inflammation sites (Al-Mossawi et al., 2019). The mono_IFN monocytes highly express many interferon-related genes (Fig.2E), suggesting that these cells might play important roles in immune responses to viral infection (Heming et al., 2021). Most of these mono_IFN monocytes are from VE patients (89%), and are depleted in Neuro-COVID patients (1%) compared with VE patients (15%) (Fig.2F). These indicate that there might be an attenuated interferon response in Neuro-COVID patients. This attenuated interferon response in Neuro-COVID patients was discovered in Heming et al. (2021) by differential gene expression analysis. However, Heming et al. (2021) did not find the mono_IFN monocyte possibly because of its feature screening. Here, we successfully identify the mono_IFN monocyte and its marker genes, which can facilitate further downstream analysis of these types of cells. All other methods cannot detect these mono_IFN monocytes. These mono_IFN monocytes either scatter widely in the method’s UMAP plot or are only a small portion of larger cell clusters detected by other methods (Figure 2 B-D). Therefore, we conclude that the EM-test enables more accurate cell type identification via its precise cluster-relevant gene screening and lead to the discovery of the potential novel cell subtype mono_IFN monocyte.

Refer to caption
Figure 2: Analysis of cerebrospinal fluid (CSF) samples from Neuro-COVID, viral encephalitis, multiple sclerosis and nine idiopathic intracranial hypertension patients. (A) Clustering and UMAP based on genes selected by the EM-test. (B) The location of mono_IFN cells on the UMAP derived from genes selected by Chi-square. (C) The location of the cluster derived from genes selected by KS-test containing the mono_IFN cells on the UMAP plot by EM-test selected genes. (D) The location of the cluster derived from all candidate genes containing the mono_IFN cells on the UMAP plot by EM-test selected genes. (E) Expression of several interferon-related genes markers of different cell types. (F) Percentages of mono_IFN cells in VE, Neuro-COVID, IIH and MS patients.

6 Discussion

In this paper, we propose a general parametric clustering feature screening method using the EM-test. We establish the tail probability bounds of the EM-test statistic and show that the proposed screening method can achieve the sure independent screening property and consistency in feature selection when pp goes to infinity not too fast. Limiting distribution of the EM-test statistic under general settings is also obtained. Conditions in this paper are generally mild and many commonly used parametric families satisfy all these conditions. Thus, our method can be widely applied. The most stringent condition is the strong identifiability condition (C4). Although many exponential family distributions satisfy this condition, normal distributions with unknown means and variances cannot satisfy this condition (but normal distributions with known variances can). However, we find that this problem is closely related to a well-known truncated moment problem and we actually can establish the tail probability bound for normal distributions without Condition (C4). This is out of the scope of this paper and we will discuss in future research.

One limitation of the proposed method is that EM-test is a marginal screening method. Jointly important features may be marginally unimportant and thus could be missed by EM-test. This problem will not occur if the features are independent. In clustering analysis, this problem can also be avoided under conditions other than independence. For example, clustering methods like k-means rely on variables’ means for clustering analysis. If clustering-relevant features are assumed to have different means in different clusters, a scenario considered in Cai et al. (2019) and many other clustering works (Jin and Wang, 2016; Löffler et al., 2019), jointly important clustering-relevant features will always be marginally important and the problem will not occur. On the other hand, marginally important features may be jointly unimportant and and could be falsely retained by marginal screening methods like EM-test. However, if most of important features are retained, inclusion of a few false positives may not have significant impact on clustering accuracy. For example, for the simulation scenario with low signal and low noise and p=20,000p=20,000, EM-0.35 retains almost all of 20 important features, but also report around 40 false positives. In comparison, EM-adjust has almost no false positives but only retains around 12 important features (Table 2). However, in terms of clustering accuracy, the ARI of EM-0.35 is considerably higher than EM-adjust (0.87 versus 0.74).

The current method can be improved in several aspects. One important type of data is binary data. Since a mixture of binary distributions is still a binary distribution, the current method is not able to screen for cluster-relevant binary data. Further studies on feature screening for binary data are needed. A potential way to address this problem is to first aggregate the binary variables and then perform screening on the aggregated variables. Another important direction to improve over the current methods is to develop non-parametric or semi-parametric screening methods for clustering analyses. Non-parametric or semi-parametric screening methods can allow more robust feature screening and thus potentially have wider applications.

7 Supplementary material

All proofs of the theoretical results are given in the Supplementary material. Additional simulation results and details for the application are also shown in the Supplementary material.

8 Acknowledgments

This work was supported by the National Key Basic Research Project of China (2020YFE0204000), the National Natural Science Foundation of China (11971039), and Sino-Russian Mathematics Center.

Appendix A Proofs of the non-asymptotic results

In this section, we aim to prove Theorem 1–3.

A.1 Sketch of the proof of Theorem 2

In this section, we sketch the proof of Theorem 2. We first recall some important definitions in the manuscript. Let

Yih=1f(xi,𝜽0)f(xi,𝜽0)θh,Zih=12f(xi,𝜽0)2f(xi,𝜽0)θh2,\displaystyle Y_{ih}=\frac{1}{f(x_{i},\bm{\theta}_{0})}\frac{\partial f(x_{i},\bm{\theta}_{0})}{\partial\theta_{h}},~{}Z_{ih}=\frac{1}{2f(x_{i},\bm{\theta}_{0})}\frac{\partial^{2}f(x_{i},\bm{\theta}_{0})}{\partial\theta_{h}^{2}},
Uih=1f(xi,𝜽0)2f(xi,𝜽0)θhθ(h<),𝐛1i=(Yi1,,Yid)T,\displaystyle U_{ih\ell}=\frac{1}{f(x_{i},\bm{\theta}_{0})}\frac{\partial^{2}f(x_{i},\bm{\theta}_{0})}{\partial\theta_{h}\partial\theta_{\ell}}(h<\ell),~{}{\bf b}_{1i}=(Y_{i1},\ldots,Y_{id})^{\rm T}, (12)
𝐛2i=(Zi1,,Zid,Ui12,,Ui(d1)d)T,and𝐛i=(𝐛1iT,𝐛2iT)T.\displaystyle{\bf b}_{2i}=(Z_{i1},\ldots,Z_{id},U_{i12},\ldots,U_{i(d-1)d})^{\rm T},~{}\mbox{and}~{}{\bf b}_{i}=\left({\bf b}_{1i}^{\rm T},{\bf b}_{2i}^{\rm T}\right)^{\rm T}.

Given h,{1,,d}h,\ell\in\{1,\dots,d\}, let

m1h(𝜶,𝝃1,𝝃2)=g=1Gαg(θ1ghθ2gh),m2h(𝜶,𝝃1,𝝃2)=g=1Gαg(θ1ghθ2gh)2,{m}_{1h}({\bm{\alpha}},\bm{\xi}_{1},\bm{\xi}_{2})=\sum_{g=1}^{G}{\alpha}_{g}\left({\theta}_{1gh}-{\theta}_{2gh}\right),~{}{m}_{2h}({\bm{\alpha}},\bm{\xi}_{1},\bm{\xi}_{2})=\sum_{g=1}^{G}{\alpha}_{g}\left({\theta}_{1gh}-{\theta}_{2gh}\right)^{2},
 and sh(𝜶,𝝃1,𝝃2)=g=1Gαg(θ1ghθ2gh)(θ1gθ2g)(h<),\mbox{ and }{s}_{h\ell}({\bm{\alpha}},\bm{\xi}_{1},\bm{\xi}_{2})=\sum_{g=1}^{G}{\alpha}_{g}\left({\theta}_{1gh}-{\theta}_{2gh}\right)\left({\theta}_{1g\ell}-{\theta}_{2g\ell}\right)~{}(h<\ell),

where 𝝃j=(𝜽j1,,𝜽jG)\bm{\xi}_{j}=(\bm{\theta}_{j1},\dots,\bm{\theta}_{jG}) and 𝜽jg=(θjg1,,θjgd)\bm{\theta}_{jg}=(\theta_{jg1},\dots,\theta_{jgd}) (j=1,2j=1,2, g=1,,Gg=1,\dots,G). We define two vector functions as

𝐦1(𝜶,𝝃1,𝝃2)=(m11,,m1d)T,{{\bf m}}_{1}({\bm{\alpha}},\bm{\xi}_{1},\bm{\xi}_{2})=({m}_{11},\dots,{m}_{1d})^{\rm T}, (13)
𝐦(𝜶,𝝃1,𝝃2)=(m11,,m1d,m21,,m2d,s12,,s(d1)d)T,{{\bf m}}({\bm{\alpha}},\bm{\xi}_{1},\bm{\xi}_{2})=({m}_{11},\dots,{m}_{1d},{m}_{21},\dots,{m}_{2d},{s}_{12},\dots,{s}_{(d-1)d})^{\rm T}, (14)

and simplify 𝐦1(𝜶,𝝃,𝝃0){{\bf m}}_{1}({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0}) and 𝐦(𝜶,𝝃,𝝃0){{\bf m}}({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0}) as 𝐦1(𝜶,𝝃){\bf m}_{1}({\bm{\alpha}},\bm{\xi}) and 𝐦(𝜶,𝝃){\bf m}({\bm{\alpha}},\bm{\xi}), respectively.

The basic idea of the proof is to alternatively bound 𝜶(k){\bm{\alpha}}^{(k)} and 𝝃(k){\bm{\xi}}^{(k)} (k=0,,Kk=0,\dots,K), and use Taylor’s expansion to bound the EM-test statistic. More specifically, given an initial value 𝜶(0){\bm{\alpha}}^{(0)}, the one step EM update 𝝃(0){\bm{\xi}}^{(0)} maximizes the log-likelihood ln(𝝃,𝜶(0))=i=1nlogφ(xi;𝝃,𝜶(0))l_{n}(\bm{\xi},{\bm{\alpha}}^{(0)})=\sum_{i=1}^{n}\hbox{log}\ \varphi(x_{i};\bm{\xi},{\bm{\alpha}}^{(0)}). Observe that the homogeneous distribution f(x;𝜽0)f(x;\bm{\theta}_{0}) can also be written as φ(x;𝝃0,𝜶(0))\varphi(x;\bm{\xi}_{0},{\bm{\alpha}}^{(0)}), and all elements of 𝜶(0){\bm{\alpha}}^{(0)} are bounded away from zero, i.e. ming=1,,Gαg(0)>δ>0\min_{g=1,\ldots,G}{\alpha}_{g}^{(0)}>\delta>0. The one step update 𝝃(0){\bm{\xi}}^{(0)} will be a consistent estimate of the true parameter 𝝃0\bm{\xi}_{0}, and we can bound the tail probability of 𝝃(0)𝝃022\left\|{\bm{\xi}}^{(0)}-\bm{\xi}_{0}\right\|_{2}^{2}. Alternatively, since 𝝃(0){\bm{\xi}}^{(0)} is close to 𝝃0\bm{\xi}_{0}, the EM update 𝜶(1){\bm{\alpha}}^{(1)} will also be bounded away from zero, i.e. ming=1,,Gαg(1)>δ\min_{g=1,\ldots,G}{\alpha}_{g}^{(1)}>\delta. We can repeat this process KK times and give a tail probability bound for 𝝃(K)𝝃022\left\|{\bm{\xi}}^{(K)}-\bm{\xi}_{0}\right\|_{2}^{2}. Finally, we can use Taylor’s expansion to represent the EM-test statistic in terms of 𝝃(K)𝝃0{\bm{\xi}}^{(K)}-\bm{\xi}_{0}, and obtain the tail probability bound for the EM-test statistic. In the following, we present the critical lemmas needed in this proof process.

The following Lemma 1 guarantees that if 𝜶(k){\bm{\alpha}}^{(k)} is bounded away from zero, the EM update 𝝃(k){\bm{\xi}}^{(k)} will be close to 𝝃0\bm{\xi}_{0} and we can obtain a tail probability bound for 𝝃(k)𝝃022\left\|{\bm{\xi}}^{(k)}-\bm{\xi}_{0}\right\|_{2}^{2}. More specifically, we define

𝕊δ={𝜶:𝜶𝕊G1,ming=1,,Gαgδ>0}.\mathbb{S}_{\delta}=\left\{{\bm{\alpha}}:{\bm{\alpha}}\in\mathbb{S}^{G-1},\min_{g=1,\dots,G}\alpha_{g}\geq\delta>0\right\}. (15)

Clearly, we have 𝜶(0)𝕊δ{\bm{\alpha}}^{(0)}\in\mathbb{S}_{\delta}. In the proof process of Theorem 2, we can show that 𝜶(k)𝕊δ{\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta} with a high probability. Denote

𝒮ϵ(k)={𝐦(𝜶(k),𝝃(k))2<ϵL1,𝝃(k)𝝃022<ϵL2},\mathcal{S}^{(k)}_{\epsilon}=\left\{\left\|{\bf{m}}\left({\bm{\alpha}}^{(k)},{\bm{\xi}}^{(k)}\right)\right\|_{2}<\frac{\epsilon}{L_{1}},\left\|{\bm{\xi}}^{(k)}-\bm{\xi}_{0}\right\|_{2}^{2}<\frac{\epsilon}{L_{2}}\right\}, (16)

where L1,L2>0L_{1},L_{2}>0 are two constants that will be specified in Lemma 4 and 𝐦(𝜶(k),𝝃(k)){\bf{m}}\left({\bm{\alpha}}^{(k)},{\bm{\xi}}^{(k)}\right) is as defined in (14). We have the following lemma.

Lemma 1.

Assume that x1,,xnx_{1},\dots,x_{n} are independent samples from the homogeneous distribution f(x;𝛉0)f(x;\bm{\theta}_{0}). Let c1=1/24,c2=(4/27)(1/1926),p0=λGlog(δG)c_{1}=1/24,c_{2}=(4/27)(1/1926),p_{0}=\lambda G\hbox{log}(\delta G) and c>0c>0 be a constant depending on δ,d,G,M\delta,d,G,M and diam(Ξ){\rm diam}(\Xi). Under Condition (C1)–(C4), for any ϵ>0\epsilon>0 and sufficiently large nn such that ϵmax(cn1/2logn,c11/2(p0)1/2n1/2),\epsilon\geq\max\left(cn^{-1/2}{\rm log}n,c_{1}^{-1/2}\left({-p_{0}}\right)^{1/2}n^{-1/2}\right), we have

(𝒮ϵ(k){𝜶(k)𝕊δ})(𝜶(k)𝕊δ)5exp(c2nϵ2).{\mathbb{P}}\left(\mathcal{S}^{(k)}_{\epsilon}\cap\left\{{\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right\}\right)\geq{\mathbb{P}}\left({\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right)-5\exp\left(-c_{2}n\epsilon^{2}\right).

Define

(k+1)={ming=1,,Gαg(k+1)ming=1,,Gαg(k)(12K)}.\mathcal{E}^{(k+1)}=\left\{\min_{g=1,\dots,G}{\alpha}_{g}^{(k+1)}\geq\min_{g=1,\dots,G}{\alpha}_{g}^{(k)}\left(1-\frac{2}{K}\right)\right\}. (17)

The following lemma shows that if 𝝃(k){\bm{\xi}}^{(k)} is close to the true value 𝝃0\bm{\xi}_{0}, mingαg(k+1)\min_{g}{\alpha}_{g}^{(k+1)} can be bounded by mingαg(k)\min_{g}{\alpha}_{g}^{(k)} up to a fixed factor with a high probability. Let ΔK\Delta_{K} be the constant defined in Lemma 12.

Lemma 2.

Assume that x1,,xnx_{1},\dots,x_{n} are independent samples from the homogeneous distribution f(x;𝛉0)f(x;\bm{\theta}_{0}). When L2ΔK2ϵL_{2}\Delta_{K}^{2}\geq\epsilon, for any measurable set \mathcal{B}, we have

((k+1)𝒮ϵ(k))(𝒮ϵ(k))2exp(2nK2).{\mathbb{P}}\left(\mathcal{E}^{(k+1)}\cap\mathcal{S}^{(k)}_{\epsilon}\cap\mathcal{B}\right)\geq{\mathbb{P}}\left(\mathcal{S}^{(k)}_{\epsilon}\cap\mathcal{B}\right)-2\exp\left(\frac{-2n}{K^{2}}\right).

Theorem 2 aims to give an upper bound of the EM-test statistic under 0\mathbb{H}_{0}. By the likelihood non-decreasing property of the EM algorithm, if EMn(K)(K3){\rm EM}_{n}^{(K)}(K\geq 3) is bounded, then EMn(K)(K<3){\rm EM}_{n}^{(K)}(K<3) is bounded. Thus, without loss of generality, we can assume that K3K\geq 3. In other words, the assumption K3K\geq 3 in our manuscript can be relaxed to K>0K>0 clearly. Since 𝜶(0)𝕊δ{\bm{\alpha}}^{(0)}\in\mathbb{S}_{\delta}, we can alternatively apply Lemma 1 and Lemma 2, and get

ming=1,,Gαg(K)ming=1,,Gαg(0)(12K)K271ming=1,,Gαg(0)δ(K3),\min_{g=1,\ldots,G}{\alpha}_{g}^{(K)}\geq\min_{g=1,\ldots,G}{\alpha}_{g}^{(0)}\left(1-\frac{2}{K}\right)^{K}\geq 27^{-1}\min_{g=1,\ldots,G}{\alpha}_{g}^{(0)}\geq\delta~{}(K\geq 3), (18)

with a high probability. Applying Lemma 1 again, we obtain the tail probability bound for 𝝃(K)𝝃02\left\|{\bm{\xi}}^{(K)}-\bm{\xi}_{0}\right\|_{2} and 𝐦(𝜶(K),𝝃(K))2\left\|{\bf{m}}\left({\bm{\alpha}}^{(K)},{\bm{\xi}}^{(K)}\right)\right\|_{2}. Combining these results, we can simultaneously bound 𝝃(K)𝝃02\left\|{\bm{\xi}}^{(K)}-\bm{\xi}_{0}\right\|_{2} and ming=1,,Gαg(K)\min_{g=1,\ldots,G}{\alpha}_{g}^{(K)}.

Lemma 3.

Assume that x1,,xnx_{1},\dots,x_{n} are independent samples from the homogeneous distribution f(x;𝛉0)f(x;\bm{\theta}_{0}). Let c1=1/24,c2=(4/27)(1/1926)c_{1}=1/24,c_{2}=(4/27)(1/1926), p0=λGlog(δG)p_{0}=\lambda G\hbox{log}(\delta G) and c>0c>0 be a constant depending on δ,d,G,M\delta,d,G,M and diam(Ξ){\rm diam}(\Xi). Under Condition (C1)–(C4), for any ϵ>0\epsilon>0 and sufficiently large nn such that L2ΔK2ϵmax(cn1/2logn,c11/2(p0)1/2n1/2),L_{2}\Delta_{K}^{2}\geq\epsilon\geq\max\left(cn^{-1/2}{\rm log}n,c_{1}^{-1/2}\left({-p_{0}}\right)^{1/2}n^{-1/2}\right), we have

(𝒮ϵ(K){𝜶(K)𝕊δ})15(K+1)exp(c2nϵ2)2Kexp(2nK2).{\mathbb{P}}\left(\mathcal{S}_{\epsilon}^{(K)}\cap\left\{{\bm{\alpha}}^{(K)}\in\mathbb{S}_{\delta}\right\}\right)\geq 1-5(K+1)\exp\left(-c_{2}n\epsilon^{2}\right)-2K\exp\left(\frac{-2n}{K^{2}}\right).

Lemma 3 shows that when nn is sufficiently large, the tail probability of 𝝃(K)\bm{\xi}^{(K)} away from 𝝃0\bm{\xi}_{0} exponentially decays to zero. Besides, the convergence rate of 𝝃(K)\bm{\xi}^{(K)} is Op(n1/4log1/2n)O_{p}\left(n^{-1/4}\hbox{log}^{1/2}n\right) Based on this result, we can prove Theorem 2.

It is difficult to directly prove Lemma 1 and bound the Euclidean distance between 𝝃(k)\bm{\xi}^{(k)} and 𝝃0\bm{\xi}_{0}, because the Fisher information matrix is not positive definite under the homogeneous model 0\mathbb{H}_{0}. However, results in Wong and Shen (1995) imply that the Hellinger distance between 𝝃(k)\bm{\xi}^{(k)} and 𝝃0\bm{\xi}_{0} can be bounded with a high probability. The following Lemma 4 shows that the Euclidean distance between 𝝃(k)\bm{\xi}^{(k)} and 𝝃0\bm{\xi}_{0} is dominated by their Hellinger distance. Thus, to bound 𝝃(k)𝝃02\left\|{\bm{\xi}}^{(k)}-\bm{\xi}_{0}\right\|_{2} and prove Lemma 1, it suffices to bound the Hellinger distance between 𝝃(k)\bm{\xi}^{(k)} and 𝝃0\bm{\xi}_{0}. Before presenting Lemma 4, we introduce some notations.

Define

𝒫𝕊δG={g=1Gαgf(x;𝜽g):𝜽gΘ,(g=1,,G),𝜶𝕊δ}.\mathcal{P}^{G}_{\mathbb{S}_{\delta}}=\left\{\sum_{g=1}^{G}\alpha_{g}f(x;\bm{\theta}_{g}):\bm{\theta}_{g}\in\Theta,(g=1,\dots,G),{\bm{\alpha}}\in\mathbb{S}_{\delta}\right\}.

For any two densities p1,p2p_{1},p_{2} with respect to a measure μ\mu, we define their Hellinger distance as

H(p1,p2)={21(p11/2p21/2)2dμ}1/2.H(p_{1},p_{2})=\left\{2^{-1}\int\left(p_{1}^{1/2}-p_{2}^{1/2}\right)^{2}{{\rm d}}\mu\right\}^{1/2}.

When φ1,φ2𝒫𝕊δG\varphi_{1},\varphi_{2}\in\mathcal{P}^{G}_{\mathbb{S}_{\delta}}, we use (𝝃1,𝜶1),(𝝃1,𝜶2)(\bm{\xi}_{1},{\bm{\alpha}}_{1}),(\bm{\xi}_{1},{\bm{\alpha}}_{2}) to represent φ1,φ2\varphi_{1},\varphi_{2}, respectively, and write their Hellinger distance as

H(𝜶1,𝜶2,𝝃1,𝝃2)=[21{φ1/2(x;𝝃1,𝜶1)φ1/2(x;𝝃2,𝜶2)}2μ(dx)]1/2.H({\bm{\alpha}}_{1},{\bm{\alpha}}_{2},\bm{\xi}_{1},\bm{\xi}_{2})=\left[2^{-1}\int\left\{\varphi^{1/2}(x;\bm{\xi}_{1},{\bm{\alpha}}_{1})-\varphi^{1/2}(x;\bm{\xi}_{2},{\bm{\alpha}}_{2})\right\}^{2}\mu({\rm d}x)\right]^{1/2}.

When 𝝃0=(𝜽0,,𝜽0)\bm{\xi}_{0}=(\bm{\theta}_{0},\ldots,\bm{\theta}_{0}), the Hellinger distance H(𝜶1,𝜶2,𝝃1,𝝃0)H({\bm{\alpha}}_{1},{\bm{\alpha}}_{2},\bm{\xi}_{1},\bm{\xi}_{0}) can be written as

H(𝜶1,𝜶2,𝝃1,𝝃0)=[21{φ1/2(x;𝝃1,𝜶1)f1/2(x;𝜽0)}2μ(dx)]1/2.H({\bm{\alpha}}_{1},{\bm{\alpha}}_{2},\bm{\xi}_{1},\bm{\xi}_{0})=\left[2^{-1}\int\left\{\varphi^{1/2}(x;\bm{\xi}_{1},{\bm{\alpha}}_{1})-f^{1/2}(x;\bm{\theta}_{0})\right\}^{2}\mu({\rm d}x)\right]^{1/2}.

Note that H(𝜶1,𝜶2,𝝃1,𝝃0)H({\bm{\alpha}}_{1},{\bm{\alpha}}_{2},\bm{\xi}_{1},\bm{\xi}_{0}) is independent of 𝜶2{\bm{\alpha}}_{2} and we write it H(𝜶1,𝝃1,𝝃0)H({\bm{\alpha}}_{1},\bm{\xi}_{1},\bm{\xi}_{0}). Let diam𝐦(Ξ)=sup𝝃1,𝝃2Ξ𝜶𝕊δ𝐦(𝜶,𝝃1,𝝃2)22.{\rm diam}_{{\bf m}}(\Xi)=\sup_{\begin{subarray}{c}\bm{\xi}_{1},\bm{\xi}_{2}\in\Xi\\ {\bm{\alpha}}\in\mathbb{S}_{\delta}\end{subarray}}\left\|{{\bf m}}({\bm{\alpha}},\bm{\xi}_{1},\bm{\xi}_{2})\right\|_{2}^{2}. Since Ξ\Xi is a compact set, we have diam𝐦(Ξ)<{\rm diam}_{{\bf m}}(\Xi)<\infty. For any δ>0{\delta}^{\prime}>0, let

𝒟(δ)={(𝜶,𝜽0,𝝃):𝜶𝕊δ,𝜽0Θ,𝝃Ξ,g=1G𝜽g𝜽02δ}.\mathcal{D}({\delta}^{\prime})=\left\{({\bm{\alpha}},\bm{\theta}_{0},\bm{\xi}):{\bm{\alpha}}\in\mathbb{S}_{\delta},\bm{\theta}_{0}\in\Theta,\bm{\xi}\in\Xi,\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}\geq{\delta^{\prime}}\right\}. (19)

We have the following lemma that provides the connection between the Hellinger distance and Euclidean distance.

Lemma 4.

Under Condition (C1)–(C4) and 0\mathbb{H}_{0}, there exists Δ1>0\Delta_{1}>0 such that for any 𝛉0Θ,𝛂𝕊δ\bm{\theta}_{0}\in\Theta,{\bm{\alpha}}\in\mathbb{S}_{\delta}, when g=1G𝛉g𝛉02<Δ1\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}<\Delta_{1}, we have

H(𝜶,𝝃,𝝃0)321λmin𝐦(𝜶,𝝃)2,H({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0})\geq 32^{-1}\lambda_{\rm min}\|{\bf m}({\bm{\alpha}},\bm{\xi})\|_{2},

where 𝛏=(𝛉1,,𝛉G)\bm{\xi}=(\bm{\theta}_{1},\cdots,\bm{\theta}_{G}) and 𝛏0=(𝛉0,,𝛉0)\bm{\xi}_{0}=(\bm{\theta}_{0},\cdots,\bm{\theta}_{0}). Furthermore, if ω=inf𝒟(Δ1)H(𝛂,𝛏,𝛏0)>0,\omega=\inf_{\mathcal{D}(\Delta_{1})}H({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0})>0, then for any 𝛉0Θ,𝛏Ξ,𝛂𝕊δ\bm{\theta}_{0}\in\Theta,\bm{\xi}\in\Xi,{\bm{\alpha}}\in\mathbb{S}_{\delta}, we have

H(𝜶,𝝃,𝝃0)L1𝐦(𝜶,𝝃)2L2𝝃𝝃022,H({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0})\geq L_{1}\|{\bf m}({\bm{\alpha}},\bm{\xi})\|_{2}\geq L_{2}\|\bm{\xi}-\bm{\xi}_{0}\|_{2}^{2},

where L1=min(321λmin,ω2diam𝐦(Ξ))L_{1}=\sqrt{\min\left({32^{-1}\lambda_{\rm min}},\frac{\omega^{2}}{{\rm diam}_{{\bf m}}(\Xi)}\right)} and L2=d1/2L1δL_{2}=d^{-1/2}{L_{1}{\delta}}.

Lemma 4 shows that there exists a constant L2L_{2} such that H(𝜶,𝝃,𝝃0)L2𝝃𝝃022,H({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0})\geq L_{2}\|\bm{\xi}-\bm{\xi}_{0}\|_{2}^{2}, provided that mingαgδ>0\min_{g}\alpha_{g}\geq\delta>0. It demonstrates that, to bound the Euclidean distance between 𝝃(k)\bm{\xi}^{(k)} and 𝝃0\bm{\xi}_{0}, we only need to bound their Hellinger distance. Note that this lemma depends on an additional condition ω>0\omega>0, which can be guaranteed by the assumption that 𝒫G\mathcal{P}^{G} is an identifiable finite mixture and the compactness of Θ\Theta. In fact, by the identifiability, if 𝝃𝝃0\bm{\xi}\neq\bm{\xi}_{0}, then H(𝜶,𝝃,𝝃0)>0H({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0})>0. Since H(𝜶,𝝃,𝝃0)H({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0}) is uniformly continuous on the compact set 𝒟(Δ1)\mathcal{D}(\Delta_{1}), we have ω>0\omega>0. The continuity of H(𝜶,𝝃,𝝃0)H({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0}) is clear from Condition (C1).

A.2 Proofs of Lemma 1–4

In this section, we give the proofs of Lemma 4, 1, 2 and 3.

A.2.1 Proofs of Lemma 4

Before proving Lemma 4, we state the following lemma that is often used in the proof.

Lemma 5.

Let 𝛏1,𝛏2Ξ{\bm{\xi}}_{1},\bm{\xi}_{2}\in\Xi, where 𝛏1=(𝛉11,,𝛉1G){\bm{\xi}}_{1}=\left({\bm{\theta}}_{11},\dots,{\bm{\theta}}_{1G}\right) and 𝛏2=(𝛉21,,𝛉2G){\bm{\xi}}_{2}=\left({\bm{\theta}}_{21},\dots,{\bm{\theta}}_{2G}\right). Then, for any integer k1k\geq 1, 1gG1\leq g\leq G and j1,,jk{1,,d}j_{1},\dots,j_{k}\in\{1,\dots,d\}, we have

s=1k|θ1gjsθ2gjs|(d)k𝜽1g𝜽2g2k,\prod_{s=1}^{k}\left|{\theta}_{1gj_{s}}-{\theta}_{2gj_{s}}\right|\leq\left(\sqrt{d}\right)^{k}\left\|{\bm{\theta}}_{1g}-{\bm{\theta}}_{2g}\right\|_{2}^{k},

where 𝛉ig=(θig1,,θigd){\bm{\theta}}_{ig}=({\theta}_{ig1},\dots,{\theta}_{igd}) (i=1,2,g=1,,G)(i=1,2,g=1,\dots,G).

Proof of Lemma 5.

It is clear that

s=1k|θ1gjsθ2gjs|(j=1d|θ1gjθ2gj|)k(d)k𝜽1g𝜽2g2k,\prod_{s=1}^{k}\left|{\theta}_{1gj_{s}}-{\theta}_{2gj_{s}}\right|\leq\left(\sum_{j=1}^{d}\left|{\theta}_{1gj}-{\theta}_{2gj}\right|\right)^{k}\leq\left(\sqrt{d}\right)^{k}\left\|{\bm{\theta}}_{1g}-{\bm{\theta}}_{2g}\right\|_{2}^{k},

where the last inequality is from Cauchy’s inequality. ∎

Proof of Lemma 4.

For notation simplicity, we abbreviate 𝐦(𝜶,𝝃,𝝃0){\bf m}({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0}) defined in (14) as 𝐦{\bf m}, where 𝝃0=(𝜽0,,𝜽0)Ξ\bm{\xi}_{0}=(\bm{\theta}_{0},\dots,\bm{\theta}_{0})\in\Xi. Note that

H2(𝜶,𝝃,𝝃0)\displaystyle H^{2}({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0}) =1{g=1Gαgf(x,𝜽g)f(x,𝜽0)}1/2μ(dx)\displaystyle=1-\int\left\{\sum_{g=1}^{G}\alpha_{g}f\left(x,\bm{\theta}_{g}\right)f\left(x,\bm{\theta}_{0}\right)\right\}^{1/2}\mu({\rm d}x)
=1{g=1Gαgf(x,𝜽g)f(x,𝜽0)}1/2f(x,𝜽0)μ(dx)\displaystyle=1-\int\left\{\frac{\sum_{g=1}^{G}\alpha_{g}f\left(x,\bm{\theta}_{g}\right)}{f\left(x,\bm{\theta}_{0}\right)}\right\}^{1/2}f\left(x,\bm{\theta}_{0}\right)\mu({\rm d}x)
=(1{g=1Gαg(f(x,𝜽g)f(x,𝜽0))f(x,𝜽0)+1}1/2)f(x,𝜽0)μ(dx).\displaystyle=\int\left(1-\left\{\frac{\sum_{g=1}^{G}\alpha_{g}(f\left(x,\bm{\theta}_{g}\right)-f\left(x,\bm{\theta}_{0}\right))}{f\left(x,\bm{\theta}_{0}\right)}+1\right\}^{1/2}\right)f\left(x,\bm{\theta}_{0}\right)\mu({\rm d}x).

Let δ(x)=1f(x,𝜽0)g=1Gαg(f(x,𝜽g)f(x,𝜽0))\delta(x)=\frac{1}{f\left(x,\bm{\theta}_{0}\right)}\sum_{g=1}^{G}\alpha_{g}\left(f\left(x,\bm{\theta}_{g}\right)-f\left(x,\bm{\theta}_{0}\right)\right). We can rewrite H2(𝜶,𝝃,𝝃0)H^{2}({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0}) as

H2(𝜶,𝝃,𝝃0)=(1δ(x)+1)f(x,𝜽0)μ(dx).H^{2}({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0})=\int\left(1-\sqrt{\delta(x)+1}\right)f\left(x,\bm{\theta}_{0}\right)\mu({\rm d}x).

Applying the inequality

x+11x/2x2/8+x3/16,\sqrt{x+1}-1\leq x/2-x^{2}/8+x^{3}/16,

and 𝔼[δ(x)]=0{\mathbb{E}}[\delta(x)]=0, we have

H2(𝜶,𝝃,𝝃0)\displaystyle H^{2}({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0}) 𝔼[δ(x)/2]+𝔼[δ2(x)/8]𝔼[δ3(x)/16]\displaystyle\geq-{\mathbb{E}}[\delta(x)/2]+{\mathbb{E}}\left[\delta^{2}(x)/8\right]-{\mathbb{E}}\left[\delta^{3}(x)/16\right]
=𝔼[δ2(x)/8]𝔼[δ3(x)/16].\displaystyle={\mathbb{E}}\left[\delta^{2}(x)/8\right]-{\mathbb{E}}\left[\delta^{3}(x)/16\right]. (20)

Step 1. We first consider the quadratic term. Define

𝐛(x)=(Y1(x),,Yd(x),Z1(x),,Zd(x),U12(x),,U(d1)d(x)),{\bf b}(x)=(Y_{1}(x),\dots,Y_{d}(x),Z_{1}(x),\dots,Z_{d}(x),U_{12}(x),\dots,U_{(d-1)d}(x)),

where Yh(x)Y_{h}(x), Zh(x)Z_{h}(x) (h=1,,dh=1,\dots,d) and Uh(x)U_{h\ell}(x) (1h<d1\leq h<\ell\leq d) are defined as in (A.1) without the subscript ii. By Taylor’s expansion, we have

δ(x)\displaystyle\delta(x) =g=1Gαgf(x;𝜽g)f(x;𝜽0)f(x;𝜽0)\displaystyle=\sum_{g=1}^{G}{\alpha}_{g}\frac{f(x;{\bm{\theta}}_{g})-f(x;{\bm{\theta}}_{0})}{f(x;{\bm{\theta}}_{0})}
=h=1dg=1Gαg(θghθ0h)Yh(x)+h=1dg=1Gαg(θghθ0h)2Zh(x)\displaystyle=\sum_{h=1}^{d}\sum_{g=1}^{G}{\alpha}_{g}({\theta}_{gh}-{\theta}_{0h})Y_{h}(x)+\sum_{h=1}^{d}\sum_{g=1}^{G}{\alpha}_{g}({\theta}_{gh}-{\theta}_{0h})^{2}Z_{h}(x)
+h<dg=1Gαg(θghθ0h)(θgθ0)Uh(x)+ε(x)\displaystyle+\sum_{h<\ell}^{d}\sum_{g=1}^{G}{\alpha}_{g}({\theta}_{gh}-{\theta}_{0h})({\theta}_{g\ell}-{\theta}_{0\ell})U_{h\ell}(x)+\varepsilon(x)
=𝐦T𝐛(x)+ε(x).\displaystyle={\bf m}^{\rm T}{\bf b}(x)+\varepsilon(x).

Here ε(x)\varepsilon(x) is the remainder term and can be accurately represented as

ε(x)=j1=1dj3=1dg=1Gαgs=13(θgjsθ0js)(3f(x,𝜻g(x))θj1θj2θj3)/(3!f(x,𝜽0)),\varepsilon(x)=\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\sum_{g=1}^{G}{\alpha}_{g}\prod_{s=1}^{3}\left({\theta}_{gj_{s}}-{\theta}_{0j_{s}}\right)\left(\frac{\partial^{3}f(x,{\bm{\zeta}}_{g}(x))}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\bigg{/}(3!f(x,\bm{\theta}_{0})),

where 𝜻g(x){\bm{\zeta}}_{g}(x) lies between 𝜽g{\bm{\theta}}_{g} and 𝜽0\bm{\theta}_{0}. Since δ(x)=𝐦T𝐛(x)+ε(x)\delta(x)={\bf m}^{\rm T}{\bf b}(x)+\varepsilon(x), we have

δ2(x)=𝐦T𝐛(x)𝐛(x)T𝐦denote as I+2𝐦T𝐛(x)ε(x)denote as II+ε2(x)denote as III.\delta^{2}(x)=\overbrace{{\bf m}^{\rm T}{\bf b}(x){\bf b}(x)^{\rm T}{\bf m}}^{\text{denote as I}}+\overbrace{2{\bf m}^{\rm T}{\bf b}(x)\varepsilon(x)}^{\text{denote as II}}+\overbrace{\varepsilon^{2}(x)}^{\text{denote as III}}.

For the first term I, we have

𝔼(𝐦T𝐛(x)𝐛(x)T𝐦)=𝐦T𝐁(𝜽0)𝐦λmin𝐦22,{\mathbb{E}}({\bf m}^{\rm T}{\bf b}(x){\bf b}(x)^{\rm T}{\bf m})={\bf m}^{\rm T}{\bf B}(\bm{\theta}_{0}){\bf m}\geq\lambda_{\rm min}\|{\bf m}\|_{2}^{2},

where 𝐁(𝜽0)=𝔼(𝐛(x)𝐛(x)T){\bf B}(\bm{\theta}_{0})={\mathbb{E}}\left({\bf b}(x){\bf b}(x)^{\rm T}\right). For the second term II, by Cauchy’s inequality, we have

2𝔼(𝐦T𝐛(x)ε(x))2𝔼(|𝐦T𝐛(x)ε(x)|)2𝐦2𝔼(𝐛(x)2|ε(x)|).2{\mathbb{E}}\left({\bf m}^{\rm T}{\bf b}(x)\varepsilon(x)\right)\geq-2{\mathbb{E}}\left(\left|{\bf m}^{\rm T}{\bf b}(x)\|\varepsilon(x)\right|\right)\geq-2\|{\bf m}\|_{2}{\mathbb{E}}\left(\|{\bf b}(x)\|_{2}|\varepsilon(x)|\right).

Hence, we aim to bound 𝔼(𝐛(x)2|ε(x)|){\mathbb{E}}(\|{\bf b}(x)\|_{2}|\varepsilon(x)|). For any fixed j1,j2,j3j_{1},j_{2},j_{3} and gg, by Lemma 5, we have

αgs=13(θgjsθ0js)αgd3/2𝜽g𝜽023d2𝜽g𝜽02𝐦2.{\alpha}_{g}\prod_{s=1}^{3}({\theta}_{gj_{s}}-{\theta}_{0j_{s}})\leq{\alpha}_{g}d^{3/2}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}^{3}\leq d^{2}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}\|{\bf m}\|_{2}. (21)

The second inequality of (21) is from d𝐦2αg𝜽g𝜽022\sqrt{d}\|{\bf m}\|_{2}\geq{\alpha}_{g}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}^{2}, because

d𝐦2d(h=1d{g=1Gαg(θghθ0h)2}2)1/2h=1dg=1Gαg(θghθ0h)2.\sqrt{d}\|{\bf m}\|_{2}\geq\sqrt{d}\left({\sum_{h=1}^{d}\left\{\sum_{g=1}^{G}\alpha_{g}(\theta_{gh}-\theta_{0h})^{2}\right\}^{2}}\right)^{1/2}\geq\sum_{h=1}^{d}\sum_{g=1}^{G}\alpha_{g}(\theta_{gh}-\theta_{0h})^{2}. (22)

Remember that g(x,𝜽0)g(x,\bm{\theta}_{0}) is the function defined in Condition (C3). Then, we have

𝔼(𝐛(x)2|g=1Gαgs=13(θgjsθ0js)(3f(x,𝜻g(x,𝜽0))θj1θj2θj3)/(3!f(x,𝜽0))|)\displaystyle\quad{\mathbb{E}}\left(\|{\bf b}(x)\|_{2}\left|\sum_{g=1}^{G}{\alpha}_{g}\prod_{s=1}^{3}({\theta}_{gj_{s}}-{\theta}_{0j_{s}})\left(\frac{\partial^{3}f(x,{\bm{\zeta}}_{g}(x,\bm{\theta}_{0}))}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\bigg{/}(3!f(x,\bm{\theta}_{0}))\right|\right)
\displaystyle\leq 𝔼(g(x,𝜽0)𝐛(x)2)d2g=1G𝜽g𝜽02𝐦2(by (21))\displaystyle\quad{\mathbb{E}}(g(x,\bm{\theta}_{0})\|{\bf b}(x)\|_{2})d^{2}\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}\|{\bf m}\|_{2}\ \ \ \ (\mbox{by (\ref{eq:mm}}))
\displaystyle\leq d2g=1G𝜽g𝜽02𝐦2(d+1)M2.\displaystyle\quad d^{2}\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}\|{\bf m}\|_{2}(d+1)M^{2}.

where

𝔼(g(x,𝜽0)𝐛(x)2)(𝔼(g2(x,𝜽0))𝔼(𝐛(x)22))1/2(d+1)M2{\mathbb{E}}(g(x,\bm{\theta}_{0})\|{\bf b}(x)\|_{2})\leq\left({\mathbb{E}}(g^{2}(x,\bm{\theta}_{0})){\mathbb{E}}(\|{\bf b}(x)\|_{2}^{2})\right)^{1/2}\leq(d+1)M^{2}

is from Cauchy’s inequality. It follows that

𝔼(𝐛(x)2|ε(x)|)d5(d+1)g=1G𝜽g𝜽02𝐦2M2.\displaystyle{\mathbb{E}}(\|{\bf b}(x)\|_{2}|\varepsilon(x)|)\leq d^{5}(d+1)\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}\|{\bf m}\|_{2}M^{2}.

Therefore, there exists a constant Δ11>0\Delta_{11}>0 such that when g=1G𝜽g𝜽02Δ11\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}\leq\Delta_{11},

2𝔼(𝐦T𝐛(x)ε(x))21λmin𝐦22.2{\mathbb{E}}\left({\bf m}^{\rm T}{\bf b}(x)\varepsilon(x)\right)\geq-2^{-1}\lambda_{\rm min}\|{\bf m}\|_{2}^{2}.

Since 𝔼(ε2(x))>0{\mathbb{E}}\left(\varepsilon^{2}(x)\right)>0, when g=1G𝜽g𝜽02Δ11\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}\leq\Delta_{11}, we have

𝔼(δ2(x))21λmin𝐦22.{\mathbb{E}}(\delta^{2}(x))\geq 2^{-1}\lambda_{\rm min}\|{\bf m}\|_{2}^{2}.

Step 2. Next, we aim to prove that there exists Δ12>0\Delta_{12}>0 such that when g=1G𝜽g𝜽02Δ12\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}\leq\Delta_{12}, 𝔼(|δ3(x)|)21λmin𝐦22{\mathbb{E}}\left(\left|\delta^{3}(x)\right|\right)\leq 2^{-1}\lambda_{\rm min}\|{\bf m}\|_{2}^{2}. We have

δ3(x)=(𝐦T𝐛(x))3+3𝐦T𝐛(x)𝐦T𝐛(x)ε(x)+3𝐦T𝐛(x)ε2(x)+ε3(x).\delta^{3}(x)=\left({\bf m}^{\rm T}{\bf b}(x)\right)^{3}+3{\bf m}^{\rm T}{\bf b}(x){\bf m}^{\rm T}{\bf b}(x)\varepsilon(x)+3{\bf m}^{\rm T}{\bf b}(x)\varepsilon^{2}(x)+\varepsilon^{3}(x).

Note that

|ε(x)|\displaystyle|\varepsilon(x)| |j1=1dj3=1dg=1Gαgs=13(θgjsθ0js)g(x;𝜽0)|\displaystyle\leq\left|\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\sum_{g=1}^{G}{\alpha}_{g}\prod_{s=1}^{3}({\theta}_{gj_{s}}-{\theta}_{0j_{s}})g(x;\bm{\theta}_{0})\right|
d5𝐦2g=1G𝜽g𝜽02g(x;𝜽0)(applying (21))\displaystyle\leq d^{5}\|{\bf m}\|_{2}\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}g(x;\bm{\theta}_{0})\ \ \ \ (\mbox{applying (\ref{eq:mm}}))
=𝐦2ε~(x),\displaystyle=\|{\bf m}\|_{2}\widetilde{\varepsilon}(x),

where

ε~(x)=d5g=1G𝜽g𝜽02g(x,𝜽0).\widetilde{\varepsilon}(x)=d^{5}\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}g(x,\bm{\theta}_{0}).

Then, we have

|δ(x)3|\displaystyle\left|\delta(x)^{3}\right| =|(𝐦T𝐛(x))3+3𝐦T𝐛(x)𝐦T𝐛(x)ε(x)+3𝐦T𝐛(x)ε2(x)+ε3(x)|\displaystyle=\left|\left({\bf m}^{\rm T}{\bf b}(x)\right)^{3}+3{\bf m}^{\rm T}{\bf b}(x){\bf m}^{\rm T}{\bf b}(x)\varepsilon(x)+3{\bf m}^{\rm T}{\bf b}(x)\varepsilon^{2}(x)+\varepsilon^{3}(x)\right|
𝐦23(𝐛(x)23+3𝐛(x)22|ε~(x)|+3𝐛(x)2|ε~2(x)|+|ε~3(x)|).\displaystyle\leq\|{\bf m}\|_{2}^{3}\left(\|{\bf b}(x)\|_{2}^{3}+3\|{\bf b}(x)\|_{2}^{2}|\widetilde{\varepsilon}(x)|+3\|{\bf b}(x)\|_{2}|\widetilde{\varepsilon}^{2}(x)|+|\widetilde{\varepsilon}^{3}(x)|\right).

By Condition (C3), obviously, the random variable

𝐛(x)23+3𝐛(x)22|ε~(x)|+3𝐛(x)2|ε~(x)|2+|ε~(x)|3\|{\bf b}(x)\|_{2}^{3}+3\|{\bf b}(x)\|_{2}^{2}|\widetilde{\varepsilon}(x)|+3\|{\bf b}(x)\|_{2}|\widetilde{\varepsilon}(x)|^{2}+|\widetilde{\varepsilon}(x)|^{3}

is integrable. Then, similarly to the proof in Step 1, there exists a constant Δ12>0\Delta_{12}>0 such that when g=1G𝜽g𝜽02Δ12\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}\leq\Delta_{12}, we have 𝔼[δ3(x)]21λmin𝐦22{\mathbb{E}}\left[\delta^{3}(x)\right]\leq 2^{-1}\lambda_{\rm min}\|{\bf m}\|_{2}^{2}. Taking Δ1=min(Δ11,Δ12)\Delta_{1}=\min(\Delta_{11},\Delta_{12}), by (A.2.1), for any 𝜽0Θ\bm{\theta}_{0}\in\Theta, when g=1G𝜽g𝜽02Δ1\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}\leq\Delta_{1}, we have

H2(𝜶,𝝃,𝝃0)321λmin𝐦22.H^{2}({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0})\geq 32^{-1}\lambda_{\rm min}\|{\bf m}\|_{2}^{2}.

By the definition of ω\omega and diamm(Ξ){\rm diam}_{m}(\Xi), for any 𝝃Ξ\bm{\xi}\in\Xi, 𝜽0Θ\bm{\theta}_{0}\in\Theta and 𝜶𝕊δ{\bm{\alpha}}\in\mathbb{S}_{\delta}, we have

H2(𝜶,𝝃,𝝃0)min{ω2diamm(Ξ),321λmin}𝐦22.H^{2}({\bm{\alpha}},\bm{\xi},\bm{\xi}_{0})\geq\min\left\{\frac{\omega^{2}}{{\rm diam}_{m}(\Xi)},32^{-1}\lambda_{\rm min}\right\}\|{\bf m}\|_{2}^{2}.

Write L1=min{ω2diamm(Ξ),321λmin}L_{1}=\sqrt{\min\left\{\frac{\omega^{2}}{{\rm diam}_{m}(\Xi)},32^{-1}\lambda_{\rm min}\right\}}. Finally, by (22), we have d𝐦2αming=1G𝜽g𝜽022αmin𝝃𝝃022\sqrt{d}\|{\bf m}\|_{2}\geq\alpha_{\rm min}\sum_{g=1}^{G}\|\bm{\theta}_{g}-\bm{\theta}_{0}\|_{2}^{2}\geq\alpha_{\rm min}\|\bm{\xi}-\bm{\xi}_{0}\|_{2}^{2}. Together with 𝜶𝕊δ{\bm{\alpha}}\in\mathbb{S}_{\delta} yields

L1𝐦2L1αmind𝝃𝝃024L1δd𝝃𝝃022,L_{1}\|{\bf m}\|_{2}\geq\frac{L_{1}{\alpha_{\rm min}}}{\sqrt{d}}\|\bm{\xi}-\bm{\xi}_{0}\|_{2}^{4}\geq\frac{L_{1}{\delta}}{\sqrt{d}}\|\bm{\xi}-\bm{\xi}_{0}\|_{2}^{2},

which finishes the proof of Lemma 4. ∎

A.2.2 Proofs of Lemma 1

To prove Lemma 1, it remains to construct a link between the log-likelihood ratio and the Hellinger distance between the estimator and the true value. The following lemma shows that 𝝃(k){\bm{\xi}}^{(k)} is concentrated on 𝝃0\bm{\xi}_{0} in the sense of the Hellinger distance. In other words, Lemma 6 constructs a link between the log-likelihood ratio and the Hellinger distance between (𝝃(k),𝜶(k))\left({\bm{\xi}}^{(k)},{\bm{\alpha}}^{(k)}\right) and (𝝃0,𝜶0)\left(\bm{\xi}_{0},{\bm{\alpha}}_{0}\right).

Lemma 6.

Let c1=1/24,c2=(4/27)(1/1926),p0=λGlog(δG)c_{1}=1/24,c_{2}=(4/27)(1/1926),p_{0}=\lambda G\hbox{log}(\delta G) and c>0c>0 be a constant depending on δ,d,G,M\delta,d,G,M and diam(Ξ){\rm diam}(\Xi). Under Condition (C1)–(C4) and 0\mathbb{H}_{0}, for any ϵcn1/2logn\epsilon\geq cn^{-1/2}{\rm log}\ n and n1p0c1ϵ2-n^{-1}p_{0}\leq c_{1}\epsilon^{2}, we have

({H(𝜶(k),𝝃(k),𝝃0)ϵ}{𝜶(k)𝕊δ})(𝜶(k)𝕊δ)5exp(c2nϵ2).{\mathbb{P}}\left(\left\{H\left({{\bm{\alpha}}}^{(k)},{\bm{\xi}}^{(k)},\bm{\xi}_{0}\right)\leq\epsilon\right\}\cap\left\{{\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right\}\right)\geq{\mathbb{P}}\left({\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right)-5\exp\left(-c_{2}n\epsilon^{2}\right).

Based on Lemma 6 and Lemma 4, we can prove Lemma 1.

Proof of Lemma 1.

By Lemma 6, when ϵcn1/2logn\epsilon\geq cn^{-1/2}\hbox{log}n and c1ϵ2(p0)n1c_{1}\epsilon^{2}\geq\left(-p_{0}\right)n^{-1}, i.e.

ϵmax(cn1/2logn,c11/2(p0)1/2n1/2),\epsilon\geq\max\left(cn^{-1/2}{\rm log}n,c_{1}^{-1/2}\left(-p_{0}\right)^{1/2}n^{-1/2}\right),

we have

({H(𝜶(k),𝝃(k),𝝃0)ϵ}{𝜶(k)𝕊δ})(𝜶(k)𝕊δ)5exp(c2nϵ2).{\mathbb{P}}\left(\left\{H\left({{\bm{\alpha}}}^{(k)},{\bm{\xi}}^{(k)},\bm{\xi}_{0}\right)\leq\epsilon\right\}\cap\left\{{\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right\}\right)\geq{\mathbb{P}}\left({\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right)-5\exp\left(-c_{2}n\epsilon^{2}\right).

By Lemma 4, we have

{H(𝜶(k),𝝃(k),𝝃0)ϵ}{𝜶(k)𝕊δ}𝒮ϵ(k){𝜶(k)𝕊δ}.\left\{H\left({{\bm{\alpha}}}^{(k)},{\bm{\xi}}^{(k)},\bm{\xi}_{0}\right)\leq\epsilon\right\}\cap\left\{{\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right\}\subset\mathcal{S}^{(k)}_{\epsilon}\cap\left\{{\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right\}.

It follows that

(𝒮ϵ(k){𝜶(k)𝕊δ})(𝜶(k)𝕊δ)5exp(c2nϵ2),{\mathbb{P}}\left(\mathcal{S}^{(k)}_{\epsilon}\cap\left\{{\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right\}\right)\geq{\mathbb{P}}\left({\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right)-5\exp\left(-c_{2}n\epsilon^{2}\right),

and thus we complete the proof. ∎

We now aim to prove Lemma 6. We use the Hellinger distance entropy to measure the size of the parameter space Ξ\Xi. For any u>0u>0, we call a finite set {(fjL,fjU),j=1,,N}\{(f_{j}^{L},f_{j}^{U}),j=1,\dots,N\} a (Hellinger) uu-bracketing of a distribution family \mathcal{F}, if H(fjL,fjU)uH(f_{j}^{L},f_{j}^{U})\leq u, and for any pp\in\mathcal{F}, there is a jj such that fjLpfjUf_{j}^{L}\leq p\leq f_{j}^{U}. Define the Hellinger distance entropy of 𝒫𝕊δG\mathcal{P}^{G}_{\mathbb{S}_{\delta}} as (u,𝒫𝕊δG)=\mathcal{H}(u,\mathcal{P}^{G}_{\mathbb{S}_{\delta}})= logarithm of the cardinality of the uu-bracketing of the smallest size. To bound the Hellinger distance entropy (u,𝒫𝕊δG)\mathcal{H}(u,\mathcal{P}^{G}_{\mathbb{S}_{\delta}}), we need the following Lipschitz property of φ1/2(x;𝝃,𝜶)\varphi^{1/2}(x;\bm{\xi},{\bm{\alpha}}).

Lemma 7.

Under Condition (C3), if 𝛏1,𝛏2Ξ,𝛂1,𝛂2𝕊δ\bm{\xi}_{1},\bm{\xi}_{2}\in\Xi,{\bm{\alpha}}_{1},{\bm{\alpha}}_{2}\in\mathbb{S}_{\delta}, then

|φ1/2(x;𝝃1,𝜶1)φ1/2(x;𝝃2,𝜶2)|ar(x)(𝝃1,𝜶1)(𝝃2,𝜶2)2,|\varphi^{1/2}(x;\bm{\xi}_{1},{\bm{\alpha}}_{1})-\varphi^{1/2}(x;\bm{\xi}_{2},{\bm{\alpha}}_{2})|\leq ar(x)\left\|(\bm{\xi}_{1},{\bm{\alpha}}_{1})-(\bm{\xi}_{2},{\bm{\alpha}}_{2})\right\|_{2},

where r(x)r(x) is the function as in Condition (C3) and a=G4δa=\sqrt{\frac{G}{4\delta}}.

With Lemma 7, computing the Hellinger distance entropy can be converted to computing the Euclidean distance entropy. The following lemma gives an upper bound of (u,𝒫𝕊δG)\mathcal{H}(u,\mathcal{P}^{G}_{\mathbb{S}_{\delta}}) based on Lemma 7.

Lemma 8.

Under Condition (C1)–(C3), we have

(u,𝒫𝕊δG)G(d+1)log(1+2aMdiam(Ξ×𝕊δ)u),\mathcal{H}(u,\mathcal{P}^{G}_{\mathbb{S}_{\delta}})\leq G(d+1){\rm log}\left(1+\frac{2aM{\rm diam}(\Xi\times\mathbb{S}_{\delta})}{u}\right),

where a=G4δa=\sqrt{\frac{G}{4\delta}} and diam(Ξ×𝕊δ){\rm diam}(\Xi\times\mathbb{S}_{\delta}) and are the Euclidean diameter of Ξ×𝕊δ\Xi\times\mathbb{S}_{\delta}.

We remark here that diam(Ξ×𝕊δ){\rm diam}(\Xi\times\mathbb{S}_{\delta}) is only depending on diam(Ξ),G{\rm diam}(\Xi),G and dd because the elements of 𝜶{\bm{\alpha}} with 𝜶𝕊δ{\bm{\alpha}}\in\mathbb{S}_{\delta} are bounded by 1. The following lemma from Wong and Shen (1995) gives a uniform exponential bound for the likelihood ratio.

Lemma 9.

Taking c1=1/24,c2=(4/27)(1/1926),c3=10,c4=(2/3)5/2/512c_{1}=1/24,c_{2}=(4/27)(1/1926),c_{3}=10,c_{4}=(2/3)^{5/2}/512, for any ϵ>0\epsilon>0, if

ϵ2/282ϵ1/2(u/c3,𝒫𝕊δG)𝑑uc4n1/2ϵ2,\int_{\epsilon^{2}/2^{8}}^{\sqrt{2}\epsilon}\mathcal{H}^{1/2}\left(u/c_{3},\mathcal{P}^{G}_{\mathbb{S}_{\delta}}\right)du\leq c_{4}n^{1/2}\epsilon^{2}, (23)

then

(supH(𝜶,𝝃1,𝝃0)ϵ𝝃1Ξ𝜶𝕊δi=1nφ(xi;𝝃1,𝜶)/φ(xi;𝝃0,𝜶)exp(c1nϵ2))4exp(c2nϵ2),{\mathbb{P}}^{*}\left(\sup_{\begin{subarray}{c}H({\bm{\alpha}},\bm{\xi}_{1},\bm{\xi}_{0})\geq\epsilon\\ \bm{\xi}_{1}\in\Xi\\ {\bm{\alpha}}\in\mathbb{S}_{\delta}\end{subarray}}\prod_{i=1}^{n}\varphi\left(x_{i};{\bm{\xi}_{1}},{\bm{\alpha}}\right)/\varphi\left(x_{i};{\bm{\xi}_{0}},{\bm{\alpha}}\right)\geq\exp\left(-c_{1}n\epsilon^{2}\right)\right)\leq 4\exp\left(-c_{2}n\epsilon^{2}\right),

where {\mathbb{P}}^{*} is understood to be the outer probability measure corresponding to the measure at (𝛏0,𝛂0)(\bm{\xi}_{0},{\bm{\alpha}}_{0}).

The following lemma claims that when n2n\geq 2, for any ϵcn1/2logn\epsilon\geq cn^{-1/2}\hbox{log}n, where cc is a constant only depending on δ,d,G,M\delta,d,G,M and diam(Ξ){\rm diam}(\Xi), (23) holds.

Lemma 10.

Under Condition (C1)–(C3), there exists a constant cc depending on δ,d,G,M\delta,d,G,M and diam(Ξ){\rm diam}(\Xi) such that when n2n\geq 2, for any ϵn1/2logn\epsilon\geq n^{-1/2}{\rm log}n, (23) holds.

In fact, if we use the local Hellinger distance entropy, we can remove the logn\hbox{log}n factor and obtain a stronger result. However, in this paper, cn1/2logncn^{-1/2}\hbox{log}n is sufficient. Thus, based on Lemma 9 and Lemma 10, we prove Lemma 6.

Proof of Lemma 6.

Since 𝝃(0)=argmax𝝃Ξpln(𝝃,𝜶(0))\bm{\xi}^{(0)}=\arg\max_{\bm{\xi}\in\Xi}pl_{n}(\bm{\xi},{\bm{\alpha}}^{(0)}), we have

pln(𝝃(0),𝜶(0))pln(𝝃0,𝜶(0)),pl_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)\geq pl_{n}\left(\bm{\xi}_{0},{\bm{\alpha}}^{(0)}\right),

and thus

ln(𝝃(0),𝜶(0))ln(𝝃0,𝜶(0))=ln(𝝃0,𝜶0).l_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)\geq l_{n}\left(\bm{\xi}_{0},{\bm{\alpha}}^{(0)}\right)=l_{n}\left(\bm{\xi}_{0},{\bm{\alpha}}_{0}\right).

By the property of the EM algorithm, for any 1kK1\leq k\leq K, we have

pln(𝝃(k),𝜶(k))pln(𝝃(0),𝜶(0)).pl_{n}\left(\bm{\xi}^{(k)},{\bm{\alpha}}^{(k)}\right)\geq pl_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right).

Since p(𝜶(0))p0p({\bm{\alpha}}^{(0)})\geq p_{0}, we conclude that

ln(𝝃(k),𝜶(k))ln(𝝃(0),𝜶(0))p(𝜶(0))p(𝜶(k))p(𝜶(0))p0.l_{n}\left(\bm{\xi}^{(k)},{\bm{\alpha}}^{(k)}\right)-l_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)\geq p({\bm{\alpha}}^{(0)})-p({\bm{\alpha}}^{(k)})\geq p({\bm{\alpha}}^{(0)})\geq p_{0}. (24)

Next, by Lemma 10 and 9, for any ϵcn1/2logn\epsilon\geq cn^{-1/2}{\rm log}\ n and n1p0c1ϵ2-n^{-1}p_{0}\leq c_{1}\epsilon^{2}, we have

(supH(𝜶,𝝃1,𝝃0)ϵ𝝃1Ξ𝜶𝕊δi=1nφ(xi;𝝃1,𝜶)/φ(xi;𝝃0,𝜶)exp(c1nϵ2))4exp(c2nϵ2),{\mathbb{P}}^{*}\left(\sup_{\begin{subarray}{c}H({\bm{\alpha}},\bm{\xi}_{1},\bm{\xi}_{0})\geq\epsilon\\ \bm{\xi}_{1}\in\Xi\\ {\bm{\alpha}}\in\mathbb{S}_{\delta}\end{subarray}}\prod_{i=1}^{n}\varphi\left(x_{i};{\bm{\xi}_{1}},{\bm{\alpha}}\right)/\varphi\left(x_{i};{\bm{\xi}_{0}},{\bm{\alpha}}\right)\geq\exp\left(-c_{1}n\epsilon^{2}\right)\right)\leq 4\exp\left(-c_{2}n\epsilon^{2}\right),

or equivalently,

(supH(𝜶,𝝃1,𝝃0)ϵ𝝃1Ξ𝜶𝕊δln(𝝃1,𝜶)ln(𝝃0,𝜶0)p0)4exp(c2nϵ2).{\mathbb{P}}^{*}\left(\sup_{\begin{subarray}{c}H({\bm{\alpha}},\bm{\xi}_{1},\bm{\xi}_{0})\geq\epsilon\\ \bm{\xi}_{1}\in\Xi\\ {\bm{\alpha}}\in\mathbb{S}_{\delta}\end{subarray}}l_{n}({\bm{\xi}_{1}},{\bm{\alpha}})-l_{n}({\bm{\xi}_{0}},{\bm{\alpha}}_{0})\geq p_{0}\right)\leq 4\exp\left(-c_{2}n\epsilon^{2}\right).

Write

𝒢={supH(𝜶,𝝃1,𝝃0)ϵ𝝃1Ξ𝜶𝕊δln(𝝃1,𝜶)ln(𝝃0,𝜶0)p0}.\mathcal{G}=\left\{\sup_{\begin{subarray}{c}H({\bm{\alpha}},\bm{\xi}_{1},\bm{\xi}_{0})\geq\epsilon\\ \bm{\xi}_{1}\in\Xi\\ {\bm{\alpha}}\in\mathbb{S}_{\delta}\end{subarray}}l_{n}({\bm{\xi}_{1}},{\bm{\alpha}})-l_{n}({\bm{\xi}_{0}},{\bm{\alpha}}_{0})\geq p_{0}\right\}.

By (24) and the fact 𝝃(k)Ξ\bm{\xi}^{(k)}\in\Xi, we have

{H(𝜶(k),𝝃(k),𝝃0)ϵ}{𝜶(k)𝕊δ}𝒢{𝜶(k)𝕊δ},\left\{H({\bm{\alpha}}^{(k)},\bm{\xi}^{(k)},\bm{\xi}_{0})\geq\epsilon\right\}\cup\left\{{\bm{\alpha}}^{(k)}\not\in\mathbb{S}_{\delta}\right\}\subset\mathcal{G}\cup\left\{{\bm{\alpha}}^{(k)}\not\in\mathbb{S}_{\delta}\right\},

because if 𝜶(k)𝕊δ{\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}, combining (24) with {H(𝜶(k),𝝃(k),𝝃0)ϵ}\left\{H({\bm{\alpha}}^{(k)},\bm{\xi}^{(k)},\bm{\xi}_{0})\geq\epsilon\right\} implies 𝒢\mathcal{G}. Thus, we conclude that

({H(𝜶(k),𝝃(k),𝝃0)ϵ}{𝜶(k)𝕊δ})\displaystyle{\mathbb{P}}\left(\left\{H\left({{\bm{\alpha}}}^{(k)},{\bm{\xi}}^{(k)},\bm{\xi}_{0}\right)\geq\epsilon\right\}\cup\left\{{\bm{\alpha}}^{(k)}\not\in\mathbb{S}_{\delta}\right\}\right) (𝒢)+(𝜶(k)𝕊δ)\displaystyle\leq{\mathbb{P}}^{*}(\mathcal{G})+{\mathbb{P}}\left({\bm{\alpha}}^{(k)}\not\in\mathbb{S}_{\delta}\right)
5exp(c2nϵ2)+1(𝜶(k)𝕊δ),\displaystyle\leq 5\exp\left(-c_{2}n\epsilon^{2}\right)+1-{\mathbb{P}}\left({\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right),

which proves this lemma. ∎

At the end of this section, we give the proofs of Lemma 7, 8 and 10. Before presenting their proofs, we give a bound of covering numbers of the Euclidean ball which can be founded in Vershynin (2018) (Corollary 4.2.13). Let 𝒩(ε,K)\mathcal{N}(\varepsilon,K) be the smallest number of closed Euclidean balls with centers in KK and radius ε\varepsilon whose union covers KK.

Lemma 11.

The covering numbers of the unit Euclidean ball B2pB_{2}^{p} satisfy the following for any ε>0\varepsilon>0 :

(1ε)p𝒩(ε,B2p)(2ε+1)p.\left(\frac{1}{\varepsilon}\right)^{p}\leq\mathcal{N}\left(\varepsilon,B_{2}^{p}\right)\leq\left(\frac{2}{\varepsilon}+1\right)^{p}.
Proof of Lemma 7.

Since φ1/2(x;𝝃,𝜶)=(g=1Gαgf(x;𝜽g))1/2\varphi^{1/2}(x;\bm{\xi},{\bm{\alpha}})=\left(\sum_{g=1}^{G}\alpha_{g}f(x;\bm{\theta}_{g})\right)^{1/2}, the gradient of φ1/2\varphi^{1/2} can be written as

φ1/2(x;𝝃,𝜶)\displaystyle\quad\nabla\varphi^{1/2}(x;\bm{\xi},{\bm{\alpha}})
=21φ1/2(x;𝝃,𝜶)(α1𝜽1f(x;𝜽1),,αG𝜽Gf(x;𝜽G),f(x;𝜽1),,f(x;𝜽G)).\displaystyle=2^{-1}\varphi^{-1/2}(x;\bm{\xi},{\bm{\alpha}})\left(\alpha_{1}\nabla_{\bm{\theta}_{1}}f(x;\bm{\theta}_{1}),\ldots,\alpha_{G}\nabla_{\bm{\theta}_{G}}f(x;\bm{\theta}_{G}),f(x;\bm{\theta}_{1}),\ldots,f(x;\bm{\theta}_{G})\right).

By Lagrange’s theorem and Cauchy’s inequality, we have

|φ1/2(x;𝝃1,𝜶1)φ1/2(x;𝝃2,𝜶2)|φ1/2(x;𝝃ˇ(x),𝜶ˇ(x))2(𝝃1,𝜶1)(𝝃2,𝜶2)2,|\varphi^{1/2}(x;\bm{\xi}_{1},{\bm{\alpha}}_{1})-\varphi^{1/2}(x;\bm{\xi}_{2},{\bm{\alpha}}_{2})|\leq\left\|\nabla\varphi^{1/2}(x;\check{\bm{\xi}}(x),\check{{\bm{\alpha}}}(x))\right\|_{2}\left\|(\bm{\xi}_{1},{\bm{\alpha}}_{1})-(\bm{\xi}_{2},{\bm{\alpha}}_{2})\right\|_{2},

where 𝝃ˇ(x)=(𝜽ˇ1(x),,𝜽ˇG(x))\check{\bm{\xi}}(x)=(\check{\bm{\theta}}_{1}(x),\dots,\check{\bm{\theta}}_{G}(x)) lies between 𝝃1\bm{\xi}_{1} and 𝝃2\bm{\xi}_{2} and 𝜶ˇ(x)\check{{\bm{\alpha}}}(x) lies between 𝜶1{\bm{\alpha}}_{1} and 𝜶2{\bm{\alpha}}_{2}. Since 𝜽ˇg(x)Θ(g=1,,G)\check{\bm{\theta}}_{g}(x)\in\Theta\ (g=1,\dots,G) and 𝜶ˇ(x)𝕊δ\check{{\bm{\alpha}}}(x)\in\mathbb{S}_{\delta}, it follows that

φ1/2(x;𝝃ˇ(x),𝜶ˇ(x))22\displaystyle\left\|\nabla\varphi^{1/2}(x;\check{\bm{\xi}}(x),\check{{\bm{\alpha}}}(x))\right\|_{2}^{2} =g=1Gαˇg2𝜽gf(x;𝜽ˇg(x))22+f2(x;𝜽ˇg(x))4φ(x,𝝃ˇ(x),𝜶ˇ(x))\displaystyle=\sum_{g=1}^{G}\frac{\check{\alpha}_{g}^{2}\left\|\nabla_{\bm{\theta}_{g}}f(x;\check{\bm{\theta}}_{g}(x))\right\|_{2}^{2}+f^{2}(x;\check{\bm{\theta}}_{g}(x))}{4\varphi(x,\check{\bm{\xi}}(x),\check{{\bm{\alpha}}}(x))}
g=1G𝜽gf(x;𝜽ˇg(x))22+f2(x;𝜽ˇg(x))4αˇgf(x;𝜽ˇg(x))\displaystyle\leq\sum_{g=1}^{G}\frac{\left\|\nabla_{\bm{\theta}_{g}}f(x;\check{\bm{\theta}}_{g}(x))\right\|_{2}^{2}+f^{2}(x;\check{\bm{\theta}}_{g}(x))}{4\check{\alpha}_{g}f(x;\check{\bm{\theta}}_{g}(x))}
G4δr2(x),\displaystyle\leq\frac{G}{4\delta}r^{2}(x),

where r(x)r(x) is defined in Condition (C4). Thus, we have

|φ1/2(x;𝝃1,𝜶1)φ1/2(x;𝝃2,𝜶2)|G4δr(x)(𝝃1,𝜶1)(𝝃2,𝜶2)2,|\varphi^{1/2}(x;\bm{\xi}_{1},{\bm{\alpha}}_{1})-\varphi^{1/2}(x;\bm{\xi}_{2},{\bm{\alpha}}_{2})|\leq\sqrt{\frac{G}{4\delta}}r(x)\left\|(\bm{\xi}_{1},{\bm{\alpha}}_{1})-(\bm{\xi}_{2},{\bm{\alpha}}_{2})\right\|_{2},

which proves this lemma. ∎

Proof of Lemma 8.

Let a=G4δa=\sqrt{\frac{G}{4\delta}}. We use brackets of the type

[{(φ1/2(x;𝝃,𝜶)ar(x)ϵ)+}2,(φ1/2(x;𝝃,𝜶)+ar(x)ϵ)2],\left[\left\{\left(\varphi^{1/2}(x;\bm{\xi},{\bm{\alpha}})-ar(x)\epsilon\right)_{+}\right\}^{2},\left(\varphi^{1/2}(x;\bm{\xi},{\bm{\alpha}})+ar(x)\epsilon\right)^{2}\right],

for (𝝃,𝜶)(\bm{\xi},{\bm{\alpha}}) ranging over a suitable chosen subset of Ξ×𝕊δ\Xi\times\mathbb{S}_{\delta}. Firstly, these brackets are of size no greater than aMϵaM\epsilon, because

[21(φ1/2(x;𝝃,𝜶)+ar(x)ϵ(φ1/2(x;𝝃,𝜶)ar(x)ϵ)+)2dx]1/2\displaystyle\quad\left[2^{-1}\int\left(\varphi^{1/2}(x;\bm{\xi},{\bm{\alpha}})+ar(x)\epsilon-\left(\varphi^{1/2}(x;\bm{\xi},{\bm{\alpha}})-ar(x)\epsilon\right)_{+}\right)^{2}{{\rm d}}x\right]^{1/2}
[21(φ1/2(x;𝝃,𝜶)+ar(x)ϵ(φ1/2(x;𝝃,𝜶)ar(x)ϵ))2dx]1/2\displaystyle\leq\left[2^{-1}\int\left(\varphi^{1/2}(x;\bm{\xi},{\bm{\alpha}})+ar(x)\epsilon-\left(\varphi^{1/2}(x;\bm{\xi},{\bm{\alpha}})-ar(x)\epsilon\right)\right)^{2}{{\rm d}}x\right]^{1/2}
[212a2r2(x)ϵ2dx]1/2aϵM,\displaystyle\leq\left[2^{-1}\int 2a^{2}r^{2}(x)\epsilon^{2}{{\rm d}}x\right]^{1/2}\leq a\epsilon M,

where [r2(x)dx]1/2M\left[\int r^{2}(x){{\rm d}}x\right]^{1/2}\leq M is from Condition (C3). If (𝝃,𝜶)(\bm{\xi},{\bm{\alpha}}) ranges over a grid of mesh-width ϵ\epsilon over Ξ×𝕊δ\Xi\times\mathbb{S}_{\delta}, then the brackets cover 𝒫𝕊δG\mathcal{P}^{G}_{\mathbb{S}_{\delta}}. It is because that by Lemma 7,

{(φ1/2(x;𝝃,𝜶)ar(x)ϵ)+}2φ(x;𝝃1,𝜶1)(φ1/2(x;𝝃,𝜶)+ar(x)ϵ)2,\left\{\left(\varphi^{1/2}(x;\bm{\xi},{\bm{\alpha}})-ar(x)\epsilon\right)_{+}\right\}^{2}\leq\varphi(x;\bm{\xi}_{1},{\bm{\alpha}}_{1})\leq\left(\varphi^{1/2}(x;\bm{\xi},{\bm{\alpha}})+ar(x)\epsilon\right)^{2},

provided that (𝝃1,𝜶1)(𝝃,𝜶)2ϵ\left\|(\bm{\xi}_{1},{\bm{\alpha}}_{1})-(\bm{\xi},{\bm{\alpha}})\right\|_{2}\leq\epsilon. Therefore, the smallest number of brackets with size ϵ\epsilon whose union cover 𝒫𝕊δG\mathcal{P}^{G}_{\mathbb{S}_{\delta}} is less than the smallest number of balls with radius (aM)1ϵ(aM)^{-1}\epsilon whose union cover Ξ×𝕊δ\Xi\times\mathbb{S}_{\delta}. Since Ξ×𝕊δ\Xi\times\mathbb{S}_{\delta} is a compact set, by Lemma 11, we have

(u,𝒫𝕊δG)G(d+1)log(1+2aMdiam(Ξ×𝕊δ)u),\mathcal{H}(u,\mathcal{P}^{G}_{\mathbb{S}_{\delta}})\leq G(d+1){\rm log}\left(1+\frac{2aM{\rm diam}(\Xi\times\mathbb{S}_{\delta})}{u}\right),

which proves the lemma. ∎

Proof of Lemma 10.

Clearly, when 2ϵϵ2/28\sqrt{2}\epsilon\leq\epsilon^{2}/2^{8}, i.e., ϵ282\epsilon\geq 2^{8}\sqrt{2}, (23) holds. We now assume ϵ282\epsilon\leq 2^{8}\sqrt{2} and thus 2ϵ29\sqrt{2}\epsilon\leq 2^{9}. Let a1=2aMdiam(Ξ×𝕊δ)a_{1}=2aM{\rm diam}(\Xi\times\mathbb{S}_{\delta}). Then, by Lemma 8, we have

(u,𝒫𝕊δG)G(d+1)log(1+a1(e1)29u)G(d+1)log(2(a1(e1)29)u),\mathcal{H}(u,\mathcal{P}^{G}_{\mathbb{S}_{\delta}})\leq G(d+1){\rm log}\left(1+\frac{a_{1}\vee(e-1)2^{9}}{u}\right)\leq G(d+1){\rm log}\left(\frac{2(a_{1}\vee(e-1)2^{9})}{u}\right),

when u29u\leq 2^{9}. Let a2=2(a1(e1)29)a_{2}=2(a_{1}\vee(e-1)2^{9}) and a3=G(d+1)a_{3}=G(d+1). Thus, we have

ϵ2/282ϵ1/2(u/c3,𝒫𝕊δG)𝑑u\displaystyle\int_{\epsilon^{2}/2^{8}}^{\sqrt{2}\epsilon}\mathcal{H}^{1/2}\left(u/c_{3},\mathcal{P}^{G}_{\mathbb{S}_{\delta}}\right)du a3ϵ2/282ϵ{log(a2u)}1/2𝑑u\displaystyle\leq a_{3}\int_{\epsilon^{2}/2^{8}}^{\sqrt{2}\epsilon}\left\{{\rm log}\left(\frac{a_{2}}{u}\right)\right\}^{1/2}du
a3ϵ2/282ϵlog(a2u)𝑑u,(since log(a2/u)1)\displaystyle\leq a_{3}\int_{\epsilon^{2}/2^{8}}^{\sqrt{2}\epsilon}{\rm log}\left(\frac{a_{2}}{u}\right)du,\ \ \ \ (\mbox{since }{\rm log}\left({a_{2}}/{u}\right)\geq 1)
=a3a2ϵ2/(a228)2ϵ/a2log(t)dt,(let t=u/a2)\displaystyle=a_{3}a_{2}\int_{\epsilon^{2}/(a_{2}2^{8})}^{\sqrt{2}\epsilon/a_{2}}-{\rm log}\left(t\right)dt,\ \ \ \ (\mbox{let }t=u/a_{2})
=(ttlogt|ϵ2/(a228)2ϵ/a2)a3a2.\displaystyle=\left(t-t\hbox{log}t\bigg{|}_{\epsilon^{2}/(a_{2}2^{8})}^{\sqrt{2}\epsilon/a_{2}}\right)a_{3}a_{2}.

When ϵ282\epsilon\leq 2^{8}\sqrt{2}, we have

ϵ2a228217(e1)218<e.\frac{\epsilon^{2}}{a_{2}2^{8}}\leq\frac{2^{17}}{(e-1)2^{18}}<e.

Write ϕ(t)=ttlogt\phi(t)=t-t\hbox{log}t. Using the fact that ϕ(t)0\phi(t)\geq 0 when tet\leq e, we have ϕ(ϵ2/(a228))>0\phi\left(\epsilon^{2}/(a_{2}2^{8})\right)>0. It follows that

(ttlogt|ϵ2/(a228)2ϵ/a2)a3a2a32ϵ(a32ϵ)log{2ϵ/a2}.\left(t-t\hbox{log}t\bigg{|}_{\epsilon^{2}/(a_{2}2^{8})}^{\sqrt{2}\epsilon/a_{2}}\right)a_{3}a_{2}\leq{a_{3}\sqrt{2}\epsilon}-\left({a_{3}\sqrt{2}\epsilon}\right)\hbox{log}\{{\sqrt{2}\epsilon/a_{2}}\}.

Therefore, we conclude that there exists two constants c>0c^{\prime}>0 and c′′c^{\prime\prime} such that

ϵ2/282ϵ1/2(u/c3,𝒫𝕊δG)𝑑uϵ(c′′clogϵ).\int_{\epsilon^{2}/2^{8}}^{\sqrt{2}\epsilon}\mathcal{H}^{1/2}\left(u/c_{3},\mathcal{P}^{G}_{\mathbb{S}_{\delta}}\right)du\leq\epsilon\left(c^{\prime\prime}-c^{\prime}\hbox{log}\ \epsilon\right).

In order to ensure that (23) holds, we only need that

c′′clogϵc4n1/2ϵ.c^{\prime\prime}-c^{\prime}\hbox{log}\ \epsilon\leq c_{4}n^{1/2}\epsilon. (25)

It is clear that we can choose a sufficiently large c>0c>0 such that when n2n\geq 2, for any ϵn1/2logn\epsilon\geq n^{-1/2}{\rm log}n, (25) holds. The proof is complete. ∎

A.2.3 Proofs of Lemma 3

To prove Lemma 3, we first prove Lemma 2. Recall that

(k+1)={ming=1,,Gαg(k+1)ming=1,,Gαg(k)(12K)}.\mathcal{E}^{(k+1)}=\left\{\min_{g=1,\dots,G}{\alpha}_{g}^{(k+1)}\geq\min_{g=1,\dots,G}{\alpha}_{g}^{(k)}\left(1-\frac{2}{K}\right)\right\}. (26)

The following lemma gives the definition of ΔK\Delta_{K}.

Lemma 12.

For all 𝛉0Θ\bm{\theta}_{0}\in\Theta, there exists a constant ΔK>0\Delta_{K}>0 such that when 𝛉𝛉02ΔK\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\Delta_{K}, we have

𝔼(inf𝜽𝜽02ΔKf(x;𝜽)sup𝜽𝜽02ΔKf(x;𝜽))11K.{\mathbb{E}}\left(\frac{\inf_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\Delta_{K}}f(x;\bm{\theta})}{\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\Delta_{K}}f(x;\bm{\theta})}\right)\geq 1-\frac{1}{K}.
Proof of Lemma 12.

Observe that for any h>0h>0, we have

0inf𝜽0Θinf𝜽𝜽02hf(x;𝜽)sup𝜽𝜽02hf(x;𝜽)1.0\leq\inf_{\bm{\theta}_{0}\in\Theta}\frac{\inf_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq h}f(x;\bm{\theta})}{\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq h}f(x;\bm{\theta})}\leq 1.

By the dominated convergence theorem, the compactness of Θ\Theta and the continuity of f(x;𝜽)f(x;\bm{\theta}), we have

limh0𝔼(inf𝜽0Θinf𝜽𝜽02hf(x;𝜽)sup𝜽𝜽02hf(x;𝜽))=𝔼(limh0inf𝜽0Θinf𝜽𝜽02hf(x;𝜽)sup𝜽𝜽02hf(x;𝜽))=1.\lim_{h\rightarrow 0}{\mathbb{E}}\left(\inf_{\bm{\theta}_{0}\in\Theta}\frac{\inf_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq h}f(x;\bm{\theta})}{\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq h}f(x;\bm{\theta})}\right)={\mathbb{E}}\left(\lim_{h\rightarrow 0}\inf_{\bm{\theta}_{0}\in\Theta}\frac{\inf_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq h}f(x;\bm{\theta})}{\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq h}f(x;\bm{\theta})}\right)=1.

Therefore, there is ΔK>0\Delta_{K}>0 such that the inequality in Lemma 12 holds. ∎

Before proving Lemma 2, we state the following Hoeffding’s inequality which can be found in Vershynin (2018) (Theorem 2.2.6).

Lemma 13.

Let X1,,XNX_{1},\ldots,X_{N} be independent random variables. Assume that Xi[mi,Mi]X_{i}\in\left[m_{i},M_{i}\right] for every ii. Then, for any t>0t>0, we have

{i=1N(Xi𝔼Xi)t}exp(2t2i=1N(Mimi)2).\mathbb{P}\left\{\sum_{i=1}^{N}\left(X_{i}-\mathbb{E}X_{i}\right)\geq t\right\}\leq\exp\left(-\frac{2t^{2}}{\sum_{i=1}^{N}\left(M_{i}-m_{i}\right)^{2}}\right).
Proof of Lemma 2.

Recall that p(𝜶)=λ(g=1Glog(αg)+GlogG)p({\bm{\alpha}})=\lambda\left(\sum_{g=1}^{G}\hbox{log}(\alpha_{g})+G\hbox{log}G\right) and

wgi(k)=αg(k)f(xi;𝜽g(k))φ(xi;𝝃(k),𝜶(k)).w_{gi}^{(k)}=\frac{\alpha^{(k)}_{g}f\left(x_{i};\bm{\theta}_{g}^{(k)}\right)}{\varphi\left(x_{i};\bm{\xi}^{(k)},{\bm{\alpha}}^{(k)}\right)}. (27)

Then, the update of 𝜶{\bm{\alpha}} can be written as

αg(k+1)=i=1nwgi(k)+λn+Gλ,{\alpha}_{g}^{(k+1)}=\frac{\sum_{i=1}^{n}{w}_{gi}^{(k)}+\lambda}{n+G\lambda},

which is a weighted sum of n1i=1nwgi(k)n^{-1}\sum_{i=1}^{n}w_{gi}^{(k)} and G1G^{-1} and shrinks n1i=1nwgi(k)n^{-1}\sum_{i=1}^{n}{w_{gi}^{(k)}} towards G1G^{-1}. Thus, we conclude that

ming=1,,Gαg(k+1)ming=1,,Gn1i=1nwgi(k).\min_{g=1,\ldots,G}{\alpha}_{g}^{(k+1)}\geq\min_{g=1,\ldots,G}n^{-1}{\sum_{i=1}^{n}{w}_{gi}^{(k)}}. (28)

Thus, we only need to bound ming=1,,Gn1i=1nwgi(k)\min_{g=1,\dots,G}{n^{-1}\sum_{i=1}^{n}{w}_{gi}^{(k)}}. Let

Ti=inf𝜽𝜽02ΔKf(xi;𝜽)sup𝜽𝜽02ΔKf(xi;𝜽),T_{i}=\frac{\inf_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\Delta_{K}}f(x_{i};\bm{\theta})}{\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\Delta_{K}}f(x_{i};\bm{\theta})},

and 𝒦=𝒮ϵ(k)\mathcal{K}=\mathcal{S}^{(k)}_{\epsilon}\cap\mathcal{B}, where ΔK\Delta_{K} is as defined in Lemma 12, and Sϵ(k)S_{\epsilon}^{(k)} is defined in (16). Since L2ΔK2ϵL_{2}\Delta_{K}^{2}\geq\epsilon, on 𝒦\mathcal{K}, we have 𝝃(k)𝝃02ΔK\left\|{\bm{\xi}}^{(k)}-\bm{\xi}_{0}\right\|_{2}\leq\Delta_{K}. It follows that on 𝒦\mathcal{K}, wgi(k)αg(k)Ti{w}_{gi}^{(k)}\geq{\alpha}_{g}^{(k)}T_{i}. Thus, we conclude that

{n1i=1nTi12K1}𝒦\displaystyle\quad\left\{n^{-1}{\sum_{i=1}^{n}T_{i}}\geq 1-2K^{-1}\right\}\cap\mathcal{K}
{ming=1,,Gn1i=1nwgi(k)ming=1,,Gαg(k)(12K1)}𝒦\displaystyle\subset\left\{\min_{g=1,\ldots,G}n^{-1}{\sum_{i=1}^{n}{w}_{gi}^{(k)}}\geq\min_{g=1,\ldots,G}{\alpha}_{g}^{(k)}\left(1-2K^{-1}\right)\right\}\cap\mathcal{K}
(k+1)𝒦,\displaystyle\subset\mathcal{E}^{(k+1)}\cap\mathcal{K}, (29)

where (k+1)\mathcal{E}^{(k+1)} is defined in (26).

Then, it suffices to bound the probability (n1i=1nTi12K1).{\mathbb{P}}\left(n^{-1}{\sum_{i=1}^{n}T_{i}}\geq 1-2K^{-1}\right). Note that 0Ti10\leq T_{i}\leq 1. Hence, by Lemma 13, we have

(n1|i=1nTi𝔼(Ti)|K1)2exp(2n/K2).{\mathbb{P}}\left(n^{-1}\left|{\sum_{i=1}^{n}T_{i}-{\mathbb{E}}(T_{i})}\right|\geq K^{-1}\right)\leq 2\exp\left({-2n}/{K^{2}}\right).

By Lemma 12, we have 1𝔼(Ti)1K1,1\geq{\mathbb{E}}(T_{i})\geq 1-K^{-1}, and thus

(n1i=1nTi12K1)12exp(2n/K2).{\mathbb{P}}\left(n^{-1}{\sum_{i=1}^{n}T_{i}}\geq 1-2K^{-1}\right)\geq 1-2\exp\left({-2n}/{K^{2}}\right).

Applying (28) and (A.2.3), we have

((k+1)𝒦)\displaystyle{\mathbb{P}}\left(\mathcal{E}^{(k+1)}\cap\mathcal{K}\right) (n1i=1nTi12K1)+(𝒦)1\displaystyle\geq\mathbb{P}\left(n^{-1}{\sum_{i=1}^{n}T_{i}}\geq 1-2K^{-1}\right)+\mathbb{P}\left(\mathcal{K}\right)-1
(𝒦)2exp(2n/K2),\displaystyle\geq\mathbb{P}\left(\mathcal{K}\right)-2\exp\left({-2n}/{K^{2}}\right),

and Lemma 2 is proved. ∎

Finally, combining Lemma 1 with Lemma 2, we can prove Lemma 3.

Proof of Lemma 3.

Recall the definition of 𝕊δ\mathbb{S}_{\delta}, 𝒮ϵ(k)\mathcal{S}_{\epsilon}^{(k)} and (k+1)\mathcal{E}^{(k+1)} defined in (15), (16) and (26). For 0kK0\leq k\leq K, we define

(k)={ming=1,,Gαg(k)ming=1,,G(12K)kαg(0)}.\mathcal{B}^{(k)}=\left\{\min_{g=1,\ldots,G}\alpha^{(k)}_{g}\geq\min_{g=1,\ldots,G}\left(1-\frac{2}{K}\right)^{k}{\alpha}_{g}^{(0)}\right\}. (30)

It is clear that for any 0kK0\leq k\leq K, (k){𝜶(k)𝕊δ}\mathcal{B}^{(k)}\subset\left\{{\bm{\alpha}}^{(k)}\in\mathbb{S}_{\delta}\right\} because (12/K)k271(1-2/K)^{k}\geq 27^{-1} for K3K\geq 3. We aim to prove a stronger result

(𝒮ϵ(k)(k))15(k+1)exp(c2nϵ2)2kexp(2nK2),0kK.{\mathbb{P}}\left(\mathcal{S}_{\epsilon}^{(k)}\cap\mathcal{B}^{(k)}\right)\geq 1-5(k+1)\exp\left(-c_{2}n\epsilon^{2}\right)-2k\exp\left(\frac{-2n}{K^{2}}\right),0\leq k\leq K. (31)

We use mathematical induction to prove (31). We first give the proof for the case k=0k=0. It is clear that 𝒮ϵ(0)(0)=𝒮ϵ(0).\mathcal{S}_{\epsilon}^{(0)}\cap\mathcal{B}^{(0)}=\mathcal{S}_{\epsilon}^{(0)}. Since 𝜶(0)𝕊δ{\bm{\alpha}}^{(0)}\in\mathbb{S}_{\delta}, by Lemma 1, we have

(𝒮ϵ(0)(0))15exp(c2nϵ2).{\mathbb{P}}\left(\mathcal{S}_{\epsilon}^{(0)}\cap\mathcal{B}^{(0)}\right)\geq 1-5\exp\left(-c_{2}n\epsilon^{2}\right). (32)

Assume the result holds for k<Kk<K, we will prove it for k+1k+1. On 𝒮ϵ(k)(k)\mathcal{S}_{\epsilon}^{(k)}\cap\mathcal{B}^{(k)}, since L2ΔK2ϵL_{2}\Delta_{K}^{2}\geq\epsilon, we have 𝝃(k)𝝃02ΔK\left\|{\bm{\xi}}^{(k)}-\bm{\xi}_{0}\right\|_{2}\leq\Delta_{K}, i.e. 𝒮ϵ(k)(k){𝝃(k)𝝃02ΔK}\mathcal{S}_{\epsilon}^{(k)}\cap\mathcal{B}^{(k)}\subset\left\{\left\|{\bm{\xi}}^{(k)}-\bm{\xi}_{0}\right\|_{2}\leq\Delta_{K}\right\}. By the inductive hypothesis, we have

(𝒮ϵ(k)(k))15(k+1)exp(c2nϵ2)2kexp(2nK2).{\mathbb{P}}\left(\mathcal{S}_{\epsilon}^{(k)}\cap\mathcal{B}^{(k)}\right)\geq 1-5(k+1)\exp\left(-c_{2}n\epsilon^{2}\right)-2k\exp\left(\frac{-2n}{K^{2}}\right).

Thus, by Lemma 2, we have

((k+1)𝒮ϵ(k)(k))15(k+1)exp(c2nϵ2)2(k+1)exp(2nK2).{\mathbb{P}}\left(\mathcal{E}^{(k+1)}\cap\mathcal{S}_{\epsilon}^{(k)}\cap\mathcal{B}^{(k)}\right)\geq 1-5(k+1)\exp\left(-c_{2}n\epsilon^{2}\right)-2(k+1)\exp\left(\frac{-2n}{K^{2}}\right).

Note that

(k+1)(k)(k+1),\mathcal{E}^{(k+1)}\cap\mathcal{B}^{(k)}\subset\mathcal{B}^{(k+1)},

because

ming=1,,Gαg(k+1)ming=1,,Gαg(k)(12K)ming=1,,G(12K)k+1αg(0).\min_{g=1,\dots,G}{\alpha}_{g}^{(k+1)}\geq\min_{g=1,\dots,G}{\alpha}_{g}^{(k)}\left(1-\frac{2}{K}\right)\geq\min_{g=1,\ldots,G}\left(1-\frac{2}{K}\right)^{k+1}{\alpha}_{g}^{(0)}.

Thus, we conclude that

({𝜶(k+1)𝕊δ})((k+1))((k+1)𝒮ϵ(k)(k)).{\mathbb{P}}\left(\left\{{\bm{\alpha}}^{(k+1)}\in\mathbb{S}_{\delta}\right\}\right)\geq{\mathbb{P}}\left(\mathcal{B}^{(k+1)}\right)\geq{\mathbb{P}}\left(\mathcal{E}^{(k+1)}\cap\mathcal{S}_{\epsilon}^{(k)}\cap\mathcal{B}^{(k)}\right).

By Lemma 1, we have

(𝒮ϵ(k+1){𝜶(k+1)𝕊δ})\displaystyle{\mathbb{P}}\left(\mathcal{S}^{(k+1)}_{\epsilon}\cap\left\{{\bm{\alpha}}^{(k+1)}\in\mathbb{S}_{\delta}\right\}\right) ({𝜶(k+1)𝕊δ})5exp(c2nϵ2)\displaystyle\geq{\mathbb{P}}\left(\left\{{\bm{\alpha}}^{(k+1)}\in\mathbb{S}_{\delta}\right\}\right)-5\exp\left(-c_{2}n\epsilon^{2}\right)
15(k+2)exp(c2nϵ2)2(k+1)exp(2nK2),\displaystyle\geq 1-5(k+2)\exp\left(-c_{2}n\epsilon^{2}\right)-2(k+1)\exp\left(\frac{-2n}{K^{2}}\right),

and thus we complete the proof. ∎

A.3 Proofs of Theorem 2

In order to derive a tail probability bound for the EM-test statistic, we need the following lemmas.

Lemma 14 (Rosenthal’s inequality).

Suppose that {Xi}i=1n\{X_{i}\}_{i=1}^{n} are mean-zero and independent random variables and satisfy the moment bound XiL2mC,1in\|X_{i}\|_{L^{2m}}\leq C,1\leq i\leq n with some fixed integer m1m\geq 1. Then, we have

{|i=1nXi|nt}2Rm(Cnt)2m,for all t>0,{\mathbb{P}}\left\{\left|\sum_{i=1}^{n}X_{i}\right|\geq nt\right\}\leq 2R_{m}\left(\frac{C}{\sqrt{n}t}\right)^{2m},\ \ \mbox{for all }t>0,

where RmR_{m} is a universal constant only depending on mm. Further, if 𝔼(Xi)0{\mathbb{E}}(X_{i})\neq 0, then we have

{|i=1n[Xi𝔼(Xi)]|nt}2Rm(2Cnt)2m,for all t>0.{\mathbb{P}}\left\{\left|\sum_{i=1}^{n}\left[X_{i}-{\mathbb{E}}(X_{i})\right]\right|\geq nt\right\}\leq 2R_{m}\left(\frac{2C}{\sqrt{n}t}\right)^{2m},\ \ \mbox{for all }t>0.
Lemma 15.

Let p,qp,q\in\mathbb{N} and fulfill p+q3p+q\leq 3. Let g(x;𝛉0)g(x;\bm{\theta}_{0}) and mm be the same as in Condition (C3). Write Rpq(xi)=gp(xi;𝛉0)𝐛i2qR_{pq}(x_{i})=g^{p}(x_{i};\bm{\theta}_{0})\|{\bf b}_{i}\|_{2}^{q}, where

𝐛i2=(j=1dYij2+j=1dZij2+j1=1dj2>j1dUij1j22)1/2.\left\|{\bf b}_{i}\right\|_{2}=\left(\sum_{j=1}^{d}Y_{ij}^{2}+\sum_{j=1}^{d}Z_{ij}^{2}+\sum_{j_{1}=1}^{d}\sum_{j_{2}>j_{1}}^{d}U_{ij_{1}j_{2}}^{2}\right)^{1/2}.

Then, under 0\mathbb{H}_{0} and Condition (C3), we have

{|i=1nRpq(xi)|n(1+(d+1)qMp+q)}2Rm(2(d+1)qMp+qn)2m.{\mathbb{P}}\left\{\left|\sum_{i=1}^{n}R_{pq}(x_{i})\right|\geq n(1+(d+1)^{q}M^{p+q})\right\}\leq 2R_{m}\left(\frac{2(d+1)^{q}M^{p+q}}{\sqrt{n}}\right)^{2m}.
Lemma 16.

Let k{3,4}k\in\{3,4\} and j1,,jk{1,,d}j_{1},\dots,j_{k}\in\{1,\dots,d\}. Define

Di(j1,,jk)=(kf(xi;𝜽0)θj1θjk)/(k!f(xi;𝜽0)).D_{i}(j_{1},\dots,j_{k})=\left(\frac{\partial^{k}f(x_{i};\bm{\theta}_{0})}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{k}}}\right)\bigg{/}(k!f(x_{i};\bm{\theta}_{0})).

Then, under 0\mathbb{H}_{0} and Condition (C3), we have

{j1=1djk=1d|i=1nDi(j1,,jk)|<dkn5/8}12dkRm(Mn1/8)2m,{\mathbb{P}}\left\{\sum_{j_{1}=1}^{d}\cdots\sum_{j_{k}=1}^{d}\left|\sum_{i=1}^{n}D_{i}(j_{1},\dots,j_{k})\right|<d^{k}n^{5/8}\right\}\geq 1-2d^{k}R_{m}\left(\frac{M}{n^{1/8}}\right)^{2m},
Proof of Theorem 2.

Recall that EMn(K)=max{Mn(K)(𝜶t),t=1,,T}.{\rm EM}_{n}^{(K)}=\max\left\{M_{n}^{(K)}({\bm{\alpha}}_{t}),t=1,\dots,T\right\}. Without loss of generality, we assume T=1T=1 and EMn(K)=Mn(K)(𝜶(0)).{\rm EM}_{n}^{(K)}=M_{n}^{(K)}\left({\bm{\alpha}}^{(0)}\right). Considering that

Mn(K)(𝜶(0))=2[pln(𝝃(K),𝜶(K))pln(𝝃0,𝜶0)+pln(𝝃0,𝜶0)pln(𝝃^0,𝜶0)],M_{n}^{(K)}\left({\bm{\alpha}}^{(0)}\right)=2\left[pl_{n}\left({\bm{\xi}}^{(K)},{{\bm{\alpha}}}^{(K)}\right)-pl_{n}(\bm{\xi}_{0},{\bm{\alpha}}_{0})+pl_{n}(\bm{\xi}_{0},{\bm{\alpha}}_{0})-pl_{n}\left(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0}\right)\right],

we let R1n=2[pln(𝝃(K),𝜶(K))pln(𝝃0,𝜶0)]R_{1n}=2\left[pl_{n}\left({\bm{\xi}}^{(K)},{{\bm{\alpha}}}^{(K)}\right)-pl_{n}(\bm{\xi}_{0},{\bm{\alpha}}_{0})\right] and R0n=2[pln(𝝃0,𝜶0)pln(𝝃^0,𝜶0)]R_{0n}=2\left[pl_{n}(\bm{\xi}_{0},{\bm{\alpha}}_{0})-pl_{n}\left(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0}\right)\right]. Since R0n0R_{0n}\leq 0, we have R1nMn(K)(𝜶(0))R_{1n}\geq M_{n}^{(K)}\left({\bm{\alpha}}^{(0)}\right). Hence, we only consider the R1nR_{1n} term. For notation simplicity, we write 𝜶¯,𝝃¯\bar{{\bm{\alpha}}},\bar{\bm{\xi}} in replacement of 𝜶(K),𝝃(K){{\bm{\alpha}}}^{(K)},{\bm{\xi}}^{(K)}.

Next, we focus on the R1nR_{1n} term. Since p(𝜶)p({\bm{\alpha}}) is maximized at 𝜶0{\bm{\alpha}}_{0}, we have

R1n\displaystyle R_{1n} 2{ln(𝝃¯,𝜶¯)ln(𝝃0,𝜶0)}\displaystyle\leq 2\left\{l_{n}(\bar{\bm{\xi}},\bar{{\bm{\alpha}}})-l_{n}(\bm{\xi}_{0},{\bm{\alpha}}_{0})\right\}
=2i=1nlog(1+g=1Gα¯g(f(xi;𝜽¯g)f(xi;𝜽0)1))\displaystyle=2\sum_{i=1}^{n}\hbox{log}\left(1+\sum_{g=1}^{G}\bar{\alpha}_{g}\left(\frac{f(x_{i};\bar{\bm{\theta}}_{g})}{f(x_{i};{\bm{\theta}}_{0})}-1\right)\right)
=i=1n2log(1+δi),\displaystyle=\sum_{i=1}^{n}2\hbox{log}(1+\delta_{i}),

where δi=g=1Gα¯g(f(xi;𝜽¯g)f(xi;𝜽0)1)\delta_{i}=\sum_{g=1}^{G}\bar{\alpha}_{g}\left(\frac{f(x_{i};\bar{\bm{\theta}}_{g})}{f(x_{i};{\bm{\theta}}_{0})}-1\right). Applying the inequality log(1+x)xx2/2+x3/3\hbox{log}(1+x)\leq x-x^{2}/2+x^{3}/3, we have

R1ni=1n2log(1+δi)2i=1nδii=1nδi2+(2/3)i=1nδi3,R_{1n}\leq\sum_{i=1}^{n}2\hbox{log}\left(1+\delta_{i}\right)\leq 2\sum_{i=1}^{n}\delta_{i}-\sum_{i=1}^{n}\delta_{i}^{2}+(2/3)\sum_{i=1}^{n}\delta_{i}^{3}, (33)

where δi=g=1Gα¯g(f(xi;𝜽¯g)f(xi;𝜽0)1)\delta_{i}=\sum_{g=1}^{G}\bar{\alpha}_{g}\left(\frac{f\left(x_{i};\bar{\bm{\theta}}_{g}\right)}{f\left(x_{i};{\bm{\theta}}_{0}\right)}-1\right). Let

𝐦¯=𝐦(𝜶¯,𝝃¯,𝝃0),\bar{{\bf m}}={\bf m}(\bar{{\bm{\alpha}}},\bar{\bm{\xi}},\bm{\xi}_{0}),

where 𝐦{\bf m} is defined in (14). Define

εin=δi𝐦¯T𝐛i.\varepsilon_{in}=\delta_{i}-\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}. (34)

Plugging (34) into (33), we have

2i=1nδi=2i=1n𝐦¯T𝐛i+2i=1nεin.2\sum_{i=1}^{n}\delta_{i}=2\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}+2\sum_{i=1}^{n}\varepsilon_{in}.

and

i=1nδi2i=1n𝐦¯T𝐛i𝐛iT𝐦¯2i=1n𝐦¯T𝐛iεin,-\sum_{i=1}^{n}\delta^{2}_{i}\leq-\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}{\bf b}_{i}^{\rm T}\bar{{\bf m}}-2\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\varepsilon_{in},

because i=1nεin20\sum_{i=1}^{n}\varepsilon_{in}^{2}\geq 0. Therefore, (33) can be rewritten as

R1n\displaystyle R_{1n} 2i=1n𝐦¯T𝐛ii=1n𝐦¯T𝐛i𝐛iT𝐦¯+2i=1nεin\displaystyle\leq 2\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}-\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}{\bf b}_{i}^{\rm T}\bar{{\bf m}}+2\sum_{i=1}^{n}\varepsilon_{in} (35)
2i=1n𝐦¯T𝐛iεin+(2/3)i=1nδi3.\displaystyle\quad-2\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\varepsilon_{in}+(2/3)\sum_{i=1}^{n}\delta_{i}^{3}.

Our next goal is to control the 2i=1nεin2i=1n𝐦¯T𝐛iεin+(2/3)i=1nδi32\sum_{i=1}^{n}\varepsilon_{in}-2\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\varepsilon_{in}+(2/3)\sum_{i=1}^{n}\delta_{i}^{3} term, and it will be divided into three steps.

Step 1: In the first step, we bound the 2i=1nεin2\sum_{i=1}^{n}\varepsilon_{in} term. By Taylor’s expansion to the fifth order, εin\varepsilon_{in} can be accurately represented as

εin\displaystyle\varepsilon_{in} =j1=1dj3=1dg=1Gα¯gs=13(θ¯gjsθ0js)(3f(xi;𝜽0)θj1θj2θj3)/(3!f(xi;𝜽0))\displaystyle=\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}\prod_{s=1}^{3}\left(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}}\right)\left(\frac{\partial^{3}f(x_{i};\bm{\theta}_{0})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\bigg{/}(3!f(x_{i};\bm{\theta}_{0}))
+j1=1dj4=1dg=1Gα¯gs=14(θ¯gjsθ0js)(4f(xi;𝜽0)θj1θj2θj3θj4)/(4!f(xi;𝜽0))\displaystyle+\sum_{j_{1}=1}^{d}\cdots\sum_{j_{4}=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}\prod_{s=1}^{4}\left(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}}\right)\left(\frac{\partial^{4}f(x_{i};\bm{\theta}_{0})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}\partial\theta_{j_{4}}}\right)\bigg{/}(4!f(x_{i};\bm{\theta}_{0}))
+j1=1dj5=1dg=1Gα¯gs=15(θ¯gjsθ0js)(5f(xi;𝜻g(xi))θj1θj2θj3θj4θj5)/(5!f(xi;𝜽0))\displaystyle+\sum_{j_{1}=1}^{d}\cdots\sum_{j_{5}=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}\prod_{s=1}^{5}\left(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}}\right)\left(\frac{\partial^{5}f(x_{i};{\bm{\zeta}}_{g}(x_{i}))}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}\partial\theta_{j_{4}}\partial\theta_{j_{5}}}\right)\bigg{/}(5!f(x_{i};\bm{\theta}_{0}))
=I+II+III,\displaystyle={\rm I+II+III},

where 𝜻g(xi){\bm{\zeta}}_{g}(x_{i}) lies between 𝜽¯g\bar{\bm{\theta}}_{g} and 𝜽0\bm{\theta}_{0}. Take ϵ=cn11/24lognL2Δ2\epsilon=cn^{-11/24}\hbox{log}n\wedge L_{2}\Delta^{2}, where Δ=ΔKτ\Delta=\Delta_{K}\wedge\tau. Let

𝒜1={𝝃¯𝝃022<ϵL2,𝐦¯2<ϵL1}.\mathcal{A}_{1}=\left\{\left\|\bar{\bm{\xi}}-\bm{\xi}_{0}\right\|_{2}^{2}<\frac{\epsilon}{L_{2}},\left\|\bar{\bf{m}}\right\|_{2}<\frac{\epsilon}{L_{1}}\right\}.

By Lemma 3, when nn is large enough such that

L2Δ2max(cn1/2logn,c11/2p0n1/2),L_{2}\Delta^{2}\geq\max\left(cn^{-1/2}{\rm log}n,c_{1}^{-1/2}\sqrt{-p_{0}}n^{-1/2}\right),

we have

(𝒜1)15(K+1)exp(c2nϵ2)2Kexp(2nK2).{\mathbb{P}}(\mathcal{A}_{1})\geq 1-5(K+1)\exp\left(-c_{2}n\epsilon^{2}\right)-2K\exp\left(\frac{-2n}{K^{2}}\right). (36)

On 𝒜1\mathcal{A}_{1}, we have

𝝃¯𝝃02<(cL2)n11/48log1/2n and 𝐦¯2<cL1n11/24logn.\left\|\bar{\bm{\xi}}-\bm{\xi}_{0}\right\|_{2}<\left(\sqrt{\frac{{c}}{L_{2}}}\right)n^{-11/48}\hbox{log}^{1/2}n\mbox{ and }\left\|\bar{\bf{m}}\right\|_{2}<\frac{c}{L_{1}}n^{-11/24}\hbox{log}n.

For fixed j1,j2,j3j_{1},j_{2},j_{3}, by Lemma 5, we have

g=1Gα¯gs=13|(θ¯gjsθ0js)|g=1Gα¯g(d)3𝜽¯g𝜽023(d)3𝝃¯𝝃023.\sum_{g=1}^{G}\bar{\alpha}_{g}\prod_{s=1}^{3}\left|\left(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}}\right)\right|\leq\sum_{g=1}^{G}\bar{\alpha}_{g}\left(\sqrt{d}\right)^{3}\left\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\right\|_{2}^{3}\leq(\sqrt{d})^{3}\left\|\bar{\bm{\xi}}-\bm{\xi}_{0}\right\|_{2}^{3}.

Let

𝒜2={j1=1dj3=1d|i=1n(3f(xi;𝜽0)θj1θj2θj3)/(3!f(xi;𝜽0))|<d3n5/8}.\mathcal{A}_{2}=\left\{\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\left|\sum_{i=1}^{n}\left(\frac{\partial^{3}f(x_{i};\bm{\theta}_{0})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\bigg{/}(3!f(x_{i};\bm{\theta}_{0}))\right|<d^{3}n^{5/8}\right\}.

Then, on 𝒜1𝒜2\mathcal{A}_{1}\cap\mathcal{A}_{2}, we have

|I|<d9/2(cL2)3n1/16log3/2n.{\rm|I|}<d^{9/2}\left(\sqrt{\frac{{c}}{L_{2}}}\right)^{3}n^{-1/16}\hbox{log}^{3/2}n.

By Lemma 16, it follows that

P(𝒜2)12d3Rm(Mn1/8)2m.P(\mathcal{A}_{2})\geq 1-2d^{3}R_{m}\left(\frac{M}{n^{1/8}}\right)^{2m}.

Similarly, we let

𝒜3=j1=1dj4=1d{|i=1n(4f(xi;𝜽0)θj1θj2θj3θj4)/(4!f(xi;𝜽0))|<d4n5/8}.\mathcal{A}_{3}=\sum_{j_{1}=1}^{d}\cdots\sum_{j_{4}=1}^{d}\left\{\left|\sum_{i=1}^{n}\left(\frac{\partial^{4}f(x_{i};\bm{\theta}_{0})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}\partial\theta_{j_{4}}}\right)\bigg{/}(4!f(x_{i};\bm{\theta}_{0}))\right|<d^{4}n^{5/8}\right\}.

On 𝒜1𝒜3\mathcal{A}_{1}\cap\mathcal{A}_{3} we have

|II|<d6(cL2)4n7/24log2n.{\rm|II|}<d^{6}\left(\sqrt{\frac{{c}}{L_{2}}}\right)^{4}n^{-7/24}\hbox{log}^{2}n.

By Lemma 16, it follows that

P(𝒜3)12d4Rm(Mn1/8)2m.P(\mathcal{A}_{3})\geq 1-2d^{4}R_{m}\left(\frac{M}{n^{1/8}}\right)^{2m}.

Finally, let g(x;𝜽0)g(x;\bm{\theta}_{0}) be the function defined in Condition (C3). Then, for fixed j1,,j5j_{1},\dots,j_{5},

sup𝜽𝜽02τ|(5f(x;𝜽)θj1θj2θj3θj4θj5)/(5!f(x;𝜽0))|g(x;𝜽0).\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau}\left|\left(\frac{\partial^{5}f(x;{\bm{\theta}})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}\partial\theta_{j_{4}}\partial\theta_{j_{5}}}\right)/(5!f(x;\bm{\theta}_{0}))\right|\leq g(x;\bm{\theta}_{0}).

By Lemma 15, we have

(|i=1ng(xi)|n(1+M))2Rm(4M2n)m.{\mathbb{P}}\left(\left|\sum_{i=1}^{n}g(x_{i})\right|\geq n(1+M)\right)\leq 2R_{m}\left(\frac{4M^{2}}{n}\right)^{m}.

Let

𝒜4={|i=1nd5g(xi)|<d5n(1+M)}.\mathcal{A}_{4}=\left\{\left|\sum_{i=1}^{n}d^{5}g(x_{i})\right|<d^{5}n(1+M)\right\}.

Then, we have on 𝒜1𝒜4\mathcal{A}_{1}\cap\mathcal{A}_{4}

|III|<d15/2(cL2)5(1+M)n7/48log5/2n,{\rm|III|}<d^{15/2}\left(\sqrt{\frac{{c}}{L_{2}}}\right)^{5}(1+M)n^{-7/48}\hbox{log}^{5/2}n,

and the probability is at least

(𝒜4)12Rm(4M2n)m.{\mathbb{P}}(\mathcal{A}_{4})\geq 1-2R_{m}\left(\frac{4M^{2}}{n}\right)^{m}.

In summary, on 𝒜1𝒜2𝒜3𝒜4\mathcal{A}_{1}\cap\mathcal{A}_{2}\cap\mathcal{A}_{3}\cap\mathcal{A}_{4}, we have

|i=1nεin|\displaystyle\left|\sum_{i=1}^{n}\varepsilon_{in}\right| <d9/2(cL2)3n1/16log3/2n+d6(cL2)4n7/24log2n\displaystyle<d^{9/2}\left(\sqrt{\frac{{c}}{L_{2}}}\right)^{3}n^{-1/16}\hbox{log}^{3/2}n+d^{6}\left(\sqrt{\frac{{c}}{L_{2}}}\right)^{4}n^{-7/24}\hbox{log}^{2}n (37)
+d15/2(cL2)5(1+M)n7/48log5/2n,\displaystyle\quad+d^{15/2}\left(\sqrt{\frac{{c}}{L_{2}}}\right)^{5}(1+M)n^{-7/48}\hbox{log}^{5/2}n,

and

(𝒜2𝒜3𝒜4)12(d3+d4)Rm(Mn1/8)2m2Rm(4M2n)m.{\mathbb{P}}(\mathcal{A}_{2}\cap\mathcal{A}_{3}\cap\mathcal{A}_{4})\geq 1-2(d^{3}+d^{4})R_{m}\left(\frac{M}{n^{1/8}}\right)^{2m}-2R_{m}\left(\frac{4M^{2}}{n}\right)^{m}. (38)

Step 2: Next, we aim to bound |2i=1n𝐦¯T𝐛iεin|.\left|2\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\varepsilon_{in}\right|. By Taylor’s expansion to the third order, εin\varepsilon_{in} can be accurately represented as

εin=j1=1dj3=1dg=1Gα¯gs=13(θ¯gjsθ0js)(3f(xi;𝜻g(xi))θj1θj2θj3)/(3!f(xi;𝜽0)),\varepsilon_{in}=\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}\prod_{s=1}^{3}\left(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}}\right)\left(\frac{\partial^{3}f(x_{i};{\bm{\zeta}}_{g}(x_{i}))}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)/(3!f(x_{i};\bm{\theta}_{0})),

where 𝜻g(xi){\bm{\zeta}}_{g}(x_{i}) lies between θg{\theta}_{g} and θ0\theta_{0}. Then we have

|εin|\displaystyle\left|\varepsilon_{in}\right| |j1=1dj3=1dg=1Gα¯gs=13(θ¯gjsθ0js)|g(xi;𝜽0)\displaystyle\leq\left|\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}\prod_{s=1}^{3}\left(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}}\right)\right|g(x_{i};\bm{\theta}_{0})
(d)3𝝃¯𝝃023d3g(xi;𝜽0).\displaystyle\leq(\sqrt{d})^{3}\left\|\bar{\bm{\xi}}-\bm{\xi}_{0}\right\|_{2}^{3}d^{3}g(x_{i};\bm{\theta}_{0}). (39)

Applying the inequality (A.3) and Cauchy’s inequality, we have

|2𝐦¯T𝐛iεin|2𝐦¯2𝐛i2(d)3𝝃¯𝝃023d3g(xi;𝜽0),\left|2\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\varepsilon_{in}\right|\leq 2\left\|\bar{{\bf m}}\right\|_{2}\left\|{\bf b}_{i}\right\|_{2}\left(\sqrt{d}\right)^{3}\left\|\bar{\bm{\xi}}-\bm{\xi}_{0}\right\|_{2}^{3}d^{3}g(x_{i};\bm{\theta}_{0}),

where

𝐛i2=(j=1dYij2+j=1dZij2+j1=1dj2>j1dUij1j22)1/2.\left\|{\bf b}_{i}\right\|_{2}=\left(\sum_{j=1}^{d}Y_{ij}^{2}+\sum_{j=1}^{d}Z_{ij}^{2}+\sum_{j_{1}=1}^{d}\sum_{j_{2}>j_{1}}^{d}U_{ij_{1}j_{2}}^{2}\right)^{1/2}.

Therefore, we only need to consider the term i=1n𝐛i2g(xi;𝜽0).\sum_{i=1}^{n}\left\|{\bf b}_{i}\right\|_{2}g(x_{i};\bm{\theta}_{0}). By Lemma 15, we have

(|i=1n𝐛i2g(xi;𝜽0)|n(1+(d+1)M2))2Rm(4(d+1)2M4n)m.{\mathbb{P}}\left(\left|\sum_{i=1}^{n}\left\|{\bf b}_{i}\right\|_{2}g(x_{i};\bm{\theta}_{0})\right|\geq n\left(1+(d+1)M^{2}\right)\right)\leq 2R_{m}\left(\frac{4(d+1)^{2}M^{4}}{n}\right)^{m}.

Let

𝒜5={|i=1n𝐛i2g(xi;𝜽0)|<n(1+(d+1)M2)}.\mathcal{A}_{5}=\left\{\left|\sum_{i=1}^{n}\left\|{\bf b}_{i}\right\|_{2}g(x_{i};\bm{\theta}_{0})\right|<n\left(1+(d+1)M^{2}\right)\right\}.

Therefore, on 𝒜1𝒜5\mathcal{A}_{1}\cap\mathcal{A}_{5}, using the fact that L2L1L_{2}\leq L_{1}, we have

|i=1n2𝐦¯T𝐛iεin|<2d9/2(1+(d+1)M2)(cL2)5/2n7/48log5/2n,\left|\sum_{i=1}^{n}2\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\varepsilon_{in}\right|<2d^{9/2}\left(1+(d+1)M^{2}\right)\left({\frac{{c}}{L_{2}}}\right)^{5/2}n^{-7/48}\hbox{log}^{5/2}n, (40)

and

(𝒜5)12Rm(4(d+1)2M4n)m.{\mathbb{P}}(\mathcal{A}_{5})\geq 1-2R_{m}\left(\frac{4(d+1)^{2}M^{4}}{n}\right)^{m}. (41)

Step 3: Finally, we aim to bound

i=1nδi3=i=1n{(𝐦¯T𝐛i)3+3(𝐦¯T𝐛i)2εin+3(𝐦¯T𝐛i)εin2+εin3}.\sum_{i=1}^{n}\delta_{i}^{3}=\sum_{i=1}^{n}\left\{\left(\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\right)^{3}+3\left(\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\right)^{2}\varepsilon_{in}+3\left(\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\right)\varepsilon_{in}^{2}+\varepsilon_{in}^{3}\right\}. (42)

We first deal with i=1n|(𝐦¯T𝐛i)qεinp|,\sum_{i=1}^{n}\left|\left(\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\right)^{q}\varepsilon_{in}^{p}\right|, where p,qp,q\in\mathbb{N} and p+q=3p+q=3. Similar to the proof in Step 2, we have

|εin|d9/2𝝃¯𝝃023g(xi;𝜽0).\left|\varepsilon_{in}\right|\leq d^{9/2}\left\|\bar{\bm{\xi}}-\bm{\xi}_{0}\right\|_{2}^{3}g(x_{i};\bm{\theta}_{0}).

Then, we have

i=1n|(𝐦¯T𝐛i)qεinp|d9p/2𝝃¯𝝃023p𝐦¯2qi=1n𝐛i2qgp(xi;𝜽0).\sum_{i=1}^{n}\left|\left(\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\right)^{q}\varepsilon_{in}^{p}\right|\leq d^{9p/2}\left\|\bar{\bm{\xi}}-\bm{\xi}_{0}\right\|_{2}^{3p}\|\bar{{\bf m}}\|_{2}^{q}\sum_{i=1}^{n}\|{\bf b}_{i}\|_{2}^{q}g^{p}(x_{i};\bm{\theta}_{0}).

Let

pq={|i=1n𝐛i2qgp(xi;𝜽0)|<n(1+(d+1)qM3)}.\mathcal{B}_{pq}=\left\{\left|\sum_{i=1}^{n}\|{\bf b}_{i}\|_{2}^{q}g^{p}(x_{i};\bm{\theta}_{0})\right|<n(1+(d+1)^{q}M^{3})\right\}.

Then, on 𝒜1pq\mathcal{A}_{1}\cap\mathcal{B}_{pq}, we have

i=1n|(𝐦¯T𝐛i)qεinp|<d9p/2(1+(d+1)qM3)(cL2)3p+2q2n4811(3p+2q)48log3p+2q2n,\sum_{i=1}^{n}\left|\left(\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\right)^{q}\varepsilon_{in}^{p}\right|<d^{9p/2}\left(1+(d+1)^{q}M^{3}\right)\left({\frac{{c}}{L_{2}}}\right)^{\frac{3p+2q}{2}}n^{\frac{48-11(3p+2q)}{48}}\hbox{log}^{\frac{3p+2q}{2}}n,

and by Lemma 15,

(pq)12Rm(2(d+1)qM3n)2m.{\mathbb{P}}(\mathcal{B}_{pq})\geq 1-2R_{m}\left(\frac{2(d+1)^{q}M^{3}}{\sqrt{n}}\right)^{2m}.

Hence, we let

𝒜6=p+q=3pq.\mathcal{A}_{6}=\bigcap_{p+q=3}\mathcal{B}_{pq}.

Then, on 𝒜1𝒜6\mathcal{A}_{1}\cap\mathcal{A}_{6},

|i=1nδi3|\displaystyle\left|\sum_{i=1}^{n}\delta_{i}^{3}\right| <p+q=3(3p)d9p/2(1+(d+1)qM3)(cL2)3p+2q2n4811(3p+2q)48log3p+2q2n\displaystyle<\sum_{p+q=3}\tbinom{3}{p}d^{9p/2}\left(1+(d+1)^{q}M^{3}\right)\left({\frac{{c}}{L_{2}}}\right)^{\frac{3p+2q}{2}}n^{\frac{48-11(3p+2q)}{48}}\hbox{log}^{\frac{3p+2q}{2}}n
C~n38log3n,\displaystyle\leq\tilde{C}n^{-\frac{3}{8}}\hbox{log}^{3}n, (43)

where C~>0\tilde{C}>0 is a constant and by Lemma 15,

(𝒜6)1(q=032Rm(2(d+1)qM3n)2m).{\mathbb{P}}(\mathcal{A}_{6})\geq 1-\left(\sum_{q=0}^{3}2R_{m}\left(\frac{2(d+1)^{q}M^{3}}{\sqrt{n}}\right)^{2m}\right). (44)

By (37, 38, 40, 41, A.3, 44), there are two constants CC and C1C_{1} depending on m,M,d,L2,cm,M,d,L_{2},c such that

(|2i=1nεin2i=1n𝐦¯T𝐛iεin+23i=1nδi3|Cn1/16log3/2n)(C1n)m/4.{\mathbb{P}}\left(\left|2\sum_{i=1}^{n}\varepsilon_{in}-2\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\varepsilon_{in}+\frac{2}{3}\sum_{i=1}^{n}\delta_{i}^{3}\right|\geq Cn^{-1/16}\hbox{log}^{3/2}n\right)\leq{(C_{1}n)^{-m/4}}. (45)

The above inequality (45) yields that |2i=1nεini=1n𝐦¯T𝐛iεin+23i=1nδi3|\left|2\sum_{i=1}^{n}\varepsilon_{in}-\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}\varepsilon_{in}+\frac{2}{3}\sum_{i=1}^{n}\delta_{i}^{3}\right| is sufficiently small with high probability.

We next bound

i=1n𝐦¯T𝐛i𝐛iT𝐦¯=𝐦¯Ti=1n[𝐛i𝐛iT]𝐦¯=n𝐦¯Ti=1n[𝐛i𝐛iT]n𝐦¯.\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}{\bf b}_{i}^{\rm T}\bar{{\bf m}}=\bar{{\bf m}}^{{\rm T}}\sum_{i=1}^{n}\left[{\bf b}_{i}{\bf b}_{i}^{\rm T}\right]\bar{{\bf m}}=n\bar{{\bf m}}^{\rm T}\sum_{i=1}^{n}\frac{\left[{\bf b}_{i}{\bf b}_{i}^{\rm T}\right]}{n}\bar{{\bf m}}.

To this end, we only need to bound the set

𝒜7={i=1n[𝐛i𝐛iT]n𝐁F<λmin/2}.\mathcal{A}_{7}=\left\{\left\|\sum_{i=1}^{n}\frac{\left[{\bf b}_{i}{\bf b}_{i}^{\rm T}\right]}{n}-{\bf B}\right\|_{F}<\lambda_{\rm min}/2\right\}.

By the matrix inequality 𝐀𝐁F|λmin(𝐀)λmin(𝐁)|\|{\bf A}-{\bf B}\|_{F}\geq|\lambda_{\rm min}({\bf A})-\lambda_{\rm min}({\bf B})|, on 𝒜7\mathcal{A}_{7}, we have

λmin(i=1n[𝐛i𝐛iT]n)>λmin/2.\lambda_{\rm min}\left(\sum_{i=1}^{n}\frac{\left[{\bf b}_{i}{\bf b}_{i}^{\rm T}\right]}{n}\right)>\lambda_{\rm min}/2.

It follows that on 𝒜7\mathcal{A}_{7}, we have

n𝐦¯Ti=1n[𝐛i𝐛iT]n𝐦¯λmin2n𝐦¯T𝐦¯.n\bar{{\bf m}}^{{\rm T}}\sum_{i=1}^{n}\frac{\left[{\bf b}_{i}{\bf b}_{i}^{\rm T}\right]}{n}\bar{{\bf m}}\geq\frac{\lambda_{\rm min}}{2}n\bar{{\bf m}}^{{\rm T}}\bar{{\bf m}}. (46)

Next we bound the probability of 𝒜7\mathcal{A}_{7}. Since

𝒜7c{There exists one pair k,l such that |(i=1n[𝐛i𝐛iT]n)klBkl|λmin/(2d2)},\mathcal{A}_{7}^{c}\subset\left\{\mbox{There exists one pair }k,l\mbox{ such that }\left|\left(\sum_{i=1}^{n}\frac{\left[{\bf b}_{i}{\bf b}_{i}^{\rm T}\right]}{n}\right)_{kl}-B_{kl}\right|\geq\lambda_{\rm min}/(2d^{2})\right\},

we have

(𝒜7c)k=1dl=1d(|(i=1n[𝐛i𝐛iT]n)klBkl|λmin/(2d2)).{\mathbb{P}}(\mathcal{A}_{7}^{c})\leq\sum_{k=1}^{d}\sum_{l=1}^{d}{\mathbb{P}}\left(\left|\left(\sum_{i=1}^{n}\frac{\left[{\bf b}_{i}{\bf b}_{i}^{\rm T}\right]}{n}\right)_{kl}-B_{kl}\right|\geq\lambda_{\rm min}/(2d^{2})\right).

For fixed k,lk,l let Ti=[𝐛i𝐛iT]kl.T_{i}={\left[{\bf b}_{i}{\bf b}_{i}^{\rm T}\right]_{kl}}. Therefore, we have TiL2mM2\left\|T_{i}\right\|_{L^{2m}}\leq M^{2}. Since 𝔼([𝐛i𝐛iT])kl=Bkl{\mathbb{E}}\left({\left[{\bf b}_{i}{\bf b}_{i}^{\rm T}\right]}\right)_{kl}=B_{kl}, by Lemma 14, we have

(|i=1n(TiBkl)|nλmin/(2d2))2Rm(4M4n)m(2d2λmin)2m.{\mathbb{P}}\left(\left|\sum_{i=1}^{n}\left(T_{i}-B_{kl}\right)\right|\geq n\lambda_{\rm min}/\left(2d^{2}\right)\right)\leq 2R_{m}\left(\frac{4M^{4}}{n}\right)^{m}\left(\frac{2d^{2}}{{\lambda_{\rm min}}}\right)^{2m}.

Thus, we have

(𝒜7)12d2Rm(4M4n)m(2d2λmin)2m.{\mathbb{P}}(\mathcal{A}_{7})\geq 1-2d^{2}R_{m}\left(\frac{4M^{4}}{n}\right)^{m}\left(\frac{2d^{2}}{{\lambda_{\rm min}}}\right)^{2m}. (47)

Combining (35, 45) with (46, 47), we have

(R1n2i=1n𝐦¯T𝐛iλmin2n𝐦¯T𝐦¯+Cn1/16log3/2n)1(C1n)m/4,{\mathbb{P}}\left(R_{1n}\leq 2\sum_{i=1}^{n}\bar{{\bf m}}^{{\rm T}}{\bf b}_{i}-\frac{\lambda_{\rm min}}{2}n\bar{{\bf m}}^{{\rm T}}\bar{{\bf m}}+Cn^{-1/16}\hbox{log}^{3/2}n\right)\geq 1-{(C_{1}n)^{-m/4}}, (48)

where C,C1C,C_{1} are another two constants. It is clear that

R1n\displaystyle R_{1n} 2𝐦¯Ti=1n𝐛iλmin2n𝐦¯T𝐦¯+Cn1/16log3/2n\displaystyle\leq 2\bar{{\bf m}}^{{\rm T}}\sum_{i=1}^{n}{\bf b}_{i}-\frac{\lambda_{\rm min}}{2}n\bar{{\bf m}}^{{\rm T}}\bar{{\bf m}}+Cn^{-1/16}\hbox{log}^{3/2}n
2(i=1n𝐛i)T(i=1n𝐛i)λminn+Cn1/16log3/2n\displaystyle\leq\frac{2(\sum_{i=1}^{n}{\bf b}_{i})^{\rm T}(\sum_{i=1}^{n}{\bf b}_{i})}{\lambda_{\rm min}n}+Cn^{-1/16}\hbox{log}^{3/2}n
=2λminnj=1d(d+3)/2(i=1nbij)2+Cn1/16log3/2n.\displaystyle=\frac{2}{\lambda_{\rm min}n}\sum_{j=1}^{d(d+3)/2}\left(\sum_{i=1}^{n}b_{ij}\right)^{2}+Cn^{-1/16}\hbox{log}^{3/2}n.

Therefore, let

Tn=2λminnj=1d(d+3)/2(i=1nbij)2.T_{n}=\frac{2}{\lambda_{\rm min}n}\sum_{j=1}^{d(d+3)/2}\left(\sum_{i=1}^{n}b_{ij}\right)^{2}.

For any t>0t>0, we have

{Tnt}\displaystyle\{T_{n}\geq t\} ={j=1d(d+3)/2(i=1nbij)2/nλmin2t}\displaystyle=\left\{\sum_{j=1}^{d(d+3)/2}\left(\sum_{i=1}^{n}b_{ij}\right)^{2}\bigg{/}n\geq\frac{\lambda_{\rm min}}{2}t\right\}
{There exists j such that (i=1nbij)2/nλmind(d+3)t}.\displaystyle\subset\left\{\mbox{There exists }j\mbox{ such that }\left(\sum_{i=1}^{n}b_{ij}\right)^{2}\bigg{/}n\geq\frac{\lambda_{\rm min}}{d(d+3)}t\right\}.

It follows that

(Tnt)j=1d(d+3)/2(i=1nbijn(λmind(d+3)t)1/2).{\mathbb{P}}(T_{n}\geq t)\leq\sum_{j=1}^{d(d+3)/2}{\mathbb{P}}\left(\frac{\sum_{i=1}^{n}b_{ij}}{\sqrt{n}}\geq\left(\frac{\lambda_{\rm min}}{d(d+3)}t\right)^{1/2}\right).

For any fixed jj, by Lemma 14, we have

(i=1nbijn(λmind(d+3)t)1/2)2Rm(M2d(d+3)λmin)mtm.{\mathbb{P}}\left(\frac{\sum_{i=1}^{n}b_{ij}}{\sqrt{n}}\geq\left(\frac{\lambda_{\rm min}}{d(d+3)}t\right)^{1/2}\right)\leq 2R_{m}\left(\frac{M^{2}d(d+3)}{\lambda_{\rm min}}\right)^{m}t^{-m}.

Thus, we have

(Tnt)d(d+3)Rm(M2d(d+3)λmin)mtm.{\mathbb{P}}(T_{n}\geq t)\leq d(d+3)R_{m}\left(\frac{M^{2}d(d+3)}{\lambda_{\rm min}}\right)^{m}t^{-m}. (49)

Combing (48) with (49), we conclude that

(R1nt+Cn1/16log3/2n)1(C1n)m/4(C2t)m,{\mathbb{P}}(R_{1n}\leq t+Cn^{-1/16}\hbox{log}^{3/2}n)\geq 1-{(C_{1}n)^{-m/4}}-{(C_{2}t)}^{-m},

where C,C1,C2C,C_{1},C_{2} are three constants. It follows that

(EMn(K)t+Cn1/16log3/2n)1(C1n)m/4(C2t)m,{\mathbb{P}}({\rm EM}_{n}^{(K)}\leq t+Cn^{-1/16}\hbox{log}^{3/2}n)\geq 1-{(C_{1}n)^{-m/4}}-{(C_{2}t)}^{-m},

and thus we prove the theorem. ∎

Proof of Lemma 14.

By Exercise 2.20 in Wainwright (2019), under the stated conditions, there is a universal constant RmR_{m} such that

𝔼[(i=1nXi)2m]Rm{i=1n𝔼[Xi2m]+(i=1n𝔼[Xi2])m}.\mathbb{E}\left[\left(\sum_{i=1}^{n}X_{i}\right)^{2m}\right]\leq R_{m}\left\{\sum_{i=1}^{n}\mathbb{E}\left[X_{i}^{2m}\right]+\left(\sum_{i=1}^{n}\mathbb{E}\left[X_{i}^{2}\right]\right)^{m}\right\}.

By Lyapunov’s inequality, we have XiL2XiL2mC\|X_{i}\|_{L^{2}}\leq\|X_{i}\|_{L^{2m}}\leq C. By Markov’s inequality, we have

{|i=1nXi|nδ}\displaystyle{\mathbb{P}}\left\{\left|\sum_{i=1}^{n}X_{i}\right|\geq n\delta\right\} 𝔼(|i=1nXi|2m)(nδ)2m\displaystyle\leq\frac{{\mathbb{E}}\left(|\sum_{i=1}^{n}X_{i}|^{2m}\right)}{\left(n\delta\right)^{2m}}
Rm(nδ)2m{i=1n𝔼[Xi2m]+(i=1n𝔼[Xi2])m}\displaystyle\leq\frac{R_{m}}{\left(n\delta\right)^{2m}}\left\{\sum_{i=1}^{n}\mathbb{E}\left[X_{i}^{2m}\right]+\left(\sum_{i=1}^{n}\mathbb{E}\left[X_{i}^{2}\right]\right)^{m}\right\}
Rm(nδ)2m(nC2m+nmC2m)\displaystyle\leq\frac{R_{m}}{\left(n\delta\right)^{2m}}\left(nC^{2m}+n^{m}C^{2m}\right)
2Rm(nδ)2mnmC2m=2Rm(Cnδ)2m.\displaystyle\leq\frac{2R_{m}}{\left(n\delta\right)^{2m}}n^{m}C^{2m}=2R_{m}\left(\frac{C}{\sqrt{n}\delta}\right)^{2m}.

The second conclusion is a direct corollary of the first conclusion, and thus we complete the proof. ∎

Proof of Lemma 15.

We first prove that when p+q3p+q\leq 3, we have

𝐛i2qgp(xi;𝜽0)L2m(d+1)qMp+q.\left\|\left\|{\bf b}_{i}\right\|_{2}^{q}g^{p}(x_{i};\bm{\theta}_{0})\right\|_{L^{2m}}\leq(d+1)^{q}M^{p+q}. (50)

When p,q<3p,q<3, by Cauchy’s inequality, we have

𝐛i2qgp(xi;𝜽0)L2m𝐛i2qL4mgp(xi;𝜽0)L4m.\left\|\left\|{\bf b}_{i}\right\|_{2}^{q}g^{p}(x_{i};\bm{\theta}_{0})\right\|_{L^{2m}}\leq\left\|\left\|{\bf b}_{i}\right\|_{2}^{q}\right\|_{L^{4m}}\|g^{p}(x_{i};\bm{\theta}_{0})\|_{L^{4m}}.

By the triangle inequality and Condition (C3), we conclude that

𝐛i2qL4m\displaystyle\left\|\left\|{\bf b}_{i}\right\|_{2}^{q}\right\|_{L^{4m}} =(j=1dYij2+j=1dZij2+j1=1dj2>j1dUij1j22)q/2L4m\displaystyle=\left\|\left(\sum_{j=1}^{d}Y_{ij}^{2}+\sum_{j=1}^{d}Z_{ij}^{2}+\sum_{j_{1}=1}^{d}\sum_{j_{2}>j_{1}}^{d}U_{ij_{1}j_{2}}^{2}\right)^{q/2}\right\|_{L^{4m}}
=(j=1dYij2+j=1dZij2+j1=1dj2>j1dUij1j22L2mq)q/2\displaystyle=\left(\left\|\sum_{j=1}^{d}Y_{ij}^{2}+\sum_{j=1}^{d}Z_{ij}^{2}+\sum_{j_{1}=1}^{d}\sum_{j_{2}>j_{1}}^{d}U_{ij_{1}j_{2}}^{2}\right\|_{L^{2mq}}\right)^{q/2}
(j=1dYij2L2mq+j=1dZij2L2mq+j1=1dj2>j1dUij1j22L2mq)q/2\displaystyle\leq\left(\sum_{j=1}^{d}\left\|Y_{ij}^{2}\right\|_{L^{2mq}}+\sum_{j=1}^{d}\left\|Z_{ij}^{2}\right\|_{L^{2mq}}+\sum_{j_{1}=1}^{d}\sum_{j_{2}>j_{1}}^{d}\left\|U_{ij_{1}j_{2}}^{2}\right\|_{L^{2mq}}\right)^{q/2}
=(j=1dYijL4mq2+j=1dZijL4mq2+j1=1dj2>j1dUij1j22L4mq2)q/2\displaystyle=\left(\sum_{j=1}^{d}\left\|Y_{ij}\right\|_{L^{4mq}}^{2}+\sum_{j=1}^{d}\left\|Z_{ij}\right\|_{L^{4mq}}^{2}+\sum_{j_{1}=1}^{d}\sum_{j_{2}>j_{1}}^{d}\left\|U_{ij_{1}j_{2}}^{2}\right\|_{L^{4mq}}^{2}\right)^{q/2}
(d+1)qMq,\displaystyle\leq(d+1)^{q}M^{q}, (51)

where the last inequality is from the fact q2q\leq 2 and d(d+3)/2(d+1)2d(d+3)/2\leq(d+1)^{2}. By p2p\leq 2 and Condition (C3), similarly, we conclude that

gp(xi;𝜽0)L4m=(g(xi;𝜽0)L4mp)pMp.\|g^{p}(x_{i};\bm{\theta}_{0})\|_{L^{4m}}=(\|g(x_{i};\bm{\theta}_{0})\|_{L^{4mp}})^{p}\leq M^{p}.

Thus, we prove (50) when p,q<3p,q<3. When p=0,q=3p=0,q=3, analysis similar to that in (A.3) shows that

𝐛i2qL2m(d+1)qMq.\left\|\left\|{\bf b}_{i}\right\|_{2}^{q}\right\|_{L^{2m}}\leq(d+1)^{q}M^{q}.

Similarly, when p=3,q=0p=3,q=0 we have

gp(xi;𝜽0)L2m=(g(xi;𝜽0)L2mp)pMp.\|g^{p}(x_{i};\bm{\theta}_{0})\|_{L^{2m}}=(\|g(x_{i};\bm{\theta}_{0})\|_{L^{2mp}})^{p}\leq M^{p}.

Thus, we prove that for any p+q3p+q\leq 3, (50) holds.

Next, by Lemma 14, we have

{|i=1n{Rpq(xi)𝔼(Rpq(xi))}|n}2Rm(2(d+1)qMp+qn)2m.{\mathbb{P}}\left\{\left|\sum_{i=1}^{n}\left\{R_{pq}(x_{i})-{\mathbb{E}}(R_{pq}(x_{i}))\right\}\right|\geq n\right\}\leq 2R_{m}\left(\frac{2(d+1)^{q}M^{p+q}}{\sqrt{n}}\right)^{2m}.

By Lyapunov’s inequality, we have

𝐛i2qgp(xi;𝜽0)L1𝐛i2qgp(xi)L2m(d+1)qMp+q.\left\|\left\|{\bf b}_{i}\right\|_{2}^{q}g^{p}(x_{i};\bm{\theta}_{0})\right\|_{L^{1}}\leq\left\|\left\|{\bf b}_{i}\right\|_{2}^{q}g^{p}(x_{i})\right\|_{L^{2m}}\leq(d+1)^{q}M^{p+q}.

It follows that

{|i=1nRpq(xi)|n(1+(d+1)qMp+q)}2Rm(2(d+1)qMp+qn)2m,{\mathbb{P}}\left\{\left|\sum_{i=1}^{n}R_{pq}(x_{i})\right|\geq n(1+(d+1)^{q}M^{p+q})\right\}\leq 2R_{m}\left(\frac{2(d+1)^{q}M^{p+q}}{\sqrt{n}}\right)^{2m},

and thus we complete the proof. ∎

Proof of Lemma 16.

Note that 𝔼[Di(j1,,jk)]=0.{\mathbb{E}}\left[D_{i}(j_{1},\dots,j_{k})\right]=0. By Condition (C3), we have

Di(j1,,jk)L2mM.\left\|D_{i}(j_{1},\dots,j_{k})\right\|_{L^{2m}}\leq M.

By Lemma 14, we have

{|i=1nDi(j1,,jk)|<nt}12Rm(Mnt)2m,for all t>0.{\mathbb{P}}\left\{\left|\sum_{i=1}^{n}D_{i}(j_{1},\dots,j_{k})\right|<nt\right\}\geq 1-2R_{m}\left(\frac{M}{\sqrt{n}t}\right)^{2m},\ \ \mbox{for all }t>0.

Taking t=n3/8t=n^{-3/8}, we obtain the tail probability bound as

{|i=1nDi(j1,,jk)|n5/8}2Rm(Mn1/8)2m.{\mathbb{P}}\left\{\left|\sum_{i=1}^{n}D_{i}(j_{1},\dots,j_{k})\right|\geq n^{5/8}\right\}\leq 2R_{m}\left(\frac{M}{n^{1/8}}\right)^{2m}.

Observe that

{j1=1djk=1d|i=1nDi(j1,,jk)|dkn5/8}j1=1djk=1d{|i=1nDi(j1,,jk)|n5/8}\left\{\sum_{j_{1}=1}^{d}\cdots\sum_{j_{k}=1}^{d}\left|\sum_{i=1}^{n}D_{i}(j_{1},\dots,j_{k})\right|\geq d^{k}n^{5/8}\right\}\subset\bigcup_{j_{1}=1}^{d}\cdots\bigcup_{j_{k}=1}^{d}\left\{\left|\sum_{i=1}^{n}D_{i}(j_{1},\dots,j_{k})\right|\geq n^{5/8}\right\}

It follows that

{j1=1djk=1d|i=1nDi(j1,,jk)|<dkn5/8}12dkRm(Mn1/8)2m,{\mathbb{P}}\left\{\sum_{j_{1}=1}^{d}\cdots\sum_{j_{k}=1}^{d}\left|\sum_{i=1}^{n}D_{i}(j_{1},\dots,j_{k})\right|<d^{k}n^{5/8}\right\}\geq 1-2d^{k}R_{m}\left(\frac{M}{n^{1/8}}\right)^{2m},

and thus we complete the proof. ∎

A.4 Proofs of Theorem 3

We abbreviate pln(𝝃^0,𝜶0)pl_{n}\left(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0}\right) and ln(𝝃^0,𝜶0)l_{n}\left(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0}\right) to pln(𝝃^0)pl_{n}\left(\hat{\bm{\xi}}_{0}\right) and ln(𝝃^0)l_{n}\left(\hat{\bm{\xi}}_{0}\right), respectively. Recall that

𝜽0=argmax𝜽Θ𝔼𝜶,𝝃[logf(x;𝜽)],{\bm{\theta}}^{\dagger}_{0}=\mathop{\arg\max}_{\bm{\theta}\in{\Theta}}{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}\left[\hbox{log}f(x;\bm{\theta})\right], (52)

and

𝝃=argmax𝝃Ξ𝔼𝜶,𝝃[logφ(x;𝝃,𝜶(0))].{\bm{\xi}}^{\dagger}=\mathop{\arg\max}_{\bm{\xi}\in\Xi}{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}\left[\hbox{log}\ \varphi\left(x;\bm{\xi},{\bm{\alpha}}^{(0)}\right)\right]. (53)

We briefly describe the proof of Theorem 3. Observe that the EM-test statistic is larger than the penalized log-likelihood ratio pln(𝝃(0),𝜶(0))pln(𝝃^0,𝜶0)pl_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)-pl_{n}\left(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0}\right), which can be decomposed as a summation of three parts, pln(𝝃(0),𝜶(0))pln(𝝃,𝜶(0))pl_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)-pl_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{(0)}\right), pln(𝝃,𝜶(0))pln(𝝃0,𝜶0)pl_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{(0)}\right)-pl_{n}\left({\bm{\xi}}_{0}^{\dagger},{\bm{\alpha}}_{0}\right) and pln(𝝃0,𝜶0)pln(𝝃^0,𝜶0)pl_{n}\left({\bm{\xi}}_{0}^{\dagger},{\bm{\alpha}}_{0}\right)-pl_{n}\left(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0}\right). All three parts can be bounded. The first part is non-negative. The second part can be written as

pln(𝝃,𝜶(0))pln(𝝃0,𝜶0)=i=1nR(xi;𝝃)+p(𝜶(0))p(𝜶0),pl_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{(0)}\right)-pl_{n}\left({\bm{\xi}}_{0}^{\dagger},{\bm{\alpha}}_{0}\right)=\sum_{i=1}^{n}R(x_{i};\bm{\xi}^{*})+p({\bm{\alpha}}^{(0)})-p({\bm{\alpha}}_{0}),

and can be bounded using the Bernstein inequality. For the third part, since D(𝜽)D(𝜽0)D(\bm{\theta})\leq D\left({\bm{\theta}}^{\dagger}_{0}\right) for all 𝜽Θ\bm{\theta}\in\Theta, we have

pln(𝝃0,𝜶0)pln(𝝃^0,𝜶0)=i=1n{logf(xi;𝜽^0)logf(xi;𝜽0)}n1/2sup𝜽ΘZ𝜽(𝝃).pl_{n}\left({\bm{\xi}}_{0}^{\dagger},{\bm{\alpha}}_{0}\right)-pl_{n}\left(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0}\right)=\sum_{i=1}^{n}\left\{\hbox{log}\ f(x_{i};\hat{\bm{\theta}}_{0})-\hbox{log}\ f\left(x_{i};{\bm{\theta}}^{\dagger}_{0}\right)\right\}\geq-n^{1/2}\sup_{\bm{\theta}\in\Theta}Z_{\bm{\theta}}(\bm{\xi}^{*}).

Thus, the third part can be bounded by analyzing the supremum of the empirical process {Z𝜽(𝝃),𝜽Θ}\{Z_{\bm{\theta}}(\bm{\xi}^{*}),\bm{\theta}\in\Theta\} using the generalized Dudley inequality.

We first give two technical lemmas.

Lemma 17.

Under Condition (C5) and (C6), for every t0t\geq 0, we have,

(ln(𝝃,𝜶(0))ln(𝝃0)nϱt)12exp[Cmin(t2nMψ12,tMψ1)],{\mathbb{P}}\left(l_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{\left(0\right)}\right)-l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)\geq n\varrho-t\right)\geq 1-2\exp\left[-C^{\prime}\min\left(\frac{t^{2}}{nM_{\psi_{1}}^{2}},\frac{t}{M_{\psi_{1}}}\right)\right],

where CC^{\prime} is a constant and 𝛏\bm{\xi}^{\dagger} and 𝛏0=(𝛉0,,𝛉0)\bm{\xi}_{0}^{\dagger}=\left(\bm{\theta}_{0}^{\dagger},\dots,\bm{\theta}_{0}^{\dagger}\right) are defined in (53) and (52).

Let ρ(𝜽,𝜽)=Cρ𝜽𝜽2\rho(\bm{\theta},\bm{\theta}^{\prime})=C_{\rho}\|\bm{\theta}-\bm{\theta}^{\prime}\|_{2}, where CρC_{\rho} is in Condition (C7). Let 𝒩(u,Θ,ρ)\mathcal{N}(u,\Theta,\rho) be the covering number, which is the smallest number of closed balls with centers in Θ\Theta and radius uu whose union covers Θ\Theta. Next we define the generalized Dudley integral as

J(D)=0Dlog(1+𝒩(u,Θ,ρ))du.J(D)=\int_{0}^{D}\hbox{log}\left(1+\mathcal{N}(u,\Theta,\rho)\right){\rm d}u.

where D=sup𝜽,𝜽Θ𝜽𝜽2D=\sup_{\bm{\theta},\bm{\theta}^{\prime}\in\Theta}\|\bm{\theta}-\bm{\theta}^{\prime}\|_{2} is the Euclidean diameter. Note that Θ\Theta is a compact set and ρ(𝜽,𝜽)=Cρ𝜽𝜽2\rho(\bm{\theta},\bm{\theta}^{\prime})=C_{\rho}\|\bm{\theta}-\bm{\theta}^{\prime}\|_{2}. Therefore, J(D)<J(D)<\infty, and we have the following generalized Dudley inequality by the chaining method. The proof of the classic Dudley’s inequality can be found in Vershynin (2018). The proof of the following lemma follows the same arguments by a chaining method and can be found in Wainwright (2019) (Theorem 5.36). Thus, we omit the proof.

Lemma 18.

Under Condition (C7), for any tt,

(sup𝜽,𝜽Θ|Z𝜽(𝝃)Z𝜽(𝝃)|CJ[J(D)+t])2exp(tD),{\mathbb{P}}\left(\sup_{\bm{\theta},\bm{\theta}^{\prime}\in\Theta}|Z_{\bm{\theta}}(\bm{\xi}^{*})-Z_{\bm{\theta}^{\prime}}(\bm{\xi}^{*})|\geq C_{J}[J(D)+t]\right)\leq 2\exp\left(\frac{-t}{D}\right),

where CJC_{J} is a constant, DD is the diameter and J(D)J(D) is the generalized Dudley integral.

Proof of Theorem 3.

We first aim to bound the probability

(pln(𝝃(0),𝜶(0))pln(𝝃^0)21nϱn1/2CJ[J(D)+t]p0).{\mathbb{P}}\left(pl_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)-pl_{n}\left(\hat{\bm{\xi}}_{0}\right)\geq 2^{-1}n\varrho-n^{1/2}C_{J}\left[J\left(D\right)+t\right]-p_{0}\right).

Since pln(𝝃(0),𝜶(0))=ln(𝝃(0),𝜶(0))+p(𝜶(0))pl_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)=l_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)+p\left({\bm{\alpha}}^{(0)}\right) and p(𝜶(0))p0p\left({\bm{\alpha}}^{(0)}\right)\geq p_{0}, we have

pln(𝝃(0),𝜶(0))pln(𝝃^0)ln(𝝃(0),𝜶(0))ln(𝝃^0)+p0.pl_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)-pl_{n}\left(\hat{\bm{\xi}}_{0}\right)\geq l_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)-l_{n}\left(\hat{\bm{\xi}}_{0}\right)+p_{0}.

Thus, we only need to control

(ln(𝝃(0),𝜶(0))ln(𝝃^0)21nϱn1/2CJ[J(D)+t]).{\mathbb{P}}\left(l_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)-l_{n}\left(\hat{\bm{\xi}}_{0}\right)\geq 2^{-1}n\varrho-n^{1/2}C_{J}\left[J\left(D\right)+t\right]\right).

Since

ln(𝝃(0),𝜶(0))ln(𝝃^0)\displaystyle l_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)-l_{n}\left(\hat{\bm{\xi}}_{0}\right) =ln(𝝃(0),𝜶(0))ln(𝝃,𝜶(0))\displaystyle=l_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)-l_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{(0)}\right)
+ln(𝝃,𝜶(0))ln(𝝃0)+ln(𝝃0)ln(𝝃^0)\displaystyle+l_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{(0)}\right)-l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)+l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)-l_{n}\left(\hat{\bm{\xi}}_{0}\right)

and ln(𝝃(0),𝜶(0))ln(𝝃,𝜶(0))>0l_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{(0)}\right)-l_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{(0)}\right)>0, we have

(ln(𝝃(0),𝜶(0))ln(𝝃^0)21nϱn1/2CJ[J(D)+t])\displaystyle\quad{\mathbb{P}}\left(l_{n}\left(\bm{\xi}^{(0)},{\bm{\alpha}}^{\left(0\right)}\right)-l_{n}\left(\hat{\bm{\xi}}_{0}\right)\geq 2^{-1}n\varrho-n^{1/2}C_{J}\left[J\left(D\right)+t\right]\right)
(ln(𝝃,𝜶(0))ln(𝝃0)+ln(𝝃0)ln(𝝃^0)21nϱn1/2CJ[J(D)+t])\displaystyle\geq{\mathbb{P}}\left(l_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{\left(0\right)}\right)-l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)+l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)-l_{n}\left(\hat{\bm{\xi}}_{0}\right)\geq 2^{-1}n\varrho-n^{1/2}C_{J}\left[J\left(D\right)+t\right]\right)
1(ln(𝝃,𝜶(0))ln(𝝃0)<21nϱ)(ln(𝝃0)ln(𝝃^0)<n1/2CJ[J(D)+t]).\displaystyle\geq 1-{\mathbb{P}}\left(l_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{\left(0\right)}\right)-l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)<2^{-1}n\varrho\right)-{\mathbb{P}}\left(l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)-l_{n}\left(\hat{\bm{\xi}}_{0}\right)<-n^{1/2}C_{J}\left[J\left(D\right)+t\right]\right).

Thus, we divide the remaining proof into two steps.

Step 1. We aim to bound (ln(𝝃,𝜶(0))ln(𝝃0)21nϱ).{\mathbb{P}}\left(l_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{\left(0\right)}\right)-l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)\geq 2^{-1}n\varrho\right). Applying Lemma 17 and taking t=nϱ/2t=n\varrho/2 yield

(ln(𝝃,𝜶(0))ln(𝝃0)21nϱ)12exp[cmin(nϱ24Mψ12,nϱ2Mψ1)].{\mathbb{P}}\left(l_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{\left(0\right)}\right)-l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)\geq 2^{-1}n\varrho\right)\geq 1-2\exp\left[-c\min\left(\frac{n\varrho^{2}}{4M_{\psi_{1}}^{2}},\frac{n\varrho}{2M_{\psi_{1}}}\right)\right]. (54)

Step 2. We aim to bound (ln(𝝃0)ln(𝝃^0)<n1/2CJ[J(D)+t]).{\mathbb{P}}\left(l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)-l_{n}\left(\hat{\bm{\xi}}_{0}\right)<-n^{1/2}C_{J}\left[J\left(D\right)+t\right]\right). We can write ln(𝝃0)l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right) as ln(𝜽0)l_{n}\left({\bm{\theta}}_{0}^{\dagger}\right) because 𝝃0=(𝜽0,,𝜽0){\bm{\xi}}_{0}^{\dagger}=\left({\bm{\theta}}_{0}^{\dagger},\ldots,{\bm{\theta}}_{0}^{\dagger}\right). The major difficulty to control the second term is the randomness of 𝜽^0\hat{\bm{\theta}}_{0}. To deal with it, we note that

(ln(𝜽0)ln(𝜽^0)n1/2CJ[J(D)+t])\displaystyle\quad{\mathbb{P}}\left(l_{n}\left({\bm{\theta}}_{0}^{\dagger}\right)-l_{n}\left(\hat{\bm{\theta}}_{0}\right)\geq-n^{1/2}C_{J}\left[J\left(D\right)+t\right]\right)
(inf𝜽Θ{ln(𝜽0)ln(𝜽)}n1/2CJ[J(D)+t]).\displaystyle\geq{\mathbb{P}}\left(\inf_{\bm{\theta}\in\Theta}\left\{l_{n}\left({\bm{\theta}}_{0}^{\dagger}\right)-l_{n}\left(\bm{\theta}\right)\right\}\geq-n^{1/2}C_{J}\left[J\left(D\right)+t\right]\right).

Thus, we turn to control the probability of {inf𝜽𝚯{ln(𝜽0)ln(𝜽)}n1/2CJ[J(D)+t]}\left\{\inf_{\bm{\theta}\in\bm{\Theta}}\left\{l_{n}\left({\bm{\theta}}_{0}^{\dagger}\right)-l_{n}\left(\bm{\theta}\right)\right\}\geq-n^{1/2}C_{J}\left[J\left(D\right)+t\right]\right\}. It is equivalent to controlling the probability of {sup𝜽𝚯{ln(𝜽)ln(𝜽0)}n1/2CJ[J(D)+t]}\left\{\sup_{\bm{\theta}\in\bm{\Theta}}\left\{l_{n}\left(\bm{\theta}\right)-l_{n}\left({\bm{\theta}}_{0}^{\dagger}\right)\right\}\leq n^{1/2}C_{J}\left[J\left(D\right)+t\right]\right\}.

Let 𝜽=𝜽0\bm{\theta}^{\prime}={\bm{\theta}}_{0}^{\dagger}. It follows that Z𝜽(𝝃)=0Z_{\bm{\theta}^{\prime}}(\bm{\xi}^{*})=0. By Lemma 18, we have

(sup𝜽Θ|Z𝜽(𝝃)|CJ[J(D)+t])2exp(tD).{\mathbb{P}}\left(\sup_{\bm{\theta}\in\Theta}|Z_{\bm{\theta}}(\bm{\xi}^{*})|\geq C_{J}[J(D)+t]\right)\leq 2\exp\left(\frac{-t}{D}\right).

Plugging Z𝜽Z_{\bm{\theta}} into the above inequality, we have

(sup𝜽Θ|n1/2{ln(𝜽)ln(𝜽0)(D(𝜽)D(𝜽0))}|CJ[J(D)+t])2exp(tD),{\mathbb{P}}\left(\sup_{\bm{\theta}\in\Theta}\left|n^{-1/2}\left\{l_{n}(\bm{\theta})-l_{n}({\bm{\theta}}_{0}^{\dagger})-\left(D(\bm{\theta})-D\left({\bm{\theta}}_{0}^{\dagger}\right)\right)\right\}\right|\geq C_{J}[J(D)+t]\right)\leq 2\exp\left(\frac{-t}{D}\right),

where D(𝜽)=𝔼𝜶,𝝃[logf(x;𝜽)]D(\bm{\theta})={\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}[\hbox{log}f(x;\bm{\theta})]. Since D(𝜽)D(𝜽0)0D(\bm{\theta})-D({\bm{\theta}}_{0}^{\dagger})\leq 0, we have

(sup𝜽Θn1/2{ln(𝜽)ln(𝜽0)}CJ[J(D)+t])2exp(tD).{\mathbb{P}}\left(\sup_{\bm{\theta}\in\Theta}n^{-1/2}\left\{l_{n}(\bm{\theta})-l_{n}\left({\bm{\theta}}_{0}^{\dagger}\right)\right\}\geq C_{J}[J(D)+t]\right)\leq 2\exp\left(\frac{-t}{D}\right).

That is

(sup𝜽Θ{ln(𝜽)ln(𝜽0)}n1/2CJ[J(D)+t])12exp(tD).{\mathbb{P}}\left(\sup_{\bm{\theta}\in\Theta}\left\{l_{n}(\bm{\theta})-l_{n}\left({\bm{\theta}}_{0}^{\dagger}\right)\right\}\leq n^{1/2}C_{J}[J(D)+t]\right)\geq 1-2\exp\left(\frac{-t}{D}\right). (55)

Combining (54), (55) with the likelihood non-decreasing property of EM yields Theorem 3. ∎

From Theorem 3, we can prove Corollary 1.

Proof of Corollary 1.

Write

t=21n1/2ϱnϑ1/2p0n1/2CJJD.t=\frac{2^{-1}n^{1/2}\varrho-n^{\vartheta-1/2}-p_{0}n^{-1/2}}{C_{J}}-J_{D}.

Then, we have

21nϱn1/2CJ[J(D)+t]p0=nϑ.2^{-1}n\varrho-n^{1/2}C_{J}\left[J\left(D\right)+t\right]-p_{0}=n^{\vartheta}.

By Theorem 3, we have

(EMn(K)nϑ)\displaystyle{\mathbb{P}}\left({\rm EM}_{n}^{\left(K\right)}\geq n^{\vartheta}\right) 12exp[Cmin(nϱ24Mψ12,nϱ2Mψ1)]\displaystyle\geq 1-2\exp\left[-C^{\prime}\min\left(\frac{n\varrho^{2}}{4M_{\psi_{1}}^{2}},\frac{n\varrho}{2M_{\psi_{1}}}\right)\right]
2exp(D121n1/2ϱ+nϑ1/2+p0n1/2CJ+D1JD).\displaystyle-2\exp\left(D^{-1}\frac{-2^{-1}n^{1/2}\varrho+n^{\vartheta-1/2}+p_{0}n^{-1/2}}{C_{J}}+D^{-1}J_{D}\right).

Therefore, we can find two constants C3,C4>0C_{3},C_{4}>0 such that

(EMn(K)tn)1exp(C3n1/2+C4nϑ1/2),{\mathbb{P}}\left({\rm EM}_{n}^{\left(K\right)}\geq t_{n}\right)\geq 1-\exp\left(-C_{3}n^{1/2}+C_{4}n^{\vartheta-1/2}\right),

and complete the proof. ∎

Proof of Lemma 17.

Recall that

R(xi;𝝃)=log(φ(xi;𝝃,𝜶(0)))log(f(xi;𝜽0)).R(x_{i};\bm{\xi}^{*})=\hbox{log}\left(\varphi\left(x_{i};{\bm{\xi}}^{\dagger},{\bm{\alpha}}^{(0)}\right)\right)-\hbox{log}\left(f\left(x_{i};{\bm{\theta}}_{0}^{\dagger}\right)\right).

Since

ln(𝝃,𝜶(0))ln(𝝃0)=i=1n{log(φ(xi;𝝃,𝜶(0)))logf(xi;𝜽0)},l_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{\left(0\right)}\right)-l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)=\sum_{i=1}^{n}\left\{\hbox{log}\left(\varphi\left(x_{i};{\bm{\xi}}^{\dagger},{\bm{\alpha}}^{(0)}\right)\right)-\hbox{log}f\left(x_{i};{\bm{\theta}}_{0}^{\dagger}\right)\right\},

we have

ln(𝝃,𝜶(0))ln(𝝃0)=i=1nR(xi;𝝃).l_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{\left(0\right)}\right)-l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)=\sum_{i=1}^{n}R(x_{i};\bm{\xi}^{*}).

By Bernstein’s inequality in Vershynin (2018) (Theorem 2.8.1) and Condition (C6), for every t0t\geq 0, we have

(|i=1nR(xi;𝝃)𝔼(R(xi;𝝃))|t)12exp[Cmin(t2nMψ12,tMψ1)],{\mathbb{P}}\left(\left|\sum_{i=1}^{n}R(x_{i};\bm{\xi}^{*})-{\mathbb{E}}\left(R(x_{i};\bm{\xi}^{*})\right)\right|\leq t\right)\geq 1-2\exp\left[-C^{\prime}\min\left(\frac{t^{2}}{nM_{\psi_{1}}^{2}},\frac{t}{M_{\psi_{1}}}\right)\right],

where C>0C^{\prime}>0 is a universal constant. By Condition (C5), we conclude that 𝔼(R(xi;𝝃))ϱ{\mathbb{E}}\left(R(x_{i};\bm{\xi}^{*})\right)\geq\varrho. It follows that

(i=1nR(xi;𝝃)nϱt)12exp[Cmin(t2nMψ12,tMψ1)].{\mathbb{P}}\left(\sum_{i=1}^{n}R(x_{i};\bm{\xi}^{*})\geq n\varrho-t\right)\geq 1-2\exp\left[-C^{\prime}\min\left(\frac{t^{2}}{nM_{\psi_{1}}^{2}},\frac{t}{M_{\psi_{1}}}\right)\right].

That is

(ln(𝝃,𝜶(0))ln(𝝃0)nϱt)12exp[Cmin(t2nMψ12,tMψ1)],{\mathbb{P}}\left(l_{n}\left({\bm{\xi}}^{\dagger},{\bm{\alpha}}^{\left(0\right)}\right)-l_{n}\left({\bm{\xi}}_{0}^{\dagger}\right)\geq n\varrho-t\right)\geq 1-2\exp\left[-C^{\prime}\min\left(\frac{t^{2}}{nM_{\psi_{1}}^{2}},\frac{t}{M_{\psi_{1}}}\right)\right],

which proves the lemma. ∎

A.5 Proofs of Theorem 1

Proof of Theorem 1.

Observe that

{S1S^1(tn)}={EMnj(K)tn, for all jS1}.\{S_{1}\subset\hat{S}_{1}(t_{n})\}=\left\{{\rm EM}_{nj}^{(K)}\geq t_{n},\mbox{ for all }j\in S_{1}\right\}.

We have

(S1S^1(tn))1jS1(EMnj(K)<tn).{\mathbb{P}}\left(S_{1}\subset\hat{S}_{1}(t_{n})\right)\geq 1-\sum_{j\in S_{1}}{\mathbb{P}}\left({\rm EM}_{nj}^{(K)}<t_{n}\right).

By Corollary 1, we have

(S1S^1(tn))1sexp(C3n1/2+C4nϑ1/2),{\mathbb{P}}\left(S_{1}\subset\hat{S}_{1}(t_{n})\right)\geq 1-s\exp\left(-C_{3}n^{1/2}+C_{4}n^{\vartheta-1/2}\right), (56)

and the first inequality is proved. Next, observing that

{S1=S^1(tn)}={S1S^1(tn)}{EMnj(K)<tn, for all jS0},\left\{S_{1}=\hat{S}_{1}(t_{n})\right\}=\left\{S_{1}\subset\hat{S}_{1}(t_{n})\right\}\cap\left\{{\rm EM}_{nj}^{(K)}<t_{n},\mbox{ for all }j\in S_{0}\right\},

using Theorem 2, we have

({EMnj(K)<tn, for all jS0})1(ps)((C1n)m/4+(C2n)ϑm).{\mathbb{P}}\left(\left\{{\rm EM}_{nj}^{(K)}<t_{n},\mbox{ for all }j\in S_{0}\right\}\right)\geq 1-(p-s)\left((C_{1}n)^{-m/4}+(C_{2}n)^{-\vartheta m}\right). (57)

Combining (57) with (56) yields the second result. ∎

Appendix B Proofs of the asymptotic results

We first derive the upper bound of 𝝃(k)𝝃02\left\|\bm{\xi}^{(k)}-\bm{\xi}_{0}\right\|_{2}. Namely, we provide the following results.

Theorem 5.

Assume that x1,,xnx_{1},\dots,x_{n} are independent samples from the homogeneous distribution f(x;𝛉0)f(x;\bm{\theta}_{0}). Under Condition (C1)–(C2) and (WC3)–(WC4), given any initial value 𝛂(0)𝕊G1{\bm{\alpha}}^{(0)}\in\mathbb{S}^{G-1}, for any fixed K>0K>0 and 1kK1\leq k\leq K, we have 𝛂(k)𝛂(0)2=op(1),𝛏(k)𝛏02=Op(n1/4)\left\|{\bm{\alpha}}^{(k)}-{\bm{\alpha}}^{(0)}\right\|_{2}=o_{p}(1),\ \left\|\bm{\xi}^{(k)}-\bm{\xi}_{0}\right\|_{2}=O_{p}\left(n^{-1/4}\right) and g=1Gαg(k)(𝛉g(k)𝛉0)2=Op(n1/2).\left\|\sum_{g=1}^{G}{\alpha}_{g}^{(k)}\left({\bm{\theta}}_{g}^{(k)}-\bm{\theta}_{0}\right)\right\|_{2}=O_{p}\left(n^{-1/2}\right).

Theorem 5 says that the convergence rate of 𝝃(k)\bm{\xi}^{(k)} under the homogeneous model 0\mathbb{H}_{0} is only Op(n1/4)O_{p}\left(n^{-1/4}\right), but not the common convergence rate Op(n1/2)O_{p}\left(n^{-1/2}\right). The reason is that under 0\mathbb{H}_{0}, the heterogeneous model is unidentifiable, and, in consequence, the Fisher information matrix is not positive definite. However, the weighted average of 𝝃(k)\bm{\xi}^{(k)}, g=1Gαg(k)(𝜽g(k)𝜽0)\sum_{g=1}^{G}{\alpha}_{g}^{(k)}\left({\bm{\theta}}_{g}^{(k)}-\bm{\theta}_{0}\right), is a n\sqrt{n}-consistent estimator.

B.1 Proofs of Theorem S1

In this subsection, we give the proof of Theorem S1. To prove Theorem S1, we only need to prove the following lemmas.

Lemma 19 (Consistency).

Assume that x1,,xnx_{1},\dots,x_{n} are independent samples from the homogeneous distribution f(x;𝛉0)f(x;\bm{\theta}_{0}). Let (𝛏¯,𝛂¯)\left(\bar{\bm{\xi}},\bar{{\bm{\alpha}}}\right) be an estimator of the parameters in the heterogeneous model φ(x;𝛏,𝛂)=g=1Gαgf(x;𝛉g)\varphi(x;\bm{\xi},{\bm{\alpha}})=\sum_{g=1}^{G}\alpha_{g}f(x;\bm{\theta}_{g}) such that ηα¯g1\eta\leq\bar{\alpha}_{g}\leq 1 for some η(0,0.5]\eta\in(0,0.5]. Assume that there exists a constant cc such that for any nn

ln(𝝃¯,𝜶¯)ln(𝝃0,𝜶0)c>.l_{n}\left(\bar{\bm{\xi}},\bar{{\bm{\alpha}}}\right)-l_{n}\left({\bm{\xi}}_{0},{{\bm{\alpha}}_{0}}\right)\geq c>-\infty.

Then, under Condition (C1)–(C2) and (WC3)–(WC4), we have 𝛏¯𝛏02=op(1)\left\|\bar{\bm{\xi}}-{\bm{\xi}}_{0}\right\|_{2}=o_{p}(1).

Lemma 20 (Convergence rate).

Assume that x1,,xnx_{1},\dots,x_{n} are independent samples from the homogeneous distribution f(x;𝛉0)f(x;\bm{\theta}_{0}). Let (𝛏¯,𝛂¯)\left(\bar{\bm{\xi}},\bar{{\bm{\alpha}}}\right) be an estimator of the parameters in the heterogeneous model φ(x;𝛏,𝛂)=g=1Gαgf(x;𝛉g)\varphi(x;\bm{\xi},{\bm{\alpha}})=\sum_{g=1}^{G}\alpha_{g}f(x;\bm{\theta}_{g}) such that ηα¯g1\eta\leq\bar{\alpha}_{g}\leq 1 for some η(0,0.5]\eta\in(0,0.5]. Assume that there exists a constant cc, such that for any nn,

pln(𝝃¯,𝜶¯)pln(𝝃0,𝜶0)c>.pl_{n}\left(\bar{\bm{\xi}},\bar{{\bm{\alpha}}}\right)-pl_{n}\left({\bm{\xi}}_{0},{{\bm{\alpha}}_{0}}\right)\geq c>-\infty.

Then, under Condition (C1)–(C2) and (WC3)–(WC4) we have

𝝃¯𝝃02=Op(n1/4), and 𝐦¯12=g=1Gα¯g(𝜽¯g𝜽0)2=Op(n1/2),\left\|\bar{\bm{\xi}}-{\bm{\xi}}_{0}\right\|_{2}=O_{p}(n^{-1/4}),\mbox{ and }\|\bar{{\bf m}}_{1}\|_{2}=\left\|\sum_{g=1}^{G}\bar{\alpha}_{g}(\bar{\bm{\theta}}_{g}-\bm{\theta}_{0})\right\|_{2}=O_{p}(n^{-1/2}),

where 𝐦¯1=𝐦1(𝛂¯,𝛏¯,𝛏0)\bar{{\bf m}}_{1}={\bf m}_{1}(\bar{{\bm{\alpha}}},\bar{\bm{\xi}},\bm{\xi}_{0}) and 𝐦1{\bf m}_{1} is defined in (13).

Given an estimator (𝝃¯,𝜶¯)\left(\bar{\bm{\xi}},\bar{{\bm{\alpha}}}\right), define w¯gi=α¯gf(xi;𝜽¯g)/φ(xi;𝜶¯,𝝃¯)\bar{w}_{gi}={\bar{\alpha}_{g}f(x_{i};\bar{\bm{\theta}}_{g})}/{\varphi\left(x_{i};\bar{{\bm{\alpha}}},\bar{\bm{\xi}}\right)} and 𝜶¯(1)=i=1nw¯gi+λn+Gλ\bar{{\bm{\alpha}}}^{(1)}=\sum_{i=1}^{n}\frac{\bar{w}_{gi}+\lambda}{n+G\lambda} is the one-step EM update of 𝜶{\bm{\alpha}}. The following lemma states that under 0\mathbb{H}_{0}, the EM-update of 𝜶{\bm{\alpha}} does not change much.

Lemma 21.

Assume 𝛂¯𝛂(0)=op(1)\bar{{\bm{\alpha}}}-{\bm{\alpha}}^{(0)}=o_{p}(1). Then, under the same conditions as in Lemma 20, we have 𝛂¯(1)𝛂(0)=op(1)\bar{{\bm{\alpha}}}^{(1)}-{\bm{\alpha}}^{(0)}=o_{p}(1).

Combining Lemma 20, 21 and the likelihood non-decreasing property of the EM algorithm, we prove Theorem S1.

Proof of Lemma 19.

Since Ξ\Xi is compact, the conclusion can be easily proved using the classical Wald’s consistency Theorem (Van der Vaart, 2000). ∎

Proof of Lemma 20.

Let R1n(𝝃¯,𝜶¯)=2{pln(𝝃¯,𝜶¯)pln(𝝃0,𝜶0)}R_{1n}(\bar{\bm{\xi}},\bar{{\bm{\alpha}}})=2\left\{pl_{n}(\bar{\bm{\xi}},\bar{{\bm{\alpha}}})-pl_{n}(\bm{\xi}_{0},{\bm{\alpha}}_{0})\right\}, where 𝝃¯=(𝜽¯1,,𝜽¯G)\bar{\bm{\xi}}=(\bar{\bm{\theta}}_{1},\ldots,\bar{\bm{\theta}}_{G}). Since p(𝜶)p({\bm{\alpha}}) is maximized at 𝜶0{\bm{\alpha}}_{0}, we have

R1n\displaystyle R_{1n} 2{ln(𝝃¯,𝜶¯)ln(𝝃0,𝜶0)}\displaystyle\leq 2\left\{l_{n}(\bar{\bm{\xi}},\bar{{\bm{\alpha}}})-l_{n}(\bm{\xi}_{0},{\bm{\alpha}}_{0})\right\}
=2i=1nlog(1+g=1Gα¯g(f(xi;𝜽¯g)f(xi;𝜽0)1))\displaystyle=2\sum_{i=1}^{n}\hbox{log}\left(1+\sum_{g=1}^{G}\bar{\alpha}_{g}\left(\frac{f(x_{i};\bar{\bm{\theta}}_{g})}{f(x_{i};{\bm{\theta}}_{0})}-1\right)\right)
=i=1n2log(1+δi),\displaystyle=\sum_{i=1}^{n}2\hbox{log}(1+\delta_{i}), (58)

where δi=g=1Gα¯g(f(xi;𝜽¯g)f(xi;𝜽0)1)\delta_{i}=\sum_{g=1}^{G}\bar{\alpha}_{g}\left(\frac{f(x_{i};\bar{\bm{\theta}}_{g})}{f(x_{i};{\bm{\theta}}_{0})}-1\right). Applying the inequality log(1+x)xx2/2+x3/3\hbox{log}(1+x)\leq x-x^{2}/2+x^{3}/3, we have

R1n2i=1nδii=1nδi2+(2/3)i=1nδi3.R_{1n}\leq 2\sum_{i=1}^{n}\delta_{i}-\sum_{i=1}^{n}\delta_{i}^{2}+(2/3)\sum_{i=1}^{n}\delta_{i}^{3}. (59)

We first deal with i=1nδi\sum_{i=1}^{n}\delta_{i} in (59). By Taylor’s expansion of f(xi;𝜽¯g)f(x_{i};\bar{\bm{\theta}}_{g}) at 𝜽0{\bm{\theta}}_{0}, we have

δi\displaystyle\delta_{i} =g=1Gα¯gf(xi;𝜽¯g)f(xi;𝜽0)f(xi;𝜽0)\displaystyle=\sum_{g=1}^{G}\bar{\alpha}_{g}\frac{f(x_{i};\bar{\bm{\theta}}_{g})-f(x_{i};{\bm{\theta}}_{0})}{f(x_{i};{\bm{\theta}}_{0})}
=h=1dg=1Gα¯g(θ¯ghθ0h)Yih+h=1dg=1Gα¯g(θ¯ghθ0h)2Zih\displaystyle=\sum_{h=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}(\bar{\theta}_{gh}-{\theta}_{0h})Y_{ih}+\sum_{h=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}(\bar{\theta}_{gh}-{\theta}_{0h})^{2}Z_{ih}
+h<dg=1Gα¯g(θ¯ghθ0h)(θ¯gθ0)Uih+εin,\displaystyle+\sum_{h<\ell}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}(\bar{\theta}_{gh}-{\theta}_{0h})(\bar{\theta}_{g\ell}-{\theta}_{0\ell})U_{ih\ell}+\varepsilon_{in}, (60)

where YihY_{ih}, ZihZ_{ih} and UihU_{ih\ell} are defined in (A.1) and εin\varepsilon_{in} is the remainder term. Let

𝐦¯=𝐦(𝜶¯,𝝃¯,𝝃0) and ,𝐦¯1=𝐦1(𝜶¯,𝝃¯,𝝃0)\bar{{\bf m}}={\bf m}(\bar{{\bm{\alpha}}},\bar{\bm{\xi}},\bm{\xi}_{0})\mbox{ and },\bar{{\bf m}}_{1}={\bf m}_{1}(\bar{{\bm{\alpha}}},\bar{\bm{\xi}},\bm{\xi}_{0})

where 𝐦{\bf m} and 𝐦1{\bf m}_{1} are defined in (14) and (13). Then, we write (B.1) as

δi=𝐦¯T𝐛i+εin,\delta_{i}=\bar{{\bf m}}^{\rm T}{\bf b}_{i}+\varepsilon_{in}, (61)

where 𝐛i{\bf b}_{i} is defined in (A.1).

Step 1. Controlling the remainder term εin\bm{\varepsilon}_{in}. We aim to prove

i=1nεin=op(1)+op(n)(𝐦¯22).\sum_{i=1}^{n}\varepsilon_{in}=o_{p}(1)+o_{p}(n)\left(\|\bar{{\bf m}}\|_{2}^{2}\right). (62)

In order to show this, we note that i=1nεin\sum_{i=1}^{n}\varepsilon_{in} can be written as

i=1nεin\displaystyle\sum_{i=1}^{n}\varepsilon_{in} =i=1nj1=1dj3=1dg=1Gα¯gs=13(θ¯gjsθ0js)(3f(xi;𝜽0)θj1θj2θj3)/(3!f(xi;𝜽0))\displaystyle=\sum_{i=1}^{n}\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}\prod_{s=1}^{3}(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}})\left(\frac{\partial^{3}f(x_{i};\bm{\theta}_{0})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\bigg{/}(3!f(x_{i};\bm{\theta}_{0}))
+i=1nj1=1dj4=1dg=1Gα¯gs=14(θ¯gjsθ0js)(4f(xi;𝜽0)θj1θj2θj3θj4)/(4!f(xi;𝜽0))\displaystyle+\sum_{i=1}^{n}\sum_{j_{1}=1}^{d}\cdots\sum_{j_{4}=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}\prod_{s=1}^{4}(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}})\left(\frac{\partial^{4}f(x_{i};\bm{\theta}_{0})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}\partial\theta_{j_{4}}}\right)\bigg{/}(4!f(x_{i};\bm{\theta}_{0}))
+i=1nj1=1dj5=1dg=1Gα¯gs=15(θ¯gjsθ0js)(5f(xi;𝜻g(xi))θj1θj2θj3θj4θj5)/(5!f(xi;𝜽0))\displaystyle+\sum_{i=1}^{n}\sum_{j_{1}=1}^{d}\cdots\sum_{j_{5}=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}\prod_{s=1}^{5}(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}})\left(\frac{\partial^{5}f(x_{i};{\bm{\zeta}}_{g}(x_{i}))}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}\partial\theta_{j_{4}}\partial\theta_{j_{5}}}\right)\bigg{/}(5!f(x_{i};\bm{\theta}_{0}))
=I+II+III,\displaystyle={\rm I+II+III},

where 𝜻g(xi){\bm{\zeta}}_{g}(x_{i}) lies between 𝜽¯g\bar{\bm{\theta}}_{g} and 𝜽0\bm{\theta}_{0}.

For I, note that the production term s=13(θ¯gjsθ0js)\prod_{s=1}^{3}(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}}) does not involve the index ii. We can change the summation and production order as j1=1dj3=1dg=1Gs=13i=1n\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\sum_{g=1}^{G}\prod_{s=1}^{3}\sum_{i=1}^{n}. Further, for any fixed j1,j2,j3j_{1},j_{2},j_{3}, by Cauchy’s inequality, we have

s=13|θ¯gjsθ0js|(j=1d|θ¯gjθ0j|)3(d)3𝜽¯g𝜽023.\prod_{s=1}^{3}\left|\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}}\right|\leq\left(\sum_{j=1}^{d}\left|\bar{\theta}_{gj}-{\theta}_{0j}\right|\right)^{3}\leq\left(\sqrt{d}\right)^{3}\left\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\right\|_{2}^{3}. (63)

Hence,

|I|\displaystyle|{\rm I}| j1=1dj3=1d|g=1Gα¯gs=13(θ¯gjsθ0js)(i=1n(3f(xi;𝜽0)θj1θj2θj3)/(3!f(xi;𝜽0)))|\displaystyle\leq\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\left|\sum_{g=1}^{G}\bar{\alpha}_{g}\prod_{s=1}^{3}(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}})\left(\sum_{i=1}^{n}\left(\frac{\partial^{3}f(x_{i};\bm{\theta}_{0})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\bigg{/}(3!f(x_{i};\bm{\theta}_{0}))\right)\right|
j1=1dj3=1dg=1Gα¯g|s=13(θ¯gjsθ0js)||(i=1n(3f(xi;𝜽0)θj1θj2θj3)/(3!f(xi;𝜽0)))|\displaystyle\leq\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}\left|\prod_{s=1}^{3}(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}})\right|\left|\left(\sum_{i=1}^{n}\left(\frac{\partial^{3}f(x_{i};\bm{\theta}_{0})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\bigg{/}(3!f(x_{i};\bm{\theta}_{0}))\right)\right|
j1=1dj3=1d|(g=1Gα¯g(d)3𝜽¯g𝜽023)(i=1n(3f(xi;𝜽0)θj1θj2θj3)/(3!f(xi;𝜽0)))|.\displaystyle\leq\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\left|\left(\sum_{g=1}^{G}\bar{\alpha}_{g}\left(\sqrt{d}\right)^{3}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{3}\right)\left(\sum_{i=1}^{n}\left(\frac{\partial^{3}f(x_{i};\bm{\theta}_{0})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\bigg{/}(3!f(x_{i};\bm{\theta}_{0}))\right)\right|.

Also for any fixed j1,j2,j3j_{1},j_{2},j_{3}, let Di(j1,j2,j3)=(3f(xi;𝜽0)θj1θj2θj3)/(3!f(xi;𝜽0))D_{i}(j_{1},j_{2},j_{3})=\left(\frac{\partial^{3}f(x_{i};\bm{\theta}_{0})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\big{/}(3!f(x_{i};\bm{\theta}_{0})). By Condition (WC3), we have 𝔼(Di(j1,j2,j3))=0{\mathbb{E}}\left(D_{i}(j_{1},j_{2},j_{3})\right)=0 and Var(Di(j1,j2,j3))<\hbox{Var}\left(D_{i}(j_{1},j_{2},j_{3})\right)<\infty. Applying the Central Limit Theorem, we have i=1nDi(j1,j2,j3)=Op(n1/2)\sum_{i=1}^{n}D_{i}(j_{1},j_{2},j_{3})=O_{p}\left(n^{1/2}\right). Hence, we conclude that

|I|=Op(n1/2)(g=1Gα¯g(d)3𝜽¯g𝜽023).|{\rm I}|=O_{p}\left(n^{1/2}\right)\left(\sum_{g=1}^{G}\bar{\alpha}_{g}\left(\sqrt{d}\right)^{3}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{3}\right). (64)

Similarly, for II, we have

|II|=Op(n1/2)(g=1Gα¯g(d)4𝜽¯g𝜽024).|{\rm II}|=O_{p}\left(n^{1/2}\right)\left(\sum_{g=1}^{G}\bar{\alpha}_{g}\left(\sqrt{d}\right)^{4}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{4}\right). (65)

For III, from Condition (WC3), for any j1,,j5j_{1},\ldots,j_{5}, we have

sup𝜽𝜽0τ|5f(x,𝜽)θj1θj5/f(x,𝜽0)|g(x;𝜽0).\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|\leq\tau}\left|\frac{\partial^{5}f(x,\bm{\theta})}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{5}}}\bigg{/}f(x,\bm{\theta}_{0})\right|\leq g(x;\bm{\theta}_{0}).

By the law of large numbers, we have

i=1ng(xi;𝜽0)=Op(n).\sum_{i=1}^{n}g(x_{i};\bm{\theta}_{0})=O_{p}(n).

Using the consistency of 𝜽¯g\bar{\bm{\theta}}_{g} from Lemma 19, we have

|III|=Op(n)(g=1Gα¯g(d)5𝜽¯g𝜽025).|{\rm III}|=O_{p}(n)\left(\sum_{g=1}^{G}\bar{\alpha}_{g}\left(\sqrt{d}\right)^{5}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{5}\right). (66)

Since 𝜽¯g𝜽02=op(1)\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}=o_{p}(1), we get

|i=1nεin|=|op(n1/2)|g=1Gα¯g𝜽¯g𝜽022+|op(n)|g=1Gα¯g𝜽¯g𝜽024.\left|\sum_{i=1}^{n}\varepsilon_{in}\right|=\left|o_{p}\left(n^{1/2}\right)\right|\sum_{g=1}^{G}\bar{\alpha}_{g}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{2}+|o_{p}(n)|\sum_{g=1}^{G}\bar{\alpha}_{g}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{4}.

On the one hand, we have

|op(n)|g=1Gα¯g𝜽¯g𝜽024|op(n)|𝐦¯22.|o_{p}(n)|\sum_{g=1}^{G}\bar{\alpha}_{g}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{4}\leq|o_{p}(n)|\cdot\|\bar{{\bf m}}\|_{2}^{2}. (67)

On the other hand, we have

|op(n1/2)|g=1Gα¯g𝜽¯g𝜽022|op(n1/2)|𝐦¯2|op(1)|+|op(n)|𝐦¯22,\left|o_{p}\left(n^{1/2}\right)\right|\sum_{g=1}^{G}\bar{\alpha}_{g}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{2}\leq\left|o_{p}\left(n^{1/2}\right)\right|\|\bar{{\bf m}}\|_{2}\leq|o_{p}(1)|+|o_{p}(n)|\cdot\|\bar{{\bf m}}\|_{2}^{2}, (68)

where the last inequality is from

|op(n1/2)|𝐦¯2=|op(1)|n1/2𝐦¯2|op(1)|(1+n𝐦¯222)=|op(1)|+|op(n)|𝐦¯22.\left|o_{p}\left(n^{1/2}\right)\right|\|\bar{{\bf m}}\|_{2}=\left|o_{p}(1)\right|n^{1/2}\|\bar{{\bf m}}\|_{2}\leq\left|o_{p}(1)\right|\left(\frac{1+n\|\bar{{\bf m}}\|_{2}^{2}}{2}\right)=|o_{p}(1)|+|o_{p}(n)|\cdot\|\bar{{\bf m}}\|_{2}^{2}.

Combining (67) with (68), we get (62).

Step 2. Obtaining the convergence rate. From (62), we have

i=1nδi=i=1n(𝐦¯T𝐛i+εin)=i=1n𝐦¯T𝐛i+op(1)+op(n)𝐦¯22.\sum_{i=1}^{n}\delta_{i}=\sum_{i=1}^{n}\left(\bar{{\bf m}}^{\rm T}{\bf b}_{i}+\varepsilon_{in}\right)=\sum_{i=1}^{n}\bar{{\bf m}}^{\rm T}{\bf b}_{i}+o_{p}(1)+o_{p}(n)\|\bar{{\bf m}}\|_{2}^{2}. (69)

Similarly, we can prove

i=1nδi2=i=1n(𝐦¯T𝐛i+εin)2=i=1n(𝐦¯T𝐛i)2+op(1)+op(n)𝐦¯22,\sum_{i=1}^{n}\delta_{i}^{2}=\sum_{i=1}^{n}\left(\bar{{\bf m}}^{\rm T}{\bf b}_{i}+\varepsilon_{in}\right)^{2}=\sum_{i=1}^{n}\left(\bar{{\bf m}}^{\rm T}{\bf b}_{i}\right)^{2}+o_{p}(1)+o_{p}(n)\|\bar{{\bf m}}\|_{2}^{2}, (70)

and

i=1nδi3=i=1n(𝐦¯T𝐛i+εin)3=i=1n(𝐦¯T𝐛i)3+op(1)+op(n)𝐦¯22.\sum_{i=1}^{n}\delta_{i}^{3}=\sum_{i=1}^{n}\left(\bar{{\bf m}}^{\rm T}{\bf b}_{i}+\varepsilon_{in}\right)^{3}=\sum_{i=1}^{n}\left(\bar{{\bf m}}^{\rm T}{\bf b}_{i}\right)^{3}+o_{p}(1)+o_{p}(n)\|\bar{{\bf m}}\|_{2}^{2}. (71)

In fact, for (70), we have

i=1n(𝐦¯T𝐛i+εin)2=i=1n(𝐦¯T𝐛i)2+i=1n(εin2+𝐦¯T𝐛iεin).\sum_{i=1}^{n}\left(\bar{{\bf m}}^{\rm T}{\bf b}_{i}+\varepsilon_{in}\right)^{2}=\sum_{i=1}^{n}\left(\bar{{\bf m}}^{\rm T}{\bf b}_{i}\right)^{2}+\sum_{i=1}^{n}\left(\varepsilon_{in}^{2}+\bar{{\bf m}}^{\rm T}{\bf b}_{i}\varepsilon_{in}\right).

By Taylor’s expansion, we have

εin=j1=1dj3=1dg=1Gα¯gs=13(θ¯gjsθ0js)(3f(xi;𝜻g(xi))θj1θj2θj3)/(3!f(xi;𝜽0)),\varepsilon_{in}=\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\sum_{g=1}^{G}\bar{\alpha}_{g}\prod_{s=1}^{3}(\bar{\theta}_{gj_{s}}-{\theta}_{0j_{s}})\left(\frac{\partial^{3}f(x_{i};{\bm{\zeta}}_{g}(x_{i}))}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\bigg{/}(3!f(x_{i};\bm{\theta}_{0})), (72)

where 𝜻g(xi){\bm{\zeta}}_{g}(x_{i}) lies between 𝜽¯g\bar{\bm{\theta}}_{g} and 𝜽0\bm{\theta}_{0}. Note that here we only need to represent the remainder term εin\varepsilon_{in} in terms of the third derivatives. Again, from Condition (WC3), for any fixed j1,,j3j_{1},\ldots,j_{3}, we have

sup𝜽𝜽02τ|(3f(x;𝜽)θj1θj2θj3)/(3!f(x;𝜽0))|g(x;𝜽0).\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau}\left|\left(\frac{\partial^{3}f(x;\bm{\theta})}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\bigg{/}(3!f(x;\bm{\theta}_{0}))\right|\leq g(x;\bm{\theta}_{0}).

Note that |𝐦¯T𝐛i|𝐦¯2𝐛i2\left|\bar{{\bf m}}^{\rm T}{\bf b}_{i}\right|\leq\|\bar{{\bf m}}\|_{2}\|{\bf b}_{i}\|_{2}. By 𝝃¯𝝃0=op(1)\bar{\bm{\xi}}-\bm{\xi}_{0}=o_{p}(1), together with the inequality (63), Condition (WC3) and the law of large numbers, i=1n|𝐦¯T𝐛iεin|\sum_{i=1}^{n}\left|\bar{{\bf m}}^{\rm T}{\bf b}_{i}\varepsilon_{in}\right| can be bounded by

i=1n|𝐦¯T𝐛iεin|i=1n𝐦¯2𝐛i2|εin|\displaystyle\quad\sum_{i=1}^{n}\left|\bar{{\bf m}}^{\rm T}{\bf b}_{i}\varepsilon_{in}\right|\leq\sum_{i=1}^{n}\|\bar{{\bf m}}\|_{2}\|{\bf b}_{i}\|_{2}|\varepsilon_{in}|
𝐦¯2i=1n𝐛i2g=1G(d)3𝜽¯g𝜽023j1=1dj3=1d|(3f(xi;𝜻g(xi))θj1θj2θj3)/(3!f(xi;𝜽0))|\displaystyle\leq\|\bar{{\bf m}}\|_{2}\sum_{i=1}^{n}\|{\bf b}_{i}\|_{2}\sum_{g=1}^{G}\left(\sqrt{d}\right)^{3}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{3}\sum_{j_{1}=1}^{d}\cdots\sum_{j_{3}=1}^{d}\left|\left(\frac{\partial^{3}f(x_{i};{\bm{\zeta}}_{g}(x_{i}))}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}\partial\theta_{j_{3}}}\right)\bigg{/}(3!f(x_{i};\bm{\theta}_{0}))\right|
𝐦¯2g=1G(d)3𝜽¯g𝜽023d3i=1n𝐛i2|g(xi;𝜽0)|\displaystyle\leq\|\bar{{\bf m}}\|_{2}\sum_{g=1}^{G}\left(\sqrt{d}\right)^{3}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{3}d^{3}\sum_{i=1}^{n}\|{\bf b}_{i}\|_{2}|g(x_{i};\bm{\theta}_{0})|
=op(1)Op(n)𝐦¯22=op(n)𝐦¯22.\displaystyle=o_{p}(1)O_{p}(n)\|\bar{{\bf m}}\|_{2}^{2}=o_{p}(n)\|\bar{{\bf m}}\|_{2}^{2}.

For the i=1nεin2\sum_{i=1}^{n}\varepsilon_{in}^{2} term, when 𝝃¯𝝃02τ\|\bar{\bm{\xi}}-\bm{\xi}_{0}\|_{2}\leq\tau , we have

εin2(d3g=1G(d)3𝜽¯g𝜽023)2g2(xi;𝜽0).\varepsilon_{in}^{2}\leq\left(d^{3}\sum_{g=1}^{G}\left(\sqrt{d}\right)^{3}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{3}\right)^{2}g^{2}(x_{i};\bm{\theta}_{0}). (73)

Since 𝝃¯𝝃02𝑝0\|\bar{\bm{\xi}}-\bm{\xi}_{0}\|_{2}\overset{p}{\longrightarrow}0, we have i=1nεin2=op(n)𝐦¯22.\sum_{i=1}^{n}\varepsilon_{in}^{2}=o_{p}(n)\|\bar{{\bf m}}\|_{2}^{2}. Thus, we prove (70). Similarly, we can prove (71) .

Finally, by the law of large numbers, we have

i=1n(𝐦¯T𝐛i)2=n𝐦¯T𝐁𝐦¯(1+op(1)),\sum_{i=1}^{n}\left(\bar{{\bf m}}^{\rm T}{\bf b}_{i}\right)^{2}=n\bar{{\bf m}}^{\rm T}{\bf B}\bar{{\bf m}}(1+o_{p}(1)), (74)

and

|i=1n(𝐦¯T𝐛i)3|i=1n𝐛i23𝐦¯23Op(n)𝐦¯23=op(n)𝐦¯22,\left|\sum_{i=1}^{n}\left(\bar{{\bf m}}^{\rm T}{\bf b}_{i}\right)^{3}\right|\leq\sum_{i=1}^{n}\|{\bf b}_{i}\|_{2}^{3}\|\bar{{\bf m}}\|_{2}^{3}\leq O_{p}(n)\|\bar{{\bf m}}\|^{3}_{2}=o_{p}(n)\|\bar{{\bf m}}\|^{2}_{2}, (75)

where i=1n𝐛i23=Op(n)\sum_{i=1}^{n}\|{\bf b}_{i}\|_{2}^{3}=O_{p}(n) is from Condition (WC3). Since

n𝐦¯T𝐁𝐦¯λmin(𝐁)n𝐦¯22=Op(n)𝐦¯22,n\bar{{\bf m}}^{\rm T}{\bf B}\bar{{\bf m}}\geq\lambda_{\rm min}({\bf B})n\|\bar{{\bf m}}\|^{2}_{2}=O_{p}(n)\|\bar{{\bf m}}\|^{2}_{2},

we conclude that op(n)𝐦¯22=op(1)n𝐦¯T𝐁𝐦¯o_{p}(n)\|\bar{{\bf m}}\|^{2}_{2}=o_{p}(1)n\bar{{\bf m}}^{\rm T}{\bf B}\bar{{\bf m}}. Combining (69) – (71) with (74) – (75), we get

R1n(𝝃¯,𝜶¯)2𝐦¯Ti=1n𝐛in𝐦¯T𝐁𝐦¯(1+op(1))+op(1).R_{1n}(\bar{\bm{\xi}},\bar{{\bm{\alpha}}})\leq 2\bar{{\bf m}}^{\rm T}\sum_{i=1}^{n}{\bf b}_{i}-n\bar{{\bf m}}^{\rm T}{\bf B}\bar{{\bf m}}(1+o_{p}(1))+o_{p}(1). (76)

Since we know R1n2cR_{1n}\geq 2c, the inequality (76) implies that

2cn2𝐦¯2i=1n𝐛in2(λmin(𝐁)+op(1))𝐦¯22+op(1/n).\frac{2c}{n}\leq 2\|\bar{{\bf m}}\|_{2}\left\|\frac{\sum_{i=1}^{n}{\bf b}_{i}}{n}\right\|_{2}-(\lambda_{\rm min}({\bf B})+o_{p}(1))\|\bar{{\bf m}}\|_{2}^{2}+o_{p}(1/n).

Applying the inequality

λmin(𝐁)2𝐦¯22+2λmin(𝐁)i=1n𝐛in222𝐦¯2i=1n𝐛in2,\frac{\lambda_{\rm min}({\bf B})}{2}\|\bar{{\bf m}}\|_{2}^{2}+\frac{2}{\lambda_{\rm min}({\bf B})}\left\|\frac{\sum_{i=1}^{n}{\bf b}_{i}}{n}\right\|_{2}^{2}\geq 2\|\bar{{\bf m}}\|_{2}\left\|\frac{\sum_{i=1}^{n}{\bf b}_{i}}{n}\right\|_{2}, (77)

we conclude that

(λmin(𝐁)2+op(1))𝐦¯222λmin(𝐁)i=1n𝐛in222cn+op(1/n).\left(\frac{\lambda_{\rm min}({\bf B})}{2}+o_{p}(1)\right)\|\bar{{\bf m}}\|_{2}^{2}\leq\frac{2}{\lambda_{\rm min}({\bf B})}\left\|\frac{\sum_{i=1}^{n}{\bf b}_{i}}{n}\right\|_{2}^{2}-\frac{2c}{n}+o_{p}(1/n). (78)

Using the fact i=1n𝐛in22=Op(1/n)\left\|\frac{\sum_{i=1}^{n}{\bf b}_{i}}{n}\right\|_{2}^{2}=O_{p}(1/n), (78) implies that 𝐦¯2=Op(n1/2)\|\bar{{\bf m}}\|_{2}=O_{p}\left(n^{-1/2}\right), and thus 𝐦¯12=Op(n1/2)\|\bar{{\bf m}}_{1}\|_{2}=O_{p}(n^{-1/2}). Since δα¯g1\delta\leq\bar{\alpha}_{g}\leq 1, we have 𝜽¯g𝜽02=Op(n1/4)\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}=O_{p}\left(n^{-1/4}\right) (g=1,,Gg=1,\dots,G), and thus we complete the proof. ∎

Proof of Lemma 21.

The first step aims to show

n1i=1nw¯giα¯g=op(1).n^{-1}\sum_{i=1}^{n}\bar{w}_{gi}-\bar{\alpha}_{g}=o_{p}(1). (79)

By the definition of w¯gi\bar{w}_{gi}, we have

w¯giα¯g=α¯gf(xi;𝜽¯g)φ(xi;𝝃¯,α¯)α¯g.\bar{w}_{gi}-\bar{\alpha}_{g}=\frac{\bar{\alpha}_{g}f(x_{i};\bar{\bm{\theta}}_{g})}{\varphi(x_{i};\bar{\bm{\xi}},\bar{\alpha})}-\bar{\alpha}_{g}.

Let δgi=f(xi;𝜽¯g)f(xi;𝜽0)1\delta_{gi}=\frac{f(x_{i};\bar{\bm{\theta}}_{g})}{f(x_{i};{\bm{\theta}}_{0})}-1 and δi=φ(xi;𝝃¯,α¯)f(xi;𝜽0)1\delta_{i}=\frac{\varphi(x_{i};\bar{\bm{\xi}},\bar{\alpha})}{f(x_{i};{\bm{\theta}}_{0})}-1. We can rewrite w¯giα¯g\bar{w}_{gi}-\bar{\alpha}_{g} as

w¯giα¯g=α¯g1+δgi1+δiα¯g=α¯gδgiδi1+δi.\bar{w}_{gi}-\bar{\alpha}_{g}=\bar{\alpha}_{g}\frac{1+\delta_{gi}}{1+\delta_{i}}-\bar{\alpha}_{g}=\bar{\alpha}_{g}\frac{\delta_{gi}-\delta_{i}}{1+\delta_{i}}.

Thus, we only need to prove

n1inw¯giα¯g=α¯gn1i=1nδgiδi1+δi=op(1).n^{-1}\sum_{i}^{n}\bar{w}_{gi}-\bar{\alpha}_{g}=\bar{\alpha}_{g}n^{-1}\sum_{i=1}^{n}\frac{\delta_{gi}-\delta_{i}}{1+\delta_{i}}=o_{p}(1). (80)

To prove (80), we first prove maxi|δi|=op(1).\max_{i}|\delta_{i}|=o_{p}(1). As in the proof of Lemma 20, (61) gives

δi=𝐦¯T𝐛i+εin,\delta_{i}=\bar{{\bf m}}^{\rm T}{\bf b}_{i}+\varepsilon_{in},

and (73) gives

εin(d3g=1G(d)3𝜽¯g𝜽023)g(xi;𝜽0).\varepsilon_{in}\leq\left(d^{3}\sum_{g=1}^{G}\left(\sqrt{d}\right)^{3}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{3}\right)g(x_{i};\bm{\theta}_{0}).

Then, it remains to show maxi𝐦¯T𝐛i2=op(1)\max_{i}\|\bar{{\bf m}}^{\rm T}{\bf b}_{i}\|_{2}=o_{p}(1) and maxi𝜽¯g𝜽023g(xi;𝜽0)=op(1)\max_{i}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{3}g(x_{i};\bm{\theta}_{0})=o_{p}(1). Since 𝐦¯2=Op(n1/2)\|\bar{{\bf m}}\|_{2}=O_{p}\left(n^{-1/2}\right), we have n3/8𝐦¯2=op(1)n^{3/8}\|\bar{{\bf m}}\|_{2}=o_{p}(1). In order to show maxi𝐦¯T𝐛i2=op(1)\max_{i}\|\bar{{\bf m}}^{\rm T}{\bf b}_{i}\|_{2}=o_{p}(1), we only need to prove

maxin3/8𝐛i2=op(1).\max_{i}n^{-3/8}\|{\bf b}_{i}\|_{2}=o_{p}(1).

For any ϵ>0\epsilon>0, we have

(maxin3/8𝐛i2ϵ)i=1n(n3/8𝐛i2ϵ).\mathbb{P}\left(\max_{i}n^{-3/8}\|{\bf b}_{i}\|_{2}\geq\epsilon\right)\leq\sum_{i=1}^{n}\mathbb{P}\left(n^{-3/8}\|{\bf b}_{i}\|_{2}\geq\epsilon\right).

By Chebyshev’s inequality, we have

(n3/8𝐛i2ϵ)𝔼(𝐛i23)n9/8ϵ3.\mathbb{P}\left(n^{-3/8}\|{\bf b}_{i}\|_{2}\geq\epsilon\right)\leq\frac{{\mathbb{E}}\left(\|{\bf b}_{i}\|_{2}^{3}\right)}{n^{9/8}\epsilon^{3}}.

Thus, we have

(maxin3/8𝐛i2ϵ)𝔼(𝐛123)n1/8ϵ3.\mathbb{P}\left(\max_{i}n^{-3/8}\|{\bf b}_{i}\|_{2}\geq\epsilon\right)\leq\frac{{\mathbb{E}}\left(\|{\bf b}_{1}\|_{2}^{3}\right)}{n^{1/8}\epsilon^{3}}.

It follows that maxi𝐦¯T𝐛i2=op(1)\max_{i}\|\bar{{\bf m}}^{\rm T}{\bf b}_{i}\|_{2}=o_{p}(1). Similarly, we can prove maxi𝜽¯g𝜽023g(xi;𝜽0)=op(1)\max_{i}\|\bar{\bm{\theta}}_{g}-{\bm{\theta}}_{0}\|_{2}^{3}g(x_{i};\bm{\theta}_{0})=o_{p}(1), and thus maxi|δi|=op(1)\max_{i}|\delta_{i}|=o_{p}(1). In order to show (80), by the fact maxi|δi|=op(1)\max_{i}|\delta_{i}|=o_{p}(1), we only need to show that

n1i=1n|δgi|=n1i=1n|f(xi;𝜽¯g)f(xi;𝜽0)f(xi;𝜽0)|=op(1),n^{-1}\sum_{i=1}^{n}\left|\delta_{gi}\right|=n^{-1}\sum_{i=1}^{n}\left|\frac{f(x_{i};\bar{\bm{\theta}}_{g})-f(x_{i};{\bm{\theta}}_{0})}{f(x_{i};{\bm{\theta}}_{0})}\right|=o_{p}(1),

which is similar to the proof of (69) without the summation over of gg. More specifically, by Lagrange’s mean value theorem, we have

f(xi;𝜽¯g)f(xi;𝜽0)f(xi;𝜽0)=j=1d(𝜽¯gjs𝜽0js)(f(xi;𝜻g(xi))θj)/f(xi;𝜽0),\frac{f(x_{i};\bar{\bm{\theta}}_{g})-f(x_{i};{\bm{\theta}}_{0})}{f(x_{i};{\bm{\theta}}_{0})}=\sum_{j=1}^{d}(\bar{\bm{\theta}}_{gj_{s}}-{\bm{\theta}}_{0j_{s}})\left(\frac{\partial f(x_{i};{\bm{\zeta}}_{g}(x_{i}))}{\partial\theta_{j}}\right)\bigg{/}f(x_{i};\bm{\theta}_{0}),

where 𝜻g(xi){\bm{\zeta}}_{g}(x_{i}) lies between 𝜽¯g\bar{\bm{\theta}}_{g} and 𝜽0\bm{\theta}_{0}. By Condition (WC3), when 𝜽¯g𝜽2τ\|\bar{\bm{\theta}}_{g}-\bm{\theta}\|_{2}\leq\tau, we have

|(f(xi;𝜻g(xi))θj)/f(xi;𝜽0)|g(xi;𝜽0),\left|\left(\frac{\partial f(x_{i};{\bm{\zeta}}_{g}(x_{i}))}{\partial\theta_{j}}\right)\bigg{/}f(x_{i};\bm{\theta}_{0})\right|\leq g(x_{i};\bm{\theta}_{0}),

and 𝔼(g(xi;𝜽0))<{\mathbb{E}}(g(x_{i};\bm{\theta}_{0}))<\infty. Since 𝜽¯g𝜽2=op(1)\|\bar{\bm{\theta}}_{g}-\bm{\theta}\|_{2}=o_{p}(1), we prove that n1i=1n|δgi|=op(1).n^{-1}\sum_{i=1}^{n}\left|\delta_{gi}\right|=o_{p}(1).

Then, it suffices to prove that n1i=1nw¯giα¯g(1)=op(1)n^{-1}\sum_{i=1}^{n}\bar{w}_{gi}-\bar{\alpha}^{(1)}_{g}=o_{p}(1). Note that

n1i=1nw¯giα¯g(1)=i=1nw¯gini=1nw¯gi+λn+Gλ=Gλi=1nw¯gi+nλn(n+Gλ).n^{-1}\sum_{i=1}^{n}\bar{w}_{gi}-\bar{\alpha}^{(1)}_{g}=\frac{\sum_{i=1}^{n}\bar{w}_{gi}}{n}-\frac{\sum_{i=1}^{n}\bar{w}_{gi}+\lambda}{n+G\lambda}=\frac{-G\lambda\sum_{i=1}^{n}\bar{w}_{gi}+n\lambda}{n(n+G\lambda)}.

By (80), we have n1i=1nw¯gi=α¯g+op(1).n^{-1}\sum_{i=1}^{n}\bar{w}_{gi}=\bar{\alpha}_{g}+o_{p}(1). It implies that

Gλi=1nw¯gi+nλn(n+Gλ)=op(1),\frac{-G\lambda\sum_{i=1}^{n}\bar{w}_{gi}+n\lambda}{n(n+G\lambda)}=o_{p}(1),

which proves the lemma. ∎

B.2 Proofs of Theorem 4

In this subsection, we give the detailed proof of Theorem 4. Let r=min(G1,d)r=\min(G-1,d). Recall that

𝒱={vech(𝐕):𝐕d×d is symmetric,rank(𝐕)r,𝐕0}.\mathcal{V}=\left\{{\rm vech}({\bf V}):{\bf V}\in\mathbb{R}^{d\times d}\mbox{ is symmetric},~{}{\rm rank}({\bf V})\leq r,{\bf V}\succeq 0\right\}. (81)

We require the following lemma.

Lemma 22.

For any fixed 𝛂ΔG1{\bm{\alpha}}\in\Delta^{G-1}, define

𝒱𝜶={𝐯:𝐯=vech(𝐀𝐀T),g=1Gαg𝐀g=𝟎,𝐀d×G}.\mathcal{V}_{{\bm{\alpha}}}=\left\{{\bf v}:{\bf v}={\rm vech}({\bf A}{\bf A}^{\rm T}),\sum_{g=1}^{G}{{\alpha}_{g}}{{\bf A}}_{\cdot g}={\bf 0},{\bf A}\in\mathbb{R}^{d\times G}\right\}.

Then, we have 𝒱𝛂𝒱\mathcal{V}_{{\bm{\alpha}}}\equiv\mathcal{V}, where 𝒱\mathcal{V} is defined in (81).

Proof of Theorem 4.

Let

R0n(𝝃^0,𝜶0)=2{pln(𝝃^0,𝜶0)pln(𝝃0,𝜶0)},R_{0n}(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0})=2\left\{pl_{n}\left(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0}\right)-pl_{n}({\bm{\xi}}_{0},{\bm{\alpha}}_{0})\right\},

and

R1n(𝝃(k),𝜶(k))=2{pln(𝝃(k),𝜶(k))pln(𝝃0,𝜶0)}.R_{1n}\left({\bm{\xi}}^{(k)},{{\bm{\alpha}}}^{(k)}\right)=2\left\{pl_{n}\left({\bm{\xi}}^{(k)},{{\bm{\alpha}}}^{(k)}\right)-pl_{n}(\bm{\xi}_{0},{\bm{\alpha}}_{0})\right\}.

Firstly, we claim that

R0n(𝝃^0,𝜶0)=(i=1n𝐛1i)T(n𝐁11)1(i=1n𝐛1i)+op(1).R_{0n}(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0})=\left(\sum_{i=1}^{n}{\bf b}_{1i}\right)^{\rm T}\left(n{\bf B}_{11}\right)^{-1}\left(\sum_{i=1}^{n}{\bf b}_{1i}\right)+o_{p}(1).

In fact, by Condition (C1)–(C2) and Condition (WC3)–(WC4), it is the classical expansion of the log likelihood ratio. By the likelihood non-decreasing property of the EM algorithm, we have

pln(𝝃(k),𝜶(k))pln(𝝃(0),𝜶(0))pln(𝝃0,𝜶0)+p0,pl_{n}\left({\bm{\xi}}^{(k)},{{\bm{\alpha}}}^{(k)}\right)\geq pl_{n}\left({\bm{\xi}}^{(0)},{{\bm{\alpha}}}^{(0)}\right)\geq pl_{n}({\bm{\xi}}_{0},{{\bm{\alpha}}}_{0})+p_{0},

where p0=λGlog(δG)p_{0}=\lambda G\hbox{log}(\delta G) is a constant. Hence, by Theorem S1 and (76), for any fixed kk, we have

R1n(𝝃(k),𝜶(k))2(𝐦(k))Ti=1n𝐛in(𝐦(k))T𝐁(𝐦(k)){1+op(1)}+op(1),R_{1n}\left({\bm{\xi}}^{(k)},{{\bm{\alpha}}}^{(k)}\right)\leq 2\left({\bf m}^{(k)}\right)^{\rm T}\sum_{i=1}^{n}{\bf b}_{i}-n\left({\bf m}^{(k)}\right)^{\rm T}{\bf B}\left({\bf m}^{(k)}\right)\{1+o_{p}(1)\}+o_{p}(1),

where 𝐦(k)=𝐦(𝜶(k),𝝃(k),𝝃0){\bf m}^{(k)}={\bf m}({\bm{\alpha}}^{(k)},\bm{\xi}^{(k)},\bm{\xi}_{0}) and 𝐦{\bf m} is defined in (14). In this proof, for simplicity of notation, from now on we omit the superscript kk and abbreviate 𝜶(k),𝝃(k),𝐦(k){\bm{\alpha}}^{(k)},\bm{\xi}^{(k)},{\bf m}^{(k)} and 𝐦1(k)=𝐦1(𝜶(k),𝝃(k),𝝃0){\bf m}_{1}^{(k)}={\bf m}_{1}({\bm{\alpha}}^{(k)},\bm{\xi}^{(k)},\bm{\xi}_{0}) to 𝜶^,𝝃^,𝐦^\hat{{\bm{\alpha}}},\hat{\bm{\xi}},\hat{{\bf m}} and 𝐦^1\hat{{\bf m}}_{1}, where 𝐦1{\bf m}_{1} is defined in (13). Since n𝐦^T𝐁𝐦^=Op(1)n\hat{{\bf m}}^{\rm T}{\bf B}\hat{{\bf m}}=O_{p}(1), we have

R1n(𝝃^,𝜶^)2𝐦^Ti=1n𝐛in𝐦^T𝐁𝐦^+op(1).R_{1n}\left(\hat{\bm{\xi}},\hat{{\bm{\alpha}}}\right)\leq 2\ \hat{{\bf m}}^{\rm T}\sum_{i=1}^{n}{\bf b}_{i}-n\hat{{\bf m}}^{\rm T}{\bf B}\hat{{\bf m}}+o_{p}(1).

Let v~hg=α^g(θ^ghθ0h)\tilde{v}_{hg}=\sqrt{\hat{\alpha}_{g}}(\hat{\theta}_{gh}-{\theta}_{0h}) and 𝐕~=[v~hg]d×G\widetilde{{\bf V}}=[\tilde{v}_{hg}]\in\mathbb{R}^{d\times G}. Define 𝐕~=(𝐕~1,,𝐕~G)\widetilde{{\bf V}}=(\widetilde{{\bf V}}_{\cdot 1},\dots,\widetilde{{\bf V}}_{\cdot G}). This gives g=1Gα^g𝐕~g=𝐦^1\sum_{g=1}^{G}\sqrt{\hat{\alpha}_{g}}\widetilde{{\bf V}}_{\cdot g}=\hat{{\bf m}}_{1}. Hence, we have

𝐕~1=𝐦^1g=2Gα^g𝐕~gα^1.\widetilde{{\bf V}}_{\cdot 1}=\frac{\hat{{\bf m}}_{1}-\sum_{g=2}^{G}\sqrt{\hat{\alpha}_{g}}\widetilde{{\bf V}}_{\cdot g}}{\sqrt{\hat{\alpha}_{1}}}.

Based on this equation, we define 𝐕^\hat{{\bf V}} as

𝐕^1=g=2Gα^g𝐕~gα^1 and 𝐕^g=𝐕~g,g1.\hat{{\bf V}}_{\cdot 1}=\frac{-\sum_{g=2}^{G}\sqrt{\hat{\alpha}_{g}}\widetilde{{\bf V}}_{\cdot g}}{\sqrt{\hat{\alpha}_{1}}}\mbox{ and }\hat{{\bf V}}_{\cdot g}=\widetilde{{\bf V}}_{\cdot g},g\neq 1.

It follows that g=1Gα^g𝐕^g=𝟎\sum_{g=1}^{G}\sqrt{\hat{\alpha}_{g}}\hat{{\bf V}}_{\cdot g}={\bf 0}. Let 𝐯^=vech(𝐕^𝐕^T)\hat{\bf v}={\rm vech}\left(\hat{{\bf V}}\hat{{\bf V}}^{\rm T}\right) and 𝐯~=vech(𝐕~𝐕~T){\bf\tilde{v}}={\rm vech}\left(\widetilde{{\bf V}}\widetilde{{\bf V}}^{\rm T}\right). By Theorem S1, we have v~hg=op(1)\tilde{v}_{hg}=o_{p}(1) and 𝐦^12=Op(n1/2)\left\|\hat{{\bf m}}_{1}\right\|_{2}=O_{p}(n^{-1/2}). Therefore, we have 𝐯~=𝐯^+op(n1/2){\bf\tilde{v}}=\hat{\bf v}+o_{p}(n^{-1/2}). Let 𝐭^=(𝐦^1T,𝐯^T)T\hat{{\bf t}}=\left(\hat{{\bf m}}_{1}^{\rm T},\hat{\bf v}^{\rm T}\right)^{\rm T}. Since 𝐦^=(𝐦^1T,𝐯~T)T\hat{{\bf m}}=\left(\hat{{\bf m}}_{1}^{\rm T},{\bf\tilde{v}}^{\rm T}\right)^{\rm T}, we have 𝐦^=𝐭^+op(n1/2)\hat{{\bf m}}=\hat{{\bf t}}+o_{p}\left(n^{-1/2}\right). It follows that

R1n(𝝃^,𝜶^)2𝐭^Ti=1n𝐛in𝐭^T𝐁𝐭^+op(1).R_{1n}\left(\hat{\bm{\xi}},\hat{{\bm{\alpha}}}\right)\leq 2\hat{{\bf t}}^{\rm T}\sum_{i=1}^{n}{\bf b}_{i}-n\hat{{\bf t}}^{\rm T}{\bf B}\hat{{\bf t}}+o_{p}\left(1\right).

Let 𝐦~1=𝐦^1+𝐁111𝐁12𝐯^\tilde{{\bf m}}_{1}=\hat{{\bf m}}_{1}+{\bf B}_{11}^{-1}{\bf B}_{12}\hat{\bf{\bf v}} and 𝐁~22=𝐁22𝐁21𝐁111𝐁12\widetilde{{\bf B}}_{22}={{\bf B}}_{22}-{\bf B}_{21}{\bf B}_{11}^{-1}{\bf B}_{12}. It is clear that

𝐭^T𝐁𝐭^=𝐦~1T𝐁11𝐦~1+𝐯^T𝐁~22𝐯^,\hat{{\bf t}}^{\rm T}{\bf B}\hat{{\bf t}}=\tilde{{\bf m}}_{1}^{\rm T}{\bf B}_{11}\tilde{{\bf m}}_{1}+\hat{{\bf v}}^{\rm T}\widetilde{{\bf B}}_{22}\hat{\bf v},

and

𝐭^Ti=1n𝐛i=𝐦~1Ti=1n𝐛1i+𝐯^T(i=1n𝐛~2i),\hat{{\bf t}}^{\rm T}\sum_{i=1}^{n}{\bf b}_{i}=\tilde{{\bf m}}_{1}^{\rm T}\sum_{i=1}^{n}{\bf b}_{1i}+\hat{{\bf v}}^{\rm T}\left(\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}\right),

where 𝐛~2i=𝐛2i𝐁21𝐁111𝐛1i\tilde{{\bf b}}_{2i}={\bf b}_{2i}-{\bf B}_{21}{\bf B}_{11}^{-1}{\bf b}_{1i}. Then, we have

R1n(𝝃^,𝜶^)\displaystyle R_{1n}\left(\hat{\bm{\xi}},\hat{{\bm{\alpha}}}\right) 2𝐦~1Ti=1n𝐛1in𝐦~1T𝐁11𝐦~1+2𝐯^Ti=1n𝐛~2in𝐯^T𝐁~22𝐯^+op(1)\displaystyle\leq 2\tilde{{\bf m}}_{1}^{\rm T}\sum_{i=1}^{n}{\bf b}_{1i}-n\tilde{{\bf m}}_{1}^{\rm T}{\bf B}_{11}\tilde{{\bf m}}_{1}+2\hat{\bf{\bf v}}^{\rm T}\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}-n\hat{{\bf v}}^{\rm T}\widetilde{{\bf B}}_{22}\hat{\bf v}+o_{p}\left(1\right)
(i=1n𝐛1i)T(n𝐁11)1(i=1n𝐛1i)+2𝐯^Ti=1n𝐛~2in𝐯^T𝐁~22𝐯^+op(1).\displaystyle\leq\left(\sum_{i=1}^{n}{\bf b}_{1i}\right)^{\rm T}\left(n{\bf B}_{11}\right)^{-1}\left(\sum_{i=1}^{n}{\bf b}_{1i}\right)+2\hat{{\bf v}}^{\rm T}\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}-n\hat{{\bf v}}^{\rm T}\widetilde{{\bf B}}_{22}\hat{\bf v}+o_{p}\left(1\right). (82)

Subtracting R0n(𝝃^0,𝜶0)R_{0n}(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0}) from R1n(𝝃^,𝜶^)R_{1n}\left(\hat{\bm{\xi}},\hat{{\bm{\alpha}}}\right), we have

R1n(𝝃^,𝜶^)R0n(𝝃^0,𝜶0)2𝐯^Ti=1n𝐛~2in𝐯^T𝐁~22𝐯^+op(1).R_{1n}\left(\hat{\bm{\xi}},\hat{{\bm{\alpha}}}\right)-R_{0n}(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0})\leq 2\hat{{\bf v}}^{\rm T}\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}-n\hat{{\bf v}}^{\rm T}\widetilde{{\bf B}}_{22}\hat{\bf v}+o_{p}\left(1\right).

Let 𝒱𝜶^={𝐯:𝐯=vech(𝐕𝐕T),g=1Gα^g𝐕g=0}\mathcal{V}_{\hat{{\bm{\alpha}}}}=\left\{{\bf v}:{\bf v}={\rm vech}\left({\bf V}{\bf V}^{\rm T}\right),\ \sum_{g=1}^{G}\sqrt{\hat{\alpha}_{g}}{{\bf V}}_{\cdot g}=0\right\}. Since 𝐯^𝒱𝜶^\hat{\bf v}\in\mathcal{V}_{\hat{{\bm{\alpha}}}}, we have

R1n(𝝃^,𝜶^)R0n(𝝃^0,𝜶0)sup𝐯𝒱𝜶^2𝐯Ti=1n𝐛~2in𝐯T𝐁~22𝐯+op(1).R_{1n}\left(\hat{\bm{\xi}},\hat{{\bm{\alpha}}}\right)-R_{0n}(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0})\leq\sup_{{\bf v}\in\mathcal{V}_{\hat{{\bm{\alpha}}}}}2{\bf v}^{\rm T}\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}-n{\bf v}^{\rm T}\widetilde{{\bf B}}_{22}{\bf v}+o_{p}(1).

By Lemma 22, we have 𝒱𝜶^𝒱\mathcal{V}_{\hat{{\bm{\alpha}}}}\equiv\mathcal{V}. Based on this fact, we can rewrite the above inequality as

R1n(𝝃^,𝜶^)R0n(𝝃^0,𝜶0)sup𝐯𝒱2𝐯Ti=1n𝐛~2in𝐯T𝐁~22𝐯+op(1).R_{1n}\left(\hat{\bm{\xi}},\hat{{\bm{\alpha}}}\right)-R_{0n}(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0})\leq\sup_{{\bf v}\in\mathcal{V}}2{\bf v}^{\rm T}\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}-n{\bf v}^{\rm T}\widetilde{{\bf B}}_{22}{\bf v}+o_{p}(1).

Hence,

EMn(k)sup𝐯𝒱2𝐯Ti=1n𝐛~2in𝐯T𝐁~22𝐯+op(1).{\rm EM}_{n}^{(k)}\leq\sup_{{\bf v}\in\mathcal{V}}2{\bf v}^{\rm T}\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}-n{\bf v}^{\rm T}\widetilde{{\bf B}}_{22}{\bf v}+o_{p}(1).

On the other hand, let

𝐯^=argmax𝐯𝒱2𝐯Ti=1n𝐛~2in𝐯T𝐁~22𝐯.\hat{{\bf v}}^{\flat}=\mathop{\arg\max}_{\begin{subarray}{c}{\bf v}\in\mathcal{V}\end{subarray}}2{\bf v}^{\rm T}\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}-n{\bf v}^{\rm T}\widetilde{{\bf B}}_{22}{\bf v}.

Since 𝟎𝒱{\bf 0}\in\mathcal{V}, it follows that

02𝐯^Ti=1n𝐛~2in𝐯^T𝐁~22𝐯^2𝐯^2i=1n𝐛~2i2nλmin(𝐁~22)𝐯^22.0\leq 2\hat{{\bf v}}^{\flat{\rm T}}\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}-n\hat{{\bf v}}^{\flat{\rm T}}\widetilde{{\bf B}}_{22}\hat{{\bf v}}^{\flat}\leq 2\|\hat{{\bf v}}^{\flat}\|_{2}\|\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}\|_{2}-n\lambda_{\rm min}(\widetilde{{\bf B}}_{22})\|\hat{{\bf v}}^{\flat}\|_{2}^{2}. (83)

From (83), it is straightforward to show that 𝐯^2=Op(n1/2)\|\hat{{\bf v}}^{\flat}\|_{2}=O_{p}(n^{-1/2}). Let 𝐕^=(𝐕^1,,𝐕^G)\hat{{\bf V}}^{\flat}=(\hat{{\bf V}}^{\flat}_{\cdot 1},\dots,\hat{{\bf V}}^{\flat}_{\cdot G}) be a matrix such that 𝐯^=vech(𝐕^𝐕^T)\hat{{\bf v}}^{\flat}={\rm vech}\left(\hat{{\bf V}}^{\flat}\hat{{\bf V}}^{\flat{\rm T}}\right). It follows that 𝐕^g2=Op(n1/4),(g=1,,G)\|\hat{{\bf V}}^{\flat}_{\cdot g}\|_{2}=O_{p}(n^{-1/4}),(g=1,\dots,G). Let

𝐦~1=𝐁111i=1n𝐛1in,{\tilde{{\bf m}}}^{\flat}_{1}={\bf B}_{11}^{-1}\frac{\sum_{i=1}^{n}{\bf b}_{1i}}{n},

which minimizes 2𝐦~1Ti=1n𝐛1in𝐦~1T𝐁11𝐦~12\tilde{{\bf m}}_{1}^{\rm T}\sum_{i=1}^{n}{\bf b}_{1i}-n\tilde{{\bf m}}_{1}^{\rm T}{\bf B}_{11}\tilde{{\bf m}}_{1}. Therefore, we define 𝐦^1=𝐦~1𝐁111𝐁12𝐯^\hat{{\bf m}}^{\flat}_{1}={\tilde{{\bf m}}}^{\flat}_{1}-{\bf B}_{11}^{-1}{\bf B}_{12}\hat{{\bf v}}^{\flat}. Finally, we define 𝐕=[vhg]{{{\bf V}}}^{\flat}=[{{v}}^{\flat}_{hg}] as

𝐕1=𝐦^1g=2G1/G𝐕^g1/G and 𝐕g=𝐕^g,g1.{{{\bf V}}}_{\cdot 1}^{\flat}=\frac{\hat{{\bf m}}_{1}^{\flat}-\sum_{g=2}^{G}\sqrt{1/G}\hat{{\bf V}}^{\flat}_{\cdot g}}{\sqrt{1/G}}\mbox{ and }{{{\bf V}}}^{\flat}_{\cdot g}=\hat{{\bf V}}^{\flat}_{\cdot g},g\neq 1.

Let θgh=Gvhg+θ0h{{\theta}}^{\flat}_{gh}=\sqrt{G}{{v}}^{\flat}_{hg}+{\theta}_{0h} and 𝜽g=(θg1,,θgd)T{{\bm{\theta}}}^{\flat}_{g}=\left({{\theta}}^{\flat}_{g1},\dots,{{\theta}}^{\flat}_{gd}\right)^{\rm T}. It follows that 𝜽g𝜽02=Op(n1/4)\|{{\bm{\theta}}}^{\flat}_{g}-{\bm{\theta}}_{0}\|_{2}=O_{p}\left(n^{-1/4}\right). Let 𝝃=(𝜽1,,𝜽G){{\bm{\xi}}}^{\flat}=\left({{\bm{\theta}}}^{\flat}_{1},\dots,{{\bm{\theta}}}^{\flat}_{G}\right). Then, we have

EMn(k)R1n(𝝃,𝜶0)R0n(𝝃^0,𝜶0)=sup𝐯𝒱2𝐯Ti=1n𝐛~2in𝐯T𝐁~22𝐯+op(1).{\rm EM}_{n}^{(k)}\geq R_{1n}\left({{\bm{\xi}}}^{\flat},{\bm{\alpha}}_{0}\right)-R_{0n}(\hat{\bm{\xi}}_{0},{\bm{\alpha}}_{0})=\sup_{{\bf v}\in\mathcal{V}}2{\bf v}^{\rm T}\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}-n{\bf v}^{\rm T}\widetilde{{\bf B}}_{22}{\bf v}+o_{p}(1).

Note that 𝐯𝒱{\bf v}\in\mathcal{V} if and only if n𝐯𝒱\sqrt{n}{\bf v}\in\mathcal{V}. Hence, we have

EMn(k)=sup𝐯𝒱2𝐯Ti=1n𝐛~2i/n𝐯T𝐁~22𝐯+op(1).{\rm EM}_{n}^{(k)}=\sup_{{\bf v}\in\mathcal{V}}2{\bf v}^{\rm T}\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}/\sqrt{n}-{\bf v}^{\rm T}\widetilde{{\bf B}}_{22}{\bf v}+o_{p}(1).

By the central limit theorem, we have i=1n𝐛~2i/nN(0,𝐁~22)\sum_{i=1}^{n}\tilde{{\bf b}}_{2i}/\sqrt{n}\rightarrow{\rm N}(0,\widetilde{{\bf B}}_{22}). Thus, we get under 0\mathbb{H}_{0}, for any fixed kk, we have as nn\rightarrow\infty,

EMn(k)𝑑sup𝐯𝒱2𝐯T𝐰𝐯T𝐁~22𝐯,{\rm EM}_{n}^{(k)}\overset{d}{\longrightarrow}\sup_{{\bf v}\in\mathcal{V}}2{\bf v}^{\rm T}{\bf{w}}-{\bf v}^{\rm T}\widetilde{{\bf B}}_{22}{\bf v},

where 𝐰=(w1,,wd(d+1)/2)T{\bf{w}}=(w_{1},\ldots,w_{d(d+1)/2})^{\rm T} is a zero-mean multivariate normal random vector with a covariance matrix 𝐁~22\widetilde{{\bf B}}_{22}, and thus we prove the theorem. ∎

Proof of Lemma 22.

It is clear that 𝒱𝜶𝒱\mathcal{V}_{{\bm{\alpha}}}\subset\mathcal{V}. Hence, we aim to prove 𝒱𝒱𝜶\mathcal{V}\subset\mathcal{V}_{{\bm{\alpha}}}. Without loss of generality, we assume αG0\alpha_{G}\neq 0. Let 𝐌=(𝐀1,,𝐀(G1)){\bf M}=({\bf A}_{\cdot 1},\dots,{\bf A}_{\cdot(G-1)}). Then, 𝐀{\bf A} can be rewritten as 𝐀=(𝐌,𝐀G){\bf A}=({\bf M},{\bf A}_{\cdot G}). Since g=1Gαg𝐀g=0\sum_{g=1}^{G}{{\alpha}_{g}}{{\bf A}}_{\cdot g}=0, we have 𝐀G=g=1G1αg𝐀g/αG{\bf A}_{\cdot G}=-\sum_{g=1}^{G-1}{{\alpha}_{g}}{{\bf A}}_{\cdot g}/{{\alpha}_{G}}. Let

𝜷=(α1/αG,,αG1/αG)T.\bm{\beta}=-({{\alpha}_{1}}/{{\alpha}_{G}},\ldots,{{\alpha}_{G-1}}/{{\alpha}_{G}})^{\rm T}.

Then, 𝐀G{\bf A}_{\cdot G} can be rewritten as 𝐀G=𝐌𝜷{\bf A}_{\cdot G}={\bf M}\bm{\beta}. It follows that 𝐀=(𝐌,𝐌𝜷){\bf A}=({\bf M},{\bf M}\bm{\beta}) and

𝐀𝐀T=𝐌𝐌T+𝐌𝜷𝜷T𝐌T=𝐌(𝐈+𝜷𝜷T)𝐌T.{\bf A}{\bf A}^{\rm T}={\bf M}{\bf M}^{\rm T}+{\bf M}\bm{\beta}\bm{\beta}^{\rm T}{\bf M}^{\rm T}={\bf M}({\bf I}+\bm{\beta}\bm{\beta}^{\rm T}){\bf M}^{\rm T}.

Since the minimum eigenvalue of 𝐈+𝜷𝜷T{\bf I}+\bm{\beta}\bm{\beta}^{\rm T} is greater than or equal to 1, 𝐈+𝜷𝜷T{\bf I}+\bm{\beta}\bm{\beta}^{\rm T} is positive definite. Hence, there exists a full rank matrix 𝐐{\bf Q} such that 𝐈+𝜷𝜷T=𝐐𝐐T{\bf I}+\bm{\beta}\bm{\beta}^{\rm T}={\bf Q}{\bf Q}^{\rm T}. Then 𝐀𝐀T=𝐌𝐐𝐐T𝐌T{\bf A}{\bf A}^{\rm T}={\bf M}{\bf Q}{\bf Q}^{\rm T}{\bf M}^{\rm T}. Therefore, for any 𝐕𝒱{\bf V}\in\mathcal{V}, we aim to prove that there exists 𝐌{\bf M} such that 𝐕=𝐌𝐐𝐐T𝐌T{\bf V}={\bf M}{\bf Q}{\bf Q}^{\rm T}{\bf M}^{\rm T}. When dG1d\geq G-1, since 𝐕{\bf V} is a positive semi-definite matrix and rank(𝐕)r{\rm rank}({\bf V})\leq r, by eigenvalue decomposition theorem, there exists a diagonal matrix 𝐃=diag(λ1,,λG1),λ1λG1{\bf D}=\hbox{diag}(\lambda_{1},\dots,\lambda_{G-1}),\lambda_{1}\geq\cdots\geq\lambda_{G-1} and an orthogonal matrix 𝐏d×(G1){\bf P}\in\mathbb{R}^{d\times(G-1)} such that 𝐕=𝐏𝐃𝐏T{\bf V}={\bf P}{\bf D}{\bf P}^{\rm T}. Taking 𝐌=𝐏𝐃1/2𝐐1{\bf M}={\bf P}{\bf D}^{1/2}{\bf Q}^{-1} yields this lemma. When d<G1d<G-1, similarly, we have 𝐕=𝐏𝐃𝐏T{\bf V}={\bf P}{\bf D}{\bf P}^{\rm T}, where 𝐃=diag(λ1,,λd){\bf D}=\hbox{diag}(\lambda_{1},\dots,\lambda_{d}) and 𝐏d×d{\bf P}\in\mathbb{R}^{d\times d}. Write

𝐏=(𝐏,𝟎)d×(G1) and 𝐃=diag(λ1,,λd,,0)(G1)×(G1).{\bf P}^{\dagger}=({\bf P},{\bf 0})\in\mathbb{R}^{d\times(G-1)}\mbox{ and }{\bf D}^{\dagger}=\hbox{diag}(\lambda_{1},\dots,\lambda_{d},\dots,0)\in\mathbb{R}^{(G-1)\times(G-1)}.

Taking 𝐌=𝐏𝐃1/2𝐐1{\bf M}={\bf P}^{\dagger}{\bf D}^{{}^{\dagger}1/2}{\bf Q}^{-1} yields this lemma. ∎

Appendix C Examples

In this section, we give distribution examples that satisfy Condition (C1)–(C7). Condition (C1)–(C2) and Condition (C6)–(C7) are easy to meet. The following sections show that when the initial value 𝜶(0){\bm{\alpha}}^{(0)} is close to the true value 𝜶{\bm{\alpha}}^{*}, Condition (C5) holds. We first mainly discuss Condition (C3) and Condition (C4).

Example 1 (Exponential families).

Assume that xx is from a canonical exponential family with density f(x;𝜽)=exp{𝜽TT(x)ξ(𝜽)}h(x),f(x;\bm{\theta})=\exp\left\{\bm{\theta}^{\rm T}T(x)-\xi(\bm{\theta})\right\}h(x), where 𝜽Θ\bm{\theta}\in\Theta is a dd-dimensional vector, Θ\Theta is a convex compact subset of the natural parameter space and ξ(𝜽)\xi(\bm{\theta}) is a smooth function. This family contains most of the commonly used distributions, such as the Poisson distribution and the exponential distribution. For Condition (C3), we show that mm can be taken as any positive integer. When T(x),vech(T(x)T(x)T)T(x),{\rm vech}\left(T(x)T(x)^{\rm T}\right) are linearly independent, we show that the covariance matrix 𝐁{\bf B} is positive definite, and thus Condition (C4) fulfills. As an example, consider the Poisson distribution. In such a case, we have T(x)=xT(x)=x. Since x,x2x,x^{2} are linearly independent, the covariance matrix 𝐁{\bf B} is positive definite. Similarly, for the gamma distribution, we have T(x)=(logx,x)T(x)=(\hbox{log}x,x). Using the same argument, we can verify Condition (C4). In addition, many exponential family distributions, such as the Poisson distribution and the gamma distribution satisfy the assumption that 𝒫G\mathcal{P}^{G} is an identifiable finite mixture (Yakowitz and Spragins, 1968; Barndorff-Nielsen, 1965).

Example 2 (Negative binomial model).

The negative binomial distribution xNB(μ,r)x\sim{\rm NB}(\mu,r) has a probability mass function

(x=k)=Γ(r+k)k!Γ(r)(rr+μ)r(μr+μ)k, for k=0,1,2,,{\mathbb{P}}(x=k)=\frac{\Gamma(r+k)}{k!\Gamma(r)}\left(\frac{r}{r+\mu}\right)^{r}\left(\frac{\mu}{r+\mu}\right)^{k},\mbox{ for }k=0,1,2,\ldots, (84)

where μ\mu is the mean parameter and rr is the size parameter. In such a case, we let Θ={(μ,r):0<δ1μ,rδ2<}\Theta=\{(\mu,r):0<\delta_{1}\leq\mu,r\leq\delta_{2}<\infty\} be a compact set, where δ1\delta_{1} and δ2\delta_{2} are two constants. Similarly, we can show that mm in Condition (C3) can be taken as any positive integer and the negative binomial distribution satisfies Condition (C4) and the identifiability assumption (Yakowitz and Spragins, 1968).

Our next goal aims to verify Condition (C1)–(C7) in the above two examples. In this section, we use C,C>0C,C^{\prime}>0 as a generic constant, which may change from occurrence to occurrence. The following lemma gives an upper bound for the sums of independent sub-exponential random variables.

Lemma 23.

Let X1,,XnX_{1},\dots,X_{n} be independent mean-zero sub-exponential random variables. Then, we have

i=1nXiψ12Ci=1nXiψ12,\left\|\sum_{i=1}^{n}X_{i}\right\|^{2}_{\psi_{1}}\leq C\sum_{i=1}^{n}\|X_{i}\|_{\psi_{1}}^{2},

where CC is a constant.

Proof of Lemma 23.

Since XiX_{i} is a mean-zero sub-exponential random variable, there exists c>0c>0 such that

𝔼(exp(λXi))exp(cXiψ12λ2),|λ|1cXiψ1.{\mathbb{E}}\left(\exp(\lambda X_{i})\right)\leq\exp(c\|X_{i}\|_{\psi_{1}}^{2}\lambda^{2}),|\lambda|\leq\frac{1}{\sqrt{c}\|X_{i}\|_{\psi_{1}}}.

By independence, we have

𝔼(exp(λi=1nXi))exp(cλ2i=1nXiψ12),|λ|1ci=1nXiψ12.{\mathbb{E}}\left(\exp\left(\lambda\sum_{i=1}^{n}X_{i}\right)\right)\leq\exp\left(c\lambda^{2}\sum_{i=1}^{n}\|X_{i}\|_{\psi_{1}}^{2}\right),|\lambda|\leq\frac{1}{\sqrt{c\sum_{i=1}^{n}\|X_{i}\|_{\psi_{1}}^{2}}}.

It follows that there exists a constant C>0C>0 such that

i=1nXiψ12Ci=1nXiψ12,\left\|\sum_{i=1}^{n}X_{i}\right\|^{2}_{\psi_{1}}\leq C\sum_{i=1}^{n}\|X_{i}\|_{\psi_{1}}^{2},

which proves the lemma. ∎

C.1 A sufficient condition for Condition (C5)

In this subsection, we will give a sufficient condition for Condition (C5). Recall that

Ξ1={𝝃:maxgg𝜽g𝜽g2γ,𝝃Ξ}.\Xi_{1}=\left\{\bm{\xi}:\max_{g\neq g^{\prime}}\|\bm{\theta}_{g}-\bm{\theta}_{g^{\prime}}\|_{2}\geq\gamma,\bm{\xi}\in\Xi\right\}. (85)
Lemma 24.

Suppose that 𝔼𝛂1,𝛏1logφ(x;𝛏2,𝛂2){\mathbb{E}}_{{\bm{\alpha}}_{1},\bm{\xi}_{1}}{\rm log}\ \varphi\left(x;\bm{\xi}_{2},{\bm{\alpha}}_{2}\right) is continuous with respect to 𝛂1,𝛏1,𝛂2,𝛏2{\bm{\alpha}}_{1},\bm{\xi}_{1},{\bm{\alpha}}_{2},\bm{\xi}_{2} and 𝔼𝛂1,𝛏1logf(x;𝛉0){\mathbb{E}}_{{\bm{\alpha}}_{1},\bm{\xi}_{1}}{\rm log}\ f(x;\bm{\theta}_{0}) is continuous with respect to 𝛂1,𝛏1,𝛉0{\bm{\alpha}}_{1},\bm{\xi}_{1},\bm{\theta}_{0}. For any γ>0\gamma>0, there exists a constant τ(γ)>0\tau(\gamma)>0 such that if 𝛂(0)𝛂2τ(γ)\left\|{\bm{\alpha}}^{(0)}-{\bm{\alpha}}^{*}\right\|_{2}\leq\tau(\gamma), then we have

inf𝝃Ξ1{sup𝝃Ξ𝔼𝜶,𝝃logφ(x;𝝃,𝜶(0))sup𝜽0Θ𝔼𝜶,𝝃logf(x;𝜽0)}=ϱ>0.\inf_{\bm{\xi}^{*}\in\Xi_{1}}\left\{\sup_{\bm{\xi}\in\Xi}{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ \varphi\left(x;\bm{\xi},{\bm{\alpha}}^{(0)}\right)-\sup_{\bm{\theta}_{0}\in\Theta}{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ f(x;\bm{\theta}_{0})\right\}=\varrho>0.
Proof of Lemma 24.

We first prove that

inf𝝃Ξ1{𝔼𝜶,𝝃logφ(x;𝝃,𝜶)sup𝜽0Θ𝔼𝜶,𝝃logf(x;𝜽0)}:=ϱ~>0,\inf_{\bm{\xi}^{*}\in\Xi_{1}}\left\{{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ \varphi\left(x;\bm{\xi}^{*},{\bm{\alpha}}^{*}\right)-\sup_{\bm{\theta}_{0}\in\Theta}{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ f(x;\bm{\theta}_{0})\right\}:=\widetilde{\varrho}>0, (86)

Note that the above formula (86) can be written as

inf𝝃Ξ1inf𝜽0Θ{𝔼𝜶,𝝃logφ(x;𝝃,𝜶)𝔼𝜶,𝝃logf(x;𝜽0)}.\inf_{\bm{\xi}^{*}\in\Xi_{1}}\inf_{\bm{\theta}_{0}\in\Theta}\left\{{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ \varphi\left(x;\bm{\xi}^{*},{\bm{\alpha}}^{*}\right)-{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ f(x;\bm{\theta}_{0})\right\}.

Since 𝒫G\mathcal{P}^{G} is an identifiable finite mixture, applying Jensen’s inequality, we have for any 𝝃Ξ1\bm{\xi}^{*}\in\Xi_{1} and 𝜽0Θ\bm{\theta}_{0}\in\Theta,

𝔼𝜶,𝝃logφ(x;𝝃,𝜶)𝔼𝜶,𝝃logf(x;𝜽0)>0.{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ \varphi\left(x;\bm{\xi}^{*},{\bm{\alpha}}^{*}\right)-{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ f(x;\bm{\theta}_{0})>0.

By the continuity and the compactness of Ξ1,Θ\Xi_{1},\Theta, we prove (86). Then, we aim to prove that there exists a constant τ(γ)>0\tau(\gamma)>0 such that if 𝜶(0)𝜶2τ(γ)\left\|{\bm{\alpha}}^{(0)}-{\bm{\alpha}}^{*}\right\|_{2}\leq\tau(\gamma), then

inf𝝃Ξ1{sup𝝃Ξ𝔼𝜶,𝝃logφ(x;𝝃,𝜶(0))𝔼𝜶,𝝃logφ(x;𝝃,𝜶)}ϱ~/2.\inf_{\bm{\xi}^{*}\in\Xi_{1}}\left\{\sup_{\bm{\xi}\in\Xi}{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ \varphi\left(x;\bm{\xi},{\bm{\alpha}}^{(0)}\right)-{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ \varphi\left(x;\bm{\xi}^{*},{\bm{\alpha}}^{*}\right)\right\}\geq-\widetilde{\varrho}/2. (87)

We only need to prove

inf𝝃Ξ1{𝔼𝜶,𝝃logφ(x;𝝃,𝜶(0))𝔼𝜶,𝝃logφ(x;𝝃,𝜶)}ϱ~/2.\inf_{\bm{\xi}^{*}\in\Xi_{1}}\left\{{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ \varphi\left(x;\bm{\xi}^{*},{\bm{\alpha}}^{(0)}\right)-{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ \varphi\left(x;\bm{\xi}^{*},{\bm{\alpha}}^{*}\right)\right\}\geq-\widetilde{\varrho}/2.

Using the fact that 𝔼𝜶1,𝝃1logφ(x;𝝃2,𝜶2){\mathbb{E}}_{{\bm{\alpha}}_{1},\bm{\xi}_{1}}{\rm log}\ \varphi\left(x;\bm{\xi}_{2},{\bm{\alpha}}_{2}\right) is uniformly continuous on a compact set, for ϱ~/2\widetilde{\varrho}/2, there exists a constant τ(γ)>0\tau(\gamma)>0 such that if 𝜶(0)𝜶2τ(γ)\left\|{\bm{\alpha}}^{(0)}-{\bm{\alpha}}^{*}\right\|_{2}\leq\tau(\gamma), then for any 𝝃Ξ1\bm{\xi}^{*}\in\Xi_{1}, we have

|𝔼𝜶,𝝃logφ(x;𝝃,𝜶(0))𝔼𝜶,𝝃logφ(x;𝝃,𝜶)|ϱ~/2,\left|{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ \varphi\left(x;\bm{\xi}^{*},{\bm{\alpha}}^{(0)}\right)-{\mathbb{E}}_{{\bm{\alpha}}^{*},\bm{\xi}^{*}}{\rm log}\ \varphi\left(x;\bm{\xi}^{*},{\bm{\alpha}}^{*}\right)\right|\leq\widetilde{\varrho}/2,

which proves (87). Combining (86) with (87) yields the result. ∎

C.2 Exponential families

It is clear that Condition (C1) and (C2) hold. A sufficient condition for Condition (C5) is in C.1. The remainder of Section C.2 will be devoted to verify Condition (C3)–(C4) and (C6)–(C7).

C.2.1 Condition (C3)

We first prove that for any 0<r5,j1,,jr{1,,d}0<r\leq 5,j_{1},\dots,j_{r}\in\{1,\dots,d\} and m>0m>0,

sup𝜽Θ|rθj1θjrlogf(x;𝜽)|Lm<.\left\|\sup_{\bm{\theta}\in\Theta}\left|\frac{\partial^{r}}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{r}}}\hbox{log}f(x;\bm{\theta})\right|\right\|_{L^{m}}<\infty. (88)

Note that logf(x;𝜽)=𝜽TT(x)ξ(𝜽)+logh(x)\hbox{log}f(x;\bm{\theta})=\bm{\theta}^{\rm T}T(x)-\xi(\bm{\theta})+\hbox{log}h(x), 𝜽Θd\bm{\theta}\in\Theta\subset\mathbb{R}^{d} and Θ\Theta is a compact set. Then

rθj1θjrlogf(x;𝜽)==1dT(x)𝕀(r=1)+rθj1θjrξ(𝜽),\frac{\partial^{r}}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{r}}}\hbox{log}f(x;\bm{\theta})=\sum_{\ell=1}^{d}T_{\ell}(x){\mathbb{I}(r=1)}+\frac{\partial^{r}}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{r}}}\xi(\bm{\theta}),

where 𝕀(){\mathbb{I}(\cdot)} is the indicator function. Since T(x)Lm<\left\|T_{\ell}(x)\right\|_{L^{m}}<\infty and rθj1θjrξ(𝜽)\frac{\partial^{r}}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{r}}}\xi(\bm{\theta}) are bounded in Θ\Theta, the conclusion follows. Our next goal is to prove for any m>0,0<r5m>0,0<r\leq 5,

sup𝜽Θ|rθj1θjrf(x;𝜽)/f(x;𝜽)|Lm<.\left\|\sup_{\bm{\theta}\in\Theta}\left|\frac{\partial^{r}}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{r}}}f(x;\bm{\theta})\bigg{/}f(x;\bm{\theta})\right|\right\|_{L^{m}}<\infty.

In fact, it is a direct consequence of (88). To prove this, when r=2r=2, we can write 2θj1θj2f(x;𝜽)/f(x;𝜽)\frac{\partial^{2}}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}}f(x;\bm{\theta})\big{/}f(x;\bm{\theta}) as

2θj1θj2f(x;𝜽)/f(x;𝜽)=2θj1θj2logf(x;𝜽)+θj1logf(x;𝜽)θj2logf(x;𝜽).\frac{\partial^{2}}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}}f(x;\bm{\theta})\bigg{/}f(x;\bm{\theta})=\frac{\partial^{2}}{\partial\theta_{j_{1}}\partial\theta_{j_{2}}}\hbox{log}f(x;\bm{\theta})+\frac{\partial}{\partial\theta_{j_{1}}}\hbox{log}f(x;\bm{\theta})\frac{\partial}{\partial\theta_{j_{2}}}\hbox{log}f(x;\bm{\theta}).

By (88), we prove the result. The same reasoning applies to the case 3r53\leq r\leq 5. In order to verify Condition (C3), we note that

sup𝜽𝜽02τ|rf(x;𝜽)θj1θjr/f(x;𝜽0)|\displaystyle\quad\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau}\left|\frac{\partial^{r}f(x;\bm{\theta})}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{r}}}\bigg{/}f(x;\bm{\theta}_{0})\right|
sup𝜽𝜽02τ|rf(x;𝜽)θj1θjr/f(x;𝜽)|sup𝜽𝜽02τ|f(x;𝜽)/f(x;𝜽0)|.\displaystyle\leq\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau}\left|\frac{\partial^{r}f(x;\bm{\theta})}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{r}}}\bigg{/}f(x;\bm{\theta})\right|\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau}\left|f(x;\bm{\theta})/f(x;\bm{\theta}_{0})\right|.

It remains to consider the function sup𝜽𝜽02τ|f(x;𝜽)/f(x;𝜽0)|.\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau}\left|f(x;\bm{\theta})/f(x;\bm{\theta}_{0})\right|. Let CC be a constant such that

sup𝜽Θ𝜽ξ(𝜽)2C.\sup_{\bm{\theta}\in\Theta}\left\|\frac{\partial}{\partial\bm{\theta}}\xi(\bm{\theta})\right\|_{2}\leq C.

By Lagrange’s theorem, if 𝜽𝜽02τ,\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau, we have

f(x;𝜽)/f(x;𝜽0)\displaystyle f(x;\bm{\theta})/f(x;\bm{\theta}_{0}) =exp{[𝜽𝜽0]TT(x)(ξ(𝜽)ξ(𝜽0))}\displaystyle=\exp\left\{[\bm{\theta}-\bm{\theta}_{0}]^{\rm T}T(x)-(\xi(\bm{\theta})-\xi(\bm{\theta}_{0}))\right\}
exp{τT(x)1+Cτ}.\displaystyle\leq\exp\left\{\tau\|T(x)\|_{1}+C\tau\right\}. (89)

Since 𝜽0\bm{\theta}_{0} is an interior point of the natural parameter space, for any m>0m>0, by (C.2.1), there exists a τ>0\tau>0 such that

sup𝜽𝜽02τ|f(x;𝜽)/f(x;𝜽0)|Lm<.\left\|\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau}\left|f(x;\bm{\theta})/f(x;\bm{\theta}_{0})\right|\right\|_{L^{m}}<\infty. (90)

Further, since (C.2.1) is independent of 𝜽0\bm{\theta}_{0}, this gives that for any 0<r5,j1,,jr{1,,d}0<r\leq 5,j_{1},\dots,j_{r}\in\{1,\dots,d\} and m>0m>0, there exists a τ>0\tau>0 such that

sup𝜽0Θsup𝜽𝜽02τ|rf(x;𝜽)θj1θjr/f(x;𝜽0)|Lm<.\sup_{\bm{\theta}_{0}\in\Theta}\left\|\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau}\left|\frac{\partial^{r}f(x;\bm{\theta})}{\partial\theta_{j_{1}}\cdots\partial\theta_{j_{r}}}\bigg{/}f(x;\bm{\theta}_{0})\right|\right\|_{L^{m}}<\infty.

Our next goal is to prove r(x)r(x) exists. We first prove that

sup𝜽Θf(x;𝜽)μ(dx)<.\int\sup_{\bm{\theta}\in\Theta}{{f(x;\bm{\theta})}}\mu({\rm d}x)<\infty. (91)

For any 𝜽0\bm{\theta}_{0}, (90) shows that there is a τ>0\tau>0 such that

sup𝜽𝜽02τf(x;𝜽)μ(dx)<.\int\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau}f(x;\bm{\theta})\mu({\rm d}x)<\infty.

Let U(𝜽,τ)={𝜽:𝜽𝜽2<τ}U(\bm{\theta},\tau)=\{\bm{\theta}^{\prime}:\|\bm{\theta}^{\prime}-\bm{\theta}\|_{2}<\tau\}. Since 𝚯\bm{\Theta} is a compact set, by the Heine-Borel theorem, open cover {U(𝜽,τ(𝜽)),𝜽𝚯}\{U(\bm{\theta},\tau(\bm{\theta})),\bm{\theta}\in\bm{\Theta}\} has a finite subcover {U(𝜽j,τ(𝜽j))}j=1J.\{U(\bm{\theta}_{j},\tau(\bm{\theta}_{j}))\}_{j=1}^{J}. It follows that

sup𝜽Θf(x;𝜽)j=1Jsup𝜽𝜽j2τ(𝜽j)f(x;𝜽),\sup_{\bm{\theta}\in\Theta}{{f(x;\bm{\theta})}}\leq\sum_{j=1}^{J}\sup_{\|\bm{\theta}-\bm{\theta}_{j}\|_{2}\leq\tau(\bm{\theta}_{j})}f(x;\bm{\theta}),

and thus we have

sup𝜽Θf(x;𝜽)μ(dx)<.\int\sup_{\bm{\theta}\in\Theta}{{f(x;\bm{\theta})}}\mu({\rm d}x)<\infty.

It remains to show that

sup𝜽Θ1f(x;𝜽)f(x;𝜽)𝜽22μ(dx)<.\int\sup_{\bm{\theta}\in\Theta}\frac{1}{{f(x;\bm{\theta})}}\left\|\frac{\partial f(x;\bm{\theta})}{\partial\bm{\theta}}\right\|_{2}^{2}\mu({\rm d}x)<\infty.

To this end, we only need to prove that for any \ell,

sup𝜽Θ1f(x;𝜽)|f(x;𝜽)θ|2μ(dx)<.\int\sup_{\bm{\theta}\in\Theta}\frac{1}{{f(x;\bm{\theta})}}\left|\frac{\partial f(x;\bm{\theta})}{\partial\theta_{\ell}}\right|^{2}\mu({\rm d}x)<\infty. (92)

Note that

1f(x;𝜽)|f(x;𝜽)θ|2=f(x;𝜽)|logf(x;𝜽)θ|2.\frac{1}{{f(x;\bm{\theta})}}\left|\frac{\partial f(x;\bm{\theta})}{\partial\theta_{\ell}}\right|^{2}=f(x;\bm{\theta})\left|\frac{\partial\hbox{log}f(x;\bm{\theta})}{\partial\theta_{\ell}}\right|^{2}.

Since logf(x;𝜽)θ=T(x)+ξ(θ)θ\frac{\partial\hbox{log}f(x;\bm{\theta})}{\partial\theta_{\ell}}=T_{\ell}(x)+\frac{\partial\xi(\theta)}{\partial\theta_{\ell}} and there exists a constant CC such that sup𝜽Θ|ξ(𝜽)θ|C,\sup_{\bm{\theta}\in\Theta}\left|\frac{\partial\xi(\bm{\theta})}{\partial\theta_{\ell}}\right|\leq C, it follows that

sup𝜽Θ1f(x;𝜽)|f(x;𝜽)θ|2(|T(x)|+C)2sup𝜽Θf(x;𝜽).\sup_{\bm{\theta}\in\Theta}\frac{1}{{f(x;\bm{\theta})}}\left|\frac{\partial f(x;\bm{\theta})}{\partial\theta_{\ell}}\right|^{2}\leq(|T_{\ell}(x)|+C)^{2}\sup_{\bm{\theta}\in\Theta}f(x;\bm{\theta}).

Similarly, for any 𝜽0\bm{\theta}_{0}, by (C.2.1), there exists a τ>0\tau>0 such that

sup𝜽𝜽02τf(x;𝜽)(|T(x)|+C)2μ(dx)<.\int\sup_{\|\bm{\theta}-\bm{\theta}_{0}\|_{2}\leq\tau}f(x;\bm{\theta})(|T_{\ell}(x)|+C)^{2}\mu({\rm d}x)<\infty. (93)

With (93), applying the Heine-Borel theorem and using the same argument as in the proof of (91), we can easily prove (92). Therefore, we verify Condition (C3).

C.2.2 Condition (C4)

We claim that if (T(x),vech(T(x)T(x)T))\left(T(x),{\rm vech}\left(T(x)T(x)^{\rm T}\right)\right) are linearly independent, then Condition (C4) fulfills. Observe that

f(x;𝜽)𝜽/f(x;𝜽)=T(x)+𝜽ξ(𝜽),\frac{\partial f(x;\bm{\theta})}{\partial\bm{\theta}}\bigg{/}f(x;\bm{\theta})=T(x)+\frac{\partial}{\partial\bm{\theta}}\xi(\bm{\theta}),

and

2f(x;𝜽)𝜽𝜽T/f(x;𝜽)=(T(x)+𝜽ξ(𝜽))(T(x)+𝜽ξ(𝜽))T+2𝜽𝜽Tξ(𝜽).\frac{\partial^{2}f(x;\bm{\theta})}{\partial\bm{\theta}\bm{\theta}^{\rm T}}\bigg{/}f(x;\bm{\theta})=\left(T(x)+\frac{\partial}{\partial\bm{\theta}}\xi(\bm{\theta})\right)\left(T(x)+\frac{\partial}{\partial\bm{\theta}}\xi(\bm{\theta})\right)^{\rm T}+\frac{\partial^{2}}{\partial\bm{\theta}\bm{\theta}^{\rm T}}\xi(\bm{\theta}).

Hence, to prove 𝐁(𝜽0)=Cov(𝐛){\bf B}(\bm{\theta}_{0})={\rm Cov}({\bf b}) is positive definite, we only need that the covariance of

(f(x;𝜽)𝜽/f(x;𝜽),vech(2f(x;𝜽)𝜽𝜽T/f(x;𝜽)))\left(\frac{\partial f(x;\bm{\theta})}{\partial\bm{\theta}}\bigg{/}f(x;\bm{\theta}),{\rm vech}\left(\frac{\partial^{2}f(x;\bm{\theta})}{\partial\bm{\theta}\bm{\theta}^{\rm T}}\bigg{/}f(x;\bm{\theta})\right)\right)

is positive definite. Note that 𝜽ξ(𝜽)\frac{\partial}{\partial\bm{\theta}}\xi(\bm{\theta}) and 2𝜽𝜽Tξ(𝜽)\frac{\partial^{2}}{\partial\bm{\theta}\bm{\theta}^{\rm T}}\xi(\bm{\theta}) are independent of xx. Thus, it suffices to show

(T(x),vech((T(x)+𝜽ξ(𝜽))(T(x)+𝜽ξ(𝜽))T))\left(T(x),{\rm vech}\left(\left(T(x)+\frac{\partial}{\partial\bm{\theta}}\xi(\bm{\theta})\right)\left(T(x)+\frac{\partial}{\partial\bm{\theta}}\xi(\bm{\theta})\right)^{\rm T}\right)\right)

are linearly independent. However, it is equivalent to (T(x),vech(T(x)T(x)T))\left(T(x),{\rm vech}\left(T(x)T(x)^{\rm T}\right)\right) are linearly independent, and thus 𝐁(𝜽0){\bf B}(\bm{\theta}_{0}) is positive definite. Finally, using the fact that λmin(𝐁(𝜽0))\lambda_{\rm min}({\bf B}(\bm{\theta}_{0})) is continuous and Θ\Theta is compact, we verify Condition (C4).

C.2.3 Condition (C6)

To verify Condition (C6), we only need to show that there exists Mψ1M_{\psi_{1}} such that

log(g=1G𝜶g(0)f(x;𝜽g))log(f(x;𝜽0))ψ1Mψ1.\left\|\hbox{log}\left(\sum_{g=1}^{G}{\bm{\alpha}}_{g}^{\left(0\right)}f\left(x;{\bm{\theta}}^{\dagger}_{g}\right)\right)-\hbox{log}\left(f\left(x;{\bm{\theta}}_{0}^{\dagger}\right)\right)\right\|_{\psi_{1}}\leq M_{\psi_{1}}.

Applying (C.2.1), it follows that

g=1G𝜶g(0)f(x;𝜽g)/f(x;𝜽0)exp{Cdiam(Θ)T(x)1+Cdiam(Θ)}.\sum_{g=1}^{G}{\bm{\alpha}}_{g}^{\left(0\right)}f\left(x;{\bm{\theta}}^{\dagger}_{g}\right)\big{/}f\left(x;{\bm{\theta}}_{0}^{\dagger}\right)\leq\exp\left\{C{\rm diam(\Theta)}\|T(x)\|_{1}+C{\rm diam(\Theta)}\right\}.

Thus, there exists a tt such that

𝔼[(g=1G𝜶g(0)f(x;𝜽g)/f(x;𝜽0))t]<,{\mathbb{E}}\left[\left(\sum_{g=1}^{G}{\bm{\alpha}}_{g}^{\left(0\right)}f\left(x;{\bm{\theta}}^{\dagger}_{g}\right)\big{/}f\left(x;{\bm{\theta}}_{0}^{\dagger}\right)\right)^{t}\right]<\infty,

and thus we verify Condition (C6).

C.2.4 Condition (C7)

Finally, we aim to verify Condition (C7). Note that

Z𝜽(𝝃)Z𝜽(𝝃)=n1/2i=1n{logf(x;𝜽)logf(x;𝜽)(D(𝜽)D(𝜽))}.Z_{\bm{\theta}}(\bm{\xi}^{*})-Z_{\bm{\theta}^{\prime}}(\bm{\xi}^{*})=n^{-1/2}\sum_{i=1}^{n}\left\{\hbox{log}f(x;\bm{\theta})-\hbox{log}f(x;\bm{\theta}^{\prime})-(D(\bm{\theta})-D(\bm{\theta}^{\prime}))\right\}.

Then, by Lemma 23, we have

CZ𝜽(𝝃)Z𝜽(𝝃)ψ1\displaystyle C\left\|Z_{\bm{\theta}}(\bm{\xi}^{*})-Z_{\bm{\theta}^{\prime}}(\bm{\xi}^{*})\right\|_{\psi_{1}} [𝜽𝜽]TT(x)ψ1+ξ(𝜽)ξ(𝜽)ψ1+D(𝜽)D(𝜽)ψ1\displaystyle\leq\left\|[\bm{\theta}-\bm{\theta}^{\prime}]^{\rm T}T(x)\right\|_{\psi_{1}}+\left\|\xi(\bm{\theta})-\xi(\bm{\theta}^{\prime})\right\|_{\psi_{1}}+\left\|D(\bm{\theta})-D(\bm{\theta}^{\prime})\right\|_{\psi_{1}}
=1d[𝜽𝜽]T(x)ψ1+C𝜽𝜽2\displaystyle\leq\left\|\sum_{\ell=1}^{d}[\bm{\theta}_{\ell}-\bm{\theta}^{\prime}_{\ell}]T_{\ell}(x)\right\|_{\psi_{1}}+C^{\prime}\left\|\bm{\theta}-\bm{\theta}^{\prime}\right\|_{2}
C′′𝜽𝜽2=1dT(x)ψ1+C𝜽𝜽2,\displaystyle\leq C^{\prime\prime}\left\|\bm{\theta}-\bm{\theta}^{\prime}\right\|_{2}\sum_{\ell=1}^{d}\left\|T_{\ell}(x)\right\|_{\psi_{1}}+C^{\prime}\left\|\bm{\theta}-\bm{\theta}^{\prime}\right\|_{2},

where the second inequality is from the fact that ξ(𝜽)\xi(\bm{\theta}) and D(𝜽)D(\bm{\theta}) have continuous derivative functions. Since Ξ\Xi is a compact set, there is a constant C>0C>0 such that =1dT(x)ψ1C\sum_{\ell=1}^{d}\left\|T_{\ell}(x)\right\|_{\psi_{1}}\leq C, and thus we verify Condition (C7).

C.3 Negative binomial model

The same proof remains valid for the negative binomial example, and thus we omit it.

Appendix D Details for the simulation data generation

For the low-noise and high-noise scenarios, we independently generate rjr_{j} from the uniform distributions U(10,11){\rm U}(10,11) and U(5,6){\rm U}(5,6), respectively. For the clustering-relevant features (j=1,,20j=1,\dots,20), the mean parameters μgj\mu_{gj} are either set as exp(uj)\exp(u_{j}) or exp(uj)+Dj\exp(u_{j})+D_{j}, where uju_{j} is generated from U(log 2,log 5){\rm U}(\hbox{log}\ 2,\hbox{log}\ 5), and DjD_{j} is to control the signal strength (the differences between clusters). We generate DjD_{j} from U(5,6){\rm U}(5,6), U(7,8){\rm U}(7,8) or U(9,10){\rm U}(9,10) for the low, medium and high signal strength settings, respectively. For the first 5 features (1j51\leq j\leq 5), we set μ2j=exp(uj)+Dj\mu_{2j}=\exp(u_{j})+D_{j} and μgj=exp(uj)(g2)\mu_{gj}=\exp(u_{j})(g\neq 2). Similarly, for 5k+1j5k+5(k=1,2,3)5k+1\leq j\leq 5k+5(k=1,2,3), we set μk+2,j=exp(uj)+Dj\mu_{k+2,j}=\exp(u_{j})+D_{j} and μgj=exp(uj)(gk+2)\mu_{gj}=\exp(u_{j})(g\neq k+2). For all cluster-irrelevant features (j=21,,pj=21,\dots,p), we set μj=exp(uj)\mu_{j}=\exp(u_{j}), where uju_{j} is generated from U(log 2,log 5){\rm U}(\hbox{log}\ 2,\hbox{log}\ 5).

Appendix E Additional simulation results

In this section, we present the additional simulation results.

E.1 Simulations for EM-test with mis-specified group number GG

Table 5: The means and standard deviations (in parenthesis) of ARIs over 100 replications by EM-test with mis-specified GG. The values in the table are shown as the actual values ×\times 100. Simulation are generated from the negative binomial model (Section 4.1 in the main manuscript). The true number of clusters is 5. EM-adjust means that the features are selected by the adjusted p-values and EM-0.35 means that we choose the threshold as n0.35n^{0.35}.
EM-adjust EM-0.35
G=5G=5 (True) G=2G=2 G=8G=8 G=5G=5 (True) G=2G=2 G=8G=8
Case 1: High signal and low noise
p=500p=500 98 (1.5) 98 (1.5) 98 (1.5) 97 (2.7) 98 (1.3) 97 (2.7)
p=5000p=5000 98 (0.9) 98 (0.9) 98 (0.9) 97 (1.5) 97 (1.6) 97 (1.5)
p=20,000p=20,000 97 (2.9) 97 (2.9) 97 (2.9) 97 (1.3) 98 (0.9) 98 (0.9)
Case 2: High signal and high noise
p=500p=500 94 (2.8) 94 (2.8) 94 (2.8) 94 (3.0) 94 (1.3) 94 (2.9)
p=5000p=5000 94 (1.7) 94 (1.7) 94 (1.7) 94 (1.7) 94 (1.6) 94 (2.0)
p=20,000p=20,000 94 (1.9) 93 (2.9) 94 (1.9) 93 (1.5) 93 (1.4) 93 (1.5)
Case 3: Medium signal and low noise
p=500p=500 95 (1.8) 95 (1.8) 95 (1.8) 95 (1.9) 95 (1.9) 95 (2.0)
p=5000p=5000 95 (1.2) 95 (1.2) 95 (1.2) 96 (1.1) 96 (1.1) 95 (1.1)
p=20,000p=20,000 95 (1.9) 95 (1.9) 95 (1.9) 95 (1.3) 95 (1.1) 95 (1.3)
Case 4: Medium signal and high noise
p=500p=500 90 (2.0) 90 (2.0) 90 (2.0) 90 (1.7) 90 (1.7) 90 (1.7)
p=5000p=5000 90 (2.5) 90 (2.6) 90 (2.5) 90 (2.1) 90 (2.2) 90 (2.1)
p=20,000p=20,000 88 (3.0) 88 (3.0) 88 (2.9) 89 (1.8) 89 (1.7) 89 (1.8)
Case 5: Low signal and low noise
p=500p=500 84 (7.6) 83 (7.7) 84 (7.6) 88 (2.6) 88 (2.8) 88 (2.6)
p=5000p=5000 78 (7.1) 77 (7.3) 79 (7.1) 88 (2.6) 88 (2.6) 88 (2.6)
p=20,000p=20,000 74 (8.7) 73 (9.7) 75 (8.4) 87 (3.0) 87 (2.9) 87 (3.1)
Case 6: Low signal and high noise
p=500p=500 73 (6.5) 71 (6.7) 73 (6.4) 80 (3.6) 79 (4.5) 80 (3.5)
p=5000p=5000 66 (8.3) 65 (9.0) 66 (8.2) 79 (3.9) 79 (4.0) 79 (3.9)
p=20,000p=20,000 59 (11.7) 57 (12.3) 59 (11.9) 72 (8.9) 75 (6.8) 72 (8.8)
Table 6: The mean of numbers of correctly retained features (\mathcal{R}) and falsely retained features (\mathcal{F}) by EM-test with mis-specified GG. The true number of clusters is 5. Simulation are generated from the negative binomial model (Section 4.1 in the main manuscript). EM-adjust means that the features are selected by the adjusted p-values and EM-0.35 means that we choose the threshold as n0.35n^{0.35}.
EM-adjust EM-0.35
G=5G=5 (True) G=2G=2 G=8G=8 G=5G=5 (True) G=2G=2 G=8G=8
Case 1: High signal and low noise
p=500p=500 \mathcal{R} 20.0 (0.1) 20.0 (0.1) 20.0 (0.1) 20.0 (0.1) 20.0 (0.1) 20.0 (0.1)
\mathcal{F} 0.0 (0.1) 0.0 (0.1) 0.0 (0.1) 1.2 (1.1) 0.5 (0.7) 1.3 (1.2)
p=5000p=5000 \mathcal{R} 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0)
\mathcal{F} 0.0 (0.1) 0.0 (0.1) 0.0 (0.1) 10.2 (3.3) 6.3 (2.5) 11.5 (3.5)
p=20,000p=20,000 \mathcal{R} 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0)
\mathcal{F} 0.0 (0.2) 0.0 (0.1) 0.0 (0.2) 40.5 (5.6) 26.1 (4.4) 46.7 (5.9)
Case 2: High signal and high noise
p=500p=500 \mathcal{R} 20.0 (0.1) 20.0 (0.1) 20.0 (0.1) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0)
\mathcal{F} 0.1 (0.2) 0.0 (0.1) 0.1 (0.3) 2.0 (1.4) 1.1 (1.0) 2.1 (1.5)
p=5000p=5000 \mathcal{R} 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0)
\mathcal{F} 0.0 (0.1) 0.0 (0.0) 0.0 (0.2) 17.9 (4.1) 12.1 (3.5) 20.1 (4.5)
p=20,000p=20,000 \mathcal{R} 20.0 (0.2) 20.0 (0.2) 20.0 (0.2) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0)
\mathcal{F} 0.1 (0.3) 0.0 (0.2) 0.1 (0.3) 74.9 (7.7) 49.1 (6.1) 84.2 (7.9)
Case 3: Medium signal and low noise
p=500p=500 \mathcal{R} 19.9 (0.3) 19.9 (0.3) 19.9 (0.3) 20.0 (0.1) 20.0 (0.1) 19.9 (0.2)
\mathcal{F} 0.0 (0.1) 0.0 (0.1) 0.0 (0.1) 1.2 (1.1) 0.5 (0.7) 1.2 (1.2)
p=5000p=5000 \mathcal{R} 19.8 (0.4) 19.8 (0.4) 19.8 (0.4) 20.0 (0.1) 20.0 (0.1) 20.0 (0.1)
\mathcal{F} 0.0 (0.1) 0.0 (0.1) 0.0 (0.1) 10.1 (3.2) 6.4 (2.6) 11.6 (3.6)
p=20,000p=20,000 \mathcal{R} 19.6 (0.6) 19.6 (0.6) 19.6 (0.6) 20.0 (0.0) 20.0 (0.0) 20.0 (0.0)
\mathcal{F} 0.1 (0.2) 0.0 (0.1) 0.1 (0.2) 40.8 (5.7) 26.4 (4.5) 46.9 (6.0)
Case 4: Medium signal and high noise
p=500p=500 \mathcal{R} 19.8 (0.5) 19.7 (0.5) 19.8 (0.5) 20.0 (0.1) 20.0 (0.2) 20.0 (0.1)
\mathcal{F} 0.0 (0.1) 0.0 (0.1) 0.1 (0.3) 2.0 (1.4) 1.3 (1.1) 2.2 (1.5)
p=5000p=5000 \mathcal{R} 19.2 (0.9) 19.1 (0.9) 19.2 (0.9) 20.0 (0.1) 20.0 (0.1) 20.0 (0.1)
\mathcal{F} 0.0 (0.2) 0.0 (0.0) 0.0 (0.2) 17.9 (4.1) 12.1 (3.5) 20.0 (4.4)
p=20,000p=20,000 \mathcal{R} 18.7 (1.2) 18.6 (1.2) 18.8 (1.1) 20.0 (0.2) 19.9 (0.2) 19.9 (0.2)
\mathcal{F} 0.1 (0.3) 0.1 (0.2) 0.1 (0.3) 74.7 (7.5) 48.9 (6.0) 83.7 (8.2)
Case 5: Low signal and low noise
p=500p=500 \mathcal{R} 16.5 (2.0) 15.9 (1.9) 16.6 (2.0) 18.9 (1.0) 18.7 (1.1) 19.0 (0.9)
\mathcal{F} 0.0 (0.1) 0.0 (0.1) 0.0 (0.1) 1.1 (1.1) 0.7 (0.7) 1.3 (1.1)
p=5000p=5000 \mathcal{R} 13.7 (2.0) 13.3 (2.1) 13.8 (2.1) 19.0 (1.0) 18.9 (1.0) 19.0 (1.0)
\mathcal{F} 0.0 (0.1) 0.0 (0.0) 0.0 (0.1) 10.1 (3.3) 6.3 (2.6) 11.6 (3.6)
p=20,000p=20,000 \mathcal{R} 12.1 (2.3) 11.8 (2.4) 12.3 (2.3) 18.9 (1.1) 18.7 (1.1) 18.9 (1.1)
\mathcal{F} 0.0 (0.1) 0.0 (0.0) 0.0 (0.2) 40.5 (5.6) 26.4 (4.6) 46.9 (5.8)
Case 6: Low signal and high noise
p=500p=500 \mathcal{R} 14.7 (2.1) 14.0 (2.0) 14.9 (2.1) 18.4 (1.2) 17.9 (1.5) 18.5 (1.1)
\mathcal{F} 0.0 (0.1) 0.0 (0.1) 0.0 (0.2) 1.9 (1.4) 1.3 (1.0) 2.1 (1.5)
p=5000p=5000 \mathcal{R} 12.0 (2.3) 11.6 (2.4) 12.0 (2.2) 18.4 (1.3) 18.1 (1.3) 18.4 (1.2)
\mathcal{F} 0.0 (0.1) 0.0 (0.0) 0.0 (0.1) 18.1 (4.2) 12.1 (3.5) 20.0 (4.5)
p=20,000p=20,000 \mathcal{R} 10.0 (2.5) 9.6 (2.5) 10.1 (2.5) 18.1 (1.4) 17.8 (1.4) 18.1 (1.4)
\mathcal{F} 0.1 (0.3) 0.0 (0.2) 0.1 (0.3) 74.9 (7.4) 49.0 (5.9) 84.0 (7.9)

E.2 Simulations for EM-test with different penalties λ\lambda

Table 7: The mean of numbers of correctly retained features (\mathcal{R}) and falsely retained features (\mathcal{F}) over 100 replications by EM-test with different penalties λ\lambda. Simulation are generated from the negative binomial model (Section 4.1 in the main manuscript).
p=500p=500 p=5000p=5000 p=20,000p=20,000
EM-adjust EM-0.35 EM-adjust EM-0.35 EM-adjust EM-0.35
Medium signal and high noise
λ=107\lambda=10^{-7} \mathcal{R} 19.8 (0.5) 20.0 (0.1) 19.2 (0.9) 20.0 (0.1) 18.7 (1.2) 20.0 (0.2)
\mathcal{F} 0.0 (0.1) 2.0 (1.4) 0.0 (0.2) 17.9 (4.1) 0.1 (0.3) 74.7 (7.5)
λ=105\lambda=10^{-5} \mathcal{R} 19.8 (0.5) 20.0 (0.1) 19.2 (0.9) 20.0 (0.1) 18.7 (1.2) 20.0 (0.2)
\mathcal{F} 0.0 (0.1) 2.0 (1.4) 0.0 (0.2) 17.9 (4.1) 0.1 (0.3) 74.7 (7.5)
λ=103\lambda=10^{-3} \mathcal{R} 19.8 (0.5) 20.0 (0.1) 19.2 (0.9) 20.0 (0.1) 18.7 (1.2) 20.0 (0.2)
\mathcal{F} 0.0 (0.1) 2.0 (1.4) 0.0 (0.2) 17.8 (4.1) 0.1 (0.3) 74.6 (7.5)
λ=101\lambda=10^{-1} \mathcal{R} 19.7 (0.5) 20.0 (0.1) 19.2 (0.9) 20.0 (0.1) 18.7 (1.2) 20.0 (0.2)
\mathcal{F} 0.0 (0.1) 1.8 (1.3) 0.0 (0.2) 17.5 (4.1) 0.1 (0.3) 73.2 (7.4)
λ=1\lambda=1 \mathcal{R} 19.7 (0.5) 20.0 (0.1) 19.2 (0.9) 20.0 (0.1) 18.7 (1.2) 20.0 (0.2)
\mathcal{F} 0.0 (0.1) 1.6 (1.3) 0.0 (0.1) 15.8 (3.9) 0.1 (0.3) 65.8 (7.5)
λ=10\lambda=10 \mathcal{R} 19.7 (0.5) 20.0 (0.1) 19.2 (0.9) 20.0 (0.1) 18.7 (1.2) 20.0 (0.2)
\mathcal{F} 0.0 (0.1) 1.5 (1.3) 0.0 (0.1) 14.3 (3.7) 0.1 (0.2) 58.9 (7.0)
λ=100\lambda=100 \mathcal{R} 19.7 (0.5) 20.0 (0.1) 19.2 (0.9) 20.0 (0.1) 18.7 (1.2) 20.0 (0.2)
\mathcal{F} 0.0 (0.1) 1.5 (1.3) 0.0 (0.1) 14.0 (3.7) 0.1 (0.2) 58.1 (6.8)

E.3 Simulations for EM-test with different iteration steps KK

Table 8: The mean of numbers of correctly retained features (\mathcal{R}) and falsely retained features (\mathcal{F}) over 100 replications by EM-test with different steps KK. Simulation are generated from the negative binomial model (Section 4.1 in the main manuscript).
p=500p=500 p=5000p=5000 p=20,000p=20,000
EM-adjust EM-0.35 EM-adjust EM-0.35 EM-adjust EM-0.35
Medium signal and high noise
K=1K=1 \mathcal{R} 19.6 (0.6) 20.0 (0.2) 4.2 (3.4) 12.5 (3.5) 0.0 (0.0) 2.8 (2.1)
\mathcal{F} 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.1)
K=3K=3 \mathcal{R} 19.7 (0.5) 20.0 (0.2) 15.0 (2.8) 19.0 (1.3) 6.8 (2.5) 15.5 (1.8)
\mathcal{F} 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.1 (0.2) 0.0 (0.0) 0.2 (0.5)
K=5K=5 \mathcal{R} 19.7 (0.5) 20.0 (0.2) 18.1 (1.7) 19.8 (0.5) 14.7 (2.1) 19.1 (0.9)
\mathcal{F} 0.0 (0.0) 0.1 (0.2) 0.0 (0.0) 0.5 (0.6) 0.0 (0.0) 1.7 (1.4)
K=10K=10 \mathcal{R} 19.7 (0.5) 20.0 (0.2) 18.9 (1.0) 20.0 (0.2) 18.0 (1.4) 19.9 (0.3)
\mathcal{F} 0.0 (0.0) 0.2 (0.4) 0.0 (0.0) 2.1 (1.4) 0.0 (0.1) 8.6 (3.1)
K=20K=20 \mathcal{R} 19.7 (0.5) 20.0 (0.1) 19.1 (1.0) 20.0 (0.1) 18.5 (1.2) 19.9 (0.2)
\mathcal{F} 0.0 (0.0) 0.6 (0.7) 0.0 (0.0) 5.8 (2.5) 0.1 (0.2) 23.3 (4.4)
K=50K=50 \mathcal{R} 19.7 (0.5) 20.0 (0.1) 19.1 (0.9) 20.0 (0.1) 18.6 (1.2) 19.9 (0.2)
\mathcal{F} 0.0 (0.1) 1.1 (1.0) 0.0 (0.0) 10.8 (3.5) 0.1 (0.2) 42.8 (6.2)
K=100K=100 \mathcal{R} 19.7 (0.5) 20.0 (0.1) 19.2 (0.9) 20.0 (0.1) 18.6 (1.2) 19.9 (0.2)
\mathcal{F} 0.0 (0.1) 1.6 (1.3) 0.0 (0.0) 13.6 (3.6) 0.1 (0.2) 55.4 (7.1)
K=200K=200 \mathcal{R} 19.8 (0.5) 20.0 (0.1) 19.2 (0.9) 20.0 (0.1) 18.7 (1.2) 20.0 (0.2)
\mathcal{F} 0.0 (0.1) 2.0 (1.4) 0.0 (0.2) 17.9 (4.1) 0.1 (0.3) 74.7 (7.5)

E.4 Simulations for EM-test with different thresholds ϑ\vartheta

Table 9: The mean of numbers of correctly retained features (\mathcal{R}) and falsely retained features (\mathcal{F}) by the EM-test with different thresholds over 100 replications. Simulation are generated from the negative binomial model (Section 4.1 in the main manuscript). The numbers in the parenthesis are the standard deviation of \mathcal{R} and \mathcal{F} over 100 replications. EM-0.2 means that we choose the threshold as n0.2n^{0.2}, similar for EM-0.25 – EM-0.45.
EM-0.2 EM-0.25 EM-0.3 EM-0.35 EM-0.4 EM-0.45
Case 1: High signal and low noise
p=500p=500 \mathcal{R} 20 (0.1) 20 (0.1) 20 (0.1) 20 (0.1) 20 (0.1) 20 (0.1)
\mathcal{F} 45 (6.4) 19 (4.3) 6 (2.3) 1 (1.1) 0 (0.3) 0 (0.1)
p=5000p=5000 \mathcal{R} 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0)
\mathcal{F} 459 (20.7) 189 (12.2) 57 (7.4) 10 (3.3) 1 (1.0) 0 (0.1)
p=20,000p=20,000 \mathcal{R} 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0)
\mathcal{F} 1833 (38.3) 757 (24.7) 223 (12.9) 40 (5.6) 4 (2.0) 0 (0.4)
Case 2: High signal and high noise
p=500p=500 \mathcal{R} 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.2)
\mathcal{F} 63 (7.5) 29 (5.6) 9 (2.9) 2 (1.4) 0 (0.4) 0 (0.0)
p=5000p=5000 \mathcal{R} 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0)
\mathcal{F} 658 (23.4) 290 (16.0) 90 (9.4) 18 (4.1) 2 (1.3) 0 (0.1)
p=20,000p=20,000 \mathcal{R} 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0)
\mathcal{F} 2637 (52.6) 1177 (36.3) 375 (21.5) 75 (7.7) 8 (3.1) 0 (0.6)
Case 3: Medium signal and low noise
p=500p=500 \mathcal{R} 20 (0.1) 20 (0.1) 20 (0.1) 20 (0.1) 20 (0.2) 20 (0.4)
\mathcal{F} 45 (6.7) 19 (4.2) 6 (2.4) 1 (1.1) 0 (0.2) 0 (0.1)
p=5000p=5000 \mathcal{R} 20 (0.0) 20 (0.0) 20 (0.1) 20 (0.1) 20 (0.1) 20 (0.4)
\mathcal{F} 459 (20.4) 189 (12.7) 57 (7.1) 10 (3.2) 1 (1.0) 0 (0.1)
p=20,000p=20,000 \mathcal{R} 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.2) 20 (0.4)
\mathcal{F} 1834 (41.0) 757 (25.6) 223 (13.0) 41 (5.7) 4 (2.0) 0 (0.4)
Case 4: Medium signal and high noise
p=500p=500 \mathcal{R} 20 (0.0) 20 (0.0) 20 (0.1) 20 (0.1) 20 (0.4) 19 (0.8)
\mathcal{F} 63 (7.6) 28 (5.5) 9 (2.9) 2 (1.4) 0 (0.4) 0 (0.0)
p=5000p=5000 \mathcal{R} 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.1) 20 (0.4) 19 (0.9)
\mathcal{F} 659 (23.0) 291 (16.2) 90 (9.1) 18 (4.1) 2 (1.3) 0 (0.2)
p=20,000p=20,000 \mathcal{R} 20 (0.0) 20 (0.1) 20 (0.1) 20 (0.2) 20 (0.4) 19 (0.9)
\mathcal{F} 2637 (51.5) 1177 (34.9) 375 (21.3) 75 (7.5) 8 (3.0) 0 (0.6)
Case 5: Low signal and low noise
p=500p=500 \mathcal{R} 20 (0.3) 20 (0.5) 19 (0.7) 19 (1.0) 18 (1.5) 15 (2.0)
\mathcal{F} 45 (6.9) 19 (4.1) 6 (2.3) 1 (1.1) 0 (0.2) 0 (0.1)
p=5000p=5000 \mathcal{R} 20 (0.3) 20 (0.3) 20 (0.5) 19 (1.0) 17 (1.6) 14 (1.9)
\mathcal{F} 459 (20.7) 189 (11.8) 57 (7.4) 10 (3.3) 1 (1.0) 0 (0.1)
p=20,000p=20,000 \mathcal{R} 20 (0.3) 20 (0.4) 20 (0.6) 19 (1.1) 17 (1.4) 14 (1.9)
\mathcal{F} 1833 (38.3) 758 (25.2) 223 (13.3) 41 (5.6) 4 (1.9) 0 (0.4)
Case 6: Low signal and high noise
p=500p=500 \mathcal{R} 20 (0.4) 20 (0.6) 19 (0.8) 18 (1.2) 16 (1.7) 13 (2.0)
\mathcal{F} 65 (7.8) 29 (5.4) 10 (3.2) 2 (1.4) 0 (0.4) 0 (0.0)
p=5000p=5000 \mathcal{R} 20 (0.4) 20 (0.6) 19 (0.8) 18 (1.3) 16 (1.7) 13 (2.1)
\mathcal{F} 660 (24.4) 291 (16.3) 91 (9.2) 18 (4.2) 2 (1.2) 0 (0.1)
p=20,000p=20,000 \mathcal{R} 20 (0.5) 20 (0.6) 19 (0.9) 18 (1.4) 16 (1.7) 12 (2.0)
\mathcal{F} 2636 (52.9) 1176 (36.7) 374 (21.7) 75 (7.4) 8 (3.0) 0 (0.6)

E.5 Simulations for continuous data

In this section, we perform simulations for continuous data. The distribution family is chosen as the normal distribution. We consider three dimension setups p=500,5000p=500,5000 and 20,00020,000. The sample size is set as n=1000n=1000 and the number of cluster-relevant features is s=20s=20.

We first consider simulation setup of balanced scenario. We set the number of clusters as G=5G=5 and the proportions of the clusters as 𝜶=(α1,,α5)=(0.2,0.2,0.2,0.2,0.2){\bm{\alpha}}=(\alpha_{1},\dots,\alpha_{5})=(0.2,0.2,0.2,0.2,0.2). For the iith sample, we first randomly assign it to a cluster gg with the probability αg\alpha_{g}. Then, if the jjth feature is cluster-relevant (j=1,,20j=1,\dots,20), we randomly sample xijx_{ij} from Normal(μgj,σj2){\rm Normal}(\mu_{gj},\sigma^{2}_{j}); If it is cluster-irrelevant (j=21,,pj=21,\dots,p), we randomly sample xijx_{ij} from Normal(μj,σj2){\rm Normal}(\mu_{j},\sigma^{2}_{j}). We independently generate σj\sigma_{j} from the uniform distributions U(1,1.5){\rm U}(1,1.5). For the clustering-relevant features (j=1,,20j=1,\dots,20), the mean parameters μgj\mu_{gj} are either set as uju_{j} or uj+Dju_{j}+D_{j}, where uju_{j} is generated from U(5,5){\rm U}(-5,5), and DjD_{j} is to control the signal strength (the differences between clusters). We generate DjD_{j} from U(10,11){\rm U}(10,11). For the first 5 features (1j51\leq j\leq 5), we set μ2j=uj+Dj\mu_{2j}=u_{j}+D_{j} and μgj=uj(g2)\mu_{gj}=u_{j}(g\neq 2). Similarly, for 5k+1j5k+5(k=1,2,3)5k+1\leq j\leq 5k+5(k=1,2,3), we set μk+2,j=uj+Dj\mu_{k+2,j}=u_{j}+D_{j} and μgj=uj(gk+2)\mu_{gj}=u_{j}(g\neq k+2). For all cluster-irrelevant features (j=21,,pj=21,\dots,p), we set μj=uj\mu_{j}=u_{j}, where uju_{j} is generated from U(5,5){\rm U}(-5,5).

Then we consider simulation setup of unbalanced scenario. We set the number of clusters as G=5G=5 and the proportions of the clusters as 𝜶=(α1,,α5)=(0.5,0.125,0.125,0.125,0.125){\bm{\alpha}}=(\alpha_{1},\dots,\alpha_{5})=(0.5,0.125,0.125,0.125,0.125). We independently generate σj\sigma_{j} from the uniform distributions U(1,2){\rm U}(1,2). For the clustering-relevant features (j=1,,20j=1,\dots,20), the mean parameters μgj\mu_{gj} are either set as uju_{j} or uj+Dju_{j}+D_{j}, where uju_{j} is generated from U(5,5){\rm U}(-5,5), and DjD_{j} is to control the signal strength (the differences between clusters). We generate DjD_{j} from U(3,4){\rm U}(3,4). For the first 5 features (1j51\leq j\leq 5), we set μ2j=uj+Dj\mu_{2j}=u_{j}+D_{j} and μgj=uj(g2)\mu_{gj}=u_{j}(g\neq 2). Similarly, for 5k+1j5k+5(k=1,2,3)5k+1\leq j\leq 5k+5(k=1,2,3), we set μk+2,j=uj+Dj\mu_{k+2,j}=u_{j}+D_{j} and μgj=uj(gk+2)\mu_{gj}=u_{j}(g\neq k+2). For all cluster-irrelevant features (j=21,,pj=21,\dots,p), we set μj=uj\mu_{j}=u_{j}, where uju_{j} is generated from U(5,5){\rm U}(-5,5).

Table 10: The mean of numbers of correctly retained features (\mathcal{R}) and falsely retained features (\mathcal{F}) by different methods in the normal simulations. EM-adjust means that select the features by the adjusted p-values and EM-0.35 means that we choose the threshold as n0.35n^{0.35}. KS-test is the test used by IF-PCA (Jin and Wang, 2016), Dip-test is the uni-modality test (Chan and Hall, 2010) and COSCI is the test mentioned by the reviewer (Banerjee et al., 2017). COSCI is not evaluated for p=20,000p=20,000 because of its high computational cost.
EM-adjust EM-0.35 SC-FS KS-test Dip-test COSCI
Case 1: Balanced
p=500p=500 \mathcal{R} 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 16 (2.1)
\mathcal{F} 0 (0.5) 4 (2.2) 0 (0.0) 0 (0.0) 0 (0.0) 45 (4.6)
p=5000p=5000 \mathcal{R} 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 20 (0.0) 17 (1.6)
\mathcal{F} 0 (0.6) 39 (6.2) 0 (0.0) 0 (0.0) 0 (0.0) 532 (16.4)
p=20,000p=20,000 \mathcal{R} 20 (0.0) 20 (0.0) 16 (4.3) 20 (0.0) 20 (0.0) NA
\mathcal{F} 0 (0.6) 153 (12.5) 0 (0.0) 0 (0.0) 0 (0.0) NA
Case 2: Unbalanced
p=500p=500 \mathcal{R} 18 (1.4) 19 (0.8) 20 (0.7) 6 (2.3) 0 (0.0) 1 (0.9)
\mathcal{F} 0 (0.5) 4 (2.0) 0 (0.0) 0 (0.0) 0 (0.0) 53 (5.3)
p=5000p=5000 \mathcal{R} 16 (1.7) 19 (0.9) 0 (0.3) 4 (2.0) 0 (0.0) 1 (1.1)
\mathcal{F} 0 (0.5) 39 (5.9) 0 (0.0) 0 (0.0) 0 (0.0) 538 (17.2)
p=20,000p=20,000 \mathcal{R} 15 (2.0) 19 (0.9) 0 (0.0) 3 (1.7) 0 (0.0) NA
\mathcal{F} 0 (0.6) 153 (11.9) 0 (0.0) 0 (0.0) 0 (0.0) NA
Table 11: Similar to Table 10 but for clustering accuracy. The means and standard deviations (in parenthesis) of ARIs over 100 replications by different methods in the normal simulations. The values in the table are shown as the actual values ×\times 100. No-Screening means that we use all features for clustering. Oracle means that we only use the s=20s=20 clustering-relevant features for clustering.
No-Screening Oracle EM-adjust EM-0.35 SC-FS KS-test Dip-test COSCI
Case 1: Balanced
p=500p=500 71 (10.6) 96 (9.5) 97 (8.9) 94 (11.4) 97 (8.6) 96 (9.5) 96 (9.7) 87 (11.6)
p=5000p=5000 62 (2.8) 94 (11.1) 94 (11.1) 90 (9.3) 94 (11.4) 94 (11.1) 96 (9.7) 75 (4.1)
p=20,000p=20,000 16 (4.9) 93 (11.7) 93 (12.3) 80 (5.1) 66 (18.1) 93 (11.7) 96 (10.0) NA
Case 2: Unbalanced
p=500p=500 61 (9.9) 97 (1.5) 95 (3.4) 96 (2.1) 96 (5.3) 52 (21.5) 0 (0.0) 1 (1.8)
p=5000p=5000 1 (0.6) 96 (3.0) 93 (5.1) 94 (3.7) 0 (1.6) 36 (20.4) 0 (0.0) 0 (0.2)
p=20,000p=20,000 0 (0.2) 97 (1.5) 92 (7.1) 86 (8.2) 0 (0.0) 32 (19.8) 0 (0.0) NA

Results of these two scenarios are presented in Table 10 and 11. We first compare different algorithms in terms of the number of correctly retained features and falsely retained features (Table 10). The balanced case is a relative simple scenario and many methods work well. Overall, the two versions of EM-test could correctly select most cluster-relevant features and have very few false positives. The unbalanced case is much more challenging and EM-test outperforms other methods by a large margin, especially when pp is larger. Dip-test (Chan and Hall, 2010) tends to be very conservative in this case and does not select any features. KS-test (the test used by IF-PCA) is able to select a few important features with very few false positives. COSCI has a large number false positives. SC-FS performs well in the low dimensional case (p=500p=500), but cannot select any feature in the higher dimensional cases. We also compare the clustering accuracy based on the selected features (Table 11). Again, we see that EM-test outperforms other methods, especially for the more challenging unbalanced case.

E.6 EM-test under mis-specified model

We consider two mis-specified models to investigate the robustness of the proposed method. The two mis-specified models are the Poisson-truncated-normal and the binomial-Gamma distributions. A random variable yy is said to follow a Poisson-truncated-normal distribution PTN(μ,σ,γ){\rm PTN}(\mu,\sigma,\gamma) (μ,σ>0,γ>0\mu\in\mathcal{R},\sigma>0,\gamma>0), if conditional on a latent variable λ\lambda, xx follows a Poisson distribution with a mean λ\lambda, and the latent variable λ\lambda follows the truncated normal distribution with the probability density function

f(x;μ,σ,a,b)=1σϕ(xμσ)1Φ(γμσ),xγ,{\displaystyle f(x;\mu,\sigma,a,b)={\frac{1}{\sigma}}\,{\frac{\phi({\frac{x-\mu}{\sigma}})}{1-\Phi({\frac{\gamma-\mu}{\sigma}})}}},x\geq\gamma,

where ϕ()\phi(\cdot) is the probability density function of the standard normal distribution and Φ()\Phi(\cdot) is its cumulative distribution function. In this simulation, we set γ=0.5\gamma=0.5, σ=1\sigma=1.

The simulation data generation is the same as the negative binomial case. We set the number of clusters as G=5G=5 and the proportions of the clusters as

𝜶=(α1,,α5)=(0.5,0.125,0.125,0.125,0.125).{\bm{\alpha}}=(\alpha_{1},\dots,\alpha_{5})=(0.5,0.125,0.125,0.125,0.125).

We consider three dimension setups p=500,5000p=500,5000 and 20,00020,000. The sample size is set as n=1000n=1000 and the number of cluster-relevant features is s=20s=20. For the clustering-relevant features (j=1,,20j=1,\dots,20), the mean parameters of the truncated normal distribution μgj\mu_{gj} are either set as exp(uj)\exp(u_{j}) or exp(uj)+Dj\exp(u_{j})+D_{j}, where uju_{j} is generated from U(log 2,log 5){\rm U}(\hbox{log}\ 2,\hbox{log}\ 5), and DjD_{j} is to control the signal strength (the differences between clusters). We generate DjD_{j} from U(7,8){\rm U}(7,8). For the first 5 features (1j51\leq j\leq 5), we set μ2j=exp(uj)+Dj\mu_{2j}=\exp(u_{j})+D_{j} and μgj=exp(uj)(g2)\mu_{gj}=\exp(u_{j})(g\neq 2). Similarly, for 5k+1j5k+5(k=1,2,3)5k+1\leq j\leq 5k+5(k=1,2,3), we set μk+2,j=exp(uj)+Dj\mu_{k+2,j}=\exp(u_{j})+D_{j} and μgj=exp(uj)(gk+2)\mu_{gj}=\exp(u_{j})(g\neq k+2). For all cluster-irrelevant features (j=21,,pj=21,\dots,p), we set μj=exp(uj)\mu_{j}=\exp(u_{j}), where uju_{j} is generated from U(log 2,log 5){\rm U}(\hbox{log}\ 2,\hbox{log}\ 5).

We next consider another mis-specified model the binomial-Gamma distribution as follows. A random variable xx is said to follow a binomial-Gamma BG(z,μ,r){\rm BG}(z,\mu,r) if

x|λBinomial(max(z,λ),λ/max(z,λ)),λGamma(r,μ/r)x|\lambda\sim{\rm Binomial}(\lceil\max(z,\lambda)\rceil,\lambda/\lceil\max(z,\lambda)\rceil),~{}\lambda\sim{\rm Gamma}(r,\mu/r)

where rr is the shape parameter and μ/r\mu/r is the scale parameter of the Gamma distribution. In this simulation, we set z=100z=100. We set the number of clusters as G=5G=5 and the proportions of the clusters as 𝜶=(α1,,α5)=(0.5,0.125,0.125,0.125,0.125){\bm{\alpha}}=(\alpha_{1},\dots,\alpha_{5})=(0.5,0.125,0.125,0.125,0.125). We consider three dimension setups p=500,5000p=500,5000 and 20,00020,000. The sample size is set as n=1000n=1000 and the number of cluster-relevant features is s=20s=20. We independently generate rjr_{j} from the uniform distributions U(5,6){\rm U}(5,6). For the clustering-relevant features (j=1,,20j=1,\dots,20), the mean parameters μgj\mu_{gj} of binomial-Gamma (BG(z,μgj,rj){\rm BG}(z,\mu_{gj},r_{j})) are either set as exp(uj)\exp(u_{j}) or exp(uj)+Dj\exp(u_{j})+D_{j}, where uju_{j} is generated from U(log 2,log 5){\rm U}(\hbox{log}\ 2,\hbox{log}\ 5), and DjD_{j} is to control the signal strength (the differences between clusters). We generate DjD_{j} from U(2,3){\rm U}(2,3). For the first 5 features (1j51\leq j\leq 5), we set μ2j=exp(uj)+Dj\mu_{2j}=\exp(u_{j})+D_{j} and μgj=exp(uj)(g2)\mu_{gj}=\exp(u_{j})(g\neq 2). Similarly, for 5k+1j5k+5(k=1,2,3)5k+1\leq j\leq 5k+5(k=1,2,3), we set μk+2,j=exp(uj)+Dj\mu_{k+2,j}=\exp(u_{j})+D_{j} and μgj=exp(uj)(gk+2)\mu_{gj}=\exp(u_{j})(g\neq k+2). For all cluster-irrelevant features (j=21,,pj=21,\dots,p), we set μj=exp(uj)\mu_{j}=\exp(u_{j}), where uju_{j} is generated from U(log 2,log 5){\rm U}(\hbox{log}\ 2,\hbox{log}\ 5).

For comparison, we also included the continuous methods IF-PCA (KS-test), Dip-test and COCSI. Before applying the continuous methods, we apply a log transformation (log(x+1)\hbox{log}(x+1)) to the count data to make the data more like continuous data. Because of the computational burden, COCSI is not considered for the p=20,000p=20,000 simulations. Table 12 and 13 show the simulation results for the Poisson-truncated normal model. EM-test still performs the best under this mis-specified model. For example, the ARIs of the clustering results based on features selected by EM-test are consistently larger than those of other methods, especially in the higher dimensional setups (Table 12). EM-test could select almost all clustering-relevant features with few false positives. Other methods either select too few clustering-relevant features or report too much false positives. The continuous methods do not perform well for these count data. KS-test and Dip-test report many false positives and select almost all features as clustering-relevant features (adjusted p-values <0.01<0.01 ). COCSI instead is very conservative in this simulation and could not select any features. The simulation results for the binomial-Gamma mis-specified model are similar and are shown in Table S10-S11.

Table 12: The means and standard deviations (in parenthesis) of ARIs over 100 replications by different methods under the Poisson-truncated normal (mis-specified) model. The values in the table are shown as the actual values ×\times 100. EM-adjust means that the features are selected by the adjusted p-values and EM-0.35 means that we choose the threshold as n0.35n^{0.35}. The three continuous methods Dip-test, KS-test and COSCI are applied to the normalized data (log normalization).
No-Screening EM-adjust EM-0.35 Chi-square SC-FS Dip-test KS-test COSCI
p=500p=500 93 (6.1) 99 (2.7) 99 (2.7) 82 (15.4) 98 (4.0) 95 (1.4) 95 (1.5) 0 (0.0)
p=5000p=5000 7 (2.7) 99 (1.2) 99 (0.5) 50 (24.6) 37 (28.0) 9 (2.6) 9 (2.6) 0 (0.0)
p=20,000p=20,000 0 (0.3) 99 (0.9) 98 (0.7) 24 (21.4) 0 (0.0) 1 (0.3) 1 (0.3) NA
Table 13: The mean of numbers of correctly retained features (\mathcal{R}) and falsely retained features (\mathcal{F}) by different methods under the Poisson-truncated normal (mis-specified) model.
EM-adjust EM-0.35 Chi-square SC-FS Dip-test KS-test COSCI
p=500p=500 \mathcal{R} 20 (0.4) 20 (0.1) 10 (2.7) 20 (0.4) 20 (0.0) 20 (0.0) 0 (0.0)
\mathcal{F} 0 (0.3) 3 (1.7) 0 (0.4) 0 (0.0) 480 (0.0) 480 (0.0) 0 (0.0)
p=5000p=5000 \mathcal{R} 19 (0.7) 20 (0.0) 5 (2.5) 14 (4.7) 20 (0.0) 20 (0.0) 0 (0.0)
\mathcal{F} 0 (0.3) 31 (6.0) 0 (0.5) 0 (0.0) 4980 (0.0) 4980 (0.0) 0 (0.0)
p=20,000p=20,000 \mathcal{R} 19 (1.1) 20 (0.1) 2 (1.9) 0 (0.2) 20 (0.0) 20 (0.0) NA
\mathcal{F} 0 (0.4) 124 (10.3) 0 (0.3) 0 (0.0) 19980 (0.0) 19980 (0.0) NA
Table 14: The means and standard deviations (in parenthesis) of ARIs over 100 replications by different methods under the binomial-Gamma (mis-specified) model. The values in the table are shown as the actual values ×\times 100. EM-adjust means that the features are selected by the adjusted p-values and EM-0.35 means that we choose the threshold as n0.35n^{0.35}, similar for EM-0.35. The three continuous methods Dip-test, KS-test and COSCI are applied to the normalized data (log normalization).
No-Screening EM-adjust EM-0.35 Chi-square SC-FS Dip-test KS-test COSCI
p=500p=500 73 (33.1) 83 (32.9) 86 (29.9) 75 (39.7) 82 (30.9) 79 (33.7) 78 (33.9) 2 (4.0)
p=5000p=5000 14 (12.0) 82 (35.2) 84 (31.2) 70 (39.2) 56 (47.6) 16 (12.6) 16 (13.3) 0 (1.1)
p=20,000p=20,000 1 (0.6) 81 (35.4) 81 (33.9) 45 (42.3) 0 (2.1) 1 (0.7) 1 (0.8) NA
Table 15: The mean of numbers of correctly retained features (\mathcal{R}) and falsely retained features (\mathcal{F}) by different methods under the binomial-Gamma (mis-specified) model.
EM-adjust EM-0.35 Chi-square SC-FS Dip-test KS-test COSCI
p=500p=500 \mathcal{R} 16 (6.8) 17 (5.7) 14 (8.0) 18 (5.0) 20 (0.3) 19 (1.5) 1 (1.1)
\mathcal{F} 0 (0.3) 3 (2.4) 68 (95.4) 0 (0.3) 480 (0.2) 453 (35.2) 40 (22.5)
p=5000p=5000 \mathcal{R} 16 (7.2) 17 (5.6) 14 (8.3) 13 (9.0) 20 (0.3) 19 (1.5) 1 (1.1)
\mathcal{F} 0 (0.4) 25 (17.5) 676 (975.2) 0 (0.0) 4980 (0.9) 4697 (362.0) 425 (227.6)
p=20,000p=20,000 \mathcal{R} 16 (7.2) 17 (5.7) 13 (8.4) 0 (1.4) 20 (0.3) 19 (1.2) NA
\mathcal{F} 0 (0.4) 100 (68.9) 2703 (3934.1) 0 (0.0) 19979 (4.4) 18835 (1459.0) NA

E.7 Simulations for the limiting distribution

Table 16: The means and standard deviations (in parenthesis) of FDR and power over 100 replications. EM(1) means that the p-values are obtained from the χ2(3)\chi^{2}(3) distribution, and EM(2) means that the p-values are calculated using the limiting distribution in Theorem 4. The Benjamini–Hochberg procedure controls the FDR at 0.01. Simulation are generated from the negative binomial model (Section 4.1 in the main manuscript).
EM (1) EM (2) Chi-square
FDR Power FDR Power FDR Power
Case 1: High signal and low noise
p=500p=500 0.00 (0.01) 1.00 (0.01) 0.01 (0.02) 1.00 (0.00) 0.02 (0.03) 0.98 (0.03)
p=5000p=5000 0.00 (0.00) 1.00 (0.00) 0.02 (0.03) 1.00 (0.00) 0.02 (0.03) 0.93 (0.06)
p=20,000p=20,000 0.00 (0.01) 1.00 (0.00) NA NA 0.01 (0.03) 0.90 (0.06)
Case 2: High signal and high noise
p=500p=500 0.00 (0.01) 1.00 (0.01) 0.02 (0.03) 1.00 (0.00) 0.01 (0.03) 0.92 (0.06)
p=5000p=5000 0.00 (0.01) 1.00 (0.00) 0.04 (0.04) 1.00 (0.00) 0.02 (0.03) 0.83 (0.09)
p=20,000p=20,000 0.00 (0.01) 1.00 (0.01) NA NA 0.02 (0.04) 0.75 (0.10)
Case 3: Medium signal and low noise
p=500p=500 0.00 (0.01) 1.00 (0.01) 0.01 (0.02) 1.00 (0.01) 0.01 (0.03) 0.78 (0.12)
p=5000p=5000 0.00 (0.00) 0.99 (0.02) 0.02 (0.03) 1.00 (0.02) 0.02 (0.03) 0.61 (0.11)
p=20,000p=20,000 0.00 (0.01) 0.98 (0.03) NA NA 0.01 (0.04) 0.50 (0.13)
Case 4: Medium signal and high noise
p=500p=500 0.00 (0.01) 0.99 (0.03) 0.02 (0.03) 1.00 (0.02) 0.02 (0.04) 0.65 (0.11)
p=5000p=5000 0.00 (0.01) 0.96 (0.04) 0.04 (0.05) 0.98 (0.03) 0.01 (0.03) 0.46 (0.13)
p=20,000p=20,000 0.00 (0.01) 0.93 (0.06) NA NA 0.02 (0.05) 0.34 (0.13)
Case 5: Low signal and low noise
p=500p=500 0.00 (0.01) 0.82 (0.10) 0.01 (0.02) 0.90 (0.08) 0.02 (0.07) 0.20 (0.12)
p=5000p=5000 0.00 (0.01) 0.69 (0.10) 0.03 (0.04) 0.80 (0.09) 0.03 (0.12) 0.08 (0.08)
p=20,000p=20,000 0.00 (0.01) 0.61 (0.12) NA NA 0.02 (0.11) 0.04 (0.05)
Case 6: Low signal and high noise
p=500p=500 0.00 (0.01) 0.73 (0.10) 0.02 (0.04) 0.83 (0.09) 0.02 (0.07) 0.14 (0.10)
p=5000p=5000 0.00 (0.01) 0.60 (0.11) 0.05 (0.05) 0.74 (0.09) 0.02 (0.11) 0.05 (0.05)
p=20,000p=20,000 0.01 (0.02) 0.50 (0.12) NA NA 0.02 (0.12) 0.03 (0.05)

E.8 The computation time of different methods.

Table 17: The computation time of different methods. Simulation are generated from the negative binomial model (Section 4.1 in the main manuscript).
Time (s) EM-test Chi-square SC-FS Skmeans KS-test Dip-test COSCI
p=500p=500 14.59 25.57 0.98 265.36 1.83 1.86 53.73
p=5000p=5000 127.21 246.97 7.60 3074.06 7.01 7.09 470.50
p=20,000p=20,000 505.21 986.82 30.53 NA 25.90 25.84 NA

Appendix F Details for the application on scRNA-seq data

In the analysis of the scRNA-seq data from Heming et al. (2021), we mainly follow the analysis protocol of Seurat (Butler et al., 2018). Since there are 31 patients in this dataset, we must consider the batch effect (Haghverdi et al., 2018) which may have a non-negligible effect on the count matrix from different patients. Also, it is known that systematic differences in library size between different cells are often observed in scRNA-seq data (Stegle et al., 2015). Therefore, before applying our screening procedure, we remove this confounding effect via down-sampling such that each cell has the same number of unique molecular identifier (UMI) counts.

Under the assumption that clustering-informative genes must be heterogeneously distributed in at least one batch, we apply the EM-test to each batch b(b=1,,B)b\,(b=1,\dots,B), respectively, and get a p-value pj(b)p_{j}^{(b)} for each gene jj. Then, we calculate the Bonferroni-type combined p-values (Vovk and Wang, 2020):

pjcomb=Bmin{pj(1),pj(2),,pj(B)},p_{j}^{\text{comb}}=B\cdot\min\left\{p_{j}^{(1)},p_{j}^{(2)},\dots,p_{j}^{(B)}\right\},

and then perform the Bonferroni-Hochberg (BH) procedure (Benjamini and Hochberg, 1995b) of false discovery rate control on the pjcombp_{j}^{\text{comb}}’s. Finally, we select genes with an adjusted p-value smaller than 0.01 for downstream analysis.

Before applying dimensional reduction methods, we first normalize and scale the count matrix by NormalizeData() and ScaleData() in Seurat. Then, we perform PCA analysis and select the first 4040 principle components as the input for harmony to do batch effect removal (Korsunsky et al., 2018). After that, following the standard analysis protocol of Seurat, we construct a SNN graph with the derived “harmony dimensions”, perform clustering with the Louvain’s method, and further reduce the dimensions of the data to two via UMAP for virtualization. To annotate each derived clusters, we check the expression of each marker genes provided by Heming et al. (2021).

As to the implementation of the Chi-square test, we group the data into four bins according to the median and the upper and lower quartiles and perform similar analyses as the EM-test.


References

  • Al-Mossawi et al. (2019) Al-Mossawi, H., N. Yager, C. A. Taylor, E. Lau, S. Danielli, J. J. S. de Wit, J. J. Gilchrist, I. Nassiri, E. A. Mahe, W. Lee, L. Rizvi, S. Makino, J. Cheeseman, M. J. Neville, J. C. Knight, P. Bowness, and B. P. Fairfax (2019). Context-specific regulation of surface and soluble IL7R expression by an autoimmune risk allele. Nature Communications 10.
  • Andrews and Hemberg (2019) Andrews, T. S. and M. Hemberg (2019). M3Drop: dropout-based feature selection for scRNASeq. Bioinformatics 35, 2865 – 2867.
  • Banerjee et al. (2017) Banerjee, T., G. Mukherjee, and P. Radchenko (2017). Feature screening in large scale cluster analysis. Journal of Multivariate Analysis 161, 191–212.
  • Barndorff-Nielsen (1965) Barndorff-Nielsen, O. E. (1965). Identifiability of mixtures of exponential families. Journal of Mathematical Analysis and Applications 12, 115–121.
  • Benjamini and Hochberg (1995a) Benjamini, Y. and Y. Hochberg (1995a). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 57(1), 289–300.
  • Benjamini and Hochberg (1995b) Benjamini, Y. and Y. Hochberg (1995b). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 57, 289–300.
  • Butler et al. (2018) Butler, A., P. J. Hoffman, P. Smibert, E. Papalexi, and R. Satija (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology 36, 411–420.
  • Cai et al. (2019) Cai, T. T., J. Ma, and L. Zhang (2019). CHIME: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality. The Annals of Statistics.
  • Caliński and Harabasz (1974) Caliński, T. and J. Harabasz (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3, 1–27.
  • Chan and Hall (2010) Chan, Y.-b. and P. Hall (2010). Using evidence of mixed populations to select variables for clustering very high-dimensional data. Journal of the American Statistical Association 105(490), 798–809.
  • Chen (1995) Chen, J. (1995). Optimal rate of convergence for finite mixture models. The Annals of Statistics, 221–233.
  • Chen et al. (2018) Chen, W., Y. Li, J. Easton, D. Finkelstein, G. Wu, and X. Chen (2018). UMI-count modeling and differential expression analysis for single-cell RNA sequencing. Genome Biology 19.
  • Chernoff and Lander (1995) Chernoff, H. and E. Lander (1995). Asymptotic distribution of the likelihood ratio test that a mixture of two binomials is a single binomial. Journal of Statistical Planning and Inference 43(1-2), 19–40.
  • Fan and Lv (2008) Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.
  • Fop and Murphy (2018) Fop, M. and T. B. Murphy (2018). Variable selection methods for model-based clustering. Statistics Surveys 12, 18–65.
  • Haghverdi et al. (2018) Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology 36, 421–427.
  • Hartigan (1985) Hartigan, J. A. (1985). A failure of likelihood asymptotics for normal mixtures. In Proceedings of the Berkeley conference in honor of Jerzy Neyman and Jack Kiefer, Volume 2, pp.  807–810.
  • Heming et al. (2021) Heming, M., X. Li, S. Räuber, A. K. Mausberg, A.-L. Börsch, M. Hartlehnert, A. Singhal, I.-N. Lu, M. Fleischer, F. Szepanowski, O. Witzke, T. Brenner, U. Dittmer, N. Yosef, C. Kleinschnitz, H. Wiendl, M. Stettner, and G. M. zu Hörste (2021). Neurological manifestations of COVID-19 feature T cell exhaustion and dedifferentiated monocytes in cerebrospinal fluid. Immunity 54, 164 – 175.e6.
  • Jin and Wang (2016) Jin, J. and W. Wang (2016). Influential features PCA for high dimensional clustering. The Annals of Statistics 44(6), 2323–2359.
  • Kiselev et al. (2019) Kiselev, V. Y., T. S. Andrews, and M. Hemberg (2019). Challenges in unsupervised clustering of single-cell RNA-seq data. Nature Reviews Genetics 20(5), 273–282.
  • Korsunsky et al. (2018) Korsunsky, I., J. Fan, K. Slowikowski, F. Zhang, K. Wei, Y. Baglaenko, M. B. Brenner, P.-R. Loh, and S. Raychaudhuri (2018). Fast, sensitive, and accurate integration of single cell data with Harmony. Nature Methods 16, 1289 – 1296.
  • Li and Chen (2010) Li, P. and J. Chen (2010). Testing the order of a finite mixture. Journal of the American Statistical Association 105(491), 1084–1092.
  • Li et al. (2009) Li, P., J. Chen, and P. Marriott (2009). Non-finite Fisher information and homogeneity: an EM approach. Biometrika 96(2), 411–426.
  • Li et al. (2012) Li, R., W. Zhong, and L. Zhu (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association 107(499), 1129–1139.
  • Liu et al. (2015) Liu, J., W. Zhong, and R. Li (2015). A selective overview of feature screening for ultrahigh-dimensional data. Science China Mathematics 58(10), 1–22.
  • Liu et al. (2022) Liu, T., Y. Lu, B. Zhu, and H. Zhao (2022). Clustering high-dimensional data via feature selection. Biometrics.
  • Löffler et al. (2019) Löffler, M., A. Y. Zhang, and H. H. Zhou (2019). Optimality of spectral clustering for gaussian mixture model. ArXiv abs/1911.00538.
  • McInnes and Healy (2018) McInnes, L. and J. Healy (2018). Umap: Uniform manifold approximation and projection for dimension reduction. ArXiv abs/1802.03426.
  • Nguyen (2013) Nguyen, X. (2013). Convergence of latent mixing measures in finite and infinite mixture models. The Annals of Statistics 41(1), 370–400.
  • Niu et al. (2011) Niu, X., P. Li, and P. Zhang (2011). Testing homogeneity in a multivariate mixture model. Canadian Journal of Statistics 39(2), 218–238.
  • Rousseeuw (1987) Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65.
  • Stegle et al. (2015) Stegle, O., S. A. Teichmann, and J. C. Marioni (2015). Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics 16, 133–145.
  • Van der Vaart (2000) Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge University Press.
  • Vandenberghe (2010) Vandenberghe, L. (2010). The CVXOPT linear and quadratic cone program solvers. Online: http://cvxopt. org/documentation/coneprog. pdf.
  • Vershynin (2018) Vershynin, R. (2018). High-dimensional probability: an introduction with applications in data science, Volume 47. Cambridge University Press.
  • Vovk and Wang (2020) Vovk, V. and R. Wang (2020). Combining p-values via averaging. Biometrika 107(4), 791–808.
  • Wainwright (2019) Wainwright, M. J. (2019). High-dimensional statistics: a non-asymptotic viewpoint, Volume 48. Cambridge University Press.
  • Witten and Tibshirani (2010) Witten, D. M. and R. Tibshirani (2010). A framework for feature selection in clustering. Journal of the American Statistical Association 105(490), 713–726.
  • Wong and Shen (1995) Wong, W. H. and X. Shen (1995). Probability inequalities for likelihood ratios and convergence rates of sieve MLEs. The Annals of Statistics, 339–362.
  • Yakowitz and Spragins (1968) Yakowitz, S. J. and J. D. Spragins (1968). On the identifiability of finite mixtures. Annals of Mathematical Statistics 39, 209–214.
  • Zhu et al. (2011) Zhu, L.-P., L. Li, R. Li, and L.-X. Zhu (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association 106(496), 1464–1475.