High-Dimensional Differentially-Private EM Algorithm: Methods and Near-Optimal Statistical Guarantees

Zhe Zhang and Linjun Zhang
Department of Statistics, Rutgers University

Abstract

In this paper, we develop a general framework to design differentially private expectation-maximization (EM) algorithms in high-dimensional latent variable models, based on the noisy iterative hard-thresholding. We derive the statistical guarantees of the proposed framework and apply it to three specific models: Gaussian mixture, mixture of regression, and regression with missing covariates. In each model, we establish the near-optimal rate of convergence with differential privacy constraints, and show the proposed algorithm is minimax rate optimal up to logarithm factors. The technical tools developed for the high-dimensional setting are then extended to the classic low-dimensional latent variable models, and we propose a near rate-optimal EM algorithm with differential privacy guarantees in this setting. Simulation studies and real data analysis are conducted to support our results.

Keywords: Differential privacy; High-dimensional data; EM algorithm; Optimal rate of convergence.

1 Introduction

In the era of big data, there is an unprecedented number of large data sets becoming available for researchers/industries to retrieve important information. In the meantime, these large data sets often include some sensitive personal information, urgently generating demands for privacy-preserving algorithms that can protect the individual information during the data analysis. One widely adopted criterion for privacy-preserving algorithms is differential privacy (DP) (Dwork et al., 2006b, a). This notion has been widely developed and used nowadays in Microsoft (Erlingsson et al., 2014), Google (Ding et al., 2017), Facebook (Kifer et al., 2020) and the U.S. Census Bureau (Abowd, 2016) to help protect individual privacy including user habits, browsing history, social connections, and health records. The basic idea behind differential privacy is that the information of a single individual in the training data should appear as hidden so that given the outcome, it is almost impossible to distinguish if a certain individual is in the dataset.

The attractiveness of differential privacy could possibly be attributed to the ease of building privacy-preserving algorithms. Many traditional algorithms and statistical methods have been extended to their private counterparts, including top- $k$ selection (Bafna and Ullman, 2017; Steinke and Ullman, 2017), multiple testing (Dwork et al., 2018), decision trees (Jagannathan et al., 2009), and random forests (Rana et al., 2015). However, many existing works only focus on designing privacy-preserving algorithms while lacking the sharp analysis of accuracy in terms of minimax optimality. At the high-level, these privatized algorithms are designed by injecting random noise into the traditional algorithms. Such noise-injection procedures typically sacrifice statistical accuracy, and therefore, it is essential to understand what is the best accuracy an algorithm would output while maintaining a certain level of differential privacy requirement. This motivates us to study the trade-off between the privacy and the statistical accuracy.

This paper is devoted to studying this trade-off in latent variable models by proposing a DP version of the expectation-maximization (EM) algorithm. The EM algorithm is a commonly used approach when dealing with latent variables and missing values Ranjan et al. (2016); Quost and Denoeux (2016); Kadir et al. (2014); Ding and Song (2016). The statistical properties, such as the local convergence and minimax optimality of standard EM algorithm has been recently studied in Balakrishnan et al. (2017); Wang et al. (2015); Yi and Caramanis (2015); Cai et al. (2019a); Zhao et al. (2020), while the development of DP EM algorithm, especially the theories of the optimal trade-off between privacy and accuracy, is still largely unexplored. In this paper, we propose novel DP EM algorithms in both the classic low-dimensional setting and the contemporary high-dimensional setting, where the parameter we are interested in is sparse and its dimension is greatly larger than the sample size. We demonstrate the superiority of the proposed algorithm by applying it to specific statistical models. Under these specific statistical models, the convergence rates of the proposed algorithm are found to be minimax optimal with privacy constraints. The main contributions of this paper are summarized in the following.

•

We design a novel DP EM algorithm in the high-dimensional setting based on a noisy iterative hard-thresholding algorithm in the M-step. Such a step effectively enforces the sparsity of the attained estimator while maintaining differential privacy, and allows us to establish sharp rate of convergence of the algorithm. To the best of our knowledge, this algorithm is the first DP EM algorithm in high-dimensions with statistical guarantees.
•

We then apply the proposed DP EM algorithm to three common models in the high-dimensional setting: Gaussian mixture model, mixture of regression model, and regression with missing covariates. Under mild regularity conditions, we show that our algorithm obtains the minimax optimal rate of convergence with privacy constraints.
•

In the classical low-dimensional setting, a DP EM algorithm based on the Gaussian mechanism is designed. The technical tools developed for the high-dimensional setting are then used to establish the optimality in this general low-dimensional setting. We show that both theoretically and numerically, the proposed algorithm outperforms several existing private EM algorithms in this classical setting.

Related Work. The expectation-maximization (EM) algorithm, proposed in (Dempster et al., 1977), is a common approach in handling latent variable models. There has been a long history on the study of the EM algorithm (Wu, 1983; McLachlan and Krishnan, 2007). Recently, a seminal work (Balakrishnan et al., 2017) obtains a general framework to study the statistical guarantees of EM algorithms in the classic low-dimensional setting. In subsequent works, the convergence rates of EM algorithm under various hidden variable models are studied, including the Gaussian mixture (Xu et al., 2016; Daskalakis et al., 2017; Yan et al., 2017; Kwon and Caramanis, 2020) and mixture of linear regression (Yi et al., 2014; Kwon et al., 2019, 2020). Another important direction is to design variants of EM algorithms in the high-dimensional regime. Such a goal is fulfilled through regularization (Cai et al., 2019a; Yi and Caramanis, 2015; Zhang et al., 2020) and truncation (Wang et al., 2015) in the M-step. Due to the increasing attention on data privacy nowadays, the design of private EM algorithms is in great demand but still largely lacking.

The differential privacy is arguably the most popular notion of privacy nowadays. After its invention, the basic properties has been studied in (Dwork et al., 2010; Dwork and Roth, 2014; Dwork et al., 2017; Dwork and Feldman, 2018; Mirshani et al., 2019). The trade-off between statistical accuracy and privacy is one of the fundamental topics in differential privacy. In the low-dimensional setting, there are various works focusing on this trade-off, including mean estimation (Dwork et al., 2015; Steinke and Ullman, 2016; Bun et al., 2018; Kamath et al., 2019; Cai et al., 2019b; Kamath et al., 2020b), confidence intervals of Gaussian mean (Karwa and Vadhan, 2017) and binomial mean (Awan and Slavković, 2020), linear regression (Wang, 2018; Cai et al., 2019b), generalized linear models Song et al. (2020); Cai et al. (2020); Song et al. (2021), principal component analysis (Dwork et al., 2014; Chaudhuri et al., 2013), convex empirical risk minimization (Bassily et al., 2014), and robust M-estimators (Avella-Medina, 2020; Avella-Medina et al., 2021).

However, in the high-dimensional setting where the dimension is much larger than the sample size, the trade-off between statistical accuracy and privacy is less studied. Most of the existing works focus on relatively standard statistical problems such as the sparse mean estimation and regression. For example, (Steinke and Ullman, 2017) studies the optimal bounds for private top- $k$ selection problems. (Cai et al., 2019b) studies near-optimal algorithms for the sparse mean estimation. For the high-dimensional sparse linear regression, (Talwar et al., 2015) obtains a DP LASSO algorithm which is near-optimal in terms of the excess risk; (Cai et al., 2019b) proposes another DP LASSO algorithm with the optimal rate of convergence in estimation. In (Cai et al., 2020), a near-minimax optimal DP algorithm for high-dimensional generalized linear models is introduced.

In the literature of differential privacy, most of the existing works for latent variable models focus on the low-dimensional cases, while the study in the high-dimensional regime is largely lacking. In the classical low-dimensional setting, (Park et al., 2017) introduces a DP EM algorithm, but offers no statistical guarantees/accuracy analysis. (Nissim et al., 2007) provides a result for the low-dimensional Gaussian mixture model based on the sample-aggregate framework and reaches a $O(\sqrt{d^{3}/{n}}\cdot{\log(1/\delta)}/{\epsilon})$ convergence rate in estimation for the mixture of spherical Gaussian distributions. (Kamath et al., 2020a) considers a more general Gaussian mixture models and studies the total variation distance. (Wang et al., 2020) studies the DP EM algorithm in the classical low-dimensional setting and obtains the estimation error of order $O(\sqrt{d^{2}/{n}}\cdot{\log(1/\delta)}/{\epsilon})$ . In the current paper, we are going to show this rate can be significantly improved to $O(\sqrt{d/n}+{d\cdot\sqrt{\log(1/\delta)}}/{n\epsilon})$ , obtained by our proposed algorithm in the classic low-dimensional setting.

Organization of this paper. This paper is organized as follows. In Section 2, we introduce the problem formulation as well as some preliminaries of the EM algorithm and differential privacy. In Section 3, we present the main results of this paper and establish the convergence rate of the proposed EM algorithm in high-dimensional settings. In Section 4, we apply the results obtained in Section 3 to three specific models: Gaussian mixture model, mixture of regression, and regression with missing covariates. We present the estimation error bounds for these three models respectively and show the optimality. In Section 5, we consider the DP EM algorithm in the classic low-dimensional setting. In Section 6, simulation studies of the proposed algorithm are conducted to support our theory. Section 7 summarizes the paper and discusses some possible future work. In Appendix A, we provided some supplement materials. We prove the main results in Appendix B, and the proofs of other results and technical lemmas are in the Appendix C.

Notations. Throughout this paper, let $\bm{v}=(v_{1},v_{2},...v_{d})^{\top}\in\mathbb{R}^{d}$ be a vector. $\mathcal{S}$ denotes the set of indices and $\bm{v}_{\mathcal{S}}$ indicates the restriction of vector $\bm{v}$ on the set $\mathcal{S}$ . Also, $\|\bm{v}\|_{q}$ denotes the $\ell_{q}$ norm for $1\leq q\leq\infty$ and $\|\bm{v}\|_{0}$ specifically denotes the number of non-zero coordinates of $\bm{v}$ , and we also call it the sparsity level of $\bm{v}$ . We denote $\odot$ to be the Hadamard product. Generally, we use $n$ to denote the number of samples, $d$ to denote the dimension of a vector and $s$ to denote the sparsity level of a vector. We also define a truncation function $\Pi_{T}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ be a function denotes the projection into the $\ell_{\infty}$ ball of radius $T$ centered at the origin in $\mathbb{R}^{d}$ . Moreover, we use $\nabla Q(\cdot;\cdot)$ to denote the gradient of the function $Q(\cdot;\cdot)$ . If there is no further specification, this gradient is taken with respect to the first argument. For two sequences $\{a_{n}\}$ and $\{b_{n}\}$ , we write $a_{n}=o(b_{n})$ if $a_{n}/b_{n}\rightarrow 0$ . We denote $a_{n}=O(b_{n})$ if there exist a constant $c$ such that $a_{n}\leq cb_{n}$ and $a_{n}=\Omega(b_{n})$ if there exist a constant $c^{\prime}$ such that $a_{n}\geq c^{\prime}b_{n}$ . We also denote $a_{n}\asymp b_{n}$ if $a_{n}=O(b_{n})$ and $a_{n}=\Omega(b_{n})$ . In this paper, $c_{0},c_{1},m_{0},m_{1},C,C^{\prime},K,K^{\prime},...$ denote universal constants and their specific values may vary from place to place.

2 Problem Formulation

In this section, we present some preliminaries that are important for the discussions in the rest of the paper. We are going to introduce the EM algorithm in Section 2.1, and the formal definition and some critical properties of differential privacy in Section 2.2.

2.1 The EM algorithm

The EM algorithm is a widely used algorithm to compute the maximum likelihood estimator when there are latent or unobserved variables in the model. We first introduce the standard EM algorithm. Let $\bm{Y}$ and $\bm{Z}$ be random variables taking values in the sample spaces $\mathcal{Y}$ and $\mathcal{Z}$ . For each pair of data $(\bm{Y},\bm{Z})$ , we assume that only $\bm{Y}$ is observed, while $\bm{Z}$ remains unobserved. Suppose that the pair $(\bm{Y},\bm{Z})$ has a joint density function $f_{\bm{\beta}^{*}}(\bm{y},\bm{z})$ , where $\bm{\beta}^{*}$ is the true parameter that we would like to estimate. Let $h_{\bm{\beta}^{*}}(\bm{y})$ be the marginal density function of the observed variable $\bm{Y}$ . Then, we can write $h_{\bm{\beta}^{*}}(\bm{y})$ by integrating out $\bm{z}$

h_{\bm{\beta}^{*}}(\bm{y})=\int_{\mathcal{Z}}f_{\bm{\beta}^{*}}(\bm{y},\bm{z})d\bm{z}.

(1)

The goal of the EM algorithm is to obtain an estimator of $\bm{\beta}^{*}$ through maximizing the likelihood function. Specifically, suppose we have $n$ $i.i.d.$ observations of $\bm{Y}$ : $\bm{y}_{1},\bm{y}_{2},...\bm{y}_{n}\sim h_{{\bm{\beta}}^{*}}(\bm{y})$ , we aim to use EM algorithm to maximize the log-likelihood $l_{n}(\bm{\beta})=\sum_{i=1}^{n}\log h_{\bm{\beta}}(\bm{y}_{i})$ , and get an estimator of ${\bm{\beta}}^{*}$ . In many latent variable models, it is generally difficult to evaluate the log-likelihood $l_{n}(\bm{\beta})$ directly, but relatively easy to compute the log-likelihood for the joint distribution $f_{\bm{\beta}}(\bm{y},\bm{z})$ . Such situations are in need of EM algorithms. Specifically, for a given parameter ${\bm{\beta}}$ , let $k_{\bm{\beta}}(\bm{z}|\bm{y})$ be the conditional distribution of $\bm{Z}$ given the observed variable $\bm{Y}$ , that is, $k_{\bm{\beta}}(\bm{z}|\bm{y})=\frac{f_{\bm{\beta}}(\bm{y},\bm{z})}{h_{\bm{\beta}}(\bm{y})}.$

The EM algorithm uses an iterative approach motivated by Jensen’s inequality. Suppose that in the $t$ -th iteration, we have obtained ${\bm{\beta}}^{t}$ and would like to update it to ${\bm{\beta}}^{t+1}$ with a larger log-likelihood. The log-likelihood evaluated at $\bm{\beta}^{t+1}$ can always be lower bounded, as shown in the following expression.

	$\displaystyle\frac{1}{n}l_{n}(\bm{\beta}^{t+1})$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\log h_{\bm{\beta}^{t+1}}(\bm{y}_{i})$		(2)
		$\displaystyle\geq\underbrace{\frac{1}{n}\sum_{i=1}^{n}\int_{\mathcal{Z}}k_{{\bm{\beta}}^{t}}(\bm{z}_{i}\|\bm{y}_{i})\log f_{\bm{\beta}^{t+1}}(\bm{y}_{i},\bm{z}_{i})d\bm{z}_{i}}_{Q_{n}({\bm{\beta}^{t+1}};{{\bm{\beta}}^{t}})}-\frac{1}{n}\sum_{i=1}^{n}\int_{\mathcal{Z}}k_{{\bm{\beta}}^{t}}(\bm{z}_{i}\|\bm{y}_{i})\log k_{{\bm{\beta}}^{t}}(\bm{z}_{i}\|\bm{y}_{i})d\bm{z}_{i},$

with equality holds when ${\bm{\beta}}^{t+1}={\bm{\beta}}^{t}$ . The basic idea of EM algorithm is to successively maximize the lower bound with respect to ${\bm{\beta}}^{t+1}$ in (2). In the E-step, we evaluate the lower bound in (2) at the current parameter ${\bm{\beta}}^{t}$ . Since the second term in (2) only depends on $\bm{\beta}^{t}$ , which is fixed given the current ${\bm{\beta}}^{t}$ , we only need to evaluate $Q_{n}$ in the E-step. Then, in the M-Step, we calculate a new parameter ${\bm{\beta}}^{t+1}$ which moves towards the direction that maximizes $Q_{n}$ , so the lower bound in (2) becomes larger when we update ${\bm{\beta}}^{t}$ to ${\bm{\beta}}^{t+1}$ . We use the new parameter ${\bm{\beta}}^{t+1}$ at the $(t+1)$ -th iteration and continue the E-step and M-step iteratively until convergence.

In the $t$ -th iteration, the M-step in the standard EM algorithm maximizes $Q_{n}(\cdot;\bm{\beta}^{t})$ (Dempster et al., 1977), that is, $\bm{\beta}^{t+1}=\text{argmax}_{\bm{\beta}}Q_{n}(\bm{\beta};\bm{\beta}^{t})$ . However, sometimes it is computationally expensive to compute the maximizer directly. As an alternative, the gradient EM (Balakrishnan et al., 2017) was proposed by taking one-step update of the gradient ascent $\bm{\beta}^{t+1}=\bm{\beta}^{t}+\eta\cdot\nabla Q_{n}(\bm{\beta}^{t};\bm{\beta}^{t})$ in the M-step. When $Q_{n}(\bm{\beta};\bm{\beta}^{\prime})$ is strongly convex with respect to ${\bm{\beta}}$ , this gradient EM approach is shown to reach the same statistical guarantee as that of the standard EM approach (Balakrishnan et al., 2017). In the high-dimensional setting, when the data dimension is much larger than the sample size and the true parameter is sparse, people have come up with different variants of EM algorithms. For example, in the M-step, the maximization approach is generalized to the regularized maximization (Yi and Caramanis, 2015; Cai et al., 2019a; Zhang et al., 2020) and the gradient approach is generalized to the truncated gradient (Wang et al., 2015).

2.2 Some basic properties of differential privacy

In this section, we introduce the concepts and properties of differential privacy. These properties will play an important role in the design of the DP EM algorithm. First, the formal definition of differential privacy is given below.

Definition 2.1 (Differential Privacy (Dwork et al., 2006b)).

Let $\mathcal{X}$ be the sample space for an individual data, a randomized algorithm $M:\mathcal{X}^{n}\rightarrow\mathbb{R}$ is $(\epsilon,\delta)$ -DP if and only if for every pair of adjacent data sets $\bm{X},\bm{X}^{\prime}\in\mathcal{X}^{n}$ and for any $S\subseteq\mathbb{R}$ , the inequality below holds:

\displaystyle\mathbb{P}\left(M(\bm{X})\in S\right)\leq e^{\varepsilon}\cdot\mathbb{P}\left(M(\bm{X}^{\prime})\in S\right)+\delta,

where we say that two data sets $\bm{X}=\{\bm{x}_{i}\}_{i=1}^{n}$ and $\bm{X}^{\prime}=\{{{\bm{x}}_{i}^{\prime}}\}_{i=1}^{n}$ are adjacent if and only if they differ by one individual datum.

According to the definition, the two parameters $\epsilon,\delta$ control the privacy level. With smaller $\epsilon$ and $\delta$ , the privacy constraint becomes more stringent. Intuitively speaking, this definition suggests that for a DP algorithm $M$ , an adversary cannot distinguish if the original dataset is $\bm{X}$ or $\bm{X}^{\prime}$ when $\bm{X}$ and $\bm{X}^{\prime}$ is adjacent, implying that the information of each individual is protected after releasing $M$ .

We then list several useful facts of designing DP algorithms. To create a DP algorithm, the arguably most common strategy is to add random noise to the output. Intuitively, the scale of the noise can not be too large, otherwise the accuracy of the output will be sacrificed. This scale is characterized by the sensitivity of the algorithm.

Definition 2.2.

For any algorithm $f:\mathcal{X}^{n}\rightarrow{\mathbb{R}}^{d}$ and two adjacent data sets $\bm{X}$ and $\bm{X}^{\prime}$ , the $\ell_{p}$ -sensitivity of $f$ is defined as: $\Delta_{p}(f)=\sup_{\bm{X},\bm{X}^{\prime}\in\mathcal{X}^{n}\text{ adjacent}}\|f(\bm{X})-f(\bm{X}^{\prime})\|_{p}.$

For algorithms with finite $\ell_{1}$ -sensitivity or $\ell_{2}$ -sensitivity, we can achieve differential privacy by adding Laplace noises or Gaussian noises respectively.

Proposition 2.3 (The Laplace Mechanism (Dwork et al., 2006b)).

Let $f:\mathcal{X}^{n}\to\mathbb{R}^{d}$ be a deterministic algorithm with $\Delta_{1}(f)<\infty$ . For $\bm{w}\in\mathbb{R}^{d}$ with coordinates $w_{1},w_{2},\cdots,w_{d}$ be i.i.d samples drawn from Laplace $(\Delta_{1}(f)/\epsilon)$ , $f(\bm{X})+\bm{w}$ is $(\epsilon,0)$ -DP.

Proposition 2.4 (The Gaussian Mechanism (Dwork et al., 2006b)).

Let $f:\mathcal{X}^{n}\to\mathbb{R}^{d}$ be a deterministic algorithm with $\Delta_{2}(f)<\infty$ . For $\bm{w}=(w_{1},w_{2},\cdots,w_{d})$ with coordinates i.i.d drawn from $N(0,2(\Delta_{2}(f)/\epsilon)^{2}\log(1.25/\delta))$ , $f(\bm{X})+\bm{w}$ is $(\epsilon,\delta)$ -DP.

These two mechanisms are computationally efficient, and are typically used to build more complicated algorithms. In the following, we introduce the post-processing and composition properties of differential privacy, which enable us to design complicated DP algorithms by combining simpler ones.

Proposition 2.5 (Post-processing Property (Dwork et al., 2006b)).

Let $M$ be an $(\epsilon,\delta)$ -DP algorithm and $g$ be an arbitrary function which takes $M(\bm{X})$ as input, then $g(M(\bm{X}))$ is also $(\epsilon,\delta)$ -DP.

Proposition 2.6 (Composition property (Dwork et al., 2006b)).

For $i=1,2$ , let $M_{i}$ be $(\varepsilon_{i},\delta_{i})$ -DP algorithm, then $(M_{1},M_{2})$ is $(\epsilon_{1}+\epsilon_{2},\delta_{1}+\delta_{2})$ -DP algorithm.

In the following section, we will see that the above two properties are particularly useful in the design of the DP EM algorithm.

3 High-dimensional EM Algorithm

In this section, we develop a novel DP EM algorithm for the (sparse) high-dimensional latent variable models. We are going to first introduce the detailed description of the proposed algorithm in Section 3.1, and then present its theoretical properties in Section 3.2. We will further apply our general method to three specific models in the next section.

3.1 Methodology

Suppose we have $i.i.d$ data sampled from the latent variable model (1) and aim to use the EM algorithm to find the maximum likelihood estimator in a DP manner. Since the EM algorithm is an iterative approach where the $t$ -th iteration takes as input the ${\bm{\beta}}^{t-1}$ from the M-step in the last iteration, it suffices to make each ${\bm{\beta}}^{t}$ differentially private to ensure the privacy guarantee of the final output.

Our algorithm relies on two key designs in the M-step. First, we use the gradient EM approach, and in the gradient update stage, we introduce a truncation step on the gradient to control the sensitivity of the gradient update, and thus we can appropriately determine the scale of noise we need to achieve differential privacy. Secondly, we propose to apply the noisy hard-thresholding algorithm (NoisyHT) (Dwork et al., 2018) to pursue sparsity and achieve privacy at the same time. The NoisyHT algorithm is defined as follows,

1:vector-valued function

\bm{v}=\bm{v}(\bm{X})\in\mathbb{R}^{d}

with data

\bm{X}

, sparsity

s

, privacy parameters

\varepsilon,\delta

, sensitivity

\lambda

2:Initialization:

S=\emptyset

3:For

i

1

s

4: Generate

\bm{w}_{i}\in\mathbb{R}^{d}

with

w_{i1},w_{i2},\cdots,w_{id}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Laplace}\left(\lambda\cdot\frac{2\sqrt{3s\log(1/\delta)}}{\varepsilon}\right)

5: Append

j^{*}=\text{argmax}_{j\in[d]\setminus S}(|v_{j}|+w_{ij})

S

6:End For

7: Generate

\tilde{\bm{w}}

with

\tilde{w}_{1},\cdots,\tilde{w}_{d}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Laplace}\left(\lambda\cdot\frac{2\sqrt{3s\log(1/\delta)}}{\varepsilon}\right)

18:

P_{S}(\bm{v}+\tilde{\bm{w}})

Algorithm 1 Noisy Hard Thresholding (NoisyHT(

\bm{v},s,\lambda,\epsilon,\delta

)) (Dwork et al., 2018)

In the last step, $P_{S}(\bm{u})$ denotes the operator that makes $\bm{u}_{S^{c}}=0$ while preserving $\bm{u}_{S}$ . A great advantage of this algorithm is that when the vector $\bm{v}=v(\bm{X})$ has bounded $\ell_{\infty}$ sensitivity $\lambda$ , the algorithm is guaranteed to be DP, as we can see in the lemma below.

Lemma 3.1 ((Dwork et al., 2018)).

If for every pair of adjacent data sets $\bm{X},\bm{X}^{\prime}\in\mathcal{X}^{n}$ . we have $||v(\bm{X})-v(\bm{X}^{\prime})||_{\infty}<\lambda$ , then the noisy hard-thresholding algorithm is $(\epsilon,\delta)$ -DP.

Another important property of the NoisyHT algorithm is its accuracy. Specifically, for the coordinates that are not chosen by the NoisyHT algorithm, the $\ell_{2}$ norm of these coordinates is upper bounded by that of the coordinates with the same size but chosen by the NoisyHT algorithm up to some error term. The formal statement is summarized in the following with the proof given in Appendix C.1.

Lemma 3.2.

Let $S$ be the set chosen by the NoisyHT algorithm, $\bm{v}$ be the input vector and $\{\bm{w}_{i}\}_{i\in[s]}$ be defined as in the NoisyHT algorithm. For every set $R_{1}\subset S$ and $R_{2}\subset S^{c}$ such that $|R_{1}|=|R_{2}|$ and for every $c>0$ , we have:

||\bm{v}_{R_{2}}||_{2}^{2}\leq(1+c)||\bm{v}_{R_{1}}||_{2}^{2}+4\cdot(1+1/c)\sum_{i\in[s]}||\bm{w}_{i}||_{\infty}^{2}.

After introducing the NoisyHT algorithm, we now proceed to the development of the DP EM algorithm. We update the estimator $\hat{\bm{\beta}}$ based on the NoisyHT algorithm in each M-step after truncation. Such a modified M-step guarantees that the output ${\bm{\beta}}^{t}$ is sparse and differentially private in each iteration. We also note here that we use the sample-splitting in the algorithm, which makes ${\bm{\beta}}^{t}$ independent with the batch of samples in $t$ -th iteration and helps control the sensitivity of the gradient. The algorithm is summarized below:

1:Private parameters

(\epsilon,\delta)

, step size

\eta

, truncation level

T

, maximum number of iterations

N_{0}

, sparsity parameter

\hat{s}

2:Initialization:

\bm{\beta}^{0}

with

||\bm{\beta}^{0}||_{0}\leq\hat{s}

3:For

t=0,1,2,...N_{0}-1

4: Compute

\bm{\beta}^{t+0.5}=\bm{\beta}^{t}+\eta\cdot f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))

5: Let

\bm{\beta}^{t+1}=\text{NoisyHT}(\bm{\beta}^{t+0.5},\hat{s},2\eta\cdot N_{0}\cdot T/n,\epsilon,\delta)

16:End For

27:

\hat{{\bm{\beta}}}={\bm{\beta}}^{N_{0}}

Algorithm 2 High-Dimensional DP EM algorithm

In the above algorithm, the truncation operator $f_{T}(\nabla Q_{n}(\bm{\beta};\bm{\beta}))$ with truncation level $T$ is defined as $f_{T}(\nabla Q_{n}(\bm{\beta};\bm{\beta}))=\frac{1}{n}\sum_{i=1}^{n}h_{T}(\nabla q_{i}(\bm{\beta};\bm{\beta}))$ , where $q_{i}(\bm{\beta};\bm{\beta})=\int_{\mathcal{Z}}k_{{\bm{\beta}}}(\bm{z}_{i}|\bm{y}_{i})\log f_{\bm{\beta}}(\bm{y}_{i},\bm{z}_{i})d\bm{z}_{i}$ , and $h_{T}$ denotes some generic truncation function with $\ell_{\infty}$ norm upper bounded by $T$ . Later in the application to different specific models, we will specify the form of $h_{T}$ for each model. The next lemma shows Algorithm 2 is $(\epsilon,\delta)$ -DP.

Lemma 3.3.

The output of high-dimensional DP EM algorithm (Algorithm 2) is $(\epsilon,\delta)$ -DP.

The proof of Lemma 3.3 is given in the Appendix C.2. In theory, we shall take the truncation level $T$ and number of iterations $N_{0}$ to be the order of $\sqrt{\log n}$ and $\log n$ , respectively, as we will show in the next section. For the sparsity level $\hat{s}$ , we should choose it to have the same order of the true sparsity level $s^{*}$ . While the true parameter $s^{*}$ is unknown in practice, $\hat{s}$ could be chosen through cross-validation. Given the differential privacy guarantees, we are going to analyze the utility of this algorithm in the next section.

3.2 Theoretical gaurantees

In this section, we analyze the theoretical properties of the proposed high-dimensional DP EM algorithm. Before we lay out the main results, we first introduce four technical conditions. The first three conditions are standard in the literature of EM algorithms; for example, see (Balakrishnan et al., 2017; Wang et al., 2015; Zhu et al., 2017; Wang et al., 2020). The fourth condition is needed to control the sensitivity in the design of DP algorithms. We will verify these four conditions in the specific models in the next section.

Condition 3.4 (Lipschitz-Gradient $(\gamma,\mathcal{B})$ ).

For the true parameter $\bm{\beta}^{*}$ and any $\bm{\beta}\in\mathcal{B}$ , where $\mathcal{B}$ denotes a region in the parameter space. We have

||\nabla Q(\bm{\beta};\bm{\beta}^{*})-\nabla Q(\bm{\beta};\bm{\beta})||_{2}\leq\gamma||\bm{\beta}-\bm{\beta}^{*}||_{2}.

This condition implies that if $\bm{\beta}$ is close to the true parameter $\bm{\beta}^{*}$ , then the gradients $\nabla Q(\bm{\beta};\bm{\beta}^{*})$ and $\nabla Q(\bm{\beta};\bm{\beta})$ should also be close, which implies gradient stability.

Condition 3.5 (Concavity-Smoothness $(\mu,\nu,\mathcal{B})$ ).

For any $\bm{\beta}_{1},\bm{\beta}_{2}\in\mathcal{B}$ , $Q(\cdot;\bm{\beta}^{*})$ is $\mu$ -smooth such that

Q(\bm{\beta}_{1};\bm{\beta}^{*})\geq Q(\bm{\beta}_{2};\bm{\beta}^{*})+\langle\bm{\beta}_{1}-\bm{\beta}_{2},\nabla Q(\bm{\beta}_{2};\bm{\beta}^{*})\rangle-\mu/2\cdot||\bm{\beta}_{2}-\bm{\beta}_{1}||_{2}^{2},

and $\nu$ -strongly concave such that

Q(\bm{\beta}_{1};\bm{\beta}^{*})\leq Q(\bm{\beta}_{2};\bm{\beta}^{*})+\langle\bm{\beta}_{1}-\bm{\beta}_{2},\nabla Q(\bm{\beta}_{2};\bm{\beta}^{*})\rangle-\nu/2\cdot||\bm{\beta}_{2}-\bm{\beta}_{1}||_{2}^{2}.

The Concavity-Smoothness condition indicates that when the second argument of $Q(\cdot;\cdot)$ is $\bm{\beta}^{*}$ , $Q(\cdot;\cdot)$ is a well-behaved convex function that it can be upper bounded and lower bounded by two quadratic functions. This condition ensures the geometric decay of the optimization error in the statistical analysis.

Condition 3.6 (Statistical-Error $(\alpha,\tau,s,n,\mathcal{B})$ ).

For any fixed $\bm{\beta}\in\mathcal{B}$ and $||\bm{\beta}||_{0}\leq s$ , we have with probability at least $1-\tau$ ,

||\nabla Q(\bm{\beta};\bm{\beta})-\nabla Q_{n}(\bm{\beta};\bm{\beta})||_{\infty}\leq\alpha.

In this condition, the statistical error is quantified by the $\ell_{\infty}$ norm between the population quantity $\nabla Q(\bm{\beta};\bm{\beta})$ and its sample version $\nabla Q_{n}(\bm{\beta};\bm{\beta})$ . Such a bound is different from the $\ell_{2}$ norm bound considered in the classic low-dimensional DP EM algorithms (Wang et al., 2020). In the high-dimensional setting, although for each index, the statistical error is small, the $\ell_{2}$ norm can still be quite large. This fine-grained $\ell_{\infty}$ bound enables us to iteratively quantify the statistical accuracy when using the NoisyHT in the M-step.

Condition 3.7 (Truncation-Error $(\xi,\phi,s,n,T,\mathcal{B})$ ).

For any $\bm{\beta}\in\mathcal{B}$ and $\|\bm{\beta}\|_{0}\leq s$ , there exists a non-incresaing function $\phi$ , such that for the truncation level $T$ , with probability $1-\phi(\xi)$ ,

||\nabla Q_{n}(\bm{\beta};\bm{\beta})-f_{T}(\nabla Q_{n}(\bm{\beta};\bm{\beta}))||_{2}\leq\xi.

The Truncation-Error condition quantifies the error caused by the truncation step in Algorithm 2. Intuitively, when $T$ is large, the truncation error $\xi$ can be very small, while leading to larger sensitivity and larger injected noise to ensure privacy. We need to carefully choose $T$ to strike a balance between the statistical accuracy and privacy cost. Below we show the main result of this section, with the detailed proof given in Section B.1.

Theorem 3.8.

Suppose the true parameter $\bm{\beta}^{*}$ is sparse with $\|\bm{\beta}^{*}\|_{0}\leq s^{*}$ . We define $\mathcal{B}=\{\bm{\beta}:||\bm{\beta}-\bm{\beta}^{*}||_{2}\leq R\}$ with $R=L\cdot\|\bm{\beta}^{*}\|_{2}$ for some $L\in(0,1)$ . We assume the Concavity-Smoothness $(\mu,\nu,\mathcal{B})$ holds and the initialization $\bm{\beta}^{0}$ satisfies $||\bm{\beta}^{0}-\bm{\beta}^{*}||_{2}\leq R/2$ . We further assume that the Lipschitz-Gradient $(\gamma,\mathcal{B})$ holds and define $\kappa=1-2\cdot\frac{\nu-\gamma}{\nu+\mu}\in(0,1)$ . In Algorithm 2, we let the step size $\eta=2/(\mu+\nu)$ , the number of iterations $N_{0}\asymp\log n$ , the sparsity level $\hat{s}\geq c_{0}\cdot\max(\frac{16}{(1/\kappa-1)^{2}},\frac{4\cdot(1+L)^{2}}{(1-L)^{2}})\cdot s^{*}$ where $c_{0}$ is a constant greater than 1 and $\hat{s}=O(s^{*})$ . We assume the condition Truncation-Error( $\xi,\phi,\hat{s},n/N_{0},T,\mathcal{B}$ ) holds with $T\asymp\sqrt{\log n}$ and $\phi(\xi)\cdot N_{0}=o(1)$ . Moreover, we assume the condition Statistical-Error $(\alpha,\tau/{N_{0}},\hat{s},n/{N_{0}},\mathcal{B})$ holds and assume that $\alpha=o(1)$ and there exists a constant $c_{1}>0$ , and $(\sqrt{\hat{s}}+c_{1}/\sqrt{1-L}\cdot\sqrt{s^{*}})\cdot\eta\cdot\alpha\leq\min((1-\sqrt{\kappa})^{2}\cdot R,\frac{(1-L)^{2}}{2\cdot(1+L)}\cdot\|\beta^{*}\|_{2})$ , also $\frac{s^{*}\log d}{n}\cdot(\log n)^{\frac{3}{2}}=o(1)$ . Then there exist constants $K,m_{0},m_{1}$ , it holds that

	$\displaystyle\\|\bm{\beta}^{t}-\bm{\beta}^{*}\\|_{2}$	$\displaystyle\leq\kappa^{t/2}\cdot\|\|{\bm{\beta}}^{0}-\bm{\beta}^{}\|\|_{2}+\frac{(\sqrt{\hat{s}}+c_{1}/\sqrt{1-L}\cdot\sqrt{s^{}})\cdot\eta}{1-\sqrt{\kappa}}\cdot\alpha+K\cdot\frac{s^{*}\cdot\log d\cdot\sqrt{\log(1/\delta)}\log^{3/2}n}{n\epsilon}$
		$\displaystyle+\eta\cdot\xi/(1-\sqrt{\kappa}),$		(3)

with probability $1-\tau-N_{0}\cdot\phi(\xi)-m_{0}\cdot s^{*}\cdot\log n\cdot\exp(-m_{1}\log d)$ . Specifically, for the output in Algorithm 2, it holds that

\displaystyle\|\bm{\beta}^{N_{0}}-\bm{\beta}^{*}\|_{2}

\displaystyle\leq\frac{(\sqrt{\hat{s}}+c_{1}/\sqrt{1-L}\cdot\sqrt{s^{*}})\cdot\eta}{1-\sqrt{\kappa}}\cdot\alpha+K\cdot\frac{s^{*}\cdot\log d\cdot\sqrt{\log(1/\delta)}\log^{3/2}n}{n\epsilon}+\eta\cdot\xi/(1-\sqrt{\kappa}),

with probability $1-\tau-N_{0}\cdot\phi(\xi)-m_{0}\cdot s^{*}\cdot\log n\cdot\exp(-m_{1}\log d)$ .

To interpret this result, let us discuss the four terms in (3). The first term, $\kappa^{t/2}\cdot||{\bm{\beta}}^{0}-\bm{\beta}^{*}||_{2}=O(\kappa^{t/2}\cdot\|{\bm{\beta}}^{*}\|_{2})$ , is the optimization error. With $\kappa\in(0,1)$ , this term shrinks to zero at a geometric rate when the iteration number $t$ is sufficiently large. The second term is of order $\sqrt{s^{*}}\cdot\alpha$ when $\hat{s}$ is chosen as the same order of $s^{*}$ , this is the statistical error caused by finite samples. We will further show that in some specific models, $\alpha$ is of the order $\sqrt{\log d\cdot\log n/n}$ and makes the second term to be $O(\sqrt{s^{*}\log d\cdot\log n/n})$ . The third term $K\cdot\frac{s^{*}\log d\cdot\sqrt{\log(1/\delta)}\log^{3/2}n}{n\epsilon}$ can be seen as the cost of privacy, as this error is introduced by the additional requirement that the output needs to be $(\epsilon,\delta)$ -DP. This term becomes larger when the privacy constraint becomes more stringent ( $\epsilon,\delta$ become smaller). Typically, in practice, we choose $\delta=O(1/n^{a})$ for some $a\geq 1$ and $\epsilon$ a small constant. This implies that when $d,s^{*}$ and $n$ satisfy $\frac{s^{*}\log d\cdot(\log n)^{2}}{n}\cdot\frac{\log{1/\delta}}{\epsilon^{2}}=o(1)$ , the cost of privacy will be negligible comparing to the statistical error. In this case, we can gain privacy for free in terms of convergence rate. The fourth term is due to the truncation of the gradient. Under the Truncation-Error condition with an appropriately chosen truncation parameter, this term reaches a convergence rate dominated by the statistical error up to logarithm factors.

4 DP EM Algorithm in Specific Models

In this section, we apply the results developed in Section 3 to specific statistical models and establish concrete convergence rates. We will discuss three models, the Gaussian mixture model, mixture of regression model and regression with missing covariates, in Sections 4.1, 4.2 and 4.3 respectively. Further, for each statistical model, we establish the minimax lower bound of the convergence rate, and demonstrate that our algorithm obtains a near minimax optimal rate of convergence.

4.1 Gaussian mixture model

In this subsection, we first apply the results in Section 3 to the Gaussian mixture model. By verifying the conditions in Theorem 3.8, we establish the convergence rate of the DP estimation in the Gaussian mixture model.

In a standard Gaussian mixture model, we assume:

\bm{Y}=Z\cdot\bm{\beta}^{*}+\bm{e},

(4)

where $\bm{Y}$ is a $d$ -dimensional output and $\bm{e}\sim N(\bm{0},\sigma^{2}\bm{I}_{d})$ . In this model, $\bm{\beta}^{*}$ and $-{\bm{\beta}}^{*}$ are the $d$ -dimensional vectors representing the population means of each class, and $Z$ is a class indicator variable with ${\mathbb{P}}(Z=1)=1/2$ and ${\mathbb{P}}(Z=-1)=1/2$ . Note that $Z$ is a hidden variable and independent of $\bm{e}$ . In the high dimensional setting, we assume $\bm{\beta}^{*}$ to be sparse.

Let $\bm{y}_{1},\bm{y}_{2}...\bm{y}_{n}$ be $n$ $i.i.d$ samples from the Gaussian mixture model. Using the framework of the EM method introduced in Section 3, we need to calculate

Q_{n}(\bm{\beta}^{\prime};\bm{\beta})=-\frac{1}{2n}\sum_{i=1}^{n}w_{\bm{\beta}}(\bm{y}_{i})||\bm{y}_{i}-\bm{\beta}^{\prime}||_{2}^{2}+[1-w_{\bm{\beta}}(\bm{y}_{i})]\cdot||\bm{y}_{i}+\bm{\beta}^{\prime}||_{2}^{2},

where

w_{\bm{\beta}}(\bm{y})=\frac{1}{1+\exp(-\langle\bm{\beta},\bm{y}\rangle/{\sigma^{2}})}.

Then, for the $M$ -step in the $t$ -th iteration given $\bm{\beta}^{t}$ , the update rule is given by

\bm{\beta}^{t+1}=\bm{\beta}^{t}+\eta\cdot\nabla Q_{n}(\bm{\beta}^{t};\bm{\beta}^{t})\text{, where }\nabla Q_{n}(\bm{\beta};\bm{\beta})=\frac{1}{n}\sum_{i=1}^{n}[2\cdot w_{\bm{\beta}}(\bm{y}_{i})-1]\cdot\bm{y}_{i}-\bm{\beta}.

Given this expression, we now present the DP estimation in the high-dimensional Gaussian mixture model by applying Algorithm 2. The algorithm is presented in Algorithm 3.

1:Private parameters

(\epsilon,\delta)

, step size

\eta

, truncation level

T

, maximum number of iterations

N_{0}

, sparsity parameter

\hat{s}

2:Initialization:

\bm{\beta}^{0}

with

||\bm{\beta}^{0}||_{0}\leq\hat{s}

3:For

t=0,1,2,...N_{0}-1

4: Compute

\bm{\beta}^{t+0.5}=\bm{\beta}^{t}+\eta\cdot N_{0}/n\cdot\sum_{i=1}^{n/N_{0}}[(2w_{\bm{\beta}^{t}}(\bm{y}_{i})-1)\cdot\Pi_{T}(\bm{y}_{i})-\bm{\beta}^{t}]

5: Let

\bm{\beta}^{t+1}=\text{NoisyHT}(\bm{\beta}^{t+0.5},\hat{s},2\eta\cdot T\cdot N_{0}/n,\epsilon,\delta)

16:End For

\hat{\bm{\beta}}=\bm{\beta}^{N_{0}}

Algorithm 3 DP Algorithm for High-Dimensional Gaussian Mixture Model

To derive the converge rate of $\hat{\bm{\beta}}$ , we need to first verify Conditions 3.4-3.7. Conditions 3.4-3.6 are standard in the literature of EM algorithms (Balakrishnan et al., 2017; Wang et al., 2015). We adapt them into our setting and state the results altogether below.

Lemma 4.1.

Suppose we can always find a sufficiently large constant $\phi$ to be the lower bound of the signal-to-noise ratio, $||\bm{\beta}^{*}||_{2}/\sigma>\phi$ . Then

•

There exists a constant $C>0$ such that Condition 3.4, Lipschitz-Gradient $(\gamma,\mathcal{B})$ and Condition 3.5, Concavity-Smoothness $(\mu,\nu,\mathcal{B})$ hold with the parameters

\gamma=\exp(-C\cdot\phi^{2}),\mu=\nu=1,\mathcal{B}=\{{\bm{\beta}}:||\bm{\beta}-\bm{\beta}^{*}||_{2}\leq R\}\text{ with }R=1/4\cdot||\bm{\beta}^{*}||_{2}.

•

Condition 3.6, Statistical-Error $(\alpha,\tau,\hat{s},n,\mathcal{B})$ holds with a constant $C_{1}$ and

$\alpha=C_{1}\cdot(||\bm{\beta}^{*}||_{\infty}+\sigma)\cdot\sqrt{\frac{\log d+\log(2/\tau)}{n}}.$

•

Condition 3.7, Truncation-Error $(\xi,\phi,\hat{s},n/N_{0},T,\mathcal{B})$ holds with $T\asymp\sqrt{\log n}$ and with probability $1-m_{0}/\log d\cdot\log n$ , there exists a constant $C_{2}$ , such that

||\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta}))||_{2}\leq C_{2}\cdot\sqrt{\frac{s^{*}\cdot\log d\cdot\log n}{n}}.

The detailed proof of Lemma 4.1 is given in Appendix C.3. Given these verified conditions, the following theorem establishes the results for the DP estimation in the high-dimensional Gaussian mixture model.

Theorem 4.2.

We implement Algorithm 3 to the observations generated from the Gaussian mixture model (4). Let $\mathcal{B},R,\mu,\nu,\gamma$ defined the same way as in Lemma 4.1. We assume $||\bm{\beta}^{*}||_{2}/\sigma>\phi$ for a sufficiently large constant $\phi>0$ . Let the initialization ${\bm{\beta}}^{0}$ satisfy $||\bm{\beta}^{0}-\bm{\beta}^{*}||_{2}\leq R/2$ and $\kappa=\gamma$ . Also, set the sparsity level $\hat{s}\geq c_{0}\cdot\max(\frac{16}{(1/\kappa-1)^{2}},100/9)\cdot s^{*}$ with $c_{0}>1$ be a constant and $\hat{s}=O(s^{*})$ . The step size is chosen as $\eta=1$ . We choose the number of iterations $N_{0}\asymp\log n$ , and let truncation level $T\asymp\sqrt{\log n}$ . We further assume that $\frac{s^{*}\log d}{n}\cdot(\log n)^{\frac{3}{2}}=o(1)$ . Then, the proposed Algorithm 3 is $(\epsilon,\delta)$ -DP. Also, we can show that there exist sufficient large constants $C,C_{1}$ , it holds that

\displaystyle||\hat{\bm{\beta}}-\bm{\beta}^{*}||_{2}

\displaystyle\leq C\cdot\sqrt{\frac{s^{*}\cdot\log d\cdot\log n}{n}}+C_{1}\cdot\frac{s^{*}\log d\cdot\sqrt{\log(1/{\delta})}{(\log n)}^{\frac{3}{2}}}{n\epsilon}.

(5)

with probability $1-m_{0}\cdot s^{*}\cdot\log n\cdot\exp(-m_{1}\log d)-m_{2}/\log d-m_{3}\cdot d^{-1/2}$ , where $m_{0},m_{1},m_{2},m_{3}$ are constants.

The proof of Theorem 4.2 is given in Section B.2. Similar to the interpretation for the results in (3), the first and the second items are respectively the statistical error and the cost of privacy.

In the following, we present the minimax lower bound for the estimation in the high-dimensional Gaussian mixture model with differential privacy constraints, indicating the convergence rate obtained above is near optimal.

Proposition 4.3.

Suppose $\bm{Y}=\{\bm{y}_{1},\bm{y}_{2},...,\bm{y}_{n}\}$ be the data set of $n$ samples observed from the Gaussian mixture model (4) and let $M$ be any algorithm such that $M\in\mathcal{M}_{\epsilon,\delta}$ , where $\mathcal{M}_{\epsilon,\delta}$ be the set of all $(\epsilon,\delta)$ -DP algorithms for the estimation of the true parameter $\bm{\beta}^{*}$ . Then there exists a constant $c$ , if $s^{*}=o(d^{1-\omega})$ for some fixed $\omega>0$ , $0<\epsilon<1$ and $\delta<n^{-(1+\omega)}$ for some fixed $\omega>0$ , we have

\inf_{M\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d},\|\bm{\beta}^{*}\|_{0}\leq s^{*}}\mathbb{E}\|{M(\bm{Y})}-\bm{\beta}^{*}\|_{2}\geq c\cdot(\sqrt{\frac{s^{*}\log d}{n}}+\frac{s^{*}\log d}{n\epsilon}).

The proof of Proposition 4.3 is given in Section B.3. By comparing the results in Theorem 4.2 and Proposition 4.3, our algorithm is shown to attain the minimax optimality up to logarithm factors in the high-dimensional Gaussian mixture models.

4.2 Mixture of regression model

We continue to demonstrate the proposed algorithm in the mixture of regression model, where we assume the following data generative process

Y=Z\cdot\bm{X}^{\top}\bm{\beta}^{*}+\bm{e},

(6)

where $Y\in\mathbb{R}$ is the response, $Z$ is an indicator variable with ${\mathbb{P}}(Z=1)={\mathbb{P}}(Z=-1)=1/2$ , $\bm{X}\sim N(0,\bm{I}_{d})$ , $\bm{e}\sim N(\bm{0},\sigma^{2}\bm{I}_{d})$ , and $\bm{\beta}^{*}$ is a $d$ -dimensional coefficients vector, we also require $\bm{\beta}^{*}$ to be sparse in the high-dimensional setting. Note that $Z,\bm{e},\bm{X}$ are independent with each other.

Let $(\bm{x}_{1},y_{1}),(\bm{x}_{2},y_{2})...(\bm{x}_{n},y_{n})$ be the $n$ $i.i.d.$ observed samples from the mixture of regression model. Then, to use the EM algorithm, we need to compute

Q_{n}(\bm{\beta}^{\prime};\bm{\beta})=-\frac{1}{2n}\sum_{i=1}^{n}w_{\bm{\beta}}(\bm{x}_{i},y_{i})(y_{i}-\langle\bm{x}_{i},{\bm{\beta}}^{\prime}\rangle)^{2}+[1-w_{\bm{{\bm{\beta}}}}(\bm{x}_{i},y_{i})]\cdot(y_{i}+\langle\bm{x}_{i},\bm{\beta}^{\prime}\rangle)^{2},

where $w_{\bm{\beta}}(\bm{x},y)=({1+\exp(-y\cdot\langle\bm{\beta},\bm{x}\rangle/{\sigma^{2}})})^{-1}.$

According to the gradient EM update rule, for the $t$ -th iteration $\bm{\beta}^{t}$ , we update $\hat{\bm{\beta}}$ by

\bm{\beta}^{t+1}=\bm{\beta}^{t}+\eta\cdot\nabla Q_{n}(\bm{\beta}^{t};\bm{\beta}^{t})\text{ ,where }\nabla Q_{n}(\bm{\beta};\bm{\beta})=\frac{1}{n}\sum_{i=1}^{n}[2\cdot w_{\bm{\beta}}(\bm{x}_{i},y_{i})\cdot y_{i}\cdot\bm{x}_{i}-\bm{x}_{i}\cdot\bm{x}_{i}^{\top}\cdot\bm{\beta}^{t}].

Similar to the Gaussian Mixture model, to apply Algorithm 2, we need to specify the truncation operator $f_{T}(\nabla Q_{n}(\bm{\beta}^{t};\bm{\beta}^{t}))$ . Rather than using truncation on the whole gradient, we perform the truncation on $y_{i}$ , $\bm{x}_{i}$ and $\bm{x}_{i}^{\top}\bm{\beta}$ respectively, which leads to a more refined analysis and improved rate in the statistical analysis. Specifically, we define

f_{T}(\nabla Q_{n}(\bm{\beta};\bm{\beta}))=\frac{1}{n}\sum_{i=1}^{n}[2w_{\bm{\beta}}(\bm{x}_{i},y_{i})\cdot\Pi_{T}(y_{i})\cdot\Pi_{T}(\bm{x}_{i})-\Pi_{T}(\bm{x}_{i})\cdot\Pi_{T}(\bm{x}_{i}^{\top}\bm{\beta})].

Due to space constraints, we present the full algorithm for mixture of regression model in section A.1. We also verify Conditions 3.4-3.7 in the mixture of regression model, and summarize the results in Lemma A.1. In the following, we show the theoretical guarantees for the high-dimensional DP EM algorithm on the mixture of regression model, with proof given in Section B.4.

Theorem 4.4.

We implement the Algorithm 5 to the sample generated from the mixture of regression model (6). Let $\mathcal{B},R,\mu,\nu,\gamma$ defined as in Lemma A.1. We assume $||\bm{\beta}^{*}||_{2}/\sigma>\phi$ for a sufficiently large $\phi>0$ . Let the initialization $||\bm{\beta}^{0}-\bm{\beta}||_{2}\leq R/2$ and $\kappa=\gamma$ . Also, set the sparsity level $\hat{s}\geq c_{0}\cdot\max(\frac{16}{(1/\kappa-1)^{2}},\frac{4\cdot 33^{2}}{31^{2}})\cdot s^{*}$ with $c_{0}>1$ be a constant and $\hat{s}=O(s^{*})$ . The step size is chosen as $\eta=1$ . We choose the number of iterations $N_{0}\asymp\log n$ , and let the truncation level $T\asymp\sqrt{\log n}$ . We assume that $\frac{s^{*}\log d}{n}\cdot(\log n)^{\frac{3}{2}}=o(1)$ . Then, the proposed Algorithm 5 is $(\epsilon,\delta)$ -DP, there exist constants $C,C_{1}$ , it holds that

\displaystyle||\hat{\bm{\beta}}-\bm{\beta}^{*}||_{2}

\displaystyle\leq C\cdot\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n+C_{1}\cdot\frac{s^{*}\cdot\log d\cdot\sqrt{\log(1/{\delta})}{(\log n)}^{\frac{3}{2}}}{n\epsilon}.

(7)

with probability $1-m_{0}\cdot s^{*}\log n\cdot\exp(-m_{1}\log d)-m_{2}/\log d-m_{3}\cdot d^{-1/2}$ , where $m_{0},m_{1},m_{2},m_{3}$ are constants.

Theorem 4.4 achieves a similar rate consisting of statistical error and privacy cost as we have discussed above in the general private EM algorithm and the Gaussian mixture model. The proposition below shows the lower bound in the mixture of regression model.

Proposition 4.5.

Suppose $(Y,\bm{X})=\{(y_{1},\bm{x}_{1}),(y_{2},\bm{x}_{2}),...(y_{n},\bm{x}_{n})\}$ be the data set of $n$ samples observed from the mixture of regression model (6). Let $M$ and $\mathcal{M}_{\epsilon,\delta}$ defined as in Proposition 4.3. Then there exists a constant $c$ , if $s^{*}=o(d^{1-\omega})$ for some fixed $\omega>0$ , $0<\epsilon<1$ and $\delta<n^{-(1+\omega)}$ for some fixed $\omega>0$ , we have

\inf_{M\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d},\|\bm{\beta}^{*}\|_{0}\leq s^{*}}\mathbb{E}\|{M(Y,\bm{X})}-\bm{\beta}^{*}\|_{2}\geq c\cdot(\sqrt{\frac{s^{*}\log d}{n}}+\frac{s^{*}\log d}{n\epsilon}).

The proof of Proposition 4.5 is in Section B.5. Comparing the results from Theorem 4.4 and Proposition 4.5, our algorithm attains the lower bound up to logarithm factors.

4.3 Regression with missing covariates

The last model we discuss in this section is the regression with missing covariates. For the model setup, we assume the following data generative process

Y=\bm{X}^{\top}\bm{\beta}^{*}+e,

where $Y\in\mathbb{R}$ is the response, $\bm{X}\sim N(0,\bm{I}_{d})$ , $e\sim N(0,\sigma^{2})$ and $e,\bm{X}$ are independent. $\bm{\beta}^{*}$ is a $d$ -dimensional coefficient vector, and we require $\bm{\beta}^{*}$ to be sparse in the high-dimensional setting with $\|\bm{\beta}^{*}\|_{0}\leq s^{*}$ . Let $(\bm{x}_{1},y_{1}),...,(\bm{x}_{n},y_{n})$ be $n$ $i.i.d.$ samples generated from the above model. For each $\bm{x}_{i}$ , we assume the missing completely at random model such that each coordinate of $\bm{x}_{i}$ is missing independently with probability $p\in[0,1)$ . Specifically, for each $\bm{x}_{i}$ , we denote $\tilde{\bm{x}}_{i}$ be the observed covariates such that $\tilde{\bm{x}}_{i}=\bm{z}_{i}\odot\bm{x}_{i}$ , where $\odot$ denotes the Hadamard product and $\bm{z}_{i}$ is a $d$ -dimensional Bernoulli random vector with $z_{ij}=1$ if $x_{ij}$ is observed and $z_{ij}=0$ if $x_{ij}$ is missing. Then, by the EM algorithm, we compute

Q_{n}(\bm{\beta}^{\prime};\bm{\beta})=\frac{1}{n}\sum_{i=1}^{n}y_{i}\cdot(\bm{\beta}^{\prime})^{\top}\cdot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})-\frac{1}{2}(\bm{\beta}^{\prime})^{\top}\cdot K_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})\cdot\bm{\beta}^{\prime}.

Here, $m_{\bm{\beta}}(\cdot,\cdot)\in\mathbb{R}^{d}$ and $K_{\bm{\beta}}(\cdot,\cdot)\in\mathbb{R}^{d\times d}$ are defined as

m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})=\bm{z}_{i}\odot\bm{x}_{i}+\frac{y_{i}-\langle\bm{\beta},\bm{z}_{i}\odot\bm{x}_{i}\rangle}{\sigma^{2}+\|(\bm{1}-\bm{z}_{i})\odot\bm{\beta}\|_{2}^{2}}\cdot(\bm{1}-\bm{z}_{i})\odot\bm{\beta},

\displaystyle K_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})=\text{diag}(\bm{1}-\bm{z}_{i})+m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})\cdot[m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})]^{\top}-[(\bm{1}-\bm{z}_{i})\odot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})]\cdot[(\bm{1}-\bm{z}_{i})\odot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})]^{\top}.

Then, according to the gradient EM update, for the $t$ -th iteration $\bm{\beta}^{t}$ , the update rule for the estimation of $\bm{\beta}$ is given below

\bm{\beta}^{t+1}=\bm{\beta}^{t}+\eta\cdot\nabla Q_{n}(\bm{\beta}^{t};\bm{\beta}^{t})\text{ ,where }\nabla Q_{n}(\bm{\beta};\bm{\beta})=\frac{1}{n}\sum_{i=1}^{n}[y_{i}\cdot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})-K_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})\cdot\bm{\beta}].

Similar as before, to apply Algorithm 2, we also need to specify the truncation operator $f_{T}(\nabla Q_{n}(\bm{\beta}^{t};\bm{\beta}^{t}))$ . According to the definition of $m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})$ , when $\bm{\beta}$ is close to $\bm{\beta}^{*}$ , we find that $m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})$ is close to $\bm{z}_{i}\odot\bm{x}_{i}$ , so we propose to truncate on $m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})$ and $m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})^{\top}\cdot\bm{\beta}$ . Specifically, let

	$\displaystyle f_{T}(\nabla Q_{n}(\bm{\beta};\bm{\beta}))$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}[\Pi_{T}(y_{i})\cdot\Pi_{T}(m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i}))-\text{diag}(\bm{1}-\bm{z}_{i})\cdot\bm{\beta}$
		$\displaystyle-\Pi_{T}(m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i}))\cdot\Pi_{T}(m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})^{\top}\cdot\bm{\beta})$
		$\displaystyle+\Pi_{T}((\bm{1}-\bm{z}_{i})\odot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i}))\cdot\Pi_{T}(((\bm{1}-\bm{z}_{i})\odot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i}))^{\top}\cdot\bm{\beta}).$

We present the full DP algorithm for regression with missing covariates model in section A.2, and verify Conditions 3.4-3.7 in the regression with missing covariates model. The results are summarized as Lemma A.2. In the following, we show the theoretical guarantees for the high-dimensional DP EM algorithm on the mixture of regression model, with the proof given in Section B.6.

Theorem 4.6.

We implement Algorithm 6 to the samples generated from the regression with missing covariates model. Let $\mathcal{B},R,L,\mu,\nu,\gamma$ are defined as in Lemma A.2, and assume the initialization ${\bm{\beta}}^{0}$ satisfies $||\bm{\beta}^{0}-\bm{\beta}^{*}||_{2}\leq R/2$ and $\kappa=\gamma$ . Also, set the sparsity level $\hat{s}\geq c_{0}\cdot\max(\frac{16}{(1/\kappa-1)^{2}},\frac{4\cdot(1+L)^{2}}{(1-L)^{2}})\cdot s^{*}$ with $c_{0}>1$ be a constant and $\hat{s}=O(s^{*})$ . We choose the step size $\eta=1$ , the number of iterations $N_{0}\asymp\log n$ , and the truncation level $T\asymp\sqrt{\log n}$ . We further assume that $\frac{s^{*}\log d}{n}\cdot(\log n)^{\frac{3}{2}}=o(1)$ . Then, the proposed Algorithm 6 is $(\epsilon,\delta)$ -DP, there exist constants $C,C_{1},m_{0},m_{1},m_{2},m_{3}$ , it holds that

\displaystyle||\hat{{\bm{\beta}}}-{\bm{\beta}}^{*}||_{2}

\displaystyle\leq C\cdot\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n+C_{1}\cdot\frac{s^{*}\cdot\log d\cdot\sqrt{\log(1/{\delta})}{(\log n)}^{\frac{3}{2}}}{n\epsilon}.

with probability $1-m_{0}\cdot s^{*}\cdot\log n\cdot\exp(-m_{1}\log d)-m_{2}/\log d-m_{3}\cdot d^{-1/2}$ where $m_{0},m_{1},m_{2},m_{3}$ are constants.

This results is similar as before, as the convergence rate in the DP estimation in the regression with missing covariates consists of the statistical error $O_{P}(\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n)$ and the privacy cost $O_{P}(\frac{s^{*}\cdot\log d\cdot\sqrt{\log(1/{\delta})}{(\log n)}^{\frac{3}{2}}}{n\epsilon})$ . Again, we show the minimax lower bound for the estimation in the regression with missing covariates with differential privacy constraints.

Proposition 4.7.

Suppose $(Y,\bm{X})=\{(y_{1},\bm{x}_{1}),(y_{2},\bm{x}_{2}),...(y_{n},\bm{x}_{n})\}$ be the data set of $n$ samples observed from the regression with missing covariates discussed above and let $M$ and $\mathcal{M}_{\epsilon,\delta}$ defined as in Proposition 4.3. Then there exists a constant $c$ , if $s^{*}=o(d^{1-\omega})$ for some fixed $\omega>0$ , $0<\epsilon<1$ and $\delta<n^{-(1+\omega)}$ for some fixed $\omega>0$ , we have

\inf_{M\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d},\|\bm{\beta}^{*}\|_{0}\leq s^{*}}\mathbb{E}\|{M(Y,\bm{X})}-\bm{\beta}^{*}\|_{2}\geq c\cdot(\sqrt{\frac{s^{*}\log d}{n}}+\frac{s^{*}\log d}{n\epsilon}).

As a result, for the high-dimensional regression of missing covariates model, comparing the upper and lower bounds in Theorem 4.6 and Proposition 4.7, our algorithm attains the optimal rate of convergence up to logarithm factors.

5 Low-dimensional DP EM Algorithm

In this section, we extend the technical tools we developed in Section 3 to the classic low-dimensional setting, and propose the DP EM algorithm in this low-dimensional regime. We will further apply our proposed algorithm to the Gaussian mixture model as an example, and show that the proposed algorithm obtains a near-optimal rate of convergence.

For the low-dimensional case, instead of using the noisy hard-thresholding algorithm, here we use the Gaussian mechanism in the M-step for each iteration. Similar to the high-dimensional setting, we use the sample splitting in each iteration, and the truncation step in each M-step to ensure bounded sensitivity. The algorithm is summarized below.

1:Private parameters

(\epsilon,\delta)

, step size

\eta

, truncation level

T

, maximum number of iterations

N_{0}

2:Initialization:

\bm{\beta}^{0}

3:For

t=0,1,2,...N_{0}-1

4: Compute

\bm{\beta}^{t+1}=\bm{\beta}^{t}+\eta f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))+W_{t}

W_{t}

is a random vector

(\xi_{1},\xi_{2},...\xi_{d})^{\top}

, where

\xi_{1},\xi_{2},...\xi_{d}

are i.i.d sample drawn from

N(0,\frac{2\eta^{2}d(2T)^{2}{N_{0}}^{2}\log(1.25/\delta)}{n^{2}\epsilon^{2}})

15:End For

26:

\hat{\bm{\beta}}=\bm{\beta}^{N_{0}}

Algorithm 4 Low-Dimensional DP EM algorithm

Lemma 5.1.

The output $\hat{\bm{\beta}}$ of the low dimensional DP EM algorithm (Algorithm 4) is $(\epsilon,\delta)$ -DP.

Given the privacy guarantee, we then analyze the statistical accuracy of this algorithm. Before that, we need to modify the Condition 3.6 and Condition 3.7 to fit into the low-dimensional setting.

Condition 5.2 (Statistical-Error-2 $(\alpha,\tau,n,\mathcal{B})$ ).

For any fixed ${\bm{\beta}}\in\mathcal{B}$ , we have that with probability at least $1-\tau$ ,

||\nabla Q(\bm{\beta};\bm{\beta})-\nabla Q_{n}(\bm{\beta};\bm{\beta})||_{2}\leq\alpha.

A difference between the high-dimensional and low-dimensional cases is that in the low-dimensional case, the dimension of $\bm{\beta}^{*}$ , noted as $d$ , can be much smaller than the sample size $n$ , so rather than using the infinity norm, the statistical error can be directly measured in $\ell_{2}$ norm to reflect the accumulated difference between each index of the true ${\bm{\beta}}^{*}$ and $\hat{\bm{\beta}}$ .

Condition 5.3 (Truncation-Error-2 $(\xi,\phi,n,T,\mathcal{B})$ ).

For any $\bm{\beta}\in\mathcal{B}$ , there exists a non-increasing function $\phi$ , such that for the truncation level $T$ , with probability $1-\phi(\xi)$ ,

||\nabla Q_{n}(\bm{\beta};\bm{\beta})-f_{T}(\nabla Q_{n}(\bm{\beta};\bm{\beta}))||_{2}\leq\xi.

Below is the main theorem for the low-dimensional DP EM algorithm.

Theorem 5.4.

For the Algorithm 4, we define $\mathcal{B}=\{\bm{\beta}:||\bm{\beta}-\bm{\beta}^{*}||_{2}\leq R\}$ and $||\bm{\beta}^{0}-\bm{\beta}^{*}||_{2}\leq R/2$ . We assume the Lipschitz-Gradient $(\gamma,\mathcal{B})$ and the Concavity-Smoothness $(\mu,\nu,\mathcal{B})$ hold. Define $\kappa=1-\frac{2\nu-2\gamma}{\mu+\lambda}\in(0,1)$ . For the parameters, we choose the step size $\eta=\frac{2}{\mu+\nu}$ , the truncation level $T\asymp\sqrt{\log n}$ , and the number of iterations $N_{0}\asymp\log n$ . We assume that $\alpha\leq(\nu-\gamma)\cdot R/4$ and $\xi\leq(\nu-\gamma)\cdot R/4$ . For the sample size, there exists a constant $K$ , and $n\geq K\cdot\frac{d(\log n)^{\frac{3}{2}}\sqrt{\log(1/\delta)}}{(1-\kappa)\cdot R\cdot\epsilon}$ . Then, under the conditions Statistical-Error-2 $(\alpha,\tau/{N_{0}},n/{N_{0}},\mathcal{B})$ and Truncation-Error-2 $(\xi,\phi,n/N_{0},T,\mathcal{B})$ , there exists sufficient large constant $C$ , such that it holds that with probability $1-c_{0}\log n\cdot\exp(-c_{1}d)-c_{2}\phi(\xi)\cdot\log n-\tau$ ,

\displaystyle||\bm{\beta}^{t}-\bm{\beta}^{*}||_{2}

\displaystyle\leq\frac{\kappa^{t}}{2}R+C\cdot\frac{d\cdot(\log n)^{\frac{3}{2}}\sqrt{\log(1/\delta)}}{n\epsilon}+\frac{\eta\cdot(\xi+\alpha)}{1-\kappa}.

(8)

Specifically, for the output in Algorithm 4, it holds that

\displaystyle||\bm{\beta}^{N_{0}}-\bm{\beta}^{*}||_{2}

\displaystyle\leq C\cdot\frac{d\cdot(\log n)^{\frac{3}{2}}\sqrt{\log(1/\delta)}}{n\epsilon}+\frac{\eta\cdot(\xi+\alpha)}{1-\kappa}.

The proof of Theorem 5.4 is given in Section B.7. There are three terms in the result (8), and their interpretations align with the high-dimensional cases. The first term is the optimization error, which converges to zero at a geometric rate. When the number of iterations $t$ is large, this term is small. The second term is the cost of privacy caused by the Gaussian noise added in each iteration to achieve DP. When the privacy constraint becomes more stringent ( $\epsilon,\delta$ become smaller), this term becomes larger. The third term reflects the truncation error and the statistical error. On one hand, by choosing an appropriate $T$ , nearly every index is below the threshold $T$ in each iteration, so the truncation costs few accuracy loss. On the other hand, the statistical error is caused by the finite samples. With proper choice of $T$ and sufficiently large $n$ , the third term can be quite small.

The result in Theorem 5.4 is obtained for the general latent variables model. The convergence rates of $\alpha$ and $\xi$ are unspecified in this general case, which may vary according to different specific models. To show the theoretical guarantees of the proposed algorithm, as an example, we apply the algorithm to the Gaussian mixture model and leave the results of the other two specific models in the Appendix A.4.1 and A.4.2.

Theorem 5.5.

For the Algorithm 4 in the Gaussian mixture model, let the truncation of the gradient be the same as the truncation of gradient in Algorithm 3. Then, we define $\mathcal{B}=\{\bm{\beta}:||\bm{\beta}-\bm{\beta}^{*}||_{2}\leq R\}$ and $||\bm{\beta}^{0}-\bm{\beta}^{*}||_{2}\leq R/2$ . Define $R,\mu,\nu,\gamma$ as in Lemma 4.1 and $\kappa=\gamma$ . For the choice of parameters, the step size is chosen as $\eta=1$ , the truncation level $T$ is chosen to be $T\asymp\sqrt{\log n}$ and the number of iterations is chosen as $N_{0}\asymp\log n$ . For the sample size $n$ , it is sufficiently large that there exists constants $K,K^{\prime}$ , such that $n\geq K\cdot\frac{d(\log n)^{\frac{3}{2}}\sqrt{\log(1/\delta)}}{(1-\kappa)\cdot R\cdot\epsilon}$ and $K^{\prime}\cdot\sqrt{\frac{d}{n}}\cdot\log n\leq(1-\gamma)\cdot R/4$ . Then, there exists sufficient large constant $C$ , such that it holds that with probability $1-c_{0}\log n\cdot\exp(-c_{1}d)-c_{2}/\log n-c_{3}\cdot n^{-1/2}$ ,

\displaystyle||\hat{\bm{\beta}}-\bm{\beta}^{*}||_{2}

\displaystyle\leq C\cdot\frac{d\cdot(\log n)^{\frac{3}{2}}\sqrt{\log(1/\delta)}}{n\epsilon}+\frac{\eta}{1-\kappa}\cdot\sqrt{\frac{d}{n}}\cdot\log n.

We remark here that in the literature, (Wang et al., 2020) also analyzed the Gaussian mixture model and obtained a $O(\sqrt{d^{2}/{n}}\cdot{\log(1/\delta)}/{\epsilon})$ rate of convergence. Our result in Theorem 5.5 has faster rate of convergence than that. In the following, we are going to show that such a rate cannot be improved further up to logarithm factors.

Proposition 5.6.

Suppose $\bm{Y}=\{\bm{y}_{1},\bm{y}_{2},...\bm{y}_{n}\}$ be the data set of $n$ samples observed from the Gaussian mixture model and let $M$ be any algorithm such that $M\in\mathcal{M}_{\epsilon,\delta}$ , where $\mathcal{M}_{\epsilon,\delta}$ be the set of all $(\epsilon,\delta)$ -DP algorithms for the estimation of the true parameter $\bm{\beta}^{*}$ in the low-dimensional setting. Then there exists a constant $c$ , if $0<\epsilon<1$ and $n^{-1}\exp(-n\epsilon)<\delta<n^{-(1+\omega)}$ for some fixed $\omega>0$ with $d>c_{0}\log(1/\delta)$ and $n>c_{1}\cdot\sqrt{d\log(1/\delta)/\epsilon}$ , we have

\inf_{M\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d}}\mathbb{E}\|{M(\bm{Y})}-\bm{\beta}^{*}\|_{2}\geq c\cdot(\sqrt{\frac{d}{n}}+\frac{d\sqrt{\log(1/\delta)}}{n\epsilon}).

The detailed proofs of Theorem 5.5 and Proposition 5.6 are in Appendix B.8. By comparing the results in these two theorems, our proposed Algorithm 4 reaches the minimax optimal rate of convergence up to logarithm factors.

6 Numerical Experiments

In this section, we investigate the numerical performance of the proposed DP EM algorithms. Specifically, in the high-dimensional setting, for the illustration purpose, we investigate the Gaussian mixture model (Algorithm 3) in Section 6.1 on the simulated data sets in details. Due to space constraints, an additional simulation for Mixture of regression model is presented in Supplement materials A.3. Then, in the low-dimensional case, we compare our Algorithm 4 with the algorithm in (Wang et al., 2020) under the Gaussian mixture model in Section 6.2. Further, in Section 6.3, we demonstrate the numerical performance of the proposed algorithm on real datasets.

6.1 Simulation results for Gaussian mixture model

For the DP EM algorithm in the high-dimensional Gaussian mixture model, the simulated data set is constructed as follows. First, we set $\bm{\beta}^{*}$ to be a unit vector, where the first $s^{*}$ indices equal to $1/\sqrt{s^{*}}$ and the rest of the indices are zero. For $i\in[n]$ , we generate $\bm{y}_{i}=z_{i}\cdot\bm{\beta}^{*}+\bm{e}_{i}$ , where ${\mathbb{P}}(z_{i}=1)={\mathbb{P}}(z_{i}=-1)=1/2$ and $\bm{e}_{i}\sim N(0,\sigma^{2}I_{d})$ with $\sigma=0.5$ . We consider the following three settings:

1.

Fix $d=1000,s^{*}=10,\epsilon=0.5,\delta=(2n)^{-1}$ . Compare the results of Algorithm 3 when $n=4000,5000,6000$ , respectively.
2.

Fix $n=4000,d=1000,\epsilon=0.5,\delta=(2n)^{-1}$ . Compare the results of Algorithm 3 when $s^{*}=5,10,15$ , respectively.
3.

Fix $n=4000,d=1000,s^{*}=10,\delta=(2n)^{-1}$ . Compare the results of Algorithm 3 when $\epsilon=0.3,0.5,0.8$ , respectively.

For each setting, we repeat the experiment for 50 times and report the average of the estimation error $\|{\bm{\beta}}^{t}-{\bm{\beta}}^{*}\|_{2}$ . For each experiment, we set the step size $\eta$ to be 0.5. The results are plotted in Figure 1.

Refer to caption — Figure 1: The average estimation error under different settings in the high-dimensional Gaussian mixture model.

From Figure 1, we can clearly discover the relationship between different choices of $n,s^{*},\epsilon$ and the performance of proposed algorithm in the Gaussian mixture model. The left figure in Figure 1 shows that the estimation of ${\bm{\beta}}^{*}$ becomes more accurate when $n$ becomes larger. The middle figure in Figure 1 shows that when $s^{*}$ becomes smaller, the estimation error becomes smaller. The right figure in Figure 1 shows that when $\epsilon$ becomes larger (the privacy constraints are more relaxed), the cost of privacy becomes smaller, and therefore the estimator achieves a smaller estimation error. With the $\epsilon$ becomes large enough, the estimator $\hat{{\bm{\beta}}}$ becomes closer to the non-private setting.

6.2 Comparison with other algorithms

In the literature, (Wang et al., 2020) also studied the DP EM algorithm in the classic low-dimensional latent variable models. In this section, we compare our proposed method with 1). the algorithm proposed in (Wang et al., 2020), and 2). the standard (non-private) gradient EM algorithm (Balakrishnan et al., 2017).

The synthetic data is generated as follows. We first set the true parameter ${\bm{\beta}}^{*}$ to be a unit vector with each element equal to $1/\sqrt{d}$ . Then we simulate the Gaussian mixture model with $z_{i}$ satisfying ${\mathbb{P}}(z_{i}=1)={\mathbb{P}}(z_{i}=-1)=1/2$ and sample a multivariate Gaussian variable $\bm{e}_{i}\sim N(0,\sigma^{2}I_{d})$ with $\sigma=0.5$ . At last, we compute $\bm{y}_{i}=z_{i}\cdot\bm{\beta}^{*}+\bm{e}_{i}$ .

We consider the following two settings. In the first setting, we fix $d,\epsilon,\delta$ and vary $n$ from 5000 to 25000; in the second setting, we fix $n,\epsilon,\delta$ and vary $d$ from 5 to 25. For each setting, we report the average estimation error $\|{\bm{\beta}}^{*}-\hat{{\bm{\beta}}}\|_{2}$ among 50 repetitions. The simulation results are summarized in Figure 2. The results indicate that although there is always a gap between our algorithm with the non-private EM algorithm (due to the cost of privacy), our algorithm has a much smaller error than that produced by that in (Wang et al., 2020).

6.3 Real data analysis

In this section, we apply the proposed DP EM algorithm for high-dimensional Gaussian mixture to the Breast Cancer Wisconsin (Diagnostic) Data Set, which is available at the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) (Dua and Graff, 2017). This data set contains 569 instances and 30 attributes. Each instance describes the diagnostic result for an individual as ‘Benign’ or ‘Malignant’. In the dataset, there are 212 instances labeled as ‘Malignant’ while the rest 357 instances labeled as ‘Benign’. Such a medical diagnose data set often contains personal sensitive information and serves as a suitable data set to apply our privacy-preserving algorithms.

In our experiment, all the attributes are normalized to have zero mean and unit variance. Moreover, to make the dataset symmetric, we first randomly drop out 145 data points in the ‘Benign’ class and compute the overall sample mean. Then for each data point, we subtract the overall sample mean from it. After preprocessing, the data is randomly split into two parts, where 70% of the instances are used for training and 30% of instances are used for testing.

In the training stage, we estimate the parameter $\bm{\beta}^{*}$ using the proposed algorithm for high-dimensional Gaussian mixture model (Algorithm 3). We run Algorithm 3 for 50 iterations with step size $\eta=0.5$ . The initialization $\bm{\beta}^{0}$ is chosen as the unit vector where all indexes equal $1/\sqrt{30}$ . We fix $\delta=1/2n$ , and choose various sparsity level $\hat{s}$ and privacy parameter $\epsilon$ as displayed in Table 3. In the testing stage, we classify each testing data as ‘Benign’ or ‘Malignant’ according to the $\ell_{2}$ distance closeness between its attributes and $\hat{\bm{\beta}}$ with the distance between its attributes and $-\hat{\bm{\beta}}$ . Then we compute the misclassification rate by comparing with the true diagnostic outcome. For each choice of parameters, we repeat the whole training and testing stages for 50 times, and report the average misclassification rate and the standard error. To compare with the non-private setting, for each choice of parameters, we also use the Algorithm described in (Wang et al., 2015) as the baseline for the non-private setting (shown as $\epsilon=\infty$ in the table below). The results are summarized in the following Table 3.

	$\hat{s}=5$	$\hat{s}=10$	$\hat{s}=15$
$\varepsilon=0.2$	0.14(.07)	0.12(.05)	0.10(.04)
$\varepsilon=0.5$	0.08(.02)	0.07(.02)	0.07(.01)
$\varepsilon=\infty$	0.07(.02)	0.06(.02)	0.06(.01)

Figure 3: The average and standard error of misclassification rates of Algorithm 3 for the Breast Cancer Wisconsin Data Set.

The results suggest that when the privacy requirements become more stringent, the classification accuracy drops in a mild way. Considering the significance to achieve privacy guarantees, such loss of accuracy could be acceptable.

7 Conclusion

In this paper, we introduce a novel DP EM algorithm in both high-dimensional and low-dimensional settings. In the high-dimensional setting, we propose an algorithm based on noisy iterative hard thresholding and show this method is minimax rate-optimal up to logarithm factors in three specific models: Gaussian mixture, mixture of regression, and regression with missing covariates. In the low-dimensional setting, an algorithm based on Gaussian mechanism is also developed and shown to be near minimax optimal.

References

Abowd [2016] John M Abowd. The challenge of scientific reproducibility and privacy protection for statistical agencies. Census Scientific Advisory Committee, 2016.
Avella-Medina [2020] Marco Avella-Medina. Privacy-preserving parametric inference: a case for robust statistics. J. Am. Stat. Assoc., pages 1–15, 2020.
Avella-Medina et al. [2021] Marco Avella-Medina, Casey Bradshaw, and Po-Ling Loh. Differentially private inference via noisy optimization. arXiv preprint arXiv:2103.11003, 2021.
Awan and Slavković [2020] Jordan Awan and Aleksandra Slavković. Differentially private inference for binomial data. Journal of Privacy and Confidentiality, 10(1):1–40, 2020.
Bafna and Ullman [2017] Mitali Bafna and Jonathan Ullman. The price of selection in differential privacy. In COLT, pages 151–168. PMLR, 2017.
Balakrishnan et al. [2017] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis. Ann Stat, 45(1):77–120, 2017.
Bassily et al. [2014] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464–473. IEEE, 2014.
Bun et al. [2018] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approximate differential privacy. SIAM J. Comput., 47(5):1888–1938, 2018.
Cai et al. [2019a] T Tony Cai, Jing Ma, and Linjun Zhang. Chime: Clustering of high-dimensional gaussian mixtures with em algorithm and its optimality. Ann Stat., 47(3):1234–1267, 2019a.
Cai et al. [2019b] T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. arXiv preprint arXiv:1902.04495, 2019b.
Cai et al. [2020] T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy in generalized linear models: Algorithms and minimax lower bounds. arXiv preprint arXiv:2011.03900, 2020.
Chaudhuri et al. [2013] Kamalika Chaudhuri, Anand D Sarwate, and Kaushik Sinha. A near-optimal algorithm for differentially-private principal components. J Mach Learn Res, 14, 2013.
Daskalakis et al. [2017] Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of em suffice for mixtures of two gaussians. In COLT, pages 704–710. PMLR, 2017.
Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Series B, 39(1):1–22, 1977.
Ding et al. [2017] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. In NeurIPS, pages 3574–3583, 2017.
Ding and Song [2016] Wei Ding and Peter X-K Song. Em algorithm in gaussian copula with missing data. Comput Stat Data Anal, 101:1–11, 2016.
Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
Dwork and Feldman [2018] Cynthia Dwork and Vitaly Feldman. Privacy-preserving prediction. In COLT, pages 1693–1702. PMLR, 2018.
Dwork and Roth [2014] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
Dwork et al. [2006a] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Eurocrypt, pages 486–503. Springer, 2006a.
Dwork et al. [2006b] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284. Springer, 2006b.
Dwork et al. [2010] Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy. In FOCS, pages 51–60. IEEE, 2010.
Dwork et al. [2014] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In STOC, 2014.
Dwork et al. [2015] Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. Robust traceability from trace amounts. In FOCS, pages 650–669. IEEE, 2015.
Dwork et al. [2017] Cynthia Dwork, Adam Smith, Thomas Steinke, and Jonathan Ullman. Exposed! a survey of attacks on private data. Annu. Rev. Stat. Appl., 4:61–84, 2017.
Dwork et al. [2018] Cynthia Dwork, Weijie J Su, and Li Zhang. Differentially private false discovery rate control. arXiv preprint arXiv:1807.04209, 2018.
Erlingsson et al. [2014] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In CCS’14, pages 1054–1067, 2014.
Jagannathan et al. [2009] Geetha Jagannathan, Krishnan Pillaipakkamnatt, and Rebecca N Wright. A practical differentially private random decision tree classifier. In ICDM. IEEE, 2009.
Kadir et al. [2014] Shabnam N Kadir, Dan FM Goodman, and Kenneth D Harris. High-dimensional cluster analysis with the masked em algorithm. Neural Comput, 26(11):2379–2394, 2014.
Kamath et al. [2019] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-dimensional distributions. In COLT, pages 1853–1902. PMLR, 2019.
Kamath et al. [2020a] Gautam Kamath, Or Sheffet, Vikrant Singhal, and Jonathan Ullman. Differentially private algorithms for learning mixtures of separated gaussians. In ITA, pages 1–62. IEEE, 2020a.
Kamath et al. [2020b] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. In COLT, pages 2204–2235. PMLR, 2020b.
Karwa and Vadhan [2017] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals. arXiv preprint arXiv:1711.03908, 2017.
Kifer et al. [2020] Daniel Kifer, Solomon Messing, Aaron Roth, Abhradeep Thakurta, and Danfeng Zhang. Guidelines for implementing and auditing differentially private systems. arXiv preprint arXiv:2002.04049, 2020.
Kwon and Caramanis [2020] Jeongyeol Kwon and Constantine Caramanis. The EM algorithm gives sample-optimality for learning mixtures of well-separated gaussians. In COLT. PMLR, 2020.
Kwon et al. [2019] Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen, and Damek Davis. Global convergence of the em algorithm for mixtures of two component linear regression. In COLT, pages 2055–2110. PMLR, 2019.
Kwon et al. [2020] Jeongyeol Kwon, Nhat Ho, and Constantine Caramanis. On the minimax optimality of the em algorithm for learning two-component mixed linear regression. arXiv preprint arXiv:2006.02601, 2020.
McLachlan and Krishnan [2007] Geoffrey J McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.
Mirshani et al. [2019] Ardalan Mirshani, Matthew Reimherr, and Aleksandra Slavković. Formal privacy for functional data with gaussian perturbations. In ICML, pages 4595–4604. PMLR, 2019.
Nissim et al. [2007] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In STOC, pages 75–84, 2007.
Park et al. [2017] Mijung Park, James Foulds, Kamalika Choudhary, and Max Welling. Dp-em: Differentially private expectation maximization. In AISTATS, pages 896–904. PMLR, 2017.
Quost and Denoeux [2016] Benjamin Quost and Thierry Denoeux. Clustering and classification of fuzzy data using the fuzzy em algorithm. Fuzzy Sets Syst, 286:134–156, 2016.
Rana et al. [2015] Santu Rana, Sunil Kumar Gupta, and Svetha Venkatesh. Differentially private random forest with high utility. In ICDM2015, pages 955–960. IEEE, 2015.
Ranjan et al. [2016] Rishik Ranjan, Biao Huang, and Alireza Fatehi. Robust gaussian process modeling using em algorithm. J Process Control, 42:125–136, 2016.
Song et al. [2020] Shuang Song, Om Thakkar, and Abhradeep Thakurta. Characterizing private clipped gradient descent on convex generalized linear problems. arXiv preprint arXiv:2006.06783, 2020.
Song et al. [2021] Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimensionality in unconstrained private glms. In AISTATS. PMLR, 2021.
Steinke and Ullman [2016] Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. Journal of Privacy and Confidentiality, 7(2):3–22, 2016.
Steinke and Ullman [2017] Thomas Steinke and Jonathan Ullman. Tight lower bounds for differentially private selection. In FOCS, pages 552–563. IEEE, 2017.
Talwar et al. [2015] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly-optimal private lasso. In NeurIPS, pages 3025–3033, 2015.
Wang et al. [2020] Di Wang, Jiahao Ding, Zejun Xie, Miao Pan, and Jinhui Xu. Differentially private (gradient) expectation maximization algorithm with statistical guarantees. arXiv preprint arXiv:2010.13520, 2020.
Wang [2018] Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. arXiv preprint arXiv:1803.02596, 2018.
Wang et al. [2015] Zhaoran Wang, Quanquan Gu, Yang Ning, and Han Liu. High dimensional em algorithm: Statistical optimization and asymptotic normality. NeurIPS, 28:2512, 2015.
Wu [1983] CF Jeff Wu. On the convergence properties of the em algorithm. Ann Stat., pages 95–103, 1983.
Xu et al. [2016] Ji Xu, Daniel J Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of two gaussians. NeurIPS, 29, 2016.
Yan et al. [2017] Bowei Yan, Mingzhang Yin, and Purnamrita Sarkar. Convergence of gradient em on multi-component mixture of gaussians. NeurIPS, 2017.
Yi and Caramanis [2015] Xinyang Yi and Constantine Caramanis. Regularized em algorithms: A unified framework and statistical guarantees. arXiv preprint arXiv:1511.08551, 2015.
Yi et al. [2014] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed linear regression. In ICML, pages 613–621. PMLR, 2014.
Zhang et al. [2020] Linjun Zhang, Rong Ma, T Tony Cai, and Hongzhe Li. Estimation, confidence intervals, and large-scale hypotheses testing for high-dimensional mixed linear regression. arXiv preprint arXiv:2011.03598, 2020.
Zhao et al. [2020] Ruofei Zhao, Yuanzhi Li, and Yuekai Sun. Statistical convergence of the em algorithm on gaussian mixture models. Electron. J. Stat., 14(1):632–660, 2020.
Zhu et al. [2017] Rongda Zhu, Lingxiao Wang, Chengxiang Zhai, and Quanquan Gu. High-dimensional variance-reduced stochastic gradient expectation-maximization algorithm. In ICML, pages 4180–4188. PMLR, 2017.

Appendix A Supplement materials

A.1 DP Algorithm and theories for Mixture of Regression Model in high-dimensional settings

By applying Algorithm 2, the DP estimation algorithm in the high-dimensional mixture of regression is presented in the following in details:

1:Private parameters

(\epsilon,\delta)

, step size

\eta

, truncation level

T

, maximum number of iterations

N_{0}

, sparsity parameter

\hat{s}

2:Initialization:

\bm{\beta}^{0}

with

||\bm{\beta}^{0}||_{0}\leq\hat{s}

3:For

t=0,1,2,...N_{0}-1

4: Compute

\bm{\beta}^{t+0.5}=\bm{\beta}^{t}+\eta\cdot f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))

5: Let

\bm{\beta}^{t+1}=\text{NoisyHT}(\bm{\beta}^{t+0.5},\hat{s},4\eta\cdot T^{2}\cdot N_{0}/n,\epsilon,\delta)

16:End For

\hat{\bm{\beta}}=\bm{\beta}^{N_{0}}

Algorithm 5 DP Algorithm for High-Dimensional Mixture of Regression Model

For the truncation step $f_{T}(\nabla Q_{n}(\bm{\beta}^{t};\bm{\beta}^{t}))$ on line 3 of Algorithm 5, rather than using truncation on the whole gradient, we perform the truncation on $y_{i}$ , $\bm{x}_{i}$ and $\bm{x}_{i}^{\top}\bm{\beta}$ respectively, which leads to a more refined analysis and improved rate in the statistical analysis. Specifically, we define

f_{T}(\nabla Q_{n}(\bm{\beta};\bm{\beta}))=\frac{1}{n}\sum_{i=1}^{n}[2w_{\bm{\beta}}(\bm{x}_{i},y_{i})\cdot\Pi_{T}(y_{i})\cdot\Pi_{T}(\bm{x}_{i})-\Pi_{T}(\bm{x}_{i})\cdot\Pi_{T}(\bm{x}_{i}^{\top}\bm{\beta})].

Then, we could verify Conditions 3.4-3.7 in the mixture of regression model.

Lemma A.1.

Suppose we can always find a sufficiently large constant $\phi$ to be the lower bound of the signal-to-noise ratio, $||\bm{\beta}^{*}||_{2}/\sigma>\phi$ . Then

•

For Condition 3.4, Lipschitz-Gradient $(\gamma,\mathcal{B})$ condition and Condition 3.5, Concavity-Smoothness $(\mu,\nu,\mathcal{B})$ condition, both conditions hold with the parameters

\gamma\in(0,1/4),\mu=\nu=1,\mathcal{B}=\{\bm{\beta}:||\bm{\beta}-\bm{\beta}^{*}||_{2}\leq R\}\text{ with }R=1/32\cdot||\bm{\beta}^{*}||_{2}.

•

For Condition 3.6, the condition Statistical-Error $(\alpha,\tau,\hat{s},n,\mathcal{B})$ holds with a constant $C$ and

\alpha=C\cdot\eta\cdot\max(||\bm{\beta}^{*}||_{2}^{2}+\sigma^{2},1,\sqrt{\hat{s}}\cdot||\bm{\beta}^{*}||_{2})\cdot\sqrt{\frac{\log d+\log(4/\tau)}{n}}.

•

For Condition 3.7, the condition Truncation-Error $(\xi,\phi,\hat{s},n/N_{0},T,\mathcal{B})$ holds with $T\asymp\sqrt{\log n}$ and with probability $1-m_{0}/\log d\cdot\log n$ , there exists a constant $C_{1}$ , such that

||\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta}))||_{2}\leq C_{1}\cdot\sqrt{\frac{s^{*}\cdot\log d}{n}}\cdot\log n.

The detailed proof of Lemma A.1 is given in Appendix C.4.

A.2 DP Algorithm and theories for Regression with missing covariates Model in high-dimensional settings

By applying Algorithm 2, we present in the following the DP estimation in the high-dimensional regression with missing covariate model.

1:Private parameters

(\epsilon,\delta)

, step size

\eta

, truncation level

T

, maximum number of iterations

N_{0}

, sparsity parameter

\hat{s}

2:Initialization:

\bm{\beta}^{0}

with

||\bm{\beta}^{0}||_{0}\leq\hat{s}

3:For

t=0,1,2,...N_{0}-1

4: Compute

\bm{\beta}^{t+0.5}=\bm{\beta}^{t}+\eta\cdot f_{T}(Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))

5: Let

\bm{\beta}^{t+1}=\text{NoisyHT}(\bm{\beta}^{t+0.5},\hat{s},6\eta\cdot T^{2}\cdot N_{0}/n,\epsilon,\delta)

16:End For

\hat{\bm{\beta}}=\bm{\beta}^{N_{0}}

Algorithm 6 DP Algorithm for High-Dim Regression with Missing Covariates

For the term $f_{T}(\nabla Q_{n}(\bm{\beta}^{t};\bm{\beta}^{t}))$ on line 3 of Algorithm 5, we design a truncation step specifically for this model. According to the definition of $m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})$ , when $\bm{\beta}$ is close to $\bm{\beta}^{*}$ , we find that $m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})$ is close to $\bm{z}_{i}\odot\bm{x}_{i}$ , so we propose to truncate on $m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})$ and $m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})^{\top}\cdot\bm{\beta}$ . Specifically, let

	$\displaystyle f_{T}(\nabla Q_{n}(\bm{\beta};\bm{\beta}))$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}[\Pi_{T}(y_{i})\cdot\Pi_{T}(m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i}))-\text{diag}(\bm{1}-\bm{z}_{i})\cdot\bm{\beta}$
		$\displaystyle-\Pi_{T}(m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i}))\cdot\Pi_{T}(m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})^{\top}\cdot\bm{\beta})$
		$\displaystyle+\Pi_{T}((\bm{1}-\bm{z}_{i})\odot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i}))\cdot\Pi_{T}(((\bm{1}-\bm{z}_{i})\odot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i}))^{\top}\cdot\bm{\beta}).$

Then, we could verify Conditions 3.4-3.7 in the Regression with missing covariates model.

Lemma A.2.

Suppose the signal-to-noise ratio $||\bm{\beta}^{*}||_{2}/\sigma\leq r$ with $r>0$ be some constant. Also, for the probability $p$ that each coordinate of $\bm{x}_{i}$ is missing, we have $p<1/(1+2b+2b^{2})$ with $b=r^{2}\cdot(1+L)^{2}$ and $L\in(0,1)$ be a constant.

•

Condition 3.4, Lipschitz-Gradient $(\gamma,\mathcal{B})$ condition and Condition 3.5, Concavity-Smoothness $(\mu,\nu,\mathcal{B})$ condition hold with the parameters

\gamma=\frac{b+p\cdot(1+2b+2b^{2})}{1+b}<1,\mu=\nu=1,\mathcal{B}=\{\bm{\beta}:||\bm{\beta}-\bm{\beta}^{*}||_{2}\leq R\}\text{ with }R=L\cdot||\bm{\beta}^{*}||_{2}.

•

For Condition 3.6, the condition Statistical-Error $(\alpha,\tau,\hat{s},n,\mathcal{B})$ holds with a constant $C$ and

\alpha=C\cdot\eta\cdot[\sqrt{\hat{s}}\cdot\|\bm{\beta}^{*}\|_{2}^{2}\cdot(1+L)\cdot(1+L\cdot r)^{2}+\max(||\bm{\beta}^{*}||_{2}^{2}+\sigma^{2},(1+L\cdot r)^{2})]\cdot\sqrt{\frac{\log d+\log(12/\tau)}{n}}.

•

||\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta}))||_{2}\leq C_{1}\cdot\sqrt{\frac{s^{*}\cdot\log d}{n}}\cdot\log n.

The detailed proof of Lemma A.2 is in the Appendix C.5.

A.3 Simulation results for mixture of regression model

For the DP EM algorithm in the high-dimensional mixture of regression model, the simulated dataset is constructed as follows. First, we set $\bm{\beta}^{*}$ to be a unit vector, where the first $s^{*}$ indices equal to $1/\sqrt{s^{*}}$ and the rest of the indices are zero. For $i\in[n]$ , we let $y_{i}=z_{i}\cdot\bm{x}_{i}^{\top}\bm{\beta}^{*}+e_{i}$ , where $\bm{x}_{i}\sim N(0,\bm{I}_{d})$ , ${\mathbb{P}}(z_{i}=1)={\mathbb{P}}(z_{i}=-1)=1/2$ and $e_{i}\sim N(0,\sigma^{2})$ with $\sigma=0.5$ . We consider the following three experimental settings:

•

Fix $d=1000,s^{*}=10,\epsilon=0.6,\delta=(2n)^{-1}$ . Compare the results of Algorithm 5 when $n=4000,5000,6000$ , respectively.
•

Fix $n=5000,d=1000,\epsilon=0.6,\delta=(2n)^{-1}$ . Compare the results of Algorithm 5 when $s^{*}=5,10,15$ , respectively.
•

Fix $n=5000,d=1000,s^{*}=10,\delta=(2n)^{-1}$ . Compare the results of Algorithm 5 when $\epsilon=0.4,0.6,0.8$ , respectively.

For each setting, we repeat the experiment for 50 times and report the average error $\|{\bm{\beta}}^{t}-{\bm{\beta}}^{*}\|_{2}$ . For each experiment, the choice of initialized ${\bm{\beta}}^{0}$ should be close to the true ${\bm{\beta}}^{*}$ and the step size $\eta$ is set to be 0.5. The results are shown in Figure 4.

From the results in Figure 4, we also discover the relationship between different choices of $n,s^{*},\epsilon$ and the performance of the proposed DP EM algorithm on the mixture of regression model. Similar to the results under the Gaussian mixture model, with larger $n$ , smaller $s^{*}$ and larger $\epsilon$ , the estimator $\hat{{\bm{\beta}}}$ has a smaller error in estimating ${\bm{\beta}}^{*}$ .

A.4 Results for specific models in Low-dimensional settings

In this subsection, we will show results for DP algorithm for the mixture of regression model and regression with missing covariates in low-dimensional settings. we will list the algorithm and the theorem below. Since the proofs of these two specific models are highly similar to the proof of the high dimensional setting Theorem 4.4, Proposition 4.5, Theorem 4.6, Proposition 4.7 and also the low dimensional setting for Gaussian mixture Model Theorem 5.5 and Proposition 5.6, we omit the proofs here.

A.4.1 Algorithm and theories for Mixture of Regression Model in low-dimensional settings

The algorithm is listed as below:

1:Private parameters

(\epsilon,\delta)

, step size

\eta

, truncation level

T

, maximum number of iterations

N_{0}

2:Initialization:

\bm{\beta}^{0}

3:For

t=0,1,2,...N_{0}-1

4: Compute

\bm{\beta}^{t+1}=\bm{\beta}^{t}+\eta f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))+W_{t}

W_{t}

is a random vector

(\xi_{1},\xi_{2},...\xi_{d})^{\top}

, where

\xi_{1},\xi_{2},...\xi_{d}

are i.i.d sample drawn from

N(0,\frac{2\eta^{2}d(4T^{2})^{2}{N_{0}}^{2}\log(1.25/\delta)}{n^{2}\epsilon^{2}})

15:End For

26:

\hat{\bm{\beta}}=\bm{\beta}^{N_{0}}

Algorithm 7 Low Dimensional Private EM algorithm on Mixture of Regression Model

where the truncation on the gradient is the same as the truncation in Algorithm 5

Theorem A.3.

For the Algorithm 7 in the Mixture of Regression Model, we define $\mathcal{B}=\{\bm{\beta}:||\bm{\beta}-\bm{\beta}^{*}||_{2}\leq R\}$ and $||\bm{\beta}^{0}-\bm{\beta}^{*}||_{2}\leq R/2$ . Define $R,\mu,\nu,\gamma$ as in Lemma A.1 and $\kappa=\gamma$ . For the choice of parameters, the step size is chosen as $\eta=1$ , the truncation level is chosen as $T\asymp\sqrt{\log n}$ and the number of iterations is chosen as $N_{0}\asymp\log n$ . For the sample size $n$ , it is sufficiently large that there exists constants $K,K^{\prime}$ , such that $n\geq K\cdot\frac{d(\log n)^{\frac{3}{2}}\sqrt{\log(1/\delta)}}{(1-\kappa)\cdot R\cdot\epsilon}$ and $K^{\prime}\cdot\sqrt{\frac{d}{n}}\cdot(\log n)^{3/2}\leq(1-\gamma)\cdot R/4$ . Then, there exists sufficient large constant $C$ , it holds that with probability $1-c_{0}\log n\cdot\exp(-c_{1}d)-c_{2}/\log n-c_{3}n^{-1/2}$ :

\displaystyle||\hat{\bm{\beta}}-\bm{\beta}^{*}||_{2}

\displaystyle\leq C\cdot\frac{d\cdot(\log n)^{\frac{3}{2}}\sqrt{\log(1/\delta)}}{n\epsilon}+\frac{\eta}{1-\kappa}\cdot\sqrt{\frac{d}{n}}\cdot(\log n)^{3/2}.

Theorem A.4.

Suppose $(Y,\bm{X})=\{(y_{1},\bm{x}_{1}),(y_{2},\bm{x}_{2}),...(y_{n},\bm{x}_{n})\}$ be the data set of $n$ samples observed from the Mixture of Regression Model discussed above and let $M$ be any corresponding $(\epsilon,\delta)$ -differentially private algorithm for the estimation of the true parameter $\bm{\beta}^{*}$ . Then there exists a constant $c$ , if $0<\epsilon<1$ and $\delta<n^{-(1+\omega)}$ for some fixed $\omega>0$ , we have:

\inf_{M\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d}}\mathbb{E}\|{M(Y,\bm{X})}-\bm{\beta}^{*}\|_{2}\geq c\cdot(\sqrt{\frac{d}{n}}+\frac{d}{n\epsilon}).

A.4.2 Algorithm and theories for Regression with Missing Covariates in low dimension cases

The algorithm is listed as below:

1:Private parameters

(\epsilon,\delta)

, step size

\eta

, truncation level

T

, maximum number of iterations

N_{0}

2:Initialization:

\bm{\beta}^{0}

3:For

t=0,1,2,...N_{0}-1

4: Compute

\bm{\beta}^{t+1}=\bm{\beta}^{t}+\eta f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))+W_{t}

W_{t}

is a random vector

(\xi_{1},\xi_{2},...\xi_{d})^{\top}

, where

\xi_{1},\xi_{2},...\xi_{d}

are i.i.d sample drawn from

N(0,\frac{2\eta^{2}d(6T^{2})^{2}{N_{0}}^{2}\log(1.25/\delta)}{n^{2}\epsilon^{2}})

15:End For

26:

\hat{\bm{\beta}}=\bm{\beta}^{N_{0}}

Algorithm 8 Low Dimensional Private EM algorithm on Regression with missing covariates

where the truncation on the gradient is the same as the truncation in Algorithm 6

Theorem A.5.

For Algorithm 8, we define $\mathcal{B}=\{\bm{\beta}:||\bm{\beta}-\bm{\beta}^{*}||_{2}\leq R\}$ and $||\bm{\beta}^{0}-\bm{\beta}^{*}||_{2}\leq R/2$ . Define $R,\mu,\nu,\gamma$ as in Lemma A.2 and $\kappa=\gamma$ . For the choice of parameters, the step size is chosen as $\eta=1$ , the truncation level is chosen to be $T\asymp\sqrt{\log n}$ and the number of iterations is chosen as $N_{0}\asymp\log n$ . For the sample size $n$ , it is sufficiently large that there exists constants $K,K^{\prime}$ , such that $n\geq K\cdot\frac{d(\log n)^{\frac{3}{2}}\sqrt{\log(1/\delta)}}{(1-\kappa)\cdot R\cdot\epsilon}$ and $K^{\prime}\cdot\sqrt{\frac{d}{n}}\cdot(\log n)^{3/2}\leq(1-\gamma)\cdot R/4$ . Then, there exists sufficient large constant $C$ , it holds that with probability $1-c_{0}\log n\cdot\exp(-c_{1}d)-c_{2}/\log n-c_{3}n^{-1/2}$ :

\displaystyle||\hat{\bm{\beta}}-\bm{\beta}^{*}||_{2}

\displaystyle\leq C\cdot\frac{d\cdot(\log n)^{\frac{3}{2}}\sqrt{\log(1/\delta)}}{n\epsilon}+\frac{\eta}{1-\kappa}\cdot\sqrt{\frac{d}{n}}\cdot(\log n)^{3/2}.

Theorem A.6.

Suppose $(Y,\bm{X})=\{(y_{1},\bm{x}_{1}),(y_{2},\bm{x}_{2}),...(y_{n},\bm{x}_{n})\}$ be the data set of $n$ samples observed from the Regression with Missing covariates discussed above and let $M$ be any corresponding $(\epsilon,\delta)$ -differentially private algorithm for the estimation of the true parameter $\bm{\beta}^{*}$ . Then there exists a constant $c$ , if $0<\epsilon<1$ and $\delta<n^{-(1+\omega)}$ for some fixed $\omega>0$ , we have:

\inf_{M\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d}}\mathbb{E}\|{M(Y,\bm{X})}-\bm{\beta}^{*}\|_{2}\geq c\cdot(\sqrt{\frac{d}{n}}+\frac{d}{n\epsilon}).

Therefore, both the low-dimensioanl EM algorithm on Mixture of Regression Model and Regression with missing covariates attain a near-optimal rate of convergence up to a log factor.

Appendix B Proofs of main results

B.1 Proof of Theorem 3.8

Assume that the step size $\eta$ is chosen, by Lemma 3.3, we can claim the privacy is guaranteed. Then we can start the proof. For simplicity, we denote $n_{0}=n/N_{0}$ . During the $t$ -th iteration, we can write the iterated two steps in the following way:

\bm{\beta}^{t+0.5}=\bm{\beta}^{t}+\eta\cdot f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})),\quad\bm{\beta}^{t+1}=\text{trunc}(\bm{\beta}^{t+0.5},\mathcal{S}^{t+0.5})+\bm{W}_{Lap}^{t}.

(9)

where $\mathcal{S}^{t+0.5}$ is the set of indices selected by the private peeling algorithm, and $\bm{W}_{Lap}^{t}$ is the vector of Laplace noises with support on $\mathcal{S}^{t+0.5}$ . Furthermore, let us introduce the following notations, we define:

\bar{\bm{\beta}}^{t+0.5}=\bm{\beta}^{t}+\eta\cdot\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t}),\quad\bar{\bm{\beta}}^{t+1}=\text{trunc}(\bar{\bm{\beta}}^{t+0.5},\mathcal{S}^{t+0.5}).

(10)

Then, before continue the proof, we will first introduce two lemmas:

Lemma B.1.

Suppose we have

\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{*}\|_{2}\leq L\|\bm{\beta}^{*}\|_{2},

(11)

for some $L\in(0,1)$ . Also, for the sparsity level, we have that $\hat{s}\geq\frac{4\cdot(1+L)^{2}}{(1-L)^{2}}\cdot s^{*}$ We assume that $\alpha=o(1)$ and also $\sqrt{\hat{s}}\cdot\alpha\leq\frac{(1-L)^{2}}{2\eta(1+L)}\cdot||\bm{\beta}^{*}||_{2}$ . We further assume $\frac{s^{*}\log d}{n}\cdot(\log n)^{\frac{3}{2}}=o(1)$ . Then, it holds that, there exists a constant $K_{1}>0$ :

\|\bar{\bm{\beta}}^{t+1}-\bm{\beta}^{*}\|_{2}\leq(1+4\sqrt{{s^{*}}/\hat{s}})^{\frac{1}{2}}\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{*}\|_{2}+\frac{K_{1}\sqrt{s^{*}}}{\sqrt{1-L}}{\cdot\eta\cdot\alpha}.

(12)

with probability $1-\tau-N_{0}\phi(\xi)-m_{0}\cdot s^{*}\cdot\log n\cdot\exp(-m_{1}\log d)$ , for all $t$ in $0,1,2,...,N_{0}-1$ .

Lemma B.2.

For $\nu$ , $\mu$ and $\gamma$ defined in the Theorem 3.8, the following inequality holds:

||\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{*}||_{2}\leq(1-2\cdot\frac{\nu-\gamma}{\nu+\mu})\cdot||{\bm{\beta}}^{t}-\bm{\beta}^{*}||_{2}.

(13)

The detailed proof of Lemma B.1 is in Appendix C.6 and the proof for Lemma B.2 is in Lemma 5.2 from [Wang et al., 2015]. Then, by the two lemmas discussed above:

		$\displaystyle\\|\bm{\beta}^{t+1}-\bm{\beta}^{*}\\|_{2}$
		$\displaystyle\leq\\|\text{trunc}(\bm{\beta}^{t+0.5},\mathcal{S}^{t+0.5})-\bm{\beta}^{*}\\|_{2}+\\|\bm{W}_{Lap}^{t}\\|_{2}$
		$\displaystyle\leq\\|\text{trunc}(\bm{\beta}^{t+0.5},\mathcal{S}^{t+0.5})-\text{trunc}(\bar{\bm{\beta}}^{t+0.5},\mathcal{S}^{t+0.5})\\|_{2}+\\|\text{trunc}(\bar{\bm{\beta}}^{t+0.5},\mathcal{S}^{t+0.5})-\bm{\beta}^{*}\\|_{2}+\\|\bm{W}_{Lap}^{t}\\|_{2}$
		$\displaystyle=\underbrace{\\|\text{trunc}(\bm{\beta}^{t+0.5},S^{t+0.5})-\text{trunc}(\bar{\bm{\beta}}^{t+0.5},\mathcal{S}^{t+0.5})\\|_{2}}_{(1)}+\underbrace{\\|\bar{\bm{\beta}}^{t+1}-\bm{\beta}^{*}\\|_{2}}_{(2)}+\\|\bm{W}_{Lap}^{t}\\|_{2}.$		(14)

First, we notice that for the term (1) in (B.1):

		$\displaystyle\\|\text{trunc}(\bm{\beta}^{t+0.5},\mathcal{S}^{t+0.5})-\text{trunc}(\bar{\bm{\beta}}^{t+0.5},\mathcal{S}^{t+0.5})\\|_{2}$
		$\displaystyle=\\|(\eta f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\eta\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t}))_{\mathcal{S}^{t+0.5})}\\|_{2}$
		$\displaystyle\leq\\|(\eta f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\eta\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))_{\mathcal{S}^{t+0.5})}\\|_{2}+\\|(\eta\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-\eta\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t}))_{\mathcal{S}^{t+0.5})}\\|_{2}$
		$\displaystyle\leq\eta\\|f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})\\|_{2}+\eta\sqrt{\hat{s}}\cdot\\|\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t})\\|_{\infty}$
		$\displaystyle\leq\eta\\|f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})\\|_{2}+\eta\sqrt{\hat{s}}\cdot\alpha.$		(15)

Second, for the term (2) in (B.1), by Lemma B.1 and Lemma B.2, we have:

	$\displaystyle{}\\|\bar{\bm{\beta}}^{t+1}-\bm{\beta}^{*}\\|_{2}$	$\displaystyle\leq(1+4\sqrt{{s^{}}/\hat{s}})^{\frac{1}{2}}\\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{}\\|_{2}+\frac{K_{1}\sqrt{s^{*}}}{\sqrt{1-L}}{\cdot\eta\cdot\alpha}$
		$\displaystyle\leq(1+4\sqrt{{s^{}}/\hat{s}})^{\frac{1}{2}}\cdot(1-2\cdot\frac{\nu-\gamma}{\nu+\mu})\cdot\|\|{\bm{\beta}}^{t}-\bm{\beta}^{}\|\|_{2}+\frac{K_{1}\sqrt{s^{*}}}{\sqrt{1-L}}{\cdot\eta\cdot\alpha}.$		(16)

Third, for the term $\|\bm{W}_{Lap}^{t}\|_{2}$ , notice that for any $i=1,2,...d$ , $\{W_{Lap}^{t}\}_{i}\sim Laplace(\lambda_{0})$ , by the concentration of Laplace Distribution, for every $C>1$ , we have:

{\mathbb{P}}(\|W_{Lap}^{t}\|_{2}^{2}>\hat{s}C^{2}\lambda_{0}^{2})\leq\hat{s}e^{-C}.

(17)

here, in this case, $\lambda_{0}=C^{\prime}\sqrt{\log n}\sqrt{\hat{s}\log(1/\delta)}/(n_{0}\cdot\epsilon)$ , let $C\asymp\log d$ , we have there exists a constant $C_{1},m_{2},m_{3}$ , such that with probability $1-m_{2}\cdot s^{*}\cdot\exp(-m_{3}\log d)$ , we have:

\|\bm{W}_{Lap}^{t}\|_{2}^{2}\leq C_{1}\cdot\frac{(s^{*}\log d)^{2}\log(1/\delta)\log n}{n_{0}^{2}\epsilon^{2}}.

(18)

By a union bound and the choice of $N_{0}$ to be $N_{0}\asymp\log n$ , we find that with probability $1-m_{2}\cdot s^{*}\cdot\log n\cdot\exp(-m_{3}\log d)$ , we have:

\max_{t}\|\bm{W}_{Lap}^{t}\|_{2}^{2}\leq C_{1}\cdot\frac{(s^{*}\log d)^{2}\log(1/\delta)\log n}{n_{0}^{2}\epsilon^{2}}.

(19)

According to our assumption, we find that $\|\bm{W}_{Lap}^{t}\|_{2}$ is o(1). Thus, by combining the results from (B.1)(B.1)(B.1), we find that:

	$\displaystyle\\|\bm{\beta}^{t+1}-\bm{\beta}^{*}\\|_{2}$	$\displaystyle\leq\eta\sqrt{\hat{s}}\cdot\alpha+(1+4\sqrt{{s^{}}/\hat{s}})^{\frac{1}{2}}\cdot(1-2\cdot\frac{\nu-\gamma}{\nu+\mu})\cdot\|\|{\bm{\beta}}^{t}-\bm{\beta}^{}\|\|_{2}+\frac{K_{1}\sqrt{s^{*}}}{\sqrt{1-L}}{\cdot\eta\cdot\alpha}$
		$\displaystyle+\eta\\|f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})\\|_{2}+\\|\bm{W}_{Lap}^{t}\\|_{2}.$		(20)

Denote $\kappa=1-2\cdot\frac{\nu-\gamma}{\nu+\mu}$ . Then, if $\beta^{t}\in\mathcal{B}$ , by our assumption in the theorem that $\hat{s}\geq 16\cdot(1/\kappa-1)^{-2}\cdot s^{*}$ , we have $(1+4\cdot\sqrt{s^{*}/\hat{s}})^{1/2}\leq 1/\sqrt{\kappa}$ . thus we have $(1+4\sqrt{{s^{*}}/\hat{s}})^{\frac{1}{2}}\cdot(1-2\cdot\frac{\nu-\gamma}{\nu+\mu})\leq\sqrt{\kappa}$ .

On the other hand, we find that by the assumption we also have $(\sqrt{\hat{s}}+\frac{K_{1}\sqrt{s^{*}}}{\sqrt{1-L}})\cdot\eta\cdot\alpha\leq(1-\sqrt{\kappa})^{2}\cdot R$ . Further more, by the assumptions, we also have that $\|W_{Lap}^{t}\|_{2}$ are $o(1)$ for any $t$ , and with probability $1-N_{0}\cdot\phi(\xi)$ , $\max_{t}\|f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})\|_{2}<\xi$ where $\xi=o(1)$ .

\displaystyle\|\bm{\beta}^{t+1}-\bm{\beta}^{*}\|_{2}\leq(1-\sqrt{\kappa})^{2}R+\sqrt{\kappa}R\leq R,

(21)

which guarantees that if $\bm{\beta}^{t}\in\mathcal{B}$ , we also have $\bm{\beta}^{t+1}\in\mathcal{B}$ . Iterate this, we can get the connection between $\|\bm{\beta}^{t}-\bm{\beta}^{*}\|_{2}$ and $\|\bm{\beta}^{0}-\bm{\beta}^{*}\|_{2}$ :

	$\displaystyle\\|\bm{\beta}^{t}-\bm{\beta}^{*}\\|_{2}$	$\displaystyle\leq\frac{(\sqrt{\hat{s}}+K_{1}/\sqrt{1-L}\cdot\sqrt{s^{}})\cdot\eta}{1-\sqrt{\kappa}}\cdot\alpha+\kappa^{t/2}\cdot\|\|{\bm{\beta}}^{t}-\bm{\beta}^{}\|\|_{2}$
		$\displaystyle+\eta\cdot\xi/(1-\sqrt{\kappa})+K_{2}\cdot\frac{s^{*}\log d\cdot\sqrt{\log(1/\delta)}(\log n)^{3/2}}{n\epsilon},$		(22)

which finishes the proof for the theorem. $\square$

B.2 Proof of Theorem 4.2

The proof of theorem 4.2 consists of three parts. The first part is the privacy guarantees. The second part is to verify the conditions for $\alpha$ is satisfied. The third part is to find the convergence rate of $||\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta}))||_{2}$ , for any $\bm{\beta}$ on the iteration path. Let us begin the proof with the first part.

For the privacy guarantee, notice that for two adjacent data sets, let $y_{i}$ and ${y_{i}}^{\prime}$ be the difference between these two data sets.

	$\displaystyle\|\|f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta}))-f_{T}(\nabla{Q_{n/N_{0}}}^{\prime}(\bm{\beta};\bm{\beta}))\|\|_{\infty}$
	$\displaystyle=\|\|\frac{N_{0}}{n}\cdot[2w_{\bm{\beta}}(y_{i})-1]\cdot\Pi_{T}(y_{i})-\frac{N_{0}}{n}\cdot[2w_{\bm{\beta}}({y_{i}}^{\prime})-1]\cdot\Pi_{T}({y_{i}}^{\prime})\|\|_{\infty}$
	$\displaystyle\leq\frac{N_{0}}{n}\cdot[\|\|\Pi_{T}(y_{i})\|\|_{\infty}+\|\|\Pi_{T}({y_{i}}^{\prime})\|\|_{\infty}]$
	$\displaystyle\leq\frac{2T\cdot N_{0}}{n}.$		(23)

Then, by Lemma 3.3, we can claim that the private Gaussian Mixture Model is $(\epsilon,\delta)$ -differentially private.

Also, for the conditions of $\alpha$ is Theorem 3.8, we find that $(\sqrt{\hat{s}}+c_{1}/\sqrt{1-L}\cdot\sqrt{s^{*}})\cdot\eta\cdot\alpha$ is roughly $O(\sqrt{s^{*}\cdot\log d\cdot\log n/n})$ and also by the assumption that $\frac{s^{*}\log d}{n}\cdot(\log n)^{\frac{5}{2}}=o(1)$ , we can claim that $(\sqrt{\hat{s}}+c_{1}/\sqrt{1-L}\cdot\sqrt{s^{*}})\cdot\eta\cdot\alpha$ is actually $o(1)$ , thus the condition $(\sqrt{\hat{s}}+c_{1}/\sqrt{1-L}\cdot\hat{s^{*}})\cdot\eta\cdot\alpha\leq\min((1-\sqrt{\kappa})^{2}\cdot R,\frac{(1-L)^{2}}{2\cdot(1+L)}\cdot\|\bm{\beta}^{*}\|_{2})$ can be satisfied. Therefore, we can find a constant $C$ , such that:

\frac{(\sqrt{\hat{s}}+c_{1}/\sqrt{1-L}\cdot\sqrt{s^{*}})\cdot\eta}{1-\sqrt{\kappa}}\cdot\alpha\leq C\cdot\sqrt{\frac{s^{*}\cdot\log d\cdot\log n}{n}}.

(24)

Then, to finish the proof, it suffices to show that for each $\bm{\beta}^{t}$ , $t=0,1,2,...N_{0}-1$ , we can claim $||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}$ is $O(\sqrt{\frac{s^{*}\cdot\log d\cdot\log n}{n}})$ . Then we can follow the proof of Theorem 3.8 to finish the proof of Theorem 4.2.

By the third proposition in Lemma 4.1, we have already observed that for any $t=0,1,2,\cdots,N_{0}-1$ , if we choose $\xi=O(\sqrt{\frac{s^{*}\cdot\log d\cdot\log n}{n}})$ . We can find that for a constant $C^{\prime}$ :

P(||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}>\xi)\leq C^{\prime}\cdot\frac{1}{\log d\cdot\log n}.

Furthermore, for all $t=0,1,...N_{0}-1$ , since $N_{0}$ is chosen to be $O(\log n)$ , we could apply an union bound, and claim that for all $t$ , with probability $1-C^{\prime}\log d$ , $||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}=O(\sqrt{\frac{s^{*}\cdot\log d\cdot\log n}{n}})$ , which finishes the proof of Theorem 4.2. $\square$

B.3 Proof of Proposition 4.3

Suppose $\bm{Y}=\{\bm{y}_{1},\bm{y}_{2},...\bm{y}_{n}\}$ be the data set of $n$ samples observed from the Gaussian Mixture Model and let $M$ be any corresponding $(\epsilon,\delta)$ -differentially private algorithm. Suppose we have another model where there are no hidden variables $Z$ where $\bm{Y}=\beta^{*}+\bm{e}$ . Suppose $\bm{Y}_{1}=\{\bm{y}_{1}^{\prime},\bm{y}_{2}^{\prime},...\bm{y}_{n}^{\prime}\}$ be the data set of $n$ samples observed from the latter model and let $M_{1}$ be any corresponding $(\epsilon,\delta)$ -differentially private algorithm. Then the estimation of true $\bm{\beta}^{*}$ can be seen as a mean-estimation problem. By Lemma 3.2 and Theorem 3.3 from [Cai et al., 2019b], we have that:

\inf_{M_{1}\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d},\|\bm{\beta}^{*}\|_{0}\leq s^{*}}\mathbb{E}\|{M_{1}(\bm{Y}_{1})}-\bm{\beta}^{*}\|_{2}\geq c\cdot(\sqrt{\frac{s^{*}\log d}{n}}+\frac{s^{*}\log d}{n\epsilon})

Since the no hidden variables model can be seen as a special case where all hidden variables $Z$ equals 1, so we have:

\inf_{M_{0}\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d},\|\bm{\beta}^{*}\|_{0}\leq s^{*}}\mathbb{E}\|{M(\bm{Y})}-\bm{\beta}^{*}\|_{2}\geq\inf_{M_{1}\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d},\|\bm{\beta}^{*}\|_{0}\leq s^{*}}\mathbb{E}\|{M_{1}(\bm{Y}_{1})}-\bm{\beta}^{*}\|_{2}

By combining the two inequalities together, we finish the proof. $\square$

B.4 Proof of Theorem 4.4

Similar to the proof for Theorem 4.2, the proof consists of three parts. The first part is the privacy guarantees. The second part is to verify the conditions of $\alpha$ . The third part is to find the convergence of $\max_{t}||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}$ . Here, the verification of $\alpha$ is similar to the proof of Theorem 4.2, so we omit the proof here.

For the privacy guarantee, notice that for two adjacent data sets, let $(\bm{x}_{i},y_{i})$ and $(\bm{x}_{i}^{\prime},y_{i}^{\prime})$ be the difference between these two data sets.

	$\displaystyle\|\|f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta}))-f_{T}(\nabla{Q_{n/N_{0}}^{\prime}}(\bm{\beta};\bm{\beta}))\|\|_{\infty}$
	$\displaystyle=\|\|\frac{N_{0}}{n}\cdot[2w_{\bm{\beta}}(\bm{x}_{i},y_{i})\cdot\Pi_{T}(y_{i})\cdot\Pi_{T}(\bm{x}_{i})-\Pi_{T}(\bm{x}_{i})\cdot\Pi_{T}(\bm{x}_{i}^{\top}\bm{\beta})]$
	$\displaystyle-\frac{N_{0}}{n}\cdot[2w_{\bm{\beta}}(\bm{x}_{i}^{\prime},y_{i}^{\prime})\cdot\Pi_{T}(y_{i}^{\prime})\cdot\Pi_{T}(\bm{x}_{i}^{\prime})-\Pi_{T}(\bm{x}_{i}^{\prime})\cdot\Pi_{T}(\bm{x_{i}^{\prime}}^{\top}\bm{\beta})]\|\|_{\infty}$
	$\displaystyle\leq\frac{N_{0}}{n}\\|\Pi_{T}(y_{i})\cdot\Pi_{T}(\bm{x}_{i})-\Pi_{T}(y_{i}^{\prime})\cdot\Pi_{T}(\bm{x}_{i}^{\prime})\\|_{\infty}+\frac{N_{0}}{n}\\|\Pi_{T}(\bm{x}_{i})\cdot\Pi_{T}(\bm{x}_{i}^{\top}\bm{\beta})-\Pi_{T}(\bm{x}_{i}^{\prime})\cdot\Pi_{T}(\bm{x_{i}^{\prime}}^{\top}\bm{\beta})\\|_{\infty}$
	$\displaystyle\leq\frac{4T^{2}\cdot N_{0}}{n}.$		(25)

Next, let us find the convergence rate for the truncation error. By Lemma A.1, for the truncation error, we have obtained that with probability $1-m_{0}/(\log n\cdot\log d)$ , therefore, for any $t=0,1,2,...,N_{0}-1$ , we have:

{\mathbb{P}}(||\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}>C\cdot\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n)\leq C^{\prime}\cdot 1/{\log d\cdot\log n},

(26)

By a union bound on $t$ , and if we take the times of iteration $N_{0}\asymp\log n$ , we have:

{\mathbb{P}}(\max_{t}||\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}>C\cdot\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n)\leq C^{\prime}\cdot 1/{\log d}.

(27)

Thus, we can get the result that for all $t$ , with probability $1-C^{\prime}/{\log d}$ , $||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}=O(\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n)$ , thus follows the proof of Theorem 3.8, we finish the proof of Theorem 4.4. $\square$

B.5 Proof of Proposition 4.5

Suppose we have the traditional linear regression model where there are no hidden variables $Z$ where $Y=\bm{X}^{\top}\beta^{*}+\bm{e}$ . Suppose $(Y_{1},\bm{X}_{1})=\{(y_{1}^{\prime},\bm{x}_{1}^{\prime}),(y_{2}^{\prime},\bm{x}_{2}^{\prime}),...(y_{n}^{\prime},\bm{x}_{n}^{\prime})\}$ be the data set of $n$ samples observed from the linear regression model and let $M_{1}$ be any corresponding $(\epsilon,\delta)$ -differentially private algorithm. Then by Lemma 4.3 and Theorem 4.3 in [Cai et al., 2019b], we have that:

\inf_{M_{1}\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d},\|\bm{\beta}^{*}\|_{0}\leq s^{*}}\mathbb{E}\|{M_{1}(Y_{1},\bm{X}_{1})}-\bm{\beta}^{*}\|_{2}\geq c\cdot(\sqrt{\frac{s^{*}\log d}{n}}+\frac{s^{*}\log d}{n\epsilon}).

Since the no hidden variables model can be seen as a special case where all hidden variables $Z$ equals 1, so we have:

\inf_{M\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d},\|\bm{\beta}^{*}\|_{0}\leq s^{*}}\mathbb{E}\|{M(Y,\bm{X})}-\bm{\beta}^{*}\|_{2}\geq\inf_{M_{1}\in\mathcal{M}_{\epsilon,\delta}}\sup_{\bm{\beta}^{*}\in\mathbb{R}^{d},\|\bm{\beta}^{*}\|_{0}\leq s^{*}}\mathbb{E}\|{M_{1}(Y_{1},\bm{X}_{1})}-\bm{\beta}^{*}\|_{2}.

By combining the two inequalities together, we finish the proof. $\square$

B.6 Proof of Theorem 4.6

In the proof of Theorem 4.6, we also need to verify two properties. First is the privacy guarantees. The second part is to find the convergence of $\max_{t}||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}$ .

For the privacy guarantee, notice that for two adjacent data sets, let $(\bm{x}_{i},y_{i})$ and $(\bm{x}_{i}^{\prime},y_{i}^{\prime})$ be the difference between these two data sets. For the ease of notation, we denote $n_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i},z_{i}))=(1-\bm{z}_{i})\odot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i}))$ and $u_{i}=(1-\bm{z}_{i})\odot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})^{\top}\cdot\bm{\beta}$ . Then:

	$\displaystyle\|\|f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta}))-f_{T}(\nabla{Q_{n/N_{0}}^{\prime}}(\bm{\beta};\bm{\beta}))\|\|_{\infty}$
	$\displaystyle\leq\frac{N_{0}}{n}\|\|\Pi_{T}(y_{i})\cdot\Pi_{T}(m_{\beta}(\tilde{\bm{x}}_{i},y_{i}))-\Pi_{T}(y_{i}^{\prime})\cdot\Pi_{T}(m_{\beta}(\tilde{\bm{x}}_{i}^{\prime},y_{i}^{\prime}))\|\|_{\infty}$
	$\displaystyle+\frac{N_{0}}{n}\|\|\Pi_{T}(m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i}))\cdot\Pi_{T}(m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})^{\top}\cdot\bm{\beta})-\Pi_{T}(m_{\bm{\beta}}(\tilde{\bm{x}}_{i}^{\prime},y_{i}^{\prime}))\cdot\Pi_{T}(m_{\bm{\beta}}(\tilde{\bm{x}}_{i}^{\prime},y_{i}^{\prime})^{\top}\cdot\bm{\beta})\|\|_{\infty}$
	$\displaystyle+\frac{N_{0}}{n}\|\|\Pi_{T}(n_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i},z_{i}))\cdot\Pi_{T}(u_{i})-\Pi_{T}(n_{\bm{\beta}}(\tilde{\bm{x}}_{i}^{\prime},y_{i}^{\prime},z_{i}^{\prime}))\cdot\Pi_{T}(u_{i}^{\prime})\|\|_{\infty}$
	$\displaystyle\leq\frac{6T^{2}\cdot N_{0}}{n}$		(28)

From Lemma A.2, we have that under the truncation condition, with probability $1-m_{0}/(\log n\cdot\log d)$ , for any $t=0,1,2,...,N_{0}-1$ , we have:

P(||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}>C\cdot\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n)\leq C^{\prime}\cdot 1/{\log d\cdot\log n},

(29)

Again, by a union bound on $t$ , and if we take the times of iteration $N_{0}\asymp\log n$ , we have:

P(\max_{t}||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}>C\cdot\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n)\leq C^{\prime}\cdot 1/{\log d}

(30)

Thus, we can get the result that for all $t$ , with probability $1-C^{\prime}/{\log d}$ , $\max_{t}||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}=O(\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n)$ , thus by Theorem 3.8, we finish the proof of Theorem 4.6. $\square$

B.7 Proof of Theorem 5.4

Before the proof, let us first introduce a lemma:

Lemma B.3.

If $0\leq\gamma<\nu\leq\mu$ holds, suppose that Condition 3.4, 3.5 holds with parameter $\gamma,(\mu,\nu)$ , respectively. Then if step size $\eta=\frac{2}{\mu+\nu}$ , we have:

\displaystyle||\bm{\beta}+\eta\cdot\nabla{Q(\bm{\beta};\bm{\beta})}-\bm{\beta}^{*}||_{2}\leq(1-\frac{2\nu-2\gamma}{\mu+\nu})||\bm{\beta}-\bm{\beta}^{*}||_{2}.

The detailed proof for B.3 is in Theorem 3 from [Balakrishnan et al., 2017]. Then for $t=0,1,2,3,...N_{0}-1$ , we have:

$\displaystyle\|\|\bm{\beta}^{t+1}-\bm{\beta}^{*}\|\|_{2}$	$\displaystyle=\|\|\bm{\beta}^{t}+\eta f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))+\bm{W}_{t}-\bm{\beta}^{*}\|\|_{2}$
	$\displaystyle\leq\|\|\bm{\beta}^{t}+\eta\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t})-\bm{\beta}^{*}\|\|_{2}+\eta\|\|f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t})\|\|_{2}+\|\|\bm{W}_{t}\|\|_{2}$
	$\displaystyle\leq\kappa\|\|\bm{\beta}^{t}-\bm{\beta}^{*}\|\|_{2}+\|\|\bm{W}_{t}\|\|_{2}+\eta\|\|f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})\|\|_{2}$
	$\displaystyle+\eta\|\|\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t})\|\|_{2}$
	$\displaystyle\leq\kappa\|\|\bm{\beta}^{t}-\bm{\beta}^{*}\|\|_{2}+\|\|\bm{W}_{t}\|\|_{2}+\eta\|\|f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})\|\|_{2}+\eta\cdot\alpha.$	(31)

In the above equality, if by choosing $T\asymp\sqrt{\log n}$ , from the assumption, with probability $1-N_{0}\cdot\phi(\xi)$ , for any $t$ , $||f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})||_{2}\leq\xi$ . Also, by Condition 5.2, for a single $t$ , with probability $1-\tau/N_{0}$ , $||\nabla Q_{n/N_{0}}(\beta^{t};\beta^{t})-\nabla Q(\beta^{t};\beta^{t})||_{2}\leq\alpha$ , then by a union bound, for all $t=0,1,2,...{N_{0}}-1$ , we have with probability $1-\tau$ , $\max_{t}||\nabla Q_{n/N_{0}}(\beta^{t};\beta^{t})-\nabla Q(\beta^{t};\beta^{t})||_{2}\leq\alpha$ . Then, if $\beta^{t}\in\mathcal{B}$ , which means that $||\beta^{t}-\beta^{*}||_{2}\leq R$ . Then if $\alpha$ satisfies that $\alpha\leq\frac{\nu-\gamma}{4}\cdot R$ , then we can have $\eta\cdot(\alpha+\xi)\leq\frac{1-\kappa}{2}\cdot R$ .

On the other hand, and if we choose $T\asymp\sqrt{\log n}$ and the times of rotation $N_{0}$ to be $N_{0}\asymp\log n$ . Notice that for each $W_{ti},i=1,2,3,...d$ , $W_{ti}\sim N(0,\sigma^{2})$ . $\sigma^{2}=\frac{2\eta^{2}d(2T)^{2}{N_{0}}^{2}\log(1.25/\delta)}{n^{2}\epsilon^{2}}$ . Then we find that actually $||\bm{W}_{t}||_{2}^{2}/\sigma^{2}$ follows a chi-square distribution $\chi^{2}(d)$ . By the concentration of chi-square distribution, there exists constants $c_{0},c_{1},c_{2}$ , such that:

\displaystyle{\mathbb{P}}(||\bm{W}_{t}||_{2}^{2}\geq\sigma^{2}(1+c_{1})d)\leq c_{0}\exp(-\min\{c_{1}^{2}d,c_{1}d\}/8)\leq c_{0}\exp(-c_{2}d).

(32)

Then with probability $1-c_{0}\exp(-c_{2}d)$ , we can derive that, their exists a constant $c_{3}$ such that:

||\bm{W}_{t}||_{2}\leq c_{3}\cdot\frac{d\sqrt{\log^{3}n}\sqrt{\log(1/\delta)}}{n\epsilon}.

By a union bound, we find that with probability $1-c_{0}\cdot\log n\cdot\exp(-c_{2}d)$ , we can derive that, their exists a constant $c_{4}$ such that:

\max_{t}||\bm{W}_{t}||_{2}\leq c_{4}\cdot\frac{d\sqrt{\log^{3}n}\sqrt{\log(1/\delta)}}{n\epsilon}.

Then, we properly choose $n$ such that $\max_{t}||\bm{W}_{t}||_{2}\leq\frac{1-\kappa}{2}\cdot R$ . Thus, when $||\bm{\beta}^{t}-\bm{\beta}^{*}||_{2}\leq R$ , we can also guarantee that $||\bm{\beta}^{t+1}-\bm{\beta}^{*}||_{2}\leq R$ , thus we can iterate the conclusions in (B.7), and we can get:

	$\displaystyle\|\|\bm{\beta}^{t}-\bm{\beta}^{*}\|\|_{2}$	$\displaystyle\leq\kappa^{t}\|\|\bm{\beta}^{0}-\bm{\beta}^{*}\|\|_{2}+\sum_{i=0}^{t-1}\kappa^{t-1-i}\|\|\bm{W}_{i}\|\|_{2}+\frac{\eta}{1-\kappa}[\xi+\alpha]$
		$\displaystyle\leq\frac{\kappa^{t}}{2}R+C\cdot\frac{d\sqrt{\log^{3}n}\sqrt{\log(1/\delta)}}{n\epsilon}+\frac{\eta}{1-\kappa}[\xi+\alpha].$

$\hfill\square$

B.8 Proof of Theorem 5.5 and Proposition 5.6

For the proof of Theorem 5.5, we are more focusing on two parts. The first part is the convergence of $\alpha$ , which has already been shown in Corollary 9 from [Balakrishnan et al., 2017], which shows that when $T\asymp\sqrt{\log n}$ , $\alpha=O(\sqrt{\frac{d}{n}}\cdot\log n)$ . So we will later show the convergence rate of the truncation error.

Follow the proof in Lemma 4.1, we could observe that for the vector $\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))$ , $\mathbb{E}||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}^{2}=O(\frac{d}{n})$ in the low dimensional settings. Thus if we choose $\xi=O(\sqrt{\frac{d\cdot\log n}{n}})$ . We can find that for a constant $C^{\prime}$ :

{\mathbb{P}}(||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}>\xi)\leq C^{\prime}\cdot\frac{1}{\log n}

By the definition of $\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})$ in the Gaussian Mixture Model, we found that the truncation error does not rely on the choice of $\bm{\beta}^{t}$ , thus for all $t$ , by probability $1-C^{\prime}\log n$ , $||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}=O(\sqrt{\frac{d\cdot\log n}{n}})$ , which finishes the proof of Theorem 5.5.

Then, for the proof of Proposition 5.6, we could follow the same proof of Proposition 4.3. But in the proof, rather than using Lemma 3.2 and Theorem 3.3 from [Cai et al., 2019b], we should use Lemma 3.1 and Theorem 3.1 from [Cai et al., 2019b]. $\hfill\square$

Appendix C Proof of lemmas

C.1 Proof of Lemma 3.2

Let $\psi:R_{2}\to R_{1}$ be a bijection. By the selection criterion of Algorithm 1, for each $j\in R_{2}$ we have $|v_{j}|+w_{ij}\leq|v_{\psi(j)}|+w_{i\psi(j)}$ , where $i$ is the index of the iteration in which $\psi(j)$ is appended to $S$ . It follows that, for every $c>0$ ,

	$\displaystyle v_{j}^{2}$	$\displaystyle\leq\left(\|v_{\psi(j)}\|+w_{i\psi(j)}-w_{ij}\right)^{2}$
		$\displaystyle\leq(1+c)v_{\psi(j)}^{2}+(1+1/c)(w_{i\psi(j)}-w_{ij})^{2}\leq(1+c)v_{\psi(j)}^{2}+4(1+1/c)\\|\bm{w}_{i}\\|_{\infty}^{2}$

Summing over $j$ then leads to

\displaystyle\|\bm{v}_{R_{2}}\|_{2}^{2}\leq(1+c)\|\bm{v}_{R_{1}}\|_{2}^{2}+4(1+1/c)\sum_{i\in[s]}\|\bm{w}_{i}\|^{2}_{\infty},

which finishes the proof of Lemma 3.2. $\square$

C.2 Proof of Lemma 3.3

To prove this result, we first prove that for each iteration, we can gain a $(\epsilon,\delta)$ -differentially private algorithm. Suppose the data points we have gathered is $y_{1},y_{2},...y_{n}$ , and if one individual of the data point, say, $y_{n}$ is replaced by $\tilde{y}_{n}$ , then define $\nabla Q_{y_{n}}(\cdot;\cdot)$ be the gradient taken with respect to $y_{n}$ , we can show that:

	$\displaystyle{\|\|f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-f_{T}(\nabla{Q_{n/N_{0}}^{\prime}(\bm{\beta}^{t};\bm{\beta}^{t}))\|\|}}_{\infty}$	$\displaystyle=\frac{N_{0}}{n}\|\|h_{T}(\nabla Q_{y_{n}}(\bm{\beta}^{t};\bm{\beta}^{t}))-h_{T}(\nabla Q_{\tilde{y}_{n}}(\bm{\beta}^{t};\bm{\beta}^{t}))\|\|_{\infty}$
		$\displaystyle\leq\frac{2T\cdot N_{0}}{n},$		(33)

where the last inequality follows from the definition of $h_{T}$ . Then by Lemma 3.1 we can get the result that for each iteration, we can obtain a $(\epsilon,\delta)$ -differentially private algorithm. Next, we will show that for an iterative algorithm with data-splitting, the whole algorithm is also a $(\epsilon,\delta)$ -differentially private algorithm. Let us start from the simple case, a two step iterative algorithm.

Let $D$ denote the data set and $D^{\prime}$ be the adjacent data set of $D$ . Assume the data set are split into two data sets where $D=D_{1}\cup D_{2}$ and $D_{1}\cap D_{2}=\varnothing$ . Then let $M_{1}(D_{1})$ be a $(\epsilon,\delta)$ -differentially private algorithm with output $v$ . $M_{2}(v,D_{2})$ also be a $(\epsilon,\delta)$ -differentially private algorithm with any given $v$ . Then we define $M(D)=M_{2}(M_{1}(D_{1}),D_{2})$ , we claim that $M(D)$ is also $(\epsilon,\delta)$ -differentially private. To prove this claim, we should use the definition of differential privacy. Since $D$ and $D^{\prime}$ are two adjacent data sets and differ by only one individual data. Thus, $D^{\prime}=D_{1}\cup D_{2}^{\prime}$ or $D^{\prime}=D_{1}^{\prime}\cup D_{2}$ . We will discuss these two cases one by one. For the first case, $D^{\prime}=D_{1}^{\prime}\cup D_{2}$ . from the definition, we have:

	$\displaystyle\mathbb{P}(M(D^{\prime})\in S)$	$\displaystyle=\mathbb{P}(M_{2}(M_{1}(D_{1}^{\prime}),D_{2})\in S)$
		$\displaystyle=\mathbb{E}_{M_{2}}{\mathbb{P}}(M_{2}(M_{1}(D_{1}^{\prime}),D_{2})\in S\|M_{2})$
		$\displaystyle=\mathbb{E}_{M_{2}}{\mathbb{P}}(M_{1}(D_{1}^{\prime})\in S(M_{2})\|M_{2})$
		$\displaystyle\leq\mathbb{E}_{M_{2}}e^{\epsilon}{\mathbb{P}}(M_{1}(D_{1})\in S(M_{2})\|M_{2})+\delta$
		$\displaystyle\leq e^{\epsilon}{\mathbb{P}}(M_{2}(M_{1}(D_{1}),D_{2})\in S)+\delta$
		$\displaystyle=e^{\epsilon}{\mathbb{P}}(M(D)\in S)+\delta$

From the definition, we could claim that $M$ is $(\epsilon,\delta)$ -differentially private. Then, for the second case, we have:

	$\displaystyle{\mathbb{P}}(M(D^{\prime})\in S)$	$\displaystyle={\mathbb{P}}(M_{2}(M_{1}(D_{1}),D_{2}^{\prime})\in S)$
		$\displaystyle=\mathbb{E}_{M_{1}}P(M_{2}(M_{1}(D_{1}),D_{2}^{\prime})\in S\|M_{1})$
		$\displaystyle\leq\mathbb{E}_{M_{1}}e^{\epsilon}{\mathbb{P}}(M_{2}(M_{1}(D_{1}),D_{2})\in S\|M_{1})+\delta$
		$\displaystyle=e^{\epsilon}{\mathbb{P}}(M_{2}(M_{1}(D_{1}),D_{2})\in S)+\delta$
		$\displaystyle=e^{\epsilon}{\mathbb{P}}(M(D)\in S)+\delta$

For the second case, we also claim that $M$ is $(\epsilon,\delta)$ -differentially private. Then, when the data set is split into $k=2,3,4,...$ subsets, we could use the induction to show that for an iterative algorithm with data-splitting and each iteration being $(\epsilon,\delta)$ -differentially private, the combined algorithm is also a $(\epsilon,\delta)$ -differentially private algorithm, which finish the proof of this lemma. $\square$

C.3 Proof of Lemma 4.1

The proof consists of three parts, for the first proposition of verification of Condition 3.4 and Condition 3.5, the proof is slight different with the proof for Corollary 1 in [Balakrishnan et al., 2017]. The difference is that for the Lipschitz-Gradient condition, instead the Lipschitz-Gradient-1 $(\gamma_{1},\mathcal{B})$ condition in [Balakrishnan et al., 2017], we are using Lipschitz-Gradient $(\gamma,\mathcal{B})$ . However, because the population level $\nabla Q(\cdot;\cdot)$ satisfies:

\nabla Q(\bm{\beta}^{\prime};\bm{\beta})=2\mathbb{E}[w_{\bm{\beta}}(Y)\cdot Y]-\bm{\beta}^{\prime}.

So obviously,

\nabla Q(\bm{\beta};\bm{\beta}^{*})-\nabla Q(\bm{\beta};\bm{\beta})=2\mathbb{E}[(w_{\bm{\beta}^{*}}-w_{\bm{\beta}})(Y)\cdot Y].

Also, let $M(\bm{\beta})=\text{argmax}_{\bm{\beta}^{\prime}}Q(\bm{\beta}^{\prime};\bm{\beta})$ , then we find that,

\nabla Q(M(\bm{\beta});\bm{\beta}^{*})-\nabla Q(M(\bm{\beta});\bm{\beta})=2\mathbb{E}[(w_{\bm{\beta}^{*}}-w_{\bm{\beta}})(Y)\cdot Y]

so in the case of Gaussian Mixture Model, $\gamma_{1}=\gamma$ , follow the proof of Corollary 1 in [Balakrishnan et al., 2017], the Lipschitz-Gradient $(\gamma,\mathcal{B})$ holds when taking $\gamma=\exp(-L\cdot\phi^{2})$ . For the second proposition of Condition 3.6, see detailed proof of Lemma 3.6 in [Wang et al., 2015]. For the third proposition of Condition 3.7, by the proof in (B.1) for Theorem 3.8, we find that we only need to bound $||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}$ with the support on $\mathcal{S}^{t+0.5}$ . Where $\mathcal{S}^{t+0.5}$ is the set of indexes chosen by the NoisyHT algorithm during the $t$ -th iteration. Thus, for any $\xi>0$ , since:

	$\displaystyle\|\|\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))\|\|_{2}$	$\displaystyle\leq\frac{1}{n}\|\|\sum_{i=1}^{n}[2w_{\bm{\beta}^{t}}(y_{i})-1](\Pi_{T}(\bm{y}_{i})-\bm{y}_{i})\|\|_{2}$
		$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\|\|\Pi_{T}(\bm{y}_{i})-\bm{y}_{i}\|\|_{2}$

Thus, we have:

$\displaystyle{\mathbb{P}}(\|\|(\nabla Q_{n/N_{0}}(\beta^{t};\beta^{t})-f_{T}(\nabla Q_{n/N_{0}}(\beta^{t};\beta^{t})))_{\mathcal{S}^{t+0.5}}\|\|>\xi)$	$\displaystyle\leq{\mathbb{P}}(\frac{N_{0}}{n}\sum_{i=1}^{n/N_{0}}\|\|(\Pi_{T}(\bm{y}_{i})-\bm{y}_{i})_{\mathcal{S}^{t+0.5}}\|\|_{2}>\xi)$
	$\displaystyle\leq\frac{\mathbb{E}[\frac{N_{0}}{n}\sum_{i=1}^{n/N_{0}}\|\|(\Pi_{T}(\bm{y}_{i})-\bm{y}_{i})_{\mathcal{S}^{t+0.5}}\|\|_{2}]^{2}}{\xi^{2}}$
	$\displaystyle\leq\frac{\frac{N_{0}}{n}\sum_{i=1}^{n/N_{0}}\mathbb{E}\|\|(\Pi_{T}(\bm{y}_{i})-\bm{y}_{i})_{\mathcal{S}^{t+0.5}}\|\|_{2}^{2}}{\xi^{2}}$
	$\displaystyle\leq\frac{\mathbb{E}\|\|(\Pi_{T}(\bm{Y})-\bm{Y})_{\mathcal{S}^{t+0.5}}\|\|_{2}^{2}}{\xi^{2}},$	(34)

where $\bm{Y}$ is the response of the Gaussian Mixture model on the population level. Then for any $j=1,2,...d\in\mathcal{S}^{t+0.5}$ , we can obtain that:

\mathbb{E}||(\Pi_{T}(\bm{Y})-\bm{Y})_{\mathcal{S}^{t+0.5}}||_{2}^{2}=\sum_{j=1}^{d}\mathbb{E}(\Pi_{T}(Y_{j})-Y_{j})^{2}\cdot\textbf{1}_{j\in\mathcal{S}^{t+0.5}}.

(35)

Then since for any $j\in\mathcal{S}^{t+0.5}$ , ${\mathbb{P}}(Y_{j}\sim N(\beta_{j},\sigma^{2}))=\frac{1}{2}$ and ${\mathbb{P}}(Y_{j}\sim N(-\beta_{j},\sigma^{2}))=\frac{1}{2}$ . Denote the density function of $N(\beta_{j},\sigma^{2})$ as $f_{j1}$ , the density function of $N(-\beta_{j},\sigma^{2})$ as $f_{j2}$ , the density function of $N(0,\sigma^{2})$ as $f_{z}$ , we can get:

		$\displaystyle\mathbb{E}(\Pi_{T}(Y_{j})-Y_{j})^{2}$
		$\displaystyle=\frac{1}{2}\mathbb{E}[(\Pi_{T}(Y_{j})-Y_{j})^{2}\|Y_{j}\sim N(\beta_{j},\sigma^{2})]+\frac{1}{2}\mathbb{E}[(\Pi_{T}(Y_{j})-Y_{j})^{2}\|Y_{j}\sim N(-\beta_{j},\sigma^{2})]$
		$\displaystyle=\frac{1}{2}[\int_{T}^{\infty}(y-T)^{2}f_{j1}dy+\int_{-\infty}^{-T}(y+T)^{2}f_{j1}dy+\int_{T}^{\infty}(y-T)^{2}f_{j2}dy+\int_{-\infty}^{-T}(y+T)^{2}f_{j2}dy]$
		$\displaystyle=\frac{1}{2}[\int_{T+\beta_{j}}^{\infty}(z-\beta_{j}-T)^{2}f_{z}dz+\int_{-\infty}^{-T+\beta_{j}}(z-\beta_{j}+T)^{2}f_{z}dz+\int_{T-\beta_{j}}^{\infty}(z+\beta_{j}-T)^{2}f_{z}dz$
		$\displaystyle+\int_{-\infty}^{-T-\beta_{j}}(z+\beta_{j}+T)^{2}f_{z}dz]$
		$\displaystyle=\int_{T+\beta_{j}}^{\infty}(z-\beta_{j}-T)^{2}f_{z}dz+\int_{T-\beta_{j}}^{\infty}(z+\beta_{j}-T)^{2}f_{z}dz.$		(36)

For the first term in the above result (C.3), by Fubini theorem:

$\displaystyle\int_{T+\beta_{j}}^{\infty}(z-\beta_{j}-T)^{2}f_{z}dz$	$\displaystyle=\int_{T+\beta_{j}}^{\infty}\int_{0}^{z-\beta_{j}-T}\frac{1}{2}tdtf_{z}dz$
	$\displaystyle=\frac{1}{2}\int_{0}^{\infty}\int_{T+\beta_{j}+t}^{\infty}f_{z}dz\cdot tdt$
	$\displaystyle=\frac{1}{2}\int_{0}^{\infty}{\mathbb{P}}(Z\geq T+\beta_{j}+t)tdt.$	(37)

By the tail bound of Gaussian distributions, we can have:

{\mathbb{P}}(Z\geq T+\beta_{j}+t)\leq\exp(-\frac{(T+\beta_{j}+t)^{2}}{2\sigma^{2}})

(38)

Insert the tail bound from (38) into (C.3), we can obtain:

$\displaystyle\int_{T+\beta_{j}}^{\infty}(z-\beta_{j}-T)^{2}f_{z}dz$	$\displaystyle\leq\frac{1}{2}\int_{T+\beta_{j}}^{\infty}\exp(-\frac{t^{2}}{2\sigma^{2}})(t-T-\beta_{j})dt$
	$\displaystyle\leq\int_{T+\beta_{j}}^{\infty}\exp(-\frac{t^{2}}{2\sigma^{2}})dt^{2}$
	$\displaystyle\leq 2\sigma^{2}\cdot\exp(-\frac{(T+\beta_{j})^{2}}{2\sigma^{2}}).$	(39)

If we choose $T=c\cdot\sigma\sqrt{\log n}$ and analyze the second term in similar approach in (C.3), we can find that $\mathbb{E}(\Pi_{T}(Y_{j})-Y_{j})^{2}=O(\frac{1}{n})$ . Then, according to (C.3), we can claim that for sufficiently large $n$ , $\mathbb{E}||(\Pi_{T}(\bm{Y})-\bm{Y})_{\mathcal{S}^{t+0.5}}||_{2}^{2}=O(\frac{s^{*}}{n})$ . So if we choose $\xi=O(\sqrt{\frac{s^{*}\cdot\log d\cdot\log n}{n}})$ . We can find that for a constant $C^{\prime}$ :

{\mathbb{P}}(||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}>\xi)\leq C^{\prime}\cdot\frac{1}{\log d\cdot\log n},

which finished the proof of the third proposition in Lemma 4.1. Thus, we complete the proof of Lemma 4.1. $\square$

C.4 Proof of Lemma A.1

For the first proposition in Lemma A.1, see detailed proof in Corollary 3 from [Balakrishnan et al., 2017], for the second proposition in Lemma A.1, see detailed proof in Lemma 3.9 from [Wang et al., 2015]. In this section, our major focus is the third proposition, which verifies Condition 3.7 for the Truncation-error. Before we start the proof of third proposition, let us first introduce two lemmas, which are significant in the following analysis.

Lemma C.1.

Let $X$ be a sub-gaussian random variable on $\mathbb{R}$ with mean zero and variance $\sigma^{2}$ . Then, for the choice of $T$ , if $T=O(\sigma\sqrt{\log n})$ , we have $\mathbb{E}(\Pi_{T}(x)-x)^{2}=O(\frac{1}{n})$ . Further, we also have $\mathbb{E}(\Pi_{T}(x)-x)^{4}=O(\frac{\log n}{n})$ .

Lemma C.2.

Let $\bm{X}\in\mathbb{R}^{d}$ be a sub-gaussian random vector and $Y\in\mathbb{R}$ be a sub-gaussian random variable. Also, let $x_{1},x_{2},...x_{n_{0}}$ be $n_{0}$ realizations of $\bm{X}$ and $y_{1},y_{2},...y_{n_{0}}$ be $n_{0}$ realizations of $Y$ . Suppose we have an index set $\mathcal{S}$ with $|\mathcal{S}|=s$ . Let $T\asymp\sqrt{\log n}$ , then it holds that, there exists constants $C,m_{0}$ :

\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}||(y_{i}\cdot\bm{x}_{i}-\Pi_{T}(y_{i})\Pi_{T}(\bm{x}_{i}))_{\mathcal{S}}||_{2}\leq C\cdot\sqrt{\frac{s\log d}{n}}\cdot\log n

(40)

with probability greater than $1-m_{0}/(\log n\cdot\log d)$ .

The proof of Lemma C.1 and Lemma C.2 can be found in the Appendix C.7 and C.8. Then we could start the proof of Lemma A.1. Similar to Gaussian Mixture Model, we also assume that for any $t$ , $||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}$ has support $\mathcal{S}^{t+0.5}$ . Then let $n_{0}=n/N_{0}$ for the ease of notation, we can break $||\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}$ into two parts:

		$\displaystyle\|\|\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))\|\|_{2}$
		$\displaystyle=\|\|(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})))_{\mathcal{S}^{t+0.5}}\|\|_{2}$
		$\displaystyle\leq\underbrace{\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\|\|(y_{i}\cdot\bm{x}_{i}-\Pi_{T}(y_{i})\Pi_{T}(\bm{x}_{i}))_{\mathcal{S}^{t+0.5}}\|\|_{2}}_{(\ref{peq29}.1)}+\underbrace{\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\|\|(\bm{x}_{i}\cdot\bm{x}_{i}^{\top}\bm{\beta}^{t}-\Pi_{T}(\bm{x}_{i}^{\top}\bm{\beta}^{t})\Pi_{T}(\bm{x}_{i}))_{\mathcal{S}^{t+0.5}}\|\|_{2}.}_{(\ref{peq29}.2)}$		(41)

Since $\bm{x}_{i}\sim N(\bm{0},\bm{I}_{d})$ , $y_{i}\sim N(0,\beta^{\top}\beta+\sigma^{2})$ and $\bm{x}_{i}^{\top}\beta^{t}\sim N(0,{\beta^{t}}^{\top}\beta^{t})$ are all gaussian random variables. So by the conclusion from Lemma C.2, we could conclude that both the term $(\ref{peq29}.1)$ and the term $(\ref{peq29}.2)$ are all $O(\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n)$ with probability $1-m_{0}/(\log n\cdot\log d)$ , therefore, for any $t=0,1,2,...,N_{0}-1$ , we have:

{\mathbb{P}}(||\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}>C\cdot\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n)\leq C^{\prime}\cdot 1/{\log d\cdot\log n},

(42)

which finishes the proof of the third proposition in Lemma A.1. $\square$

C.5 Proof of Lemma A.2

For the first proposition, The detailed proof is in Corollary 6 from [Balakrishnan et al., 2017]. For the second proposition Lemma A.2, the proof is in Lemma 3.12 from [Wang et al., 2015]. Then, we will focus on the proof of the third proposition.

Then, similar to the previous approach, we assume that for any $t$ , $||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}$ has support $\mathcal{S}^{t+0.5}$ . Denote $n_{0}=n/N_{0}$ , then we can break it into three parts. For the ease of notation, we denote $n_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i},z_{i}))=(1-\bm{z}_{i})\odot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i}))$ and $u_{i}=(1-\bm{z}_{i})\odot m_{\bm{\beta}}(\tilde{\bm{x}}_{i},y_{i})^{\top}\cdot\bm{\beta}$ .

	$\displaystyle\|\|\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))\|\|_{2}$
	$\displaystyle=\|\|(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})))_{\mathcal{S}^{t+0.5}}\|\|_{2}$
	$\displaystyle\leq\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\|\|(\Pi_{T}(y_{i})\cdot\Pi_{T}(m_{\beta}(\tilde{\bm{x}}_{i},y_{i}))-y_{i}\cdot m_{\beta}(\tilde{\bm{x}}_{i},y_{i}))_{\mathcal{S}^{t+0.5}}\|\|_{2}$		(43)
	$\displaystyle+\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\|\|(\Pi_{T}(n_{\beta}(\tilde{\bm{x}}_{i},y_{i},z_{i}))\cdot\Pi_{T}(u_{i})-n_{\beta}(\tilde{\bm{x}}_{i},y_{i},z_{i})\cdot u_{i})_{\mathcal{S}^{t+0.5}}\|\|_{2}$		(44)
	$\displaystyle+\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\|\|(\Pi_{T}(m_{\beta}(\tilde{\bm{x}}_{i},y_{i}))\cdot\Pi_{T}(m_{\beta}(\tilde{\bm{x}}_{i},y_{i})^{\top}\cdot\beta)-m_{\beta}(\tilde{\bm{x}}_{i},y_{i})\cdot(m_{\beta}(\tilde{\bm{x}}_{i},y_{i})^{\top}\cdot\beta))_{\mathcal{S}^{t+0.5}}\|\|_{2}$		(45)

Then, to analysis (43), (44) and (45), we should first introduce a lemma below.

Lemma C.3.

Under the conditions of Theorem 4.6, the random vector $m_{\beta}(\tilde{\bm{x}}_{i},y_{i})$ is sub-gaussian with a constant parameter.

The detailed proof of Lemma C.3 is in the proof for Lemma 10 in [Balakrishnan et al., 2017]. Then by the definition of sub-gaussian random vector, we can also claim that $m_{\beta}(\tilde{\bm{x}}_{i},y_{i})^{\top}\cdot\beta$ is sub-gaussian.

Further, for the term $n_{\beta}(\tilde{\bm{x}}_{i},y_{i},z_{i})=(1-\bm{z}_{i})\odot m_{\beta}(\tilde{\bm{x}}_{i},y_{i})$ , we also claim that, for any unit vector $v\in\mathbb{R}^{d}$ , $n_{\beta}(\tilde{\bm{x}}_{i},y_{i},z_{i}))^{\top}\cdot v=m_{\beta}(\tilde{\bm{x}}_{i},y_{i}))^{\top}\cdot[(1-\bm{z}_{i})\odot v]$ , so $n_{\beta}(\tilde{\bm{x}}_{i},y_{i},z_{i}))$ is also a sub-gaussian vector. Similarly, we also have $u_{i}$ be sub-gaussian random variables. So by the conclusion from Lemma C.2, we could conclude that both the term $(\ref{peq37})$ , the term $(\ref{peq38})$ and the term $(\ref{peq39})$ are all $O(\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n)$ with probability $1-m_{0}/(\log n\cdot\log d)$ , therefore, for any $t=0,1,2,...,N_{0}-1$ , we have:

{\mathbb{P}}(||\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))||_{2}>C\cdot\sqrt{\frac{s^{*}\log d}{n}}\cdot\log n)\leq C^{\prime}\cdot 1/{\log d\cdot\log n},

(46)

which completes the proof of Lemma A.2. $\square$

C.6 Proof of Lemma B.1

By assumption (11), we have

(1-L)\|\bm{\beta}^{*}\|_{2}\leq\|\bar{\bm{\beta}}^{t+0.5}\|_{2}\leq(1+L)\|\bm{\beta}^{*}\|_{2}.

(47)

Then, for the simplicity of the proof, we can denote that

\bar{\bm{\theta}}=\frac{\bar{\bm{\beta}}^{t+0.5}}{\|\bar{\bm{\beta}}^{t+0.5}\|_{2}},\bm{\theta}=\frac{\bm{\beta}^{t+0.5}}{\|\bar{\bm{\beta}}^{t+0.5}\|_{2}}\text{, and }\bm{\theta}^{*}=\frac{\bm{\beta}^{*}}{\|\bm{\beta}^{*}\|_{2}}.

We can find that both $\bar{\bm{\theta}}$ and $\bm{\theta}^{*}$ are unit vectors. Further, we denote the sets $\mathcal{I}_{1},\mathcal{I}_{2}$ and $\mathcal{I}_{3}$ as the follows

\mathcal{I}_{1}=\mathcal{S}^{*}\backslash{\mathcal{S}}^{t+0.5},\mathcal{I}_{2}=\mathcal{S}^{*}\bigcap{\mathcal{S}}^{t+0.5}\text{, and }\mathcal{I}_{3}={\mathcal{S}}^{t+0.5}\backslash\mathcal{S}^{*},

where $\mathcal{S}^{*}=\text{supp}(\bm{\beta}^{*})$ , the support of $\bm{\beta}^{*}$ . And ${\mathcal{S}}^{t+0.5}$ be the set of indexes chosen by the NoisyHT algorithm during the $t$ -th iteration. Let $s_{i}=|\mathcal{I}_{i}|$ for $i=1,2,3$ , respectively. Then, we define $\Delta=\langle\bar{\bm{\theta}},\bm{\theta}^{*}\rangle$ , so the results below holds:

\Delta=\langle\bar{\bm{\theta}},\bm{\theta}^{*}\rangle=\sum_{j\in\mathcal{S}^{*}}\bar{\bm{\theta}}_{j}\theta_{j}^{*}=\sum_{j\in\mathcal{I}_{1}}\bar{\bm{\theta}}_{j}\theta_{j}^{*}+\sum_{j\in\mathcal{I}_{2}}\bar{\bm{\theta}}_{j}\theta_{j}^{*}\leq\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}\|\bm{\theta}^{*}_{\mathcal{I}_{1}}\|_{2}+\|\bar{\bm{\theta}}_{\mathcal{I}_{2}}\|_{2}\|\bm{\theta}^{*}_{\mathcal{I}_{2}}\|_{2}.

(48)

By Cauchy-Schwartz inequality, we have

	$\displaystyle\Delta^{2}$	$\displaystyle\leq(\\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\\|_{2}\\|\bm{\theta}^{}_{\mathcal{I}_{1}}\\|_{2}+\\|\bar{\bm{\theta}}_{\mathcal{I}_{2}}\\|_{2}\\|\bm{\theta}^{}_{\mathcal{I}_{2}}\\|_{2})^{2}\leq(\\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\\|_{2}^{2}+\\|\bar{\bm{\theta}}_{\mathcal{I}_{2}}\\|_{2}^{2})(\\|\bm{\theta}^{}_{\mathcal{I}_{1}}\\|_{2}^{2}+\\|\bm{\theta}^{}_{\mathcal{I}_{2}}\\|_{2}^{2})$
		$\displaystyle=(1-\\|\bar{\bm{\theta}}_{\mathcal{I}_{3}}\\|_{2}^{2})(1-\\|\bm{\theta}^{*}_{\mathcal{I}_{3}}\\|_{2}^{2})\leq 1-\\|\bar{\bm{\theta}}_{\mathcal{I}_{3}}\\|_{2}^{2}.$		(49)

Then, let $R$ be a set of indexes for the smallest $s_{1}$ indexes in $\mathcal{I}_{3}$ , this is possible, since by the assumption, we have $\hat{s}>s^{*}$ , thus $|\mathcal{I}_{1}|<|\mathcal{I}_{3}|$ . Then, by Lemma 3.2, define $\bm{W}=\sum_{i\in[\hat{s}]}||\bm{w}_{i}||_{\infty}^{2}$ , we have for any $c_{0}>0$ :

||\bm{\beta}_{\mathcal{I}_{1}}^{t+0.5}||_{2}^{2}\leq(1+c_{0})||\bm{\beta}_{R}^{t+0.5}||_{2}^{2}+4(1+\frac{1}{c_{0}})\bm{W}.

(50)

Then by the fact that $\frac{||\bm{\beta}_{\mathcal{I}_{3}}^{t+0.5}||_{2}}{\sqrt{s_{3}}}\geq\frac{||\bm{\beta}_{R}^{t+0.5}||_{2}}{\sqrt{s_{1}}}$ according to the choice of $R$ , and also a standard inequality $a^{2}+b^{2}\leq(a+b)^{2}$ when $a,b>0$ . Thus, we have:

\frac{||\bm{\beta}_{\mathcal{I}_{1}}^{t+0.5}||_{2}}{\sqrt{s_{1}}}\leq\sqrt{1+c_{0}}\cdot\frac{||\bm{\beta}_{\mathcal{I}_{3}}^{t+0.5}||_{2}}{\sqrt{s_{3}}}+2\sqrt{(1+\frac{1}{c_{0}})\bm{W}}/{\sqrt{s_{1}}}.

(51)

Therefore,

\frac{||\bm{\theta}_{\mathcal{I}_{1}}^{t+0.5}||_{2}}{\sqrt{1+c_{0}}\cdot\sqrt{s_{1}}}\leq\frac{||\bm{\theta}_{\mathcal{I}_{3}}^{t+0.5}||_{2}}{\sqrt{s_{3}}}+2\sqrt{(1+\frac{1}{c_{0}})\bm{W}}/{(\sqrt{1+c_{0}}\cdot\sqrt{s_{1}}\cdot||\bar{\bm{\beta}}^{t+0.5}||_{2})}.

(52)

Then, we denote $\bm{W}_{1}=2\sqrt{(1+\frac{1}{c_{0}})\bm{W}}/{(\sqrt{1+c_{0}}\cdot\sqrt{s_{1}}\cdot||\bar{\bm{\beta}}^{t+0.5}||_{2})}$ . Further, we define that

\epsilon_{0}=2\eta\alpha/||\bar{\bm{\beta}}^{t+0.5}||_{2}\text{ and }\epsilon_{1}=\eta\cdot||f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})||_{2}/||\bar{\bm{\beta}}^{t+0.5}||_{2}.

Thus we have, with probability $1-\tau-N_{0}\cdot\phi(\xi)$ :

	$\displaystyle\frac{\\|\bm{\theta}_{\mathcal{I}_{1}}-\bar{\bm{\theta}}_{\mathcal{I}_{1}}\\|_{2}}{\sqrt{s_{1}}}$	$\displaystyle\leq\frac{\\|\bm{\beta}_{\mathcal{I}_{1}}^{t+0.5}-\bar{\bm{\beta}}_{\mathcal{I}_{1}}^{t+0.5}\\|_{2}}{\sqrt{s_{1}}\cdot\|\|\bar{\bm{\beta}}^{t+0.5}\|\|_{2}}$
		$\displaystyle\leq\frac{\\|[\eta\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t})-\eta f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))]_{\mathcal{I}_{1}}\\|_{2}}{\sqrt{s_{1}}\cdot\|\|\bar{\bm{\beta}}^{t+0.5}\|\|_{2}}$
		$\displaystyle\leq\frac{\eta\\|[\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t})-\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})]_{\mathcal{I}_{1}}\\|_{2}}{\sqrt{s_{1}}\cdot\|\|\bar{\bm{\beta}}^{t+0.5}\|\|_{2}}+\frac{\eta\\|[\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))]_{\mathcal{I}_{1}}\\|_{2}}{\sqrt{s_{1}}\cdot\|\|\bar{\bm{\beta}}^{t+0.5}\|\|_{2}}$
		$\displaystyle\leq\frac{\eta\sqrt{s_{1}}\\|\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t})-\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})\\|_{\infty}}{\sqrt{s_{1}}\cdot\|\|\bar{\bm{\beta}}^{t+0.5}\|\|_{2}}+\frac{\eta\\|[\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))]_{\mathcal{I}_{1}}\\|_{2}}{\sqrt{s_{1}}\cdot\|\|\bar{\bm{\beta}}^{t+0.5}\|\|_{2}}$
		$\displaystyle\leq\epsilon_{0}/2+\epsilon_{1}/\sqrt{s_{1}}.$

Similarly, we also have that:

\frac{\|\bm{\theta}_{\mathcal{I}_{3}}-\bar{\bm{\theta}}_{\mathcal{I}_{3}}\|_{2}}{\sqrt{s_{3}}}\leq\epsilon_{0}/2+\epsilon_{1}/\sqrt{s_{3}}.

Define $\tilde{\epsilon}=\bm{W}_{1}+(1+\frac{1}{\sqrt{1+c_{0}}})\epsilon_{0}/2+(\frac{1}{\sqrt{s_{3}}}+\frac{1}{\sqrt{s_{1}}\cdot\sqrt{1+c_{0}}})\epsilon_{1}$ , which implies that

$\displaystyle\frac{\\|\bar{\bm{\theta}}_{\mathcal{I}_{3}}\\|_{2}}{\sqrt{s_{3}}}\geq\frac{\\|\bm{\theta}_{\mathcal{I}_{3}}\\|_{2}}{\sqrt{s_{3}}}-\frac{\\|\bm{\theta}_{\mathcal{I}_{3}}-\bar{\bm{\theta}}_{\mathcal{I}_{3}}\\|_{2}}{\sqrt{s_{3}}}$	$\displaystyle\geq\frac{\\|\bm{\theta}_{\mathcal{I}_{1}}\\|_{2}}{\sqrt{1+c_{0}}\cdot\sqrt{s_{1}}}-\bm{W}_{1}-\frac{\\|\bm{\theta}_{\mathcal{I}_{3}}-\bar{\bm{\theta}}_{\mathcal{I}_{3}}\\|_{2}}{\sqrt{s_{3}}}$
	$\displaystyle\geq\frac{\\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\\|_{2}}{\sqrt{1+c_{0}}\cdot\sqrt{s_{1}}}-\bm{W}_{1}-\frac{\\|\bm{\theta}_{\mathcal{I}_{3}}-\bar{\bm{\theta}}_{\mathcal{I}_{3}}\\|_{2}}{\sqrt{s_{3}}}-\frac{\\|\bm{\theta}_{\mathcal{I}_{1}}-\bar{\bm{\theta}}_{\mathcal{I}_{1}}\\|_{2}}{\sqrt{1+c_{0}}\cdot\sqrt{s_{1}}}$
	$\displaystyle\geq\frac{\\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\\|_{2}}{\sqrt{1+c_{0}}\cdot\sqrt{s_{1}}}-\tilde{\epsilon},$	(53)

where the second inequality is by plugging (52) into the inequality. Plugging (C.6) into (C.6), we have

\Delta^{2}\leq 1-\|\bar{\bm{\theta}}_{\mathcal{I}_{3}}\|_{2}^{2}\leq 1-(\sqrt{\frac{s_{3}}{s_{1}\cdot(1+c_{0})}}\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}-\sqrt{s_{3}}\tilde{\epsilon})^{2}.

(54)

After solving $\|\bar{\theta}_{\mathcal{I}_{1}}\|_{2}$ in (54), we can obtain the inequality below:

\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}\leq\sqrt{\frac{s_{1}\cdot(1+c_{0})}{s_{3}}}\sqrt{1-\Delta^{2}}+\sqrt{s_{1}\cdot(1+c_{0})}\cdot\tilde{\epsilon}\leq\sqrt{\frac{s^{*}}{\hat{s}}}\cdot\sqrt{1-\Delta^{2}}+\sqrt{1+c_{0}}\cdot\sqrt{s_{1}}\cdot\tilde{\epsilon}.

(55)

The final inequality is due to the inequality $\frac{s_{1}}{s_{3}}\leq\frac{s_{1}+s_{2}}{s_{3}+s_{2}}=\frac{s^{*}}{\hat{s}}$ . By a properly chosen $c_{0}$ satisfies that $c_{0}\leq\min((s^{*}\cdot s_{3})/(\hat{s}\cdot s_{1})-1,(2\sqrt{s^{*}/s_{1}}-1)^{2}-1)$ , we have $(1+c_{0})\cdot\frac{s_{1}}{s_{3}}\leq\frac{s^{*}}{\hat{s}}$ . In the following, we will prove that the right hand side of (55) is upper bounded by $\Delta$ .

By the assumptions from this lemma, we first observe that $\epsilon_{0}=o(1)$ and $\epsilon_{1}=o(1)$ . We can also find that for $\bm{W}$ , by Lemma A.1 from [Cai et al., 2019b], we can claim that, there exists constants $c_{1},m_{0},m_{1}$ , such that, with probability $1-m_{0}\cdot s^{*}\cdot\exp(-m_{1}\log d)$ :

\bm{W}\leq c_{1}\cdot\frac{(s^{*}\log d)^{2}\log(1/\delta)\log^{3}n}{n^{2}\epsilon^{2}}.

Thus, if we let $\bm{W}^{\prime}$ be the maximum of $\bm{W}$ for all the $N_{0}$ iterations, then by a union bound, if we let $N_{0}=O(\log n)$ be the total number of iterations, then with probability $1-m_{0}\cdot s^{*}\log n\cdot\exp(-m_{1}\log d)$ :

\bm{W}^{\prime}\leq c_{1}\cdot\frac{(s^{*}\log d)^{2}\log(1/\delta)\log^{3}n}{n^{2}\epsilon^{2}}.

Therefore,

\sqrt{s_{1}}\cdot\bm{W}_{1}\leq{c_{1}^{\prime}}\cdot\frac{s^{*}\log d\cdot\log(1/\delta)\log^{3/2}n}{n\epsilon}.

(56)

by the assumptions of this Lemma, we have $\sqrt{s_{1}}\cdot\bm{W}_{1}=o(1)$ , thus for $\tilde{\epsilon}$ , we find that:

	$\displaystyle\sqrt{s_{1}}\tilde{\epsilon}$	$\displaystyle\leq\sqrt{s_{1}}\bm{W}_{1}+\sqrt{s_{1}}(1+\frac{1}{\sqrt{1+c_{0}}})\epsilon_{0}/2+(\frac{\sqrt{s_{1}}}{\sqrt{s_{3}}}+\frac{\sqrt{s_{1}}}{\sqrt{s_{1}}\cdot\sqrt{1+c_{0}}})\epsilon_{1}$
		$\displaystyle\leq\sqrt{s_{1}}\bm{W}_{1}+\sqrt{s_{1}}(1+\frac{1}{\sqrt{1+c_{0}}})\epsilon_{0}/2+(1+\frac{1}{\sqrt{1+c_{0}}})\epsilon_{1}$		(57)

The first term and the third term of (C.6) are all o(1), plug this result into (55), we have that,

\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}\leq\sqrt{\frac{s^{*}}{\hat{s}}}\cdot\sqrt{1-\Delta^{2}}+(\sqrt{1+c_{0}}+1)\cdot\sqrt{s_{1}}\epsilon_{0}/2.

By a properly chosen $c_{0}$ , we can attain that $(\sqrt{1+c_{0}}+1)\cdot\sqrt{s_{1}}\leq 2\sqrt{s^{*}}$ , so,

\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}\leq\sqrt{\frac{s^{*}}{\hat{s}}}\cdot\sqrt{1-\Delta^{2}}+\sqrt{s^{*}}\epsilon_{0}.

(58)

In the following steps, we will prove that the right hand side of (C.6) is upper bounded by $\Delta$ . Such bound holds if we have:

\displaystyle\Delta\geq\frac{\sqrt{s^{*}}\epsilon_{0}+[s^{*}\tilde{\epsilon}^{2}-(s^{*}/\hat{s}+1)(s^{*}{\epsilon_{0}^{2}-s^{*}/\hat{s})]^{1/2}}}{s^{*}/\hat{s}+1}=\frac{\sqrt{s^{*}}{\epsilon_{0}}+[-(s^{*}\tilde{\epsilon})^{2}/\hat{s}+(s^{*}/\hat{s}+1)s^{*}/\hat{s}]^{1/2}}{s^{*}/\hat{s}+1}.

(59)

To prove (59), we first note that $\sqrt{s^{*}}{\epsilon_{0}}\leq\Delta$ , which holds because:

\sqrt{s^{*}}{\epsilon_{0}}\leq\sqrt{\hat{s}}{\epsilon_{0}}=\frac{2\eta\alpha\sqrt{\hat{s}}}{\|\bar{\bm{\beta}}^{t+0.5}\|_{2}}\leq\frac{1-L}{1+L}\leq\Delta,

(60)

where the second inequality is due to the assumptions in the lemma and the final inequality is due to the fact that:

\|\bar{\bm{\beta}}^{t+0.5}\|_{2}^{2}+\|\bm{\beta}^{*}\|_{2}^{2}-2\langle\bar{\bm{\beta}}^{t+0.5},\bm{\beta}^{*}\rangle=\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{*}\|_{2}^{2}\leq L^{2}\|\bm{\beta}^{*}\|_{2}^{2}.

(61)

and

\Delta=\langle\bar{\bm{\theta}},\bm{\theta}^{*}\rangle=\frac{\langle\bar{\bm{\beta}}^{t+0.5},\bm{\beta}^{*}\rangle}{\|\bar{\bm{\beta}}^{t+0.5}\|_{2}\|\bm{\beta}^{*}\|_{2}}\geq\frac{\|\bar{\bm{\beta}}^{t+0.5}\|_{2}^{2}+\|\bm{\beta}^{*}\|_{2}^{2}-\kappa^{2}\|\bm{\beta}^{*}\|_{2}^{2}}{2\|\bar{\bm{\beta}}^{t+0.5}\|_{2}\|\bm{\beta}^{*}\|_{2}}\geq\frac{(1-L)^{2}+1-L^{2}}{2(1+L)}=\frac{1-L}{1+L}.

(62)

By combining (61) and (62), we can finish the proof of (60). Then, we can verify that (59) holds. By (60), we have

\sqrt{\hat{s}}\cdot{\epsilon_{0}}\leq\frac{1-L}{1+L}<1<\sqrt{\frac{s^{*}+\hat{s}}{\hat{s}}},

(63)

The above inequality implies that ${\epsilon_{0}}\leq\frac{\sqrt{s^{*}+\hat{s}}}{\hat{s}}$ . Then, for the right hand side of (59), we observe that:

$\displaystyle\frac{\sqrt{s^{}}{\epsilon_{0}}+[-(s^{}{\epsilon_{0}})^{2}/\hat{s}+(s^{}/\hat{s}+1)s^{}/\hat{s}]^{1/2}}{s^{*}/\hat{s}+1}$	$\displaystyle\leq\frac{\sqrt{s^{}}{\epsilon_{0}}+[(s^{}/\hat{s}+1)s^{}/s]^{1/2}}{s^{}/\hat{s}+1}$
	$\displaystyle\leq 2\sqrt{\frac{s^{}}{s^{}+\hat{s}}}$
	$\displaystyle\leq 2\sqrt{\frac{1}{1+4(1+L)^{2}/(1-L)^{2}}}$
	$\displaystyle\leq\frac{1-L}{1+L}\leq\Delta.$	(64)

Thus, we can claim that

\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}\leq\Delta.

(65)

Furthermore, from (C.6), we have:

\Delta\leq\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}\|\bm{\theta}^{*}_{\mathcal{I}_{1}}\|_{2}+\|\bar{\bm{\theta}}_{\mathcal{I}_{2}}\|_{2}\|\bm{\theta}^{*}_{\mathcal{I}_{2}}\|_{2}\leq\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}\|\bm{\theta}^{*}_{\mathcal{I}_{1}}\|_{2}+\sqrt{(1-\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}^{2})}\sqrt{(1-\|\bm{\theta}^{*}_{\mathcal{I}_{1}}\|^{2}_{2})},

noticing that $\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}\|\bm{\theta}^{*}_{\mathcal{I}_{1}}\|_{2}\leq\Delta$ , thus,

(\Delta-\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}\|\bm{\theta}^{*}_{\mathcal{I}_{1}}\|_{2})^{2}\leq(1-\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}^{2})(1-\|\bm{\theta}^{*}_{\mathcal{I}_{1}}\|^{2}_{2}).

By solving $\|\bm{\theta}^{*}_{\mathcal{I}_{1}}\|_{2}$ from the above inequality, and by the fact that $\Delta\leq 1$ , we have:

	$\displaystyle\\|\bm{\theta}^{*}_{\mathcal{I}_{1}}\\|_{2}$	$\displaystyle\leq\\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\\|_{2}\Delta+\sqrt{1-\\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\\|_{2}^{2}}\sqrt{1-\Delta^{2}}\leq\\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\\|_{2}+\sqrt{1-\Delta^{2}}$
		$\displaystyle\leq(\sqrt{\frac{s^{}}{\hat{s}}}+1)\sqrt{1-\Delta^{2}}+\sqrt{s^{}}{\epsilon_{0}},$		(66)

where the third inequality is due to (58). Therefore, by (58) and (C.6), we can combine the result as follows:

\|\bm{\theta}^{*}_{\mathcal{I}_{1}}\|_{2}\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}\leq[(\sqrt{\frac{s^{*}}{\hat{s}}}+1)\sqrt{1-\Delta^{2}}+\sqrt{s^{*}}{\epsilon_{0}}]\cdot[\sqrt{\frac{s^{*}}{\hat{s}}}\sqrt{1-\Delta^{2}}+\sqrt{s^{*}}{\epsilon_{0}}].

(67)

Note by the definition of $\bar{\bm{\theta}}$ , we can find that:

\bar{\bm{\beta}}^{t+1}=\text{trunc}(\bar{\bm{\beta}}^{t+0.5},{\mathcal{S}}^{t+0.5})=\text{trunc}(\bar{\bm{\theta}},{\mathcal{S}}^{t+0.5})\cdot\|\bar{\bm{\beta}}^{t+0.5}\|_{2}.

(68)

Therefore, we have:

\langle\frac{\bar{\bm{\beta}}^{t+1}}{\|\bar{\bm{\beta}}^{t+0.5}\|_{2}},\frac{\bm{\beta}^{*}}{\|\bm{\beta}^{*}\|_{2}}\rangle=\langle\text{trunc}(\bar{\bm{\theta}},{\mathcal{S}}^{t+0.5}),\bm{\theta}^{*}\rangle=\langle\bar{\bm{\theta}}_{\mathcal{I}_{2}},\bm{\theta}^{*}_{\mathcal{I}_{2}}\rangle\geq\langle\bar{\bm{\theta}},\bm{\theta}^{*}\rangle-\|\bar{\bm{\theta}}_{\mathcal{I}_{1}}\|_{2}\|\bm{\theta}^{*}_{\mathcal{I}_{1}}\|_{2}.

(69)

Define $\chi=\|\bar{\bm{\beta}}^{t+0.5}\|_{2}\|\bm{\beta}^{*}\|_{2}$ . Combining the results from (69) and (67), we observe that:

	$\displaystyle\langle\bar{\bm{\beta}}^{t+1},\bm{\beta}^{*}\rangle$
	$\displaystyle\geq\langle\bar{\bm{\beta}}^{t+0.5},\bm{\beta}^{}\rangle-[(\sqrt{s^{}/\hat{s}}+1)\cdot\sqrt{\chi(1-\Delta^{2})}+\sqrt{s^{}}\cdot\sqrt{\chi}{\epsilon_{0}}]\cdot[\sqrt{{s^{}}/\hat{s}}\cdot\sqrt{\chi(1-\Delta^{2})}+\sqrt{s^{*}}\cdot\sqrt{\chi}{\epsilon_{0}}]$
	$\displaystyle=\langle\bar{\bm{\beta}}^{t+0.5},\bm{\beta}^{}\rangle-(\sqrt{{s^{}}/\hat{s}}+{s^{}}/\hat{s})\cdot\chi(1-\Delta^{2})-(1+2\sqrt{{s^{}}/\hat{s}})\cdot\sqrt{\chi(1-\Delta^{2})}\sqrt{s^{}}\sqrt{\chi}\tilde{\epsilon}-(\sqrt{s^{}}\sqrt{\chi}{\epsilon_{0}})^{2}$		(70)

Then, note the definition of $\Delta$ , we can bound the term $\sqrt{\chi(1-\Delta^{2})}$ by:

$\displaystyle\sqrt{\chi(1-\Delta^{2})}$	$\displaystyle\leq\sqrt{2\chi(1-\Delta)}\leq\sqrt{2\\|\bar{\bm{\beta}}^{t+0.5}\\|_{2}\\|\bm{\beta}^{}\\|_{2}-2\langle\bar{\bm{\beta}}^{t+0.5},\bm{\beta}^{}\rangle}$
	$\displaystyle\leq\sqrt{\\|\bar{\bm{\beta}}^{t+0.5}\\|_{2}^{2}+\\|\bm{\beta}^{}\\|_{2}^{2}-2\langle\bar{\bm{\beta}}^{t+0.5},\bm{\beta}^{}\rangle}$
	$\displaystyle\leq\\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{*}\\|_{2}.$	(71)

Then, for the term $\sqrt{\chi}{\epsilon_{0}}$ , we have

\displaystyle\sqrt{\chi}{\epsilon_{0}}=\sqrt{\|\bar{\bm{\beta}}^{t+0.5}\|_{2}\|\bm{\beta}^{*}\|_{2}}\frac{2\eta\alpha}{\|\bar{\bm{\beta}}^{t+0.5}\|_{2}}\leq\frac{2\eta\alpha}{\sqrt{1-L}}.

(72)

Then, we can plug (C.6) and (72) into (70), we can get the following results:

	$\displaystyle\langle\bar{\bm{\beta}}^{t+1},\bm{\beta}^{*}\rangle$	$\displaystyle\geq\langle\bar{\bm{\beta}}^{t+0.5},\bm{\beta}^{}\rangle-(\sqrt{{s^{}}/\hat{s}}+{s^{}}/\hat{s})\\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{}\\|_{2}^{2}$
		$\displaystyle-(1+2\sqrt{{s^{}}/\hat{s}})\\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{}\\|_{2}\cdot\frac{\sqrt{s^{}}}{\sqrt{1-L}}\cdot 2\eta\alpha-\frac{4\eta^{2}s^{}}{1-L}{\alpha}^{2}.$		(73)

Then, notice that $\bar{\bm{\beta}}^{t+1}$ is obtained by truncating $\bar{\bm{\beta}}^{t+0.5}$ , so we have $\|\bar{\bm{\beta}}^{t+1}\|_{2}^{2}+\|\bm{\beta}^{*}\|_{2}^{2}\leq\|\bar{\bm{\beta}}^{t+0.5}\|_{2}^{2}+\|\bm{\beta}^{*}\|_{2}^{2}$ , plug this fact into (73), we get the following results:

	$\displaystyle\\|\bar{\bm{\beta}}^{t+1}-\bm{\beta}^{*}\\|_{2}^{2}$	$\displaystyle\leq(1+\sqrt{{s^{}}/\hat{s}}+\frac{s^{}}{\hat{s}})\\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{}\\|_{2}^{2}+\frac{4\eta^{2}s^{}}{1-L}{\alpha}^{2}+(1+2\sqrt{{s^{}}/\hat{s}})\\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{}\\|_{2}\frac{\sqrt{s^{*}}}{\sqrt{1-L}}{2\eta\alpha}$
		$\displaystyle\leq(1+2\sqrt{{s^{}}/\hat{s}}+2{s^{}}/\hat{s})[\\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{}\\|_{2}+\frac{\sqrt{s^{}}}{\sqrt{1-L}}{\eta\alpha}]^{2}+\frac{4\eta^{2}s^{*}}{1-L}{\alpha}^{2}.$		(74)

By taking square root on both sides of (74) and by the fact that for $a,b>0$ , $\sqrt{a^{2}+b^{2}}\leq a+b$ , we can find a constant $K_{1}$ , such that:

\|\bar{\bm{\beta}}^{t+1}-\bm{\beta}^{*}\|_{2}\leq(1+4\sqrt{{s^{*}}/\hat{s}})^{\frac{1}{2}}\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{*}\|_{2}+\frac{K_{1}\sqrt{s^{*}}}{\sqrt{1-L}}{\eta\alpha}.

(75)

Then, the lemma is proved. $\square$

C.7 Proof of Lemma C.1

Since $X$ is a sub-gaussian random variable with mean zero and variance $\sigma^{2}$ , then by the definition of sub-gaussian random variables, we have:

{\mathbb{P}}(X>t)\leq\exp(-\frac{t^{2}}{2\sigma^{2}})\text{ , }{\mathbb{P}}(X<-t)\leq\exp(-\frac{t^{2}}{2\sigma^{2}}).

(76)

Thus, with this tail bound, we can calculate $\mathbb{E}(\Pi_{T}(X)-X)^{2})$ directly, suppose the density function of $X$ is $f_{x}$ , then, according to the definition of $\Pi_{T}$ , we have:

\displaystyle\mathbb{E}[(\Pi_{T}(X)-X)]^{2}

\displaystyle=\int_{T}^{\infty}(x-T)^{2}f_{x}dx+\int_{-\infty}^{-T}(T+x)^{2}f_{x}dx.

(77)

Then, we will analyze these two terms in (77) separately. For the first term, we have:

$\displaystyle\int_{T}^{\infty}(x-T)^{2}f_{x}dx$	$\displaystyle=\int_{T}^{\infty}\int_{T}^{x}2(t-T)dtf_{x}dx$
	$\displaystyle=\int_{T}^{\infty}2(t-T)P(x\geq t)dt$
	$\displaystyle\leq 2\cdot\int_{T}^{\infty}(t-T)\exp(-\frac{t^{2}}{2\sigma^{2}})dt$
	$\displaystyle\leq 2\cdot\exp(-\frac{T^{2}}{2\sigma^{2}}).$	(78)

Then, by choosing $T=c\cdot\sigma\sqrt{2\log n}$ , we have $\int_{T}^{\infty}(x-T)^{2}f_{x}dx=O(\frac{1}{n})$ . By the similar approach, we can also attain that $\int_{-\infty}^{-T}(x+T)^{2}f_{x}dx=O(\frac{1}{n})$ . Thus, we claim that $\mathbb{E}(\Pi_{T}(x)-x)^{2}=O(\frac{1}{n})$ .

For the second part proof of this lemma, we can also calculate that:

\displaystyle\mathbb{E}[(\Pi_{T}(X)-X)]^{4}

\displaystyle=\int_{T}^{\infty}(x-T)^{4}f_{x}dx+\int_{-\infty}^{-T}(T+x)^{4}f_{x}dx.

(79)

Same as the first part, we will also analyze these two terms in (79) separately. For the first term, we have:

$\displaystyle\int_{T}^{\infty}(x-T)^{4}f_{x}dx$	$\displaystyle=\int_{T}^{\infty}\int_{T}^{x}4(t-T)^{3}dtf_{x}dx$
	$\displaystyle=\int_{T}^{\infty}\int_{t}^{\infty}f_{x}dx4(t-T)^{3}dt$
	$\displaystyle\leq 4\int_{T}^{\infty}(t-T)^{3}\exp(-\frac{t^{2}}{2\sigma^{2}})dt$
	$\displaystyle\leq 4\int_{T}^{\infty}t^{3}\exp(-\frac{t^{2}}{2\sigma^{2}})dt$
	$\displaystyle=c_{0}\cdot t^{2}\exp(-\frac{t^{2}}{2\sigma^{2}})\Big{\|}_{T}^{\infty}-c_{1}\int_{T}^{\infty}t\exp(-\frac{t^{2}}{2\sigma^{2}})dt$
	$\displaystyle=c_{0}\cdot T^{2}\exp(-\frac{T^{2}}{2\sigma^{2}})+c_{1}\cdot\exp(-\frac{T^{2}}{2\sigma^{2}}).$	(80)

Then, by choosing $T=c\cdot\sigma\sqrt{2\log n}$ , we have $\int_{T}^{\infty}(x-T)^{2}f_{x}dx=O(\frac{\log n}{n})$ . By the similar approach, we can also attain that $\int_{-\infty}^{-T}(x+T)^{4}f_{x}dx=O(\frac{\log n}{n})$ . Thus, we claim that $\mathbb{E}(\Pi_{T}(x)-x)^{4}=O(\frac{\log n}{n})$ . Thus finishes the proof of this lemma. $\square$

C.8 Proof of Lemma C.2

First, we could separate the left hand side of (40) as follows:

		$\displaystyle\|\|(y_{i}\cdot\bm{x}_{i}-\Pi_{T}(y_{i})\Pi_{T}(\bm{x}_{i}))_{\mathcal{S}}\|\|_{2}$
		$\displaystyle\leq\|\|(y_{i}\cdot\Pi_{T}(\bm{x}_{i})-\Pi_{T}(y_{i})\Pi_{T}(\bm{x}_{i}))_{\mathcal{S}}\|\|_{2}+\|\|(y_{i}\cdot\bm{x}_{i}-y_{i}\Pi_{T}(\bm{x}_{i}))_{\mathcal{S}}\|\|_{2}$
		$\displaystyle\leq\underbrace{T\cdot\sqrt{s}\cdot\|\Pi_{T}(y_{i})-y_{i}\|}_{(\ref{peq35}.1)}+\underbrace{\|\|((\Pi_{T}(y_{i})-y_{i})(\Pi_{T}(\bm{x}_{i})-\bm{x}_{i}))_{\mathcal{S}}\|\|_{2}}_{(\ref{peq35}.2)}+\underbrace{T\cdot\|\|(\Pi_{T}(\bm{x}_{i})-\bm{x}_{i})_{\mathcal{S}}\|\|_{2}}_{(\ref{peq35}.3)}.$		(81)

For the term $(\ref{peq35}.1)$ , denote the distribution of $y$ as $f_{y}$ , then $y\sim N(0,\beta^{\top}\beta+\sigma^{2})$ , which follows a Gaussian distribution.Then by Lemma C.1 we have, if we choose $T\asymp\sqrt{\log n}$ , we have $\mathbb{E}[\Pi_{T}(\bm{y}_{i})-\bm{y}_{i}]^{2}=O(\frac{1}{n})$ , thus $T^{2}\cdot s\cdot\mathbb{E}[\Pi_{T}(\bm{y}_{i})-\bm{y}_{i}]^{2}=O(\frac{s\cdot\log n}{n})$ .

Then, let us analyze on the term $(\ref{peq35}.3)$ , we have for any $j\in\mathcal{S}$ :

\displaystyle\mathbb{E}||(\Pi_{T}(\bm{x}_{i})-\bm{x}_{i})_{\mathcal{S}}||_{2}^{2}

\displaystyle=s\cdot\mathbb{E}(\Pi_{T}(\bm{x}_{ij})-\bm{x}_{ij})^{2}.

(82)

By Lemma C.1, we have that when we choose $T\asymp\sqrt{\log n}$ , $\mathbb{E}(\Pi_{T}(\bm{x}_{ij})-\bm{x}_{ij})^{2}=O(\frac{1}{n})$ , which means that the term $\mathbb{E}||(\Pi_{T}(\bm{x}_{i})-\bm{x}_{i})_{\mathcal{S}}||_{2}^{2}=O(\frac{s}{n})$ . Therefore, for the term $(\ref{peq35}.3)$ $T^{2}\cdot\mathbb{E}||(\Pi_{T}(\bm{x}_{i})-\bm{x}_{i})_{\mathcal{S}_{1}}||_{2}^{2}=O(\frac{s\log n}{n})$ .

Finally, let us analyze the term $(\ref{peq35}.2)$ , for any $j\in\mathcal{S}$ :

	$\displaystyle\mathbb{E}\|\|((\Pi_{T}(y_{i})-y_{i})(\Pi_{T}(\bm{x}_{i})-\bm{x}_{i}))_{\mathcal{S}}\|\|_{2}^{2}$	$\displaystyle=s\cdot\mathbb{E}(\Pi_{T}(y_{i})-y_{i})^{2}(\Pi_{T}(\bm{x}_{ij})-\bm{x}_{ij})^{2}$
		$\displaystyle\leq s\cdot\sqrt{\mathbb{E}(\Pi_{T}(y_{i})-y_{i})^{4}\cdot\mathbb{E}(\Pi_{T}(\bm{x}_{ij})-\bm{x}_{ij})^{4}}.$		(83)

Then again, by Lemma C.1, we can also obtain that both $\mathbb{E}(\Pi_{T}(y_{i})-y_{i})^{4}$ and $\mathbb{E}(\Pi_{T}(\bm{x}_{ij})-\bm{x}_{ij})^{4}$ are $O(\frac{\log n}{n})$ , insert this analysis into (C.8), we can claim that $\mathbb{E}||((\Pi_{T}(y_{i})-y_{i})(\Pi_{T}(\bm{x}_{i})-\bm{x}_{i}))_{\mathcal{S}^{t+0.5}}||_{2}^{2}=O(\frac{s\log n}{n})$ .

Therefore, if we choose $\xi=O(\sqrt{\frac{s\log d}{n}}\cdot\log n)$ , we have:

	$\displaystyle{\mathbb{P}}(\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\|\|(y_{i}\cdot\bm{x}_{i}-\Pi_{T}(y_{i})\Pi_{T}(\bm{x}_{i}))_{\mathcal{S}_{1}}\|\|_{2}>\xi/2)$
	$\displaystyle\leq{\mathbb{P}}(\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}[(\ref{peq35}.1)+(\ref{peq35}.2)+(\ref{peq35}.3)]>\xi/2)$
	$\displaystyle\leq{\mathbb{P}}(\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}[(\ref{peq35}.1)]>\xi/6)+{\mathbb{P}}(\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}[(\ref{peq35}.2)]>\xi/6)+{\mathbb{P}}(\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}[(\ref{peq35}.3)]>\xi/6)$
	$\displaystyle\leq\frac{36\mathbb{E}[(\ref{peq35}.1)^{2}]}{\xi^{2}}+\frac{36\mathbb{E}[(\ref{peq35}.2)^{2}]}{\xi^{2}}+\frac{36\mathbb{E}[(\ref{peq35}.3)^{2}]}{\xi^{2}}$
	$\displaystyle=O(\frac{1}{\log n\cdot\log d})$		(84)

Thus finishes the proof. $\square$

		$\displaystyle\\|\bm{\beta}^{t+1}-\bm{\beta}^{*}\\|_{2}$
		$\displaystyle\leq\\|\text{trunc}(\bm{\beta}^{t+0.5},\mathcal{S}^{t+0.5})-\bm{\beta}^{*}\\|_{2}+\\|\bm{W}_{Lap}^{t}\\|_{2}$
		$\displaystyle\leq\\|\text{trunc}(\bm{\beta}^{t+0.5},\mathcal{S}^{t+0.5})-\text{trunc}(\bar{\bm{\beta}}^{t+0.5},\mathcal{S}^{t+0.5})\\|_{2}+\\|\text{trunc}(\bar{\bm{\beta}}^{t+0.5},\mathcal{S}^{t+0.5})-\bm{\beta}^{*}\\|_{2}+\\|\bm{W}_{Lap}^{t}\\|_{2}$
		$\displaystyle=\underbrace{\\|\text{trunc}(\bm{\beta}^{t+0.5},S^{t+0.5})-\text{trunc}(\bar{\bm{\beta}}^{t+0.5},\mathcal{S}^{t+0.5})\\|_{2}}_{(1)}+\underbrace{\\|\bar{\bm{\beta}}^{t+1}-\bm{\beta}^{*}\\|_{2}}_{(2)}+\\|\bm{W}_{Lap}^{t}\\|_{2}.$		(14)

		$\displaystyle\\|\text{trunc}(\bm{\beta}^{t+0.5},\mathcal{S}^{t+0.5})-\text{trunc}(\bar{\bm{\beta}}^{t+0.5},\mathcal{S}^{t+0.5})\\|_{2}$
		$\displaystyle=\\|(\eta f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\eta\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t}))_{\mathcal{S}^{t+0.5})}\\|_{2}$
		$\displaystyle\leq\\|(\eta f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\eta\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))_{\mathcal{S}^{t+0.5})}\\|_{2}+\\|(\eta\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-\eta\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t}))_{\mathcal{S}^{t+0.5})}\\|_{2}$
		$\displaystyle\leq\eta\\|f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})\\|_{2}+\eta\sqrt{\hat{s}}\cdot\\|\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})-\nabla Q(\bm{\beta}^{t};\bm{\beta}^{t})\\|_{\infty}$
		$\displaystyle\leq\eta\\|f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})\\|_{2}+\eta\sqrt{\hat{s}}\cdot\alpha.$		(15)

	$\displaystyle{}\\|\bar{\bm{\beta}}^{t+1}-\bm{\beta}^{*}\\|_{2}$	$\displaystyle\leq(1+4\sqrt{{s^{}}/\hat{s}})^{\frac{1}{2}}\\|\bar{\bm{\beta}}^{t+0.5}-\bm{\beta}^{}\\|_{2}+\frac{K_{1}\sqrt{s^{*}}}{\sqrt{1-L}}{\cdot\eta\cdot\alpha}$
		$\displaystyle\leq(1+4\sqrt{{s^{}}/\hat{s}})^{\frac{1}{2}}\cdot(1-2\cdot\frac{\nu-\gamma}{\nu+\mu})\cdot\|\|{\bm{\beta}}^{t}-\bm{\beta}^{}\|\|_{2}+\frac{K_{1}\sqrt{s^{*}}}{\sqrt{1-L}}{\cdot\eta\cdot\alpha}.$		(16)

	$\displaystyle\\|\bm{\beta}^{t+1}-\bm{\beta}^{*}\\|_{2}$	$\displaystyle\leq\eta\sqrt{\hat{s}}\cdot\alpha+(1+4\sqrt{{s^{}}/\hat{s}})^{\frac{1}{2}}\cdot(1-2\cdot\frac{\nu-\gamma}{\nu+\mu})\cdot\|\|{\bm{\beta}}^{t}-\bm{\beta}^{}\|\|_{2}+\frac{K_{1}\sqrt{s^{*}}}{\sqrt{1-L}}{\cdot\eta\cdot\alpha}$
		$\displaystyle+\eta\\|f_{T}(\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t}))-\nabla Q_{n_{0}}(\bm{\beta}^{t};\bm{\beta}^{t})\\|_{2}+\\|\bm{W}_{Lap}^{t}\\|_{2}.$		(20)

	$\displaystyle\|\|f_{T}(\nabla Q_{n/N_{0}}(\bm{\beta};\bm{\beta}))-f_{T}(\nabla{Q_{n/N_{0}}}^{\prime}(\bm{\beta};\bm{\beta}))\|\|_{\infty}$
	$\displaystyle=\|\|\frac{N_{0}}{n}\cdot[2w_{\bm{\beta}}(y_{i})-1]\cdot\Pi_{T}(y_{i})-\frac{N_{0}}{n}\cdot[2w_{\bm{\beta}}({y_{i}}^{\prime})-1]\cdot\Pi_{T}({y_{i}}^{\prime})\|\|_{\infty}$
	$\displaystyle\leq\frac{N_{0}}{n}\cdot[\|\|\Pi_{T}(y_{i})\|\|_{\infty}+\|\|\Pi_{T}({y_{i}}^{\prime})\|\|_{\infty}]$
	$\displaystyle\leq\frac{2T\cdot N_{0}}{n}.$		(23)

High-Dimensional Differentially-Private EM Algorithm: Methods and Near-Optimal Statistical Guarantees

Abstract

1 Introduction

2 Problem Formulation

2.1 The EM algorithm

2.2 Some basic properties of differential privacy

Definition 2.1 (Differential Privacy (Dwork et al., 2006b)).

Definition 2.2.

Proposition 2.3 (The Laplace Mechanism (Dwork et al., 2006b)).

Proposition 2.4 (The Gaussian Mechanism (Dwork et al., 2006b)).

Proposition 2.5 (Post-processing Property (Dwork et al., 2006b)).

Proposition 2.6 (Composition property (Dwork et al., 2006b)).

3 High-dimensional EM Algorithm

3.1 Methodology

Lemma 3.1 ((Dwork et al., 2018)).

Lemma 3.2.

Lemma 3.3.

3.2 Theoretical gaurantees

Condition 3.4 (Lipschitz-Gradient(γ,ℬ)(\gamma,\mathcal{B})).

Condition 3.5 (Concavity-Smoothness (μ,ν,ℬ)(\mu,\nu,\mathcal{B})).

Condition 3.6 (Statistical-Error(α,τ,s,n,ℬ)(\alpha,\tau,s,n,\mathcal{B})).

Condition 3.7 (Truncation-Error(ξ,ϕ,s,n,T,ℬ)(\xi,\phi,s,n,T,\mathcal{B})).

Theorem 3.8.

4 DP EM Algorithm in Specific Models

4.1 Gaussian mixture model

Lemma 4.1.

Theorem 4.2.

Proposition 4.3.

4.2 Mixture of regression model

Theorem 4.4.

Proposition 4.5.

4.3 Regression with missing covariates

Theorem 4.6.

Proposition 4.7.

5 Low-dimensional DP EM Algorithm

Lemma 5.1.

Condition 5.2 (Statistical-Error-2 (α,τ,n,ℬ)(\alpha,\tau,n,\mathcal{B})).

Condition 5.3 (Truncation-Error-2 (ξ,ϕ,n,T,ℬ)(\xi,\phi,n,T,\mathcal{B})).

Theorem 5.4.

Theorem 5.5.

Proposition 5.6.

6 Numerical Experiments

6.1 Simulation results for Gaussian mixture model

6.2 Comparison with other algorithms

6.3 Real data analysis

7 Conclusion

References

Appendix A Supplement materials

A.1 DP Algorithm and theories for Mixture of Regression Model in high-dimensional settings

Lemma A.1.

A.2 DP Algorithm and theories for Regression with missing covariates Model in high-dimensional settings

Lemma A.2.

A.3 Simulation results for mixture of regression model

A.4 Results for specific models in Low-dimensional settings

A.4.1 Algorithm and theories for Mixture of Regression Model in low-dimensional settings

Theorem A.3.

Theorem A.4.

A.4.2 Algorithm and theories for Regression with Missing Covariates in low dimension cases

Theorem A.5.

Theorem A.6.

Appendix B Proofs of main results

B.1 Proof of Theorem 3.8

Lemma B.1.

Lemma B.2.

B.2 Proof of Theorem 4.2

B.3 Proof of Proposition 4.3

B.4 Proof of Theorem 4.4

B.5 Proof of Proposition 4.5

B.6 Proof of Theorem 4.6

B.7 Proof of Theorem 5.4

Lemma B.3.

B.8 Proof of Theorem 5.5 and Proposition 5.6

Appendix C Proof of lemmas

C.1 Proof of Lemma 3.2

C.2 Proof of Lemma 3.3

C.3 Proof of Lemma 4.1

C.4 Proof of Lemma A.1

Lemma C.1.

Lemma C.2.

C.5 Proof of Lemma A.2

Condition 3.4 (Lipschitz-Gradient $(\gamma,\mathcal{B})$ ).

Condition 3.5 (Concavity-Smoothness $(\mu,\nu,\mathcal{B})$ ).

Condition 3.6 (Statistical-Error $(\alpha,\tau,s,n,\mathcal{B})$ ).

Condition 3.7 (Truncation-Error $(\xi,\phi,s,n,T,\mathcal{B})$ ).

Condition 5.2 (Statistical-Error-2 $(\alpha,\tau,n,\mathcal{B})$ ).

Condition 5.3 (Truncation-Error-2 $(\xi,\phi,n,T,\mathcal{B})$ ).